Accessing the lexicon Partially specified query terms Boolean

차례 • • Accessing the lexicon Partially specified query terms Boolean query processing Ranking and information retrieval Evaluating retrieval effectiveness Implementation of the cosine measure Interactive retrieval 2

Preview(2) • Ranked query – query와 각 문서와의 similarity를 측정하기 위한 heuristic을 사용 – 측정한 numeric indicator를 바탕으로 r most closely matching document가 결과로 나오게 됨 ( r은 10이거나 100) – low precision + high recall – high precision + low recall 4

Accessing the lexicon • lexicon for an inverted file index에는 term 뿐만 아니라 보조 정보도 저장 – It : address in the inverted file of the corresponding list of document numbers – ft : the number of documents containing the term • lexicon들이 어떻게 저장되는 방식 중요 5

Access structures • Array of records (Figure 4. 1) – 20 byte string + 4 byte address field + 4 byte ft • Array of string pointers (Figure 4. 2) – One long contiguous string + an array of 4 -byte character pointer • one word in four indexed (Figure 4. 3) – blocking을 이용하여 string pointer를 줄임 – blocking은 searching process를 더 복잡하게 만듦 6

Font coding(2) • Partial “ 3 -in-4” front coding – p. 122 Table 4. 1 참조 – block pointer가 가리키는 단어들은 완전한 문자열과 길이를 유지 – block내 마지막 단어는 suffix length가 생략 됨 – binary search 가능 8

Minimal perfect hashing(1) • hash function : mechanism for mapping a set L of n keys xj into a set of integer values h(xj) in the range 1≤ h(xj)≤m, with duplicates allowed • h(x) = x mod m (m > n/a, a : loading(record 와 사용가능한 주소와의 비율)) • a값이 작을 수록 두 key가 같은 hash value 에서 충돌할 가능성이 낮아짐 9

Minimal perfect hashing(2) • Perfect hash function – hash function + i = j 일 때만 h(i) = h(j) – no collision – minimal perfect hash function(MPHF) • • perfect hash function m=n each of n keys hash to a unique integer between 1 and n a = 1. 0 – order-preserving minimal perfect hash function • minimal perfect hash function + xi < xj then h(xi) < h(xj) 10

Minimal perfect hashing(3) 11

Design of a minimal perfect hash function(1) • Probability or the birthday paradox – n items are to be hashed into m slots – m = 365 and n = 22 → 0. 524, n = 23 → 0. 493 • Checking for acyclicity and assigning a mapping – figure 4. 5 참조 12

Design of a minimal perfect hash function(2) 13 14 0 1 12 11 5 11 2 2 7 4 10 6 3 9 10 3 8 0 9 1 8 7 6 4 5 13

Design of a minimal perfect hash function(3) • Generating a perfect hash function – m값을 정한다 – w 1[i]와 w 2[i]를 랜덤하게 정한다 – 그래프 G=(V, E)를 생성한다(V={1, …, m}, E={(h 1(t), h 2(t))}t∈L}) – figure 4. 5를 이용해서 mapping g를 계산 – labeling algotirhm이 실패하면 step 2로 돌아간다 – w 1, w 2, g를 return 14

Disk-based lexicon storage • p. 131 Figure 4. 7 참조 • lexicon을 저장하는데 primary memory를 줄일 수 있는 방법(put it on disk) • primary memory에는 각 term에 해당하는 disk block을 구별하기 위한 정보 저장 • 정보를 찾을 때는 in-memory index에서 block number를 찾은 다음 그 block을 buffer로 읽어 와서 계속 찾음 • 간단하고 작은 양의 primary memory 필요 • 비교적 작은 양의 lexicon을 작은 workstation이나 personal computer에 저장할 때 효과적 15

Partially specified query terms(1) • 찾고자 하는 term이 wildcard character, *, 를 포함하고 있 을때 • Brute force string matching – 전체 lexicon을 main memory에 있다면 fast patternmatching algorithm을 이용하여 모든 term들과 pattern 을 비교할 수 있음 • Indexing using n-grams – p. 133 Table 4. 3 참조 – query term을 n-grams으로 분해 – false match가 일어날 수 있으므로 결과를 다시 pattern matcher를 이용하여 체크해야함 ( ex : lab*r laaber, lavacaber) 16

Partially specified query terms(2) • Rotated lexicons – – p. 135 Table 4. 4 참조 indexing pointer를 각각의 lexicon character에 유지 속도는 빠르나 메모리 사용이 많다. wildcard character가 앞뒤에 다 있을 때는 하나로 통 일 – multiple wildcard character • query에서 가장 긴 character가 match되는 후보 set을 만듦 • each specified component를 차례로 검사 • pattern matcher를 이용하여 check 17

Boolean query processing(2) • Term processing order – 후보 set을 만들 때 가장 least frequent term을 선택하 는 이유 • query processing에 요구되는 메모리 공간을 양을 줄이기 위해서 • frequency 순서대로 처리하면 더 빠르기 때문에 (look-up operation과 merge operation에 드는 시간 비교) – compressed inverted file을 이용하면 공간은 많이 절 약되지만 conjunctive query processing 중에 시간이 더 필요함(압축을 풀고 merge하는데 드는 시간) 19

Boolean query processing(3) • Random access and fast lookup – 빠른 searching을 지원하기 위해서는 inverted file entry내에서 random access가 제공되어야 함 – 세 가지 issue • storage mechanism used for the index • suitable value for bt, the blocking constant for term t • trade-off of time for space • Nonconjunctive queries – Boolean query expression이 복잡해질 때는 informal or ranked query를 이용 20

Ranking and Information Retrieval • Boolean Query – Data에 대한 정보가 확실히 알려져 있는 경우 – Exact search가 가능 – Commercial DB, Bibliographic system • Coordinate matching – Query term의 수가 더 많은 문서가 더 유사하다 – 한 개의 term이라도 있으면 관련된 문서로 판단 • Ranking – Query와 document간의 유사도를 측정 – 유사도가 큰 문헌 순서대로 정렬하여 display 21

Cosine Measure 25

Implementation of the cosine measure • Issue : Memory 냐 Disk 냐 ? • 전체 data에 대해 모두 처리할 것이냐, 일 부만을 처리할 것이냐 ? • N-best algorithm 28

Relevance Feedback 30