Accumulator Representations Dr Susan Gauch Criteria Fast look

Criteria ¡Fast look up by docid ¡ Need to be able to add posting

Option 1: Array ¡ One element per document ¡ Fast lookup by docid –

More Efficient Sorts ¡Take advantage of 2 things: ¡ 1) Array stores mostly 0

More Efficient Sorts ¡ 2) Take advantage of fact that usually only present p

More Efficient Sorts (2) ¡ Before long, most (docid, wt) don’t make it past

Option 2: Hashtable ¡Size of hashtable: number of expected non-0 results * 3 (r

Option 3: Heap ¡ Can bound the heap to approximate size p ¡ Height

Option 4: Hashtable + Heap ¡Use both a hashtable AND a heap ¡Both store

Slides: 9

Download presentation

Accumulator Representations Dr. Susan Gauch

Criteria ¡Fast look up by docid ¡ Need to be able to add posting data efficiently ¡ Acc. Add (docid, wt) ¡Small space in memory ¡ Most documents do not contain any of the query words ¡ Accumulator is thus a sparse array ¡ Avoid storing buckets for non-matching documents ¡Fast sort by total weight ¡ After scores are accumulated, sort by total weight before presenting top matches to the user

Option 1: Array ¡ One element per document ¡ Fast lookup by docid – YES ¡ Acc[docid] += wt ¡ O (1) ¡ Small space in memory – NO ¡ Store one element per document in the collection ¡ What if there are billions of documents? ¡ O(N) buckets where N = number of docs in collection ¡ Fast sort by total weight – MAYBE ¡ If just sort the array – NO (array can be huge) ¡ O (N log N) where N = number of docs in collection

More Efficient Sorts ¡Take advantage of 2 things: ¡ 1) Array stores mostly 0 ¡ Keep track of number of non-0 entries ¡ Copy those into new array ¡ Sort that smaller array ¡ O (r log r) where r is number of non-0 results ¡ r << N

More Efficient Sorts ¡ 2) Take advantage of fact that usually only present p results, p << r (10? 20? 100? ) ¡ Use a bounded-size data structure to store top weighted results so far, heap or bounded-size linked list ¡ Iterate over Acc ¡ If list not full ¡ Add (docid, wt) to list in sorted location ¡ Else ¡ if (wt > list->tail. wt) ¡ Add (docid, wt) to list in sorted location ¡ Remove tail element

More Efficient Sorts (2) ¡ Before long, most (docid, wt) don’t make it past the cut-off and are immediately rejected ¡ O(A) where A is the size of the accumulator when p << r << A ¡ You must loop over accumulator, but most of the time, no inserts actually happen ¡ When inserting, it is O(p) where p is the size of the linked list ¡ For the array accumulator, this is O(N)

Option 2: Hashtable ¡Size of hashtable: number of expected non-0 results * 3 (r * 3) ¡Fast lookup by docid – YES ¡ Loc = hashfn (docid) ¡ HT[Loc] += wt ¡ O (c) where c is number of collisions + 1 ¡Small space in memory – YES ¡ O(r) ¡Fast sort by total weight – MAYBE ¡ Can use same sort approaches as for Option 1: Array ¡ O(A) == O(r)

Option 3: Heap ¡ Can bound the heap to approximate size p ¡ Height of the heap: h = log 2 p ¡ Fast lookup by docid – NO ¡ Must walk the whole heap, O(p) ¡ Small space in memory – YES ¡ Store one element for each result you plan to present to the user (just keep top p at any time) ¡ O (p) ¡ Fast sort by total weight – YES ¡ Results are always in partially sorted ¡ Just remove top element iteratively to present results at the end ¡ O (p log p) == heap sort

Option 4: Hashtable + Heap ¡Use both a hashtable AND a heap ¡Both store pointers to nodes that contain (docid, total_weight) ¡Fast look up by docid –YES ¡ O (c) in hashtable ¡Small space in memory - YES ¡ O (r) + O (p) for hashtable and heap ¡Fast sort by total weight - YES ¡ O (p log p) from heap