Index Construction sorting Paolo Ferragina Dipartimento di Informatica

  • Slides: 31
Download presentation
Index Construction: sorting Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading Chap 4

Index Construction: sorting Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading Chap 4

Indexer steps n Dictionary & postings: How do we construct them? n n n

Indexer steps n Dictionary & postings: How do we construct them? n n n Scan the texts Find the proper tokens-list Append to the proper tokens -list the doc. ID How do we: • Find ? …time issue… • Append ? …space issues … • Postings’ size? • Dictionary size ? … in-memory issues …

Indexer steps: create token n Sequence of pairs: n n < Modified token, Document

Indexer steps: create token n Sequence of pairs: n n < Modified token, Document ID > What about var-length strings? Doc 1 I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me. Doc 2 So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious

Indexer steps: Sort n n Sort by Term Then sort by Doc. ID Core

Indexer steps: Sort n n Sort by Term Then sort by Doc. ID Core indexing step

Sec. 1. 2 Indexer steps: n n n Multiple term entries in a single

Sec. 1. 2 Indexer steps: n n n Multiple term entries in a single document are merged. Split into Dictionary and Postings Doc. frequency information is added. Dictionary & Postings

Sec. 1. 2 Some key issues… Lists of doc. IDs Terms and counts Now:

Sec. 1. 2 Some key issues… Lists of doc. IDs Terms and counts Now: • How do we sort? • How much storage is needed ? Pointers 6

Keep attention on disk. . . n If sorting needs to manage strings Memory

Keep attention on disk. . . n If sorting needs to manage strings Memory containing the strings A Key observations: You sort A Not the strings n Array A is an “array of pointers to objects” n For each object-to-object comparison A[i] vs A[j]: n n 2 random accesses to 2 memory locations A[i] and A[j] Q(n log n) random memory accesses (I/Os ? ? ) Again chaching helps, But how much ?

Binary Merge-Sort(A, i, j) 01 if (i < j) then 02 m = (i+j)/2;

Binary Merge-Sort(A, i, j) 01 if (i < j) then 02 m = (i+j)/2; Divide Conquer 03 Merge-Sort(A, i, m); 04 Merge-Sort(A, m+1, j); 05 Merge(A, i, m, j) Combine 1 2 7 1 2 8 10 7 9 13 19 Merge is linear in the #items to be merged

Few key observations n Items = (short) strings = atomic. . . n On

Few key observations n Items = (short) strings = atomic. . . n On english wikipedia, about 109 tokens to sort n Q(n log n) memory accesses (I/Os ? ? ) n [5 ms] * n log 2 n ≈ 3 years In practice it is a “faster”, why?

SPIMI: Single-pass in-memory indexing n n n Key idea #1: Generate separate dictionaries for

SPIMI: Single-pass in-memory indexing n n n Key idea #1: Generate separate dictionaries for each block (No need for term. ID) Key idea #2: Don’t sort. Accumulate postings in postings lists as they occur (in internal memory). Generate an inverted index for each block. n n n More space for postings available Compression is possible Merge indexes into one big index. n Easy append with 1 file per posting (doc. ID are increasing within a block), otherwise multi-merge like

SPIMI-Invert

SPIMI-Invert

Use the same algorithm for disk? n multi-way merge-sort aka BSBI: Blocked sort-based Indexing

Use the same algorithm for disk? n multi-way merge-sort aka BSBI: Blocked sort-based Indexing n Mapping term. ID n n n to be kept in memory for constructing the pairs Consumes memory to pairs’ creation Needs two passes, unless you use hashing and thus some probability of collision.

Recursion 10 10 10 log 2 N 2 5 1 13 19 9 2

Recursion 10 10 10 log 2 N 2 5 1 13 19 9 2 5 1 13 19 10 2 5 1 13 19 7 7 9 7 9 7 15 4 8 3 12 8 15 4 8 3 12 17 6 11 8 3 12 17 4 17 11 4 15 12 6 15 15 4 3 17 6 11 11

Log 2 (N/M) Implicit Caching… 1 3 4 5 6 7 8 9 10

Log 2 (N/M) Implicit Caching… 1 3 4 5 6 7 8 9 10 11 12 13 15 17 19 15 17 12 17 2 passes (one Read/one Write) = 2 * (N/B) I/Os 1 2 5 7 9 10 13 19 3 4 2 passes (R/W) 1 log 2 N 2 2 2 10 10 2 5 10 7 9 1 5 13 19 5 1 13 19 6 8 11 12 2 passes (R/W) 13 19 7 9 9 7 3 4 4 15 15 4 8 15 6 11 3 8 12 17 8 3 12 17 M N/M runs, each sorted in internal memory (no I/Os) — I/O-cost for binary merge-sort is ≈ 2 (N/B) log 2 (N/M) 6 11

A key inefficiency After few steps, every run is longer than B !!! Output

A key inefficiency After few steps, every run is longer than B !!! Output 1, 2, 3 Run 1 2 B Output Buffer 1, 4, 2, . . . 3 B 4 7 9 10 13 19 3 5 6 8 11 12 15 17 B B Disk We are using only 3 pages But memory contains M/B pages ≈ 230/215 = 215

Multi-way Merge-Sort n Sort N items with main-memory M and disk-pages B: n Pass

Multi-way Merge-Sort n Sort N items with main-memory M and disk-pages B: n Pass 1: Produce (N/M) sorted runs. n Pass i: merge X = M/B-1 runs log. X N/M passes Pg for run 1 . . . Pg for run 2 . . . Out Pg Pg for run X . . . Disk Main memory buffers of B items

How it works 1 2 3 4 5 6 7 8 9 10 11

How it works 1 2 3 4 5 6 7 8 9 10 11 12 13 15 17 Log. X (N/M) 2 passes (one Read/one Write) = 2 * (N/B) I/Os X 1 2 5 7…. X 1 2 M 5 10 7 9 13 M 19 N/M runs, each sorted in internal memory = 2 (N/B) I/Os — I/O-cost for X-way merge is ≈ 2 (N/B) I/Os per level 19

Cost of Multi-way Merge-Sort n Number of passes = log. X N/M log. M/B

Cost of Multi-way Merge-Sort n Number of passes = log. X N/M log. M/B (N/M) n Total I/O-cost is Q( (N/B) log. M/B N/M ) I/Os In practice n M/B ≈ 105 #passes =1 few mins Tuning depends on disk features ü Large fan-out (M/B) decreases #passes ü Compression would decrease the cost of a pass!

Distributed indexing n For web-scale indexing: must use a distributed computing cluster of PCs

Distributed indexing n For web-scale indexing: must use a distributed computing cluster of PCs n Individual machines are fault-prone n Can unpredictably slow down or fail n How do we exploit such a pool of machines?

Distributed indexing n n Maintain a master machine directing the indexing job – considered

Distributed indexing n n Maintain a master machine directing the indexing job – considered “safe”. Break up indexing into sets of (parallel) tasks. Master machine assigns tasks to idle machines Other machines can play many roles during the computation

Parallel tasks n We will use two sets of parallel tasks n n Parsers

Parallel tasks n We will use two sets of parallel tasks n n Parsers and Inverters Break the document collection in two ways: • Term-based partition one machine handles a subrange of terms • Doc-based partition one machine handles a subrange of documents

Data flow: doc-based partitioning assign Master assign Postings Set 1 Set 2 Parser splits

Data flow: doc-based partitioning assign Master assign Postings Set 1 Set 2 Parser splits Parser Inverter IL 1 Inverter IL 2 Inverter ILk Setk Each query-term goes to many machines

Data flow: assign term-based partitioning Master assign Postings Set 1 Set 2 splits Parser

Data flow: assign term-based partitioning Master assign Postings Set 1 Set 2 splits Parser a-f g-p q-z Inverter a-f Inverter g-p Inverter q-z Setk Each query-term goes to one machine

Map. Reduce n This is n n n a robust and conceptually simple framework

Map. Reduce n This is n n n a robust and conceptually simple framework for distributed computing … without having to write code for the distribution part. Google indexing system (ca. 2002) consists of a number of phases, each implemented in Map. Reduce.

Data flow: term-based partitioning Guarantee fitting in one machine ? 16 Mb 64 Mb

Data flow: term-based partitioning Guarantee fitting in one machine ? 16 Mb 64 Mb splits assign Master assign Parser a-f g-p q-z Parser Map phase a-f g-p q-z Segment files (local disks) Guarantee fitting in one machine ? Postings Inverter a-f Inverter g-p Inverter q-z Reduce phase

Dynamic indexing n Up to now, we have assumed static collections. n Now more

Dynamic indexing n Up to now, we have assumed static collections. n Now more frequently occurs that: n n n Documents come in over time Documents are deleted and modified And this induces: n n Postings updates for terms already in dictionary New terms added/deleted to/from dictionary

Simplest approach n Maintain “big” main index New docs go into “small” auxiliary index

Simplest approach n Maintain “big” main index New docs go into “small” auxiliary index Search across both, and merge the results n Deletions n n n Invalidation bit-vector for deleted docs Filter search results (i. e. docs) by the invalidation bit-vector Periodically, re-index into one main index

Issues with 2 indexes n Poor performance n n Merging of the auxiliary index

Issues with 2 indexes n Poor performance n n Merging of the auxiliary index into the main index is efficient if we keep a separate file for each postings list. Merge is the same as a simple append [new doc. IDs are greater]. But this needs a lot of files – inefficient for O/S. In reality: Use a scheme somewhere in between (e. g. , split very large postings lists, collect postings lists of length 1 in one file etc. )

Logarithmic merge n n n Maintain a series of indexes, each twice as large

Logarithmic merge n n n Maintain a series of indexes, each twice as large as the previous one: 20 M, 21 M , 22 M , 23 M, … Keep a small index (Z) in memory (of size 20 M) Store I 0, I 1, … on disk (sizes 20 M , 21 M , …) If Z gets too big (> M), write to disk as I 0 or merge with I 0 (if I 0 already exists) Either write merge Z to disk as I 1 (if no I 1) or merge with I 1 to form a larger Z etc. # indexes = logarithmic

Some analysis n n (T = #postings = #tokens) Auxiliary and main index: index

Some analysis n n (T = #postings = #tokens) Auxiliary and main index: index construction time is O(T 2) as each posting is touched in each merge. Logarithmic merge: Each posting is merged O(log (T/M) ) times, so complexity is O( T log (T/M) ) Logarithmic merge is more efficient for index construction, but its query processing requires the merging of O( log (T/M) ) lists of results

Web search engines n Most search engines now support dynamic indexing n n News

Web search engines n Most search engines now support dynamic indexing n n News items, blogs, new topical web pages But (sometimes/typically) they also periodically reconstruct the index n Query processing is then switched to the new index, and the old index is then deleted