Inverted Index Compression and Query Processing with Optimized

What are the best compression methods for compressing indexes of search engines? In particular,

Outline • Motivation & Background • Our Methods • Conclusion

Search Engines – Performance Challenge Performance challenges in large web search engines: • -

Many ways to improve performances • Caching • Early termination • Parallel processing •

Using Compression is Better Before After disk traffic! the required size of memory

What are Inverted Indexes? • Storing information about where a word (term) occurs in

Doc. ID + Frequency • madrid: 345, 777, 11437, …. • madrid: <345, 7>,

D-gaps (Differences Between Doc. IDs) • Doc. IDs madrid: 345, 777, 11437, … •

Blocks of Inverted Lists (d-gaps) • madrid : 345, 111, 1, …, 2, 14,

Indexes – In Summary • D-gaps: 345, 111, 1, …, 78 2, 14, …,

Query Processing Index 777 Index madrid museum disk 345, 777, 11437, …. 777, 1234,

How to Judge Compression Techniques • Small compressed index size • Fast decompression (frequently

Existing Research on Compressing • Doc. ID Many existing techniques • Frequency Few techniques

Related Research – Reordering • Given a collection of N documents, we need to

Related Research – Reordering • • • R. Blanco and A. Barreiro. Document identifier

Related Research – Reordering • Idea: first sort documents in a particular order such

Existing Research about Compression with Reordering • Previous work has focused on determining the

Our work – the rest of the talk Given the special reordering (sorting by

Doc. ID Compression - Contributions • Extensively study most existing methods • Propose improved

Experiment Setup • Data set • TREC GOV 2 , 25. 2 million pages

Doc. ID (d-gap) Compression • Gamma coding • Delta coding • Variable byte coding

PFor. Delta coding (PFD) • S. Heman. Super-scalar database compression between RAM and CPU-cache.

PFor. Delta Coding (PFD, by S. Heman, etc) • Using 2 bits to encode

First Improvement of PFor. Delta (New. PFD) • Using b (e. g. , 2)

Second Improvement of PFor. Delta (Opt. PFD) • How to select the number of

Second Improvement of PFor. Delta (Opt. PFD) How to select the number of bits

Global table for Opt. PFD We want • Decompression speed is above 1200 Million

PFor. Delta is Extremely Fast for Decompression! Over 1 billion doc. IDs/second ! Million

Doc. ID Compression – with Re-ordering Only 1. 47 bit/doc. ID

Frequency Compression – Related Work • Few papers especially focusing on it • Usually

Frequency Compression – Unique Features • • Normally frequencies are quite small values Unlike

Frequency Compression - My Algorithms • However, we still want to take advantage of

Our Preprocessing Algorithms: MTF / MLN • Idea: Use indexes of previously occurrences to

Results – Compressed Size Only 1. 54 bit/freq Compressed size (MBytes/query)

Results – Decompression Speed million frequencies/second

Compressed Size for the Entire Indexes Compressed • IPC, New. PFD, Opt. PFD •

Query Processing – Skipping skip 14 23 … decompre 43 block 1 (compressed 67

Query Processing – Impact of Reordering Opt. PFD • Query processing is faster !

Then, How To Compress Indexes? • Tradeoff : compressed size vs decompression speed •

One Example of Using Mixed Methods • Goal: • The overall index size is

Mixed Methods with Opt. PFD and IPC Opt. PFD

Conclusions – How to Compress Web Indexes? Previous researchers found: • Reordering improve doc.

Slides: 44

Download presentation

Inverted Index Compression and Query Processing with Optimized Document Ordering Hao Yan, Shuai Ding, Torsten Suel 1. Department of Computer Science and Engineering, Polytechnic Institute of New York University, Brooklyn, NY 11201 2. Yahoo! Research

What are the best compression methods for compressing indexes of search engines? In particular, what if given a particular ordering of documents?

Outline • Motivation & Background • Our Methods • Conclusion

Search Engines – Performance Challenge Performance challenges in large web search engines: • - A large amount of data (>1, 000, 000 pages) - Fast! (>1, 000 queries/second)

Many ways to improve performances • Caching • Early termination • Parallel processing • Data compression (inverted index compression)

Using Compression is Better Before After disk traffic! the required size of memory

What are Inverted Indexes? • Storing information about where a word (term) occurs in the collection • Each word has an inverted list that is a sequence of integers (called doc. IDs), each of which uniquely identify a document in the collection • Inverted Indexes = a set of such inverted lists madrid museum spain barcelona mall university 345, 777, 11437, …. 777, 1234, 4356, 12457, …. 4, 19, 29, 98, 143, 777, . . . 145, 457, 777, 789, . . . 678, 777, 2134, 3970, . . . 90, 256, 372, 511, 777, 1000, . . .

Doc. ID + Frequency • madrid: 345, 777, 11437, …. • madrid: <345, 7>, <777, 4>, <11437, 11>, … • Inverted list : <doc. ID 1, freq 1>, <doc. ID 2, freq 2>, <doc. ID 3, freq 3>, … • Normal layout for better compression: store doc. IDs and frequencies separately madrid doc. ID : 345, 777, 1437, … freq: 7, 4, 11, . . .

D-gaps (Differences Between Doc. IDs) • Doc. IDs madrid: 345, 777, 11437, … • D-gaps madrid: 345, 432, 10660, … • Frequences (Freqs): madrid: 7, 4, 11, . . . Not in sorted order! No gaps!

Blocks of Inverted Lists (d-gaps) • madrid : 345, 111, 1, …, 2, 14, …, 312, 423, . . Partitioned into blocks • madrid: 345, 111, 1, …, block 1 2, 14, …, block 2 312, 423, . . block 3 …

Indexes – In Summary • D-gaps: 345, 111, 1, …, 78 2, 14, …, 112 312, 423, …, 1238 …. . • Freqs: 2, 8, …, 4, 3, 7, …, block 1 3 block 2 5 7, 3, …, 1 block 3 … …. .

Query Processing Index 777 Index madrid museum disk 345, 777, 11437, …. 777, 1234, 4356, 12457, …. memory

How to Judge Compression Techniques • Small compressed index size • Fast decompression (frequently called) • Compression can be slower (seldom called)

Existing Research on Compressing • Doc. ID Many existing techniques • Frequency Few techniques

Related Research – Reordering • Given a collection of N documents, we need to first assign them document IDs • Doc. IDs can be assigned randomly, or simply in the order documents are crawled • However, researchers have found that special assignment strategies may result in better compression

Related Research – Reordering • • • R. Blanco and A. Barreiro. Document identifier reassignment through dimensionality reduction. ECIR’ 05. D. Blandford and G. Blelloch. Index compression through document reordering. DCC’ 02 W. Shieh, T. Chen, J. Shann, and C. Chung. Inverted file compression through document identifier reassignment. Inf. Processing and Management, 2003. F. Silvestri. Sorting out the document identifier assignment problem. ECIR’ 07. F. Silvestri, S. Orlando, and R. Perego. Assigning identifiers to documents to enhance the clustering property of fulltext indexes. SIGIR’ 04

Related Research – Reordering • Idea: first sort documents in a particular order such that similar documents are clustered (close to each other); then assign doc. IDs sequentially • Advantage: most d-gaps are smaller and can be better compressed • Reason: If a word occurs in a document, it is very likely it will occur in the similar documents; if these similar documents have similar doc. IDs, the d-gaps in the word’s inverted list will become much smaller • madrid (doc. IDs) 345, 777, 11438, 11443, 11450, …. . • madrid (d-gaps) 345, 432, 10660, 1, 5, 7, …. . • One interesting re-ordering method: • sorting documents by URLs (F. Silvestri, Sorting out the document identifier reassignment problem, ECIR’ 07)

Existing Research about Compression with Reordering • Previous work has focused on determining the best possible ordering of documents • However, few existing techniques focusing on compressing indexes AFTER reordering • Doc. ID : NO • Frequency: NO

Our work – the rest of the talk Given the special reordering (sorting by URLs), we study: • The best inverted index compression methods for • Doc. ID • Frequency • Query processing • A hybrid approach combining different methods

Doc. ID Compression - Contributions • Extensively study most existing methods • Propose improved PFor. Delta coding • Study effects on doc. ID compression of document re-ordering

Experiment Setup • Data set • TREC GOV 2 , 25. 2 million pages • 1000 randomly selected queries • Measurement Metric • Size • Data associated with each query: MB/query • Data for the entire indexes: MB • Bits/int (somewhere) • Speed • Million Integers/second • Different orderings • Original • Random • Sorted

Doc. ID (d-gap) Compression • Gamma coding • Delta coding • Variable byte coding (Var-byte) • Golomb coding • Rice. VT coding • Simple 9 (S 9) • Simple 16 (S 16) • Interpolative (IPC) • PFor. Delta (PFD)

PFor. Delta coding (PFD) • S. Heman. Super-scalar database compression between RAM and CPU-cache. MS Thesis, Centrum voor Wiskunde en Informatica, Netherlands, July 2005 • M. Zukowski, S. Heman, N. Nes, and P. Boncz. Super-scalar RAM-CPU cache compression. In Proc. of the Int. Conf. on Data Engineering, 2006 • Decompression is extremely fast • Compression is not the best but still good

PFor. Delta Coding (PFD, by S. Heman, etc) • Using 2 bits to encode 128 numbers 1 2 3 4 5 6 7. . . 124 125 126 127 128 3 11 42 2 3 3 1 1 … 3 3 10 11 11 01 01 … 11 11 23 1 2 01 10 42 23 a block of 128 numbers Problem: • We have to insert extra exceptions if two consecutive exceptions are far from each other

First Improvement of PFor. Delta (New. PFD) • Using b (e. g. , 2) bits to encode 128 numbers 1 2 3 4 5 6 7. . . 124 125 126 127 128 3 42 11 2 3 3 1 1 … 3 3 10 11 11 01 01 … 11 11 23 1 2 01 10 lower 2 bits of 23 11 low 10 11 11 01 01 … • Offset array • Exception array 11 11 low S 9, S 16 higher bits of 23 01 10 42 23

Second Improvement of PFor. Delta (Opt. PFD) • How to select the number of bits b for each block? • The original PFD uses a constant value for b • A constant b is not good enough • b , waster more bits to encode each number • b , resulting in more exceptions • The best b should achieve the best tradeoff between the compressed size and the decompresion speed • Difficult to formulize this

Second Improvement of PFor. Delta (Opt. PFD) How to select the number of bits b for each block? • Derive a global table for the choice of b -> tradeoff btw size and speed • Based on the table, we can dynamically choose b during query processing to achieve the best overall performance (We will talk this in more detail later)

Global table for Opt. PFD We want • Decompression speed is above 1200 Million Doc. IDs/second • Compressed size is no more than 1. 5 MB/query

PFor. Delta is Extremely Fast for Decompression! Over 1 billion doc. IDs/second ! Million doc. IDs/second

Doc. ID Compression – with Re-ordering Only 1. 47 bit/doc. ID

Frequency Compression – Related Work • Few papers especially focusing on it • Usually just choose the methods that work well for compressing small numbers, such as Gamma coding, or Rice coding • No one has studied it for web indexes under document reordering

Frequency Compression – Unique Features • • Normally frequencies are quite small values Unlike doc. IDs, they are not in sorted order (this has nothing to do with re-ordering) • Re-ordering make them clustered, but we cannot directly take advantage of the clustering as doc. IDs since we cannot take gaps of them

Frequency Compression - My Algorithms • However, we still want to take advantage of the clustering property brought by reordering • Idea: When things are clustered but unsorted, we can use some transform to make the frequency values smaller • Preprocessing techniques: • Move to front (MTF) (Bently, burrows-wheeler compression, Comm. Of the ACM, 1986) • Most likely next (MLN)

Our Preprocessing Algorithms: MTF / MLN • Idea: Use indexes of previously occurrences to encode the current number. • Move to front (MTF) : Keep an additional index array, and do Move-To-Front operation during encoding • Most likely next (MLN): Keeps a small table that stores for each value which values are most likely to follow, sort values by their likelihoods, and then use indexes in the table to represent the values • Why better? Values of indexes of frequencies are smaller when frequencies are clustered !

Results – Compressed Size Only 1. 54 bit/freq Compressed size (MBytes/query)

Results – Decompression Speed million frequencies/second

Compressed Size for the Entire Indexes Compressed • IPC, New. PFD, Opt. PFD • Compressed index size (MB) on the entire GOV 2 data set, containing 25. 2 Million web pages and the uncompressed size of it is 500 GB!!! • IPC: 3. 45 GB, our optimized PFD is 3. 88 GB and super fast!

Query Processing – Skipping skip 14 23 … decompre 43 block 1 (compressed 67 77 … ss 89 100 123 block 2 (compressed … 150 block 3 (compressed ) ) ) • We must decompress the entire block of doc. IDs • Search 123:

Query Processing – Impact of Reordering Opt. PFD • Query processing is faster ! • Reason: • After sorting by URLs, doc. IDs are clustered into fewer blocks! • Therefore, fewer blocks of doc. IDs need to be decompressed

Then, How To Compress Indexes? • Tradeoff : compressed size vs decompression speed • Mixed Methods • Frequently used indexes – faster decompression e. g. , PFor. Delta • Non-frequently used indexes – smaller compressed size e. g. , IPC

One Example of Using Mixed Methods • Goal: • The overall index size is minimized • While average time per query < a given time limit T madrid museum spain barcelona mall university 345, 777, 11437, …. 777, 1234, 4356, 12457, …. 4, 19, 29, 98, 143, 777, . . . 145, 457, 777, 789, . . . 678, 777, 2134, 3970, . . . 90, 256, 372, 511, 777, 1000, . . . PFor. Delta ? Or IPC? • Solution: • Choose the list and compression methods that gives you the smallest increase in index size per time saved • Note: this can be easily integrated into the normal process of index construction, especially when indexes are built block-wise

Mixed Methods with Opt. PFD and IPC Opt. PFD

Conclusions – How to Compress Web Indexes? Previous researchers found: • Reordering improve doc. ID compression using standard compression methods In our paper: Given a particular ordering – sorting by URLs • Doc. ID: • We proved this (sorting by URLs is better) by testing on most existing compression methods. • We proposed optimized PFor. Delta which achieves the best performance in terms of both compressed size and decompression speed • Frequency • We proposed MTF/MLN to reduce the compressed size • Query processing (QP) • We found that reordering improves QP since less number of doc. IDs/freqs need to be decoded • Mixed Methods: We propose a hybrid method to try to achieve the best tradeoff between • The compressed size • Decompresion speed

Thank you !