A Search Engine Architecture Based on Collection Selection

A Search Engine Architecture Based on Collection Selection Diego Puppin University of Pisa, Italy Supervisors: D. Laforenza, M. Vanneschi

Introduction

Diego Puppin, A Search Engine Architecture Based on Collection Selection Motivations n n September 2007 The Web is getting bigger and bigger, and users are more and more picky! Precise results are needed very fast The index is growing, due to added page and advanced indexing Big IR problems for the Web, books, multimedia search engine University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection Selection Motivations (2) n n September 2007 There is the need for new solutions, able to give high quality results with reduced computing load Parallel Computing looks like the most natural choice to help algorithms to face this growth rate [Baeza-Yates et al. 2007 a] Billions of pages and data available (several TB): the index is still very big (about 5 X the collection size) New approaches to partitioning are key to the next phase University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection Selection Parallel (Distributed) IRSs September 2007 University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection Selection Term vs Doc partitioning September 2007 University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection Selection Term vs Doc partitioning n Reduced computing load for term part. n n n September 2007 Only the servers with relevant terms Problems of load balancing Heavier communication patterns Doc. part. better balancing but all documents are scanned How to reduce the load with doc. part. ? University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection September 2007 University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection Selection Main contributions Query vector doc model 1. n Load-driven routing 2. n n Exploits better the available load Based on the effective load of the system Incremental Caching 3. n September 2007 More efficient for partitioning and selection (co-clustering and PCAP) Improves throughput AND quality University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection Selection Acknowledgments n n September 2007 Fabrizio Silvestri Raffaele Perego Ricardo Baeza-Yates Adbur Chowdury, Ophir Frieder, Gerhard Weikum, and the various reviewers… University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection Selection Other contributions n More compact collection representation n n A way to select documents (50%) to move out of the index n n n The documents in the supplemental index contribute to only 3% top results A simple way to update the index in a doc. partitioned system Extended simulation n September 2007 1/5 CORI and outperforming 6 M documents, 800 k test queries, real computing costs, several configurations tested University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection Selection Reviewers’ Request: Frieder n n n September 2007 More detailed discussion of the coclustering algorithm Improved cost scheme Experiments to be extended in the future University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection Selection Reviewers’ Requests: Weikum n n n September 2007 Improved description of pipelined termpartitioned IR system Improved description of coclustering Better definition of shingles New realistic cost model Deeper discussion of cache and silent documents University of Pisa

How to Improve Partitions

Diego Puppin, A Search Engine Architecture Based on Collection Selection Partitioning Strategy Document Collection Random Content-based (e. g. K-Means, Link-based Clust. ) Partitioning Strategy Usage-Based p 1 September 2007 p 2 pp University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection Selection The QV Model Each document cluster corresponds to a different partition. In this case three partitions are generated documents 1 2 3 4 5 6 7 8 9 10 11 12 4 5 6 9 8 7 10 11 12 Document j is returned in answer to query i. Document j is not relevant to query i. Co -cl u ste r ing For each query cluster a vocabulary is built out of all the different query terms of the queries in the cluster September 2007 Document Cluster documents Query Cluster queries 1 2 3 4 8 10 1 12 2 9 7 11 5 3 6 11 6 1 9 7 3 10 5 12 2 8 4 University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection Selection Theoretical Model of Coclustering n n The algorithm we use [Dhillon et al. , 2003] finds the clustering that minimizes the loss of information between the original matrix and the clustered matrix (given the number of row and column clusters) Efficient implementation, very robust solution n September 2007 Stable to test period, number of clusters, training set used, matrix model (scores, boolean, repeated) University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection Selection QV for Collection Selection Partitions are ranked according to their relevance to the query Query September 2007 Query clusters Document clusters We called this strategy PCAP University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection Selection PCAP collection selection September 2007 University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection Selection Experimental Settings n Experiments were carried out using n WBR 99: 5, 939, 061 documents; 22 GB uncompressed text n n n A query log from todobr. com relative to the period Jan-Oct 2003. Zettair as the IR Core Training: 190, 000 queries, Test: 800, 000 queries We created 16 + 1 doc. clusters and 128 query clusters. Model tested on the successive week (the fourth week). Metrics used: n n September 2007 Snapshot of the Brazilian Web (domain. br) back in 1999. Intersection: percentage of relevant results returned using only k servers out of 16+1 (from [Puppin et al. , 2006]). Competitive similarity: percentage of relevance score obtained using only k servers out of 16+1 (adapted from [Chierichetti et al. , 2007]). University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection Selection Quality Metrics September 2007 University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection Selection Very Effective Partitioning and Selection CORI on Random Partitioning Intersection at 5 10 20 CORI on QV vs. CORI on random performs about 5. 2 times better. PCAP on QV vs. CORI on QV performs about 1. 1 times better. September 2007 1 2 4 8 16 17 0. 30 0. 59 1. 20 0. 57 1. 16 2. 49 1. 27 2. 55 5. 04 2. 62 5. 00 9. 77 4. 60 9. 30 18. 71 5. 00 10. 00 20. 00 CORI on QV Partitioning Intersection at 5 10 20 PCAP on QV vs. CORI on random performs about 5. 8 times better. In the case of Random CORI performs really bad! Almost equal to relevants/Nclusters. E. g. 5/17 = 0. 29411765 ~ 0. 3 1 2 4 8 16 17 1. 55 3. 05 5. 97 2. 29 4. 48 8. 77 3. 01 5. 92 11. 61 3. 83 7. 62 15. 10 4. 89 9. 77 19. 54 5. 00 10. 00 20. 00 PCAP on QV Partitioning Intersection at 5 10 20 1 2 4 8 16 17 1. 73 3. 47 6. 92 2. 26 4. 51 9. 02 2. 89 5. 75 11. 47 3. 76 7. 50 14. 98 4. 84 9. 66 19. 29 5. 00 10. 00 20. 00 University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection September 2007 University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection Selection Strength n n September 2007 Popular queries are driving the distribution Low-dimensional space to represent documents More efficient collection representation QV may be built while answering queries University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection Selection Weakness n Dependent from the training set n n Cannot manage new query terms n n n Very small fraction, CORI does not help Inc. caching can help Collection selection dependent from assignment n September 2007 Actually… NOT! But addition does not break performance University of Pisa

Issues with Load Distribution

Diego Puppin, A Search Engine Architecture Based on Collection Selection Load Balancing Load is measured as the maximum number of queries answered by each IR core within a sliding query window of 1000 queries. September 2007 Still the maximum load is ~ 25% of the maximum capacity available at each IR Core University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection Selection Load Balancing Strategies n Load-driven basic <L> n n Load-driven boost <L, T> n September 2007 Servers are ranked according to their relevance, using a collection selection function. The first gets priority 1, then linearly down to 1/17. Every server i has to answer if: L(i) < p(i) * L Priority is 1 for the first T server, then linearly down to 1/(17 -T) University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection Selection Experimental Settings (2) n n The broker models the load in the cores as the number of queries served from the last W queries Assumption: cost =1, for each query and collection n n September 2007 We will change this We count the number of relevant results we can get by polling the servers, up to the chosen load threshold University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection Selection Load Balancing Results Intersection (# of relevant results retrieved) FIXED 4 BASIC<24. 7> BOOST<4, 24. 7> 5 3. 10 3. 40 3. 55 10 6. 00 6. 80 7. 00 20 12. 20 13. 60 14. 00 Competitive Similarity (% of rank score retrieved) FIXED 4 September 2007 BASIC<24. 7> BOOST<4, 24. 7> 5 0. 88 0. 91 0. 92 10 0. 87 0. 90 20 0. 85 0. 89 0. 90 University of Pisa

Caching and Collection Selection

Diego Puppin, A Search Engine Architecture Based on Collection Selection Interaction with a Cache n n n September 2007 Result caching is commonly used in WSEs [Baeza-Yates et al. , 2007 a; Baeza-Yates et al. , 2007 b]. Caching has the effect of reshaping the power -law underlying the query distribution [Baeza. Yates et al. , 2007 a]. We designed a novel caching strategy (i. e. Incremental Caching) integrated with collection selection University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection Selection Incremental Caching September 2007 Q … Q … IR Core 1 Incremental Cache An incremental cache is effective both at load reduction, and at improving result quality. Q Servers Polled XXXX IR Core 2 Results IR Core 3 IR Core 4 University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection Selection Incremental Caching Results Intersection (like P@N - # of relevant res retrieved) BASIC<24. 7> BOOST<4, 24. 7> INCREMENTAL 5 3. 40 3. 55 4. 00 10 6. 80 7. 00 7. 80 20 13. 60 14. 00 15. 60 Competitive Similarity (% of rank score retrieved) September 2007 BASIC<24. 7> BOOST<4, 24. 7> INCREMENTAL 5 0. 91 0. 92 0. 94 10 0. 90 0. 93 20 0. 89 0. 90 0. 93 University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection September 2007 University of Pisa

Refined Cost Model and Prioritization

Diego Puppin, A Search Engine Architecture Based on Collection Selection Collection Prioritization n n September 2007 We reverse the load control from the broker to the cores The broker broadcasts the query, and sends info about the relative rank of each core (the priority) Each core serves query if L(i) < p(i) L L(i) = sum of the comp. cost (timing) of served queries University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection September 2007 University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection Selection Extended Tests n n We actually partitioned the documents onto different servers We indexed locally, and we measured the timing of each query The actual timing is used to compute the load and drive the system Load cap is AVERAGE load n September 2007 The peak can heavily vary! University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection September 2007 University of Pisa

…the bill, please!

Diego Puppin, A Search Engine Architecture Based on Collection Selection Conclusions n n n September 2007 We presented an architecture for a distributed search engine, based on collection selection The load-driven strategy and the incremental caching can retrieve very high quality results, with reduced load Verified with an extensive simulation University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection Selection Impact and Benefits n n If a given precision is expected, we can use FEWER servers With a given number of servers, we get HIGHER precision n September 2007 Confirmed with different metrics Smaller load for the IR system, with more focus on top results Nice trade-off cost vs. quality University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection Selection Impact and Benefits (2) n Load-driven routing can be used n n September 2007 to absorb query peaks to offer higher/lower quality results to selected users Consistent ranking due to local indexing Inc. caching can be used to reduce the negative effects of selection University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection Selection Furthermore n n September 2007 Caching posting lists is very effective on local indices Simple way to add new documents Inc. caching could help with impactordered posting lists Caching could be based on line value (query frequency, number of polled servers) University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection Selection Future Work n n September 2007 Comparison with other results in clustering (k-means, link-based, P 2 P, LSI, SVD) Test on a large-scale, real-world search engine Real-world implementation at Google TOIS paper to wrap up University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection Selection References n [Puppin et al. , 2006] n n [Puppin & Silvestri, 2006] n n Diego Puppin, Fabrizio Silvestri. “The Query-Vector Document Model”. Proceedings of CIKM ‘ 06. [Puppin et al. , 2007] n September 2007 Diego Puppin, Fabrizio Silvestri, Domenico Laforenza. “Query-Driven Document Partitioning and Collection Selection”. Invited Paper. Proceedings of INFOSCALE ‘ 06. Diego Puppin, Ricardo Baeza-Yates, Raffaele Perego, Fabrizio Silvestri. “Incremental Caching for Collection Selection Architectures”. Proceedings of INFOSCALE ‘ 07. University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection Selection References n [Baeza-Yates et al. , 2007 a] n n [Chierichetti et al. , 2007] n n F. Chierichetti, A. Panconesi, P. Raghavan, M. Sozio, A. Tiberi, E. Upfal. “Finding Near Neighbors Through Cluster Pruning”. Proceedings of PODS 2007. [Baeza-Yates et al. , 2007 b] n September 2007 Ricardo Baeza-Yates, Carlos Castillo, Flavio Junqueira, Vassilis Plachouras, Fabrizio Silvestri. “Challenges in Distributed Information Retrieval”. Invited Paper. Proceedings of ICDE 2007. Ricardo Baeza-Yates, Aristides Gionis, Flavio Junqueira, Vanessa Murdock, Vassilis Plachouras, Fabrizio Silvestri. “The Impact of Caching on Search Engines”. Proceedings of SIGIR 2007. University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection Selection References n [Dhillon et al. , 2003] n September 2007 Dhillon, I. S. and Mallela, S. and Modha, D. S. , “Information-Theoretic Co-Clustering”. Proceedings of The Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining(KDD-2003) University of Pisa

Backup Slides

Diego Puppin, A Search Engine Architecture Based on Collection September 2007 University of Pisa

Adding Documents

Diego Puppin, A Search Engine Architecture Based on Collection Selection Adding Documents n It is important to assign new documents to the fittest clusters n n n September 2007 New versions, New pages etc. The new documents will be found along with the previously assigned documents Hopefully the coll. selection will find them with similar docs University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection Selection A Modest Proposal n n September 2007 The body of the new document is used as query for the PCAP selection The body is compared to the query clusters We will find a similarity between doc. body and query cluster We use PCAP to rank doc. collections University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection Selection Implementation n n September 2007 The first 1000 byte of (stripped) body doc are used The new doc is assigned to the doc. cluster with the top PCAP score New docs are locally indexed No need to re-train / re-assign New docs have consistent score and ranking University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection Selection Test Configurations September 2007 University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection September 2007 University of Pisa