Practical considerations for a webscale search engine Michael

Practical considerations for a web-scale search engine Michael Isard Microsoft Research Silicon Valley

Search and research • Lots of research motivated by web search – Explore specific research questions – Small to moderate scale • A few large-scale production engines – Many additional challenges – Not all purely algorithmic/technical • What are the extra constraints for a production system?

Production search engines • Scale up – Tens of billions of web pages, images, etc. – Tens of thousands to millions of computers • Geographic distribution – For performance and reliability • Continuous crawling and serving – No downtime, need fresh results • Long-term test/maintenance – Simplicity a core goal

Disclaimer • Not going to describe any particular webscale search engine – No detailed public description of any engine • But, general principles apply

Outline • • Anatomy of a search engine Query serving Link-based ranking Index generation

Structure of a search engine Document crawling Link structure analysis Page feature training The Web Index building Ranker training Query serving User behavior analysis Auxiliary answers

Some index statistics • Tens of billions of documents – Each document contains thousands of terms – Plus metadata – Plus snippet information • Billions of unique terms – Serial numbers, etc. • Hundreds of billions of nodes in web graph • Latency a few ms on average – Well under a second worst-case

Query serving pipeline The Web Front-end web servers, caches, etc. Index servers Indexservers

Page relevance • Query-dependent component – Query/document match, user metadata, etc. • Query-independent component – Document rank, spam score, click rate, etc. • Ranker needs: – Term frequencies and positions – Document metadata – Near-duplicate information –…

Single-box query outline term posting list a … hello … world 1. 2, 1. 10, 1. 16, …, 1040. 23, …, doc metadata 1 … 45 … 1125 foo. com/bar, EN-US, … doc snippet data 1 … “once a week …” 3. 76, …, 45. 48, …, 1125. 3, …, Hello world + {EN-US, …} (45. 48, 45. 29), (1125. 3, 1125. 4), … 7. 12, …, 45. 29, …, 1125. 4, …, Ranker go. com/hw. txt, EN-US, … bar. com/a. html, EN-US, … 1125. 3, 45. 48, … Query Results

Query statistics • Small number of terms (fewer than 10) • Posting lists length 1 to 100 s of millions – Most terms occur once • Potentially millions of documents to rank – Response is needed in a few ms – Tens of thousands of near duplicates – Sorting documents by QI rank may help • Tens or hundreds of snippets

Distributed index structure • Tens of billions of documents • Thousands of queries per second • Index is constantly updated – Most pages turn over in at most a few weeks – Some very quickly (news sites) – Almost every page is never returned How to distribute?

Distributed index: split by term • Each computer stores a subset of terms • Each query goes only to a few computers • Document metadata stored separately Hello world + {EN-US, …} A-G H-M Ranker N-S Metadata T-Z

Split by term: pros • Short queries only touch a few computers – With high probability all are working • Long posting lists improve compression – Most words occur many times in corpus

Split by term: cons (1) • Must ship posting lists across network – Multi-term queries make things worse – But maybe pre-computing can help? • Intersections of lists for common pairs of terms • Needs to work with constantly updating index • Extra network roundtrip for doc metadata – Too expensive to store in every posting list • Where does the ranker run? – Hundreds of thousands of ranks to compute

Split by term: cons (2) • Front-ends must map terms to computers – Simple hashing may be too unbalanced – Some terms may need to be split/replicated • Long posting lists • “Hot” posting lists • Sorting by QI rank is a global operation – Needs to work with index updates

Distributed index: split by document • Each computer stores a subset of docs • Each query goes to many computers • Document metadata stored inline Hello world + {EN-US, …} Aggregator Ranker Docs 1 -1000 Docs 1001 -2000 Docs 2001 -3000 Docs 3001 -4000

Split by document: pros • Ranker on same computer as document – All data for a given doc in the same place – Ranker computation is distributed • Can get low latency • Sorting by QI rank local to each computer • Only ranks+scores need to be aggregated – Hundreds of results, not millions

Split by document: cons • A query touches hundreds of computers – One slow computer makes query slow – Computers per query is linear in corpus size – But query speeds are not iid • Shorter posting lists: worse compression – Each word split into many posting lists

Index replication • Multiple copies of each partition – Needed for redundancy, performance • Makes things more complicated – Can mitigate latency variability • Ask two replicas, one will probably return quickly – Interacts with data layout • Split by document may be simpler • Consistency may not be essential

Splitting: word vs document • Original Google paper split by word • All major engines split by document now? – Tens of microseconds to rank a document

Link-based ranking • Intuition: “quality” of a page is reflected somehow in the link structure of the web • Made famous by Page. Rank – Can be seen as stationary distribution of a random walk on the web graph – Google’s original advantage over Alta. Vista?

Some hints • Page. Rank is (no longer) very important • Anchor text contains similar information – BM 25 F includes a lot of link structure • Query-dependent link features may be useful

Comparing the Effectiveness of HITS and SALSA, M. Najork, CIKM 2007

Query-dependent link features E F A J G B K H C L I D M N

Real-time QD link information • Lookup of neighborhood graph • Followed by SALSA • In a few ms Seems like a good topic for approximation/learning

Index building • Catch-all term – Create inverted files – Compute document features – Compute global link-based statistics – Which documents to crawl next? – Which crawled documents to put in the index? • Consistency may be needed here

Index lifecycle Usage analysis Index selection The Web Query serving Page crawling

Experimentation • A/B testing is best – Ranking, UI, etc. – Immediate feedback on what works – Can be very fine-grained (millions of queries) • Some things are very hard – Index selection, etc. – Can run parallel build processes • Long time constants: not easy to do brute force

Implementing new features • Document-specific features much “cheaper” – Spam probability, duplicate fingerprints, language • Global features can be done, but with a higher bar – Distribute anchor text – Page. Rank et al. • Danger of “butterfly effect” on system as a whole

Distributing anchor text Crawler Indexer Anchor text Docs f 0 -ff Docsf 0 -ff

Distributed infrastructure • Things are improving – Large scale partitioned file systems • Files commonly contain many TB of data • Accessed in parallel – Large scale data-mining platforms – General-purpose data repositories • Data-centric – Traditional supercomputing is cycle-centric

Software engineering • Simple always wins • Hysteresis – Prove a change will improve things • Big improvement needed to justify big change – Experimental platforms are essential

Summary • Search engines are big and complicated • Some things are easier to change than others • Harder changes need more convincing experiments • Small datasets are not good predictors for large datasets • Systems/learning may need to collaborate