Google and Scalable Query Services Zachary G Ives

Google and Scalable Query Services Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems April 6, 2005

Administrivia § Please send me an email updating your project status § Next readings and summaries: § Monday – Berners-Lee paper (very short, fluffy) § Wednesday – First two sections of the Piazza paper § For both – summarize the goals, key ideas, and challenges § Reduced reading so you can work on the project! 2

Today’s Trivia Question 3

Google Architecture. Brin [ /Page 98] Focus was on scalability to the size of the Web First to really exploit Link Analysis Started as an academic project @ Stanford; became a startup Our discussion will be on early Google – today they keep things secret! 4

Google’s Focus § Commodity, cheap hardware § Unreliable § Not very powerful § A fair amount of memory, reasonable hard disks § Lots of racks § Special air conditioning, power systems, big net pipes § Special queries § Partitioning of service between “two” versions: § The version being crawled and fleshed out § The version being searched § (Really, different pieces can be crawled & updated at different times) 5

What Does Google Need to Do? § § § Scalable crawling of documents Archival of documents (“cache”) Inverted indexing Duplicate removal Ranking – requires iteration over link structure § Page. Rank § TF/IDF § Heuristics § Do the new Google services change any of that? § Some may not need the crawler, e. g. , maps, perhaps Froogle 6

The Heart of Google Storage The main database: Repository § Basically, a warehouse of every HTML page (this is the cached page entry), compressed in zlib § Useful for doing additional processing, any necessary rebuilds § Repository entry format: § [Doc. ID][ECode][Url. Len][Page. Len][Url][ Page] § The repository is indexed (not inverted here ) 7

Repository Index § One index for looking up documents by Doc. ID § Done in ISAM (think of this as a B+ Tree without smart re-balancing) § Index points to repository entries (or to URL entry if not crawled) § One index for mapping URL to Doc. ID § Sorted by checksum of URL § Compute checksum of URL, then binsearch by checksum § Allows update by merge with another similar file 8

Lexicon § The list of searchable words § (Presumably, today it’s used to suggest alternative words as well) § The “root” of the inverted index § As of 1998, 14 million “words” § Kept in memory (was 256 MB) § Two parts: Hash table of pointers to words and the “barrels” (partitions) they fall into List of words (null-separated) 9

Indices – Inverted and “Forward” § Inverted index divided into “barrels” (partitions by range) § Indexed by the lexicon; for each Doc. ID, consists of a Hit List of entries in the document § Forward index uses the same barrels § Used to find multi-word queries with words in same barrel § Indexed by Doc. ID, then a list of Word. IDs in this barrel and this document, then Hit Lists corresponding to the Word. IDs § Two barrels: short (anchor and title); full (all text) original tables from http: //www. cs. huji. ac. il/~sdbi/2000/google/index. htm 10

Hit Lists (Not Mafia-Related) § Used in inverted and forward indices § Goal was to minimize the size – the bulk of data is in hit entries § For 1998 version, made it down to 2 bytes per hit (though that’s likely climbed since then): Plain cap 1 font: 3 position: 12 vs. Fancy cap 1 Anchor cap 1 font: 7 type: 4 position: 8 special-cased to: font: 7 type: 4 hash: 4 pos: 4 11

Google’s Search Algorithm 1. Parse the query 2. Convert words into word. IDs 3. Seek to start of doclist in the short barrel for every word 4. Scan through the doclists until there is a document that matches all of the search terms 5. Compute the rank of that document 6. If we’re at the end of the short barrels, start at the doclists of the full barrel, unless we have enough 7. If not at the end of any doclist, goto step 4 8. Sort the documents by rank; return the top K 12

Ranking in Google § Considers many types of information: § Position, font size, capitalization § Anchor text § Page. Rank Done offline, in a non-query-sensitive way § Count of occurrences (basically, TF) in a way that tapers off § Multi-word queries consider proximity also 13

Why Isn’t Google Based on a DBMS? § Transactional locking is not necessary § Helps with partitioning and replication § Main memory indexing on lexicon § Unusual query model – what’s special here? § Weird consistency model! § OK if different users see different views § As long as we route same user to same machine(s), we’re OK § Updates are happening in a separate “instance” § Slipstream it in place § Can even extend this to change versions of software on the machines – as long as interfaces stay the same 14

Could We Change a DBMS? § What would a DBMS for Google-like environments look like? § What would it be useful for, other than Google? 15

Beyond Google § What if we wanted to: § Add on-the-fly query capabilities to Google? e. g. , query over up-to-the-second stock market results Use Word. Net or some thesaurus to supplement Google? Do Page. Rank in a topic-specific way? Supplement Google with “ontology” info? Do some sort of XML path matching along with keywords? § Allow for OLAP-style analysis? § Do a cooperative, e. g. , P 2 P, Google? § § Benefits of this? 16