Google Architecture Overview Problem Design Goals Google Search

Google Architecture

Overview • Problem • Design Goals • Google Search Engine Features • Google Architecture • Scalability • Conclusions

Problem • Web is vast and growing exponentially • Web is heterogenous – – – Ascii Html Images Java applets Etc. • Human Maintained Lists can’t keep up • Previous search methodologies relied on keyword • matching producing low quality matches Human attention is confined to ~10 -1000 documents

Specific Design Goals • Deliver results that have very high precision • • • even at the expense of recall Make search engine technology transparent, i. e. advertising shouldn’t bias results Bring search engine technology into academic realm in order to support novel research activities on large web data sets Make system easy to use for most people, e. g. users shouldn’t have to specify more than a couple words

Google Search Engine Features Two main features to increase result precision: • Uses link structure of web (Page. Rank) • Uses text surrounding hyperlinks to improve accurate document retrieval Other features include: • Takes into account word proximity in documents • Uses font size, word position, etc. to weight word • Storage of full raw html pages

Google Architecture

Google Architecture (cont. ) Multiple crawlers run in parallel. Each crawler keeps its own DNS lookup cache and ~300 open connections open at once. Keeps track of URLs that have and need to be crawled Compresses and stores web pages Stores each link and text surrounding link. Converts relative URLs into absolute URLs. Uncompresses and parses documents. Stores link information in anchors file. Contains full html of every web page. Each document is prefixed by doc. ID, length, and URL.

Google Architecture (cont. ) Maps absolute URLs into doc. IDs stored in Doc Index. Stores anchor text in “barrels”. Generates database of links (pairs of doc. Ids). Parses & distributes hit lists into “barrels. ” Partially sorted forward indexes sorted by doc. ID. Each barrel stores hitlists for a given range of word. IDs. In-memory hash table that maps words to word. Ids. Contains pointer to doclist in barrel which word. Id falls into. Creates inverted index whereby document list containing doc. ID and hitlists can be retrieved given word. ID. Doc. ID keyed index where each entry includes info such as pointer to doc in repository, checksum, statistics, status, etc. Also contains URL info if doc has been crawled. If not just contains URL.

Google Architecture (cont. ) 2 kinds of barrels. Short barrell which contain hit list which include title or anchor hits. Long barrell for all hit lists. New lexicon keyed by word. ID, inverted doc index keyed by doc. ID, and Page. Ranks used to answer queries List of word. Ids produced by Sorter and lexicon created by Indexer used to create new lexicon used by searcher. Lexicon stores ~14 million words.

Google Query Evaluation 1. Parse the query. 2. Convert words into word. IDs. 3. Seek to the start of the doclist in the short barrel for 4. 5. 6. 7. 8. every word. Scan through the doclists until there is a document that matches all the search terms. Compute the rank of that document for the query. If we are in the short barrels and at the end of any doclist, seek to the start of the doclist in the full barrel for every word and go to step 4. If we are not at the end of any doclist go to step 4. Sort the documents that have matched by rank and return the top k.

Single Word Query Ranking • Hitlist is retrieved for single word • Each hit can be one of several types: title, anchor, URL, • • • large font, small font, etc. Each hit type is assigned its own weight Type-weights make up vector of weights # of hits of each type is counted to form count vector Dot product of two vectors is used to compute IR score is combined with Page. Rank to compute final rank