- Slides: 9
The anatomy of a Large-Scale Hypertextual Web Search Engine
What we want from a search engine. • Speed • Quantity of Results • Efficient Storage Space • Quality of Results Google attempts to bring us all of these aspects from search.
Precision of result: Second Generation Search Engine Page Rank Anchor Text
Page Rank The more number of links that is pointing to a page (from other pages), the higher the page rank will be. The probability that a random internet surfer will reach this page by randomly clicking links. Also determined by the number of links the page has pointing you have. The more links page A has, the more valued the link from page A to B will be. PR(A) = (1 -d) + d(PR(T 1)/C(T 1) + … + PR(Tn)/C(Tn))
Anchor Text Each and every link on the internet will have some “invisible” text alongside it. This text is given by the page creator explaining what this link does, where it leads, or what it attempts to explain. By taking all of these links from hundreds of different sites, Google uses these anchor text to be able to provide most relevant search results.
Proximity Search and Others Google keeps track of how close the related words are too each other and also keeps track of the visual presentation (font size, color, boldness ect).
Crawling and Indexing • Google typically ran about 3. • Each crawler opens roughly 300 connections as once. • At peak performance, with 4 crawlers, Google can crawl 100 web pages per second. • Roughly 600 K per second of data. • Parsing • Indexing documents into barrels • Sorting