The Anatomy Of A Large Scale Search Engine
![The Anatomy Of A Large Scale Search Engine Based on a paper by: Sergey The Anatomy Of A Large Scale Search Engine Based on a paper by: Sergey](https://slidetodoc.com/presentation_image_h/7c0a064b4ab4f19b1d211bbc2f16e4d3/image-1.jpg)
![Index • • • Introduction Design Goals System Features Related Work System Anatomy Results Index • • • Introduction Design Goals System Features Related Work System Anatomy Results](https://slidetodoc.com/presentation_image_h/7c0a064b4ab4f19b1d211bbc2f16e4d3/image-2.jpg)
![What is Google? • Large-scale search engine – makes extensive use in hypertext – What is Google? • Large-scale search engine – makes extensive use in hypertext –](https://slidetodoc.com/presentation_image_h/7c0a064b4ab4f19b1d211bbc2f16e4d3/image-3.jpg)
![Why talk about google? • To engineer a SE is a challenging task – Why talk about google? • To engineer a SE is a challenging task –](https://slidetodoc.com/presentation_image_h/7c0a064b4ab4f19b1d211bbc2f16e4d3/image-4.jpg)
![The web - IR challenge • 2 main ways for surfing: – high quality The web - IR challenge • 2 main ways for surfing: – high quality](https://slidetodoc.com/presentation_image_h/7c0a064b4ab4f19b1d211bbc2f16e4d3/image-5.jpg)
![Web Growth 10/11/2020 The Anatomy Of A Large Scale Hypertextual Web Search Engine 6 Web Growth 10/11/2020 The Anatomy Of A Large Scale Hypertextual Web Search Engine 6](https://slidetodoc.com/presentation_image_h/7c0a064b4ab4f19b1d211bbc2f16e4d3/image-6.jpg)
![Web Search Engine Scaling-Up 1994 -2000 • First SE WWWW (1994) had an index Web Search Engine Scaling-Up 1994 -2000 • First SE WWWW (1994) had an index](https://slidetodoc.com/presentation_image_h/7c0a064b4ab4f19b1d211bbc2f16e4d3/image-7.jpg)
![Web Search Engine Scaling-Up 1999 • Challenges in Creating a Search Engine which scales Web Search Engine Scaling-Up 1999 • Challenges in Creating a Search Engine which scales](https://slidetodoc.com/presentation_image_h/7c0a064b4ab4f19b1d211bbc2f16e4d3/image-8.jpg)
![Google: Scaling with the web • Improved Hardware Performance – exceptions disk seek time, Google: Scaling with the web • Improved Hardware Performance – exceptions disk seek time,](https://slidetodoc.com/presentation_image_h/7c0a064b4ab4f19b1d211bbc2f16e4d3/image-9.jpg)
![Design Goals • Improved Search Quality – Junk Results • Number of documents has Design Goals • Improved Search Quality – Junk Results • Number of documents has](https://slidetodoc.com/presentation_image_h/7c0a064b4ab4f19b1d211bbc2f16e4d3/image-10.jpg)
![Design Goals (2) • Academic Search Engine Research – SE has migrated from academic Design Goals (2) • Academic Search Engine Research – SE has migrated from academic](https://slidetodoc.com/presentation_image_h/7c0a064b4ab4f19b1d211bbc2f16e4d3/image-11.jpg)
![System Features • Page. Rank: Bringing order to the web – most web SE System Features • Page. Rank: Bringing order to the web – most web SE](https://slidetodoc.com/presentation_image_h/7c0a064b4ab4f19b1d211bbc2f16e4d3/image-12.jpg)
![System Features (2) • Anchor Text – Associate link text with the page it System Features (2) • Anchor Text – Associate link text with the page it](https://slidetodoc.com/presentation_image_h/7c0a064b4ab4f19b1d211bbc2f16e4d3/image-13.jpg)
![System Features (3) • Other Features – Location Information • Use of proximity in System Features (3) • Other Features – Location Information • Use of proximity in](https://slidetodoc.com/presentation_image_h/7c0a064b4ab4f19b1d211bbc2f16e4d3/image-14.jpg)
![Related Work • SE have short history (wwww 1994) • commercial services closely guard Related Work • SE have short history (wwww 1994) • commercial services closely guard](https://slidetodoc.com/presentation_image_h/7c0a064b4ab4f19b1d211bbc2f16e4d3/image-15.jpg)
![IR &Differences Between the Web and Well Controlled Collections • “TREC 96”s “Very Large IR &Differences Between the Web and Well Controlled Collections • “TREC 96”s “Very Large](https://slidetodoc.com/presentation_image_h/7c0a064b4ab4f19b1d211bbc2f16e4d3/image-16.jpg)
![System Anatomy • High Level Overview 10/11/2020 The Anatomy Of A Large Scale Hypertextual System Anatomy • High Level Overview 10/11/2020 The Anatomy Of A Large Scale Hypertextual](https://slidetodoc.com/presentation_image_h/7c0a064b4ab4f19b1d211bbc2f16e4d3/image-17.jpg)
![Major Data Structures • Big Files – virtual files spanning multiple file systems – Major Data Structures • Big Files – virtual files spanning multiple file systems –](https://slidetodoc.com/presentation_image_h/7c0a064b4ab4f19b1d211bbc2f16e4d3/image-18.jpg)
![Major Data Structures (2) • Repository – tradeoff between speed & compression ratio – Major Data Structures (2) • Repository – tradeoff between speed & compression ratio –](https://slidetodoc.com/presentation_image_h/7c0a064b4ab4f19b1d211bbc2f16e4d3/image-19.jpg)
![Major Data Structures (3) • Document Index – keeps information about each document – Major Data Structures (3) • Document Index – keeps information about each document –](https://slidetodoc.com/presentation_image_h/7c0a064b4ab4f19b1d211bbc2f16e4d3/image-20.jpg)
![Major Data Structures (4) • URL’s - doc. ID file – used to convert Major Data Structures (4) • URL’s - doc. ID file – used to convert](https://slidetodoc.com/presentation_image_h/7c0a064b4ab4f19b1d211bbc2f16e4d3/image-21.jpg)
![Major Data Structures (4) • Lexicon – can fit in memory for reasonable price Major Data Structures (4) • Lexicon – can fit in memory for reasonable price](https://slidetodoc.com/presentation_image_h/7c0a064b4ab4f19b1d211bbc2f16e4d3/image-22.jpg)
![Major Data Structures (4) • Hit Lists – includes position font & capitalization – Major Data Structures (4) • Hit Lists – includes position font & capitalization –](https://slidetodoc.com/presentation_image_h/7c0a064b4ab4f19b1d211bbc2f16e4d3/image-23.jpg)
![Major Data Structures (4) • Hit Lists (2) 10/11/2020 The Anatomy Of A Large Major Data Structures (4) • Hit Lists (2) 10/11/2020 The Anatomy Of A Large](https://slidetodoc.com/presentation_image_h/7c0a064b4ab4f19b1d211bbc2f16e4d3/image-24.jpg)
![Major Data Structures (5) • Forward Index – partially ordered – used 64 Barrels Major Data Structures (5) • Forward Index – partially ordered – used 64 Barrels](https://slidetodoc.com/presentation_image_h/7c0a064b4ab4f19b1d211bbc2f16e4d3/image-25.jpg)
![Major Data Structures (6) • Inverted Index – 64 Barrels (same as the Forward Major Data Structures (6) • Inverted Index – 64 Barrels (same as the Forward](https://slidetodoc.com/presentation_image_h/7c0a064b4ab4f19b1d211bbc2f16e4d3/image-26.jpg)
![Major Data Structures (7) • Crawling the Web – fast distributed crawling system – Major Data Structures (7) • Crawling the Web – fast distributed crawling system –](https://slidetodoc.com/presentation_image_h/7c0a064b4ab4f19b1d211bbc2f16e4d3/image-27.jpg)
![Major Data Structures (8) • Indexing the Web – Parsing • should know to Major Data Structures (8) • Indexing the Web – Parsing • should know to](https://slidetodoc.com/presentation_image_h/7c0a064b4ab4f19b1d211bbc2f16e4d3/image-28.jpg)
![Major Data Structures (9) • Indexing Documents into Barrels – turning words into word. Major Data Structures (9) • Indexing Documents into Barrels – turning words into word.](https://slidetodoc.com/presentation_image_h/7c0a064b4ab4f19b1d211bbc2f16e4d3/image-29.jpg)
![Major Data Structures (10) • Indexing the Web – Sorting • creating the inverted Major Data Structures (10) • Indexing the Web – Sorting • creating the inverted](https://slidetodoc.com/presentation_image_h/7c0a064b4ab4f19b1d211bbc2f16e4d3/image-30.jpg)
![Searching • Algorithm – – –. 5 Compute the rank of that 1. Parse Searching • Algorithm – – –. 5 Compute the rank of that 1. Parse](https://slidetodoc.com/presentation_image_h/7c0a064b4ab4f19b1d211bbc2f16e4d3/image-31.jpg)
![The Ranking System • The information – Position, Font Size, Capitalization – Anchor Text The Ranking System • The information – Position, Font Size, Capitalization – Anchor Text](https://slidetodoc.com/presentation_image_h/7c0a064b4ab4f19b1d211bbc2f16e4d3/image-32.jpg)
![The Ranking System (2) • Each Hit type has it’s own weight • Counts The Ranking System (2) • Each Hit type has it’s own weight • Counts](https://slidetodoc.com/presentation_image_h/7c0a064b4ab4f19b1d211bbc2f16e4d3/image-33.jpg)
![Feedback • A trusted user may optionally evaluate the results • The feedback is Feedback • A trusted user may optionally evaluate the results • The feedback is](https://slidetodoc.com/presentation_image_h/7c0a064b4ab4f19b1d211bbc2f16e4d3/image-34.jpg)
![Results • Produce better results than major commercial search engines for most searches • Results • Produce better results than major commercial search engines for most searches •](https://slidetodoc.com/presentation_image_h/7c0a064b4ab4f19b1d211bbc2f16e4d3/image-35.jpg)
![Storage Requirements • Using Compression on the repository • about 55 GB for all Storage Requirements • Using Compression on the repository • about 55 GB for all](https://slidetodoc.com/presentation_image_h/7c0a064b4ab4f19b1d211bbc2f16e4d3/image-36.jpg)
![Storage Statistics 10/11/2020 Web Page Statistics The Anatomy Of A Large Scale Hypertextual Web Storage Statistics 10/11/2020 Web Page Statistics The Anatomy Of A Large Scale Hypertextual Web](https://slidetodoc.com/presentation_image_h/7c0a064b4ab4f19b1d211bbc2f16e4d3/image-37.jpg)
![System Performance • • • It took 9 days to download 26 million pages System Performance • • • It took 9 days to download 26 million pages](https://slidetodoc.com/presentation_image_h/7c0a064b4ab4f19b1d211bbc2f16e4d3/image-38.jpg)
![Conclusions • Scalable Search Engine • High Quality Search Results • Search techniques – Conclusions • Scalable Search Engine • High Quality Search Results • Search techniques –](https://slidetodoc.com/presentation_image_h/7c0a064b4ab4f19b1d211bbc2f16e4d3/image-39.jpg)
![Future Work • • • Improve search efficiency Scale to 100 million Boolean Operators Future Work • • • Improve search efficiency Scale to 100 million Boolean Operators](https://slidetodoc.com/presentation_image_h/7c0a064b4ab4f19b1d211bbc2f16e4d3/image-40.jpg)
![New Features • • Google Scout Documents Caching Uncle Sam’s Link: option 10/11/2020 The New Features • • Google Scout Documents Caching Uncle Sam’s Link: option 10/11/2020 The](https://slidetodoc.com/presentation_image_h/7c0a064b4ab4f19b1d211bbc2f16e4d3/image-41.jpg)
![The End 10/11/2020 The Anatomy Of A Large Scale Hypertextual Web Search Engine 42 The End 10/11/2020 The Anatomy Of A Large Scale Hypertextual Web Search Engine 42](https://slidetodoc.com/presentation_image_h/7c0a064b4ab4f19b1d211bbc2f16e4d3/image-42.jpg)
- Slides: 42
![The Anatomy Of A Large Scale Search Engine Based on a paper by Sergey The Anatomy Of A Large Scale Search Engine Based on a paper by: Sergey](https://slidetodoc.com/presentation_image_h/7c0a064b4ab4f19b1d211bbc2f16e4d3/image-1.jpg)
The Anatomy Of A Large Scale Search Engine Based on a paper by: Sergey Brin & Lawrence Page Computer Science Department, Stanford University - submitted to WWW 7 (1997) lecture by: Tal Blum for the SDBI seminar The Anatomy Of A Large Scale Hypertextual Web Search Engine
![Index Introduction Design Goals System Features Related Work System Anatomy Results Index • • • Introduction Design Goals System Features Related Work System Anatomy Results](https://slidetodoc.com/presentation_image_h/7c0a064b4ab4f19b1d211bbc2f16e4d3/image-2.jpg)
Index • • • Introduction Design Goals System Features Related Work System Anatomy Results & Performance • Conclusions 10/11/2020 • Future Work • References The Anatomy Of A Large Scale Hypertextual Web Search Engine 2
![What is Google Largescale search engine makes extensive use in hypertext What is Google? • Large-scale search engine – makes extensive use in hypertext –](https://slidetodoc.com/presentation_image_h/7c0a064b4ab4f19b1d211bbc2f16e4d3/image-3.jpg)
What is Google? • Large-scale search engine – makes extensive use in hypertext – designed to crawl & index the web efficiently – gives better results – prototype at http: //google. stanford. edu or http: //www. google. com – googol = 10^100 10/11/2020 The Anatomy Of A Large Scale Hypertextual Web Search Engine 3
![Why talk about google To engineer a SE is a challenging task Why talk about google? • To engineer a SE is a challenging task –](https://slidetodoc.com/presentation_image_h/7c0a064b4ab4f19b1d211bbc2f16e4d3/image-4.jpg)
Why talk about google? • To engineer a SE is a challenging task – millions of pages, terms, queries • • • little academic research SE today is not what it was 5 years ago the first detailed public description of SE better results using hypertext uncontrolled hypertext collections 10/11/2020 The Anatomy Of A Large Scale Hypertextual Web Search Engine 4
![The web IR challenge 2 main ways for surfing high quality The web - IR challenge • 2 main ways for surfing: – high quality](https://slidetodoc.com/presentation_image_h/7c0a064b4ab4f19b1d211bbc2f16e4d3/image-5.jpg)
The web - IR challenge • 2 main ways for surfing: – high quality human maintained lists (Yahoo) • too slow to improve • cannot cover esoteric topics • expensive to build and maintain – search engines (google, altavista) • search by keywords • too many low quality matches • people try to mislead automated search engines 10/11/2020 The Anatomy Of A Large Scale Hypertextual Web Search Engine 5
![Web Growth 10112020 The Anatomy Of A Large Scale Hypertextual Web Search Engine 6 Web Growth 10/11/2020 The Anatomy Of A Large Scale Hypertextual Web Search Engine 6](https://slidetodoc.com/presentation_image_h/7c0a064b4ab4f19b1d211bbc2f16e4d3/image-6.jpg)
Web Growth 10/11/2020 The Anatomy Of A Large Scale Hypertextual Web Search Engine 6
![Web Search Engine ScalingUp 1994 2000 First SE WWWW 1994 had an index Web Search Engine Scaling-Up 1994 -2000 • First SE WWWW (1994) had an index](https://slidetodoc.com/presentation_image_h/7c0a064b4ab4f19b1d211bbc2f16e4d3/image-7.jpg)
Web Search Engine Scaling-Up 1994 -2000 • First SE WWWW (1994) had an index of 110, 000 web pages, 1500 queries • November 1997 index of 2 -100 million web pages, 20 million(Altavista) • expected that by 2000 SE will have an index of billion web pages, hundreds of millions of queries 10/11/2020 The Anatomy Of A Large Scale Hypertextual Web Search Engine 7
![Web Search Engine ScalingUp 1999 Challenges in Creating a Search Engine which scales Web Search Engine Scaling-Up 1999 • Challenges in Creating a Search Engine which scales](https://slidetodoc.com/presentation_image_h/7c0a064b4ab4f19b1d211bbc2f16e4d3/image-8.jpg)
Web Search Engine Scaling-Up 1999 • Challenges in Creating a Search Engine which scales even to today web – Fast crawling technology • gather documents, keep them up to date – Efficient storage space • indices, optionally the documents – Handle queries quickly • rate of thousands per second 10/11/2020 The Anatomy Of A Large Scale Hypertextual Web Search Engine 8
![Google Scaling with the web Improved Hardware Performance exceptions disk seek time Google: Scaling with the web • Improved Hardware Performance – exceptions disk seek time,](https://slidetodoc.com/presentation_image_h/7c0a064b4ab4f19b1d211bbc2f16e4d3/image-9.jpg)
Google: Scaling with the web • Improved Hardware Performance – exceptions disk seek time, OS • Google is designed to scale well to extremely large data sets • Google’s data structure are optimized for fast & efficient access • Google is a centralized SE 10/11/2020 The Anatomy Of A Large Scale Hypertextual Web Search Engine 9
![Design Goals Improved Search Quality Junk Results Number of documents has Design Goals • Improved Search Quality – Junk Results • Number of documents has](https://slidetodoc.com/presentation_image_h/7c0a064b4ab4f19b1d211bbc2f16e4d3/image-10.jpg)
Design Goals • Improved Search Quality – Junk Results • Number of documents has increased by many factors • User ability to look at documents has not • As the collection size grows we need tools with very high precision even at the expanse of recall • Use of hypertextual information – In google: link structure & anchor text 10/11/2020 The Anatomy Of A Large Scale Hypertextual Web Search Engine 10
![Design Goals 2 Academic Search Engine Research SE has migrated from academic Design Goals (2) • Academic Search Engine Research – SE has migrated from academic](https://slidetodoc.com/presentation_image_h/7c0a064b4ab4f19b1d211bbc2f16e4d3/image-11.jpg)
Design Goals (2) • Academic Search Engine Research – SE has migrated from academic domain to the commercial • SE technology became mostly a black art & advertising oriented. – Get people usage Information • considered commercially valuable – Support novel research activities on large-scale web data 10/11/2020 The Anatomy Of A Large Scale Hypertextual Web Search Engine 11
![System Features Page Rank Bringing order to the web most web SE System Features • Page. Rank: Bringing order to the web – most web SE](https://slidetodoc.com/presentation_image_h/7c0a064b4ab4f19b1d211bbc2f16e4d3/image-12.jpg)
System Features • Page. Rank: Bringing order to the web – most web SE has largely ignored the link graph – 518 million hyperlinks – correspond well with people idea of importance – Pr(A) = (1 -d) + (Pr(T 1)/C(T 1)+…+Pr(Tn)/C(Tn)) – difference from traditional methods • not counting links from pages equally • normalizing by the number of links in a page • different from Hits of Kleiberg 10/11/2020 The Anatomy Of A Large Scale Hypertextual Web Search Engine 12
![System Features 2 Anchor Text Associate link text with the page it System Features (2) • Anchor Text – Associate link text with the page it](https://slidetodoc.com/presentation_image_h/7c0a064b4ab4f19b1d211bbc2f16e4d3/image-13.jpg)
System Features (2) • Anchor Text – Associate link text with the page it points to – advantages • anchor provide more accurate description • can exist for documents that can’t be indexed – images, programs, databases, mp 3, non-text docs, e-mails • can return web pages that hadn’t been crawled – was first used in WWW Worm 1994 10/11/2020 The Anatomy Of A Large Scale Hypertextual Web Search Engine 13
![System Features 3 Other Features Location Information Use of proximity in System Features (3) • Other Features – Location Information • Use of proximity in](https://slidetodoc.com/presentation_image_h/7c0a064b4ab4f19b1d211bbc2f16e4d3/image-14.jpg)
System Features (3) • Other Features – Location Information • Use of proximity in search – Visualization Information • Font relative Size – Full raw HTML is available • users can view a cashed version of the page • users can view the page as it was when indexed • can be used for research 10/11/2020 The Anatomy Of A Large Scale Hypertextual Web Search Engine 14
![Related Work SE have short history wwww 1994 commercial services closely guard Related Work • SE have short history (wwww 1994) • commercial services closely guard](https://slidetodoc.com/presentation_image_h/7c0a064b4ab4f19b1d211bbc2f16e4d3/image-15.jpg)
Related Work • SE have short history (wwww 1994) • commercial services closely guard the details of their databases • work on specialized features of SE – especially on post-processing results of SE • work on Information Retrieval Systems – especially on well controlled environments 10/11/2020 The Anatomy Of A Large Scale Hypertextual Web Search Engine 15
![IR Differences Between the Web and Well Controlled Collections TREC 96s Very Large IR &Differences Between the Web and Well Controlled Collections • “TREC 96”s “Very Large](https://slidetodoc.com/presentation_image_h/7c0a064b4ab4f19b1d211bbc2f16e4d3/image-16.jpg)
IR &Differences Between the Web and Well Controlled Collections • “TREC 96”s “Very Large Corpus” is only 20 GB compared to 147 GB of Google crawl • The Web is a vast collection of heterogeneous documents – language, vocabulary, format • things that work well for TREC often do not produce good results on the web • there is no control over what people put on the web The Anatomy Of A Large Scale Hypertextual Web Search Engine 10/11/2020 16
![System Anatomy High Level Overview 10112020 The Anatomy Of A Large Scale Hypertextual System Anatomy • High Level Overview 10/11/2020 The Anatomy Of A Large Scale Hypertextual](https://slidetodoc.com/presentation_image_h/7c0a064b4ab4f19b1d211bbc2f16e4d3/image-17.jpg)
System Anatomy • High Level Overview 10/11/2020 The Anatomy Of A Large Scale Hypertextual Web Search Engine 17
![Major Data Structures Big Files virtual files spanning multiple file systems Major Data Structures • Big Files – virtual files spanning multiple file systems –](https://slidetodoc.com/presentation_image_h/7c0a064b4ab4f19b1d211bbc2f16e4d3/image-18.jpg)
Major Data Structures • Big Files – virtual files spanning multiple file systems – addressable by 64 bit integers – handles allocation & deallocation of File Descriptions since the OS’s is not enough – supports rudimentary compression 10/11/2020 The Anatomy Of A Large Scale Hypertextual Web Search Engine 18
![Major Data Structures 2 Repository tradeoff between speed compression ratio Major Data Structures (2) • Repository – tradeoff between speed & compression ratio –](https://slidetodoc.com/presentation_image_h/7c0a064b4ab4f19b1d211bbc2f16e4d3/image-19.jpg)
Major Data Structures (2) • Repository – tradeoff between speed & compression ratio – choose zlib (3 to 1) over bzip (4 to 1) – requires no other data structure to access it 10/11/2020 The Anatomy Of A Large Scale Hypertextual Web Search Engine 19
![Major Data Structures 3 Document Index keeps information about each document Major Data Structures (3) • Document Index – keeps information about each document –](https://slidetodoc.com/presentation_image_h/7c0a064b4ab4f19b1d211bbc2f16e4d3/image-20.jpg)
Major Data Structures (3) • Document Index – keeps information about each document – fixed width ISAM (index sequential access mode) index – includes various statistics • pointer to repository, if crawled, pointer to info lists – compact data structure – we can fetch a record in 1 disk seek during search 10/11/2020 The Anatomy Of A Large Scale Hypertextual Web Search Engine 20
![Major Data Structures 4 URLs doc ID file used to convert Major Data Structures (4) • URL’s - doc. ID file – used to convert](https://slidetodoc.com/presentation_image_h/7c0a064b4ab4f19b1d211bbc2f16e4d3/image-21.jpg)
Major Data Structures (4) • URL’s - doc. ID file – used to convert URLs to doc. IDs – list of URL checksums with their doc. IDs – sorted by checksums – given a URL a binary search is performed – conversion is done in batch mode 10/11/2020 The Anatomy Of A Large Scale Hypertextual Web Search Engine 21
![Major Data Structures 4 Lexicon can fit in memory for reasonable price Major Data Structures (4) • Lexicon – can fit in memory for reasonable price](https://slidetodoc.com/presentation_image_h/7c0a064b4ab4f19b1d211bbc2f16e4d3/image-22.jpg)
Major Data Structures (4) • Lexicon – can fit in memory for reasonable price • currently 256 MB • contains 14 million words • 2 parts – a list of words – a hash table 10/11/2020 The Anatomy Of A Large Scale Hypertextual Web Search Engine 22
![Major Data Structures 4 Hit Lists includes position font capitalization Major Data Structures (4) • Hit Lists – includes position font & capitalization –](https://slidetodoc.com/presentation_image_h/7c0a064b4ab4f19b1d211bbc2f16e4d3/image-23.jpg)
Major Data Structures (4) • Hit Lists – includes position font & capitalization – account for most of the space used in the indexes – 3 alternatives: simple, Huffman , hand-optimized – hand encoding uses 2 bytes for every hit 10/11/2020 The Anatomy Of A Large Scale Hypertextual Web Search Engine 23
![Major Data Structures 4 Hit Lists 2 10112020 The Anatomy Of A Large Major Data Structures (4) • Hit Lists (2) 10/11/2020 The Anatomy Of A Large](https://slidetodoc.com/presentation_image_h/7c0a064b4ab4f19b1d211bbc2f16e4d3/image-24.jpg)
Major Data Structures (4) • Hit Lists (2) 10/11/2020 The Anatomy Of A Large Scale Hypertextual Web Search Engine 24
![Major Data Structures 5 Forward Index partially ordered used 64 Barrels Major Data Structures (5) • Forward Index – partially ordered – used 64 Barrels](https://slidetodoc.com/presentation_image_h/7c0a064b4ab4f19b1d211bbc2f16e4d3/image-25.jpg)
Major Data Structures (5) • Forward Index – partially ordered – used 64 Barrels – each Barrel holds a range of word. IDs – requires slightly more storage – each word. ID is stored as a relative difference from the minimum word. ID of the Barrel – save considerable time in the sorting 10/11/2020 The Anatomy Of A Large Scale Hypertextual Web Search Engine 25
![Major Data Structures 6 Inverted Index 64 Barrels same as the Forward Major Data Structures (6) • Inverted Index – 64 Barrels (same as the Forward](https://slidetodoc.com/presentation_image_h/7c0a064b4ab4f19b1d211bbc2f16e4d3/image-26.jpg)
Major Data Structures (6) • Inverted Index – 64 Barrels (same as the Forward Index) – for each word. ID the Lexicon contains a pointer to the Barrel that word. ID falls into – the pointer points to a doclist with their hit list – the order of the doc. IDs is important • by doc. ID or doc word-ranking – in Google they choose a compromise 10/11/2020 The Anatomy Of A Large Scale Hypertextual Web Search Engine 26
![Major Data Structures 7 Crawling the Web fast distributed crawling system Major Data Structures (7) • Crawling the Web – fast distributed crawling system –](https://slidetodoc.com/presentation_image_h/7c0a064b4ab4f19b1d211bbc2f16e4d3/image-27.jpg)
Major Data Structures (7) • Crawling the Web – fast distributed crawling system – URLserver & Crawlers are implemented in phyton – each Crawler keeps about 300 connection open – at peek time the rate - 100 pages, 600 K per second – uses: internal cached DNS lookup – synchronized IO to handle events – number of queues – Robust & Carefully tested 10/11/2020 The Anatomy Of A Large Scale Hypertextual Web Search Engine 27
![Major Data Structures 8 Indexing the Web Parsing should know to Major Data Structures (8) • Indexing the Web – Parsing • should know to](https://slidetodoc.com/presentation_image_h/7c0a064b4ab4f19b1d211bbc2f16e4d3/image-28.jpg)
Major Data Structures (8) • Indexing the Web – Parsing • should know to handle errors – – HTML typos kb of zeros in a middle of a TAG non-ASCII characters HTML Tags nested hundreds deep • Developed their own Parser – involved a fair amount of work – did not cause a bottleneck 10/11/2020 The Anatomy Of A Large Scale Hypertextual Web Search Engine 28
![Major Data Structures 9 Indexing Documents into Barrels turning words into word Major Data Structures (9) • Indexing Documents into Barrels – turning words into word.](https://slidetodoc.com/presentation_image_h/7c0a064b4ab4f19b1d211bbc2f16e4d3/image-29.jpg)
Major Data Structures (9) • Indexing Documents into Barrels – turning words into word. IDs – in-memory hash table - the Lexicon – new additions are logged to a file – parallelization • shared lexicon of 14 million pages • log of all the extra words 10/11/2020 The Anatomy Of A Large Scale Hypertextual Web Search Engine 29
![Major Data Structures 10 Indexing the Web Sorting creating the inverted Major Data Structures (10) • Indexing the Web – Sorting • creating the inverted](https://slidetodoc.com/presentation_image_h/7c0a064b4ab4f19b1d211bbc2f16e4d3/image-30.jpg)
Major Data Structures (10) • Indexing the Web – Sorting • creating the inverted index • produces two types of barrels – for titles and anchor – for full text • sorts every barrel separately • running sorters at parallel • the sorting is done in main memory 10/11/2020 The Anatomy Of A Large Scale Hypertextual Web Search Engine 30
![Searching Algorithm 5 Compute the rank of that 1 Parse Searching • Algorithm – – –. 5 Compute the rank of that 1. Parse](https://slidetodoc.com/presentation_image_h/7c0a064b4ab4f19b1d211bbc2f16e4d3/image-31.jpg)
Searching • Algorithm – – –. 5 Compute the rank of that 1. Parse the query document 2. Convert word into –. 6 If we’re at the end of the short word. IDs barrels start at the doclists of the full barrel, unless we have enough 3. Seek to the start of the doclist in the short barrel –. 7 If were not at the end of any for every word doclist goto step 4 4. Scan through the –. 8 Sort the documents by rank doclists until there is a return the top K document that matches all of the search terms 10/11/2020 The Anatomy Of A Large Scale Hypertextual Web Search Engine 31
![The Ranking System The information Position Font Size Capitalization Anchor Text The Ranking System • The information – Position, Font Size, Capitalization – Anchor Text](https://slidetodoc.com/presentation_image_h/7c0a064b4ab4f19b1d211bbc2f16e4d3/image-32.jpg)
The Ranking System • The information – Position, Font Size, Capitalization – Anchor Text – Page. Rank • Hits Types – title , anchor , URL etc. . – small font, large font etc. . 10/11/2020 The Anatomy Of A Large Scale Hypertextual Web Search Engine 32
![The Ranking System 2 Each Hit type has its own weight Counts The Ranking System (2) • Each Hit type has it’s own weight • Counts](https://slidetodoc.com/presentation_image_h/7c0a064b4ab4f19b1d211bbc2f16e4d3/image-33.jpg)
The Ranking System (2) • Each Hit type has it’s own weight • Counts weights increase linearly with counts at first but quickly taper off this is the IR score of the doc • the IR is combined with Page. Rank to give the final Rank • For multi-word query – A proximity score for every set of hits with a proximity type weight 10/11/2020 The Anatomy Of A Large Scale Hypertextual Web Search Engine 33
![Feedback A trusted user may optionally evaluate the results The feedback is Feedback • A trusted user may optionally evaluate the results • The feedback is](https://slidetodoc.com/presentation_image_h/7c0a064b4ab4f19b1d211bbc2f16e4d3/image-34.jpg)
Feedback • A trusted user may optionally evaluate the results • The feedback is saved • When modifying the ranking function we can see the impact of this change on all previous searches that were ranked 10/11/2020 The Anatomy Of A Large Scale Hypertextual Web Search Engine 34
![Results Produce better results than major commercial search engines for most searches Results • Produce better results than major commercial search engines for most searches •](https://slidetodoc.com/presentation_image_h/7c0a064b4ab4f19b1d211bbc2f16e4d3/image-35.jpg)
Results • Produce better results than major commercial search engines for most searches • Example: query “bill clinton” – return results from the “Whitehouse. gov” – email addresses of the president – all the results are high quality pages – no broken links – no bill without clinton & no clinton without bill 10/11/2020 The Anatomy Of A Large Scale Hypertextual Web Search Engine 35
![Storage Requirements Using Compression on the repository about 55 GB for all Storage Requirements • Using Compression on the repository • about 55 GB for all](https://slidetodoc.com/presentation_image_h/7c0a064b4ab4f19b1d211bbc2f16e4d3/image-36.jpg)
Storage Requirements • Using Compression on the repository • about 55 GB for all the data used by the SE • most of the queries can be answered by just the short inverted index • with better compression, a high quality SE can fit onto a 7 GB drive of a new PC 10/11/2020 The Anatomy Of A Large Scale Hypertextual Web Search Engine 36
![Storage Statistics 10112020 Web Page Statistics The Anatomy Of A Large Scale Hypertextual Web Storage Statistics 10/11/2020 Web Page Statistics The Anatomy Of A Large Scale Hypertextual Web](https://slidetodoc.com/presentation_image_h/7c0a064b4ab4f19b1d211bbc2f16e4d3/image-37.jpg)
Storage Statistics 10/11/2020 Web Page Statistics The Anatomy Of A Large Scale Hypertextual Web Search Engine 37
![System Performance It took 9 days to download 26 million pages System Performance • • • It took 9 days to download 26 million pages](https://slidetodoc.com/presentation_image_h/7c0a064b4ab4f19b1d211bbc2f16e4d3/image-38.jpg)
System Performance • • • It took 9 days to download 26 million pages 48. 5 pages per second The Indexer & Crawler ran simultaneously The Indexer runs at 54 pages per second The sorters run in parallel using 4 machines, the whole process took 24 hours 10/11/2020 The Anatomy Of A Large Scale Hypertextual Web Search Engine 38
![Conclusions Scalable Search Engine High Quality Search Results Search techniques Conclusions • Scalable Search Engine • High Quality Search Results • Search techniques –](https://slidetodoc.com/presentation_image_h/7c0a064b4ab4f19b1d211bbc2f16e4d3/image-39.jpg)
Conclusions • Scalable Search Engine • High Quality Search Results • Search techniques – Page. Rank – Anchor Text – Proximity Information • A Complete Architecture 10/11/2020 The Anatomy Of A Large Scale Hypertextual Web Search Engine 39
![Future Work Improve search efficiency Scale to 100 million Boolean Operators Future Work • • • Improve search efficiency Scale to 100 million Boolean Operators](https://slidetodoc.com/presentation_image_h/7c0a064b4ab4f19b1d211bbc2f16e4d3/image-40.jpg)
Future Work • • • Improve search efficiency Scale to 100 million Boolean Operators Text Surrounding Links Personalization Page. Rank Result Summarization 10/11/2020 The Anatomy Of A Large Scale Hypertextual Web Search Engine 40
![New Features Google Scout Documents Caching Uncle Sams Link option 10112020 The New Features • • Google Scout Documents Caching Uncle Sam’s Link: option 10/11/2020 The](https://slidetodoc.com/presentation_image_h/7c0a064b4ab4f19b1d211bbc2f16e4d3/image-41.jpg)
New Features • • Google Scout Documents Caching Uncle Sam’s Link: option 10/11/2020 The Anatomy Of A Large Scale Hypertextual Web Search Engine 41
![The End 10112020 The Anatomy Of A Large Scale Hypertextual Web Search Engine 42 The End 10/11/2020 The Anatomy Of A Large Scale Hypertextual Web Search Engine 42](https://slidetodoc.com/presentation_image_h/7c0a064b4ab4f19b1d211bbc2f16e4d3/image-42.jpg)
The End 10/11/2020 The Anatomy Of A Large Scale Hypertextual Web Search Engine 42
The anatomy of a large-scale hypertextual web search engine
Oogoogle translate
The anatomy of a large scale hypertextual web search engine
The anatomy of a large-scale hypertextual web search engine
Anatomy of a search engine
Anatomy of a search engine
Small vs large scale maps
What is a map scale definition
Scale of a map
Introduction to topographic maps
Large scale vs small scale map
External and internal combustion engine
Descending colon
Large intestine
Label the large intestine
Small intestine main function
Cat head
Equation search engine
Asi distributor website
Kumpulan dari halaman web disebut.… *
Goto search engine
Difference between web browser and search engine
Vertical
What are the four components of a search engine
Meta search engines
Trellian keyword discovery tool
Search engine adult
Components of search engine in information retrieval
Scirus search engine
Personalised mobile search engine
Sequence diagram for search engine
Dot search
Scholar advanced search engine
Vista search engine
Architecture of search engine
Indri search engine
Alt search engine
Cara kerja search engine
Sullivan search engine
Search engine marketing presentation
Search engine optimization for orthopedic practices
Distributed search engine
Keyword generation for search engine advertising