The Anatomy Of A Large Scale Search Engine

  • Slides: 42
Download presentation
The Anatomy Of A Large Scale Search Engine Based on a paper by: Sergey

The Anatomy Of A Large Scale Search Engine Based on a paper by: Sergey Brin & Lawrence Page Computer Science Department, Stanford University - submitted to WWW 7 (1997) lecture by: Tal Blum for the SDBI seminar The Anatomy Of A Large Scale Hypertextual Web Search Engine

Index • • • Introduction Design Goals System Features Related Work System Anatomy Results

Index • • • Introduction Design Goals System Features Related Work System Anatomy Results & Performance • Conclusions 10/11/2020 • Future Work • References The Anatomy Of A Large Scale Hypertextual Web Search Engine 2

What is Google? • Large-scale search engine – makes extensive use in hypertext –

What is Google? • Large-scale search engine – makes extensive use in hypertext – designed to crawl & index the web efficiently – gives better results – prototype at http: //google. stanford. edu or http: //www. google. com – googol = 10^100 10/11/2020 The Anatomy Of A Large Scale Hypertextual Web Search Engine 3

Why talk about google? • To engineer a SE is a challenging task –

Why talk about google? • To engineer a SE is a challenging task – millions of pages, terms, queries • • • little academic research SE today is not what it was 5 years ago the first detailed public description of SE better results using hypertext uncontrolled hypertext collections 10/11/2020 The Anatomy Of A Large Scale Hypertextual Web Search Engine 4

The web - IR challenge • 2 main ways for surfing: – high quality

The web - IR challenge • 2 main ways for surfing: – high quality human maintained lists (Yahoo) • too slow to improve • cannot cover esoteric topics • expensive to build and maintain – search engines (google, altavista) • search by keywords • too many low quality matches • people try to mislead automated search engines 10/11/2020 The Anatomy Of A Large Scale Hypertextual Web Search Engine 5

Web Growth 10/11/2020 The Anatomy Of A Large Scale Hypertextual Web Search Engine 6

Web Growth 10/11/2020 The Anatomy Of A Large Scale Hypertextual Web Search Engine 6

Web Search Engine Scaling-Up 1994 -2000 • First SE WWWW (1994) had an index

Web Search Engine Scaling-Up 1994 -2000 • First SE WWWW (1994) had an index of 110, 000 web pages, 1500 queries • November 1997 index of 2 -100 million web pages, 20 million(Altavista) • expected that by 2000 SE will have an index of billion web pages, hundreds of millions of queries 10/11/2020 The Anatomy Of A Large Scale Hypertextual Web Search Engine 7

Web Search Engine Scaling-Up 1999 • Challenges in Creating a Search Engine which scales

Web Search Engine Scaling-Up 1999 • Challenges in Creating a Search Engine which scales even to today web – Fast crawling technology • gather documents, keep them up to date – Efficient storage space • indices, optionally the documents – Handle queries quickly • rate of thousands per second 10/11/2020 The Anatomy Of A Large Scale Hypertextual Web Search Engine 8

Google: Scaling with the web • Improved Hardware Performance – exceptions disk seek time,

Google: Scaling with the web • Improved Hardware Performance – exceptions disk seek time, OS • Google is designed to scale well to extremely large data sets • Google’s data structure are optimized for fast & efficient access • Google is a centralized SE 10/11/2020 The Anatomy Of A Large Scale Hypertextual Web Search Engine 9

Design Goals • Improved Search Quality – Junk Results • Number of documents has

Design Goals • Improved Search Quality – Junk Results • Number of documents has increased by many factors • User ability to look at documents has not • As the collection size grows we need tools with very high precision even at the expanse of recall • Use of hypertextual information – In google: link structure & anchor text 10/11/2020 The Anatomy Of A Large Scale Hypertextual Web Search Engine 10

Design Goals (2) • Academic Search Engine Research – SE has migrated from academic

Design Goals (2) • Academic Search Engine Research – SE has migrated from academic domain to the commercial • SE technology became mostly a black art & advertising oriented. – Get people usage Information • considered commercially valuable – Support novel research activities on large-scale web data 10/11/2020 The Anatomy Of A Large Scale Hypertextual Web Search Engine 11

System Features • Page. Rank: Bringing order to the web – most web SE

System Features • Page. Rank: Bringing order to the web – most web SE has largely ignored the link graph – 518 million hyperlinks – correspond well with people idea of importance – Pr(A) = (1 -d) + (Pr(T 1)/C(T 1)+…+Pr(Tn)/C(Tn)) – difference from traditional methods • not counting links from pages equally • normalizing by the number of links in a page • different from Hits of Kleiberg 10/11/2020 The Anatomy Of A Large Scale Hypertextual Web Search Engine 12

System Features (2) • Anchor Text – Associate link text with the page it

System Features (2) • Anchor Text – Associate link text with the page it points to – advantages • anchor provide more accurate description • can exist for documents that can’t be indexed – images, programs, databases, mp 3, non-text docs, e-mails • can return web pages that hadn’t been crawled – was first used in WWW Worm 1994 10/11/2020 The Anatomy Of A Large Scale Hypertextual Web Search Engine 13

System Features (3) • Other Features – Location Information • Use of proximity in

System Features (3) • Other Features – Location Information • Use of proximity in search – Visualization Information • Font relative Size – Full raw HTML is available • users can view a cashed version of the page • users can view the page as it was when indexed • can be used for research 10/11/2020 The Anatomy Of A Large Scale Hypertextual Web Search Engine 14

Related Work • SE have short history (wwww 1994) • commercial services closely guard

Related Work • SE have short history (wwww 1994) • commercial services closely guard the details of their databases • work on specialized features of SE – especially on post-processing results of SE • work on Information Retrieval Systems – especially on well controlled environments 10/11/2020 The Anatomy Of A Large Scale Hypertextual Web Search Engine 15

IR &Differences Between the Web and Well Controlled Collections • “TREC 96”s “Very Large

IR &Differences Between the Web and Well Controlled Collections • “TREC 96”s “Very Large Corpus” is only 20 GB compared to 147 GB of Google crawl • The Web is a vast collection of heterogeneous documents – language, vocabulary, format • things that work well for TREC often do not produce good results on the web • there is no control over what people put on the web The Anatomy Of A Large Scale Hypertextual Web Search Engine 10/11/2020 16

System Anatomy • High Level Overview 10/11/2020 The Anatomy Of A Large Scale Hypertextual

System Anatomy • High Level Overview 10/11/2020 The Anatomy Of A Large Scale Hypertextual Web Search Engine 17

Major Data Structures • Big Files – virtual files spanning multiple file systems –

Major Data Structures • Big Files – virtual files spanning multiple file systems – addressable by 64 bit integers – handles allocation & deallocation of File Descriptions since the OS’s is not enough – supports rudimentary compression 10/11/2020 The Anatomy Of A Large Scale Hypertextual Web Search Engine 18

Major Data Structures (2) • Repository – tradeoff between speed & compression ratio –

Major Data Structures (2) • Repository – tradeoff between speed & compression ratio – choose zlib (3 to 1) over bzip (4 to 1) – requires no other data structure to access it 10/11/2020 The Anatomy Of A Large Scale Hypertextual Web Search Engine 19

Major Data Structures (3) • Document Index – keeps information about each document –

Major Data Structures (3) • Document Index – keeps information about each document – fixed width ISAM (index sequential access mode) index – includes various statistics • pointer to repository, if crawled, pointer to info lists – compact data structure – we can fetch a record in 1 disk seek during search 10/11/2020 The Anatomy Of A Large Scale Hypertextual Web Search Engine 20

Major Data Structures (4) • URL’s - doc. ID file – used to convert

Major Data Structures (4) • URL’s - doc. ID file – used to convert URLs to doc. IDs – list of URL checksums with their doc. IDs – sorted by checksums – given a URL a binary search is performed – conversion is done in batch mode 10/11/2020 The Anatomy Of A Large Scale Hypertextual Web Search Engine 21

Major Data Structures (4) • Lexicon – can fit in memory for reasonable price

Major Data Structures (4) • Lexicon – can fit in memory for reasonable price • currently 256 MB • contains 14 million words • 2 parts – a list of words – a hash table 10/11/2020 The Anatomy Of A Large Scale Hypertextual Web Search Engine 22

Major Data Structures (4) • Hit Lists – includes position font & capitalization –

Major Data Structures (4) • Hit Lists – includes position font & capitalization – account for most of the space used in the indexes – 3 alternatives: simple, Huffman , hand-optimized – hand encoding uses 2 bytes for every hit 10/11/2020 The Anatomy Of A Large Scale Hypertextual Web Search Engine 23

Major Data Structures (4) • Hit Lists (2) 10/11/2020 The Anatomy Of A Large

Major Data Structures (4) • Hit Lists (2) 10/11/2020 The Anatomy Of A Large Scale Hypertextual Web Search Engine 24

Major Data Structures (5) • Forward Index – partially ordered – used 64 Barrels

Major Data Structures (5) • Forward Index – partially ordered – used 64 Barrels – each Barrel holds a range of word. IDs – requires slightly more storage – each word. ID is stored as a relative difference from the minimum word. ID of the Barrel – save considerable time in the sorting 10/11/2020 The Anatomy Of A Large Scale Hypertextual Web Search Engine 25

Major Data Structures (6) • Inverted Index – 64 Barrels (same as the Forward

Major Data Structures (6) • Inverted Index – 64 Barrels (same as the Forward Index) – for each word. ID the Lexicon contains a pointer to the Barrel that word. ID falls into – the pointer points to a doclist with their hit list – the order of the doc. IDs is important • by doc. ID or doc word-ranking – in Google they choose a compromise 10/11/2020 The Anatomy Of A Large Scale Hypertextual Web Search Engine 26

Major Data Structures (7) • Crawling the Web – fast distributed crawling system –

Major Data Structures (7) • Crawling the Web – fast distributed crawling system – URLserver & Crawlers are implemented in phyton – each Crawler keeps about 300 connection open – at peek time the rate - 100 pages, 600 K per second – uses: internal cached DNS lookup – synchronized IO to handle events – number of queues – Robust & Carefully tested 10/11/2020 The Anatomy Of A Large Scale Hypertextual Web Search Engine 27

Major Data Structures (8) • Indexing the Web – Parsing • should know to

Major Data Structures (8) • Indexing the Web – Parsing • should know to handle errors – – HTML typos kb of zeros in a middle of a TAG non-ASCII characters HTML Tags nested hundreds deep • Developed their own Parser – involved a fair amount of work – did not cause a bottleneck 10/11/2020 The Anatomy Of A Large Scale Hypertextual Web Search Engine 28

Major Data Structures (9) • Indexing Documents into Barrels – turning words into word.

Major Data Structures (9) • Indexing Documents into Barrels – turning words into word. IDs – in-memory hash table - the Lexicon – new additions are logged to a file – parallelization • shared lexicon of 14 million pages • log of all the extra words 10/11/2020 The Anatomy Of A Large Scale Hypertextual Web Search Engine 29

Major Data Structures (10) • Indexing the Web – Sorting • creating the inverted

Major Data Structures (10) • Indexing the Web – Sorting • creating the inverted index • produces two types of barrels – for titles and anchor – for full text • sorts every barrel separately • running sorters at parallel • the sorting is done in main memory 10/11/2020 The Anatomy Of A Large Scale Hypertextual Web Search Engine 30

Searching • Algorithm – – –. 5 Compute the rank of that 1. Parse

Searching • Algorithm – – –. 5 Compute the rank of that 1. Parse the query document 2. Convert word into –. 6 If we’re at the end of the short word. IDs barrels start at the doclists of the full barrel, unless we have enough 3. Seek to the start of the doclist in the short barrel –. 7 If were not at the end of any for every word doclist goto step 4 4. Scan through the –. 8 Sort the documents by rank doclists until there is a return the top K document that matches all of the search terms 10/11/2020 The Anatomy Of A Large Scale Hypertextual Web Search Engine 31

The Ranking System • The information – Position, Font Size, Capitalization – Anchor Text

The Ranking System • The information – Position, Font Size, Capitalization – Anchor Text – Page. Rank • Hits Types – title , anchor , URL etc. . – small font, large font etc. . 10/11/2020 The Anatomy Of A Large Scale Hypertextual Web Search Engine 32

The Ranking System (2) • Each Hit type has it’s own weight • Counts

The Ranking System (2) • Each Hit type has it’s own weight • Counts weights increase linearly with counts at first but quickly taper off this is the IR score of the doc • the IR is combined with Page. Rank to give the final Rank • For multi-word query – A proximity score for every set of hits with a proximity type weight 10/11/2020 The Anatomy Of A Large Scale Hypertextual Web Search Engine 33

Feedback • A trusted user may optionally evaluate the results • The feedback is

Feedback • A trusted user may optionally evaluate the results • The feedback is saved • When modifying the ranking function we can see the impact of this change on all previous searches that were ranked 10/11/2020 The Anatomy Of A Large Scale Hypertextual Web Search Engine 34

Results • Produce better results than major commercial search engines for most searches •

Results • Produce better results than major commercial search engines for most searches • Example: query “bill clinton” – return results from the “Whitehouse. gov” – email addresses of the president – all the results are high quality pages – no broken links – no bill without clinton & no clinton without bill 10/11/2020 The Anatomy Of A Large Scale Hypertextual Web Search Engine 35

Storage Requirements • Using Compression on the repository • about 55 GB for all

Storage Requirements • Using Compression on the repository • about 55 GB for all the data used by the SE • most of the queries can be answered by just the short inverted index • with better compression, a high quality SE can fit onto a 7 GB drive of a new PC 10/11/2020 The Anatomy Of A Large Scale Hypertextual Web Search Engine 36

Storage Statistics 10/11/2020 Web Page Statistics The Anatomy Of A Large Scale Hypertextual Web

Storage Statistics 10/11/2020 Web Page Statistics The Anatomy Of A Large Scale Hypertextual Web Search Engine 37

System Performance • • • It took 9 days to download 26 million pages

System Performance • • • It took 9 days to download 26 million pages 48. 5 pages per second The Indexer & Crawler ran simultaneously The Indexer runs at 54 pages per second The sorters run in parallel using 4 machines, the whole process took 24 hours 10/11/2020 The Anatomy Of A Large Scale Hypertextual Web Search Engine 38

Conclusions • Scalable Search Engine • High Quality Search Results • Search techniques –

Conclusions • Scalable Search Engine • High Quality Search Results • Search techniques – Page. Rank – Anchor Text – Proximity Information • A Complete Architecture 10/11/2020 The Anatomy Of A Large Scale Hypertextual Web Search Engine 39

Future Work • • • Improve search efficiency Scale to 100 million Boolean Operators

Future Work • • • Improve search efficiency Scale to 100 million Boolean Operators Text Surrounding Links Personalization Page. Rank Result Summarization 10/11/2020 The Anatomy Of A Large Scale Hypertextual Web Search Engine 40

New Features • • Google Scout Documents Caching Uncle Sam’s Link: option 10/11/2020 The

New Features • • Google Scout Documents Caching Uncle Sam’s Link: option 10/11/2020 The Anatomy Of A Large Scale Hypertextual Web Search Engine 41

The End 10/11/2020 The Anatomy Of A Large Scale Hypertextual Web Search Engine 42

The End 10/11/2020 The Anatomy Of A Large Scale Hypertextual Web Search Engine 42