Googles DeepWeb Crawl Jayant Madhavan David Ko Lucja

Google’s Deep-Web Crawl Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy, Google Inc. Speaker: Tom Google's Deep-Web Crawl (VLDB 2008) 1

What is the Deep Web? Content hidden behind HTML forms Deep = not accessible through search engines Google's Deep-Web Crawl (VLDB 2008) 2

Why is it important? Large source of structured data Forms present a search interface over backend databases Significant gap in search engine coverage Potentially more content that currently searchable web [Bergman+, Madhavan+, He+] More than 10 million distinct HTML forms Likely to increase and more data comes online Challenge: make the Deep Web accessible to web search Google's Deep-Web Crawl (VLDB 2008) 3

What is in the Deep Web? Yes: Informational forms store locations used cars patents radio stations recipes No: Login forms, anything that requires user information Maybe: Interactive forms, e. g. , airline reservations Google's Deep-Web Crawl (VLDB 2008) 4

Google's Deep-Web Crawl (VLDB 2008) 5

Virtual Integration Mediator forms per domain Mappings between forms [Doan+, He+, Wu+] Query routing/reformulation at run-time Popular with vertical search engines mediated form semantic mappings deep-web sources Impractical for web search! Modeling all domains in all languages might not be possible High cost of building and maintaining Query routing at run-time is very difficult Potentially high loads on deep-web sources Google's Deep-Web Crawl (VLDB 2008) 6

Google's Deep-Web Crawl (VLDB 2008) 7

Surfacing the Deep Web Google's Deep-Web Crawl (VLDB 2008) 8

Surfacing the Deep Web Pre-compute all interesting form submissions each HTML form Each form submission corresponds to a distinct URL Add URLs for each form submission into search engine index Enables the reuse of existing search engine infrastructure Deep-web URLs are like any other URL Reduced load on deep-web sites Only in response to user clicks on a search results Search engine performance not dependent on deep-web source Google's Deep-Web Crawl (VLDB 2008) 9

Surfacing Challenges 1. Predicting the appropriate values for text inputs Valid input values are required for retrieving data Ingredients in recipes. com and zipcodes in borderstores. com 2. Predicting the correct input combinations Generating all possible URLs is wasteful + unnecessary Cars. com has ~500 K listings, but 250 M possible queries Google's Deep-Web Crawl (VLDB 2008) 10

Surfacing for a Search Engine Goal: access to as much Deep-Web content at possible. Distribution of form-generated traffic is heavy-tailed More than 800, 000 distinct forms in a week Overall coverage more important than site-specific coverage Completely automatic and efficient solution required ! Many domains and many languages No human in the loop, no site-specific scripts Google's Deep-Web Crawl (VLDB 2008) 11

Contributions and Impact Research contributions Formulation: searching for informative query templates Algorithms: predicting input combinations Algorithms: predicting input values for text boxes Google’s Deep-Web crawling system Affects more than 1000 queries per second Enables access to more than a million Deep-Web sites Spans 50+ languages and 100+ domains Google's Deep-Web Crawl (VLDB 2008) 12

Problem Formulation Google's Deep-Web Crawl (VLDB 2008) 13

Form Processing 101 <form action=http: //www. borders. com/locator method=GET> <select name=store><option …/>… </select> … <input name=zip type=text/> <input name=search type=submit value=Go/> <input name=site type=hidden value=homepage/> </form> on submit URL: http: //www. borders. com/locator? store=All&city=&state= &zip=94043&within=25&search=Go&site=homepage GET and POST: types of HTML forms Only GETs can be surfaced Google's Deep-Web Crawl (VLDB 2008) 14

Problem Formulation Form submission ~ SQL Query select * from DB where I 1=V 1 and … and IN=VN Not all inputs impose selection predicates E. g. , sort order and results per page affect presentation Problem: find the best set of SQL queries Google's Deep-Web Crawl (VLDB 2008) 15

Query Templates Query Template: compact representation of a set of queries IB: binding inputs in the form { select * from DB where PB } PB: selection predicates only involving IB All queries with different values for IB Default values assigned to other inputs Store locator with zip and type can have templates: <Z> <T, Z> {select * from DB where zip = z | z are valid zip codes } {select * from DB where type = t | t are valid store types } {select * from DB where zip = z and type = t | … } Problem: find the best possible query templates Google's Deep-Web Crawl (VLDB 2008) 16

Predicting Input Combinations Google's Deep-Web Crawl (VLDB 2008) 17

Predicting Input Combinations Forms can have multiple inputs Generating all possible URLs is wasteful! … and un-necessary! Goal: minimize URLs while maximizing retrieval! Other considerations Generated URLs must be good candidates for index Only need URLs sufficient to drive traffic Only need URLs sufficient to seed the web crawler Google's Deep-Web Crawl (VLDB 2008) 18

Query Template Quality Presentation input is binding – There exists a template with fewer binding inputs Large query templates (many binding inputs) – Too many queries generated – Numerous queries with empty results + Likely to ensure complete coverage Small query templates (fewer binding inputs) + Smaller number of queries – Lower actual coverage (restrictions on the results per page) – Results of a single query not sufficiently related Google's Deep-Web Crawl (VLDB 2008) 19

Good Query Templates Do not contain presentation inputs Neither too small, neither too large Dependent on database size? Dependent on potential query traffic? Google's Deep-Web Crawl (VLDB 2008) 20

Informative Query Templates Result pages different informative http: //jobs. shrm. org/search? state=All&kw=&type=All http: //jobs. shrm. org/search? state=AL&kw=&type=All http: //jobs. shrm. org/search? state=AK&kw=&type=All … http: //jobs. shrm. org/search? state=WV&kw=&type=All Result pages similar un-informative http: //jobs. shrm. org/search? state=All&kw=&type=ALL http: //jobs. shrm. org/search? state=All&kw=&type=ANY http: //jobs. shrm. org/search? state=All&kw=&type=EXACT Google's Deep-Web Crawl (VLDB 2008) 21

Identifying Informative Templates Generate a sampling of possible form submissions Analyze and compare the contents of the result pages Compute content signatures for each corresponding web page Dist. Frac. = # Distinct Signatures / # URLs Dist. Frac. > Threshold Informative Template Content signatures must be robust to Changes in HTML layout Minor differences in content Presence of advertisements and transient content Google's Deep-Web Crawl (VLDB 2008) 22

$URL Generation Low distinctness fractions imply that presentation inputs: many pages have similar results$

URL Generation Low distinctness fractions imply that presentation inputs: many pages have similar results very large template: many pages are empty error template: all pages are the same with an error message Generated submissions unlikely to be useful URL generation strategy Enumerate all possible query templates Test each template for informativeness Generate all URLs from informative templates Google's Deep-Web Crawl (VLDB 2008) 23

Incremental Template Search Determine informative templates with one binding input Determine informative templates with two binding inputs Only consider pairs with one input known to be informative Incrementally build candidate templates Only consider supersets of smaller informative templates Halt when no larger templates are possible ISIT: Incremental Search for Informative Templates Google's Deep-Web Crawl (VLDB 2008) 24

Scalable URL Generation Competitors • Cartesian: all possible URLs • Triple: templates with three binding inputs Our algorithm generates far fewer URLs Informativeness test plays a critical role Number of URLs generated depends on database size Google's Deep-Web Crawl (VLDB 2008) 25

Other significant results Larger Templates are useful Compare with simple strategy: single binding input templates Among forms with informative templates with 3 inputs Templates of size 1 contribute 6% of search results on Google. com Templates of size 2 contribute 37% Templates of size 3 contribute 57% Informative templates are discovered efficiently Among forms with 5 inputs, on average Only 12. 6 (out of possible 31) templates are tested Only 1300 URLs are analyzed in total Google's Deep-Web Crawl (VLDB 2008) 26

Predicting Text Values Google's Deep-Web Crawl (VLDB 2008) 27

Generic and Typed Text boxes Generic Search Boxes Accept any keywords Challenge: selecting the most appropriate values Typed Text Boxes Only values belonging to specific types, e. g. , zipcodes Challenge: selecting the type of the input Google's Deep-Web Crawl (VLDB 2008) 28

Example: www. wipo. int Google's Deep-Web Crawl (VLDB 2008) 29

Input values for Generic Search Iterative Probing for search boxes Select an initial list of candidate keywords Download pages based on current set of keywords Extract more candidate keywords from result pages Refine the current set of keywords Repeat until no more new candidate keywords Prune list of candidate keywords Related Work: Classifying Deep-Web sources [Ipeirotis+] Extracting text documents [Ntoulas+, Barbosa+] Google's Deep-Web Crawl (VLDB 2008) 30

Example: www. wipo. int Metalworking Protein Antibody Pyrazole Immobilizer Vasoconstriction Phosphinates Nosepiece Sandbridge Viscosity Carboxydiphenylsulphide Ozonizer … Google's Deep-Web Crawl (VLDB 2008) 31

$Results Summary Distribution of keywords extracted is heavy tailed Large fraction of records retrieved$

Results Summary Distribution of keywords extracted is heavy tailed Large fraction of records retrieved extracted Text inputs and select menus are complementary and both are important Web crawler can automatically retrieve additional content Google's Deep-Web Crawl (VLDB 2008) 32

Typed Text Boxes Library of types that are common across domains Name patterns and sample values Zipcodes, City Names, Prices, Dates Re-use informativeness test Test singleton text boxes Informative only when using the correct type Google's Deep-Web Crawl (VLDB 2008) 33

Summary Google's Deep-Web Crawl (VLDB 2008) 34

Google’s Deep-Web Crawl Solution based on the idea of informative templates Automatic descriptions learned for millions of forms Spans many domains and 50+ languages Affects more than 1000 queries per sec Results served from 400 K+ distinct forms per day Results served from 800 K+ distinct forms per week Results validate the utility of Deep-Web content Google's Deep-Web Crawl (VLDB 2008) 35

Future Work Extending the coverage of crawlable forms Dependencies between inputs, which are currently being ignored Javascript-based submissions, which involve complex URL generation Surfacing only part of the solution POST forms cannot be indexed by surfacing Surfacing flattens structure – cannot be exploited during ranking Google's Deep-Web Crawl (VLDB 2008) 36

Related to 3 D-LBS • Mobile application • Accessibility • Limited screen size, hard to fill in forms • Recommendation • Location-sensitive query suggestion • Dependency of inputs • Hong Kong Style Dim Sum Shatin Google's Deep-Web Crawl (VLDB 2008) 38

Q&A Thanks! Google's Deep-Web Crawl (VLDB 2008) 39