Structural Web Search Using a Graph Based Discovery
Structural Web Search Using a Graph -Based Discovery System Nitish Manocha, Diane J. Cook, and Lawrence B. Holder University of Texas at Arlington cook@cse. uta. edu http: //www-cse. uta. edu/~cook
Structured Web Search n n Existing search engines use linear feature match Web contains structural information as well n Hyperlink information n Web viewed as a graph [Kleinberg] Subdue searches based on structure n Use as foundation of a structural search engine Incorporation of Word. Net allows for synonym match
n n Discovers structural patterns in input graphs A substructure is connected subgraph An instance of a substructure is a subgraph that is isomorphic to substructure definition Pattern discovery, classification, clustering Input Database T 1 Substructure S 1 (graph form) shape C 1 S 1 triangle Compressed Database object R 1 on T 2 T 3 T 4 S 2 S 3 S 4 shape object square C 1 S 1 R 1 S 1 S 1
Subdue Algorithm • • • Start with individual vertices Keep only best substructures on queue Expand substructure by adding edge/vertex Compress graph and repeat to generate hierarchical description Optional use of background knowledge
Inexact Graph Match Some variations may occur between instances n Want to abstract over minor differences n Difference = cost of transforming one graph to make it isomorphic to another n Match if cost/size < threshold n
Application Domains n n n Protein data Human Genome DNA data Spatial-temporal domains n Earthquake data n Aircraft Safety and Reporting System Telecommunications data Program source code Web data
Represent Web as Graph n Breadth-first search of domain to generate graph n Nodes represent pages / documents n Edges represent hyperlinks n Additional nodes represent document keywords subdu e texas university word projects word work hyperlink page parallel group learning robotics planning
Web. Subdue’s Structural Search Formulate query as graph n Use Subdue’s predefined substructure option to search for instances of query n Instructor http Teaching Robotics http Research Robotics Postscript | PDF Publicatio n Robotics
Query: Find all pages which link to a page containing term ‘Subdue’ Subgraph vertices: Subdue word page hyperlink page 1 page URL: http: //cygnus. uta. edu 7 page URL: http: //cygnus. uta. edu/projects. html 8 Subdue [1 ->7] hyperlink [7 ->8] word /* Vertex ID Label */ /* Edge Vertex 1 Vertex 2 Label */ s v 1 page v 2 page v 3 Subdue d 1 2 hyperlink d 2 3 word
Search for Presentation Pages page hyperlink page hyperlink n Web. Subdue § 22 instances n page hyperlink Alta. Vista § Query “host: www-cse. uta. edu AND image: next_motif. gif AND image: up_motif. gif AND image: previous_motif. gif. ” § 12 instances
Search for Reference Pages page hyperlink page n n page … page Search for page with at least 35 in links n Web. Subdue found 5 pages in www-cse Alta. Vista cannot perform this type of search
Inclusion of Word. Net n n When generating graph n Use common stopword list When searching for subgraph instances n Morphology functions n October = Oct n teaching = teach n Synsets n Optional allowance of synonyms
Search for pages on ‘jobs in computer science’ n n n Inexact match: allow one level of synonyms Web. Subdue found 33 matches n Words include employment, work, job, problem, task Alta. Vista found 2 matches page word jobs word computer science
Search for ‘authority’ hub and authority pages page HUBS AUTHORITIES word algorithms n page hyperlink page n page Web. Subdue found 3 hub (and 3 authority) pages Alta. Vista cannot perform this type of search word algorithms n n algorithms Inexact match applied with threshold = 0. 2 (4. 2 transformations allowed) Web. Subdue found 13 matches
Subdue Learning from Web Data n Distinguish professors’ and students’ web pages n Learned concept (professors have “box” in address field) page n word box Distinguish online stores and professors’ web pages n Learned concept (stores have more levels in graph) page page
Conclusions Web. Subdue can be used to search for structural web data n Could be enhanced with additional Word. Net features such as synset path length n Efficient structural search necessary for future of web search tools n
To Learn More cygnus. uta. edu/subdue cook@cse. uta. edu http: //www-cse. uta. edu/~cook
- Slides: 17