Search Engines for Semantic Web Knowledge Tim Finin
Search Engines for Semantic Web Knowledge Tim Finin University of Maryland, Baltimore County Joint work with Li Ding, Anupam Joshi, Yun Peng, Pranam Kolari, Pavan Reddivari, Sandor Dornbush, Rong Pan, Akshay Java, Joel Sachs, Scott Cost and Vishal Doshi UMBC an Honors University in Maryland http: //creativecommons. org/licenses/by-nc-sa/2. 0/ This work was partially supported by DARPA contract F 30602 -97 -1 -0215, NSF grants CCR 007080 and IIS 9875433 and grants from IBM, Fujitsu and HP. 1
This talk • Motivation • Semantic web 101 • Swoogle Semantic Web search engine • Use cases and applications • Conclusions UMBC an Honors University in Maryland 2
Once there were only a few large computers UMBC an Honors University in Maryland 3
Then there were many, UMBC an Honors University in Maryland 4
All connected 24 x 7, UMBC an Honors University in Maryland Internet Cellular telephony IRDA 802. 11 Bluetooth Ultra Wide Band RFID and more to come 5
Interoperating; tcp/ip ftp smtp rpc corba ssh http html xml gif jpg mp 3 pdf … UMBC an Honors University in Maryland 6
Access to the world’s knowledge del. icio. us UMBC an Honors University in Maryland 7
Google has made us smarter UMBC an Honors University in Maryland 8
But what about our agents? tell register UMBC an Honors University in Maryland Agents still have a very minimal understanding of text and images. 9
This talk • Motivation • Semantic web 101 • Swoogle Semantic Web search engine • Use cases and applications • Conclusions UMBC an Honors University in Maryland 10
XML helps “XML is Lisp's bastard nephew, with uglier syntax and no semantics. Yet XML is poised to enable the creation of a Web of data that dwarfs anything since the Library at Alexandria. ” -- Philip Wadler, Et tu XML? The fall of the relational empire, VLDB, Rome, September 2001. UMBC an Honors University in Maryland 11
Semantic Web adds semantics “The Semantic Web will globalize KR, just as the WWW globalize hypertext” -- Tim Berners-Lee UMBC an Honors University in Maryland 12
Semantic Web 101 <? xml version="1. 0" encoding="utf-8"? > <rdf: RDF xmlns: rdf="http: //www. w 3. org/1999/02/22 -rdf-syntax-ns#" xmlns: foaf=http: //xmlns. com/foaf/0. 1/ xmlns: uni=http//ebiquity. umbc. edu/ontologies/uni/> <uni: Student> <foaf: name>Li Ding</foaf: name> <foaf: mbox rdf: resource=“mailto: dingli 1@umbc. edu”/> </uni: Student> </rdf: RDF> foaf: name rdf: type UMBC an Honors University in Maryland • RDF/XML • rdf: RDF tag • namespaces ontologies • Semantic graph, URIs as nodes & links • triples Li Ding uni: Student 13
Where’s the semantics? • URIs as “rigid designators” • Conventions for URIs denoting things in the “real world” • Namespaces and URIs provide an unambiguous shared vocabulary • RDF, RDFS and OWL have semantics defined using model theory and also axioms • Ontologies allow agents to draw inferences UMBC an Honors University in Maryland – uni: Student is a subclass of foaf: Person – Every uni: Student has at least one uni: school, which must be an instance of uni: School – A foaf: Person with a uni: school is necessarily a uni: Student 14
UMBC an Honors University in Maryland 15
UMBC an Honors University in Maryland 16
UMBC an Honors University in Maryland 17
RDF/a is a W 3 C proposal for embedding RDF in XHTML documents <html xmlns: foaf="http: //xmlns. com/foaf/0. 1/"> <head><title>Jo Lambda's Home Page</title></head> <body> Hello. This is <span property="foaf: name">Jo Lambda</span>'s home page. <h 2>Work</h 2> If you want to contact me at work, you can either <a rel="foaf: mbox" href="mailto: jo. lambda@example. org">email me</a>, or call <span property="foaf: phone">+1 777 888 9999</span>. </body> </html> <> foaf: name "Jo Lambda"^^rdf: XMLLiteral ; foaf: mbox <mailto: jo. lambda@example. org> ; foaf: phone "+1 777 888 9999"^^rdf: XMLLiteral. UMBC an Honors University in Maryland An HTML Document with RDF embedded The triples in ntriple format. 18
But what about our agents? Swoogle tell Swoogle register Swoogle Swoogle A Google for knowledge on the Semantic Web is needed by software agents and programs UMBC an Honors University in Maryland 19
This talk • Motivation • Semantic web 101 • Swoogle Semantic Web search engine • Use cases and applications • Conclusions UMBC an Honors University in Maryland 20
UMBC an Honors University in Maryland 21
• http: //swoogle. umbc. edu/ • Running since summer 2004 • 1. 4 M RDF documents, 250 M RDF triples, 10 K ontologies • Semantic Web archive: many dynamic RDF documents UMBC an Honors University in Maryland 22
Swoogle Architecture Analysis SWD classifier Ranking Index IR Indexer SWD Indexer … Search Services Semantic Web metadata Web Server html document cache Candidate URLs Discovery Swoogle. Bot Bounded Web Crawler Google Crawler Web Service rdf/xml the Web Semantic Web human machine Legends Information flow UMBC an Honors University in Maryland Swoogle‘s web interface 23
A Hybrid Harvesting Framework Manual submission Inductive learner true would Seeds M Meta crawling Seeds R Seeds H Bounded HTML crawling google Google API call Swoogle Sample Dataset crawl RDF crawling crawl the Web UMBC an Honors University in Maryland 24
Performance – crawlers’ contribution • • • High SWD ratio: 42% URLs are confirmed as SWD Consistent growth rate: 3000 SWDs per day RDF crawler: best harvesting method HTML crawler: best accuracy Meta crawler: best in detecting websites UMBC an Honors University in Maryland # of documents 26
This talk • • • UMBC an Honors University in Maryland Motivation Swoogle overview Bots navigate the Semantic Web Ranking Semantic Web content Use cases and applications Conclusions 27
Applications and use cases • Supporting Semantic Web developers – Ontology designers, vocabulary discovery, who’s using my ontologies or data? , use analysis, errors, statistics, etc. • Searching specialized collections – Spire: aggregating observations and data from biologists – Inderence. Web: searching over and enhancing proofs – Sem. News: Text Meaning of news stories • Supporting SW tools – Triple shop: finding data for SPARQL queries UMBC an Honors University in Maryland 28
Web-scale semantic web data access agent Search vocabulary Compose query Populate RDF database data access service ask (“person”) inform (“foaf: Person”) the Web Index RDF data Search URIrefs in SW vocabulary ask (“? x rdf: type foaf: Person”) inform (doc URLs) Search URLs in SWD index Fetch docs Query local RDF database UMBC an Honors University in Maryland 32
UMBC Triple Shop • Online SPARQL RDF query processing based on HP’s Joseki with two features • Selectable reasoning level of inference • Automatically finds SWDs for give queries using Swoogle backend database – Provide dataset creation wizard and server-side dataset storage – Tag and share saved datasets SPARQL: a query language for getting information from RDF graphs (dataset) UMBC an Honors University in Maryland 33
UMBC Triple Shop Querying the Semantic Web is as easy as shopping (1) Go to http: //sparql. cs. umbc. edu/ (2) You provide a SPARQL query and constraints on what sources to use (3) Swoogle finds and suggests documents with relevant data, producing a dataset (4) You specify the amount of reasoning to do, possibly resulting in an enhanced dataset (5) We run the query and give you the results (6) You can also download the dataset or save it on the server and give it tags UMBC an Honors University in Maryland 34
UMBC an Honors University in Maryland 35
UMBC an Honors University in Maryland 36
UMBC an Honors University in Maryland 37
This talk • • • UMBC an Honors University in Maryland Motivation Swoogle overview Bots navigate the Semantic Web Ranking Semantic Web content Use cases and applications Conclusions 38
Will it Scale? How? Here’s a rough estimate of the data in RDF documents on the semantic web based on Swoogle’s crawling System/date Terms Documents Individuals Triples Bytes Swoogle 2 1. 5 x 105 3. 5 x 105 7 x 106 5 x 107 7 x 109 Swoogle 3 2 x 105 7 x 105 1. 5 x 107 7. 5 x 107 1 x 1010 2006 1 x 106 5 x 107 5 x 109 5 x 1011 2008 5 x 106 5 x 109 5 x 1011 5 x 1013 We think Swoogle’s centralized approach can be made to work for the next few years if not longer. UMBC an Honors University in Maryland 39
How much reasoning? • Swoogle. N (N<=3) does limited reasoning – It’s expensive – It’s not clear how much should be done • More reasoning would benefit many use cases – e. g. , type hierarchy • Recognizing specialized metadata – E. g. , that ontology A some maps terms from B to C UMBC an Honors University in Maryland 40
Conclusion • The web will contain the world’s knowledge in forms accessible to people and computers – We need better ways to discover, index, search and reason over SW knowledge • SW search engines address different tasks than html search engines – So they require different techniques and APIs • Swoogle like systems can help create consensus ontologies and foster best practices – Swoogle is for Semantic Web 1. 0 – Semantic Web 2. 0 will make different demands UMBC an Honors University in Maryland 41
For more information http: //ebiquity. umbc. edu/ Annotated in OWL UMBC an Honors University in Maryland 42
- Slides: 38