Problems in Semantic Search Krishnamurthy Viswanathan and Varish
Problems in Semantic Search Krishnamurthy Viswanathan and Varish Mulwad {krishna 3, varish 1} AT umbc DOT edu 1
Agenda • Introduction • Swoogle • Cool things others do • Swoogle facts/figures • Our ideas • References 2
Why is Semantic Search significant? 3
Swoogle • Swoogle is a search engine for Semantic Web (SW) documents • It offers the following services: – Search SW ontologies and documents – Search SW terms, i. e. URIs that have been defined as classes and properties – Provide metadata of SW documents and support browsing the Semantic Web 4
Swoogle • Swoogle supports two relevant query types: – Ontology: Searches a small collection that consists only of Semantic Web Ontologies – Document: Searches all SW documents. This search space is much larger • Swoogle indexes only the document’s URL, the terms being defined in the document, explicit descriptions about the document, and the namespaces used by the document 5
Swoogle capabilities • Web search: – Basic metadata: e. g. url, desc, ns etc. – Document metadata: has. Encoding, has. Length etc. – RDF metadata: has. Grammar, has. Cnt. Triple etc. • Advanced search using Lucene features • REST based services: Compose an HTTP GET query and retrieve the results in the form of RDF/XML 6
Examples of REST queries • A query is represented as a URL: – REST_QUERY : : = SERVICE_URI ? PARAMS • Example: search SW documents which are classified as ontologies (onto. Ratio > 0) – query. Type: e. g. search_swd_ontology – search. String: user constructed (see manual) – Key http: //logos. cs. umbc. edu: 8080/swoogle 31/q? query. Type=search_swd_ontology&search String=person&key=demo 7
Cool things other semantic search engines do … 8
Sindice • Sindice is a Semantic Web search engine created at Digitial Enterprise Research Institute (DERI) • Interesting things to note about Sindice – – Architecture – Indexing 9
Sindice • Sindice uses the paradigms of cloud computing for their architecture • Sindice uses Hadoop / Nutch to distribute crawling across multiple machines • Collected data is stored in a HBase – a distributed column store 10
Sindice • Sindice indexes based on – – Inverse Functional Properties (IFP) – URI’s – Literals (Keywords) IFP – An OWL cardinality restriction • Benefits – Faster Retrieval 11
Watson – A gateway to the Semantic Web • From the Knowledge Management Institute at the Open University in UK • Interesting things to note about Watson – – Consider implicit semantic relationships – Quality of Semantic documents – “Rich access” to semantic data 12
Watson • Implicit relationships between semantic web documents – Equivalence (Duplicate detection) • Quality of Semantic Documents • “Richer” access to Semantic Data – Web Interface for Humans – Spar. QL end point – Java/SOAP and REST APIs 13
Others • Semantic Web Search Engine (SWSE) – Pipelined architecture for crawling and indexing – Improved index and storage structure • Falcons – Class subsumption reasoning – Includes a Triple Store 14
Power Aqua • Multi-ontology based QA system powered by Power. Map and Watson • Takes inputs in the form of NL queries • Factual queries that can be expressed as one or more linguistic triples • Common wh-questions 15
Power Aqua • Key challenges in order to be able to answer NLquestions: – Locating the ontologies relevant to a particular query – Identifying semantically sound relationships – Combining information from multiple queries 16
Swoogle facts/figures • The search engine components currently run on 4 machines • These machines host the crawler, the Lucene index, the My. SQL database etc. and access the NFS • Approximately 20, 000 pages are accessed by Swoogle everyday (which get queued) • About 1, 731, 371 pure SW documents have been discovered 17
Swoogle facts/figures • Swoogle crawler has a large queue of documents to be crawled and indexed • Swoogle accesses metadata and index files over the NFS that makes information retrieval slower 18
Our Ideas: Research and Engineering • Acquire new hardware • Parallelize Swoogle • Focus on a particular domain • Project Swoogle as a search engines for agents 19
Our Ideas: Research and Engineering • Improve Swoogle’s indexing scheme • Analyze Swoogle’s ranking scheme • Use of Swoogle Metadata • Improve the usability of the website • Google like Services 20
References • Li Ding et al. , "Swoogle: A Search and Metadata Engine for the Semantic Web", Proceedings of the Thirteenth ACM Conference on Information and Knowledge Management, November 2004. • P. Mika, G. Tummarello “Web Semantics in the Clouds”, IEEE Intelligent Systems, Volume 23 , Issue 5 (September 2008) • E. Oren, R. Delbru, M. Catasta, R. Cyganiak, H. Stenzhorn, G. Tummarello “Sindice. com: A document-oriented lookup index for open linked data. ” In International Journal of Metadata, Semantics and Ontologies, 3(1), 2008. • Mathieu d’Aquin et al. , “Watson: A Gateway for the Semantic Web” , Poster session of the European Semantic Web Conference, ESWC 2007 • Gong Cheng, Weiyi Ge, Honghan Wu, Yuzhong Qu , “Searching Semantic Web Objects Based on Class Hierarchies” In WWW 2008 Workshop on Linked Data on the Web, 2008 21
Questions ? 22
- Slides: 22