Annotating Documents for the Semantic Web Using DataExtraction
Annotating Documents for the Semantic Web Using Data-Extraction Ontologies Dissertation Proposal Yihong Ding
Motivation • The representation of web content limits usability • A machine understandable web – Shared, explicit, formal conceptualizations (ontologies) – The semantic web 2
A Problem • How to transform current web to be the semantic web? 3
A Solution: Semantic Annotation • Add explicit, formal, and unambiguous metadata to web documents • Explicit: publicly accessible • Formal: publicly agreeable • Unambiguous: publicly identifiable 4
Implicit Annotation Representation Explicit Annotation 5
Semantic Annotation Current Research Status • Manual annotation through friendly interfaces [Annotea, etc. ] • Automatic annotation with ontology generation [SCORE] • Automatic annotation using automated IE tool based on pre-defined ontologies [Sem. Tag, Mn. M, etc. ] 6
Current Automatic Annotator a typical paradigm Non-ontology-based IE Wrapper Rules and extracting categories Domain Ontology (2) Alignment (1) Extraction (3) Annotation Document 7
Current Automatic Annotator Problems Non-ontology-based IE Wrapper Rules and extracting categories Domain Ontology (2) Problem of concept disambiguation (4) Problem of Assembling ontologies (1) Problem of data recognition (3) Problem of Annotation formatting, storing, indexing, sharing Document 8
“Main Drawback of Using Automated IE” [Kiryakov 04] • “none of these approaches expects an input or produces output with respect to ontologies” • “a set of heuristics for post-processing and mapping of the IE results to an ontology … not sufficient for large-scale, domain-independent semantic annotation. ” • “IE and wrapper induction techniques need to use the ontology more directly during the process of extraction. ” 9
Ontology-driven Paradigm (Data-Extraction Ontology) for Semantic Annotation Ontology-based IE Wrapper Non-ontology-based IE Wrapper Document 10
Ontology-driven Paradigm for Semantic Annotation Some Arguments • Resiliency w. r. t. web page layouts (helps scale to large set of web pages) • Adpativeness w. r. t. domain specifications (helps scale to large size domains) • Creation of ontologies: still a problem but no longer a drawback • Speed of execution: still a drawback (but we are going to propose a solution next) 11
Two-Layer Annotation Model Massive Annotation Process Similar Documents Structural Annotator Sample Annotation Process Document Conceptual Annotator using an ontology-based IE tool 12
Structural Annotator • Major components – HTML hierarchical path that leads to concept locations – Local context around locations – Dependencies among multiple semantic categories • Significance – Identify both categories and their semantic meanings 13
Ontology Factors in Semantic Annotation Tasks • Knowledge specification – Semantic web community – Web Ontology Language (OWL) • Knowledge instantiation – IE and database community – Object-oriented System Model in XML (OSMX) 14
Ontology Conversion • Similarities (OWL vs. OSMX) • Unique features – – Class vs. object set Object. Property vs. relationship set Cardinality restriction vs. participation constraint subclass. Of vs. is-a relationship – OWL • • subproperty. Of symmetric and transitive property namespace declaration ontology importing – OSMX • arbitrary n-ary relationship sets • data frames • general constraints 15
Ontology Construction An Unavoidable Problem • Semantic annotation tasks require ontologies. • The ontology for a specific semantic annotation task is not promised to be available all the time. 16
Ontology Construction General and Special • Generally speaking – Until now, main stream, manual construction – Automatic and semi-automatic ontology generation, many research papers, few or none practical, a very hard problem • Special to semantic annotation purpose – – Very dynamic and variant domains Much overlapped information Limited size of scope for one web page Flat structure 17
Ontology Construction Knowledge Reusing • “What has been will be again, what has been done will be done again; there is nothing new under the sun. ” (The Holy Bible, Ecclesiastes, 1: 9, NIV translation) • A “new” ontology is a new assembly with unions and projections of several preexisted ontologies. 18
Architecture on Dynamically Assembling Domain of Interest Collection of Knowledge Web Page …… Selected Knowledge Components … (1) (2) Assembled Ontology (1) Knowledge-component selection (2) Ontology assembly … 19
Thesis Statement 1. 2. 3. 4. Propose a new solution to perform semantic annotation on normal HTML web pages, specifically apply ontology-based automatic IE techniques augment OWL with knowledge recognition extension combine conceptual annotator and layout-based annotator assemble a new domain ontology for an annotation task dynamically 20
Standard Evaluation • Annotation performance – Precision – Recall – Speed of execution • Testing bed – 5 ~ 10 different domains, with over 10 lexical concepts in each domain ontology – 20 ~ 50 web pages on each domain 21
Ontology Converter Test • A complete and sound checking is costly and difficult to implement. • Our simple test – Start with an OSMX ontology A – Covert it to OWL and then transform it back to be OSMX ontology B – Process both A and B to annotate a same set of web pages (say 30 – 50 web pages) – Annotation results should be identical 22
Two-Layer Annotation Model Evaluation • Standard evaluation • In addition – About five large web sites with machinegenerated web pages, each of which contains at least dozens of web pages 23
Dynamic Ontology Assembler Evaluation • Regular precision and recall study according to selected knowledge components • A pilot study on when ontology assembler works better than manual ontology construction – Record the time to use a tool to create an ontology from scratch – Record the time to assemble a same ontology – Compare their differences and the special conditions for each case – Make empirical suggestions about how to build a knowledge base that favors ontology assembly 24
Delimitations • Automatic ontology creation from scratch • Annotation storing, indexing, and sharing mechanisms • Semantic annotation for multimedia content • Parallel or distributional computing to further scale the semantic annotation system to a large number of web pages 25
Contributions • To convert current web pages into machine-understandable semantic web pages • Producing a pure ontology-driven semantic annotator using ontology-based IE wrapper • Proposing a novel two-layer annotation model to do fast, accurate, and resilient annotation • Studying a dynamic ontology assembler that helps maximize the reuse of existing knowledge and minimize the load of manual ontology creation • Implementing an ontology converter so that this work is useful to the rest of the semantic web society. 26
- Slides: 26