Semantic web Bootstrapping Annotation Hassan Sayyadi sayyadice sharif
Semantic web Bootstrapping & Annotation Hassan Sayyadi sayyadi@ce. sharif. edu Semantic web research laboratory Computer department Sharif university of technology
Outline • • • What is annotation? Why use annotation? Crawler Annotation model Annotation methods Our Implementation 2
Outline • • • What is annotation? Why use annotation? Crawler Annotation model Annotation methods Our Implementation 3
What is annotation? • People make notes to themselves in order to preserve ideas that arise during a variety of activities • The purpose of these notes is often to summarize, criticize, or emphasize specific phrases or events • Semantic annotations are to tag ontology class instance data and map it into ontology classes. 4
Outline • • • What is annotation? Why use annotation? Crawler Annotation model Annotation methods Our Implementation 5
Why use annotation? • To have the world knowledge at one's finger tips seems possible. • The Internet is the platform for information. • Unfortunately most of the information is provided in an unstructured and nonstandardized form. 6
Why use annotation? (continue) 7
Outline • • • What is annotation? Why use annotation? Crawler Annotation model Annotation methods Our Implementation 8
Crawler • A crawler is a program which traverses the Internet following these links from one page to the next. 9
Focused crawler • Not all the Internet knowledge is required for every query. • This assumption seems reasonable because most people work on a restricted domain and do not need the knowledge of the whole Internet • Searching the whole Internet in this case is very inefficient and expensive. • Free texts in the Internet contain various information in diverse domains. 10
Focused crawler (continue) • The focus can be achieved by examining keywords • Problems: – “Understanding“ the semantic of document – Extremely focusing on one topic • Another way to focus is the Internet connectivity structure 11
Outline • • • What is annotation? Why use annotation? Crawler Annotation model Annotation methods Our Implementation 12
Annotation models • Mark in web page • Example: – SUT is one of the largest engineering schools in the Islamic Republic of Iran – <university>SUT</university> is one of the largest universities in the <country>Islamic Republic of Iran</country> 13
Annotation models (continue) • Generate RDF • Example: – SUT is one of the largest engineering schools in the Islamic Republic of Iran – <rdf: Description rdf: about="http: //sharif. edu/#SUT"> <rdf: type>university</rdf: type> <SHARIF: be_in rdf: resource="http: //sharif. edu/#Islamic+Republic+of+Iran"/> </rdf: Description> <rdf: Description rdf: about="http: //sharif. edu/#Islamic+Republic+of+Iran”> <rdf: type>Country</rdf: type> </rdf: Description> 14
Outline • • • What is annotation? Why use annotation? Crawler Annotation model Annotation methods Our Implementation 15
Annotation methods • Manually • Semi-automatically • Automatically 16
Automatic Annotation • The fully automatic creation of semantic annotations is an unsolved problem. • Automatic semantic annotation for the natural language sentences in these pages is a daunting task and we are often forced to do it manually or semiautomatically using handwritten rules 17
Manual Annotation • Manual annotation is more easily accomplished today, using authoring tools, which provide an integrated environment for simultaneously authoring and annotating text. • However, the use of human annotators is often fraught with errors due to factors such as annotator familiarity with the domain, amount of training, personal motivation and complex schemas • Manual annotation is also an expensive process 18
Semi-automatic Annotation • To overcome the annotation acquisition bottleneck, semiautomatic annotation of documents has been proposed. 19
Semi-automatic annotation • assumptions: – vocabulary set is limited – word usage has patterns – semantic ambiguities are rare – terms and jargon of the domain appear frequently 20
Semantic Annotation Platform (SAP) 21
Multistrategy SAPs • Multistrategy SAPs are able to combine methods from both pattern-based and machine learning-based systems. • No SAP currently implements the multistrategy approach for semantic annotation, although it has been implemented in systems for ontology extraction (such as On-To-Knowledge) 22
Semi-automatic annotation (continue) • Example – I go to Shanghai • Link structure is more like a RDF graph 23
The accuracy of concepts and relations about different algorithm 24
Automatic annotation 25
Source preprocessing • • Document Object Model (DOM) Text Model Layout Model NLP Model 26
Information Identification • Operators – perform extractions on document access models – Retrieval, Check, Execute • Strategies – build operator sequences according to user time and quality requirements • Source Description – build operator sequences according to user time and quality requirements 27
Ontology population • The final stage of the overall process is to decide which hypothesis represents the extracted information to insert into the ontology • The module simulates insertions and calculates the cost according to the number of new instance creations, instance modifications or inconsistencies found 28
Outline • • • What is annotation? Why use annotation? Crawler Annotation model Annotation methods Our Implementation 29
Our implementation • Crawler: – Crawl all link that contains: • sharif. ir • sharif. edu • sharif. ac. ir 30
Our implementation • Source pre-processing – Html to text • • • text = text. replace. All("n", "*_newline_*"); text = text. replace. All("\<script. *? \</script\>", ""); text = text. replace. All("\<style. *? </style. *\>", ""); text = text. replace. All("<\!--. *? --\>", ""); text = text. replace. All("\<. *? \>", ""); text = text. replace. All(" ", " "); text = text. replace. All("< ", "<"); … text = text. replace. All("\*_newline_\*", "n"); – Additional • text = text. replace. All("n(n|| )*n", ". "); • text = text. replace. All(", ", " and "); 31
Our implementation • Information extraction: – JMonty. Lingua • SUT is one of the largest engineering schools in the Islamic Republic of Iran • ("be" "SUT" "one" "of largest engineering school" "in Islamic Republic" "of Iran") 32
Our implementation • JMonty. Lingua problem: – SUT has computer, mechanic and electric engineering departments – ("have" "SUT" "computer mechanic and electric engineering departments") – ("have" "SUT" "computer and mechanic and electric engineering departments") 33
Our inplementation • ("be" "SUT" “university" "in Islamic Republic" "of Iran") • => ("be" "SUT" “university" "in Islamic Republic of Iran") • =>SUT, be, university & SUT, be_in, Islamic Republic of Iran • <rdf: Description rdf: about="http: //sharif. edu/#SUT"> <rdf: type>university</rdf: type> <SHARIF: be_in rdf: resource="http: //sharif. edu/#Islamic+Republic+of+Iran"/> </rdf: Description> 34
Any question? 35
- Slides: 35