Semantic NLP for Knowledge Extraction from Text Anette

Knowledge on the Web News Wikipedia, Wiki. X Social Networks Blogs Conflicts Information Portals

Creating validated content Crowdsourcing: Wiki-X Crowdsourcing bottlenecks Formal tagging Consistency Timeliness and participation in

Massive amounts of unstructured, hyperlinked textual information

Extensive reference to external documents and information sources

Knowledge extraction from text Information Sources Wikipedia – huge gap: structured vs. unstructured knowledge

Machine Reading “… formation of a coherent set of beliefs based on a text

Machine Reading Probabilistic Inference <-> representations Machine Learning Semantic NLP Joint models vs. pipeline

Semantic NLP: from IE to MR from supervised to unsupervised template-based IE --> relation

Chances and Challenges Mining Content from Discourse High amount of content only reveals in

Discourse-level semantic NLP Joint models for NERC – WSD – SRL – Coreference Entity

Relevant research at ICL Heidelberg Induction of lexical semantic knowledge adjective semantics fine-grained classification

Slides: 14

Download presentation

Semantic NLP for Knowledge Extraction from Text Anette Frank Department of Computational Linguistics Heidelberg University DFG-Rundgespräch „Wissenserschließung im Web“ 19. Mai 2011, Darmstadt

Knowledge on the Web News Wikipedia, Wiki. X Social Networks Blogs Conflicts Information Portals Document Repositories Opinion Perspective “Belief” Science Culture Commerce Politics Fact Wish Desire Obligation Need Promise

Creating validated content Crowdsourcing: Wiki-X Crowdsourcing bottlenecks Formal tagging Consistency Timeliness and participation in special domains Leverage content hidden in masses of unstructured text: Extracting Knowledge from Text

Structured content

Massive amounts of unstructured, hyperlinked textual information

Extensive reference to external documents and information sources

Knowledge extraction from text Information Sources Wikipedia – huge gap: structured vs. unstructured knowledge Scientific documents, books, … Online information: news, politics, blogs, … Chances and Challenges Data redundancy and conflict resolution Building dense, connected content Making implicit information explicit Fact vs. non-fact / attribution, …

Machine Reading “… formation of a coherent set of beliefs based on a text corpus and a background theory” “many of the beliefs of interest will only be implied” “express the resultant beliefs and reasoning process in probabilistic terms” (Etzioni et al. 2006)

Machine Reading Probabilistic Inference <-> representations Machine Learning Semantic NLP Joint models vs. pipeline models Modeling dependencies

Semantic NLP: from IE to MR from supervised to unsupervised template-based IE --> relation extraction --> Open. IE --> MR distributional latent semantic models text classification --> document structure --> lexical and compositional semantics exploiting heuristically labeled / additional data linguistic clues in data linguistic resources (semantic lexica, ontologies) interpreted content

Chances and Challenges Mining Content from Discourse High amount of content only reveals in discourse context Build dense graphs, filling in implicit relations through inference, using background knowledge Semantic NLP still sentence-oriented, performance ceilings Resolving conflicts, telling truth from opinion Exploitation, Presentation and Use Aggregate graphs from multiple sources, extract thematic subgraphs (summarization), generate from densely linked graphs (NLG) Relate queries and knowledge graphs by inference (QA, TE)

Discourse-level semantic NLP Joint models for NERC – WSD – SRL – Coreference Entity chains -> semantic constraints for NERC, WSD, SRL -> help classifying mentions into entities Inference on densely connected graphs Infer implicit meaning (local and discourse-level) <-> use added information for improving NLP components, further inference, summaries, … Machine Learning Joint Inference Statistical Inference

Relevant research at ICL Heidelberg Induction of lexical semantic knowledge adjective semantics fine-grained classification of named entities (>200 classes) word sense disambiguation inference relations triggered by verbs Event-based discourse semantics linking implicit semantic roles in discourse event alignment across documents Identification of factual vs. generic knowledge

Thank you