ANCOANC MASC LAFGr AF and ULA Nancy Ide

  • Slides: 21
Download presentation
ANC/OANC, MASC, LAF/Gr. AF, and ULA Nancy Ide Vassar College

ANC/OANC, MASC, LAF/Gr. AF, and ULA Nancy Ide Vassar College

ANC So far 22 million words across several genres Written (travel guides, blog, fiction,

ANC So far 22 million words across several genres Written (travel guides, blog, fiction, letters, newspaper, non-fiction, technical, journal, government documents) Spoken (face-to-face, academic, telephone) Available through LDC ($75) OANC ~15 million word subset, free of license restrictions, redistributable Download from ANC website Will add more data (~20 m words) this summer All part of OANC WSJ that is open, annotated by Time. Bank, Prop. Bank, Nom. Bank, PTB, PDTB (34 documents)

ANC Annotations ANC is distributed with stand-off annotations for POS and lemma (2 different

ANC Annotations ANC is distributed with stand-off annotations for POS and lemma (2 different tagsets), noun and verb chunks Automatically produced, no validation Recently added: Co-reference for portion of Slate Magazine CLAWS C 5 and C 7 Download from ANC site (http: //American. National. Corpus. org)

ANC Annotations Want to add annotations for a wide variety of linguistic phenomena Use

ANC Annotations Want to add annotations for a wide variety of linguistic phenomena Use any software we can get our hands on Freely available Contributed or “allowed use” Contributed annotations Multiple annotations of the same type E. g. (hopefully) BBN NER software E. g. syntactic annotation: Charniak, Collins, Minipar, CMU Link parser… Adding annotations as funding allows

ANC Process Automatically annotate Merge some or all annotations Input to UIMA primary data

ANC Process Automatically annotate Merge some or all annotations Input to UIMA primary data ANC proces sing Texts in different formats Annotations as graph of feature structures in stand-off XML documents ANC Tool Input to Graph. Viz Input to NLTK others. . .

Merging ANC does not merge on linguistic grounds (i. e. , no unification, a

Merging ANC does not merge on linguistic grounds (i. e. , no unification, a la GLARF) Merging involves simply combining information referring to common elements (spans, nodes, etc. )

MASC Manually Annotated Sub-Corpus NSF-funded project to Validate token and sentence boundaries, noun chunks,

MASC Manually Annotated Sub-Corpus NSF-funded project to Validate token and sentence boundaries, noun chunks, verb chunks, POS in a 5 million? word sub -corpus of the ANC Budget cut by over 50% so we are not sure of final size Manually or semi-automatically annotate for Word. Net senses and Frame. Net frames ICSI, Princeton IA Agreement studies Columbia

MASC Training examples Co-reference annotations Frame. Net annotations Genre-representative core with WN, entity, NP

MASC Training examples Co-reference annotations Frame. Net annotations Genre-representative core with WN, entity, NP and VP annotations WSJ with Prop. Bank, Nom. Bank, PTB, and PDTB annotations

LAF/Gr. AF Linguistic Annotation Framework (LAF) developed in ISO TC 37 SC 4 Gr.

LAF/Gr. AF Linguistic Annotation Framework (LAF) developed in ISO TC 37 SC 4 Gr. AF The XML instantiation of the LAF abstract model ANC is represented using Gr. AF LAF defines a framework involving an abstract model for annotations that serves as a “pivot” into and out of which other formats are mapped Abstract model = a graph of typed feature structures representing stand-off annotations All annotations are stand-off Multiple annotations of the same type (e. g. POS) Annotations of multiple types POS, noun and verb chunks, adding Word. Net, Frame. Net, dependency parse… We have an IBM Innovation Award to map Gr. AF to UIMA CAS The graph of feature structures is the model underlying design of CAS

OANC, MASC, ULA OANC 15 m MASC Core 50 K ULA 40 K Extra

OANC, MASC, ULA OANC 15 m MASC Core 50 K ULA 40 K Extra 10 K : fiction court transcript blog

Gr. AF and ULA As proof of concept, transduced WSJ annotations for Nom. Bank,

Gr. AF and ULA As proof of concept, transduced WSJ annotations for Nom. Bank, Time. Bank, Prop. Bank, PTB, PDTB into Gr. AF Separate stand-off documents Merged annotations Generated Graph. Viz output Paper at LAW I, Prague, 2007

Merging Annotations Involves simply combining the graphs for each annotation Graph algorithms can be

Merging Annotations Involves simply combining the graphs for each annotation Graph algorithms can be applied to collapse identically-labeled nodes with edges to common subgraphs LAW 2007 • Prague

Transduction Different annotation formats. Transduce to Gr. AF PTB Prop. Bank Nom. Bank PDTB

Transduction Different annotation formats. Transduce to Gr. AF PTB Prop. Bank Nom. Bank PDTB Time. Bank Merge

Graph. Viz Output LAW 2007 • Prague

Graph. Viz Output LAW 2007 • Prague

ULA/MASC Corpus PTB annotations transduced to Gr. AF Merged PTB’s POS labels with Penn

ULA/MASC Corpus PTB annotations transduced to Gr. AF Merged PTB’s POS labels with Penn tagging produced by automatic tagger in GATE Using to validate Will also do for NP, VP

Frame. Net example

Frame. Net example

Contributions Any annotation of OANC we get (automatic or manual), we will transduce to

Contributions Any annotation of OANC we get (automatic or manual), we will transduce to Gr. AF and make available Please contribute annotations Web interface for contributions Web interface: “create-a-corpus” Allows user to choose from among available annotations and available OANC texts, merge, produce output in format of choice Choose: Could e. g. generate a Prop. Bank annotation referring to spans instead of PTB elements May also allow various annotation formats -> Gr. AF We already have an API for transducing all the ULA annotation formats plus others (e. g. Frame. Net) to Gr. AF how to combine (e. g. if two tokens are one in other scheme) “strong” vs. “weak” merging (? )

Contributions Also contribute texts American English Post-1990 Unrestricted for re-distribution Web interface for that

Contributions Also contribute texts American English Post-1990 Unrestricted for re-distribution Web interface for that too

Contributions Software for annotation document. Handler interfaces for additional output formats from ANCTool Derived

Contributions Software for annotation document. Handler interfaces for additional output formats from ANCTool Derived data Frequency lists, bigrams, trigrams (some already there) Anything else…

Desideratum OANC texts, other freely re-distributable texts annotated by many projects/researchers for many phenomena

Desideratum OANC texts, other freely re-distributable texts annotated by many projects/researchers for many phenomena We are providing the infrastructure, you provide the annotations!

SIGANN New ACL Special Interest Group for Annotation SIGANN Shared Corpus = ULA corpus

SIGANN New ACL Special Interest Group for Annotation SIGANN Shared Corpus = ULA corpus or MASC 50 K core Solicit annotations of all types from community