ANCOANC MASC LAFGr AF and ULA Nancy Ide
- Slides: 21
ANC/OANC, MASC, LAF/Gr. AF, and ULA Nancy Ide Vassar College
ANC So far 22 million words across several genres Written (travel guides, blog, fiction, letters, newspaper, non-fiction, technical, journal, government documents) Spoken (face-to-face, academic, telephone) Available through LDC ($75) OANC ~15 million word subset, free of license restrictions, redistributable Download from ANC website Will add more data (~20 m words) this summer All part of OANC WSJ that is open, annotated by Time. Bank, Prop. Bank, Nom. Bank, PTB, PDTB (34 documents)
ANC Annotations ANC is distributed with stand-off annotations for POS and lemma (2 different tagsets), noun and verb chunks Automatically produced, no validation Recently added: Co-reference for portion of Slate Magazine CLAWS C 5 and C 7 Download from ANC site (http: //American. National. Corpus. org)
ANC Annotations Want to add annotations for a wide variety of linguistic phenomena Use any software we can get our hands on Freely available Contributed or “allowed use” Contributed annotations Multiple annotations of the same type E. g. (hopefully) BBN NER software E. g. syntactic annotation: Charniak, Collins, Minipar, CMU Link parser… Adding annotations as funding allows
ANC Process Automatically annotate Merge some or all annotations Input to UIMA primary data ANC proces sing Texts in different formats Annotations as graph of feature structures in stand-off XML documents ANC Tool Input to Graph. Viz Input to NLTK others. . .
Merging ANC does not merge on linguistic grounds (i. e. , no unification, a la GLARF) Merging involves simply combining information referring to common elements (spans, nodes, etc. )
MASC Manually Annotated Sub-Corpus NSF-funded project to Validate token and sentence boundaries, noun chunks, verb chunks, POS in a 5 million? word sub -corpus of the ANC Budget cut by over 50% so we are not sure of final size Manually or semi-automatically annotate for Word. Net senses and Frame. Net frames ICSI, Princeton IA Agreement studies Columbia
MASC Training examples Co-reference annotations Frame. Net annotations Genre-representative core with WN, entity, NP and VP annotations WSJ with Prop. Bank, Nom. Bank, PTB, and PDTB annotations
LAF/Gr. AF Linguistic Annotation Framework (LAF) developed in ISO TC 37 SC 4 Gr. AF The XML instantiation of the LAF abstract model ANC is represented using Gr. AF LAF defines a framework involving an abstract model for annotations that serves as a “pivot” into and out of which other formats are mapped Abstract model = a graph of typed feature structures representing stand-off annotations All annotations are stand-off Multiple annotations of the same type (e. g. POS) Annotations of multiple types POS, noun and verb chunks, adding Word. Net, Frame. Net, dependency parse… We have an IBM Innovation Award to map Gr. AF to UIMA CAS The graph of feature structures is the model underlying design of CAS
OANC, MASC, ULA OANC 15 m MASC Core 50 K ULA 40 K Extra 10 K : fiction court transcript blog
Gr. AF and ULA As proof of concept, transduced WSJ annotations for Nom. Bank, Time. Bank, Prop. Bank, PTB, PDTB into Gr. AF Separate stand-off documents Merged annotations Generated Graph. Viz output Paper at LAW I, Prague, 2007
Merging Annotations Involves simply combining the graphs for each annotation Graph algorithms can be applied to collapse identically-labeled nodes with edges to common subgraphs LAW 2007 • Prague
Transduction Different annotation formats. Transduce to Gr. AF PTB Prop. Bank Nom. Bank PDTB Time. Bank Merge
Graph. Viz Output LAW 2007 • Prague
ULA/MASC Corpus PTB annotations transduced to Gr. AF Merged PTB’s POS labels with Penn tagging produced by automatic tagger in GATE Using to validate Will also do for NP, VP
Frame. Net example
Contributions Any annotation of OANC we get (automatic or manual), we will transduce to Gr. AF and make available Please contribute annotations Web interface for contributions Web interface: “create-a-corpus” Allows user to choose from among available annotations and available OANC texts, merge, produce output in format of choice Choose: Could e. g. generate a Prop. Bank annotation referring to spans instead of PTB elements May also allow various annotation formats -> Gr. AF We already have an API for transducing all the ULA annotation formats plus others (e. g. Frame. Net) to Gr. AF how to combine (e. g. if two tokens are one in other scheme) “strong” vs. “weak” merging (? )
Contributions Also contribute texts American English Post-1990 Unrestricted for re-distribution Web interface for that too
Contributions Software for annotation document. Handler interfaces for additional output formats from ANCTool Derived data Frequency lists, bigrams, trigrams (some already there) Anything else…
Desideratum OANC texts, other freely re-distributable texts annotated by many projects/researchers for many phenomena We are providing the infrastructure, you provide the annotations!
SIGANN New ACL Special Interest Group for Annotation SIGANN Shared Corpus = ULA corpus or MASC 50 K core Solicit annotations of all types from community
- Pengorganisasian dan revisi pesan bisnis
- Gagasan utama adalah
- Gris feminine plural
- Braquicrural
- Framykoin maść
- Masc/fem
- Antipsödomonal antibiyotikler
- Masc neutropenia febril
- Masc pronouns
- Cal ula
- Ula biblioteca virtual
- Delay and sum beamforming
- Sfhka
- Point slope for.ula
- Tabela verdade circuito
- Al ummu madrasatul ula
- Find the discrimant
- Ula
- Webdelprofesor
- Black saturday
- A time to think
- The lightning thief chapter 16