CODA CATCHPlus Open Document Annotation Hennie Brugman OAC
CODA – CATCHPlus Open Document Annotation Hennie Brugman OAC II Project Review meeting Chicago – July 26 -27, 2012
Annotation context • Audiovisual – ASR, language, gesture, oral history • • • Text – Semantic annotation Music – lyrics, music notation Linguistic Annotation – named entities Image annotation Programs: CATCH, CATCHPlus, CLARIN
CODA main use cases • Queen’s Cabinet (Henny van Schie/National Archive, Lambert Schomaker/Univ Groningen) – Line strip and word zone annotations – ML: search in manuscript images – Add Named Entity annotations • Sailing Letters (Nicoline van de Sijs/Meertens + consortium, Lambert Schomaker) – Support manual annotation – Line strip detection service
2
Line annotation tools (catchplus)
<txt>godefroit</txt> <id>navis-SAL 7316_0195 -line-026 -y 1=2094 -y 2=2317 -zone-HUMAN -x=1145 -y=105 -w=315 -h=116 -unshear=0. 0 -version=ortho </id> <user>mceunen</user> <time>Wed Jan 26 16: 37: 01 2011</time>
OAC representation Image. Annotation has. Body Text. Annotations ia: 1 image. Scan. jpg page: 0 has. Target has. Body has. Target ib: 0 Canvas 1 constrains has. Body ct: 1 linestrip. jpg cnt: chars constrains “Dit is een beschrijving van Den Haag. En dit is een tweede zin. ” cb: 1 ia: 2 constrains line: 1 constrains cb: 2 ct: 2 has. Target has. Body zone: 2 Named Entity
OAC representation – Named Entities ! ! Annotation of annotations? Image. Annotation Text. Annotations Entity. Annotation of segments ofct: 3 inlinehas. Target text? ia: 1 ta: 0 has. Body has. Target has. Body Inline. Text. Constraint: constrains has. Target ea: 1 has. Body <rdf: Description rdf: about="urn: uuid: 533624 bb-d 565 -40 ba-a 14 a-2 e 95 c 19 c 20 df"> <rdf: type rdf: resource="http: //www. openannotation. org/ns/Constrained. Target"/> ib: 0 ib: 1 <constrains ct: 4 Canvas 1 xmlns="http: //www. openannotation. org/ns/" constrains image. Scan. jpg rdf: resource="http: //oas. dev. seecr. nl: 8000/resolve/urn%3 Auuid cnt: chars constrains %3 Ad 8741024 -18 bf-40 a 8 -a 648 -2 cd 5 ebb 9 acfd"/> ct: 1 cb: 1 <constrained. By xmlns="http: //www. openannotation. org/ns/" rdf: resource="urn: uuid: 4 f 6 b 7 d 34 -2329 -4 ab 6 -be 89 -a 0 feec 9 e 7208"/> </rdf: Description> constrains “location” cnt: chars constrains ta: 1 cb: 2 ct: 2 <rdf: Description rdf: about="urn: uuid: 4 f 6 b 7 d 34 -2329 -4 ab 6 -be 89 -a 0 feec 9 e 7208"> <rdf: type rdf: resource="http: //www. openannotation. org/ns/Constraint"/> “Dit is een beschrijving van Den has. Target has. Body <rdf: type rdf: resource="http: //www. catchplus. nl/annotation/Inline. Text. Constraint"/> Haag. En dit is een tweede zin. ” <rdf: type rdf: resource="http: //www. w 3. org/2008/content#Content. As. Text"/> ta: 2 <chars xmlns="http: //www. w 3. org/2008/content#"> "< textsegment offset="279" range="2"/> "</chars> <character. Encoding xmlns="http: //www. w 3. org/2008/content#"> UTF-8</character. Encoding> </rdf: Description>
Kd. K-2 -OAC conversion ! • • • Need for explicit representation of Sequence? Implicit line and page text Search on text of Constrained. Target/Body? ! Word and line order Text offsets and ranges Spatial information Identifiers and ‘annotatability’ Redundant text for searchability
Kd. K 2 OAC conclusions ! • • • For many annotation tasks OA may be overkill Bidirectional mapping is possible Compatible with Shared. Canvas model OAC + Canvas links everything together Implicit information made explicit Supports alternative text segmentations OAC representation is extremely verbose
Open Annotation Service (OAS) • • • RDF using SRU/Update Modelannotation does not support Annotation “sets” !Upload Inlines external text and XML Bodies and authors Indexes OA and DC properties Assigns resolvable http URIs and resolves those Implementation: RDF store icw Solr, production quality software components (Meresco) Built-in OAI-PMH data provider and harvester for ‘annotation sets’ Query: SRU/CQL, SPARQL, OAI-PMH Simple management dashboard (authentication and authorization, collection management, harvesting) Easy installation and Open Source
OAS: issues ! • • • In RDF, what are the boundaries of an annotation? Annotation publication Searchability: ‘harvest and index’ Text search on external bodies Annotation boundaries ‘Bypassing’ oac: constrains
Entity Recognition service URL or text service resolve OAS source_text URL or ID frog Fo. Li. A_document converter entity annotations
‘frog’ and Fo. Li. A • ‘Frog’ tool generates Fo. Li. A XML document with – Segmentation of text in paragraphs, sentences and words (tokens) – XML hierarchy – Part of speech, lemma, morphology, chunking, dependency structure and named entities • Mix of inline and standoff annotation – ‘Frog’ does not keep track of character offsets – Explicit ordering: numbering system in ids • Trained for Dutch • Widely used for Dutch corpora • Made available by: ILK @ Tilburg University
Fo. Li. A-2 -OAC conversion • Reconstruct character offsets after tokenization • Operates on inline text as published by OAS • Construct and add entity text from tokens + sequence (the+hague != hague+the) • Two approaches 1. Minimal: extract entity annotations and tokens, and convert to OAC 2. Maximal: full conversion to OAC
Linguistic Annotation ! Mix-in domain semantics as subtypes/subproperties? ! Maximal OA mapping or embed linguistic standards? ! Layers, hierarchies (syntax) and Documents ! Sequence (e. g. entities, morpheme breakup)
Synchronized viewing client demo • Demo/screenshot
Summary of OA issues ! Annotation of annotations? ! Annotation of segments of inline text? ! Need for explicit representation of Sequence? ! Search on Constrained. Target/Body? ! For many annotation tasks OA may be overkill ! Model does not support Annotation sets ! In RDF, what are the boundaries of an annotation?
Future work • Finalize and integrate software (with web services) • Upgrade to new OA spec (incl OAS) • Line strip detection web service • Possible applications – AV annotation in CATCHPlus – Nederlab
Questions?
- Slides: 24