Tags in the cloud Crowdsourcing semantic annotation with

  • Slides: 29
Download presentation
Tags in the cloud: Crowdsourcing semantic annotation with CATMA Jan Christoph Meister University of

Tags in the cloud: Crowdsourcing semantic annotation with CATMA Jan Christoph Meister University of Hamburg www. catma. de

CATMA 29. 10. 2012 - an integrated textual markup and analysis tool CLARIN's Turn

CATMA 29. 10. 2012 - an integrated textual markup and analysis tool CLARIN's Turn Towards The Literary Text 2

Text vs. sentence, or: What‘s so different about processing texts? • structural complexity: min

Text vs. sentence, or: What‘s so different about processing texts? • structural complexity: min TEXT > 2 (SENTENCE) • structural activity: TEXT processing actualizes paradigmatic cross-reference across sentences • structural dynamic: TEXT processing represents & simulates cognitive and empirical processes TEXT yields more INTERPRETATIONS than SENTENCE +CONTINGENCY: The more complex & dynamic structure, when activated during processing, results in a higher degree of contingency in functional „outcome“ 29. 10. 2012 CLARIN's Turn Towards The Literary Text 3

The what and why of Mark. Up procedural, descriptive & discursive function • discursive

The what and why of Mark. Up procedural, descriptive & discursive function • discursive markup: enables human readers to interpret a text and to explore its hermeneutic potential in collaboration „What might this text mean to us? “ • declarative markup: informs a human reader how to process a text as a communicative device „How is this text put together and how does it function in its communicative universe? “ • procedural markup: instructs a (natural or artificial) text processor how to handle a text as a structured character string „What is the correct operation to perfom on this input? “ performative function 29. 10. 2012 CLARIN's Turn Towards The Literary Text 4

Hermeneutic „must haves“ of discursive markup facilitate collaboration & non-deterministic annotation allow for multiple

Hermeneutic „must haves“ of discursive markup facilitate collaboration & non-deterministic annotation allow for multiple markup allow for overlap allow for concurrent tagging conceptualize markup as dynamic & recursive allow for extensibility allow for multiple (and even contradictory) markup seamlessly integrate markup and analysis & support the hermeneutic loop 29. 10. 2012 CLARIN's Turn Towards The Literary Text 5

Mark. Up types & data models stand off, discursive <1, 5, word class =

Mark. Up types & data models stand off, discursive <1, 5, word class = “Preposition”> <1, 5, segment = “Sentence. Start”> <1, 8, POS = “noun phrase”> <1, 5, word class = “Adverb”> <1, 38, speech act = “declaration”> <1, 11, POS = “verb phrase”> network There is no such thing as “no-mark up”. stand off, descriptive <1, 5, word class = “Adverb”> <1, 5, segment = “Sentence. Start”> <1, 5, POS = “verb phrase element”> nested inline, deterministic <Sentence. Start><Adverb>There</Adverb></Sentence. Start> is no such thing as “no-mark up”. sequential inline, deterministic <Sentence. Start>There</Sentence. Start> is no such thing as “no-mark up. ” linear implicit There is no such thing as “no-mark up”. (Coombs, Renear, De. Rose 1987) opaque 29. 10. 2012 There is no such thing as ”no-mark up”. CLARIN's Turn Towards The Literary Text relational 6

Implementation in CATMA www. catma. de 29. 10. 2012 CLARIN's Turn Towards The Literary

Implementation in CATMA www. catma. de 29. 10. 2012 CLARIN's Turn Towards The Literary Text 7

The CATMA/CLÉA approach to markup text range based model a tag references a text

The CATMA/CLÉA approach to markup text range based model a tag references a text range with a start and an end offset external standoff markup 29. 10. 2012 markup is stored in external files or data bases to facilitate tagging and exchange of markup by multiple users markup is stored in a standoff manner to allow overlapping markup tolerates non-deterministic tagging & supports analytical operations that exploit semantic ambiguity CLARIN's Turn Towards The Literary Text 8

Example for overlapping markup in CATMA (NB: In CATMA tag sets can be imported/exported;

Example for overlapping markup in CATMA (NB: In CATMA tag sets can be imported/exported; tags can be created / manipulated ad hoc during mark up) 29. 10. 2012 CLARIN's Turn Towards The Literary Text 9

TEI feature structure tag declaration & overlapping markup <fs xml: id="CATMA_d 7251 f 99

TEI feature structure tag declaration & overlapping markup <fs xml: id="CATMA_d 7251 f 99 -14 e 9 -4 c 36 -8 ff 7 -24058 ae 81 ce 5" n="1_7985 fdf 0 -77 a 5 -4060 -9 a 3 d 2 d 977 e 0 ab 954" type="catma_tag"> <f xml: id="CATMA_aa 9 b 3727 -187 e-4 fb 8 -9990 -e 7880912 a 409" name="catma_tagname"> <ptr target="Abstracts. doc#range( /. 21736, /. 21888)" <string>Keynote_speaker& affiliation</string> </f> type="inclusion"/> <f xml: id="CATMA_564825 ba-28 b 2 -4 dab-b 136 -b 87 c 8 a 3 d 9 e 28" name="catma_displaycolor"> <numeric <segvalue="-13421569"/> ana="#CATMA_0 a 252 cc 2 -96 d 2 -4 ed 4 -8 fb 8 - </f> 52380550 ec 0 b #CATMA_d 7251 f 99 -14 e 9 -4 c 36 -8 ff 7</fs> 24058 ae 81 ce 5 #CATMA_8513 fe 2 d-2 e 35 -4 d 0 a-a 3 a 207528 bcfa 012"> 29. 10. 2012 CLARIN's Turn Towards The Literary Text 10

Question 1: How can we model a collaborative mark up practice? 29. 10. 2012

Question 1: How can we model a collaborative mark up practice? 29. 10. 2012 CLARIN's Turn Towards The Literary Text 11

Answer 1: CATMA’S “n-meta-data set to-1 object data instance”-model meta-data • • • procedural

Answer 1: CATMA’S “n-meta-data set to-1 object data instance”-model meta-data • • • procedural declarative hermeneutic user markup 1. . n 0 A Tagsets TEXT object-data 29. 10. 2012 CLARIN's Turn Towards The Literary Text 12

Question 2: But how, on top of that, can we also model the recursive

Question 2: But how, on top of that, can we also model the recursive routines that characterize the humanistic workflow? TEXT 29. 10. 2012 CLARIN's Turn Towards The Literary Text 13

Example for recursion: a simple querie across the object data/meta data divide Step 1:

Example for recursion: a simple querie across the object data/meta data divide Step 1: object data querie. . . an additional meta-data constraint Step 2: refinement by adding. . . 29. 10. 2012 CLARIN's Turn Towards The Literary Text 14

. . . which is why (reg="bS*QezE(? =W)") where (tag="Keynote_speaker&affiliation") generates this: 29. 10.

. . . which is why (reg="bS*QezE(? =W)") where (tag="Keynote_speaker&affiliation") generates this: 29. 10. 2012 CLARIN's Turn Towards The Literary Text 15

Answer 2: CATMA’S dynamic data model, e. g. (n meta-data set to 1 object

Answer 2: CATMA’S dynamic data model, e. g. (n meta-data set to 1 object instance)>n+1 TEXT markup 1. . n meta-data markup 1. . n • • • procedural declarative hermeneutic 0 A Tagsets object-data TEXT 0 29. 10. 2012 A CLARIN's Turn Towards The Literary Text 16

Question 3: How can we implement this practice in a system? 29. 10. 2012

Question 3: How can we implement this practice in a system? 29. 10. 2012 CLARIN's Turn Towards The Literary Text 17

Answer 3: Call the big sister – CLÉA! CLÉA Data Base Model 29. 10.

Answer 3: Call the big sister – CLÉA! CLÉA Data Base Model 29. 10. 2012 CLARIN's Turn Towards The Literary Text 18

CATMA/CLÉA: User and resource administration 29. 10. 2012 CLARIN's Turn Towards The Literary Text

CATMA/CLÉA: User and resource administration 29. 10. 2012 CLARIN's Turn Towards The Literary Text 19

Manage corpora & source documents, markup collections and tag libraries 29. 10. 2012 CLARIN's

Manage corpora & source documents, markup collections and tag libraries 29. 10. 2012 CLARIN's Turn Towards The Literary Text 20

Annotate texts or corpora using pre-defined or ready-made tags 29. 10. 2012 CLARIN's Turn

Annotate texts or corpora using pre-defined or ready-made tags 29. 10. 2012 CLARIN's Turn Towards The Literary Text 21

Build and execute queries on source text & tags, or any combination thereof 29.

Build and execute queries on source text & tags, or any combination thereof 29. 10. 2012 CLARIN's Turn Towards The Literary Text 22

Visualize results 29. 10. 2012 CLARIN's Turn Towards The Literary Text 23

Visualize results 29. 10. 2012 CLARIN's Turn Towards The Literary Text 23

What’s in it for CLARIN? • Import any text or corpus into CATMA/CLÉA •

What’s in it for CLARIN? • Import any text or corpus into CATMA/CLÉA • Run standard analytical procedures automatically or inter actively on upload (indexing, POS tagging etc. ) • Annotate and analyse texts or corpora collaboratively • Share and export markup from the CATMA/CLÉA data base in multiple formats CLÉA = Collaborative Literature Éxploration and Annotation 29. 10. 2012 CLARIN's Turn Towards The Literary Text 24

Mille grazie to my CATMA/CLÉA development team • • Evelyn Gius Malte Meister Marco

Mille grazie to my CATMA/CLÉA development team • • Evelyn Gius Malte Meister Marco Petris Lena Schüch and to our funders • University of Hamburg (2009) • Google DH Awards (2010 -2013) • BMBF (2013 -2016) 29. 10. 2012 CLARIN's Turn Towards The Literary Text 25

Tag definition each Tag has a type each Tag has a color each Tag

Tag definition each Tag has a type each Tag has a color each Tag can have additional user defined properties 29. 10. 2012 CLARIN's Turn Towards The Literary Text 26

Tag instance each Tag instance is of a type a Tag instance can have

Tag instance each Tag instance is of a type a Tag instance can have individual values for the user defined properties 29. 10. 2012 CLARIN's Turn Towards The Literary Text 27

Tag referencing The content of a range is referenced by a pointer to an

Tag referencing The content of a range is referenced by a pointer to an external entity. The URI is based on the RFC 5147 for pointing to plain text. 29. 10. 2012 CLARIN's Turn Towards The Literary Text 28

Potential problems and possible solutions referencing ranges based on character offsets are vulnerable to

Potential problems and possible solutions referencing ranges based on character offsets are vulnerable to modifications of the content • possible solution: automated adjustments with checksums and context information, and • track versioning and revision history in the source document header the encoding of the tags is machine readable but not interoperable out of the box possible solution: defining the feature structure encoding of tags in terms of the open annotation framework 29. 10. 2012 CLARIN's Turn Towards The Literary Text 29