Text Analysis Meets Computational Lexicography Hannah Kermes IMS

  • Slides: 40
Download presentation
Text Analysis Meets Computational Lexicography Hannah Kermes IMS Universität Stuttgart

Text Analysis Meets Computational Lexicography Hannah Kermes IMS Universität Stuttgart

Motivation • maintainance of consistency and completeness within lexica ð computer assisted methods •

Motivation • maintainance of consistency and completeness within lexica ð computer assisted methods • lexical engineering ð scalable lexicographic work process ð processes reproducible on large amounts of text 2 IMS Universität Stuttgart

Motivation • rising interest to use evidence derived from automatic syntactic analysis statistical tools

Motivation • rising interest to use evidence derived from automatic syntactic analysis statistical tools (Po. S tagging etc. ) and traditional chunkers do not provide enough information for corpus linguistic research full parsers are not robust enough • • ð 3 need for analyzing tools that meet the specific needs of corpus linguistic studies IMS Universität Stuttgart

Information needed • syntactic information subcategorization patterns • semantic information selectional preferences, collocations, MWL

Information needed • syntactic information subcategorization patterns • semantic information selectional preferences, collocations, MWL • morphological information case, number, gender compounding and derivation 4 IMS Universität Stuttgart

A corpus linguistic approach 5 IMS Universität Stuttgart

A corpus linguistic approach 5 IMS Universität Stuttgart

Hypothesis The better and more detailed the off-line annotation, the better and faster the

Hypothesis The better and more detailed the off-line annotation, the better and faster the on-line extraction. However, the more detailed the off-line annotation, the more complex the grammar, the more time consuming and difficult the grammar development, and the slower the parsing process. 6 IMS Universität Stuttgart

Requirements for the tool • it has to work on unrestricted text • shortcomings

Requirements for the tool • it has to work on unrestricted text • shortcomings in the grammar should not lead to a complete failure to parse • no manual checking should be required • should provide a clearly defined interface • annotation should follow linguistic standards 7 IMS Universität Stuttgart

Requirements for the annotation • • • head lemma morpho-syntactic information lexical-semantic information structural

Requirements for the annotation • • • head lemma morpho-syntactic information lexical-semantic information structural and textual information hierarchical representation 8 IMS Universität Stuttgart

Chunking vs. full parsing Chunking YAC Full Parsing • flat non-recursive structures • full

Chunking vs. full parsing Chunking YAC Full Parsing • flat non-recursive structures • full hierarchical representation • simple grammar • complex grammar • robust and efficient • not very robust • non-ambiguous output • ambiguous output 9 IMS Universität Stuttgart

A classical chunker • • robust – works on unrestricted text works fully automatically

A classical chunker • • robust – works on unrestricted text works fully automatically does not provide full but partial analysis of text no highly ambiguous attachment decisions are made 10 IMS Universität Stuttgart

YAC goes beyond • extends the chunk definition of Abney • provides additional information

YAC goes beyond • extends the chunk definition of Abney • provides additional information about annotated chunks 11 IMS Universität Stuttgart

Applying and processing rules grammar rules rule application post. Perl-Scripts processing lexicon 12 corpus

Applying and processing rules grammar rules rule application post. Perl-Scripts processing lexicon 12 corpus annotation of results IMS Universität Stuttgart

Advantages of the system • • efficient work even with large corpora modular query

Advantages of the system • • efficient work even with large corpora modular query language interactive grammar development powerful post-processing of rules 13 IMS Universität Stuttgart

Annotated chunk categories • • • Adverbial phrases (Adv. P) Adjectival phrases (AP) Noun

Annotated chunk categories • • • Adverbial phrases (Adv. P) Adjectival phrases (AP) Noun phrases (NP) Prepositional phrases (PP) Verbal complexes (VC) Clauses (CL) 14 IMS Universität Stuttgart

Additional information • head lemma • morpho-syntactic information • lexical-semantic properties 15 IMS Universität

Additional information • head lemma • morpho-syntactic information • lexical-semantic properties 15 IMS Universität Stuttgart

Feature annotation feature value lexicalsemantic head lemma agreement info verbal head lemma 16 Adv.

Feature annotation feature value lexicalsemantic head lemma agreement info verbal head lemma 16 Adv. P AP NP PP VC CL X X X X IMS Universität Stuttgart

Some properties of NPs cardinal noun measure noun ne named entity quot NP in

Some properties of NPs cardinal noun measure noun ne named entity quot NP in quotation marks street address temporal noun date pronominal NP 17 IMS Universität Stuttgart

Other lexical-semantic properties • VC with separated prefix: pref Er kommt an (he arrives)

Other lexical-semantic properties • VC with separated prefix: pref Er kommt an (he arrives) • PP with contracted preposition and article: fus am Bahnhof (at the station) • complex APs embedding PPs: pp über die Köpfe der Apostel gesetzten • AP with deverbal adjectives: vder 18 IMS Universität Stuttgart

Target data • predicative(-like) constructions Es war klar, daß. . . It was clear,

Target data • predicative(-like) constructions Es war klar, daß. . . It was clear, that. . . • . . . with adverbial pronoun Er ist davon überzeugt, daß. . . He is of it convinced, that. . . • . . . with reflexive pronoun Es zeigt sich deutlich, daß. . . It shows itself clear, that. . . 19 IMS Universität Stuttgart

Target data • . . . with infinite clauses Es ist möglich, ihn zu

Target data • . . . with infinite clauses Es ist möglich, ihn zu besuchen. It is possible, him to visit. • . . . with clause in topicalized position Daß. . . , ist klar. That. . . , is clear. Ihn zu besuchen, ist möglich. Him to visit, is possible. 20 IMS Universität Stuttgart

Sample query adjective + verb + finite clause VC AP CL 21 IMS Universität

Sample query adjective + verb + finite clause VC AP CL 21 IMS Universität Stuttgart

Sample query adjective + verb + finite clause VC APpred CLfin 22 IMS Universität

Sample query adjective + verb + finite clause VC APpred CLfin 22 IMS Universität Stuttgart

Sample query adjective + verb + finite clause VC Adjuncts* APpred CLfin 23 IMS

Sample query adjective + verb + finite clause VC Adjuncts* APpred CLfin 23 IMS Universität Stuttgart

Sample query adjective + verb + finite clause VC (Adv. P|PP|NPtemp|CLrel)* APpred CLfin 24

Sample query adjective + verb + finite clause VC (Adv. P|PP|NPtemp|CLrel)* APpred CLfin 24 IMS Universität Stuttgart

adjective + verb + finite clause sein fraglich unklar offen möglich wichtig deutlich total

adjective + verb + finite clause sein fraglich unklar offen möglich wichtig deutlich total 25 bleiben machen werden 326 34 3 320 103 225 41 30 228 40 160 30 2 180 2 5 97 34 1500 177 168 75 IMS Universität Stuttgart

adjective + verb + finite clause sein fraglich bleiben machen 326 34 unklar 320

adjective + verb + finite clause sein fraglich bleiben machen 326 34 unklar 320 klar 225 offen 228 möglich 160 wichtig 180 deutlich 5 total 26 1500 werden 3 103 41 30 30 2 40 2 177 97 34 168 75 IMS Universität Stuttgart

Topicalized finite clause adjective + verb + finite clause CLfin VC (Adv. P|PP|NPtemp|CLrel)* APpred

Topicalized finite clause adjective + verb + finite clause CLfin VC (Adv. P|PP|NPtemp|CLrel)* APpred 27 IMS Universität Stuttgart

adjective + verb + finite clause fincl_ex fincl_top total fraglich 91 335 426 unklar

adjective + verb + finite clause fincl_ex fincl_top total fraglich 91 335 426 unklar 13 426 klar 221 159 380 offen 19 266 285 möglich 207 4 211 wichtig 192 9 201 deutlich 139 22 161 28 IMS Universität Stuttgart

adjective + verb + finite clause fincl_ex fincl_top total fraglich 91 335 426 unklar

adjective + verb + finite clause fincl_ex fincl_top total fraglich 91 335 426 unklar 13 426 klar 221 159 380 offen 19 266 285 möglich 207 4 211 wichtig 192 9 201 deutlich 139 22 161 29 IMS Universität Stuttgart

adjective + verb + infinite clause sein bereit schwer möglich schwierig leicht nötig erforderlich

adjective + verb + infinite clause sein bereit schwer möglich schwierig leicht nötig erforderlich total 30 431 162 532 245 120 112 102 1708 fallen haben 221 4 108 59 31 48 280 195 werden machen 6 33 40 93 8 2 1 183 IMS Universität Stuttgart 26 35 12 16 7 15 111

adjective + verb + infinite clause bereit schwer möglich schwierig leicht nötig erforderlich total

adjective + verb + infinite clause bereit schwer möglich schwierig leicht nötig erforderlich total 31 sein fallen haben werden machen 431 4 6 162 221 108 33 26 532 40 35 245 93 12 120 59 31 8 16 112 48 2 7 102 1 15 1708 280 195 183 111 IMS Universität Stuttgart

low freq adj + verb + infin clause stehen frei bringen 32 sein 35

low freq adj + verb + infin clause stehen frei bringen 32 sein 35 4 satt fertig haben 19 24 10 1 IMS Universität Stuttgart

low freq adj + verb + clause stehen frei bringen 33 sein 37 6

low freq adj + verb + clause stehen frei bringen 33 sein 37 6 satt fertig haben 27 26 11 1 IMS Universität Stuttgart

Conclusion • recursive chunking workable compromise between depth of analysis and robustness • extracted

Conclusion • recursive chunking workable compromise between depth of analysis and robustness • extracted data show correlation between • • 34 collocational preference subcategorization frames semantic classes of adjectives to a certain extent distributional preferences IMS Universität Stuttgart

Evaluation on automatic Po. Stags all chunks precision maximal chunks recall precision recall NP

Evaluation on automatic Po. Stags all chunks precision maximal chunks recall precision recall NP 89. 93 91. 67 89. 43 91. 68 PP 94. 05 89. 67 94. 04 89. 65 AP 84. 24 89. 25 83. 67 89. 59 VC - - 97. 72 96. 62 35 IMS Universität Stuttgart

Evaluation on ideal Po. S-tags all chunks precision maximal chunks recall precision recall NP

Evaluation on ideal Po. S-tags all chunks precision maximal chunks recall precision recall NP 96. 36 96. 51 95. 55 96. 47 PP 98. 08 96. 51 98. 07 96. 50 AP 96. 39 97. 50 96. 12 97. 45 VC - - 99. 01 98. 59 36 IMS Universität Stuttgart

Chunking process Second Level Corpus First Level Corpus Third Level Corpus Lexicon 37 IMS

Chunking process Second Level Corpus First Level Corpus Third Level Corpus Lexicon 37 IMS Universität Stuttgart

Chunking process • First Level • lexical information is introduced • chunks with specific

Chunking process • First Level • lexical information is introduced • chunks with specific internal structure are built • non-recursive chunks are built • Second Level • main parsing level • complex (recursive) structures are built in several iterations • Third Level • built chunk hierarchy 38 IMS Universität Stuttgart

Rule blocks 39 IMS Universität Stuttgart

Rule blocks 39 IMS Universität Stuttgart

Advantages • specific rules do not interact with main parsing rules • additional (e.

Advantages • specific rules do not interact with main parsing rules • additional (e. g. domain specific) rules can be included easily • main parsing rules can be kept simple • number of main parsing rules can be kept small 40 IMS Universität Stuttgart