MorphoSyntactic Analysis and Language Modeling using Machine Learning

  • Slides: 33
Download presentation
Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques Guy De Pauw guy. depauw@ua.

Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques Guy De Pauw guy. depauw@ua. ac. be Walter Daelemans walter. daelemans@ua. ac. be CNTS – Language Technology Group http: //www. cnts. ua. ac. be

1 Morpho-Syntactic Analysis using Machine Learning Techniques • Why? - As an NLP tool

1 Morpho-Syntactic Analysis using Machine Learning Techniques • Why? - As an NLP tool proper (!) - Annotate new datasets (e. g. Mediargus) - Extra information source for language modeling • How? - Machine Learning techniques (MBL + maxent) - Shallow linguistic analysis FLa. Vo. R Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

2 Shallow linguistic analysis • For many NLP applications, full analysis is often not

2 Shallow linguistic analysis • For many NLP applications, full analysis is often not necessary - e. g. morphological analysis • uitzonderingsgevallen: FULL: ((((uitzonder)[V], (ing)[N|V. ])[N], (s)[N|N. N], (geval)[N]), (en)[N-m] vs SHALLOW: uitzonder@V + ing@N|V. + s@N|N. N + geval@N + en@N-m • Shallow Analysis: fast + robust FLa. Vo. R Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

3 Shallow linguistic analysis word morphology POS-tag SP-tag nu nu BW I-ADVP treft tref+t

3 Shallow linguistic analysis word morphology POS-tag SP-tag nu nu BW I-ADVP treft tref+t WW 3 S-MAIN de de LID I-NP nietsvermoedende niets+vermoed+end+e ADJ 1 I-NP poolreiziger pool+reiziger N 1 I-NP vuilnisbelten vuil+nis+belt+en N 3 B-NP tussen VZ 1 I-PP de de LID I-NP ijsbergen ijs+berg+en N 3 I-NP aan VZ 2 I-SVP . . LET 0 FLa. Vo. R Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

4 Shallow linguistic analysis [ADVP nu] [SMAIN tref+t] [NP de niets+vermoed+end+e pool+reiziger] [NP vuilnis+belt+en]

4 Shallow linguistic analysis [ADVP nu] [SMAIN tref+t] [NP de niets+vermoed+end+e pool+reiziger] [NP vuilnis+belt+en] [PP tussen ] [NP de ijs+berg+en] [SVP aan]. FLa. Vo. R Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

5 Morphological Analysis parelvissers Segmentation parel+viss+er+s Tagging parel@N+viss@V+er@N|V. +s@INFLm Alternation parel@N+vis@V+er@N|V. +s@INFLm FLa. Vo.

5 Morphological Analysis parelvissers Segmentation parel+viss+er+s Tagging parel@N+viss@V+er@N|V. +s@INFLm Alternation parel@N+vis@V+er@N|V. +s@INFLm FLa. Vo. R Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

6 Morphological Analysis Parelvissers Segmentation Parel+viss+er+s Tagging parel@N+viss@V+er@N|V. +s@INFLm Alternation parel@N+vis@V+er@N|V. +s@INFLm FLa. Vo.

6 Morphological Analysis Parelvissers Segmentation Parel+viss+er+s Tagging parel@N+viss@V+er@N|V. +s@INFLm Alternation parel@N+vis@V+er@N|V. +s@INFLm FLa. Vo. R Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

7 Morphological Segmentation FLa. Vo. R Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling

7 Morphological Segmentation FLa. Vo. R Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

8 Morphological Segmentation • Trained and evaluated on (adapted) morphological database of CELEX •

8 Morphological Segmentation • Trained and evaluated on (adapted) morphological database of CELEX • Experimental Results (full word score): - FS (minimal boundaries + unigram): - Morpheme Boundary Prediction: - FS + Morpheme Prediction: FLa. Vo. R Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques 86. 7% 89. 2% 94. 8%

9 Morphological Analysis Parelvissers Segmentation Parel+viss+er+s 96% Tagging parel@N+viss@V+er@N|V. +s@INFLm Alternation parel@N+vis@V+er@N|V. +s@INFLm FLa.

9 Morphological Analysis Parelvissers Segmentation Parel+viss+er+s 96% Tagging parel@N+viss@V+er@N|V. +s@INFLm Alternation parel@N+vis@V+er@N|V. +s@INFLm FLa. Vo. R Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

10 Morphological Analysis parelvissers Segmentation parel+viss+er+s Tagging parel@N+viss@V+er@N|V. +s@INFLm Alternation parel@N+vis@V+er@N|V. +s@INFLm FLa. Vo.

10 Morphological Analysis parelvissers Segmentation parel+viss+er+s Tagging parel@N+viss@V+er@N|V. +s@INFLm Alternation parel@N+vis@V+er@N|V. +s@INFLm FLa. Vo. R Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

11 Alternation • Map parel+viss+er+s to aan+lop+en to but also aan+ge+bracht to FLa. Vo.

11 Alternation • Map parel+viss+er+s to aan+lop+en to but also aan+ge+bracht to FLa. Vo. R Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques parel+vis+er+s aan+loop+en aan+ge+breng

12 Alternation • Grapheme based alternation FLa. Vo. R Workshop (17/11/2006) – Morpho-Syntactic Analysis

12 Alternation • Grapheme based alternation FLa. Vo. R Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

13 Alternation • Grapheme based alternation FLa. Vo. R Workshop (17/11/2006) – Morpho-Syntactic Analysis

13 Alternation • Grapheme based alternation FLa. Vo. R Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

14 Alternation • Grapheme based alternation FLa. Vo. R Workshop (17/11/2006) – Morpho-Syntactic Analysis

14 Alternation • Grapheme based alternation FLa. Vo. R Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

15 Alternation • Grapheme based alternation • 99. 4% of morphemes correctly alternated -

15 Alternation • Grapheme based alternation • 99. 4% of morphemes correctly alternated - Including complex alternations like bracht->breng FLa. Vo. R Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

16 Morphological Analysis • Use morphological analysis cascade to analyze all words in CGN

16 Morphological Analysis • Use morphological analysis cascade to analyze all words in CGN and Mediargus (not in CELEX) e. g. F 1: flowerpower-afstammelingen F 2: flowerpower-@N+af@P+stamm@V+eling@N|V. +en@INFLm F 3: flowerpower@N+af@P+stam@V+eling@N|V. +en@INFLm F 4: m • Huge morphological database of ± 2. 7 M words FLa. Vo. R Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

17 Shallow linguistic analysis word morphology POS-tag SP-tag nu nu BW I-ADVP treft tref+t

17 Shallow linguistic analysis word morphology POS-tag SP-tag nu nu BW I-ADVP treft tref+t WW 3 S-MAIN de de LID I-NP nietsvermoedende niets+vermoed+end+e ADJ 1 I-NP poolreiziger pool+reiziger N 1 I-NP vuilnisbelten vuil+nis+belt+en N 3 B-NP tussen VZ 1 I-PP de de LID I-NP ijsbergen ijs+berg+en N 3 I-NP aan VZ 2 I-SVP . . LET 0 FLa. Vo. R Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

18 Part-of-Speech Tagging • Trained and evaluated on CGN + STIL • Some Experimental

18 Part-of-Speech Tagging • Trained and evaluated on CGN + STIL • Some Experimental Results - Contextual + orthographic features - + morphological information • Tags of morphemes • Lemma • Flection tag FLa. Vo. R Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques 96. 6% (uw 82. 5%) 97. 2% (uw 86. 9%)

19 Shallow linguistic analysis word morphology POS-tag SP-tag nu nu BW I-ADVP treft tref+t

19 Shallow linguistic analysis word morphology POS-tag SP-tag nu nu BW I-ADVP treft tref+t WW 3 S-MAIN de de LID I-NP nietsvermoedende niets+vermoed+end+e ADJ 1 I-NP poolreiziger pool+reiziger N 1 I-NP vuilnisbelten vuil+nis+belt+en N 3 B-NP tussen VZ 1 I-PP de de LID I-NP ijsbergen ijs+berg+en N 3 I-NP aan VZ 2 I-SVP . . LET 0 89. 5% tagging accuracy 87. 4 F-score FLa. Vo. R Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

20 System for morpho-syntactic analysis • Morphological analysis: ± 5 w/s • Tagging +

20 System for morpho-syntactic analysis • Morphological analysis: ± 5 w/s • Tagging + Phrase Chunking: ± 450 w/s • Used to annotate entire Mediargus corpus - Morphological analysis (± 2 B morphemes) - Part-of-speech tags - Phrase chunks : : demo: : http: //www. cnts. ua. ac. be/flavor FLa. Vo. R Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

21 Language Modeling • Problem 1: input is not a sequence of words, but

21 Language Modeling • Problem 1: input is not a sequence of words, but a sequence of morphemes • Problem 2: scoring hypotheses using shallow linguistic annotation FLa. Vo. R Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

22 Language Modeling • Problem 1: input is a sequence of morphemes Nu tref

22 Language Modeling • Problem 1: input is a sequence of morphemes Nu tref t de niets vermoed end e pool reiziger vuil nis belt en tussen de ijs berg en aan • Disambiguate between word and morpheme boundaries • Use morphologically analyzed mediargus as training material • Approach: morpheme sequence tagging FLa. Vo. R Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

23 nu NWB tref V t INFLt. WB de EWB niets B vermoed V

23 nu NWB tref V t INFLt. WB de EWB niets B vermoed V end A|BV. e INFLPWB pool N reiziger NWB vuil A nis N|A. belt N en INFLm. WB tussen BWB de EWB ijs N berg N en INFLm. WB aan PWB . . Language Modeling FLa. Vo. R Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

24 Language Modeling • Problem 1: input is a sequence of morphemes Nu tref

24 Language Modeling • Problem 1: input is a sequence of morphemes Nu tref t de niets vermoed end e pool reiziger vuil nis belt en tussen de ijs berg en aan [w - nu ] [w tref t] [w de ] [w niets vermoed end e ] … word boundaries: 97. 2% Morpheme boundaries: 93. 1% F-score of 92. 3% FLa. Vo. R Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

25 Language Modeling • (Big) remaining problem: - aanlopen -> - gebracht -> -

25 Language Modeling • (Big) remaining problem: - aanlopen -> - gebracht -> - But not: aan+lop+en or ge+bracht or aan+loop+en en aan+loop+en ge+breng ge+bracht - Information not available in CELEX - But: Orthography closest guess True pronounced morphemes quite workable Decent accuracy on harder task ? ? Regular expression + grapheme-to-phoneme conversion - Not yet integrated in recognizer FLa. Vo. R Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

26 Language Modeling • Turn morphemes into word forms (+ reverse alternation) - Re-analyze

26 Language Modeling • Turn morphemes into word forms (+ reverse alternation) - Re-analyze word form • Tag + shallow parse sequence of words : : demo: : www. cnts. ua. ac. be/flavor FLa. Vo. R Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

27 Language Modeling • Problem 2: scoring hypotheses - Option 1: n-gram models trained

27 Language Modeling • Problem 2: scoring hypotheses - Option 1: n-gram models trained on annotated Mediargus corpus • • • Morpheme N-grams: de niets vermoed end <e> Tagged-morpheme N-grams Ewb B V A|BV. <INFLPWB> Word n-grams Part-of-Speech tag n-grams Shallow Parsing tag n-grams Combination: de@LID@NP <kan@WW@NP> or <kan@N 1@NP> - Interpolate LM scores FLa. Vo. R Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

28 Language Modeling • Problem 2: scoring hypotheses - Option 2: classifier “certainty” •

28 Language Modeling • Problem 2: scoring hypotheses - Option 2: classifier “certainty” • Use maximum entropy classifiers, that can output proper probabilities • Quite informative for WSJ LM-task FLa. Vo. R Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

29 Language Modeling • Problem 2: scoring hypotheses - Option 3: Maxent classifier as

29 Language Modeling • Problem 2: scoring hypotheses - Option 3: Maxent classifier as LM • Information Source: surrounding context (words, morphemes, linguistic annotation) • To classify: word (or morpheme) • VERY slow training time FLa. Vo. R Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

30 Language Modeling: circumstantial evidence • Wall-Street Journal: n-gram rescoring - VP set: -

30 Language Modeling: circumstantial evidence • Wall-Street Journal: n-gram rescoring - VP set: - NVP set: 8. 11% 8. 08% 7. 57% 7. 74% + maxent classifier probabilities + POS 3 -grams • Mediargus: perplexity - Word 3 -gram: - Morpheme 3 -gram: - Tagged Morpheme 3 -gram: FLa. Vo. R Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques 148. 42 56. 36 53. 17

31 Limitations • Morpheme representation problematic for integration in recognizer • Efficiency as LM

31 Limitations • Morpheme representation problematic for integration in recognizer • Efficiency as LM not yet properly evaluated for Dutch FLa. Vo. R Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

32 Available Tools & Data Tools: - All-in-one morpho-syntactic analyzer for Dutch • Morphological

32 Available Tools & Data Tools: - All-in-one morpho-syntactic analyzer for Dutch • Morphological analyzer • Part-of-Speech tagger • Phrase Chunker - Word vs Morpheme Boundary detector for Dutch - Promising outlook for Dutch N-gram LM using extra annotation layers Data: - Adjusted version of CELEX (incl segmented orthographic forms) - 2. 7 M word database of morphologically analyzed words - Morphologically analyzed, tagged & shallow-parsed Mediargus FLa. Vo. R Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques