CPSC 503 Computational Linguistics Lecture 3 Giuseppe Carenini

  • Slides: 33
Download presentation
CPSC 503 Computational Linguistics Lecture 3 Giuseppe Carenini 10/3/2020 CPSC 503 Winter 2008 1

CPSC 503 Computational Linguistics Lecture 3 Giuseppe Carenini 10/3/2020 CPSC 503 Winter 2008 1

 • Subscribe to mailing list • Some more Intros • NLP@UBC 10/3/2020 CPSC

• Subscribe to mailing list • Some more Intros • NLP@UBC 10/3/2020 CPSC 503 Winter 2008 2

Introductions • • Your Name Previous experience in NLP? Why are you interested in

Introductions • • Your Name Previous experience in NLP? Why are you interested in NLP? Are you thinking of NLP as your main research area? If not, what else do you want to specialize in…. • Anything else………… 10/3/2020 CPSC 503 Winter 2008 3

NLP research at UBC TOPICS • Generation and Summarization of Evaluative Text (e. g.

NLP research at UBC TOPICS • Generation and Summarization of Evaluative Text (e. g. , customer reviews) • Summarization of conversations (emails, blogs, meetings) PEOPLE: G. Carenini & R. Ng (Profs), G. Murray (Postdoc) + Students SUPPORT: NSERC, Google, BObjects(now SAP), MSResearch 10/3/2020 CPSC 503 Winter 2008 4

Linguistic Knowledge (English) Morphology Syntax Semantics Pragmatics Discourse and Dialogue 10/3/2020 Formalisms and associated

Linguistic Knowledge (English) Morphology Syntax Semantics Pragmatics Discourse and Dialogue 10/3/2020 Formalisms and associated Algorithms State Machines (no prob. ) • Finite State Automata (and Regular Expressions) • Finite State Transducers Rule systems (and prob. version) (e. g. , (Prob. ) Context-Free Grammars) Logical formalisms (First-Order Logics) AI planners CPSC 503 Winter 2008 5

Computational tasks in Morphology • Recognition: recognize whether a string is an English/… word

Computational tasks in Morphology • Recognition: recognize whether a string is an English/… word (FSA) • Parsing/Generation: word e. g. , bought • Stemming: word 10/3/2020 …. stem, class, lexical features …. buy +V +PAST-PART buy +V +PAST stem …. CPSC 503 Winter 2008 6

Today Sept 15 • Finite State Transducers (FSTs) and Morphological Parsing • Stemming (Porter

Today Sept 15 • Finite State Transducers (FSTs) and Morphological Parsing • Stemming (Porter Stemmer) 10/3/2020 CPSC 503 Winter 2008 7

FST definition • Q: a finite set of states • I, O: input and

FST definition • Q: a finite set of states • I, O: input and an output alphabets (which may include ε) • Σ: a finite alphabet of complex symbols i: o, i I and o O • Q 0: the start state • F: a set of accept/final states (F Q) • A transition relation δ that maps QxΣ to 2 Q E. g. , |Q| =3 ; I={a, b, c, ε} ; O={a, b}; |Σ|=? ; 0 <= |δ| <= ? 10/3/2020 CPSC 503 Winter 2008 8

FST can be used as… • Translators: input one string from I, output another

FST can be used as… • Translators: input one string from I, output another from O (or vice versa) • Recognizers: input a string from Ix. O • Generator: output a string from Ix. O Terminology warning! E. g. , if I={a, b, c, ε} ; O={a, b}; …… 10/3/2020 CPSC 503 Winter 2008 9

FST: inflectional morphology of plural Some regular-nouns Notes: X -> X: X Some irregular-nouns

FST: inflectional morphology of plural Some regular-nouns Notes: X -> X: X Some irregular-nouns 10/3/2020 o: i CPSC 503 Winter 2008 lexical: surface 10

Examples lexical surface m i c lexical c a t e +N +PL surface

Examples lexical surface m i c lexical c a t e +N +PL surface 10/3/2020 CPSC 503 Winter 2008 11

Computational Morphology: Problems/Challenges 1. Ambiguity: one word can correspond to multiple structures (more critical

Computational Morphology: Problems/Challenges 1. Ambiguity: one word can correspond to multiple structures (more critical in morphologically richer languages) 2. Spelling changes: may occur when two morphemes are combined e. g. butterfly + -s -> butterflies 10/3/2020 CPSC 503 Winter 2008 12

Ambiguity: more complex example • What’s the right parse for Unionizable? – Union-ize-able –

Ambiguity: more complex example • What’s the right parse for Unionizable? – Union-ize-able – Un-ion-ize-able • Each would represent a valid path through an FST for derivational morphology. • Both Adj…… 10/3/2020 CPSC 503 Winter 2008 13

Deal with Morphological Ambiguity • Find all the possible outputs (all paths) and return

Deal with Morphological Ambiguity • Find all the possible outputs (all paths) and return them all (without choosing) Then Part-of- speech tagging to choose…… look at the neighboring words 10/3/2020 CPSC 503 Winter 2008 14

(2) Spelling Changes When morphemes are combined inflectionally the spelling at the boundaries may

(2) Spelling Changes When morphemes are combined inflectionally the spelling at the boundaries may change Examples • E-insertion: when –s is added to a word, -e is inserted if word ends in –s, -z, -sh, -ch, -x (e. g. , kiss, miss, waltz, bush, watch, rich, box) • Y-replacement: when –s or -ed are added to a word ending with a –y, -y changes to –ie or –i respectively (e. g. , try, butterfly) 10/3/2020 CPSC 503 Winter 2008 15

Solution: Multi-Tape Machines • Add intermediate tape • Use the output of one tape

Solution: Multi-Tape Machines • Add intermediate tape • Use the output of one tape machine as the input to the next • Add intermediate symbols – ^ morpheme boundary – # word boundary 10/3/2020 CPSC 503 Winter 2008 16

Multi-Level Tape Machines FST-1 FST-2 • FST-1 translates between the lexical and the intermediate

Multi-Level Tape Machines FST-1 FST-2 • FST-1 translates between the lexical and the intermediate level • FTS-2 handles the spelling changes (due to one rule) to the surface tape 10/3/2020 CPSC 503 Winter 2008 17

FST-1 for inflectional morphology of plural (Lexical <-> Intermediate ) Some regular-nouns +PL: ^s#

FST-1 for inflectional morphology of plural (Lexical <-> Intermediate ) Some regular-nouns +PL: ^s# # Some irregularnouns 10/3/2020 o: i CPSC 503 Winter 2008 +PL: ^ ε: s ε: # 18

Example lexical f o x +N +PL intemediate lexical m o u s e

Example lexical f o x +N +PL intemediate lexical m o u s e +N +PL intemediate 10/3/2020 CPSC 503 Winter 2008 19

FST-2 for E-insertion (Intermediate <-> Surface) E-insertion: when –s is added to a word,

FST-2 for E-insertion (Intermediate <-> Surface) E-insertion: when –s is added to a word, -e is inserted if word ends in –s, -z, -sh, -ch, -x …as in fox^s# <-> foxes #: ε 10/3/2020 CPSC 503 Winter 2008 20

Examples intermediate f o x ^ s # surface intermediate b o x ^

Examples intermediate f o x ^ s # surface intermediate b o x ^ i n g # surface 10/3/2020 CPSC 503 Winter 2008 21

Where are we? # 10/3/2020 CPSC 503 Winter 2008 22

Where are we? # 10/3/2020 CPSC 503 Winter 2008 22

Final Scheme: Part 1 10/3/2020 CPSC 503 Winter 2008 23

Final Scheme: Part 1 10/3/2020 CPSC 503 Winter 2008 23

Final Scheme: Part 2 10/3/2020 CPSC 503 Winter 2008 24

Final Scheme: Part 2 10/3/2020 CPSC 503 Winter 2008 24

Intersection (FST 1, FST 2) • States of FST 1 and FST 2 :

Intersection (FST 1, FST 2) • States of FST 1 and FST 2 : Q 1 and Q 2 • States of intersection: (Q 1 x Q 2) • Transitions of FST 1 and FST 2 : δ 1, δ 2 • Transitions of intersection : δ 3 For all i, j, n, m, a, b δ 3((q 1 i, q 2 j), a: b) = (q 1 n, q 2 m) iff – δ 1(q 1 i, a: b) = q 1 n AND a: b – δ 2(q 2 j, a: b) = q 2 m q 1 i q 1 n a: b (q 1 i, q 2 j) 10/3/2020 (q 1 n, q 2 m) CPSC 503 Winter 2008 a: b q 2 j q 2 m 25

Composition(FST 1, FST 2) • • For – – – States of FST 1

Composition(FST 1, FST 2) • • For – – – States of FST 1 and FST 2 : Q 1 and Q 2 States of composition : Q 1 x Q 2 Transitions of FST 1 and FST 2 : δ 1, δ 2 Transitions of composition : δ 3 all i, j, n, m, a, b δ 3((q 1 i, q 2 j), a: b) = (q 1 n, q 2 m) iff There exists c such that a: c δ 1(q 1 i, a: c) = q 1 n AND q 1 i q 1 n δ 2(q 2 j, c: b) = q 2 m a: b (q 1 i, q 2 j) 10/3/2020 (q 1 n, q 2 m) CPSC 503 Winter 2008 c: b q 2 j q 2 m 26

FSTs in Practice • Install an FST package…… (pointers) • Describe your “formal language”

FSTs in Practice • Install an FST package…… (pointers) • Describe your “formal language” (e. g, lexicon, morphotactic and rules) in a Reg. Exp -like notation (pointer) • Your specification is compiled in a single FST Ref: “Finite State Morphology” (Beesley and Karttunen, 2003, CSLI Publications) Complexity/Coverage: • FSTs for the morphology of a natural language may have 105 – 107 states and arcs • Spanish (1996) 46 x 103 stems; 3. 4 x 106 word forms • Arabic (2002? ) 131 x 103 stems; 7. 7 x 106 word forms 10/3/2020 CPSC 503 Winter 2008 27

Other important applications of FST in NLP From segmenting words into morphemes to… •

Other important applications of FST in NLP From segmenting words into morphemes to… • Tokenization: – finding word boundaries in text (? !) …maxmatch – Finding sentence boundaries: punctuation… but. is ambiguous look at example in Fig. 3. 22 • Shallow syntactic parsing: e. g. , find only noun phrases • Phonological Rules…… (Chpt. ? 11? ) 10/3/2020 CPSC 503 Winter 2008 28

Computational tasks in Morphology • Recognition: recognize whether a string is an English word

Computational tasks in Morphology • Recognition: recognize whether a string is an English word (FSA) • Parsing/Generation: word e. g. , bought • Stemming: word 10/3/2020 …. stem, class, lexical features …. buy +V +PAST-PART buy +V +PAST stem …. CPSC 503 Winter 2008 29

Stemmer • E. g. the Porter algorithm, which is based on a series of

Stemmer • E. g. the Porter algorithm, which is based on a series of sets of simple cascaded rewrite rules: • (condition) S 1 ->S 2 – ATIONAL ATE (relational relate) – (*v*) ING if stem contains vowel (motoring motor) • Cascade of rules applied to: computerization – ization -> -ize computerize – ize -> ε computer • Errors occur: – organization organ, doing doe university universe Code freely available in most languages: Python, Java, … 10/3/2020 CPSC 503 Winter 2008 30

Stemming mainly used in Information Retrieval 1. Run a stemmer on the documents to

Stemming mainly used in Information Retrieval 1. Run a stemmer on the documents to be indexed 2. Run a stemmer on users queries 3. Compute similarity between queries and documents (based on stems they contain) Seems to work especially well with smaller documents 10/3/2020 CPSC 503 Winter 2008 31

Porter as an FST • The original exposition of the Porter stemmer did not

Porter as an FST • The original exposition of the Porter stemmer did not describe it as a transducer but… – Each stage is a separate transducer – The stages can be composed to get one big transducer 10/3/2020 CPSC 503 Winter 2008 32

Next Time • Read handout – Probability – Stats – Information theory • Next

Next Time • Read handout – Probability – Stats – Information theory • Next Lecture: – finish Chpt 3, 3. 10 -11 – Start Probabilistic Models for NLP (Chpt. 4, 4. 1 – 4. 2 and 5. 9!) 10/3/2020 CPSC 503 Winter 2008 33