CPSC 503 Computational Linguistics Lecture 2 Giuseppe Carenini

  • Slides: 60
Download presentation
CPSC 503 Computational Linguistics Lecture 2 Giuseppe Carenini 9/19/2021 CPSC 503 Winter 2012 1

CPSC 503 Computational Linguistics Lecture 2 Giuseppe Carenini 9/19/2021 CPSC 503 Winter 2012 1

Today Jan 10 • Questionnaire • Brief check of some background knowledge • English

Today Jan 10 • Questionnaire • Brief check of some background knowledge • English Morphology • FSA and Morphology • Start: Finite State Transducers (FST) and Morphological Parsing/Gen. 9/19/2021 CPSC 503 Winter 2012 2

Finite state machines Regular Expressions & Finite State Automata 7. 1 Finite State Transducers

Finite state machines Regular Expressions & Finite State Automata 7. 1 Finite State Transducers 3. 4 Hidden-Markov Models 5. 7 Basic Probability, Bayesian Statistics and Information Theory Conditional Probability 8. 1 Bayesian Networks 6. 6 Entropy 6. 1 Machine Learning Supervised Classification (e. g. , Decision Trees) 6. 1 Unsupervised Learning (e. g. , clustering) 5. 3 Graphical Models 3. 7 Richer Formalisms Context-Free Grammar 5. 2 First-Order Logics 5. 8 9/19/2021 Programming Java 7. 4 Python 7. 8 Dynamic Programming 6. 3 Search Algorithms 6. 8 Linguistics 4. 4 Linear Alg. CPSC 503 Winter 2012 7. 9 3

Today Jan 10 • Brief check of some background knowledge • English Morphology •

Today Jan 10 • Brief check of some background knowledge • English Morphology • FSA and Morphology • Start: Finite State Transducers (FST) and Morphological Parsing/Gen. 9/19/2021 CPSC 503 Winter 2012 4

Knowledge-Formalisms Map (including probabilistic formalisms) Morphology Syntax Semantics Pragmatics Discourse and Dialogue State Machines

Knowledge-Formalisms Map (including probabilistic formalisms) Morphology Syntax Semantics Pragmatics Discourse and Dialogue State Machines (and prob. versions) (Finite State Automata, Finite State Transducers, Markov Models) Rule systems (and prob. versions) (e. g. , (Prob. ) Context-Free Grammars) Logical formalisms (First-Order Logics, Prob. Logics) M a c h i n e L e a AI planners (MDP Markov Decision Processes) 9/19/2021 CPSC 503 Winter 2012 5 r n i n g

Today (English) Morphology Syntax Semantics Pragmatics Discourse and Dialogue 9/19/2021 State Machines (no prob.

Today (English) Morphology Syntax Semantics Pragmatics Discourse and Dialogue 9/19/2021 State Machines (no prob. ) • Finite State Automata (and Regular Expressions) • Finite State Transducers Rule systems (and prob. version) (e. g. , (Prob. ) Context-Free Grammars) Logical formalisms (First-Order Logics) AI planners CPSC 503 Winter 2012 6

? ? b a 1 0 b 0 9/19/2021 b 1 3 2 a

? ? b a 1 0 b 0 9/19/2021 b 1 3 2 a a a 2 ! 4 a 3 5 ! 4 CPSC 503 Winter 2012 6 5 6 7

? ? /CPSC 50[34]/ /^([Ff]romb|[Ss]ubjectb|[Dd]ateb)/ /[0 -9]+(. [0 -9]+){3}/ 9/19/2021 CPSC 503 Winter 2012

? ? /CPSC 50[34]/ /^([Ff]romb|[Ss]ubjectb|[Dd]ateb)/ /[0 -9]+(. [0 -9]+){3}/ 9/19/2021 CPSC 503 Winter 2012 8

Fundamental Relations implement (generate and recognize) FSA describe model 9/19/2021 Regular Expressions Many Linguistic

Fundamental Relations implement (generate and recognize) FSA describe model 9/19/2021 Regular Expressions Many Linguistic Phenomena CPSC 503 Winter 2012 9

Second Usage of Reg. Exp: Text Searching/Editing Find me all instances of the determiner

Second Usage of Reg. Exp: Text Searching/Editing Find me all instances of the determiner “the” in an English text. – – To count them To substitute them with something else You try: /the/ /[t. T]he/ The other cop went to the bank but there were no people there. /btheb/ /b[t. T]heb/ s/b([t. T]he|[Aa]n? )b/DET/ 9/19/2021 CPSC 503 Winter 2012 10

Today (English) Morphology Syntax Semantics Pragmatics Discourse and Dialogue 9/19/2021 State Machines (no prob.

Today (English) Morphology Syntax Semantics Pragmatics Discourse and Dialogue 9/19/2021 State Machines (no prob. ) • Finite State Automata (and Regular Expressions) • Finite State Transducers Rule systems (and prob. version) (e. g. , (Prob. ) Context-Free Grammars) Logical formalisms (First-Order Logics) AI planners CPSC 503 Winter 2012 12

English Morphology Def. The study of how words are formed from minimal meaning-bearing units

English Morphology Def. The study of how words are formed from minimal meaning-bearing units (morphemes) • We can usefully divide morphemes into two classes – Stems: The core meaning bearing units – Affixes: Bits and pieces that adhere to stems to change their meanings and grammatical functions Examples: unhappily, …………… 9/19/2021 CPSC 503 Winter 2012 13

Word Classes • For now word classes: nouns, verbs, adjectives and adverbs. • We’ll

Word Classes • For now word classes: nouns, verbs, adjectives and adverbs. • We’ll go into the gory details in Ch 5 • Word class determines to a large degree the way that stems and affixes combine 9/19/2021 CPSC 503 Winter 2012 14

English Morphology • We can also divide morphology up into two broad classes –

English Morphology • We can also divide morphology up into two broad classes – Inflectional – Derivational 9/19/2021 CPSC 503 Winter 2012 15

Inflectional Morphology • The resulting word: – Has the same word class as the

Inflectional Morphology • The resulting word: – Has the same word class as the original – Serves a grammatical/semantic purpose different from the original 9/19/2021 CPSC 503 Winter 2012 16

Nouns, Verbs and Adjectives (English) • Nouns are simple (not really) – Markers for

Nouns, Verbs and Adjectives (English) • Nouns are simple (not really) – Markers for plural and possessive • Verbs are only slightly more complex – Markers appropriate to the tense of the verb and to the person • Adjectives – Markers for comparative and superlative 9/19/2021 CPSC 503 Winter 2012 17

Regulars and Irregulars • Some words misbehave (refuse to follow the rules) – Mouse/mice,

Regulars and Irregulars • Some words misbehave (refuse to follow the rules) – Mouse/mice, goose/geese, ox/oxen – Go/went, fly/flew • Regulars… – Walk, walks, walking, walked • Irregulars – Eat, eats, eating, ate, eaten – Catch, catches, catching, caught – Cut, cuts, cutting, cut 9/19/2021 CPSC 503 Winter 2012 18

Derivational Morphology • Derivational morphology is the messy stuff that no one ever taught

Derivational Morphology • Derivational morphology is the messy stuff that no one ever taught you. – Changes of word class – Less Productive ( -ant V -> N only with V of Latin origin!) 9/19/2021 CPSC 503 Winter 2012 19

Derivational Examples • Verb/Adj to Noun -ation computerize computerization -ee appointee -er killer -ness

Derivational Examples • Verb/Adj to Noun -ation computerize computerization -ee appointee -er killer -ness fuzzy fuzziness 9/19/2021 CPSC 503 Winter 2012 20

Derivational Examples • Noun/Verb to Adj -al Computational -able Embraceable -less Clueless 9/19/2021 CPSC

Derivational Examples • Noun/Verb to Adj -al Computational -able Embraceable -less Clueless 9/19/2021 CPSC 503 Winter 2012 21

Compute • Many paths are possible… • Start with compute – – 9/19/2021 Computer

Compute • Many paths are possible… • Start with compute – – 9/19/2021 Computer -> computerize -> computerization Computation -> computational Computer -> computerize -> computerizable Compute -> computee CPSC 503 Winter 2012 22

Summary (English) Morphology Syntax Semantics Pragmatics Discourse and Dialogue 9/19/2021 State Machines (no prob.

Summary (English) Morphology Syntax Semantics Pragmatics Discourse and Dialogue 9/19/2021 State Machines (no prob. ) • Finite State Automata (and Regular Expressions) • Finite State Transducers Rule systems (and prob. version) (e. g. , (Prob. ) Context-Free Grammars) Logical formalisms (First-Order Logics) AI planners CPSC 503 Winter 2012 23

FSAs and Morphology • GOAL 1: recognize whether a string is an English word

FSAs and Morphology • GOAL 1: recognize whether a string is an English word • PLAN: 1. First we’ll capture the morphotactics (the rules governing the ordering of affixes in a language) 2. Then we’ll add in the actual stems 9/19/2021 CPSC 503 Winter 2012 24

FSA for Portion of Noun Inflectional Morphology 9/19/2021 CPSC 503 Winter 2012 25

FSA for Portion of Noun Inflectional Morphology 9/19/2021 CPSC 503 Winter 2012 25

Adding the Stems But it does not express that: • Reg nouns ending in

Adding the Stems But it does not express that: • Reg nouns ending in –s, -z, -sh, -ch, -x -> es (kiss, waltz, bush, rich, box) • Reg nouns ending –y preceded by a consonant change the –y to -i 9/19/2021 CPSC 503 Winter 2012 26

Small Fragment of V and N Derivational Morphology [nouni] eg. hospital [adjal] eg. formal

Small Fragment of V and N Derivational Morphology [nouni] eg. hospital [adjal] eg. formal [adjous] eg. arduous [verbj] eg. speculate [verbk] eg. conserve 9/19/2021 CPSC 503 Winter 2012 27

GOAL 2: Morphological Parsing/Generation (vs. Recognition) • Recognition is usually not quite what we

GOAL 2: Morphological Parsing/Generation (vs. Recognition) • Recognition is usually not quite what we need. – Usually given a word we need to find: the stem and its class and morphological features (parsing) – Or we have a stem and its class and morphological features and we want to produce the word (production/generation) • Examples (parsing) – From “cats” to “cat +N +PL” – From “lies” to …… 9/19/2021 CPSC 503 Winter 2012 28

Computational problems in Morphology • Recognition: recognize whether a string is an English word

Computational problems in Morphology • Recognition: recognize whether a string is an English word (FSA) • Parsing/Generation: word e. g. , lies • Stemming: word 9/19/2021 …. stem, class, lexical features …. lie +N +PL lie +V +3 SG stem …. CPSC 503 Winter 2012 29

Finite State Transducers • FSA cannot help…. • The simple story – Add another

Finite State Transducers • FSA cannot help…. • The simple story – Add another tape – Add extra symbols to the transitions – On one tape we read “cats”, on the other we write “cat +N +PL” 9/19/2021 CPSC 503 Winter 2012 30

FSTs parsing 9/19/2021 generation CPSC 503 Winter 2012 31

FSTs parsing 9/19/2021 generation CPSC 503 Winter 2012 31

(Simplified) FST formal definition (you can skip 3. 4. 1 unless you want to

(Simplified) FST formal definition (you can skip 3. 4. 1 unless you want to work on FST) • Q: a finite set of states • I, O: input and an output alphabets (which may include ε) • Σ: a finite alphabet of complex symbols i: o, i I and o O • Q 0: the start state • F: a set of accept/final states (F Q) • A transition relation δ that maps QxΣ to 2 Q 9/19/2021 CPSC 503 Winter 2012 32

FST can be used as… • Translators: input one string from I, output another

FST can be used as… • Translators: input one string from I, output another from O (or vice versa) • Recognizers: input a string from Ix. O • Generator: output a string from Ix. O 9/19/2021 CPSC 503 Winter 2012 33

Simple Example c: c a: a t: t Transitions (as a translator): +N: ε

Simple Example c: c a: a t: t Transitions (as a translator): +N: ε +PL: s +SG: ε • c: c means read a c on one tape and write a c on the other (or vice versa) • +N: ε means read a +N symbol on one tape and write nothing on the other (or vice versa) • +PL: s means read +PL and write an s (or vice versa) 9/19/2021 CPSC 503 Winter 2012 34

Examples (as a translator) c: c a: a t: t +N: ε +PL: s

Examples (as a translator) c: c a: a t: t +N: ε +PL: s +SG: ε lexical surface lexical parsing c a c t a s t +N +SG generation surface 9/19/2021 CPSC 503 Winter 2012 35

Slightly More complex Example l: l q 0 i: i q 1 e: e

Slightly More complex Example l: l q 0 i: i q 1 e: e q 2 +PL: s +N: ε q 3 q 4 q 6 +V: ε q 5 q 7 Transitions (as a translator): +3 SG: s • l: l means read an l on one tape and write an l on the other (or vice versa) • +N: ε means read a +N symbol on one tape and write nothing on the other (or vice versa) • +PL: s means read +PL and write an s (or vice versa) • … 9/19/2021 CPSC 503 Winter 2012 36

Examples (as a translator) l: l q 0 q 1 i: i q 2

Examples (as a translator) l: l q 0 q 1 i: i q 2 e: e lexical q 3 +N: ε +PL: s +V: ε q 4 q 5 q 6 q 7 +3 SG: s parsing surface lexical l i l e i s e +V +3 SG generation surface 9/19/2021 CPSC 503 Winter 2012 37

Examples (as a recognizer and a generator) l: l q 0 q 1 lexical

Examples (as a recognizer and a generator) l: l q 0 q 1 lexical l i e surface l i e i: i q 2 e: e +V +3 SG q 3 +N: ε +PL: s +V: ε q 4 q 5 q 6 q 7 +3 SG: s s lexical surface 9/19/2021 CPSC 503 Winter 2012 38

Examples lexical surface m i c lexical c a t e +N +PL surface

Examples lexical surface m i c lexical c a t e +N +PL surface 9/19/2021 CPSC 503 Winter 2012 40

FST: inflectional morphology of plural Some regular-nouns Notes: X -> X: X Some irregular-nouns

FST: inflectional morphology of plural Some regular-nouns Notes: X -> X: X Some irregular-nouns 9/19/2021 o: i CPSC 503 Winter 2012 lexical: surface 41

Computational Morphology: Problems/Challenges 1. Ambiguity: one word can correspond to multiple structures (more critical

Computational Morphology: Problems/Challenges 1. Ambiguity: one word can correspond to multiple structures (more critical in morphologically richer languages) 2. Spelling changes: may occur when two morphemes are combined e. g. butterfly + -s -> butterflies 9/19/2021 CPSC 503 Winter 2012 42

Deal with Morphological Ambiguity • Find all the possible outputs (all paths) and return

Deal with Morphological Ambiguity • Find all the possible outputs (all paths) and return them all (without choosing) Then Part-of- speech tagging to choose…… look at the neighboring words 9/19/2021 CPSC 503 Winter 2012 43

(2) Spelling Changes When morphemes are combined inflectionally the spelling at the boundaries may

(2) Spelling Changes When morphemes are combined inflectionally the spelling at the boundaries may change Examples • E-insertion: when –s is added to a word, -e is inserted if word ends in –s, -z, -sh, -ch, -x (e. g. , kiss, miss, waltz, bush, watch, rich, box) • Y-replacement: when –s or -ed are added to a word ending with a –y, -y changes to –ie or –i respectively (e. g. , butterfly, try) 9/19/2021 CPSC 503 Winter 2012 45

Solution: Multi-Tape Machines • Add intermediate tape • Use the output of one tape

Solution: Multi-Tape Machines • Add intermediate tape • Use the output of one tape machine as the input to the next • Add intermediate symbols – ^ morpheme boundary – # word boundary 9/19/2021 CPSC 503 Winter 2012 46

Multi-Level Tape Machines FST-1 FST-2 • FST-1 translates between the lexical and the intermediate

Multi-Level Tape Machines FST-1 FST-2 • FST-1 translates between the lexical and the intermediate level • FTS-2 handles the spelling changes (due to one rule) to the surface tape 9/19/2021 CPSC 503 Winter 2012 47

FST-1 for inflectional morphology of plural (Lexical <-> Intermediate ) Some regular-nouns +PL: ^s#

FST-1 for inflectional morphology of plural (Lexical <-> Intermediate ) Some regular-nouns +PL: ^s# # Some irregularnouns 9/19/2021 o: i CPSC 503 Winter 2012 +PL: ^ ε: s ε: # 48

Example lexical f o x +N +PL intemediate lexical m o u s e

Example lexical f o x +N +PL intemediate lexical m o u s e +N +PL intemediate 9/19/2021 CPSC 503 Winter 2012 49

FST-2 for E-insertion (Intermediate <-> Surface) E-insertion: when –s is added to a word,

FST-2 for E-insertion (Intermediate <-> Surface) E-insertion: when –s is added to a word, -e is inserted if word ends in –s, -z, -sh, -ch, -x …as in fox^s# <-> foxes #: ε 9/19/2021 CPSC 503 Winter 2012 50

Examples intermediate f o x ^ s # surface intermediate b o x ^

Examples intermediate f o x ^ s # surface intermediate b o x ^ i n g # surface 9/19/2021 CPSC 503 Winter 2012 51

Where are we? # 9/19/2021 CPSC 503 Winter 2012 52

Where are we? # 9/19/2021 CPSC 503 Winter 2012 52

Final Scheme: Part 1 9/19/2021 CPSC 503 Winter 2012 53

Final Scheme: Part 1 9/19/2021 CPSC 503 Winter 2012 53

Final Scheme: Part 2 9/19/2021 CPSC 503 Winter 2012 54

Final Scheme: Part 2 9/19/2021 CPSC 503 Winter 2012 54

Intersection (FST 1, FST 2) = FST 3 • States of FST 1 and

Intersection (FST 1, FST 2) = FST 3 • States of FST 1 and FST 2 : Q 1 and Q 2 • States of intersection: (Q 1 x Q 2) • Transitions of FST 1 and FST 2 : δ 1, δ 2 • Transitions of intersection : δ 3 For all i, j, n, m, a, b δ 3((q 1 i, q 2 j), a: b) = (q 1 n, q 2 m) iff – δ 1(q 1 i, a: b) = q 1 n AND a: b – δ 2(q 2 j, a: b) = q 2 m q 1 i q 1 n a: b (q 1 i, q 2 j) 9/19/2021 (q 1 n, q 2 m) CPSC 503 Winter 2012 a: b q 2 j q 2 m 55

Composition(FST 1, FST 2) = FST 3 • • For – – – States

Composition(FST 1, FST 2) = FST 3 • • For – – – States of FST 1 and FST 2 : Q 1 and Q 2 States of composition : Q 1 x Q 2 Transitions of FST 1 and FST 2 : δ 1, δ 2 Transitions of composition : δ 3 all i, j, n, m, a, b δ 3((q 1 i, q 2 j), a: b) = (q 1 n, q 2 m) iff There exists c such that a: c δ 1(q 1 i, a: c) = q 1 n AND q 1 i q 1 n δ 2(q 2 j, c: b) = q 2 m a: b (q 1 i, q 2 j) 9/19/2021 (q 1 n, q 2 m) CPSC 503 Winter 2012 c: b q 2 j q 2 m 56

FSTs in Practice • Install an FST package…… (pointers) • Describe your “formal language”

FSTs in Practice • Install an FST package…… (pointers) • Describe your “formal language” (e. g, lexicon, morphotactic and rules) in a Reg. Exp -like notation (pointer) • Your specification is compiled in a single FST Ref: “Finite State Morphology” (Beesley and Karttunen, 2003, CSLI Publications) (pointer) Complexity/Coverage: • FSTs for the morphology of a natural language may have 105 – 107 states and arcs • Spanish (1996) 46 x 103 stems; 3. 4 x 106 word forms • Arabic (2002? ) 131 x 103 stems; 7. 7 x 106 word forms 9/19/2021 CPSC 503 Winter 2012 57

Other important applications of FST in NLP From segmenting words into morphemes to… •

Other important applications of FST in NLP From segmenting words into morphemes to… • Tokenization: – finding word boundaries in text (? !) …maxmatch – Finding sentence boundaries: punctuation… but. is ambiguous look at example in Fig. 3. 22 • Shallow syntactic parsing: e. g. , find only noun phrases • Phonological Rules…… (Chpt. 11) 9/19/2021 CPSC 503 Winter 2012 58

Computational tasks in Morphology • Recognition: recognize whether a string is an English word

Computational tasks in Morphology • Recognition: recognize whether a string is an English word (FSA) • Parsing/Generation: word e. g. , bought • Stemming: word 9/19/2021 …. stem, class, lexical features …. buy +V +PAST-PART buy +V +PAST stem …. CPSC 503 Winter 2012 59

Stemmer • E. g. the Porter algorithm, which is based on a series of

Stemmer • E. g. the Porter algorithm, which is based on a series of sets of simple cascaded rewrite rules: • (condition) S 1 ->S 2 – ATIONAL ATE (relational relate) – (*v*) ING if stem contains vowel (motoring motor) • Cascade of rules applied to: computerization – ization -> -ize computerize – ize -> ε computer • Errors occur: – organization organ, university universe Code freely available in most languages: Python, Java, … 9/19/2021 CPSC 503 Winter 2012 60

Stemming mainly used in Information Retrieval 1. Run a stemmer on the documents to

Stemming mainly used in Information Retrieval 1. Run a stemmer on the documents to be indexed 2. Run a stemmer on users queries 3. Compute similarity between queries and documents (based on stems they contain) Seems to work especially well with smaller documents 9/19/2021 CPSC 503 Winter 2012 61

Porter as an FST • The original exposition of the Porter stemmer did not

Porter as an FST • The original exposition of the Porter stemmer did not describe it as a transducer but… – Each stage is a separate transducer – The stages can be composed to get one big transducer 9/19/2021 CPSC 503 Winter 2012 62

Next Time • Read handout – Probability – Stats – Information theory • Next

Next Time • Read handout – Probability – Stats – Information theory • Next Lecture: – (if not finished today) finish Chpt 3, 3. 10 -11 – Start Probabilistic Models for NLP (Chpt. 4, 4. 1 – 4. 2 and 5. 9!) 9/19/2021 CPSC 503 Winter 2012 64