CPSC 503 Computational Linguistics Lecture 2 Giuseppe Carenini
- Slides: 60
CPSC 503 Computational Linguistics Lecture 2 Giuseppe Carenini 9/19/2021 CPSC 503 Winter 2012 1
Today Jan 10 • Questionnaire • Brief check of some background knowledge • English Morphology • FSA and Morphology • Start: Finite State Transducers (FST) and Morphological Parsing/Gen. 9/19/2021 CPSC 503 Winter 2012 2
Finite state machines Regular Expressions & Finite State Automata 7. 1 Finite State Transducers 3. 4 Hidden-Markov Models 5. 7 Basic Probability, Bayesian Statistics and Information Theory Conditional Probability 8. 1 Bayesian Networks 6. 6 Entropy 6. 1 Machine Learning Supervised Classification (e. g. , Decision Trees) 6. 1 Unsupervised Learning (e. g. , clustering) 5. 3 Graphical Models 3. 7 Richer Formalisms Context-Free Grammar 5. 2 First-Order Logics 5. 8 9/19/2021 Programming Java 7. 4 Python 7. 8 Dynamic Programming 6. 3 Search Algorithms 6. 8 Linguistics 4. 4 Linear Alg. CPSC 503 Winter 2012 7. 9 3
Today Jan 10 • Brief check of some background knowledge • English Morphology • FSA and Morphology • Start: Finite State Transducers (FST) and Morphological Parsing/Gen. 9/19/2021 CPSC 503 Winter 2012 4
Knowledge-Formalisms Map (including probabilistic formalisms) Morphology Syntax Semantics Pragmatics Discourse and Dialogue State Machines (and prob. versions) (Finite State Automata, Finite State Transducers, Markov Models) Rule systems (and prob. versions) (e. g. , (Prob. ) Context-Free Grammars) Logical formalisms (First-Order Logics, Prob. Logics) M a c h i n e L e a AI planners (MDP Markov Decision Processes) 9/19/2021 CPSC 503 Winter 2012 5 r n i n g
Today (English) Morphology Syntax Semantics Pragmatics Discourse and Dialogue 9/19/2021 State Machines (no prob. ) • Finite State Automata (and Regular Expressions) • Finite State Transducers Rule systems (and prob. version) (e. g. , (Prob. ) Context-Free Grammars) Logical formalisms (First-Order Logics) AI planners CPSC 503 Winter 2012 6
? ? b a 1 0 b 0 9/19/2021 b 1 3 2 a a a 2 ! 4 a 3 5 ! 4 CPSC 503 Winter 2012 6 5 6 7
? ? /CPSC 50[34]/ /^([Ff]romb|[Ss]ubjectb|[Dd]ateb)/ /[0 -9]+(. [0 -9]+){3}/ 9/19/2021 CPSC 503 Winter 2012 8
Fundamental Relations implement (generate and recognize) FSA describe model 9/19/2021 Regular Expressions Many Linguistic Phenomena CPSC 503 Winter 2012 9
Second Usage of Reg. Exp: Text Searching/Editing Find me all instances of the determiner “the” in an English text. – – To count them To substitute them with something else You try: /the/ /[t. T]he/ The other cop went to the bank but there were no people there. /btheb/ /b[t. T]heb/ s/b([t. T]he|[Aa]n? )b/DET/ 9/19/2021 CPSC 503 Winter 2012 10
Today (English) Morphology Syntax Semantics Pragmatics Discourse and Dialogue 9/19/2021 State Machines (no prob. ) • Finite State Automata (and Regular Expressions) • Finite State Transducers Rule systems (and prob. version) (e. g. , (Prob. ) Context-Free Grammars) Logical formalisms (First-Order Logics) AI planners CPSC 503 Winter 2012 12
English Morphology Def. The study of how words are formed from minimal meaning-bearing units (morphemes) • We can usefully divide morphemes into two classes – Stems: The core meaning bearing units – Affixes: Bits and pieces that adhere to stems to change their meanings and grammatical functions Examples: unhappily, …………… 9/19/2021 CPSC 503 Winter 2012 13
Word Classes • For now word classes: nouns, verbs, adjectives and adverbs. • We’ll go into the gory details in Ch 5 • Word class determines to a large degree the way that stems and affixes combine 9/19/2021 CPSC 503 Winter 2012 14
English Morphology • We can also divide morphology up into two broad classes – Inflectional – Derivational 9/19/2021 CPSC 503 Winter 2012 15
Inflectional Morphology • The resulting word: – Has the same word class as the original – Serves a grammatical/semantic purpose different from the original 9/19/2021 CPSC 503 Winter 2012 16
Nouns, Verbs and Adjectives (English) • Nouns are simple (not really) – Markers for plural and possessive • Verbs are only slightly more complex – Markers appropriate to the tense of the verb and to the person • Adjectives – Markers for comparative and superlative 9/19/2021 CPSC 503 Winter 2012 17
Regulars and Irregulars • Some words misbehave (refuse to follow the rules) – Mouse/mice, goose/geese, ox/oxen – Go/went, fly/flew • Regulars… – Walk, walks, walking, walked • Irregulars – Eat, eats, eating, ate, eaten – Catch, catches, catching, caught – Cut, cuts, cutting, cut 9/19/2021 CPSC 503 Winter 2012 18
Derivational Morphology • Derivational morphology is the messy stuff that no one ever taught you. – Changes of word class – Less Productive ( -ant V -> N only with V of Latin origin!) 9/19/2021 CPSC 503 Winter 2012 19
Derivational Examples • Verb/Adj to Noun -ation computerize computerization -ee appointee -er killer -ness fuzzy fuzziness 9/19/2021 CPSC 503 Winter 2012 20
Derivational Examples • Noun/Verb to Adj -al Computational -able Embraceable -less Clueless 9/19/2021 CPSC 503 Winter 2012 21
Compute • Many paths are possible… • Start with compute – – 9/19/2021 Computer -> computerize -> computerization Computation -> computational Computer -> computerize -> computerizable Compute -> computee CPSC 503 Winter 2012 22
Summary (English) Morphology Syntax Semantics Pragmatics Discourse and Dialogue 9/19/2021 State Machines (no prob. ) • Finite State Automata (and Regular Expressions) • Finite State Transducers Rule systems (and prob. version) (e. g. , (Prob. ) Context-Free Grammars) Logical formalisms (First-Order Logics) AI planners CPSC 503 Winter 2012 23
FSAs and Morphology • GOAL 1: recognize whether a string is an English word • PLAN: 1. First we’ll capture the morphotactics (the rules governing the ordering of affixes in a language) 2. Then we’ll add in the actual stems 9/19/2021 CPSC 503 Winter 2012 24
FSA for Portion of Noun Inflectional Morphology 9/19/2021 CPSC 503 Winter 2012 25
Adding the Stems But it does not express that: • Reg nouns ending in –s, -z, -sh, -ch, -x -> es (kiss, waltz, bush, rich, box) • Reg nouns ending –y preceded by a consonant change the –y to -i 9/19/2021 CPSC 503 Winter 2012 26
Small Fragment of V and N Derivational Morphology [nouni] eg. hospital [adjal] eg. formal [adjous] eg. arduous [verbj] eg. speculate [verbk] eg. conserve 9/19/2021 CPSC 503 Winter 2012 27
GOAL 2: Morphological Parsing/Generation (vs. Recognition) • Recognition is usually not quite what we need. – Usually given a word we need to find: the stem and its class and morphological features (parsing) – Or we have a stem and its class and morphological features and we want to produce the word (production/generation) • Examples (parsing) – From “cats” to “cat +N +PL” – From “lies” to …… 9/19/2021 CPSC 503 Winter 2012 28
Computational problems in Morphology • Recognition: recognize whether a string is an English word (FSA) • Parsing/Generation: word e. g. , lies • Stemming: word 9/19/2021 …. stem, class, lexical features …. lie +N +PL lie +V +3 SG stem …. CPSC 503 Winter 2012 29
Finite State Transducers • FSA cannot help…. • The simple story – Add another tape – Add extra symbols to the transitions – On one tape we read “cats”, on the other we write “cat +N +PL” 9/19/2021 CPSC 503 Winter 2012 30
FSTs parsing 9/19/2021 generation CPSC 503 Winter 2012 31
(Simplified) FST formal definition (you can skip 3. 4. 1 unless you want to work on FST) • Q: a finite set of states • I, O: input and an output alphabets (which may include ε) • Σ: a finite alphabet of complex symbols i: o, i I and o O • Q 0: the start state • F: a set of accept/final states (F Q) • A transition relation δ that maps QxΣ to 2 Q 9/19/2021 CPSC 503 Winter 2012 32
FST can be used as… • Translators: input one string from I, output another from O (or vice versa) • Recognizers: input a string from Ix. O • Generator: output a string from Ix. O 9/19/2021 CPSC 503 Winter 2012 33
Simple Example c: c a: a t: t Transitions (as a translator): +N: ε +PL: s +SG: ε • c: c means read a c on one tape and write a c on the other (or vice versa) • +N: ε means read a +N symbol on one tape and write nothing on the other (or vice versa) • +PL: s means read +PL and write an s (or vice versa) 9/19/2021 CPSC 503 Winter 2012 34
Examples (as a translator) c: c a: a t: t +N: ε +PL: s +SG: ε lexical surface lexical parsing c a c t a s t +N +SG generation surface 9/19/2021 CPSC 503 Winter 2012 35
Slightly More complex Example l: l q 0 i: i q 1 e: e q 2 +PL: s +N: ε q 3 q 4 q 6 +V: ε q 5 q 7 Transitions (as a translator): +3 SG: s • l: l means read an l on one tape and write an l on the other (or vice versa) • +N: ε means read a +N symbol on one tape and write nothing on the other (or vice versa) • +PL: s means read +PL and write an s (or vice versa) • … 9/19/2021 CPSC 503 Winter 2012 36
Examples (as a translator) l: l q 0 q 1 i: i q 2 e: e lexical q 3 +N: ε +PL: s +V: ε q 4 q 5 q 6 q 7 +3 SG: s parsing surface lexical l i l e i s e +V +3 SG generation surface 9/19/2021 CPSC 503 Winter 2012 37
Examples (as a recognizer and a generator) l: l q 0 q 1 lexical l i e surface l i e i: i q 2 e: e +V +3 SG q 3 +N: ε +PL: s +V: ε q 4 q 5 q 6 q 7 +3 SG: s s lexical surface 9/19/2021 CPSC 503 Winter 2012 38
Examples lexical surface m i c lexical c a t e +N +PL surface 9/19/2021 CPSC 503 Winter 2012 40
FST: inflectional morphology of plural Some regular-nouns Notes: X -> X: X Some irregular-nouns 9/19/2021 o: i CPSC 503 Winter 2012 lexical: surface 41
Computational Morphology: Problems/Challenges 1. Ambiguity: one word can correspond to multiple structures (more critical in morphologically richer languages) 2. Spelling changes: may occur when two morphemes are combined e. g. butterfly + -s -> butterflies 9/19/2021 CPSC 503 Winter 2012 42
Deal with Morphological Ambiguity • Find all the possible outputs (all paths) and return them all (without choosing) Then Part-of- speech tagging to choose…… look at the neighboring words 9/19/2021 CPSC 503 Winter 2012 43
(2) Spelling Changes When morphemes are combined inflectionally the spelling at the boundaries may change Examples • E-insertion: when –s is added to a word, -e is inserted if word ends in –s, -z, -sh, -ch, -x (e. g. , kiss, miss, waltz, bush, watch, rich, box) • Y-replacement: when –s or -ed are added to a word ending with a –y, -y changes to –ie or –i respectively (e. g. , butterfly, try) 9/19/2021 CPSC 503 Winter 2012 45
Solution: Multi-Tape Machines • Add intermediate tape • Use the output of one tape machine as the input to the next • Add intermediate symbols – ^ morpheme boundary – # word boundary 9/19/2021 CPSC 503 Winter 2012 46
Multi-Level Tape Machines FST-1 FST-2 • FST-1 translates between the lexical and the intermediate level • FTS-2 handles the spelling changes (due to one rule) to the surface tape 9/19/2021 CPSC 503 Winter 2012 47
FST-1 for inflectional morphology of plural (Lexical <-> Intermediate ) Some regular-nouns +PL: ^s# # Some irregularnouns 9/19/2021 o: i CPSC 503 Winter 2012 +PL: ^ ε: s ε: # 48
Example lexical f o x +N +PL intemediate lexical m o u s e +N +PL intemediate 9/19/2021 CPSC 503 Winter 2012 49
FST-2 for E-insertion (Intermediate <-> Surface) E-insertion: when –s is added to a word, -e is inserted if word ends in –s, -z, -sh, -ch, -x …as in fox^s# <-> foxes #: ε 9/19/2021 CPSC 503 Winter 2012 50
Examples intermediate f o x ^ s # surface intermediate b o x ^ i n g # surface 9/19/2021 CPSC 503 Winter 2012 51
Where are we? # 9/19/2021 CPSC 503 Winter 2012 52
Final Scheme: Part 1 9/19/2021 CPSC 503 Winter 2012 53
Final Scheme: Part 2 9/19/2021 CPSC 503 Winter 2012 54
Intersection (FST 1, FST 2) = FST 3 • States of FST 1 and FST 2 : Q 1 and Q 2 • States of intersection: (Q 1 x Q 2) • Transitions of FST 1 and FST 2 : δ 1, δ 2 • Transitions of intersection : δ 3 For all i, j, n, m, a, b δ 3((q 1 i, q 2 j), a: b) = (q 1 n, q 2 m) iff – δ 1(q 1 i, a: b) = q 1 n AND a: b – δ 2(q 2 j, a: b) = q 2 m q 1 i q 1 n a: b (q 1 i, q 2 j) 9/19/2021 (q 1 n, q 2 m) CPSC 503 Winter 2012 a: b q 2 j q 2 m 55
Composition(FST 1, FST 2) = FST 3 • • For – – – States of FST 1 and FST 2 : Q 1 and Q 2 States of composition : Q 1 x Q 2 Transitions of FST 1 and FST 2 : δ 1, δ 2 Transitions of composition : δ 3 all i, j, n, m, a, b δ 3((q 1 i, q 2 j), a: b) = (q 1 n, q 2 m) iff There exists c such that a: c δ 1(q 1 i, a: c) = q 1 n AND q 1 i q 1 n δ 2(q 2 j, c: b) = q 2 m a: b (q 1 i, q 2 j) 9/19/2021 (q 1 n, q 2 m) CPSC 503 Winter 2012 c: b q 2 j q 2 m 56
FSTs in Practice • Install an FST package…… (pointers) • Describe your “formal language” (e. g, lexicon, morphotactic and rules) in a Reg. Exp -like notation (pointer) • Your specification is compiled in a single FST Ref: “Finite State Morphology” (Beesley and Karttunen, 2003, CSLI Publications) (pointer) Complexity/Coverage: • FSTs for the morphology of a natural language may have 105 – 107 states and arcs • Spanish (1996) 46 x 103 stems; 3. 4 x 106 word forms • Arabic (2002? ) 131 x 103 stems; 7. 7 x 106 word forms 9/19/2021 CPSC 503 Winter 2012 57
Other important applications of FST in NLP From segmenting words into morphemes to… • Tokenization: – finding word boundaries in text (? !) …maxmatch – Finding sentence boundaries: punctuation… but. is ambiguous look at example in Fig. 3. 22 • Shallow syntactic parsing: e. g. , find only noun phrases • Phonological Rules…… (Chpt. 11) 9/19/2021 CPSC 503 Winter 2012 58
Computational tasks in Morphology • Recognition: recognize whether a string is an English word (FSA) • Parsing/Generation: word e. g. , bought • Stemming: word 9/19/2021 …. stem, class, lexical features …. buy +V +PAST-PART buy +V +PAST stem …. CPSC 503 Winter 2012 59
Stemmer • E. g. the Porter algorithm, which is based on a series of sets of simple cascaded rewrite rules: • (condition) S 1 ->S 2 – ATIONAL ATE (relational relate) – (*v*) ING if stem contains vowel (motoring motor) • Cascade of rules applied to: computerization – ization -> -ize computerize – ize -> ε computer • Errors occur: – organization organ, university universe Code freely available in most languages: Python, Java, … 9/19/2021 CPSC 503 Winter 2012 60
Stemming mainly used in Information Retrieval 1. Run a stemmer on the documents to be indexed 2. Run a stemmer on users queries 3. Compute similarity between queries and documents (based on stems they contain) Seems to work especially well with smaller documents 9/19/2021 CPSC 503 Winter 2012 61
Porter as an FST • The original exposition of the Porter stemmer did not describe it as a transducer but… – Each stage is a separate transducer – The stages can be composed to get one big transducer 9/19/2021 CPSC 503 Winter 2012 62
Next Time • Read handout – Probability – Stats – Information theory • Next Lecture: – (if not finished today) finish Chpt 3, 3. 10 -11 – Start Probabilistic Models for NLP (Chpt. 4, 4. 1 – 4. 2 and 5. 9!) 9/19/2021 CPSC 503 Winter 2012 64
- Giuseppe carenini
- Giuseppe carenini
- Cpsc 503
- Xkcd computational linguistics
- Computational linguistics olympiad
- Columbia computational linguistics
- Chomsky computational linguistics
- History of applied linguistics
- Language
- 2 687 in scientific notation
- Humiseal 503
- H 503
- Aci 224
- Unit 503
- Popular sovereignty
- Nacr-503
- Gallimune 503
- 34728 rounded off to the nearest thousand
- What is immediate family considered
- 01:640:244 lecture notes - lecture 15: plat, idah, farad
- On the computational efficiency of training neural networks
- Using mathematics and computational thinking
- Computational biology: genomes, networks, evolution
- Purdue computational science and engineering
- Computational reflection
- Computational model
- Computational geometry
- Computational fluid dynamics
- The computational speed of computers
- What is dispersion in statistics
- Computational sustainability scope
- Computational fluid dynamic
- Characteristics of computational thinking
- Sp computational formula
- Computational linguist jobs
- Computational pharmacology
- Computational security
- Jeannette m. wing computational thinking
- Hms research computing
- Computational intelligence tutorial
- Computational sustainability subjects
- Computational fluid dynamics
- Centre for computational medicine
- Computational lexical semantics
- Computational geometry tutorial
- Nibib.nih.gov computational
- Computational chemistry branches
- Computational philology
- Computational fluid dynamics
- Computational graph backpropagation
- "computational thinking"
- Computational math
- Cmu computational biology
- Fundamentals of computational neuroscience
- Computational thinking algorithms and programming
- Computational problem solving examples
- Computational diagnostics
- The computational complexity of linear optics
- Computational photography uiuc
- Computational methods in plasma physics
- Turing machine