CPSC 503 Computational Linguistics Lecture 3 Giuseppe Carenini
- Slides: 33
CPSC 503 Computational Linguistics Lecture 3 Giuseppe Carenini 10/3/2020 CPSC 503 Winter 2008 1
• Subscribe to mailing list • Some more Intros • NLP@UBC 10/3/2020 CPSC 503 Winter 2008 2
Introductions • • Your Name Previous experience in NLP? Why are you interested in NLP? Are you thinking of NLP as your main research area? If not, what else do you want to specialize in…. • Anything else………… 10/3/2020 CPSC 503 Winter 2008 3
NLP research at UBC TOPICS • Generation and Summarization of Evaluative Text (e. g. , customer reviews) • Summarization of conversations (emails, blogs, meetings) PEOPLE: G. Carenini & R. Ng (Profs), G. Murray (Postdoc) + Students SUPPORT: NSERC, Google, BObjects(now SAP), MSResearch 10/3/2020 CPSC 503 Winter 2008 4
Linguistic Knowledge (English) Morphology Syntax Semantics Pragmatics Discourse and Dialogue 10/3/2020 Formalisms and associated Algorithms State Machines (no prob. ) • Finite State Automata (and Regular Expressions) • Finite State Transducers Rule systems (and prob. version) (e. g. , (Prob. ) Context-Free Grammars) Logical formalisms (First-Order Logics) AI planners CPSC 503 Winter 2008 5
Computational tasks in Morphology • Recognition: recognize whether a string is an English/… word (FSA) • Parsing/Generation: word e. g. , bought • Stemming: word 10/3/2020 …. stem, class, lexical features …. buy +V +PAST-PART buy +V +PAST stem …. CPSC 503 Winter 2008 6
Today Sept 15 • Finite State Transducers (FSTs) and Morphological Parsing • Stemming (Porter Stemmer) 10/3/2020 CPSC 503 Winter 2008 7
FST definition • Q: a finite set of states • I, O: input and an output alphabets (which may include ε) • Σ: a finite alphabet of complex symbols i: o, i I and o O • Q 0: the start state • F: a set of accept/final states (F Q) • A transition relation δ that maps QxΣ to 2 Q E. g. , |Q| =3 ; I={a, b, c, ε} ; O={a, b}; |Σ|=? ; 0 <= |δ| <= ? 10/3/2020 CPSC 503 Winter 2008 8
FST can be used as… • Translators: input one string from I, output another from O (or vice versa) • Recognizers: input a string from Ix. O • Generator: output a string from Ix. O Terminology warning! E. g. , if I={a, b, c, ε} ; O={a, b}; …… 10/3/2020 CPSC 503 Winter 2008 9
FST: inflectional morphology of plural Some regular-nouns Notes: X -> X: X Some irregular-nouns 10/3/2020 o: i CPSC 503 Winter 2008 lexical: surface 10
Examples lexical surface m i c lexical c a t e +N +PL surface 10/3/2020 CPSC 503 Winter 2008 11
Computational Morphology: Problems/Challenges 1. Ambiguity: one word can correspond to multiple structures (more critical in morphologically richer languages) 2. Spelling changes: may occur when two morphemes are combined e. g. butterfly + -s -> butterflies 10/3/2020 CPSC 503 Winter 2008 12
Ambiguity: more complex example • What’s the right parse for Unionizable? – Union-ize-able – Un-ion-ize-able • Each would represent a valid path through an FST for derivational morphology. • Both Adj…… 10/3/2020 CPSC 503 Winter 2008 13
Deal with Morphological Ambiguity • Find all the possible outputs (all paths) and return them all (without choosing) Then Part-of- speech tagging to choose…… look at the neighboring words 10/3/2020 CPSC 503 Winter 2008 14
(2) Spelling Changes When morphemes are combined inflectionally the spelling at the boundaries may change Examples • E-insertion: when –s is added to a word, -e is inserted if word ends in –s, -z, -sh, -ch, -x (e. g. , kiss, miss, waltz, bush, watch, rich, box) • Y-replacement: when –s or -ed are added to a word ending with a –y, -y changes to –ie or –i respectively (e. g. , try, butterfly) 10/3/2020 CPSC 503 Winter 2008 15
Solution: Multi-Tape Machines • Add intermediate tape • Use the output of one tape machine as the input to the next • Add intermediate symbols – ^ morpheme boundary – # word boundary 10/3/2020 CPSC 503 Winter 2008 16
Multi-Level Tape Machines FST-1 FST-2 • FST-1 translates between the lexical and the intermediate level • FTS-2 handles the spelling changes (due to one rule) to the surface tape 10/3/2020 CPSC 503 Winter 2008 17
FST-1 for inflectional morphology of plural (Lexical <-> Intermediate ) Some regular-nouns +PL: ^s# # Some irregularnouns 10/3/2020 o: i CPSC 503 Winter 2008 +PL: ^ ε: s ε: # 18
Example lexical f o x +N +PL intemediate lexical m o u s e +N +PL intemediate 10/3/2020 CPSC 503 Winter 2008 19
FST-2 for E-insertion (Intermediate <-> Surface) E-insertion: when –s is added to a word, -e is inserted if word ends in –s, -z, -sh, -ch, -x …as in fox^s# <-> foxes #: ε 10/3/2020 CPSC 503 Winter 2008 20
Examples intermediate f o x ^ s # surface intermediate b o x ^ i n g # surface 10/3/2020 CPSC 503 Winter 2008 21
Where are we? # 10/3/2020 CPSC 503 Winter 2008 22
Final Scheme: Part 1 10/3/2020 CPSC 503 Winter 2008 23
Final Scheme: Part 2 10/3/2020 CPSC 503 Winter 2008 24
Intersection (FST 1, FST 2) • States of FST 1 and FST 2 : Q 1 and Q 2 • States of intersection: (Q 1 x Q 2) • Transitions of FST 1 and FST 2 : δ 1, δ 2 • Transitions of intersection : δ 3 For all i, j, n, m, a, b δ 3((q 1 i, q 2 j), a: b) = (q 1 n, q 2 m) iff – δ 1(q 1 i, a: b) = q 1 n AND a: b – δ 2(q 2 j, a: b) = q 2 m q 1 i q 1 n a: b (q 1 i, q 2 j) 10/3/2020 (q 1 n, q 2 m) CPSC 503 Winter 2008 a: b q 2 j q 2 m 25
Composition(FST 1, FST 2) • • For – – – States of FST 1 and FST 2 : Q 1 and Q 2 States of composition : Q 1 x Q 2 Transitions of FST 1 and FST 2 : δ 1, δ 2 Transitions of composition : δ 3 all i, j, n, m, a, b δ 3((q 1 i, q 2 j), a: b) = (q 1 n, q 2 m) iff There exists c such that a: c δ 1(q 1 i, a: c) = q 1 n AND q 1 i q 1 n δ 2(q 2 j, c: b) = q 2 m a: b (q 1 i, q 2 j) 10/3/2020 (q 1 n, q 2 m) CPSC 503 Winter 2008 c: b q 2 j q 2 m 26
FSTs in Practice • Install an FST package…… (pointers) • Describe your “formal language” (e. g, lexicon, morphotactic and rules) in a Reg. Exp -like notation (pointer) • Your specification is compiled in a single FST Ref: “Finite State Morphology” (Beesley and Karttunen, 2003, CSLI Publications) Complexity/Coverage: • FSTs for the morphology of a natural language may have 105 – 107 states and arcs • Spanish (1996) 46 x 103 stems; 3. 4 x 106 word forms • Arabic (2002? ) 131 x 103 stems; 7. 7 x 106 word forms 10/3/2020 CPSC 503 Winter 2008 27
Other important applications of FST in NLP From segmenting words into morphemes to… • Tokenization: – finding word boundaries in text (? !) …maxmatch – Finding sentence boundaries: punctuation… but. is ambiguous look at example in Fig. 3. 22 • Shallow syntactic parsing: e. g. , find only noun phrases • Phonological Rules…… (Chpt. ? 11? ) 10/3/2020 CPSC 503 Winter 2008 28
Computational tasks in Morphology • Recognition: recognize whether a string is an English word (FSA) • Parsing/Generation: word e. g. , bought • Stemming: word 10/3/2020 …. stem, class, lexical features …. buy +V +PAST-PART buy +V +PAST stem …. CPSC 503 Winter 2008 29
Stemmer • E. g. the Porter algorithm, which is based on a series of sets of simple cascaded rewrite rules: • (condition) S 1 ->S 2 – ATIONAL ATE (relational relate) – (*v*) ING if stem contains vowel (motoring motor) • Cascade of rules applied to: computerization – ization -> -ize computerize – ize -> ε computer • Errors occur: – organization organ, doing doe university universe Code freely available in most languages: Python, Java, … 10/3/2020 CPSC 503 Winter 2008 30
Stemming mainly used in Information Retrieval 1. Run a stemmer on the documents to be indexed 2. Run a stemmer on users queries 3. Compute similarity between queries and documents (based on stems they contain) Seems to work especially well with smaller documents 10/3/2020 CPSC 503 Winter 2008 31
Porter as an FST • The original exposition of the Porter stemmer did not describe it as a transducer but… – Each stage is a separate transducer – The stages can be composed to get one big transducer 10/3/2020 CPSC 503 Winter 2008 32
Next Time • Read handout – Probability – Stats – Information theory • Next Lecture: – finish Chpt 3, 3. 10 -11 – Start Probabilistic Models for NLP (Chpt. 4, 4. 1 – 4. 2 and 5. 9!) 10/3/2020 CPSC 503 Winter 2008 33
- Giuseppe carenini
- Giuseppe carenini
- Cpsc 503
- Computational linguistics olympiad
- Columbia computational linguistics
- Chomsky computational linguistics
- Xkcd computational linguistics
- Traditional linguistics and modern linguistics
- Theoretical linguistics vs applied linguistics
- Humiseal 503
- Gallimune 503
- Plastic shrinkage cracking repair
- Champion equality diversity and inclusion
- 503 in scientific notation
- Nacr-503
- H 503
- 5872 rounded to the nearest hundred
- What is immediate family considered
- Popular sovereignty
- 01:640:244 lecture notes - lecture 15: plat, idah, farad
- C6748 architecture supports
- Computational thinking
- Decomposition computer science
- Computational approaches
- Grc computational chemistry
- Barefoot computational thinking
- Tu bergakademie freiberg computational materials science
- Computational irreducibility
- Integrated computational materials engineering
- Relly brandman
- Computational mathematics
- Leerlijn computational thinking
- Cern alice
- Standard deviation computational formula