CSA 3050 Natural Language Algorithms Finite State Devices

  • Slides: 52
Download presentation
CSA 3050: Natural Language Algorithms Finite State Devices October 2005 CSA 3180 NLP

CSA 3050: Natural Language Algorithms Finite State Devices October 2005 CSA 3180 NLP

Sources • Blackburn & Striegnitz Ch. 2 October 2005 CSA 3180 NLP 2

Sources • Blackburn & Striegnitz Ch. 2 October 2005 CSA 3180 NLP 2

Part I Parsers and Transducers October 2005 CSA 3180 NLP

Part I Parsers and Transducers October 2005 CSA 3180 NLP

Parsers vs. Recognisers • Recognizers tell us whether a given input is accepted by

Parsers vs. Recognisers • Recognizers tell us whether a given input is accepted by some finite state automaton. • Often we would like to have an explanation of why it was accepted. • Parsers give us that kind of explanation. • What form does it take? October 2005 CSA 3180 NLP 4

Finite State Parser • The output of a finite state parser is a sequence

Finite State Parser • The output of a finite state parser is a sequence of nodes and arcs. If we, gave the input [h, a, !] to a parser for our first laughing automaton, it should give us [1, h, 2, a, 3, !, 4]. • The standard technique in Prolog for turning a recognizer into a parser is to add one or more extra arguments to keep track of the structure that was found. October 2005 CSA 3180 NLP 5

Base Case Recogniser recognize 1(Node, [ ]) : final(Node). October 2005 Parser parse 1(Node,

Base Case Recogniser recognize 1(Node, [ ]) : final(Node). October 2005 Parser parse 1(Node, [ ], [Node]) : final(Node). CSA 3180 NLP 6

Recursive Case Recogniser recognize 1(Node 1, String) : arc(Node 1, Node 2, Label), traverse

Recursive Case Recogniser recognize 1(Node 1, String) : arc(Node 1, Node 2, Label), traverse 1(Label, String, New. String), recognize 1(Node 2, New. String). October 2005 Parser parse 1(Node 1, String, [Node 1, Label|Path]) : arc(Node 1, Node 2, Label), traverse 1( Label, String, New. String), parse 1(Node 2, New. String, Path). CSA 3180 NLP 7

Words as Labels • So far we have only considered transitions with single-character labels.

Words as Labels • So far we have only considered transitions with single-character labels. • More complex labels are possible – e. g. words comprising several characters. • We can construct an FSA recognizing English noun phrases that can be built from the words: the, a, wizard, witch, broomstick, hermione, harry, ron, with, fast. October 2005 CSA 3180 NLP 8

FSA for Noun Phrases October 2005 CSA 3180 NLP 9

FSA for Noun Phrases October 2005 CSA 3180 NLP 9

FSA for NPs in Prolog initial(1). final(3). arc(1, 2, a). arc(1, 2, the). arc(2,

FSA for NPs in Prolog initial(1). final(3). arc(1, 2, a). arc(1, 2, the). arc(2, 2, brave). arc(2, 2, fast). arc(2, 3, witch). October 2005 arc(2, 3, wizard). arc(2, 3, broomstick). arc(2, 3, rat). arc(1, 3, harry). arc(1, 3, ron). arc(1, 3, hermione). arc(3, 1, with). CSA 3180 NLP 10

Parsing a Noun Phrase testparse 1(Symbols, Parse) : initial(Node), parse 1(Node, Symbols, Parse). ?

Parsing a Noun Phrase testparse 1(Symbols, Parse) : initial(Node), parse 1(Node, Symbols, Parse). ? - testparse 1([the, fast, wizard], Z). Z=[1, the, 2, fast, 2, wizard, 3] October 2005 CSA 3180 NLP 11

Rewriting Categories • It is also possible to obtain a more abstract parse, e.

Rewriting Categories • It is also possible to obtain a more abstract parse, e. g. ? - testparse 2([the, fast, wizard], Z). Z=[1, det, 2, adj, 2, noun, 3] • What changes are required to obtain this behaviour? October 2005 CSA 3180 NLP 12

1. Changes to the FSA %Lexicon initial(1). lex(a, det). final(3). lex(the, det). arc(1, 2,

1. Changes to the FSA %Lexicon initial(1). lex(a, det). final(3). lex(the, det). arc(1, 2, det). lex(fast, adj). arc(2, 2, adj). lex(brave, adj). arc(2, 3, cn). lex(witch, cn). arc(1, 3, pn). lex(wizard, cn). arc(3, 1, prep). lex(broomstick, cn). lex(rat, cn). lex(harry, pn). lex(hermione, pn). lex(ron, pn). lex(with, prep). October 2005 CSA 3180 NLP 13

Changes to the Parser Parse 1 parse 1(Node 1, String, [Node 1, Label|Path]) :

Changes to the Parser Parse 1 parse 1(Node 1, String, [Node 1, Label|Path]) : arc(Node 1, Node 2, Label), traverse 1( Label, String, New. String), parse 1(Node 2, New. String, Path). October 2005 Parse 2 parse 2(Node 1, String, [Node 1, Label|Path]) : arc(Node 1, Node 2, Label), traverse 2( Label, String, New. String), parse 2(Node 2, New. String, Path). traverse 2(Cat, [Word|S], S) : lex(Word, Cat). CSA 3180 NLP 14

Handling Jumps traverse 3('#', String). traverse 3(Cat, [Word|Words], Words) : lex(Word, Cat). October 2005

Handling Jumps traverse 3('#', String). traverse 3(Cat, [Word|Words], Words) : lex(Word, Cat). October 2005 CSA 3180 NLP 15

Finite State Transducers • A finite state transducer essentially is a finite state automaton

Finite State Transducers • A finite state transducer essentially is a finite state automaton that works on two (or more) tapes. • The most common way to think about transducers is as a kind of “translating machine” which works by reading from one tape and writing onto the other. October 2005 CSA 3180 NLP 16

A Translator from a to b a: b 1 October 2005 • initial state:

A Translator from a to b a: b 1 October 2005 • initial state: arrowhead • final state: double circle • a: b read from first tape and write to second tape CSA 3180 NLP 17

Prolog Representation : - op(250, xfx, : ). initial(1). final(1). arc(1, 1, a: b).

Prolog Representation : - op(250, xfx, : ). initial(1). final(1). arc(1, 1, a: b). October 2005 CSA 3180 NLP 18

Modes of Operation • generation mode: It writes a string of as on one

Modes of Operation • generation mode: It writes a string of as on one tape and a string of bs on the other tape. Both strings have the same length. • recognition mode: It accepts when the word on the first tape consists of exactly as many as as the word on the second tape consists of bs. • translation mode (left to right): It reads as from the first tape and writes a b for every a that it reads onto the second tape. • translation mode (right to left): It reads bs from the second tape and writes an a for every b that it reads onto the first tape. October 2005 CSA 3180 NLP 19

Computational Morphology Part II October 2005 CSA 3180 NLP

Computational Morphology Part II October 2005 CSA 3180 NLP

Morphology • Morphemes: The smallest unit in a word that bear some meaning, such

Morphology • Morphemes: The smallest unit in a word that bear some meaning, such as rabbit and s, are called morphemes. • Combination of morphemes to form words that are legal in some language. • Two kinds of morphology – Inflectional – Derivational October 2005 CSA 3180 NLP 21

Inflectional/Derivational Morphology • Inflectional +s plural +ed past • category preserving • productive: always

Inflectional/Derivational Morphology • Inflectional +s plural +ed past • category preserving • productive: always applies (esp. new words, e. g. fax) • systematic: same semantic effect October 2005 • Derivational +ment • category changing escape+ment • not completely productive: detractment* • not completely systematic: apartment CSA 3180 NLP 22

Example: English Noun Inflections Regular Irregular Singular cat church mouse ox Plural cats churches

Example: English Noun Inflections Regular Irregular Singular cat church mouse ox Plural cats churches mice oxen October 2005 CSA 3180 NLP 23

Morphological Parsing Output Analysis Input Word cats Morphological Parser cat N PL • Output

Morphological Parsing Output Analysis Input Word cats Morphological Parser cat N PL • Output is a string of morphemes • lexeme, other meaningful morphemes • Reversibility? October 2005 CSA 3180 NLP 24

Morphological Parsing • The goal of morphological parsing is to find out what morphemes

Morphological Parsing • The goal of morphological parsing is to find out what morphemes a given word is built from. cats cat N PL mice mouse N PL foxes fox N PL October 2005 CSA 3180 NLP 25

Morphological Analysis with FSTs • Basic idea is to write FSTs that map the

Morphological Analysis with FSTs • Basic idea is to write FSTs that map the surface form of a word to a description of the morphemes that constitute that word or vice versa. • Example: wizard+s to wizard+PL or kiss+ed to kiss+PAST. October 2005 CSA 3180 NLP 26

Plural Nouns in English • Regular Forms – add an s as in wizard+s.

Plural Nouns in English • Regular Forms – add an s as in wizard+s. – add –es as in witch +s • Handled with morpho-phonological rules that insert an e whenever the morpheme preceding the s ends in s, x, ch or another fricative. • Irregular forms – mouse/mice – automaton/automata • Handled on a case-by-case basis • Require transducer that translates wizard+s into wizard+PL, witch+es into witch+PL, mice, into mouse+PL and automata into automaton+PL. October 2005 CSA 3180 NLP 27

2 Steps 1. Split word up into its possible components, using + to indicate

2 Steps 1. Split word up into its possible components, using + to indicate possible morpheme boundaries. cats cat + s foxes fox + s mice mouse + s 2. Look up the categories of the stems and the meaning of the affixes, using a lexicon of stems and affixes cat + s cat NP PL fox + s fox N PL mouse + s mouse N PL October 2005 CSA 3180 NLP 28

Step 1 • Transducer may or may not insert a ‘+’ (morpheme boundary) if

Step 1 • Transducer may or may not insert a ‘+’ (morpheme boundary) if the word ends in ‘s’. • If the word ends in ses, xes, or zes, it may delete the ‘e’ when inserting the morpheme boundary, e. g. churches → church + s October 2005 CSA 3180 NLP 29

Transducer for Step 1 Surface Intermediate October 2005 CSA 3180 NLP 30

Transducer for Step 1 Surface Intermediate October 2005 CSA 3180 NLP 30

Transducer for Step 1 Surface Intermediate October 2005 CSA 3180 NLP 31

Transducer for Step 1 Surface Intermediate October 2005 CSA 3180 NLP 31

Prolog Representation • The transducer specifications we have seen translate easily into Prolog format

Prolog Representation • The transducer specifications we have seen translate easily into Prolog format except for the other transition. • arc(1, 3, z: z). arc(1, 3, s: s). arc(1, 3, x: x). arc(1, 2, #: +). arc(3, 1, <other>). Arc(1, 1, <other>). October 2005 CSA 3180 NLP 32

One Way to Handle <other> arcs arc(1, 3, z: z). arc(1, 3, s: s).

One Way to Handle <other> arcs arc(1, 3, z: z). arc(1, 3, s: s). arc(1, 3, x: x). arc(1, 2, #: +). arc(3, 1, a: a). arc(3, 1, b: b). arc(3, 1, c: c). : etc arc(3, 1, y: y). October 2005 CSA 3180 NLP 33

Transducer for Step 2 Intermediate Morphemes Possible inputs to the transducer are: • •

Transducer for Step 2 Intermediate Morphemes Possible inputs to the transducer are: • • Regular noun stem: Regular noun stem + s: Singular irregular noun stem: Plural irregular noun stem: October 2005 CSA 3180 NLP cat+s mouse mice 34

2. Intermediate Morphemes Transducer October 2005 CSA 3180 NLP 35

2. Intermediate Morphemes Transducer October 2005 CSA 3180 NLP 35

Handling Stems cat /cat mice/mouse October 2005 CSA 3180 NLP 36

Handling Stems cat /cat mice/mouse October 2005 CSA 3180 NLP 36

Completed Stage 2 October 2005 CSA 3180 NLP 37

Completed Stage 2 October 2005 CSA 3180 NLP 37

Joining Stages 1 and 2 • If the two transducers run in a cascade

Joining Stages 1 and 2 • If the two transducers run in a cascade (i. e. we let the second transducer run on the output of the first one), we can do a morphological parse of (some) English noun phrases. • We can change also the direction of translation (in translation mode). • This transducer can also be used for generating a surface form from an underlying form. October 2005 CSA 3180 NLP 38

Combining Rules • Consider the word “berries”. • Two rules are involved – berry

Combining Rules • Consider the word “berries”. • Two rules are involved – berry + s – y → ie under certain circumstances. • Combinations of such rules can be handled in two ways – Cascade, i. e. sequentially – Parallel • Algorithms exist for combining transducers together in series or in parallel. • Such algorithms involve computations over regular relations. October 2005 CSA 3180 NLP 39

3 Related Frameworks REGULAR LANGUAGES REGULAR EXPRESSIONS October 2005 FSA CSA 3180 NLP 40

3 Related Frameworks REGULAR LANGUAGES REGULAR EXPRESSIONS October 2005 FSA CSA 3180 NLP 40

Concatenation over FS Automata a c � b = October 2005 d a c

Concatenation over FS Automata a c � b = October 2005 d a c b d CSA 3180 NLP 41

REGULAR RELATIONS AUGMENTED REGULAR EXPRESSIONS October 2005 FINITE STATE TRANSDUCERS CSA 3180 NLP 42

REGULAR RELATIONS AUGMENTED REGULAR EXPRESSIONS October 2005 FINITE STATE TRANSDUCERS CSA 3180 NLP 42

Putting it all together execution of FSTi takes place in parallel October 2005 CSA

Putting it all together execution of FSTi takes place in parallel October 2005 CSA 3180 NLP 43

Kaplan and Kay The Xerox View FSTi are aligned but separate October 2005 FSTi

Kaplan and Kay The Xerox View FSTi are aligned but separate October 2005 FSTi intersected together CSA 3180 NLP 44

Summary • Morphological processing can be handled by finite state machinery • Finite State

Summary • Morphological processing can be handled by finite state machinery • Finite State Transducers are formally very similar to Finite State Automata. • They are formally equivalent to regular relations, i. e. sets of pairings of sentences of regular languages. October 2005 CSA 3180 NLP 45

Exercises • Change the representation of automata that allow them to be given names.

Exercises • Change the representation of automata that allow them to be given names. • Make the corresponding changes to the transducer. • Write a predicate which allows two named automata to be composed – i. e. the output of one becomes the input of the other October 2005 CSA 3180 NLP 46

Simple Transducer in Prolog transduce 1(Node, [ ]) : final(Node). transduce 1(Node 1, Tape

Simple Transducer in Prolog transduce 1(Node, [ ]) : final(Node). transduce 1(Node 1, Tape 2) : arc(Node 1, Node 2, Label), traverse 1(Label, Tape 1, New. Tape 1, Tape 2, New. Tape 2), transduce 1(Node 2, New. Tape 1, New. Tape 2). October 2005 CSA 3180 NLP 47

Traverse for FST traverse 1(L 1: L 2, [L 1|Rest. Tape 1], Rest. Tape

Traverse for FST traverse 1(L 1: L 2, [L 1|Rest. Tape 1], Rest. Tape 1, [L 2|Rest. Tape 2], Rest. Tape 2). testtrans 1(Tape 1, Tape 2) : initial(Node), transduce 1(Node, Tape 1, Tape 2). October 2005 CSA 3180 NLP 48

Transducers and Jumps • Transducers can make jumps going from one state to another

Transducers and Jumps • Transducers can make jumps going from one state to another without doing anything on either one or on both of the tapes. • So, transitions of the form a: # or #: a or #: # are possible. October 2005 CSA 3180 NLP 49

Handling Jumps: 4 cases • Jump on both tapes. • Jump on the first

Handling Jumps: 4 cases • Jump on both tapes. • Jump on the first but not on the second tape. • Jump on the second but not on the first tape. • Jump on neither tape (this is what traverse 1 does). October 2005 CSA 3180 NLP 50

4 Corresponding Clauses traverse 2('#': '#', Tape 1, Tape 2). traverse 2('#': L 2,

4 Corresponding Clauses traverse 2('#': '#', Tape 1, Tape 2). traverse 2('#': L 2, Tape 1, [L 2|Rest. Tape 2], Rest. Tape 2). traverse 2(L 1: '#', [L 1|Rest. Tape 1], Rest. Tape 1, Tape 2). traverse 2(L 1: L 2, [L 1|Rest. Tape 1], Rest. Tape 1, [L 2|Rest. Tape 2], Rest. Tape 2). October 2005 CSA 3180 NLP 51

FST in Prolog lex(wizard: wizard, ’STEM-REG 1’). lex(witch: witch, ’STEM-REG 2’). lex(automaton: automaton, ’IRREG-SG’).

FST in Prolog lex(wizard: wizard, ’STEM-REG 1’). lex(witch: witch, ’STEM-REG 2’). lex(automaton: automaton, ’IRREG-SG’). lex(automata: ’automaton-PL’, ’IRREG-PL’). lex(mouse: mouse, ’IRREG-SG’). lex(mice: ’mouse-PL’, ’IRREG-PL’). October 2005 CSA 3180 NLP 52