Computational Linguistics Introduction Finite State Machinery and Language

  • Slides: 42
Download presentation
Computational Linguistics Introduction Finite State Machinery and Language Description Apr 2009 CLINT-LIN: Finite State

Computational Linguistics Introduction Finite State Machinery and Language Description Apr 2009 CLINT-LIN: Finite State Machinery

Acknowledgement The material for this lecture is derived from a series of talks given

Acknowledgement The material for this lecture is derived from a series of talks given by Dr. Ken Beesley (Xerox European Research Centre, Grenoble) in Malta, 2001. Apr 2009 CLINT-LIN: Finite State Machinery

Today’s Topics • • Finite State Technology Regular Languages and Relations Review of Set

Today’s Topics • • Finite State Technology Regular Languages and Relations Review of Set Theory Understand the mathematical operations that can be performed on such Languages. • Understand how Languages, Relations, Regular Expressions, and Networks are interrelated. Apr 2009 CLINT-LIN: Finite State Machinery

What is Finite State Technology? • Finite State Technology refers to a collection of

What is Finite State Technology? • Finite State Technology refers to a collection of techniques for application of Finite State Automata (FSA) to a range of linguistically motivated problems. • Such Techniques include • Design of user languages for specifying FSA • Compilation of such languages into efficient transition networks. • Development environments and runtime systems Apr 2009 CLINT-LIN: Finite State Machinery

What is Finite-State Technology Good For? • Finite-state techniques cannot handle central embedding •

What is Finite-State Technology Good For? • Finite-state techniques cannot handle central embedding • the man the dog the cat bit followed ate. • They are well suited to “lower-level” natural language processing such as • Tokenization – what is the next word? • Spelling error detection: does the next word belong to a list? • Morphological/phonological analysis/generation • Shallow syntactic parsing and “chunking” Apr 2009 CLINT-LIN: Finite State Machinery

Tokenisation Problems Vf. B Stuttgart scored twice in quick success -ion early in the

Tokenisation Problems Vf. B Stuttgart scored twice in quick success -ion early in the second half on their way to a deserved 2 -1 victory over Manchester United in the Champions League on Wednesday. (example from Mary Dalrymple, University of London) • • Vf. B Stuttgart, Manchester United succession 2 -1 Wednesday • Finite state techniques provide a means to specify the language of words, thus defining what it means to be the next token. • There are three ways to specify such languages Apr 2009 CLINT-LIN: Finite State Machinery

Languages, Notations and Machines LANGUAGE (set of strings) NOTATION Apr 2009 MACHINE CLINT-LIN: Finite

Languages, Notations and Machines LANGUAGE (set of strings) NOTATION Apr 2009 MACHINE CLINT-LIN: Finite State Machinery

Languages, Notations and Machines FINITE STATE LANGUAGE FINITE STATE NOTATION Apr 2009 FINITE STATE

Languages, Notations and Machines FINITE STATE LANGUAGE FINITE STATE NOTATION Apr 2009 FINITE STATE AUTOMATON CLINT-LIN: Finite State Machinery

FINITE STATE AUTOMATA: preliminary definition A finite state automaton includes: • A finite set

FINITE STATE AUTOMATA: preliminary definition A finite state automaton includes: • A finite set of states • A finite set of labelled transitions between states Apr 2009 CLINT-LIN: Finite State Machinery

Physical Machines with Finite States The Lightswitch Machine UP OFF ON DOWN Apr 2009

Physical Machines with Finite States The Lightswitch Machine UP OFF ON DOWN Apr 2009 CLINT-LIN: Finite State Machinery

Physical Machines with Finite States The Lightswitch Toggle Machine PUSH OFF ON PUSH Apr

Physical Machines with Finite States The Lightswitch Toggle Machine PUSH OFF ON PUSH Apr 2009 CLINT-LIN: Finite State Machinery

The Five Cent Machine Problem: • Assume you have one, two, and five cent

The Five Cent Machine Problem: • Assume you have one, two, and five cent pieces • Design a finite state automaton which accepts exactly 5 cents. Apr 2009 CLINT-LIN: Finite State Machinery

The Cola Machine • Need to enter 25 cents (USA) to get a drink

The Cola Machine • Need to enter 25 cents (USA) to get a drink • Accepts the following coins: • Nickel = 5 cents • Dime = 10 cents • Quarter = 25 cents • For simplicity, our machine needs exact change • We will model only the coin-accepting mechanism Apr 2009 CLINT-LIN: Finite State Machinery

Physical Machines with Finite States The Cola Machine Start State Final State N 0

Physical Machines with Finite States The Cola Machine Start State Final State N 0 N 5 D N N 10 15 D D Q Apr 2009 CLINT-LIN: Finite State Machinery N 20 25 D

The Cola Machine Language • List of all the sequences of coins accepted: •

The Cola Machine Language • List of all the sequences of coins accepted: • { Q, DDN, DND, NDD, DNNN, NDNN, NNDNNNND, NNNNN } • Think of the coins as SYMBOLS or CHARACTERS • The set of symbols accepted is the ALPHABET of the machine • Think of sequences of coins as WORDS or “strings” • The set of words accepted by the machine is its LANGUAGE Apr 2009 CLINT-LIN: Finite State Machinery

FINITE STATE AUTOMATA: better definition A finite state automaton includes: • A finite set

FINITE STATE AUTOMATA: better definition A finite state automaton includes: • A finite set of states • Initial State • Final State (s) • • A finite set of labelled transitions beween states Labels are symbols from an alphabet Recognises a language Generates a language as well! Apr 2009 CLINT-LIN: Finite State Machinery

A Network that Accepts a One Word Language c Apr 2009 a n t

A Network that Accepts a One Word Language c Apr 2009 a n t CLINT-LIN: Finite State Machinery o

A Network that Accepts a Three Word Language a n t i g r

A Network that Accepts a Three Word Language a n t i g r o c t m Apr 2009 e s CLINT-LIN: Finite State Machinery e a

Scaling Up the Network • Imagine the same network expanded to handle three million

Scaling Up the Network • Imagine the same network expanded to handle three million words, all of them corresponding to valid words of a given language. • We supply a word and ‘apply’ it to the network. If it is accepted by the network, then it is a valid word. Otherwise it does not belong to the language • This is the basis for a Spanish spelling error detector. Apr 2009 CLINT-LIN: Finite State Machinery

Looking Up a Word a n t i g r o c t m

Looking Up a Word a n t i g r o c t m “Apply” Apr 2009 e s m e s a CLINT-LIN: Finite State Machinery e a

Lookup Failure Lookup succeeds when all input is consumed and final state is reached.

Lookup Failure Lookup succeeds when all input is consumed and final state is reached. Lookup can fail because: • Not all input is consumed ("libro", "tigra") • Input is fully consumed but state is not final ("cant") • Final state is reached but there is still unconsumed output ("mesas") Apr 2009 CLINT-LIN: Finite State Machinery

Shared Structure c l e a v e Apr 2009 CLINT-LIN: Finite State Machinery

Shared Structure c l e a v e Apr 2009 CLINT-LIN: Finite State Machinery r e

Transducers “Lookdown” mesa+Noun+Fem+Pl m e s a +Noun m e s a 0 “Lookup”

Transducers “Lookdown” mesa+Noun+Fem+Pl m e s a +Noun m e s a 0 “Lookup” Apr 2009 m e s a s CLINT-LIN: Finite State Machinery +Fem 0 +Pl s

A Morphological Analyzer dog +n +pl Transducer dogs Apr 2009 CLINT-LIN: Finite State Machinery

A Morphological Analyzer dog +n +pl Transducer dogs Apr 2009 CLINT-LIN: Finite State Machinery

A Morphological Analyzer Lexical Language Transducer Surface Language Apr 2009 CLINT-LIN: Finite State Machinery

A Morphological Analyzer Lexical Language Transducer Surface Language Apr 2009 CLINT-LIN: Finite State Machinery

A Quick Review of Set Theory A set is a collection of objects. B

A Quick Review of Set Theory A set is a collection of objects. B A D E We can enumerate the “members” or “elements” of finite sets: { A, D, B, E }. There is no significant order in a set, so { A, D, B, E } is the same set as { E, A, D, B }, etc. Apr 2009 CLINT-LIN: Finite State Machinery

Uniqueness of Elements You cannot have two or more ‘A’ elements in the same

Uniqueness of Elements You cannot have two or more ‘A’ elements in the same set B A D E { A, A, D, B, E} is just a redundant specification of the set { A, D, B, E }. Apr 2009 CLINT-LIN: Finite State Machinery

Cardinality of Sets The Empty Set: A Finite Set: Norway Denmark Sweden An Infinite

Cardinality of Sets The Empty Set: A Finite Set: Norway Denmark Sweden An Infinite Set: e. g. The Set of all Positive Integers Apr 2009 CLINT-LIN: Finite State Machinery

Simple Operations on Sets: Union A B D E C Set 1 Set 2

Simple Operations on Sets: Union A B D E C Set 1 Set 2 B C A D E Union of Set 1 and Set 2 Apr 2009 CLINT-LIN: Finite State Machinery

Simple Operations on Sets (2): Union A B C D C Set 1 Set

Simple Operations on Sets (2): Union A B C D C Set 1 Set 2 B C A D Union of Set 1 and Set 2 Apr 2009 CLINT-LIN: Finite State Machinery

Simple Operations on Sets (3): Intersection A B C D C Set 1 Set

Simple Operations on Sets (3): Intersection A B C D C Set 1 Set 2 C Intersection of Set 1 and Set 2 Apr 2009 CLINT-LIN: Finite State Machinery

Simple Operations on Sets (4): Subtraction A B C D C Set 1 Set

Simple Operations on Sets (4): Subtraction A B C D C Set 1 Set 2 A B Set 1 minus Set 2 Apr 2009 CLINT-LIN: Finite State Machinery

Formal Languages Very Important Concept in Formal Language Theory: A Language is just a

Formal Languages Very Important Concept in Formal Language Theory: A Language is just a Set of Words. • We use the terms “word” and “string” interchangeably. • A Language can be empty, have finite cardinality, or be infinite in size. • You can union, intersect and subtract languages, just like any other sets. Apr 2009 CLINT-LIN: Finite State Machinery

Union of Languages (Sets) dog cat rat elephant mouse Language 1 Language 2 dog

Union of Languages (Sets) dog cat rat elephant mouse Language 1 Language 2 dog cat rat elephant mouse Union of Language 1 and Language 2 Apr 2009 CLINT-LIN: Finite State Machinery

Intersection of Languages (Sets) dog cat rat elephant mouse Language 1 Language 2 Intersection

Intersection of Languages (Sets) dog cat rat elephant mouse Language 1 Language 2 Intersection of Language 1 and Language 2 Apr 2009 CLINT-LIN: Finite State Machinery

Intersection of Languages (Sets) dog cat rat mouse Language 1 Language 2 rat Intersection

Intersection of Languages (Sets) dog cat rat mouse Language 1 Language 2 rat Intersection of Language 1 and Language 2 Apr 2009 CLINT-LIN: Finite State Machinery

Subtraction of Languages (Sets) dog cat rat mouse Language 1 Language 2 dog cat

Subtraction of Languages (Sets) dog cat rat mouse Language 1 Language 2 dog cat Language 1 minus Language 2 Apr 2009 CLINT-LIN: Finite State Machinery

Languages • A language is a set of words (=strings). • Words (strings) are

Languages • A language is a set of words (=strings). • Words (strings) are composed of symbols (letters) that are “concatenated” together. • At another level, words are composed of “morphemes”. • In most natural languages, we concatenate morphemes together to form whole words. For sets consisting of words (i. e. for Languages), the operation of concatenation is very important. Apr 2009 CLINT-LIN: Finite State Machinery

Concatenation of Languages work talk walk 0 ing Root Language Suffix Language working worked

Concatenation of Languages work talk walk 0 ing Root Language Suffix Language working worked works talking talked talks walking walked walks Apr 2009 CLINT-LIN: Finite State Machinery ed s The concatenation of the Suffix language after the Root language.

Languages and Networks t 0 a s w a l o s k s

Languages and Networks t 0 a s w a l o s k s r e Network/Language 1 i n g d Network/Language 2 t s 0 a w a o l s k i r e Apr 2009 n g The concatenation of Network 1 and Network 2 d CLINT-LIN: Finite State Machinery

Why is “Finite State” Computing so Interesting? • Finite-state systems are mathematically elegant, easily

Why is “Finite State” Computing so Interesting? • Finite-state systems are mathematically elegant, easily manipulated and modifiable. • Computationally efficient. Usually very compact. • The programming we linguists do is declarative. We describe the facts of our natural language; i. e. we write grammars. We do not hack ad hoc code. • The runtime code, which applies our systems to linguistic input, is already written and it is completely languageindependent. • Finite-state systems are inherently bidirectional: we can use the same system to analyze and to generate. Apr 2009 CLINT-LIN: Finite State Machinery

Languages, Notations and Machines FINITE STATE LANGUAGE FINITE STATE NOTATION Apr 2009 FINITE STATE

Languages, Notations and Machines FINITE STATE LANGUAGE FINITE STATE NOTATION Apr 2009 FINITE STATE MACHINE CLINT-LIN: Finite State Machinery