CSA 3050 Natural Language Algorithms Words Strings and

  • Slides: 26
Download presentation
CSA 3050: Natural Language Algorithms Words, Strings and Regular Expressions Finite State Automota October

CSA 3050: Natural Language Algorithms Words, Strings and Regular Expressions Finite State Automota October 2004 CSA 3050 NL Algorithms 1

This lecture • Outline – Words – The language of words – FSAs in

This lecture • Outline – Words – The language of words – FSAs in Prolog • Acknowledgement – Jurafsky and Martin, Speech and Language Processing, Prentice Hall 2000 – Blackburn and Steignitz: NLP Techiques in Prolog: http: //www. coli. uni-sb. de/~kris/nlp-with-prolog/html/ October 2004 CSA 3050 NL Algorithms 2

What is a Word? • A series of speech sounds that symbolizes meaning without

What is a Word? • A series of speech sounds that symbolizes meaning without being divisible into smaller units • Any segment of written or printed discourse ordinarily appearing between spaces or between a space and a punctuation mark • A set of linguistic forms produced by combining a single base with various inflectional elements without change in the part-of-speech elements • A number of bytes processed as a unit. October 2004 CSA 3050 NL Algorithms 3

Information Associated with Words • Spelling – orthographic – phonological • Syntax – POS

Information Associated with Words • Spelling – orthographic – phonological • Syntax – POS – Valency • Semantics – Meaning – Relationship to other words October 2004 CSA 3050 NL Algorithms 4

Properties of Words • Sequence – characters – phonemes pollution • Delimitation – whitespace

Properties of Words • Sequence – characters – phonemes pollution • Delimitation – whitespace – other? • Structure – simple ("atomic“) words – complex ("molecular") words October 2004 CSA 3050 NL Algorithms 5

Complex Words • enlargement en + large + ment (en + large) + ment

Complex Words • enlargement en + large + ment (en + large) + ment en + (large + ment) • affixation – prefix – suffix – infix October 2004 CSA 3050 NL Algorithms 6

Sets Underly the Formation of Complex Words prefixes roots suffixes dis re un en

Sets Underly the Formation of Complex Words prefixes roots suffixes dis re un en large charge infect code decide ed ing ee er ly October 2004 + CSA 3050 NL Algorithms + 7

Structure of Complex Words • Complex words are made by concatenating elements chosen from

Structure of Complex Words • Complex words are made by concatenating elements chosen from – a set of prefixes – a set of roots – a set of suffixes • The set of valid words for a given human language (e. g. English, Maltese) can be regarded as a formal language. October 2004 CSA 3050 NL Algorithms 8

The Language of Words • What kind of formal language is the language of

The Language of Words • What kind of formal language is the language of words? • One which can be constructed out of – A characteristic set of basic symbols (alphabet) – A characteristic set of combining operations • Union (disjunction) • Concatenation • Closure (iteration) • Regular Language; Regular Sets October 2004 CSA 3050 NL Algorithms 9

Characterising Classes of Set CLASS OF SETS or LANGUAGES NOTATION October 2004 MACHINE CSA

Characterising Classes of Set CLASS OF SETS or LANGUAGES NOTATION October 2004 MACHINE CSA 3050 NL Algorithms 10

Regular Expressions • Notation for describing regular sets • Used extensively in the Unix

Regular Expressions • Notation for describing regular sets • Used extensively in the Unix operating system (grep, sed, etc. ) and also in some Microsoft products (Word) • Xerox Finite State tools use a somewhat different notation, but similar function. October 2004 CSA 3050 NL Algorithms 11

Regular Expressions a AB A|B A&B A* October 2004 a simple symbol concatenation alternation

Regular Expressions a AB A|B A&B A* October 2004 a simple symbol concatenation alternation operator intersection operator Kleene star CSA 3050 NL Algorithms 12

Characterising Classes of Set CLASS OF SETS or LANGUAGES NOTATION October 2004 MACHINE CSA

Characterising Classes of Set CLASS OF SETS or LANGUAGES NOTATION October 2004 MACHINE CSA 3050 NL Algorithms 13

Finite Automaton • • • A finite automaton comprises A finite set of states

Finite Automaton • • • A finite automaton comprises A finite set of states Q An alphabet of symbols I A start state q 0 Q A set of final states F Q A transition function δ(q, i) which maps a state q Q and a symbol i I to a new state q' Q October 2004 CSA 3050 NL Algorithms 14

Encoding FSAs in Prolog • Three predicates – initial/1 initial(s) – s is an

Encoding FSAs in Prolog • Three predicates – initial/1 initial(s) – s is an initial state – final/1 final(f) – f is a final state – arc/3 arc(s, t, c) there is an arc from s to t labelled c October 2004 CSA 3050 NL Algorithms 15

Example 1: FSA initial(1). final(4). arc(1, 2, h). arc(2, 3, a). arc(3, 4, !).

Example 1: FSA initial(1). final(4). arc(1, 2, h). arc(2, 3, a). arc(3, 4, !). arc(3, 2, h). 1 h 2 a h 3 ! 4= October 2004 CSA 3050 NL Algorithms 16

Example 2: FSA with jump arc initial(1). final(4). arc(1, 2, h). arc(2, 3, a).

Example 2: FSA with jump arc initial(1). final(4). arc(1, 2, h). arc(2, 3, a). arc(3, 4, !). arc(3, 1, #). 1 h 2 a # 3 ! 4= October 2004 CSA 3050 NL Algorithms 17

Example 3: NDA initial(1). final(4). arc(1, 2, h). arc(2, 3, a). arc(3, 4, !).

Example 3: NDA initial(1). final(4). arc(1, 2, h). arc(2, 3, a). arc(3, 4, !). arc(2, 1, a). 1 a h 2 a 3 ! 4= October 2004 CSA 3050 NL Algorithms 18

A Recogniser recognize 1(Node, [ ]) : final(Node). recognize 1(Node 1, String) : arc(Node

A Recogniser recognize 1(Node, [ ]) : final(Node). recognize 1(Node 1, String) : arc(Node 1, Node 2, Label), traverse 1(Label, String, New. String), recognize 1(Node 2, New. String). traverse 1(Label, [Label|Symbols], Symbols). October 2004 CSA 3050 NL Algorithms 19

Trace Call: (7) test 1([h, a, !]) Call: (8) initial(_L 181) Exit: (8) initial(1)

Trace Call: (7) test 1([h, a, !]) Call: (8) initial(_L 181) Exit: (8) initial(1) Call: (8) recognize 1(1, [h, a, !]) Call: (9) arc(1, _L 199, _L 200) Exit: (9) arc(1, 2, h) Call: (9) traverse 1(h, [h, a, !], _L 201) Exit: (9) traverse 1(h, [h, a, !], [a, !]) Call: (9) recognize 1(2, [a, !]) Call: (10) recognize 1(3, [!]) Call: (11) recognize 1(4, []) Call: (12) final(4) Exit: (11) recognize 1(4, []) Exit: (10) recognize 1(3, [!]) Exit: (9) recognize 1(2, [a, !]) Exit: (8) recognize 1(1, [h, a, !]) Exit: (7) test 1([h, a, !]) October 2004 CSA 3050 NL Algorithms 20

Generation • test 1(X) • X = [h, a, !] ; • X =

Generation • test 1(X) • X = [h, a, !] ; • X = [h, a, h, a, !] ; • etc. October 2004 CSA 3050 NL Algorithms 21

3 Related Frameworks REGULAR LANGS/SETS describe REGULAR EXPRESSIONS October 2004 recognise FINITE STATE NETWORKS

3 Related Frameworks REGULAR LANGS/SETS describe REGULAR EXPRESSIONS October 2004 recognise FINITE STATE NETWORKS CSA 3050 NL Algorithms 22

Regular Operations • Operations – Concatenation – Union – Closure • Over What –

Regular Operations • Operations – Concatenation – Union – Closure • Over What – Language – Expressions – FS Automota October 2004 CSA 3050 NL Algorithms 23

Concatenation over Reg. Expression and Language Regular Expression E 1: = [a|b] E 2:

Concatenation over Reg. Expression and Language Regular Expression E 1: = [a|b] E 2: = [c|d] E 1 E 2 = [a|b] [c|d] October 2004 Language L 1 = {"a", "b"} L 2 = {"c", "d"} L 1 L 2 = {"ac", "ad", "bc", "bd"} CSA 3050 NL Algorithms 24

Concatenation over FS Automata a c � b = October 2004 d a c

Concatenation over FS Automata a c � b = October 2004 d a c b d CSA 3050 NL Algorithms 25

Issues • • • Handling jump arcs. Handling non-determinism Computing operations over networks. Maintaining

Issues • • • Handling jump arcs. Handling non-determinism Computing operations over networks. Maintaining multiple states in DB Representation. October 2004 CSA 3050 NL Algorithms 26