CSA 3050 Natural Language Algorithms Words Strings and
- Slides: 26
CSA 3050: Natural Language Algorithms Words, Strings and Regular Expressions Finite State Automota October 2004 CSA 3050 NL Algorithms 1
This lecture • Outline – Words – The language of words – FSAs in Prolog • Acknowledgement – Jurafsky and Martin, Speech and Language Processing, Prentice Hall 2000 – Blackburn and Steignitz: NLP Techiques in Prolog: http: //www. coli. uni-sb. de/~kris/nlp-with-prolog/html/ October 2004 CSA 3050 NL Algorithms 2
What is a Word? • A series of speech sounds that symbolizes meaning without being divisible into smaller units • Any segment of written or printed discourse ordinarily appearing between spaces or between a space and a punctuation mark • A set of linguistic forms produced by combining a single base with various inflectional elements without change in the part-of-speech elements • A number of bytes processed as a unit. October 2004 CSA 3050 NL Algorithms 3
Information Associated with Words • Spelling – orthographic – phonological • Syntax – POS – Valency • Semantics – Meaning – Relationship to other words October 2004 CSA 3050 NL Algorithms 4
Properties of Words • Sequence – characters – phonemes pollution • Delimitation – whitespace – other? • Structure – simple ("atomic“) words – complex ("molecular") words October 2004 CSA 3050 NL Algorithms 5
Complex Words • enlargement en + large + ment (en + large) + ment en + (large + ment) • affixation – prefix – suffix – infix October 2004 CSA 3050 NL Algorithms 6
Sets Underly the Formation of Complex Words prefixes roots suffixes dis re un en large charge infect code decide ed ing ee er ly October 2004 + CSA 3050 NL Algorithms + 7
Structure of Complex Words • Complex words are made by concatenating elements chosen from – a set of prefixes – a set of roots – a set of suffixes • The set of valid words for a given human language (e. g. English, Maltese) can be regarded as a formal language. October 2004 CSA 3050 NL Algorithms 8
The Language of Words • What kind of formal language is the language of words? • One which can be constructed out of – A characteristic set of basic symbols (alphabet) – A characteristic set of combining operations • Union (disjunction) • Concatenation • Closure (iteration) • Regular Language; Regular Sets October 2004 CSA 3050 NL Algorithms 9
Characterising Classes of Set CLASS OF SETS or LANGUAGES NOTATION October 2004 MACHINE CSA 3050 NL Algorithms 10
Regular Expressions • Notation for describing regular sets • Used extensively in the Unix operating system (grep, sed, etc. ) and also in some Microsoft products (Word) • Xerox Finite State tools use a somewhat different notation, but similar function. October 2004 CSA 3050 NL Algorithms 11
Regular Expressions a AB A|B A&B A* October 2004 a simple symbol concatenation alternation operator intersection operator Kleene star CSA 3050 NL Algorithms 12
Characterising Classes of Set CLASS OF SETS or LANGUAGES NOTATION October 2004 MACHINE CSA 3050 NL Algorithms 13
Finite Automaton • • • A finite automaton comprises A finite set of states Q An alphabet of symbols I A start state q 0 Q A set of final states F Q A transition function δ(q, i) which maps a state q Q and a symbol i I to a new state q' Q October 2004 CSA 3050 NL Algorithms 14
Encoding FSAs in Prolog • Three predicates – initial/1 initial(s) – s is an initial state – final/1 final(f) – f is a final state – arc/3 arc(s, t, c) there is an arc from s to t labelled c October 2004 CSA 3050 NL Algorithms 15
Example 1: FSA initial(1). final(4). arc(1, 2, h). arc(2, 3, a). arc(3, 4, !). arc(3, 2, h). 1 h 2 a h 3 ! 4= October 2004 CSA 3050 NL Algorithms 16
Example 2: FSA with jump arc initial(1). final(4). arc(1, 2, h). arc(2, 3, a). arc(3, 4, !). arc(3, 1, #). 1 h 2 a # 3 ! 4= October 2004 CSA 3050 NL Algorithms 17
Example 3: NDA initial(1). final(4). arc(1, 2, h). arc(2, 3, a). arc(3, 4, !). arc(2, 1, a). 1 a h 2 a 3 ! 4= October 2004 CSA 3050 NL Algorithms 18
A Recogniser recognize 1(Node, [ ]) : final(Node). recognize 1(Node 1, String) : arc(Node 1, Node 2, Label), traverse 1(Label, String, New. String), recognize 1(Node 2, New. String). traverse 1(Label, [Label|Symbols], Symbols). October 2004 CSA 3050 NL Algorithms 19
Trace Call: (7) test 1([h, a, !]) Call: (8) initial(_L 181) Exit: (8) initial(1) Call: (8) recognize 1(1, [h, a, !]) Call: (9) arc(1, _L 199, _L 200) Exit: (9) arc(1, 2, h) Call: (9) traverse 1(h, [h, a, !], _L 201) Exit: (9) traverse 1(h, [h, a, !], [a, !]) Call: (9) recognize 1(2, [a, !]) Call: (10) recognize 1(3, [!]) Call: (11) recognize 1(4, []) Call: (12) final(4) Exit: (11) recognize 1(4, []) Exit: (10) recognize 1(3, [!]) Exit: (9) recognize 1(2, [a, !]) Exit: (8) recognize 1(1, [h, a, !]) Exit: (7) test 1([h, a, !]) October 2004 CSA 3050 NL Algorithms 20
Generation • test 1(X) • X = [h, a, !] ; • X = [h, a, h, a, !] ; • etc. October 2004 CSA 3050 NL Algorithms 21
3 Related Frameworks REGULAR LANGS/SETS describe REGULAR EXPRESSIONS October 2004 recognise FINITE STATE NETWORKS CSA 3050 NL Algorithms 22
Regular Operations • Operations – Concatenation – Union – Closure • Over What – Language – Expressions – FS Automota October 2004 CSA 3050 NL Algorithms 23
Concatenation over Reg. Expression and Language Regular Expression E 1: = [a|b] E 2: = [c|d] E 1 E 2 = [a|b] [c|d] October 2004 Language L 1 = {"a", "b"} L 2 = {"c", "d"} L 1 L 2 = {"ac", "ad", "bc", "bd"} CSA 3050 NL Algorithms 24
Concatenation over FS Automata a c � b = October 2004 d a c b d CSA 3050 NL Algorithms 25
Issues • • • Handling jump arcs. Handling non-determinism Computing operations over networks. Maintaining multiple states in DB Representation. October 2004 CSA 3050 NL Algorithms 26
- 3050 in words
- Assembly array of strings
- Springs and strings
- Pointers and strings
- String and other things
- Medidata csa
- Ap comp sci
- Csa schedule
- Csa pf cube
- Csa percentiles
- Csa illumina
- Provveditorato viterbo
- Csa vs usa
- Csa basic thresholds
- Csa 2010 login
- Csa illumina
- Dg csa
- Ap csa
- Csa
- Csa notorious nine 2021
- Frustums
- Altitude of cone
- Rds.csa
- Csa isda
- Csa tw75
- Csa tracking system
- Pennine vts csa