Modeling Natural Language Syntax and Morphology in Ling

  • Slides: 25
Download presentation
Modeling Natural Language Syntax and Morphology in Ling. Bench IDETM Parser, Examples from English

Modeling Natural Language Syntax and Morphology in Ling. Bench IDETM Parser, Examples from English and Persian Supervisor: Prof. Dr. Frank Van Eynde Natlanco Senior Supervisor: Filip De Brabander Examiner: Prof. Dr. Geert Adriaens 12/11/2021 Peyman Nojoumian 1

Introduction n n n Background Defined Project and Tasks Developing an efficient language model

Introduction n n n Background Defined Project and Tasks Developing an efficient language model Parsing in two linguistic levels 1. Morphology 2. Syntax Conclusion 12/11/2021 Peyman Nojoumian 2

What is Language Parsing? n n n Parsing: In its broader sense is “the

What is Language Parsing? n n n Parsing: In its broader sense is “the identification of parts of a sentence as subject, verb, object, etc. and of words in a sentence as noun (plural), verb (past tense), etc. [5] Assigning some structures to language components. An intermediate step in semantic analysis of an input sentence. 12/11/2021 Peyman Nojoumian 3

Parsing Techniques n n n Originated from the Formal Language Theory by Chomsky. Using

Parsing Techniques n n n Originated from the Formal Language Theory by Chomsky. Using computer algorithms and dynamic programming to analyze natural languages. Based on different techniques: top-down, bottom-up parsing, Finite State Automata, RTN, ATN, and etc. 12/11/2021 Peyman Nojoumian 4

Formal Language Theory Formal Languages Formal Grammars Automata (‘Recognizers’ or ‘Transducers’) Parsers Formal language:

Formal Language Theory Formal Languages Formal Grammars Automata (‘Recognizers’ or ‘Transducers’) Parsers Formal language: Natural language (infinite sentences) = finite elements (vocabularies) Grammars (automata): Rule systems, e. g. S 12/11/2021 NP VP Peyman Nojoumian 5

Types of Formal Grammars Type-0: Enumerable languages (no restrictions on the rules): Not interesting

Types of Formal Grammars Type-0: Enumerable languages (no restrictions on the rules): Not interesting for Natural Languages because of its random and uncertain nature. No uniform way to assign structures to sentences. SE AB AB cde Type-1: Context-sensitive grammar interesting for NL, decidable rules. Can assign structures to sentences. |A| |B| / X_Y in-possible impossible |n| |m| / _ [labial consonants] Labialization rule in generative phonology Type-2: Context-free grammar Left-hand side only one node. S NP VP Good to show derivations. Decidable algorithms. Large number of welldefined recognition algorithms. 12/11/2021 Peyman Nojoumian 6

Chomsky’s hierarchy Type-3: Finite-state grammar Regular grammars. Right-hand side contains 1 terminal node followed

Chomsky’s hierarchy Type-3: Finite-state grammar Regular grammars. Right-hand side contains 1 terminal node followed by 1 non -terminal node. A a. B 2 -Context-free 0 -enumerable 1 -context-sensitive 12/11/2021 3 -Regular Peyman Nojoumian 7

RTN & ATN n n RTN: Recursive Transition Network. It contains FSA (Finite State

RTN & ATN n n RTN: Recursive Transition Network. It contains FSA (Finite State Automata) as directed transitions of terminal or non-terminal nodes. “In an RTN, every time the machine comes to an arc labeled with a non-terminal, it treats that nonterminal as a subroutine. It places its current location onto a stack, jumps to the non-terminal and then jumps back when that non-terminal has been parsed. ” [5] 12/11/2021 Peyman Nojoumian 8

Simple Demonstration 12/11/2021 Peyman Nojoumian 9

Simple Demonstration 12/11/2021 Peyman Nojoumian 9

Augmented Transition Network n n n An extension to RTN Implements conditions and actions

Augmented Transition Network n n n An extension to RTN Implements conditions and actions within the transitions (for unification). Uses features 12/11/2021 Peyman Nojoumian 10

BTN, RTN, ATN BTN (Basic Transition Network) [finite-state grammar] Via RTN (Recursive Transition Network)

BTN, RTN, ATN BTN (Basic Transition Network) [finite-state grammar] Via RTN (Recursive Transition Network) [context-free grammar] To ATN (Augmented Transition Network) [unrestricted rewrite system] 12/11/2021 Peyman Nojoumian 11

Ling. Bench IDETM n n n “Integrated Development Environment”, a program user interface that

Ling. Bench IDETM n n n “Integrated Development Environment”, a program user interface that allows the user to create, modify, model, parse, and manipulate a set of linguistic data. The core parser: Mainly a combination of different parsing techniques. Employs RTN, embeds FSA as phrases, and uses probabilities to disambiguate and conditional features to allow only well-formed transitions in compatibility with ATN. 12/11/2021 Peyman Nojoumian 12

Ling. Bench IDETM Uses an enhanced 3 -Dimensional graphic environment for easy, fast and

Ling. Bench IDETM Uses an enhanced 3 -Dimensional graphic environment for easy, fast and efficient language modeling. n Very easy to use. n Fast and reliable. n Very efficient parsing. Demonstration n 12/11/2021 Peyman Nojoumian 13

Hardware and Software Requirements n n n A Pentium with at least 1 GHz

Hardware and Software Requirements n n n A Pentium with at least 1 GHz CPU speed and 256 MB of RAM equipped with an advanced graphic accelerator. Works on Windows platforms. A light version is available to download from the Internet: http: //www. natlantech. com/lingbench_ide_light. html 12/11/2021 Peyman Nojoumian 14

Defined Project and Tasks n n n The project was defined to be done

Defined Project and Tasks n n n The project was defined to be done in three main phases. Developing language models will be easier if it is done in cycles. 1. Getting acquainted with Ling. Bench IDETM, Spotting available corpora (lexicon and sentences), Starting the modeling by 20 simple sentences with limited lexicon, Reporting the application’s bugs. 12/11/2021 Peyman Nojoumian 15

Defined Project and Tasks n n 2. Developing a lexicon of 1000 words, sentence

Defined Project and Tasks n n 2. Developing a lexicon of 1000 words, sentence corpus with 150 sentences as the training set, Developing further rules and testing them, Reporting bugs. 3. Creating a lexicon of 10000 -30000 words and 250 sentences, Using the corpus to design rules and testing them, Reporting bugs (5 new versions). 12/11/2021 Peyman Nojoumian 16

Syntactic model of Persian Sentences n n Modeling was started with modeling 7 phrases

Syntactic model of Persian Sentences n n Modeling was started with modeling 7 phrases using 10 word classes. At the end 16 rules (phrases) were created and a total of 28 word classes (POSs) were implemented based on the training set. 12/11/2021 Peyman Nojoumian 17

Sentence Model 12/11/2021 Peyman Nojoumian 18

Sentence Model 12/11/2021 Peyman Nojoumian 18

Sample of a Parsed Sentence 12/11/2021 Peyman Nojoumian 19

Sample of a Parsed Sentence 12/11/2021 Peyman Nojoumian 19

Morphological Model of Words n n using 2000 word stems of simple verbs and

Morphological Model of Words n n using 2000 word stems of simple verbs and a number of related verb affixes, a large number of our examples from Persian verb forms were recognized by the morphological grammar module. 11 phrases and 36 morphological classes designed and modeled in the morphological grammar. 12/11/2021 Peyman Nojoumian 20

An example of the Morphological Analyzer 12/11/2021 Peyman Nojoumian 21

An example of the Morphological Analyzer 12/11/2021 Peyman Nojoumian 21

Further Functionalities of the Ling. Bench IDETM n n n Lexicon development Three lexicons:

Further Functionalities of the Ling. Bench IDETM n n n Lexicon development Three lexicons: 1)syntactic 2)morphological 3)multi-word Defining Features and efficient probabilities Defining variables in nodes and transitions to add more constraints. Unicode compatibility. 12/11/2021 Peyman Nojoumian 22

Conclusion n n It was expected that the system would be able to parse

Conclusion n n It was expected that the system would be able to parse correctly at least 50% of the 200 sentences at the end of the 3 rd phase. 100% parsable sentences of the corpus. Estimated to be able to parse 80% of the words and more than 50% of the sentences of a new test corpus. 12/11/2021 Peyman Nojoumian 23

Conclusion n n Efficient, Fast and reliable parsing was made possible with an enriched

Conclusion n n Efficient, Fast and reliable parsing was made possible with an enriched lexicon featuring subcategories, word frequencies and relative word probabilities with a simple model combining features of the RTN and ATN while using probabilities as well as feature structures. Using the model for parser module of my Ph. D thesis: “Designing a diacritizer for Persian” 12/11/2021 Peyman Nojoumian 24

References 1. Assi, S. Mostafa & M. Haji Abdolhosseini (2000). “Grammatical Tagging of a

References 1. Assi, S. Mostafa & M. Haji Abdolhosseini (2000). “Grammatical Tagging of a Persian Corpus ”, International Journal of Corpus Linguistics, Vol. 5(1), PP 69 -81. 2. Bateni, M. (1998). “Towsife sakhtemane dasturiye zabane Farsi” (Persian Syntax). Amir Kabir Publishers: Tehran. 3. De Brabander, Filip (2003). Ling. Bench IDETM Users Manual. 4. Garside, R. & et all eds. (1997). Corpus Annotation, Linguistic information from Computer Text Corpora. Longman. 5. Jurafsky, Daniel & J. H. Martin (2000). An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Prentice Hall, New Jersey. 6. Kennedy, Graeme (1998). An Introduction to Corpus Linguistics, Longman. 7. Lambton Ann K. (1996) “Persian Grammar”, London: Cambridge University Press. 8. Lazard G. (1992) “A Grammar of Contemporary Persian”, California & New York: Mazda Publishers. 9. Longman Dictionary of Language Teaching & Applied Linguistics (1992). London. 10. Manning, Christopher D. & H. Schutze (2001). Foundation of statistical natural language processing. The MIT press, Cambridge 11. Nojoumian, Peyman (1999). Design and Implementation of a Computer Assisted Language Learning (CALL) System for Persian. MA Dissertation, Alame Tabatabaei University, Tehran. 12. Sag, Ivan A. & Thomas Wasow (1999). Syntactic Theory, A formal Introduction, CSLI, Stanford, USA. 13. Van Eynde, Frank and D. Gibbon Eds. (2000). Lexicon Development for Speech and Language Processing, Kluwer Academic Publishers, London. 14. Van Valin, Robert D. & Randy J. Lapolla (1997). Syntax Structure, Meaning and Function, Cambridge University Press 12/11/2021 Peyman Nojoumian 25