CSA 2050 Natural Language Processing Tagging 1 Tagging
- Slides: 37
CSA 2050: Natural Language Processing Tagging 1 • Tagging • POS and Tagsets • Ambiguities • NLTK February 2007 CSA 3050: Tagging I 1
Tagging 1 Lecture • Slides based on Mike Rosner and Marti Hearst notes • Diane Litman’s version of Steven Bird’s notes • Additions from NLTK tutorials February 2007 CSA 3050: Tagging I 2
Tagging Mr. Sherlock Holmes, who was usually very X, … What is the part of speech of X ? February 2007 CSA 3050: Tagging I 3
Tagging Mr. Sherlock Holmes, who was usually very late/ADJ in the mornings, save upon those not infrequent occasions when he was up all night, was Y What is the part of speech of Y ? February 2007 CSA 3050: Tagging I 4
Tagging Mr. Sherlock Holmes, who was usually very late in the mornings, save upon those not infrequent occasions when he was up all night, was seated/VBN at the breakfast table February 2007 CSA 3050: Tagging I 5
Tagging Terminology • Tagging – The process of associating labels with each token in a text • Tags – The labels • Tag Set – The collection of tags used for a particular task February 2007 CSA 3050: Tagging I 6
Tagging Example Typically a tagged text is a sequence of whitespace separated base/tag tokens: The/at Pantheon’s/np interior/nn , /, still/rb in/in its/pp original/jj form/nn , /, is/bez truly/ql majestic/jj and/cc an/at architectural/jj triumph/nn. /. Its/pp rotunda/nn forms/vbz a/at perfect/jj circle/nn whose/wp diameter/nn is/bez equal/jj to/in the/at height/nn from/in the/at floor/nn to/in the/at ceiling/nn. /. February 2007 CSA 3050: Tagging I 7
What does tagging do? 1. Collapses Some Distinctions • Lexical identity may be discarded • e. g. all personal pronouns tagged with PRP 2. …. But Introduces Others • Ambiguities may be removed • e. g. deal tagged with NN or VB • e. g. deal tagged with DEAL 1 or DEAL 2 3. Helps classification and prediction February 2007 CSA 3050: Tagging I 8
Parts of Speech (POS) • A word’s POS tells us a lot about the word and its neighbors: – Limits the range of meanings (deal), pronunciation (object vs object) or both (wind) – Helps in stemming – Limits the range of following words for Speech Recognition – Can help select nouns from a document for IR – Basis for partial parsing (chunked parsing) – Parsers can build trees directly on the POS tags instead of maintaining a lexicon February 2007 CSA 3050: Tagging I 9
POS and Tagsets • The choice of tagset greatly affects the difficulty of the problem • Need to strike a balance between – Getting better information about context (best: introduce more distinctions) – Make it possible for classifiers to do their job (need to minimize distinctions) February 2007 CSA 3050: Tagging I 10
Common Tagsets • Brown corpus: 87 tags • Penn Treebank: 45 tags • Lancaster UCREL C 5 (used to tag the British National Corpus - BNC): 61 tags • Lancaster C 7: 145 tags February 2007 CSA 3050: Tagging I 11
Brown Corpus • The first digital corpus (1961) – Francis and Kucera, Brown University • Contents: 500 texts, each 2000 words long – From American books, newspapers, magazines – Representing genres: • Science fiction, romance fiction, press reportage scientific writing, popular lore February 2007 CSA 3050: Tagging I 12
Penn Treebank • First syntactically annotated corpus • 1 million words from Wall Street Journal • Part of speech tags and syntax trees February 2007 CSA 3050: Tagging I 13
Penn Treebank The/DT grand/JJ jury/NN commented/VBD on/IN a/DT number/NN of/IN other/JJ topics/NNS. /. VB DT NN. Book that flight. VBZ DT NN VB NN ? Does that flight serve dinner ? February 2007 CSA 3050: Tagging I 14
Penn Treebank February 2007 CSA 3050: Tagging I 15
Penn Treebank – Important Tags February 2007 CSA 3050: Tagging I 16
Penn Treebank – Verb Tags February 2007 CSA 3050: Tagging I 17
Penn Treebank Example (S (NP-SBJ-1 (DT The) (NNP Senate)) (VP (VBZ plans_ (S (NP-SBJ (-NONE- *-1)) (VP (TO to) (VP (VB take) (PRT (RP up)) (NP (DT the) (NN measure)) (ADV-TMP (RB quickly)))))) (. . )) February 2007 CSA 3050: Tagging I 18
Tagging • Typically the set of tags is larger than basic parts of speech • Tags often contain some morphological information • Often referred to as “morphosyntactic labels” February 2007 CSA 3050: Tagging I 19
Tagging Ambiguities N FRUIT February 2007 N-V FLIES V-IN LIKE CSA 3050: Tagging I DT A N BANANA 20
Interpretation 1 S VP NP N FRUIT February 2007 N FLIES NP V LIKE CSA 3050: Tagging I DT N A BANANA 21
Interpretation 2 S VP PP NP N FRUIT February 2007 NP V FLIES IN LIKE DT N A BANANA CSA 3050: Tagging I 22
Lots of ambiguities… 1. He can a can. 2. I can light a fire and you can open a can of beans. Now the can is open, and we can eat in the light of the fire. February 2007 CSA 3050: Tagging I 23
Lots of ambiguities… • In the Brown Corpus – 11. 5% of word types are ambiguous – 40% of word tokens are ambiguous • • • Most words in English are unambiguous. Many of the most common words are ambiguous. Typically ambiguous tags are not equally probable. February 2007 CSA 3050: Tagging I 24
Lots of ambiguities… Brown Corpus Unambiguous (1 tag): 35, 340 types Ambiguous (2 -7 tags): 4, 100 types (Table: Derose, 1988) February 2007 2 tags 3, 760 3 tags 264 4 tags 61 5 tags 12 6 tags 2 7 tags 1 CSA 3050: Tagging I 25
Approaches to Tagging 1. Tagger: ENGTWOL Tagger (Voutilainen 1995) 2. Stochastic Tagger: HMM-based Tagger 3. Transformation-Based Tagger: Brill Tagger (Brill 1995) February 2007 CSA 3050: Tagging I 26
NLTK • • Natural Language Toolkit (NLTK) http: //nltk. sourceforge. net/ Please download and install! Runs on Python February 2007 CSA 3050: Tagging I 27
NLTK Introduction • The Natural Language Toolkit (NLTK) provides: – Basic classes for representing data relevant to natural language processing. – Standard interfaces for performing tasks, such as tokenization, tagging, and parsing. – Standard implementations of each task, which can be combined to solve complex problems. • Two versions: NLTK and NLTK-Lite February 2007 CSA 3050: Tagging I 28
NLTK Modules • nltk. token: processing individual elements of text, such as words or sentences. • nltk. probability: modeling frequency distributions and probabilistic systems. • nltk. tagger: tagging tokens with supplemental information, such as parts of speech or wordnet sense tags. • nltk. parser: high-level interface for parsing texts. • nltk. chartparser: a chart-based implementation of the parser interface. • nltk. chunkparser: a regular-expression based surface parser. February 2007 CSA 3050: Tagging I 29
Python for NLP • Python is a great language for NLP: – Simple – Easy to debug: • Exceptions • Interpreted language – Easy to structure • Modules • Object oriented programming – Powerful string manipulation February 2007 CSA 3050: Tagging I 30
Python Modules and Packages • Python modules “package program code and data for reuse. ” (Lutz) – Similar to library in C, package in Java. • Python packages are hierarchical modules (i. e. , modules that contain other modules). • Three commands for accessing modules: 1. import 2. from…import 3. reload February 2007 CSA 3050: Tagging I 31
Import Command • The import command loads a module: # Load the regular expression module >>> import re • To access the contents of a module, use dotted names: # Use the search method from the re module >>> re. search(‘w+’, str) • To list the contents of a module, use dir: >>> dir(re) [‘DOTALL’, ‘IGNORECASE’, …] February 2007 CSA 3050: Tagging I 32
from. . . import • The from…import command loads individual functions and objects from a module: # Load the search function from the re module >>> from re import search • Once an individual function or object is loaded with from…import, it can be used directly: # Use the search method from the re module >>> search (‘w+’, str) February 2007 CSA 3050: Tagging I 33
Import vs. from. . . import • Import • Keeps module functions separate from user functions. • Requires the use of dotted names. • Works with reload. February 2007 from…import • Puts module functions and user functions together. • More convenient names. • Does not work with reload. CSA 3050: Tagging I 34
Reload • If you edit a module, you must use the reload command before the changes become visible in Python: >>> import mymodule. . . >>> reload (mymodule) • The reload command only affects modules that have been loaded with import; it does not update individual functions and objects loaded with from. . . import. February 2007 CSA 3050: Tagging I 35
Reload • If you edit a module, you must use the reload command before the changes become visible in Python: >>> import mymodule. . . >>> reload (mymodule) • The reload command only affects modules that have been loaded with import; it does not update individual functions and objects loaded with from. . . import. February 2007 CSA 3050: Tagging I 36
Next Sessions… • • Rule-Based Tagging Stochastic Tagging Hidden Markov Models (HMMs) N-Grams • Read Jurafsky and Marting Chapter 4 (PDF) • Install NLTK February 2007 CSA 3050: Tagging I 37
- Natural language processing vietnamese
- Probabilistic model natural language processing
- Natural language processing nlp - theory lecture
- Markov chain natural language processing
- Christopher manning stanford
- Language
- Buy nlu
- Nlp lecture notes
- Foundations of statistical natural language processing
- Natural language processing fields
- Natural language processing fields
- Natural language processing lecture notes
- Natural language processing games
- Collocation 例
- Ucla natural language processing
- Prologn
- Rada mihalcea
- Pengertian natural language processing
- Natural language processing
- Language synonyms
- Natural language processing
- Machine translation in natural language processing
- Natural language processing lecture notes
- Reference phenomena in nlp
- Kaiwei chang
- Adam meyers nyu
- Natural language processing lecture notes
- Natural language processing lecture notes
- Natural language processing berlin
- Dialogflow nlu
- Natural language processing
- Social media in 2050
- Journey 2050 student handout 2 word search
- Natmap 2050
- How does a plant resist disease and pests journey 2050
- Rfc
- Dálniční síť čr 2050
- Pas 2050