CSA 2050 Natural Language Processing Tagging 1 Tagging

  • Slides: 37
Download presentation
CSA 2050: Natural Language Processing Tagging 1 • Tagging • POS and Tagsets •

CSA 2050: Natural Language Processing Tagging 1 • Tagging • POS and Tagsets • Ambiguities • NLTK February 2007 CSA 3050: Tagging I 1

Tagging 1 Lecture • Slides based on Mike Rosner and Marti Hearst notes •

Tagging 1 Lecture • Slides based on Mike Rosner and Marti Hearst notes • Diane Litman’s version of Steven Bird’s notes • Additions from NLTK tutorials February 2007 CSA 3050: Tagging I 2

Tagging Mr. Sherlock Holmes, who was usually very X, … What is the part

Tagging Mr. Sherlock Holmes, who was usually very X, … What is the part of speech of X ? February 2007 CSA 3050: Tagging I 3

Tagging Mr. Sherlock Holmes, who was usually very late/ADJ in the mornings, save upon

Tagging Mr. Sherlock Holmes, who was usually very late/ADJ in the mornings, save upon those not infrequent occasions when he was up all night, was Y What is the part of speech of Y ? February 2007 CSA 3050: Tagging I 4

Tagging Mr. Sherlock Holmes, who was usually very late in the mornings, save upon

Tagging Mr. Sherlock Holmes, who was usually very late in the mornings, save upon those not infrequent occasions when he was up all night, was seated/VBN at the breakfast table February 2007 CSA 3050: Tagging I 5

Tagging Terminology • Tagging – The process of associating labels with each token in

Tagging Terminology • Tagging – The process of associating labels with each token in a text • Tags – The labels • Tag Set – The collection of tags used for a particular task February 2007 CSA 3050: Tagging I 6

Tagging Example Typically a tagged text is a sequence of whitespace separated base/tag tokens:

Tagging Example Typically a tagged text is a sequence of whitespace separated base/tag tokens: The/at Pantheon’s/np interior/nn , /, still/rb in/in its/pp original/jj form/nn , /, is/bez truly/ql majestic/jj and/cc an/at architectural/jj triumph/nn. /. Its/pp rotunda/nn forms/vbz a/at perfect/jj circle/nn whose/wp diameter/nn is/bez equal/jj to/in the/at height/nn from/in the/at floor/nn to/in the/at ceiling/nn. /. February 2007 CSA 3050: Tagging I 7

What does tagging do? 1. Collapses Some Distinctions • Lexical identity may be discarded

What does tagging do? 1. Collapses Some Distinctions • Lexical identity may be discarded • e. g. all personal pronouns tagged with PRP 2. …. But Introduces Others • Ambiguities may be removed • e. g. deal tagged with NN or VB • e. g. deal tagged with DEAL 1 or DEAL 2 3. Helps classification and prediction February 2007 CSA 3050: Tagging I 8

Parts of Speech (POS) • A word’s POS tells us a lot about the

Parts of Speech (POS) • A word’s POS tells us a lot about the word and its neighbors: – Limits the range of meanings (deal), pronunciation (object vs object) or both (wind) – Helps in stemming – Limits the range of following words for Speech Recognition – Can help select nouns from a document for IR – Basis for partial parsing (chunked parsing) – Parsers can build trees directly on the POS tags instead of maintaining a lexicon February 2007 CSA 3050: Tagging I 9

POS and Tagsets • The choice of tagset greatly affects the difficulty of the

POS and Tagsets • The choice of tagset greatly affects the difficulty of the problem • Need to strike a balance between – Getting better information about context (best: introduce more distinctions) – Make it possible for classifiers to do their job (need to minimize distinctions) February 2007 CSA 3050: Tagging I 10

Common Tagsets • Brown corpus: 87 tags • Penn Treebank: 45 tags • Lancaster

Common Tagsets • Brown corpus: 87 tags • Penn Treebank: 45 tags • Lancaster UCREL C 5 (used to tag the British National Corpus - BNC): 61 tags • Lancaster C 7: 145 tags February 2007 CSA 3050: Tagging I 11

Brown Corpus • The first digital corpus (1961) – Francis and Kucera, Brown University

Brown Corpus • The first digital corpus (1961) – Francis and Kucera, Brown University • Contents: 500 texts, each 2000 words long – From American books, newspapers, magazines – Representing genres: • Science fiction, romance fiction, press reportage scientific writing, popular lore February 2007 CSA 3050: Tagging I 12

Penn Treebank • First syntactically annotated corpus • 1 million words from Wall Street

Penn Treebank • First syntactically annotated corpus • 1 million words from Wall Street Journal • Part of speech tags and syntax trees February 2007 CSA 3050: Tagging I 13

Penn Treebank The/DT grand/JJ jury/NN commented/VBD on/IN a/DT number/NN of/IN other/JJ topics/NNS. /. VB

Penn Treebank The/DT grand/JJ jury/NN commented/VBD on/IN a/DT number/NN of/IN other/JJ topics/NNS. /. VB DT NN. Book that flight. VBZ DT NN VB NN ? Does that flight serve dinner ? February 2007 CSA 3050: Tagging I 14

Penn Treebank February 2007 CSA 3050: Tagging I 15

Penn Treebank February 2007 CSA 3050: Tagging I 15

Penn Treebank – Important Tags February 2007 CSA 3050: Tagging I 16

Penn Treebank – Important Tags February 2007 CSA 3050: Tagging I 16

Penn Treebank – Verb Tags February 2007 CSA 3050: Tagging I 17

Penn Treebank – Verb Tags February 2007 CSA 3050: Tagging I 17

Penn Treebank Example (S (NP-SBJ-1 (DT The) (NNP Senate)) (VP (VBZ plans_ (S (NP-SBJ

Penn Treebank Example (S (NP-SBJ-1 (DT The) (NNP Senate)) (VP (VBZ plans_ (S (NP-SBJ (-NONE- *-1)) (VP (TO to) (VP (VB take) (PRT (RP up)) (NP (DT the) (NN measure)) (ADV-TMP (RB quickly)))))) (. . )) February 2007 CSA 3050: Tagging I 18

Tagging • Typically the set of tags is larger than basic parts of speech

Tagging • Typically the set of tags is larger than basic parts of speech • Tags often contain some morphological information • Often referred to as “morphosyntactic labels” February 2007 CSA 3050: Tagging I 19

Tagging Ambiguities N FRUIT February 2007 N-V FLIES V-IN LIKE CSA 3050: Tagging I

Tagging Ambiguities N FRUIT February 2007 N-V FLIES V-IN LIKE CSA 3050: Tagging I DT A N BANANA 20

Interpretation 1 S VP NP N FRUIT February 2007 N FLIES NP V LIKE

Interpretation 1 S VP NP N FRUIT February 2007 N FLIES NP V LIKE CSA 3050: Tagging I DT N A BANANA 21

Interpretation 2 S VP PP NP N FRUIT February 2007 NP V FLIES IN

Interpretation 2 S VP PP NP N FRUIT February 2007 NP V FLIES IN LIKE DT N A BANANA CSA 3050: Tagging I 22

Lots of ambiguities… 1. He can a can. 2. I can light a fire

Lots of ambiguities… 1. He can a can. 2. I can light a fire and you can open a can of beans. Now the can is open, and we can eat in the light of the fire. February 2007 CSA 3050: Tagging I 23

Lots of ambiguities… • In the Brown Corpus – 11. 5% of word types

Lots of ambiguities… • In the Brown Corpus – 11. 5% of word types are ambiguous – 40% of word tokens are ambiguous • • • Most words in English are unambiguous. Many of the most common words are ambiguous. Typically ambiguous tags are not equally probable. February 2007 CSA 3050: Tagging I 24

Lots of ambiguities… Brown Corpus Unambiguous (1 tag): 35, 340 types Ambiguous (2 -7

Lots of ambiguities… Brown Corpus Unambiguous (1 tag): 35, 340 types Ambiguous (2 -7 tags): 4, 100 types (Table: Derose, 1988) February 2007 2 tags 3, 760 3 tags 264 4 tags 61 5 tags 12 6 tags 2 7 tags 1 CSA 3050: Tagging I 25

Approaches to Tagging 1. Tagger: ENGTWOL Tagger (Voutilainen 1995) 2. Stochastic Tagger: HMM-based Tagger

Approaches to Tagging 1. Tagger: ENGTWOL Tagger (Voutilainen 1995) 2. Stochastic Tagger: HMM-based Tagger 3. Transformation-Based Tagger: Brill Tagger (Brill 1995) February 2007 CSA 3050: Tagging I 26

NLTK • • Natural Language Toolkit (NLTK) http: //nltk. sourceforge. net/ Please download and

NLTK • • Natural Language Toolkit (NLTK) http: //nltk. sourceforge. net/ Please download and install! Runs on Python February 2007 CSA 3050: Tagging I 27

NLTK Introduction • The Natural Language Toolkit (NLTK) provides: – Basic classes for representing

NLTK Introduction • The Natural Language Toolkit (NLTK) provides: – Basic classes for representing data relevant to natural language processing. – Standard interfaces for performing tasks, such as tokenization, tagging, and parsing. – Standard implementations of each task, which can be combined to solve complex problems. • Two versions: NLTK and NLTK-Lite February 2007 CSA 3050: Tagging I 28

NLTK Modules • nltk. token: processing individual elements of text, such as words or

NLTK Modules • nltk. token: processing individual elements of text, such as words or sentences. • nltk. probability: modeling frequency distributions and probabilistic systems. • nltk. tagger: tagging tokens with supplemental information, such as parts of speech or wordnet sense tags. • nltk. parser: high-level interface for parsing texts. • nltk. chartparser: a chart-based implementation of the parser interface. • nltk. chunkparser: a regular-expression based surface parser. February 2007 CSA 3050: Tagging I 29

Python for NLP • Python is a great language for NLP: – Simple –

Python for NLP • Python is a great language for NLP: – Simple – Easy to debug: • Exceptions • Interpreted language – Easy to structure • Modules • Object oriented programming – Powerful string manipulation February 2007 CSA 3050: Tagging I 30

Python Modules and Packages • Python modules “package program code and data for reuse.

Python Modules and Packages • Python modules “package program code and data for reuse. ” (Lutz) – Similar to library in C, package in Java. • Python packages are hierarchical modules (i. e. , modules that contain other modules). • Three commands for accessing modules: 1. import 2. from…import 3. reload February 2007 CSA 3050: Tagging I 31

Import Command • The import command loads a module: # Load the regular expression

Import Command • The import command loads a module: # Load the regular expression module >>> import re • To access the contents of a module, use dotted names: # Use the search method from the re module >>> re. search(‘w+’, str) • To list the contents of a module, use dir: >>> dir(re) [‘DOTALL’, ‘IGNORECASE’, …] February 2007 CSA 3050: Tagging I 32

from. . . import • The from…import command loads individual functions and objects from

from. . . import • The from…import command loads individual functions and objects from a module: # Load the search function from the re module >>> from re import search • Once an individual function or object is loaded with from…import, it can be used directly: # Use the search method from the re module >>> search (‘w+’, str) February 2007 CSA 3050: Tagging I 33

Import vs. from. . . import • Import • Keeps module functions separate from

Import vs. from. . . import • Import • Keeps module functions separate from user functions. • Requires the use of dotted names. • Works with reload. February 2007 from…import • Puts module functions and user functions together. • More convenient names. • Does not work with reload. CSA 3050: Tagging I 34

Reload • If you edit a module, you must use the reload command before

Reload • If you edit a module, you must use the reload command before the changes become visible in Python: >>> import mymodule. . . >>> reload (mymodule) • The reload command only affects modules that have been loaded with import; it does not update individual functions and objects loaded with from. . . import. February 2007 CSA 3050: Tagging I 35

Reload • If you edit a module, you must use the reload command before

Reload • If you edit a module, you must use the reload command before the changes become visible in Python: >>> import mymodule. . . >>> reload (mymodule) • The reload command only affects modules that have been loaded with import; it does not update individual functions and objects loaded with from. . . import. February 2007 CSA 3050: Tagging I 36

Next Sessions… • • Rule-Based Tagging Stochastic Tagging Hidden Markov Models (HMMs) N-Grams •

Next Sessions… • • Rule-Based Tagging Stochastic Tagging Hidden Markov Models (HMMs) N-Grams • Read Jurafsky and Marting Chapter 4 (PDF) • Install NLTK February 2007 CSA 3050: Tagging I 37