Word classes and part of speech tagging Chapter

  • Slides: 18
Download presentation
Word classes and part of speech tagging Chapter 5

Word classes and part of speech tagging Chapter 5

Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic

Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches 1: rule-based tagging Automatic approaches 2: stochastic tagging On Part 2: finish stochastic tagging, and continue on to: evaluation Slide 1

Definition “The process of assigning a part-of-speech or other lexical class marker to each

Definition “The process of assigning a part-of-speech or other lexical class marker to each word in a corpus” (Jurafsky and Martin) WORDS the girl kissed the boy on the cheek TAGS N V P DET Slide 2

An Example WORD the girl kissed the boy on the cheek LEMMA the girl

An Example WORD the girl kissed the boy on the cheek LEMMA the girl kiss the boy on the cheek TAG +DET +NOUN +VPAST +DET +NOUN +PREP +DET +NOUN Slide 3

Motivation Speech synthesis — pronunciation Speech recognition — class-based N-grams Information retrieval — stemming,

Motivation Speech synthesis — pronunciation Speech recognition — class-based N-grams Information retrieval — stemming, selection high-content words Word-sense disambiguation Corpus analysis of language & lexicography Slide 4

Word Classes Basic word classes: Noun, Verb, Adjective, Adverb, Preposition, … Open vs. Closed

Word Classes Basic word classes: Noun, Verb, Adjective, Adverb, Preposition, … Open vs. Closed classes Open: Nouns, Verbs, Adjectives, Adverbs. Why “open”? Closed: determiners: a, an, the pronouns: she, I prepositions: on, under, over, near, by, … Slide 5

Open Class Words Every known human language has nouns and verbs Nouns: people, places,

Open Class Words Every known human language has nouns and verbs Nouns: people, places, things Classes of nouns proper vs. common count vs. mass Verbs: actions and processes Adjectives: properties, qualities Adverbs: hodgepodge! Unfortunately, John walked home extremely slowly yesterday Numerals: one, two, three, third, … Slide 6

Closed Class Words Differ more from language to language than open class words Examples:

Closed Class Words Differ more from language to language than open class words Examples: prepositions: on, under, over, … particles: up, down, off, … determiners: a, an, the, … pronouns: she, who, I, . . conjunctions: and, but, or, … auxiliary verbs: can, may should, … Slide 7

Word Classes: Tag Sets • Vary in number of tags: a dozen to over

Word Classes: Tag Sets • Vary in number of tags: a dozen to over 200 • Size of tag sets depends on language, objectives and purpose – Some tagging approaches (e. g. , constraint grammar based) make fewer distinctions e. g. , conflating prepositions, conjunctions, particles – Simple morphology = more ambiguity = fewer tags Slide 8

Word Classes: Tag set example PRP$ Slide 9

Word Classes: Tag set example PRP$ Slide 9

Example of Penn Treebank Tagging of Brown Corpus Sentence The/DT grand/JJ jury/NN commented/VBD on/IN

Example of Penn Treebank Tagging of Brown Corpus Sentence The/DT grand/JJ jury/NN commented/VBD on/IN a/DT number/NN of/IN other/JJ topics/NNS. /. VB DT NN. Book that flight. VBZ DT NN VB NN ? Does that flight serve dinner ? See http: //www. infogistics. com/posdemo. htm Buffalo buffalo Buffalo buffalo Slide 10

The Problem Words often have more than one word class: this This is a

The Problem Words often have more than one word class: this This is a nice day = PRP This day is nice = DT You can go this far = RB Slide 11

Word Class Ambiguity (in the Brown Corpus) Unambiguous (1 tag): 35, 340 Ambiguous (2

Word Class Ambiguity (in the Brown Corpus) Unambiguous (1 tag): 35, 340 Ambiguous (2 -7 tags): 4, 100 2 tags 3, 760 3 tags 264 4 tags 61 5 tags 12 6 tags 2 7 tags 1 (Derose, 1988) Slide 12

Part-of-Speech Tagging • Rule-Based Tagger: ENGTWOL (ENGlish TWO Level analysis) • Stochastic Tagger: HMM-based

Part-of-Speech Tagging • Rule-Based Tagger: ENGTWOL (ENGlish TWO Level analysis) • Stochastic Tagger: HMM-based • Transformation-Based Tagger (Brill) (we won’t cover this) Slide 13

Rule-Based Tagging • Basic Idea: – Assign all possible tags to words – Remove

Rule-Based Tagging • Basic Idea: – Assign all possible tags to words – Remove tags according to set of rules of type: if word+1 is an adj, adv, or quantifier and the following is a sentence boundary and word-1 is not a verb like “consider” then eliminate non-adv else eliminate adv. – Typically more than 1000 hand-written rules Slide 14

Sample ENGTWOL Lexicon Demo: http: //www 2. lingsoft. fi/cgi-bin/engtwol Slide 15

Sample ENGTWOL Lexicon Demo: http: //www 2. lingsoft. fi/cgi-bin/engtwol Slide 15

Stage 1 of ENGTWOL Tagging First Stage: Run words through a morphological analyzer to

Stage 1 of ENGTWOL Tagging First Stage: Run words through a morphological analyzer to get all parts of speech. Example: Pavlov had shown that salivation … Pavlov had shown that salivation PAVLOV N NOM SG PROPER HAVE V PAST VFIN SVO HAVE PCP 2 SVO SHOW PCP 2 SVOO SV ADV PRON DEM SG DET CENTRAL DEM SG CS N NOM SG Slide 16

Stage 2 of ENGTWOL Tagging Second Stage: Apply constraints. Constraints used in negative way.

Stage 2 of ENGTWOL Tagging Second Stage: Apply constraints. Constraints used in negative way. Example: Adverbial “that” rule Given input: “that” If (+1 A/ADV/QUANT) (+2 SENT-LIM) (NOT -1 SVOC/A) Then eliminate non-ADV tags Else eliminate ADV Slide 17