Ch 10 PartofSpeech Tagging Edited from L Venkata

  • Slides: 16
Download presentation
Ch 10 Part-of-Speech Tagging Edited from: L. Venkata Subramaniam February 28, 2002

Ch 10 Part-of-Speech Tagging Edited from: L. Venkata Subramaniam February 28, 2002

Part-of-Speech Tagging n n Tagging is the task of labeling (or tagging) each word

Part-of-Speech Tagging n n Tagging is the task of labeling (or tagging) each word in a sentence with its appropriate part of speech. The[AT] representative[NN] put[VBD] chairs[NNS] on[IN] the[AT] table[NN]. Tagging is a case of limited syntactic disambiguation. Many words have more than one syntactic category. Tagging has limited scope: we just fix the syntactic categories of words and do not do a complete parse.

Information Souces in Tagging n How do we decide the correct POS for a

Information Souces in Tagging n How do we decide the correct POS for a word? n Syntagmatic Information: Look at tags of other words in the context of the word we are interested in. n Lexical Information: Predicting a tag based on the word concerned. For words with a number of POS, they usually occur used as one particular POS.

Markov Model Taggers n We look at the sequence of tags in a text

Markov Model Taggers n We look at the sequence of tags in a text as a Markov chain. n n Limited horizon. P(Xi+1= tj|X 1, …, Xi) = P(Xi+1= tj|Xi) Time invariant (stationary). P(Xi+1= tj|Xi) = P(X 2= tj|X 1)

The Visible Markov Model Tagging Algorithm n n n The MLE of tag tk

The Visible Markov Model Tagging Algorithm n n n The MLE of tag tk following tag tj is obtained from training corpora: akj=P(tk|tj) = C(tk, tj )/C(tj). The probability of a word being emitted by a tag: bkjl. P(wl|tj) = C(wl, wj )/C(wj). The best tagging sequence t 1, n for a sentence w 1, n: arg max P(t 1, n | w 1, n) = arg max t 1, n P (w 1, n| t 1, n )P(t 1, n ) ________ P(w 1, n) = arg max P (w 1, n| t 1, n )P(t 1, n ) t 1, n n = P P (wi| ti )P(ti|ti-1 ) i=1

The Viterbi Algorithm n 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.

The Viterbi Algorithm n 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. We determine the optimal tags for a sentence comment: Given a sentence of length n Comment: Initialization d 1(PERIOD)=1. 0 d 1(t)=0. 0 for t!= PERIOD comment: Induction for I: =1 to n step 1 do for all tags tj do di+1(tj): =max 1[ k [ T[di (tk) P(wi+1|tj) P(tk| tk)] Yi+1(tj): =argmax 1[ k [ T[di (tk) P(wi+1|tj) P(tk| tk)] end End Termination and path-readout.

Unknown Words n n n Some tags are more common than others (for example

Unknown Words n n n Some tags are more common than others (for example a new word can be most likely verbs, nouns etc. but not prepositions or articles). Use features of the word (morphological and other cues, for example words ending in –ed are likely to be past tense forms or past participles). Use context.

Hidden Markov Model Taggers n n n Often a large tagged training set is

Hidden Markov Model Taggers n n n Often a large tagged training set is not available. We can use an HMM to learn the regularities of tag sequences in this case. The states of the HMM correspond to the tags and the output alphabet consists of words in dictionary or classes of words.

Initialization of HMM Tagger n (Jelinek, 1985) Output alphabet consists of words. Emission probabilities

Initialization of HMM Tagger n (Jelinek, 1985) Output alphabet consists of words. Emission probabilities are given by: * C(wl) b j. l _______ bj. l = S wm bj*. m C(wm) bj*. l= the sum is over all words wm in the dictionary 0 if tj is not a part of speech allowed for wl 1 Otherwise, where T(wl) is the number of ___ 2 l tags allowed for wl T(w )

Initialization (Cont. ) n (Kupiec, 1992) Output alphabet consists of word equivalence classes, i.

Initialization (Cont. ) n (Kupiec, 1992) Output alphabet consists of word equivalence classes, i. e. , metawords u. L, where L is a subset of the integers from 1 to T, where T is the number of different tags in the tag set) * C(u ) b j. l L _______ bj. L = S u. L’ bj*. L’ C(u. L’) bj*. L= the sum in the denom. is over all the metawords u. L’ 0 if j is not in L 1 Otherwise, where |L| is the number of ___ 2 indices in L. |L|

Training the HMM n Once the initialization is completed, the HMM is trained using

Training the HMM n Once the initialization is completed, the HMM is trained using the Forward-Backward algorithm. Tagging using the HMM n Use the Viterbi algorithm.

Transformation-Based Tagging n n n Exploits wider range of lexical and syntactic regularities. Condition

Transformation-Based Tagging n n n Exploits wider range of lexical and syntactic regularities. Condition the tags on preceding words not just preceding tags. Use more context than bigram or trigram.

Transformations n A transformation consists of two parts, a triggering environment and a rewrite

Transformations n A transformation consists of two parts, a triggering environment and a rewrite rule. Examples of some transformations learned in transformation-based tagging Source tag Target tag triggering environment NN VB previous tag is TO VBP VB one of the previous three tags is MD JJR RBR next tag is JJ VBP VB one of the previous two words is n’t

Learning Algorithm n n The learning algorithm selects the best transformations and determines their

Learning Algorithm n n The learning algorithm selects the best transformations and determines their order of application Initially tag each word with its most frequent tag. Iteratively we choose the transformation that reduces the error rate most. We stop when there is no transformation left that reduces the error rate by more than a prespecified threshold.

Tagging Accuracy n n Ranges from 95%-97% Depends on: n n Amount of training

Tagging Accuracy n n Ranges from 95%-97% Depends on: n n Amount of training data available. The tag set. Difference between training corpus and dictionary and the corpus of application. Unknown words in the corpus of application.

Applications of Tagging n n Partial parsing: syntactic analysis Information Extraction: tagging and partial

Applications of Tagging n n Partial parsing: syntactic analysis Information Extraction: tagging and partial parsing help identify useful terms and relationships between them. Information Retrieval: noun phrase recognition and query-document matching based on meaningful units rather than individual terms. Question Answering: analyzing a query to understand what type of entity the user is looking for and how it is related to other noun phrases mentioned in the question.