Named Entity Tagging Thanks to Dan Jurafsky Jim

  • Slides: 80
Download presentation
Named Entity Tagging Thanks to Dan Jurafsky, Jim Martin, Ray Mooney, Tom Mitchell for

Named Entity Tagging Thanks to Dan Jurafsky, Jim Martin, Ray Mooney, Tom Mitchell for slides

Outline Named Entities and the basic idea l IOB Tagging l A new classifier:

Outline Named Entities and the basic idea l IOB Tagging l A new classifier: Logistic Regression l § Linear regression § Logistic regression § Multinomial logistic regression = Max. Ent Why classifiers aren’t as good as sequence models l A new sequence model: l § MEMM = Maximum Entropy Markov Model

Named Entity Tagging CHICAGO (AP) — Citing high fuel prices, United Airlines said Friday

Named Entity Tagging CHICAGO (AP) — Citing high fuel prices, United Airlines said Friday it has increased fares by $6 per round trip on flights to some cities also served by lower-cost carriers. American Airlines, a unit AMR, immediately matched the move, spokesman Tim Wagner said. United, a unit of UAL, said the increase took effect Thursday night and applies to most routes where it competes against discount carriers, such as Chicago to Dallas and Atlanta and Denver to San Francisco, Los Angeles and New York. Slide from Jim Martin

Named Entity Tagging CHICAGO (AP) — Citing high fuel prices, United Airlines said Friday

Named Entity Tagging CHICAGO (AP) — Citing high fuel prices, United Airlines said Friday it has increased fares by $6 per round trip on flights to some cities also served by lower-cost carriers. American Airlines, a unit AMR, immediately matched the move, spokesman Tim Wagner said. United, a unit of UAL, said the increase took effect Thursday night and applies to most routes where it competes against discount carriers, such as Chicago to Dallas and Atlanta and Denver to San Francisco, Los Angeles and New York. Slide from Jim Martin

Named Entity Recognition Find the named entities and classify them by type l Typical

Named Entity Recognition Find the named entities and classify them by type l Typical approach l § § Acquire training data Encode using IOB labeling Train a sequential supervised classifier Augment with pre- and post-processing using available list resources (census data, gazetteers, etc. ) Slide from Jim Martin

Temporal and Numerical Expressions l Temporals § Find all the temporal expressions § Normalize

Temporal and Numerical Expressions l Temporals § Find all the temporal expressions § Normalize them based on some reference point l Numerical Expressions § Find all the expressions § Classify by type § Normalize Slide from Jim Martin

NE Types Slide from Jim Martin

NE Types Slide from Jim Martin

NE Types: Examples Slide from Jim Martin

NE Types: Examples Slide from Jim Martin

Ambiguity Slide from Jim Martin

Ambiguity Slide from Jim Martin

Biomedical Entities l l l l Disease Symptom Drug Body Part Treatment Enzime Protein

Biomedical Entities l l l l Disease Symptom Drug Body Part Treatment Enzime Protein Difficulty: discontiguous or overlapping mentions § Abdomen is soft, nontender, nondistended, negative bruits

NER Approaches l As with partial parsing and chunking there are two basic approaches

NER Approaches l As with partial parsing and chunking there are two basic approaches (and hybrids) § Rule-based (regular expressions) • Lists of names • Patterns to match things that look like names • Patterns to match the environments that classes of names tend to occur in. § ML-based approaches • Get annotated training data • Extract features • Train systems to replicate the annotation Slide from Jim Martin

ML Approach Slide from Jim Martin

ML Approach Slide from Jim Martin

Encoding for Sequence Labeling We can use IOB encoding: …United Airlines said Friday it

Encoding for Sequence Labeling We can use IOB encoding: …United Airlines said Friday it has increased l B_ORG I_ORG O O O the move , spokesman Tim Wagner said. O O O B_PER I_PER O l How many tags? § For N classes we have 2*N+1 tags • An I and B for each class and one O for no-class Each token in a text gets a tag l Can use simpler IO tagging if what? l

NER Features Slide from Jim Martin

NER Features Slide from Jim Martin

Discriminative vs Generative l Generative Model: § Estimate full joint distribution P(y, x) §

Discriminative vs Generative l Generative Model: § Estimate full joint distribution P(y, x) § Use Bayes rule to obtain P(y | x) or use argmax for classification: l Discriminative model: § Estimate P(y | x) in order to predict y from x

How to do NE tagging? l Classifiers § Naïve Bayes § Logistic Regression l

How to do NE tagging? l Classifiers § Naïve Bayes § Logistic Regression l Sequence Models § § HMMs MEMMs CRFs Convolutional Neural Network l Sequence models work better

Linear Regression l Example from Freakonomics (Levitt and Dubner 2005) § Fantastic/cute/charming versus granite/maple

Linear Regression l Example from Freakonomics (Levitt and Dubner 2005) § Fantastic/cute/charming versus granite/maple l # vague adjective Price increase 4 0 3 $1000 2 $1500 2 $6000 1 $14000 0 $18000 Can we predict price from # of adjs?

Linear Regression

Linear Regression

Muliple Linear Regression l Predicting values: l In general: § Let’s pretend an extra

Muliple Linear Regression l Predicting values: l In general: § Let’s pretend an extra “intercept” feature f 0 with value 1 l Multiple Linear Regression

Learning in Linear Regression l Consider one instance xj l We would like to

Learning in Linear Regression l Consider one instance xj l We would like to choose weights to minimize the difference between predicted and observed value for xj: l This is an optimization problem that turns out to have a closed-form solution

l Put the weight from the training set into matrix X of observations f(i)

l Put the weight from the training set into matrix X of observations f(i) l Put the observed values in a vector l Formula that minimizes the cost: W = (XTX)− 1 XTy y

Logistic Regression

Logistic Regression

Logistic Regression l But in language problems we are doing classification § Predicting one

Logistic Regression l But in language problems we are doing classification § Predicting one of a small set of discrete values l Could we just use linear regression for this?

Logistic regression l Not possible: the result doesn’t fall between 0 and 1 l

Logistic regression l Not possible: the result doesn’t fall between 0 and 1 l Instead of predicting prob, predict ratio of probs: § but still not good: does not lie between 0 and 1 l So how about if we predict the log:

Logistic regression l Solving this for p(y=true)

Logistic regression l Solving this for p(y=true)

Logistic function Inverse, aka Sigmoid, maps p to range [0 -1]

Logistic function Inverse, aka Sigmoid, maps p to range [0 -1]

Logistic Regression l How do we do classification? Or: Or, in explicit sum notation:

Logistic Regression l How do we do classification? Or: Or, in explicit sum notation:

Multinomial logistic regression l Multiple classes: l One change: indicator functions instead of real

Multinomial logistic regression l Multiple classes: l One change: indicator functions instead of real values f(c, x)

Estimating the weights l Generalized Iterative Scaling (GIS) § (Darroch and Ratcliff, 1972) l

Estimating the weights l Generalized Iterative Scaling (GIS) § (Darroch and Ratcliff, 1972) l Improved Iterative Scaling (IIS) § (Della Pietra et al. , 1995)

GIS: setup Requirements for running GIS: l Obey form of model and constraints: l

GIS: setup Requirements for running GIS: l Obey form of model and constraints: l An additional constraint: l Add a new feature fk+1:

GIS algorithm Compute dj, j=1, …, k+1 l Initialize (any values, e. g. ,

GIS algorithm Compute dj, j=1, …, k+1 l Initialize (any values, e. g. , 0) l Repeat until converge l § for each j • compute where • update

Features

Features

Summary so far Naïve Bayes Classifier l Logistic Regression Classifier l § Also called

Summary so far Naïve Bayes Classifier l Logistic Regression Classifier l § Also called Maximum Entropy classifier

How do we apply classification to sequences?

How do we apply classification to sequences?

Sequence Labeling as Classification l Classify each token independently but use as input features,

Sequence Labeling as Classification l Classify each token independently but use as input features, information about the surrounding tokens (sliding window). John saw the saw and decided to take it to the table. classifier NNP Slide from Ray Mooney

Sequence Labeling as Classification l Classify each token independently but use as input features,

Sequence Labeling as Classification l Classify each token independently but use as input features, information about the surrounding tokens (sliding window). John saw the saw and decided to take it to the table. classifier VBD Slide from Ray Mooney

Sequence Labeling as Classification l Classify each token independently but use as input features,

Sequence Labeling as Classification l Classify each token independently but use as input features, information about the surrounding tokens (sliding window). John saw the saw and decided to take it to the table. classifier DT Slide from Ray Mooney

Sequence Labeling as Classification l Classify each token independently but use as input features,

Sequence Labeling as Classification l Classify each token independently but use as input features, information about the surrounding tokens (sliding window). John saw the saw and decided to take it to the table. classifier NN Slide from Ray Mooney

Sequence Labeling as Classification l Classify each token independently but use as input features,

Sequence Labeling as Classification l Classify each token independently but use as input features, information about the surrounding tokens (sliding window). John saw the saw and decided to take it to the table. classifier CC Slide from Ray Mooney

Sequence Labeling as Classification l Classify each token independently but use as input features,

Sequence Labeling as Classification l Classify each token independently but use as input features, information about the surrounding tokens (sliding window). John saw the saw and decided to take it to the table. classifier VBD Slide from Ray Mooney

Sequence Labeling as Classification l Classify each token independently but use as input features,

Sequence Labeling as Classification l Classify each token independently but use as input features, information about the surrounding tokens (sliding window). John saw the saw and decided to take it to the table. classifier TO Slide from Ray Mooney

Sequence Labeling as Classification l Classify each token independently but use as input features,

Sequence Labeling as Classification l Classify each token independently but use as input features, information about the surrounding tokens (sliding window). John saw the saw and decided to take it to the table. classifier VB Slide from Ray Mooney

Sequence Labeling as Classification l Classify each token independently but use as input features,

Sequence Labeling as Classification l Classify each token independently but use as input features, information about the surrounding tokens (sliding window). John saw the saw and decided to take it to the table. classifier PRP Slide from Ray Mooney

Sequence Labeling as Classification l Classify each token independently but use as input features,

Sequence Labeling as Classification l Classify each token independently but use as input features, information about the surrounding tokens (sliding window). John saw the saw and decided to take it to the table. classifier IN Slide from Ray Mooney

Sequence Labeling as Classification l Classify each token independently but use as input features,

Sequence Labeling as Classification l Classify each token independently but use as input features, information about the surrounding tokens (sliding window). John saw the saw and decided to take it to the table. classifier DT Slide from Ray Mooney

Sequence Labeling as Classification l Classify each token independently but use as input features,

Sequence Labeling as Classification l Classify each token independently but use as input features, information about the surrounding tokens (sliding window). John saw the saw and decided to take it to the table. classifier NN Slide from Ray Mooney

Using Outputs as Inputs l Better input features are usually the categories of the

Using Outputs as Inputs l Better input features are usually the categories of the surrounding tokens, but these are not available yet l Can use category of either the preceding or succeeding tokens by going forward or back and using previous output Slide from Ray Mooney

Forward Classification John saw the saw and decided to take it to the table.

Forward Classification John saw the saw and decided to take it to the table. classifier NNP Slide from Ray Mooney

Forward Classification NNP John saw the saw and decided to take it to the

Forward Classification NNP John saw the saw and decided to take it to the table. classifier VBD Slide from Ray Mooney

Forward Classification NNP VBD John saw the saw and decided to take it to

Forward Classification NNP VBD John saw the saw and decided to take it to the table. classifier DT Slide from Ray Mooney

Forward Classification NNP VBD DT John saw the saw and decided to take it

Forward Classification NNP VBD DT John saw the saw and decided to take it to the table. classifier NN Slide from Ray Mooney

Forward Classification NNP VBD DT NN John saw the saw and decided to take

Forward Classification NNP VBD DT NN John saw the saw and decided to take it to the table. classifier CC Slide from Ray Mooney

Forward Classification NNP VBD DT NN CC John saw the saw and decided to

Forward Classification NNP VBD DT NN CC John saw the saw and decided to take it to the table. classifier VBD Slide from Ray Mooney

Forward Classification NNP VBD DT NN CC VBD John saw the saw and decided

Forward Classification NNP VBD DT NN CC VBD John saw the saw and decided to take it to the table. classifier TO Slide from Ray Mooney

Forward Classification NNP VBD DT NN CC VBD TO John saw the saw and

Forward Classification NNP VBD DT NN CC VBD TO John saw the saw and decided to take it to the table. classifier VB Slide from Ray Mooney

Backward Classification l Disambiguating “to” in this case would be even easier backward. DT

Backward Classification l Disambiguating “to” in this case would be even easier backward. DT NN John saw the saw and decided to take it to the table. classifier IN Slide from Ray Mooney

Backward Classification l Disambiguating “to” in this case would be even easier backward. IN

Backward Classification l Disambiguating “to” in this case would be even easier backward. IN DT NN John saw the saw and decided to take it to the table. classifier PRP Slide from Ray Mooney

Backward Classification l Disambiguating “to” in this case would be even easier backward. PRP

Backward Classification l Disambiguating “to” in this case would be even easier backward. PRP IN DT NN John saw the saw and decided to take it to the table. classifier VB Slide from Ray Mooney

Backward Classification l Disambiguating “to” in this case would be even easier backward. VB

Backward Classification l Disambiguating “to” in this case would be even easier backward. VB PRP IN DT NN John saw the saw and decided to take it to the table. classifier TO Slide from Ray Mooney

Backward Classification l Disambiguating “to” in this case would be even easier backward. TO

Backward Classification l Disambiguating “to” in this case would be even easier backward. TO VB PRP IN DT NN John saw the saw and decided to take it to the table. classifier VBD Slide from Ray Mooney

Backward Classification l Disambiguating “to” in this case would be even easier backward. VBD

Backward Classification l Disambiguating “to” in this case would be even easier backward. VBD TO VB PRP IN DT NN John saw the saw and decided to take it to the table. classifier CC Slide from Ray Mooney

Backward Classification l Disambiguating “to” in this case would be even easier backward. CC

Backward Classification l Disambiguating “to” in this case would be even easier backward. CC VBD TO VB PRP IN DT NN John saw the saw and decided to take it to the table. classifier VBD Slide from Ray Mooney

Backward Classification l Disambiguating “to” in this case would be even easier backward. VBD

Backward Classification l Disambiguating “to” in this case would be even easier backward. VBD CC VBD TO VB PRP IN DT NN John saw the saw and decided to take it to the table. classifier DT Slide from Ray Mooney

Backward Classification l Disambiguating “to” in this case would be even easier backward. DT

Backward Classification l Disambiguating “to” in this case would be even easier backward. DT VBD CC VBD TO VB PRP IN DT NN John saw the saw and decided to take it to the table. classifier VBD Slide from Ray Mooney

Backward Classification l Disambiguating “to” in this case would be even easier backward. VBD

Backward Classification l Disambiguating “to” in this case would be even easier backward. VBD DT VBD CC VBD TO VB PRP IN DT NN John saw the saw and decided to take it to the table. classifier NNP Slide from Ray Mooney

NER as Sequence Labeling

NER as Sequence Labeling

Why classifiers are not as good as sequence models

Why classifiers are not as good as sequence models

Problems with using Classifiers for Sequence Labeling l It is not easy to integrate

Problems with using Classifiers for Sequence Labeling l It is not easy to integrate information from hidden labels on both sides l We make a hard decision on each token § We should rather choose a global optimum § The best labeling for the whole sequence § Keeping each local decision as just a probability, not a hard decision

Probabilistic Sequence Models l Probabilistic sequence models allow integrating uncertainty over multiple, interdependent classifications

Probabilistic Sequence Models l Probabilistic sequence models allow integrating uncertainty over multiple, interdependent classifications and collectively determine the most likely global assignment l Common approaches § Hidden Markov Model (HMM) § Conditional Random Field (CRF) § Maximum Entropy Markov Model (MEMM) is a simplified version of CRF § Convolutional Neural Networks (CNN)

HMMs vs. MEMMs Slide from Jim Martin

HMMs vs. MEMMs Slide from Jim Martin

HMMs vs. MEMMs Slide from Jim Martin

HMMs vs. MEMMs Slide from Jim Martin

HMMs vs. MEMMs Slide from Jim Martin

HMMs vs. MEMMs Slide from Jim Martin

HMM vs MEMM

HMM vs MEMM

Viterbi in MEMMs l We condition on the observation AND the previous state: l

Viterbi in MEMMs l We condition on the observation AND the previous state: l HMM decoding: l Which is the HMM version of: l MEMM decoding:

Decoding in MEMMs

Decoding in MEMMs

Evaluation Metrics

Evaluation Metrics

Precision: how many of the names we returned are really names? l Recall: how

Precision: how many of the names we returned are really names? l Recall: how many of the names in the database did we find? l

F-measure l F-measure is a way to combine these: l More generally:

F-measure l F-measure is a way to combine these: l More generally:

F-measure l Harmonic mean is the reciprocal of arthithmetic mean of reciprocals: l Hence

F-measure l Harmonic mean is the reciprocal of arthithmetic mean of reciprocals: l Hence F-measure is:

Outline Named Entities and the basic idea l IOB Tagging l A new classifier:

Outline Named Entities and the basic idea l IOB Tagging l A new classifier: Logistic Regression l § Linear regression § Logistic regression § Multinomial logistic regression = Max. Ent Why classifiers are not as good as sequence models l A new sequence model: l § MEMM = Maximum Entropy Markov Model