Integer Linear Programming Formulations in Natural Language Processing

Nice to Meet You Roth & Srikumar: ILP formulations in Natural Language Processing 2

ILP Formulations in NLP n Part 1: Introduction [30 min] n Part 2: Applications

PART 1: INTRODUCTION Roth & Srikumar: ILP formulations in Natural Language Processing

ILP Formulations in NLP n Part 1: Introduction [30 min] Motivation ¨ Examples: ¨

Joint Inference with General Constraint Structure [Roth&Yih’ 04, 07, …. ] Recognizing Entities and

Mot iv Pipeline Raw Data POS Tagging atio n. I n Most problems are

Example 2: Object detection How would you design a predictor that labels all the

One approach to build this structure Left wheel detector: Is there a wheel in

Task of Interests: Structured Output n For each instance, assign values to a set

Information Extraction without Output Expectations Lars Ole Andersen. Program analysis and specialization for the

Strategies for Improving the Results n (Standard) Machine Learning Approaches Higher Order HMM/CRF? NN?

Expectations from the output (Constraints) n n n n Each field must be a

Information Extraction with Expectation Constraints n Adding constraints, we get correct results! ¨ Without

Constrained Conditional Models Variables are models Any MAP problem w. r. t. any Penalty

Examples: CCM Formulations The 2 nd and 3 rd parts of the tutorial are

Semantic Role Labeling (SRL) Archetypical Information Extraction Problem: E. g. , Concept Identification and

Algorithmic Approach Identify argument n Learning Based Java: allows a developer No duplicate I

Semantic Role Labeling (SRL) I left my pearls to my daughter in my will.

The tutorial web page will point to material on how to write down linear

SRL: Posing the Problem Roth & Srikumar: ILP formulations in Natural Language Processing 23

Example 2: Sequence Tagging HMM : Here, y’s are labels; x’s are observations. The

Example 2: Sequence Tagging Example: HMM: As an ILP: Inference Variables the man saw

Example 2: Sequence Tagging HMM: As an ILP: Example: the man saw the dog

Example 2: Sequence Tagging HMM : As an ILP: Example: the man saw the

Example 2: Sequence Tagging HMM : As an ILP: Without additional constraints the ILP

Constraints n We have seen three different constraints in this example Unique label for

Constrained Conditional Models—ILP Formulations n n Have been shown useful in the context of

The Rest of the Tutorial y= n n n The 4 th and 5

END of PART 1 n END of PART 1 Roth & Srikumar: ILP formulations

First Summary n n n Introduced Structured Prediction Many examples Introduced the key building

Slides: 33

Download presentation

Integer Linear Programming Formulations in Natural Language Processing Dan Roth and Vivek Srikumar University of Illinois, University of Utah http: //ilpinference. github. io

Nice to Meet You Roth & Srikumar: ILP formulations in Natural Language Processing 2

ILP Formulations in NLP n Part 1: Introduction [30 min] n Part 2: Applications of ILP Formulations in NLP [15 min] n Part 3: Modeling: Inference methods and Constraints [45 min] n BREAK n Part 4: Training Paradigms [30 min] n Part 5: Constraints Driven Learning [30 min] n Part 6: Developing ILP based applications [15 min] n Part 7: Final words [15 min] Roth & Srikumar: ILP formulations in Natural Language Processing 3

PART 1: INTRODUCTION Roth & Srikumar: ILP formulations in Natural Language Processing

ILP Formulations in NLP n Part 1: Introduction [30 min] Motivation ¨ Examples: ¨ n n n ¨ Problem Formulation n ¨ Constrained Conditional Models: Integer Linear Programming Formulations Initial thoughts about learning n n ¨ NE + Relations Vision Additional NLP Examples Learning independent models Constraints Driven Learning Initial thoughts about Inference Roth & Srikumar: ILP formulations in Natural Language Processing 5

Joint Inference with General Constraint Structure [Roth&Yih’ 04, 07, …. ] Recognizing Entities and Relations Joint inference gives good improvement other 0. 05 other 0. 10 other 0. 05 per 0. 85 per 0. 60 per 0. 50 0. 10 loc 0. 30 loc 0. 45 An Obje ctive fu Key Questions: ction. Jane, models Bernie’snwife, thaista native of Brooklyn with kn How to learn the model(s)? i n c o r owledg. E 2 oraistethe of the s 3 lsource E 1 E e (ou p. What e a r n ed tpuknowledge? A Const R t c R 23 How o strathe rain 12 ed C tonguide intglobal s) inference? ondition irrelevantal Mo 0. 05 0. 10 del loc spouse_of 0. 45 spouse_of 0. 05 born_in 0. 50 born_in 0. 85 Models could be learned separately/jointly; constraints may come up only at decision time. Roth & Srikumar: ILP formulations in Natural Language Processing 6

Mot iv Pipeline Raw Data POS Tagging atio n. I n Most problems are not single classification problems Phrases Semantic Entities Relations Either way, we need a way to learn models and make predictions (inference; Parsing WSD Semantic Roledecoding) Labeling that assign values to multiple interdependent variables n Conceptually, Pipelining is a crude approximation Interactions occur across levels and down stream decisions often interact with previous decisions. ¨ Leads to propagation of errors ¨ Occasionally, later stages could be used to correct earlier errors. ¨ n But, there are good reasons to use pipelines Reusability of components; not putting everything in one basket ¨ How about choosing some stages and thinking about them jointly? ¨ Roth & Srikumar: ILP formulations in Natural Language Processing 7

Example 2: Object detection How would you design a predictor that labels all the parts using the tools we have seen so far? saddle/seat Right facing bicycle left wheel handle bar right wheel Photo by Andrew Dressel - Own work. Licensed under Creative Commons Attribution-Share Alike 3. 0 Roth & Srikumar: ILP formulations in Natural Language Processing 8

One approach to build this structure Left wheel detector: Is there a wheel in this box? Binary classifier 1. (Left) wheel detector 2. (Right) wheel detector 3. Handle bar detector 4. Seat detector Final output: Combine the predictions of these individual classifiers (local classifiers) The predictions interact with each other Eg: The same box can not be both a left wheel and a right wheel, handle bar does not overlap with seat, etc Need inference to compose the output Photo by Andrew Dressel - Own work. Licensed under Creative Commons Attribution-Share Alike 3. 0 Roth & Srikumar: ILP formulations in Natural Language Processing 9

Task of Interests: Structured Output n For each instance, assign values to a set of variables ¨ n Common NLP tasks ¨ n Entities, Relations, … Common Vision Task: ¨ n Parsing; Semantic Parsing; Summarization; Co-reference… Common Information Extraction Tasks: ¨ n Account for the fact that output variables depend on each other Parsing objects; scene segmentation and interpretation, …. Many “pure” machine learning approaches exist Hidden Markov Models (HMMs); CRFs […there are special cases…] ¨ Structured Perceptrons and SVMs… [… to be discussed later] ¨ n However, … Roth & Srikumar: ILP formulations in Natural Language Processing 10

Information Extraction without Output Expectations Lars Ole Andersen. Program analysis and specialization for the C Programming language. Ph. D thesis. DIKU , University of Copenhagen, May 1994. Prediction result of a trained HMM [AUTHOR] [TITLE] [EDITOR] [BOOKTITLE] [TECH-REPORT] [INSTITUTION] [DATE] Small #(examples) Lars Ole Andersen. Program analysis and specialization for the C Programming language. Ph. D thesis. DIKU , University of Copenhagen , May 1994. Violates lots of natural constraints! Roth & Srikumar: ILP formulations in Natural Language Processing 11

Strategies for Improving the Results n (Standard) Machine Learning Approaches Higher Order HMM/CRF? NN? ¨ Increasing the window size? ¨ Adding a lot of new features Increasing the model complexity ¨ n ¨ Increase difficulty of Learning Requires a lot of labeled examples What if we only have a few labeled examples? Can we keep the learned model simple and still make expressive decisions? n Instead: Constrain the output to make sense – satisfy our output expectations ¨ Push the (simple) model in a direction that makes sense – minimally violates our expectations. ¨ Roth & Srikumar: ILP formulations in Natural Language Processing 12

Expectations from the output (Constraints) n n n n Each field must be a consecutive list of words and can appear at most once in a citation. State transitions must occur on punctuation marks. The citation can only start with AUTHOR or EDITOR. The words pp. , pages correspond to PAGE. Four digits starting with 20 xx and 19 xx are DATE. Quotations can appear only in TITLE ……. Easy to express pieces of “knowledge” Non Propositional; May use Quantifiers Roth & Srikumar: ILP formulations in Natural Language Processing 13

Information Extraction with Expectation Constraints n Adding constraints, we get correct results! ¨ Without changing the model [AUTHOR] [TITLE] [TECH-REPORT] [INSTITUTION] [DATE] Lars Ole Andersen. Program analysis and specialization for the C Programming language. Ph. D thesis. DIKU , University of Copenhagen , May, 1994. We introduce the Constrained Conditional Models formulation which allows: n Learning a simple model n Making decisions with a more complex model n Some of the structure imposed externally/declaratively n Accomplished by directly incorporating constraints to bias/re-rank decisions made by the simpler model Roth & Srikumar: ILP formulations in Natural Language Processing 14

Constrained Conditional Models Variables are models Any MAP problem w. r. t. any Penalty for violating probabilistic model, can be formulated constraint. as an ILPthe [Roth+ 04, Taskar 04] y = argmaxyy = 1 argmax to Constraints C(x, y) TÁ(x, TC(x, Á(x, y) wx, y subject w y) + u y) y 2 Y Weight Vector for “local” models n Decouple? Decompose? Force u to model hard constraints? Inference: A way to push the learned model to satisfy our output expectations (or expectations from a latent representation) ¨ n How is from model; E. g. , far anyentities a a“legal/expected” relations model. assignment Training: learning the objective function (w, u) ¨ n Features, classifiers (Lin; NN); log-linear models (HMM, CRF) or a combination Knowledge component: (Soft) constraints How? [Co. DL, Chang, Ratinov, Roth (07, 12); Posterior Regularization, Ganchev et. al (10); Unified EM (Samdani & Roth(12), dozens of applications in NLP] The benefits of thinking about it as an ILP are conceptual and computational. Roth & Srikumar: ILP formulations in Natural Language Processing 15

Examples: CCM Formulations The 2 nd and 3 rd parts of the tutorial are on how to model and do inference y = argmaxy 2 Y w. TÁ(x, y) + u. TC(x, y) The 4 th and 5 th parts of the tutorial are on how to learn While Á(x, y) and C(x, y) could be the same; we want C(x, y) to express high level declarative knowledge over the statistical models. Formulate NLP Problems as ILP problems (inference may be done otherwise) 1. Sequence tagging (HMM/CRF + Global constraints) 2. Sentence Compression (Language Model + Global Constraints) 3. SRL (Independent classifiers + Global Constraints) Sentence Sequential Prediction Constrained Conditional Models Allow: Knowledge/Linguistics Constraints Compression/Summarization: n Decouple complexity of the learned model from that of the desired output HMM/CRFModel Language based: Cannot have chosen, both A states and states If a modifier include its. Bhead n Learn a simple model (multiple; pipelines); reason with a complex one. Argmax ¸ijk inverb an output sequence. If is chosen, include its arguments ij xx ij ijk n Accomplished by incorporating constraints to bias/re-rank global decisions to satisfy (minimally violate) expectations. Roth & Srikumar: ILP formulations in Natural Language Processing 16

Semantic Role Labeling (SRL) Archetypical Information Extraction Problem: E. g. , Concept Identification and Typing, Event Identification, etc. I left my pearls to my daughter in my will. [I]A 0 left [my pearls]A 1 [to my daughter]A 2 [in my will]AM-LOC. n n A 0 A 1 A 2 AM-LOC Leaver Things left Benefactor Location I left my pearls to my daughter in my will. Roth & Srikumar: ILP formulations in Natural Language Processing 17

Algorithmic Approach Identify argument n Learning Based Java: allows a developer No duplicate I left my nice pearls to her to encode constraints in First Order argument classes candidate arguments Logic; these are compiled into linear Unique labels candidates Pruning [Xue&Palmer, EMNLP’ 04] ¨ Argument Identifier inequalities automatically. ¨ n Binary classification Classify argument candidates n ¨ Argument Classifier n Variable ya, t indicates whether candidate argument a is assigned a label t. ca, t is the corresponding model score Multi-classification Inference n I left mymynice pearls to her I left nice pearls to her [ [ [ ] ] ] Use the estimated probability distribution given by the argument argmax a, t ya, t ca, tclassifier = a, t 1 a=t ca=t ¨ Usetostructural and linguistic constraints Subject : One inference a, t = 1 Infer theproblem optimal global • ¨One label per argument: t y for each output ¨ • • No overlapping embedding verborpredicate. Relations between verbs and arguments, …. I left my nice pearls to her Use the pipeline architecture’s simplicity while maintaining uncertainty: keep Roth & Srikumar: ILP formulations in Natural Language&Processing probability distributions over decisions use global inference at decision time. 18

Semantic Role Labeling (SRL) I left my pearls to my daughter in my will. 0. 5 0. 05 0. 15 0. 1 0. 6 0. 1 0. 05 0. 7 0. 05 0. 15 0. 2 0. 6 0. 05 0. 3 0. 2 0. 1 0. 2 Roth & Srikumar: ILP formulations in Natural Language Processing 19

The tutorial web page will point to material on how to write down linear inequalities for various logical expressions [details later]. Constraints n No duplicate argument classes n Reference-Ax n Continuation-Ax Any Boolean rule can be encoded as a set of linear inequalities. If there is an Reference-Ax phrase, there is an Ax If there is an Continuation-x phrase, there is an Ax before it Universally quantified rules Learning Based Java: allows a developer to encode constraints in First Order Logic; these are compiled into linear inequalities automatically. Many other possible constraints: n Unique labels n No overlapping or embedding n Relations between number of arguments; order constraints n If verb is of type A, no argument of type B Roth & Srikumar: ILP formulations in Natural Language Processing 22

SRL: Posing the Problem Roth & Srikumar: ILP formulations in Natural Language Processing 23

Example 2: Sequence Tagging HMM : Here, y’s are labels; x’s are observations. The ILP’s objective function must include all entries of the Conditional Probability Table. Example: Every edge is a Boolean variable that selects a transition CPT entry. They are related: if we choose y 0 = D then we must choose an edge y 0 = D Æ y 1 = ? . the man saw the dog D D D N N N A A A V V V Every assignment to the y’s is a path. Roth & Srikumar: ILP formulations in Natural Language Processing 24

Example 2: Sequence Tagging Example: HMM: As an ILP: Inference Variables the man saw the dog D D D N N N A A A V V V Learned Parameters Roth & Srikumar: ILP formulations in Natural Language Processing 25

Example 2: Sequence Tagging HMM: As an ILP: Example: the man saw the dog D D D N N N A A A V V V Unique label for each word Roth & Srikumar: ILP formulations in Natural Language Processing 26

Example 2: Sequence Tagging HMM : As an ILP: Example: the man saw the dog D D D N N N A A A V V V Unique label for each word Edges that are chosen must form a path Roth & Srikumar: ILP formulations in Natural Language Processing 27

Example 2: Sequence Tagging HMM : As an ILP: Without additional constraints the ILP formulation of an HMM is totally unimodular Example: the man saw the dog D D D N N N A A A V V V Unique label for each word Edges that are chosen must form a path There must be a verb! [Roth & Yih, ICML’ 05] discuss training paradigms for HMMs and CRFs, when augmented with additional knowledge Roth & Srikumar: ILP formulations in Natural Language Processing 28

Constraints n We have seen three different constraints in this example Unique label for each word ¨ Chosen edges must form a path ¨ There must be a verb ¨ n All three can be expressed as linear inequalities n In terms of modeling, there is a difference A conventional model The first two define the output structure (in this case, a sequence) ¨ The third one adds knowledge to the problem ¨ In CCMs, knowledge is an integral part of the modeling Roth & Srikumar: ILP formulations in Natural Language Processing 29

Constrained Conditional Models—ILP Formulations n n Have been shown useful in the context of many NLP problems [Roth&Yih, 04, 07: Entities and Relations; Punyakanok et. al: SRL …] ¨ n n n Some theoretical work on training paradigms [Punyakanok et. al. , 05 more; Constraints Driven Learning, PR, Constrained EM…] Some work on Inference, mostly approximations, bringing back ideas on Lagrangian relaxation, etc. Good summary and description of training paradigms: ¨ n Summarization; Co-reference; Information & Relation Extraction; Event Identifications and causality ; Transliteration; Textual Entailment; Knowledge Acquisition; Sentiments; Temporal Reasoning, Parsing, … [Chang, Ratinov & Roth, Machine Learning Journal 2012] Summary of work & a bibliography: http: //L 2 R. cs. uiuc. edu/tutorials. html Roth & Srikumar: ILP formulations in Natural Language Processing 30

The Rest of the Tutorial y= n n n The 4 th and 5 th argmax w w. Tsubject y = argmax Á(x, y)to+Constraints u. TC(x, y) parts of the y 1Á(x, y) y 2 Y x, y tutorial will focus on learning The next two parts will provide more details on how to model structured decisions problems in NLP as ILP problems. After the break we will discuss learning First, learning paradigms within this framework ¨ n The 2 nd and 3 rd parts of the tutorial will discuss modeling and inference Inference is necessary, but can be incorporated in multiple ways into the learning process Then, move to Constraints Driven Learning How constraints can take the place of many examples ¨ Semi-supervised scenarios ¨ Learning with latent representations; response driven learning ¨ Roth & Srikumar: ILP formulations in Natural Language Processing 31

END of PART 1 n END of PART 1 Roth & Srikumar: ILP formulations in Natural Language Processing 32

First Summary n n n Introduced Structured Prediction Many examples Introduced the key building blocks of structured learning and inference Focused on Constraints Conditional Models CCMS: The motivating scenario is the case in which Joint INFERENCE is essential ¨ Joint LEARNING should be done thoughtfully ¨ n n n Not everything can be learned together We don’t always want to learning everything together Moving on to Details on Joint Learning ¨ Details on Inference ¨ Roth & Srikumar: ILP formulations in Natural Language Processing 33