Learning Ensembles of First Order Clauses That Optimize

  • Slides: 60
Download presentation
Learning Ensembles of First. Order Clauses That Optimize Precision-Recall Curves Mark Goadrich Computer Sciences

Learning Ensembles of First. Order Clauses That Optimize Precision-Recall Curves Mark Goadrich Computer Sciences Department University of Wisconsin - Madison Ph. D. Defense August 13 th, 2007

Biomedical Information Extraction *image courtesy of SEER Cancer Training Site Structured Database

Biomedical Information Extraction *image courtesy of SEER Cancer Training Site Structured Database

Biomedical Information Extraction http: //www. geneontology. org

Biomedical Information Extraction http: //www. geneontology. org

Biomedical Information Extraction NPL 3 encodes a nuclear protein with an RNA recognition motif

Biomedical Information Extraction NPL 3 encodes a nuclear protein with an RNA recognition motif and similarities to a family of proteins involved in RNA metabolism. yku. D was transcribed by Sig. K RNA polymerase from T 4 of sporulation. Mutations in the COL 3 A 1 gene have been implicated as a cause of type IV Ehlers-Danlos syndrome, a disease leading to aortic rupture in early adult life.

Outline n Biomedical Information Extraction n Inductive Logic Programming n Gleaner n Extensions to

Outline n Biomedical Information Extraction n Inductive Logic Programming n Gleaner n Extensions to Gleaner – Gleaner. SRL – Negative Salt – F-Measure Search – Clause Weighting (time permitting)

Inductive Logic Programming n Machine Learning – Classify data into categories – Divide data

Inductive Logic Programming n Machine Learning – Classify data into categories – Divide data into train and test sets – Generate hypotheses on train set and then measure performance on test set n In ILP, data are Objects … – person, block, molecule, word, phrase, … n and Relations between them – grandfather, has_bond, is_member, …

Seeing Text as Relational Objects verb(…) alphanumeric(…) phrase_child(…, …) Word internal_caps(…) Phrase noun_phrase(…) phrase_parent(…,

Seeing Text as Relational Objects verb(…) alphanumeric(…) phrase_child(…, …) Word internal_caps(…) Phrase noun_phrase(…) phrase_parent(…, …) Sentence long_sentence(…)

Protein Localization Clause prot_loc(Protein, Location, Sentence) : phrase_contains_some_alphanumeric(Protein, E), phrase_contains_some_internal_cap_word(Protein, E), phrase_next(Protein, _), different_phrases(Protein,

Protein Localization Clause prot_loc(Protein, Location, Sentence) : phrase_contains_some_alphanumeric(Protein, E), phrase_contains_some_internal_cap_word(Protein, E), phrase_next(Protein, _), different_phrases(Protein, Location), one_POS_in_phrase(Location, noun), phrase_contains_some_arg 2_10 x_word(Location, _), phrase_previous(Location, _), avg_length_sentence(Sentence).

ILP Background n Seed Example – A positive example that our clause must cover

ILP Background n Seed Example – A positive example that our clause must cover n Bottom Clause – All predicates which are true about seed prot_loc(P, L, S) example seed prot_loc(P, L, S): - alphanumeric(P), leading_cap(L)

Clause Evaluation n Prediction vs Actual actual Positive or Negative True or False prediction

Clause Evaluation n Prediction vs Actual actual Positive or Negative True or False prediction T P FP FN TN n Focus on positive examples TP Recall = TP + FN F 1 Score = TP Precision = TP + FP 2 PR P +R

Protein Localization Clause prot_loc(Protein, Location, Sentence) : phrase_contains_some_alphanumeric(Protein, E), phrase_contains_some_internal_cap_word(Protein, E), phrase_next(Protein, _), different_phrases(Protein,

Protein Localization Clause prot_loc(Protein, Location, Sentence) : phrase_contains_some_alphanumeric(Protein, E), phrase_contains_some_internal_cap_word(Protein, E), phrase_next(Protein, _), different_phrases(Protein, Location), one_POS_in_phrase(Location, noun), phrase_contains_some_arg 2_10 x_word(Location, _), phrase_previous(Location, _), avg_length_sentence(Sentence). 0. 15 Recall 0. 51 Precision 0. 23 F 1 Score

Aleph (Srinivasan ‘ 03) n Aleph learns theories of clauses – Pick positive seed

Aleph (Srinivasan ‘ 03) n Aleph learns theories of clauses – Pick positive seed example – Use heuristic search to find best clause – Pick new seed from uncovered positives and repeat until threshold of positives covered n Sequential learning is time-consuming n Can we reduce time with ensembles? n And also increase quality?

Outline n Biomedical Information Extraction n Inductive Logic Programming n Gleaner n Extensions to

Outline n Biomedical Information Extraction n Inductive Logic Programming n Gleaner n Extensions to Gleaner – Gleaner. SRL – Negative Salt – F-Measure Search – Clause Weighting

Gleaner (Goadrich et al. ‘ 04, ‘ 06) n Definition of Gleaner – One

Gleaner (Goadrich et al. ‘ 04, ‘ 06) n Definition of Gleaner – One who gathers grain left behind by reapers n Key Ideas of Gleaner – Use Aleph as underlying ILP clause engine – Search clause space with Rapid Random Restart – Keep wide range of clauses usually discarded – Create separate theories for diverse recall

Precision Gleaner - Learning Recall n Create B Bins n Generate Clauses n Record

Precision Gleaner - Learning Recall n Create B Bins n Generate Clauses n Record Best per Bin

Gleaner - Learning Seed K. . . Seed 3 Seed 2 Seed 1 Recall

Gleaner - Learning Seed K. . . Seed 3 Seed 2 Seed 1 Recall

Clauses from bin 5 Gleaner - Ensemble Pos Neg Pos ex 1: prot_loc(…) ex

Clauses from bin 5 Gleaner - Ensemble Pos Neg Pos ex 1: prot_loc(…) ex 2: 12 47 ex 1: prot_loc(…) 12 ex 2: prot_loc(…) 47 ex 3: prot_loc(…) 55 . ex 598: prot_loc(…) 5 ex 599: prot_loc(…) 14 ex 600: prot_loc(…) 2 ex 601: prot_loc(…) 18 Pos Neg. . . Neg Pos .

Gleaner - Ensemble Score Precision Recall pos 3: prot_loc(…) 55 1. 00 0. 05

Gleaner - Ensemble Score Precision Recall pos 3: prot_loc(…) 55 1. 00 0. 05 neg 28: prot_loc(…) 52 0. 50 0. 05 pos 2: prot_loc(…) 47 0. 66 0. 10 neg 4: prot_loc(…) 18 0. 12 0. 85 neg 475: prot_loc(…) 17 pos 9: prot_loc(…) 17 0. 13 0. 90 . neg 15: prot_loc(…). 1. 0 Precision Examples Recall 16 0. 12 0. 90 1. 0

Gleaner - Overlap Precision n For each bin, take the topmost curve Recall

Gleaner - Overlap Precision n For each bin, take the topmost curve Recall

How to Use Gleaner (Version 1) Generate Tuneset Curve n User Selects Recall Bin

How to Use Gleaner (Version 1) Generate Tuneset Curve n User Selects Recall Bin n Return Testset Classifications Ordered By Their Score Precision n Recall = 0. 50 Precision = 0. 70 Recall

Gleaner Algorithm n Divide space into B bins n For K positive seed examples

Gleaner Algorithm n Divide space into B bins n For K positive seed examples – Perform RRR search with precision x recall heuristic – Save best clause found in each bin b n For each bin b – Combine clauses in b to form theoryb – Find L of K threshold for theorym which performs best in b on tuneset n Evaluate thresholded theories on testset

Aleph Ensembles (Dutra et al ‘ 02) n Compare to ensembles of theories n

Aleph Ensembles (Dutra et al ‘ 02) n Compare to ensembles of theories n Ensemble Algorithm – Use K different initial seeds – Learn K theories containing C rules – Rank examples by the number of theories

YPD Protein Localization n Hand-labeled dataset (Ray & Craven ’ 01) – 7, 245

YPD Protein Localization n Hand-labeled dataset (Ray & Craven ’ 01) – 7, 245 sentences from 871 abstracts – Examples are phrase-phrase combinations § 1, 810 positive & 279, 154 negative n 1. 6 GB of background knowledge – Structural, Statistical, Lexical and Ontological – In total, 200+ distinct background predicates n Performed five-fold cross-validation

Evaluation Metrics Area Under Precision. Recall Curve (AUC-PR) – All curves standardized to cover

Evaluation Metrics Area Under Precision. Recall Curve (AUC-PR) – All curves standardized to cover full recall range – Averaged AUC-PR over 5 folds n 1. 0 Precision n Number of clauses considered – Rough estimate of time Recall 1. 0

PR Curves - 100, 000 Clauses

PR Curves - 100, 000 Clauses

Protein Localization Results

Protein Localization Results

Other Relational Datasets n Genetic Disorder (Ray & Craven ’ 01) – 233 positive

Other Relational Datasets n Genetic Disorder (Ray & Craven ’ 01) – 233 positive & 103, 959 negative n Protein Interaction (Bunescu et al ‘ 04) – 799 positive & 76, 678 negative n Advisor (Richardson and Domingos ‘ 04) – Students, Professors, Courses, Papers, etc. – 113 positive & 2, 711 negative

Genetic Disorder Results

Genetic Disorder Results

Protein Interaction Results

Protein Interaction Results

Advisor Results

Advisor Results

Gleaner Summary n Gleaner makes use of clauses that are not the highest scoring

Gleaner Summary n Gleaner makes use of clauses that are not the highest scoring ones for improved speed and quality n Issues with Gleaner – Output is PR curve, not probability – Redundant clauses across seeds – L of K clause combination

Outline n Biomedical Information Extraction n Inductive Logic Programming n Gleaner n Extensions to

Outline n Biomedical Information Extraction n Inductive Logic Programming n Gleaner n Extensions to Gleaner – Gleaner. SRL – Negative Salt – F-Measure Search – Clause Weighting

Estimating Probabilities - SRL n Given highly skewed relational datasets n Produce accurate probability

Estimating Probabilities - SRL n Given highly skewed relational datasets n Produce accurate probability estimates Precision n Gleaner only produces PR curves Recall

Gleaner Algorithm Gleaner. SRL Algorithm (Goadrich ‘ 07) n Divide space into B bins

Gleaner Algorithm Gleaner. SRL Algorithm (Goadrich ‘ 07) n Divide space into B bins n For K positive seed examples – Perform RRR search with precision x recall heuristic – Save best in each bin b n Create For each propositional binclause b found feature-vectors – Combine clauses b to or form theoryb n Learn scores with in SVM other – Find L of K threshold theorym which propositional learningfor algorithms performs best in b on tuneset n Calibrate scores into probabilities n Evaluate thresholded theories on testset n Evaluate probabilities with Cross Entropy

Gleaner. SRL Algorithm

Gleaner. SRL Algorithm

Precision Learning with Gleaner Recall n Generate Clauses n Create B Bins n Record

Precision Learning with Gleaner Recall n Generate Clauses n Create B Bins n Record Best per Bin n Repeat for K seeds

Creating Feature Vectors Clauses from bin 5 K Boolean Pos 1 Neg 0 1

Creating Feature Vectors Clauses from bin 5 K Boolean Pos 1 Neg 0 1 Binned Pos. . . Neg ex 1: prot_loc(…) 12 1 1. . . 0

Learning Scores via SVM

Learning Scores via SVM

Calibrating Probabilities n Use Isotonic Regression (Zadrozny & Elkan ‘ 03) to transform SVM

Calibrating Probabilities n Use Isotonic Regression (Zadrozny & Elkan ‘ 03) to transform SVM scores into probabilities 0. 50 0. 66 1. 00 Probability 0. 00 Class 0 0 1 1 Score -2 -0. 4 0. 2 0. 4 0. 5 0. 9 1. 3 1. 7 15 Examples

Gleaner. SRL Results for Advisor (Davis et al. 05) (Davis et al. 07)

Gleaner. SRL Results for Advisor (Davis et al. 05) (Davis et al. 07)

Outline n Biomedical Information Extraction n Inductive Logic Programming n Gleaner n Extensions to

Outline n Biomedical Information Extraction n Inductive Logic Programming n Gleaner n Extensions to Gleaner – Gleaner. SRL – Negative Salt – F-Measure Search – Clause Weighting

Diversity of Gleaner Clauses

Diversity of Gleaner Clauses

Negative Salt n Seed Example – A positive example that our clause must cover

Negative Salt n Seed Example – A positive example that our clause must cover n Salt Example – A negative example that our clause should prot_loc(P, L, S) avoid seed salt

Gleaner Algorithm n Divide space into B bins n For K positive seed examples

Gleaner Algorithm n Divide space into B bins n For K positive seed examples – Select Perform. Negative RRR search Salt example with precision x recall – heuristic Perform RRR search with salt-avoiding – heuristic Save best clause found in each bin b – Save best n For each binclause b found in each bin b n For each bin b – Combine clauses in b to form theoryb – Combine clauses in bfor to form theory Find L of K threshold theory b m which best in bfor ontheory tuneset – performs Find L of K threshold m which performsthresholded best in b on tuneseton testset n Evaluate theories n Evaluate thresholded theories on testset

Diversity of Negative Salt

Diversity of Negative Salt

Effect of Salt on Theorym Choice

Effect of Salt on Theorym Choice

Negative Salt AUC-PR

Negative Salt AUC-PR

Outline n Biomedical Information Extraction n Inductive Logic Programming n Gleaner n Extensions to

Outline n Biomedical Information Extraction n Inductive Logic Programming n Gleaner n Extensions to Gleaner – Gleaner. SRL – Negative Salt – F-Measure Search – Clause Weighting

Gleaner Algorithm n Divide space into B bins n For K positive seed examples

Gleaner Algorithm n Divide space into B bins n For K positive seed examples – Perform RRR search with F precision Measurex recall heuristic – heuristic Save best clause found in each bin b n For each bin b – Combine clauses in b to form theoryb – Find L of K threshold for theorym which performs best in b on tuneset n Evaluate thresholded theories on testset

RRR Search Heuristic n Heuristic function directs RRR search n Can provide direction through

RRR Search Heuristic n Heuristic function directs RRR search n Can provide direction through F Measure n Low values for n High values for encourage Precision encourage Recall

F 0. 01 Measure Search

F 0. 01 Measure Search

F 1 Measure Search

F 1 Measure Search

F 100 Measure Search

F 100 Measure Search

F Measure AUC-PR Results Genetic Disorder Localization Protein

F Measure AUC-PR Results Genetic Disorder Localization Protein

Weighting Clauses n Alter the L of K combination in Gleaner n Within Single

Weighting Clauses n Alter the L of K combination in Gleaner n Within Single Theory – Cumulative weighting schemes successful – Precision highest-scoring scheme n Within Gleaner – Precision beats Equal Wgt’ed and Naïve Bayes – Significant results on genetic-disorder dataset

Clauses from bin 5 Weighting Clauses Pos W 11 Neg Pos ∑(recall of each

Clauses from bin 5 Weighting Clauses Pos W 11 Neg Pos ∑(recall of each matching clause) 00 ∑(F 1 measure of each matching clause) W 13 W 14 Pos. . . Neg Cumulative ∑(precision of each matching clause) ex 1: prot_loc(…) Naïve Bayes and TAN 00 learn probability for example Ranked List max(precision of each matching clause) Weighted Vote ave(precision of each matching clause)

Dominance Results n Statistically significant dominance in i, j n Precision is never dominated

Dominance Results n Statistically significant dominance in i, j n Precision is never dominated n Naïve Bayes competitive with cumulative

Weighting Gleaner Results

Weighting Gleaner Results

Conclusions and Future Work Gleaner is a flexible and fast ensemble algorithm for highly

Conclusions and Future Work Gleaner is a flexible and fast ensemble algorithm for highly skewed ILP datasets n Other Work n – Proper interpolation of PR Space (Goadrich et al. ‘ 04, ‘ 06) – Relationship of PR and ROC Curves (Davis and Goadrich ‘ 06) n Future Work – Explore Gleaner on propositional datasets – Learn heuristic function for diversity (Oliphant and Shavlik ‘ 07)

Acknowledgements USA DARPA Grant F 30602 -01 -2 -0571 n USA Air Force Grant

Acknowledgements USA DARPA Grant F 30602 -01 -2 -0571 n USA Air Force Grant F 30602 -01 -2 -0571 n USA NLM Grant 5 T 15 LM 007359 -02 n USA NLM Grant 1 R 01 LM 07050 -01 n UW Condor Group n Jude Shavlik, Louis Oliphant, David Page, Vitor Santos Costa, Ines Dutra, Soumya Ray, Marios Skounakis, Mark Craven, Burr Settles, Patricia Brennan, An. Hai Doan, Jesse Davis, Frank Di. Maio, Ameet Soni, Irene Ong, Laura Goadrich, all 6 th Floor MSCers n