Relational Data Mining with Inductive Logic Programming for

  • Slides: 35
Download presentation
Relational Data Mining with Inductive Logic Programming for Link Discovery Ray Mooney, Prem Melville,

Relational Data Mining with Inductive Logic Programming for Link Discovery Ray Mooney, Prem Melville, Rupert Tang University of Texas at Austin Jude Shavlik, Inês de Castro Dutra, David Page, Vítor Santos Costa University of Wisconsin at Madison 1

EELD Program • Evidence Extraction • Link Discovery • Pattern Learning 2

EELD Program • Evidence Extraction • Link Discovery • Pattern Learning 2

Link Discovery Task (from Jim Antonisse, GITI) Evidence request(s) Link Discovery Vetted hyp cases

Link Discovery Task (from Jim Antonisse, GITI) Evidence request(s) Link Discovery Vetted hyp cases Core: Pattern Matching Queries Ontologies Pattern(s) of Interest Domain Patterns Legend: Alerts based on Hypothesized cases Problem Context pre-run-time processing 3

Link Discovery • Data is multi-relational with many people, places, objects and actions and

Link Discovery • Data is multi-relational with many people, places, objects and actions and numerous types of relations between them. • Link analysis in intelligence and criminology investigates exploring and visualizing such data as a graph with many nodes and edges of various types. • Link discovery entails finding new links and recognizing threatening patterns in such highly-relational data. 4

EELD Program • Evidence Extraction • Link Discovery • Pattern Learning 5

EELD Program • Evidence Extraction • Link Discovery • Pattern Learning 5

Pattern Learning for Link Discovery • Automated discovery of “patterns of interest” that indicate

Pattern Learning for Link Discovery • Automated discovery of “patterns of interest” that indicate potentially threatening activities in large amounts of heterogeneous, multi-relational data. • Requires inducing multi-relational patterns that characterize multiple entities and multiple links between them. 6

Limitations of Traditional Data Mining • Traditional KDD methods assume the data to be

Limitations of Traditional Data Mining • Traditional KDD methods assume the data to be mined is in a single relational table and that examples are flat tuples of attribute values. • This assumption stems from: – 1) Properties of the typical data mining tasks like market basket analysis. – 2) Focus in machine learning and statistics on classification or regression using feature vectors as inputs. 7

Relational Data Mining • Data contains multiple relations. • Patterns to be discovered contain

Relational Data Mining • Data contains multiple relations. • Patterns to be discovered contain multiple relations. • Knowledge to be discovered may be the definition of another relation rather than a classification or regression function. 8

Relational Data Mining Example Bob Male Female Alice Z Alice Married Mary Fred Tom

Relational Data Mining Example Bob Male Female Alice Z Alice Married Mary Fred Tom X Jane W Jane John Sue Jack Parent Carol Y Carol Uncle(tom, carol) Uncle(X, Y) : - Parent(Z, X), Parent(Z, W), Parent(W, Y), Male(X), not(X=W). 9

Relational Data Mining Example (cont) Bob Male Female Alice W Alice Married Tom V

Relational Data Mining Example (cont) Bob Male Female Alice W Alice Married Tom V Mary Fred John Y Jane Z Jane Sue Jack X Parent Carol Uncle(jack, john) Uncle(X, Y) : - Married(X, Z), Parent(W, V), Parent(V, Y) , Male(X), not(Z=V). 10

Most KDD Ignores RDM • KDD textbooks barely mention RDM: – Han & Kamber,

Most KDD Ignores RDM • KDD textbooks barely mention RDM: – Han & Kamber, 2001 – Hand, Mannila, & Smyth, 2001 – Witten & Frank, 1999 • But there is a recent edited collection on RDM: – S. Džeroski & N. Lavrač, eds. Relational Data Mining, Springer Verlag, 2001. 11

Inductive Logic Programming (ILP) • Standard formal language for representing relational knowledge is first-order

Inductive Logic Programming (ILP) • Standard formal language for representing relational knowledge is first-order predicate logic. • ILP studies the induction of hypotheses in first-order predicate logic. • Logic programs (e. g. Prolog) or functionfree logic programs (e. g. Datalog), are a useful, reasonably-tractable subset of firstorder predicate logic. • ILP is the most well-studied approach to relational data mining. 12

ILP Problem Definition Given • Positive Example Set: P • Negative Example Set: N

ILP Problem Definition Given • Positive Example Set: P • Negative Example Set: N • Background Knowledge: B Find • Hypothesis, H, such that P, N, B and H are all sets of rules in first-order logic (i. e. Horn clauses, logic programs) 13

ILP Algorithms • We have utilized two ILP systems for EELD problems in link

ILP Algorithms • We have utilized two ILP systems for EELD problems in link discovery. – Aleph (Srinivasan, 2001) A variant of the popular Progol algorithm (Muggleton, 1995) – m. Foil+ (Tang and Mooney, 2002) A variant of the popular Foil algorithm (Quinlan, 1990) 14

EELD Russian Nuclear Smuggling Data • Data manually extracted from new sources about events

EELD Russian Nuclear Smuggling Data • Data manually extracted from new sources about events related to nuclear smuggling (developed by Veridian Inc. ) • Size of data set: – 40 relational tables – 2 to 800 tuples per relation • Translated Access database to Prolog, mapping each relational table to a predicate. • Used Aleph to learn rules for the relation Linked(A, B)which determines whether or not two events are part of the same incident. – 143 positive examples – 517 negative examples 15

Illustration of Linked Relation New Event Partial Incident N Partial Incident M 16

Illustration of Linked Relation New Event Partial Incident N Partial Incident M 16

Find Correct Incident for New Event Expanded Incident N Partial Incident M 17

Find Correct Incident for New Event Expanded Incident N Partial Incident M 17

Sample Rule linked(Event. A, Event. B) : lk_event_material(_, Event. A, _, _, _, Concealment.

Sample Rule linked(Event. A, Event. B) : lk_event_material(_, Event. A, _, _, _, Concealment. G, Desc. H), lk_event_person(_, Event. B, Person. D, _, C, C, _), lk_person_material(_, Person. D, Mat. F, Ev. E, _, _, _), lk_event_material(_, Ev. E, Mat. F, I, _, Concealment. G, Desc. H), l_relations(I, _, "Stolen"). If A is linked to a specific type of material <G, H>, and B is linked to a person linked to the same specific type of material, through an event in which that material was stolen, then events A and B are linked. 18

Linked(A, B) A B Event Material Person 19

Linked(A, B) A B Event Material Person 19

Linked(A, B) A B Material Type GH Event Material Person 20

Linked(A, B) A B Material Type GH Event Material Person 20

Linked(A, B) B A E Material Type GH D Material Type GH Event Material

Linked(A, B) B A E Material Type GH D Material Type GH Event Material Person 21

Linked(A, B) B A E D Stolen Material Type GH Event Material Person 22

Linked(A, B) B A E D Stolen Material Type GH Event Material Person 22

Linked(A, B) B A E D Stolen Material Type GH Event Material Person 23

Linked(A, B) B A E D Stolen Material Type GH Event Material Person 23

Accuracy Results for Learning Linked for Nuclear Smuggling Data • Experimental Method: 5 -fold

Accuracy Results for Learning Linked for Nuclear Smuggling Data • Experimental Method: 5 -fold cross validation. • Also tried bagging Aleph to produce an ensemble of 25 hypotheses. Majority Class (not Linked) 78% Aleph Bagged Aleph 83% 86% 24

Synthetic Contract Killing Data • Data generated by a plan-based simulator that generates evidence

Synthetic Contract Killing Data • Data generated by a plan-based simulator that generates evidence emulating contract killings and other types of murders (developed by IET Inc. ). • Simulator used to generate evidence from 200 murder events of three types: – Murder for Hire (71 exs) – First Degree (75 exs) – Second Degree (54 exs) • Use m. Foil+ to classify events into one of these three categories. 25

Sample Rules • Murder For Hire(A): group. Member. Maleficiary(A, B), sub. Events(A, C), crime.

Sample Rules • Murder For Hire(A): group. Member. Maleficiary(A, B), sub. Events(A, C), crime. Motive(C, economic). • First Degree Murder(A): sub. Events(A, B), performed. By(B, C), loves(C, D). • Second Degree Murder(A): sub. Events(A, B), event. Occurs. At. Location. Type(B, public. Property), crime. Motive(B, rival), occurrent. Subevent. Type(B, stealing_Generic). 26

Results on Synthetic Contract Killing Data Murder. For. Hire First. Degree Second. Degree PRECISION

Results on Synthetic Contract Killing Data Murder. For. Hire First. Degree Second. Degree PRECISION 85. 52% 91. 17% 95. 83% RECALL 91. 07% 88. 48% 59. 45% ACCURACY Majority Class m. FOIL+ 37. 50% 76. 67% 27

Recent Result from EELD Challenge Problem murder_for_hire(A) : event. Occurs. At(A, B), perpetrator(A, C),

Recent Result from EELD Challenge Problem murder_for_hire(A) : event. Occurs. At(A, B), perpetrator(A, C), agent. Phone. Number(C, D), caller. Number(E, D), account. Holder(F, C), to_Generic(G, F), from_Generic(G, H), to_Generic(I, H). • Says an event is a “murder for hire” if it has a recorded location and perpetrator, we have a recorded phone call to the perpetrator, and there was a chain of bank transfers resulting in money reaching the perpetrator’s account. • 100% accuracy on a held-out test set. • Similar pattern found manually by LD researchers working on this challenge problem. 28

Future Research • Scaling to larger datasets – Stochastic search – Logic program optimization

Future Research • Scaling to larger datasets – Stochastic search – Logic program optimization – Integration with relational and deductive database technology. • Integrating probabilistic reasoning – Logic programs with Bayes-net constraints • Active Learning • Theory Refinement 29

Related Research • Graph-based Relational Data Mining – Subdue (Cook & Holder, UT Arlington)

Related Research • Graph-based Relational Data Mining – Subdue (Cook & Holder, UT Arlington) • Probabilistic Relational Models – PRMs (Koller, Stanford) • Relational Feature Construction – PROXIMITY (Jensen, UMass) 30

Record Linkage • Identify and merge duplicate field values and duplicate records in a

Record Linkage • Identify and merge duplicate field values and duplicate records in a database. • Applications – Duplicates in mailing lists – Merging multiple databases of stores, restaurants, etc. – Matching bibliographic references in research papers (Cora/Research. Index) – Identifying individuals who are trying to hide their identity by providing slightly erroneous personal information. 31

Record Linkage Examples Author Title Venue Address Year 1993 Yoav Freund, H. Sebastian Seung,

Record Linkage Examples Author Title Venue Address Year 1993 Yoav Freund, H. Sebastian Seung, Eli Shamir, and Naftali Tishby Information, Advances in Neural prediction, and query Information by committee Processing System San Mateo, CA Freund, Y. , Seung, H. S. , Shamir, E. & Tishby, N. Information, Advances in Neural prediction, and query Information by committee Processing Systems San Mateo, CA. Name Address City Cusine Second Avenue Deli 156 2 nd Ave. at 10 th New York Delicatessen Second Avenue Deli 156 Second Ave. New York City Delis 32

Trainable Record Linkage • MARLIN (Multiply Adaptive Record Linkage using INduction) • Learn parameterized

Trainable Record Linkage • MARLIN (Multiply Adaptive Record Linkage using INduction) • Learn parameterized similarity metrics for comparing each field. – Trainable edit-distance • Use EM to set edit-operation costs • Learn to combine multiple similarity metrics for each field to determine equivalence. – Use SVM to decide on duplicates 33

MARLIN Record Linkage Framework Trainable duplicate detector Trainable similarity metrics m 1 … mk

MARLIN Record Linkage Framework Trainable duplicate detector Trainable similarity metrics m 1 … mk A. Field 1 B. Field 1 m 1 … mk A. Field 2 B. Field 2 … … m 1 … mk A. Fieldn B. Fieldn 34

Conclusions • Pattern Learning for Link Discovery is an important application of data mining

Conclusions • Pattern Learning for Link Discovery is an important application of data mining for counterterrorism. • Learning for Link Discovery requires Relational Data Mining (RDM). • Other problem domains require RDM – Bioinformatics – Web – Natural Language Understanding • RDM is an important next-generation KDD capability. 35