Schema Ontology Matching Current Research Directions An Hai

  • Slides: 69
Download presentation
Schema & Ontology Matching: Current Research Directions An. Hai Doan Database and Information System

Schema & Ontology Matching: Current Research Directions An. Hai Doan Database and Information System Group University of Illinois, Urbana Champaign Spring 2004

Road Map l Schema Matching – motivation & problem definition – representative current solutions:

Road Map l Schema Matching – motivation & problem definition – representative current solutions: LSD, i. MAP, Clio – broader picture l Ontology Matching – motivation & problem definition – representative current solution: GLUE – broader picture l Conclusions & Emerging Directions 2

Motivation: Data Integration New faculty member realestate. com Find houses with 2 bedrooms priced

Motivation: Data Integration New faculty member realestate. com Find houses with 2 bedrooms priced under 200 K homeseekers. com homes. com 3

Architecture of Data Integration System Find houses with 2 bedrooms priced under 200 K

Architecture of Data Integration System Find houses with 2 bedrooms priced under 200 K mediated schema source schema 1 realestate. com source schema 2 homeseekers. com source schema 3 homes. com 4

Semantic Matches between Schemas Mediated-schema price agent-name 1 -1 match homes. com listed-price 320

Semantic Matches between Schemas Mediated-schema price agent-name 1 -1 match homes. com listed-price 320 K 240 K address complex match contact-name city state Jane Brown Mike Smith Seattle Miami WA FL 5

Schema Matching is Ubiquitous! Fundamental problem in numerous applications l Databases l – –

Schema Matching is Ubiquitous! Fundamental problem in numerous applications l Databases l – – – – l data integration data translation schema/view integration data warehousing semantic query processing model management peer data management AI – knowledge bases, ontology merging, information gathering agents, . . . l Web – e-commerce – marking up data using ontologies (e. g. , on Semantic Web) 6

Why Schema Matching is Difficult l Schema & data never fully capture semantics! –

Why Schema Matching is Difficult l Schema & data never fully capture semantics! – not adequately documented – schema creator has retired to Florida! l Must rely on clues in schema & data – using names, structures, types, data values, etc. l Such clues can be unreliable – same names => different entities: area => location or square-feet – different names => same entity: area & address => location l Intended semantics can be subjective – house-style = house-description? – military applications require committees to decide! l Cannot be fully automated, needs user feedback! 7

Current State of Affairs l Finding semantic mappings is now a key bottleneck! –

Current State of Affairs l Finding semantic mappings is now a key bottleneck! – largely done by hand – labor intensive & error prone – data integration at GTE [Li&Clifton, 2000] – 40 databases, 27000 elements, estimated time: 12 years l Will only be exacerbated – data sharing becomes pervasive – translation of legacy data Need semi-automatic approaches to scale up! l Many research projects in the past few years l – Databases: IBM Almaden, Microsoft Research, BYU, George Mason, U of Leipzig, U Wisconsin, NCSU, UIUC, Washington, . . . – AI: Stanford, Karlsruhe University, NEC Japan, . . . 8

Road Map l Schema Matching – motivation & problem definition – representative current solutions:

Road Map l Schema Matching – motivation & problem definition – representative current solutions: LSD, i. MAP, Clio – broader picture l Ontology Matching – motivation & problem definition – representative current solution: GLUE – broader picture l Conclusions & Emerging Directions 9

LSD Learning Source Description l Developed at Univ of Washington 2000 -2001 l –

LSD Learning Source Description l Developed at Univ of Washington 2000 -2001 l – with Pedro Domingos and Alon Halevy l Designed for data integration settings – has been adapted to several other contexts l Desirable characteristics – – – learn from previous matching activities exploit multiple types of information in schema and data incorporate domain integrity constraints handle user feedback achieves high matching accuracy (66 -- 97%) on real-world data 10

Schema Matching for Data Integration: the LSD Approach Suppose user wants to integrate 100

Schema Matching for Data Integration: the LSD Approach Suppose user wants to integrate 100 data sources 1. User – manually creates matches for a few sources, say 3 – shows LSD these matches 2. LSD learns from the matches 3. LSD predicts matches for remaining 97 sources 11

Learning from the Manual Matches price Mediated schema agent-name agent-phone office-phone listed-price contact-name contact-phone

Learning from the Manual Matches price Mediated schema agent-name agent-phone office-phone listed-price contact-name contact-phone office description comments If “office” occurs in name => office-phone Schema of realestate. com listed-price contact-name contact-phone $250 K $320 K James Smith Mike Doan $350 K $230 K contact-agent comments (305) 729 0831 (305) 616 1822 Fantastic house (617) 253 1429 (617) 112 2315 Great location homes. com sold-at office extra-info (206) 634 9435 Beautiful yard (617) 335 4243 Close to Seattle If “fantastic” & “great” occur frequently in data instances => description 12

Must Exploit Multiple Types of Information! Mediated schema price agent-name agent-phone office-phone listed-price contact-name

Must Exploit Multiple Types of Information! Mediated schema price agent-name agent-phone office-phone listed-price contact-name contact-phone office description comments If “office” occurs in name => office-phone Schema of realestate. com listed-price contact-name contact-phone $250 K $320 K James Smith Mike Doan $350 K $230 K contact-agent comments (305) 729 0831 (305) 616 1822 Fantastic house (617) 253 1429 (617) 112 2315 Great location homes. com sold-at office extra-info (206) 634 9435 Beautiful yard (617) 335 4243 Close to Seattle If “fantastic” & “great” occur frequently in data instances => description 13

Multi-Strategy Learning l Use a set of base learners – each exploits well certain

Multi-Strategy Learning l Use a set of base learners – each exploits well certain types of information l To match a schema element of a new source – apply base learners – combine their predictions using a meta-learner l Meta-learner – uses training sources to measure base learner accuracy – weighs each learner based on its accuracy 14

Base Learners l Training Object Training examples Matching l Name Learner l l (X

Base Learners l Training Object Training examples Matching l Name Learner l l (X 1, C 1) (X 2, C 2). . . (Xm, Cm) Observed label X labels weighted by confidence score Classification model (hypothesis) – training: (“location”, address) (“contact name”, name) – matching: agent-name => (name, 0. 7), (phone, 0. 3) Naive Bayes Learner – training: (“Seattle, WA”, address) (“ 250 K”, price) – matching: “Kent, WA” => (address, 0. 8), (name, 0. 2) 15

The LSD Architecture Training Phase Matching Phase Mediated schema Source schemas Base-Learner 1 Hypothesis

The LSD Architecture Training Phase Matching Phase Mediated schema Source schemas Base-Learner 1 Hypothesis 1 Training data for base learners Base-Learnerk Hypothesisk Base-Learner 1. . Base-Learnerk Meta-Learner Predictions for instances Prediction Combiner Domain constraints Predictions for elements Constraint Handler Meta-Learner Weights for Base Learners Mappings 16

Training the Base Learners Mediated schema address price agent-name agent-phone office-phone description realestate. com

Training the Base Learners Mediated schema address price agent-name agent-phone office-phone description realestate. com location price contact-name contact-phone office comments Miami, FL $250 K James Smith (305) 729 0831 (305) 616 1822 Fantastic house Boston, MA $320 K Mike Doan (617) 253 1429 (617) 112 2315 Great location Name Learner Naive Bayes Learner (“location”, address) (“price”, price) (“contact name”, agent-name) (“contact phone”, agent-phone) (“office”, office-phone) (“comments”, description) (“Miami, FL”, address) (“$250 K”, price) (“James Smith”, agent-name) (“(305) 729 0831”, agent-phone) (“(305) 616 1822”, office-phone) (“Fantastic house”, description) (“Boston, MA”, address) 17

Meta-Learner: Stacking [Wolpert 92, Ting&Witten 99] l Training – – l uses training data

Meta-Learner: Stacking [Wolpert 92, Ting&Witten 99] l Training – – l uses training data to learn weights one for each (base-learner, mediated-schema element) pair weight (Name-Learner, address) = 0. 2 weight (Naive-Bayes, address) = 0. 8 Matching: combine predictions of base learners – computes weighted average of base-learner confidence scores area Seattle, WA Kent, WA Bend, OR Name Learner Naive Bayes (address, 0. 4) (address, 0. 9) Meta-Learner (address, 0. 4*0. 2 + 0. 9*0. 8 = 0. 8) 18

The LSD Architecture Training Phase Matching Phase Mediated schema Source schemas Base-Learner 1 Hypothesis

The LSD Architecture Training Phase Matching Phase Mediated schema Source schemas Base-Learner 1 Hypothesis 1 Training data for base learners Base-Learnerk Hypothesisk Base-Learner 1. . Base-Learnerk Meta-Learner Predictions for instances Prediction Combiner Domain constraints Predictions for elements Constraint Handler Meta-Learner Weights for Base Learners Mappings 19

Applying the Learners homes. com schema area sold-at contact-agent area Seattle, WA Kent, WA

Applying the Learners homes. com schema area sold-at contact-agent area Seattle, WA Kent, WA Bend, OR homes. com extra-info Name Learner Naive Bayes Meta-Learner (address, 0. 8), (description, 0. 2) (address, 0. 6), (description, 0. 4) (address, 0. 7), (description, 0. 3) Prediction-Combiner (address, 0. 7), (description, 0. 3) sold-at contact-agent extra-info (price, 0. 9), (agent-phone, 0. 1) (agent-phone, 0. 9), (description, 0. 1) (address, 0. 6), (description, 0. 4) 20

Domain Constraints Encode user knowledge about domain l Specified only once, by examining mediated

Domain Constraints Encode user knowledge about domain l Specified only once, by examining mediated schema l Examples l – at most one source-schema element can match address – if a source-schema element matches house-id then it is a key – avg-value(price) > avg-value(num-baths) l Given a mapping combination – can verify if it satisfies a given constraint area: sold-at: contact-agent: extra-info: address price agent-phone address 21

The Constraint Handler Predictions from Prediction Combiner Domain Constraints area: sold-at: contact-agent: extra-info: At

The Constraint Handler Predictions from Prediction Combiner Domain Constraints area: sold-at: contact-agent: extra-info: At most one element matches address area: sold-at: contact-agent: extra-info: (address, 0. 7), (description, 0. 3) (price, 0. 9), (agent-phone, 0. 1) (agent-phone, 0. 9), (description, 0. 1) (address, 0. 6), (description, 0. 4) address price agent-phone address 0. 7 0. 9 0. 6 0. 3402 area: sold-at: contact-agent: extra-info: address price agent-phone description 0. 3 0. 1 0. 4 0. 0012 0. 7 0. 9 0. 4 0. 2268 Searches space of mapping combinations efficiently l Can handle arbitrary constraints l Also used to incorporate user feedback l – sold-at does not match price 22

The Current LSD System l Can also handle data in XML format – matches

The Current LSD System l Can also handle data in XML format – matches XML DTDs l Base learners – Naive Bayes [Duda&Hart-93, Domingos&Pazzani-97] – exploits frequencies of words & symbols – WHIRL Nearest-Neighbor Classifier [Cohen&Hirsh KDD-98] – employs information-retrieval similarity metric – Name Learner [SIGMOD-01] – matches elements based on their names – County-Name Recognizer [SIGMOD-01] – stores all U. S. county names – XML Learner [SIGMOD-01] – exploits hierarchical structure of XML data 23

Empirical Evaluation l Four domains – Real Estate I & II, Course Offerings, Faculty

Empirical Evaluation l Four domains – Real Estate I & II, Course Offerings, Faculty Listings l For each domain – – l created mediated schema & domain constraints chose five sources extracted & converted data into XML mediated schemas: 14 - 66 elements, source schemas: 13 - 48 Ten runs for each domain, in each run: – manually provided 1 -1 matches for 3 sources – asked LSD to propose matches for remaining 2 sources – accuracy = % of 1 -1 matches correctly identified 24

Average Matching Acccuracy (%) High Matching Accuracy LSD’s accuracy: 71 - 92% Best single

Average Matching Acccuracy (%) High Matching Accuracy LSD’s accuracy: 71 - 92% Best single base learner: 42 - 72% + Meta-learner: + 5 - 22% + Constraint handler: + 7 - 13% + XML learner: + 0. 8 - 6% 25

Average matching accuracy (%) Contribution of Schema vs. Data LSD with only schema info.

Average matching accuracy (%) Contribution of Schema vs. Data LSD with only schema info. LSD with only data info. Complete LSD More experiments in [Doan et al. SIGMOD-01] 26

LSD Summary l LSD – learns from previous matching activities – exploits multiple types

LSD Summary l LSD – learns from previous matching activities – exploits multiple types of information – by employing multi-strategy learning – incorporates domain constraints & user feedback – achieves high matching accuracy LSD focuses on 1 -1 matches l Next challenge: discover more complex matches! l – i. MAP (illinois Mapping) system [SIGMOD-04] – developed at Washington and Illinois, 2002 -2004 – with Robin Dhamanka, Yoonkyong Lee, Alon Halevy, Pedro Domingos 27

The i. MAP Approach Mediated-schema price num-baths address homes. com listed-price 320 K 240

The i. MAP Approach Mediated-schema price num-baths address homes. com listed-price 320 K 240 K l agent-id full-baths half-baths city 53211 11578 2 1 1 1 Seattle Miami zipcode 98105 23591 For each mediated-schema element – searches space of all matches – finds a small set of likely match candidates – uses LSD to evaluate them l To search efficiently – employs a specialized searcher for each element type – Text Searcher, Numeric Searcher, Category Searcher, . . . 28

The i. MAP Architecture [SIGMOD-04] Mediated schema Searcher 1 Source schema + data Searcher

The i. MAP Architecture [SIGMOD-04] Mediated schema Searcher 1 Source schema + data Searcher 2 Searcherk Match candidates Domain knowledge and data Base-Learner 1. . Base-Learnerk Explanation module Meta-Learner Similarity Matrix Match selector User 1 -1 and complex matches 29

An Example: Text Searcher Beam search in space of all concatenation matches l Example:

An Example: Text Searcher Beam search in space of all concatenation matches l Example: find match candidates for address l Mediated-schema homes. com listed-price 320 K 240 K price agent-id 532 a 115 c concat(agent-id, city) 532 a Seattle 115 c Miami l num-baths address full-baths half-baths city 2 1 1 1 Seattle Miami concat(agent-id, zipcode) 532 a 98105 115 c 23591 zipcode 98105 23591 concat(city, zipcode) Seattle 98105 Miami 23591 Best match candidates for address – (agent-id, 0. 7), (concat(agent-id, city), 0. 75), (concat(city, zipcode), 0. 9) 30

Empirical Evaluation l Current i. MAP system – 12 searchers l Four real-world domains

Empirical Evaluation l Current i. MAP system – 12 searchers l Four real-world domains – real estate, product inventory, cricket, financial wizard – target schema: 19 -- 42 elements, source schema: 32 -- 44 Accuracy: 43 -- 92% l Sample discovered matches l – agent-name = concat(first-name, last-name) – area = building-area / 43560 – discount-cost = (unit-price * quantity) * (1 - discount) l More detail in [Dhamanka et. al. SIGMOD-04] 31

Observations l Finding complex matches much harder than 1 -1 matches! – require gluing

Observations l Finding complex matches much harder than 1 -1 matches! – require gluing together many components – e. g. , num-rooms = bath-rooms + bed-rooms + dining-rooms + living-rooms – if missing one component => incorrect match l However, even partial matches are already very useful! – so are top-k matches => need methods to handle partial/top-k matches l Huge/infinite search spaces – domain knowledge plays a crucial role! l Matches are fairly complex, hard to know if they are correct – must be able to explain matches l Human must be fairly active in the loop – need strong user interaction facilities l Break matching architecture into multiple "atomic" boxes! 32

Road Map l Schema Matching – motivation & problem definition – representative current solutions:

Road Map l Schema Matching – motivation & problem definition – representative current solutions: LSD, i. MAP, Clio – broader picture l Ontology Matching – motivation & problem definition – representative current solution: GLUE – broader picture l Conclusions & Emerging Directions 33

Finding Matches is only Half of the Job! l To translate data/queries, need mappings,

Finding Matches is only Half of the Job! l To translate data/queries, need mappings, not matches Schema S HOUSES location Atlanta, GA Raleigh, NC Schema T price ($) agent-id 360, 000 32 430, 000 15 LISTINGS area list-price agent-address agent-name Denver, CO 550, 000 Boulder, CO Laura Smith Atlanta, GA 370, 800 Athens, GA Mike Brown AGENTS id name 32 Mike Brown 15 Jean Laup l city Athens Raleigh state fee-rate GA 0. 03 NC 0. 04 Mappings – area = SELECT location FROM HOUSES – agent-address = SELECT concat(city, state) FROM AGENTS – list-price = price * (1 + fee-rate) FROM HOUSES, AGENTS WHERE agent-id = id 34

Clio: Elaborating Matches into Mappings l Developed at Univ of Toronto & IBM Almaden,

Clio: Elaborating Matches into Mappings l Developed at Univ of Toronto & IBM Almaden, 2000 -2003 – by Renee Miller, Laura Haas, Mauricio Hernandez, Lucian Popa, Howard Ho, Ling Yan, Ron Fagin l Given a match – list-price = price * (1 + fee-rate) l Refine it into a mapping – list-price = SELECT price * (1 + fee-rate) FROM HOUSES (FULL OUTER JOIN) AGENTS WHERE agent-id = id l Need to discover – the correct join path among tables, e. g. , agent-id = id – the correct join, e. g. , full outer join? inner join? l Use heuristics to decide – when in doubt, ask users – employ sophisticated user interaction methods [VLDB-00, SIGMOD-01] 35

Clio: Illustrating Examples Schema S HOUSES location Atlanta, GA Raleigh, NC Schema T price

Clio: Illustrating Examples Schema S HOUSES location Atlanta, GA Raleigh, NC Schema T price ($) agent-id 360, 000 32 430, 000 15 LISTINGS area list-price agent-address agent-name Denver, CO 550, 000 Boulder, CO Laura Smith Atlanta, GA 370, 800 Athens, GA Mike Brown AGENTS id name 32 Mike Brown 15 Jean Laup l city Athens Raleigh state fee-rate GA 0. 03 NC 0. 04 Mappings – area = SELECT location FROM HOUSES – agent-address = SELECT concat(city, state) FROM AGENTS – list-price = price * (1 + fee-rate) FROM HOUSES, AGENTS WHERE agent-id = id 36

Road Map l Schema Matching – motivation & problem definition – representative current solutions:

Road Map l Schema Matching – motivation & problem definition – representative current solutions: LSD, i. MAP, Clio – broader picture l Ontology Matching – motivation & problem definition – representative current solution: GLUE – broader picture l Conclusions & Emerging Directions 37

Broader Picture: Find Matches Hand-crafted rules Exploit schema 1 -1 matches Single learner Exploit

Broader Picture: Find Matches Hand-crafted rules Exploit schema 1 -1 matches Single learner Exploit data 1 -1 matches TRANSCM [Milo&Zohar 98] ARTEMIS [Castano&Antonellis 99] [Palopoli et al. 98] CUPID [Madhavan et al. 01] Learners + rules, use multi-strategy learning Exploit schema + data 1 -1 + complex matches Exploit domain constraints LSD [Doan et al. , SIGMOD-01] i. MAP [Dhamanka et. al. , SIGMOD-04] SEMINT [Li&Clifton 94] ILA [Perkowitz&Etzioni 95] DELTA [Clifton et al. 97] Auto. Match, Autoplex [Berlin & Motro, 01 -03] Other Important Works COMA by Erhard Rahm group David Embley group at BYU Jaewoo Kang group at NCSU Kevin Chang group at UIUC Clement Yu group at UIC More about some of these works soon. . 38

Broader Picture: From Matches to Mappings Learners + rules Exploit schema + data 1

Broader Picture: From Matches to Mappings Learners + rules Exploit schema + data 1 -1 + complex matches Automate as much as possible Rules Exploit data Powerful user interaction CLIO [Miller et. al. , 00] [Yan et al. 01] i. MAP [Dhamanka et al. , SIGMOD-04] ? 39

Road Map l Schema Matching – motivation & problem definition – representative current solutions:

Road Map l Schema Matching – motivation & problem definition – representative current solutions: LSD, i. MAP, Clio – broader picture l Ontology Matching – motivation & problem definition – representative current solution: GLUE – broader picture l Conclusions & Emerging Directions 40

Ontology Matching l Increasingly critical for – knowledge bases, Semantic Web l An ontology

Ontology Matching l Increasingly critical for – knowledge bases, Semantic Web l An ontology – concepts organized into a taxonomy tree – each concept has – a set of attributes – a set of instances Undergrad Courses – relations among concepts l CS Dept. US Entity Grad Courses Faculty Matching – concepts – attributes – relations People Assistant Professor Associate Professor Staff Professor name: Mike Burns degree: Ph. D. 41

Matching Taxonomies of Concepts CS Dept. US CS Dept. Australia Entity Undergrad Courses Grad

Matching Taxonomies of Concepts CS Dept. US CS Dept. Australia Entity Undergrad Courses Grad Courses Entity Faculty Assistant Professor Courses People Associate Professor Staff Professor Lecturer Staff Academic Staff Technical Staff Senior Lecturer Professor 42

Glue l Solution – Use data instances extensively – Learn classifiers using information within

Glue l Solution – Use data instances extensively – Learn classifiers using information within taxonomies – Use a rich constraint satisfaction scheme l [Doan, Madhavan, Domingos, Halevy; WWW’ 2002] 43

Concept Similarity Concept A A, S Concept S A, S Hypothetical universe of all

Concept Similarity Concept A A, S Concept S A, S Hypothetical universe of all examples A, S Sim(Concept A, Concept S) = [Jaccard, 1908] P(A S) = P(A, S) + P( A, S) Joint Probability Distribution: P(A, S), P( A, S) Multiple Similarity measures in terms of the JPD 44

Machine Learning for Computing Similarities A, S A Taxonomy 1 A, S Taxonomy 2

Machine Learning for Computing Similarities A, S A Taxonomy 1 A, S Taxonomy 2 A, SS S A A, S CLA A, S A A, S CLS A JPD estimated by counting the sizes of the partitions S S 45

The Glue System Matches for O 1 , Matches for O 2 Relaxation Labeling

The Glue System Matches for O 1 , Matches for O 2 Relaxation Labeling Common Knowledge & Domain Constraints Similarity Function Similarity Matrix Similarity Estimator Joint Probability Distribution P(A, B), P(A’, B)… Meta Learner Base Learner Taxonomy O 1 (tree structure + data instances) Distribution Estimator Base Learner Taxonomy O 2 (tree structure + data instances) 46

Constraints in Taxonomy Matching l Domain-dependent – at most one node matches department-chair –

Constraints in Taxonomy Matching l Domain-dependent – at most one node matches department-chair – a node that matches professor can not be a child of a node that matches assistant-professor l Domain-independent – two nodes match if parents & children match – if all children of X matches Y, then X also matches Y – Variations have been exploited in many restricted settings [Melnik&Garcia-Molina, ICDE-02], [Milo&Zohar, VLDB-98], [Noy et al. , IJCAI-01], [Madhavan et al. , VLDB-01] l Challenge: find a general & efficient approach 47

Solution: Relaxation Labeling l Relaxation labeling [Hummel&Zucker, 83] – – l applied to graph

Solution: Relaxation Labeling l Relaxation labeling [Hummel&Zucker, 83] – – l applied to graph labeling in vision, NLP, hypertext classification finds best label assignment, given a set of constraints starts with initial label assignment iteratively improves labels, using constraints Standard relax. labeling not applicable – extended it in many ways [Doan et al. , W W W-02] 48

Real World Experiments l Taxonomies on the web – University organization (UW and Cornell)

Real World Experiments l Taxonomies on the web – University organization (UW and Cornell) – Colleges, departments and sub-fields – Companies (Yahoo and The Standard) – Industries and Sectors l For each taxonomy – – – l Extract data instances – course descriptions, company profiles Trivial data cleaning 100 – 300 concepts per taxonomy 3 -4 depth of taxonomies 10 -90 data instances per concept Evaluation against manual mappings as the gold standard 49

Glue’s Performance University Depts 1 University Depts 2 Company Profiles 50

Glue’s Performance University Depts 1 University Depts 2 Company Profiles 50

Broader Picture l Ontology matching parallels the development of schema matching – rule-based &

Broader Picture l Ontology matching parallels the development of schema matching – rule-based & learning-based approaches – PROMPT family, Onto. Morph, Onto. Merge, Chimaera, Onion, OBSERVER, FCAMerge, . . . – extensive work by Ed Hovy's group – ontology versioning (e. g. , by Noy et. al. ) l More powerful user interaction methods – e. g. , i. PROMPT, Chimaera l Much more theoretical works in this area 51

Road Map l Schema Matching – motivation & problem definition – representative current solutions:

Road Map l Schema Matching – motivation & problem definition – representative current solutions: LSD, i. MAP, Clio – broader picture l Ontology Matching – motivation & problem definition – representative current solution: GLUE – broader picture l Conclusions & Emerging Directions 52

Develop the Theoretical Foundation l Not much is going on, however. . . –

Develop the Theoretical Foundation l Not much is going on, however. . . – see works by Alon Halevy (AAAI-02) and Phil Bernstein (in model management contexts) – some preliminary work in An. Hai Doan's Ph. D. dissertation – work by Stuart Russell and other AI people on identity uncertainty is potentially relevant l Most likely foundation – probability framework 53

Need Much More Domain Knowledge l Where to get it? – past matches (e.

Need Much More Domain Knowledge l Where to get it? – past matches (e. g. , LSD, i. MAP) – other schemas in the domain – holistic matching approach by Kevin Chang group [SIGMOD-02] – corpus-based matching by Alon Halevy group [IJCAI-03] – clustering to achieve bridging effects by Clement Yu group [SIGMOD-04] – external data (e. g. , i. MAP at SIGMOD-04) – mass of users (e. g. , MOBS at Web. DB-03) l How to get it and how to use it? – no clear answer yet 54

Employ Multi-Module Architecture Many "black boxes", each is good at doing a single thing

Employ Multi-Module Architecture Many "black boxes", each is good at doing a single thing l Combine them and tailor them to each application l Examples l – LSD, i. MAP, COMA, David Embley's systems l Open issues – what are these back boxes? – how to build them? – how to combine them? 55

Powerful User Interaction Minimize user effort, maximize its impact l Make it very easy

Powerful User Interaction Minimize user effort, maximize its impact l Make it very easy for users to l – supply domain knowledge – provide feedback on matches/mappings l Develop powerful explanation facilities 56

Other Issues l l l l What to do with partial/top-k matches? Meaning negotiation

Other Issues l l l l What to do with partial/top-k matches? Meaning negotiation Fortifying schemas for interoperability Very-large-scale matching scenarios (e. g. , the Web) What can we do without the mappings? Interaction between schema matching and tuple matching? Benchmarks, tools? 57

Summary l Schema/ontology matching: key to numerous data management problems – much attention in

Summary l Schema/ontology matching: key to numerous data management problems – much attention in the database, AI, Semantic Web communities l Simple problem definition, yet very difficult to do – no satisfactory solution yet – AI complete? l We now understand the problems much better – still at the beginning of the journey – will need techniques from multiple fields 58

Backup Slides 59

Backup Slides 59

Backup Slides 60

Backup Slides 60

Training the Meta-Learner l For address Extracted XML Instances <location> Miami, FL</> <listed-price> $250,

Training the Meta-Learner l For address Extracted XML Instances <location> Miami, FL</> <listed-price> $250, 000</> <area> Seattle, WA </> <house-addr>Kent, WA</> <num-baths>3</>. . . Name Learner Naive Bayes 0. 5 0. 4 0. 3 0. 6 0. 3. . . 0. 8 0. 3 0. 9 0. 8 0. 3. . . True Predictions 1 0 1 1 0. . . Least-Squares Linear Regression Weight(Name-Learner, address) = 0. 1 Weight(Naive-Bayes, address) = 0. 9 61

Average matching accuracy (%) Sensitivity to Amount of Available Data Number of data listings

Average matching accuracy (%) Sensitivity to Amount of Available Data Number of data listings per source (Real Estate I) 62

Average Matching Acccuracy (%) Contribution of Each Component Without Name Learner Without Naive Bayes

Average Matching Acccuracy (%) Contribution of Each Component Without Name Learner Without Naive Bayes Without Whirl Learner Without Constraint Handler The complete LSD system 63

Exploiting Hierarchical Structure l Existing learners flatten out all structures <contact> <name> Gail Murphy

Exploiting Hierarchical Structure l Existing learners flatten out all structures <contact> <name> Gail Murphy </name> <firm> MAX Realtors </firm> </contact> l <description> Victorian house with a view. Name your price! To see it, contact Gail Murphy at MAX Realtors. </description> Developed XML learner – similar to the Naive Bayes learner – input instance = bag of tokens – differs in one crucial aspect – consider not only text tokens, but also structure tokens 64

Reasons for Incorrect Matchings l Unfamiliarity – suburb – solution: add a suburb-name recognizer

Reasons for Incorrect Matchings l Unfamiliarity – suburb – solution: add a suburb-name recognizer l Insufficient information – correctly identified general type, failed to pinpoint exact type – agent-name phone Richard Smith (206) 234 5412 – solution: add a proximity learner l Subjectivity – house-style = description? Victorian Beautiful neo-gothic house Mexican Great location 65

Evaluate Mapping Candidates l For address, Text Searcher returns – (agent-id, 0. 7) –

Evaluate Mapping Candidates l For address, Text Searcher returns – (agent-id, 0. 7) – (concat(agent-id, city), 0. 8) – (concat(city, zipcode), 0. 75) Employ multi-strategy learning to evaluate mappings l Example: (concat(agent-id, city), 0. 8) l – Naive Bayes Learner: 0. 8 – Name Learner: “address” vs. “agent id city” 0. 3 – Meta-Learner: 0. 8 * 0. 7 + 0. 3 * 0. 3 = 0. 65 l Meta-Learner returns – (agent-id, 0. 59) – (concat(agent-id, city), 0. 65) – (concat(city, zipcode), 0. 70) 66

Relaxation Labeling l Applied to similar problems in – vision, NLP, hypertext classification People

Relaxation Labeling l Applied to similar problems in – vision, NLP, hypertext classification People Dept Australia Courses Acad. Staff Faculty Dept U. S. Courses Staff Tech. Staff Courses People Faculty Staff 67

Relaxation Labeling for Taxonomy Matching l Must define – neighborhood of a node –

Relaxation Labeling for Taxonomy Matching l Must define – neighborhood of a node – k features of neighborhood – how to combine influence of features – l Algorithm – init: – loop: for each pair <N, L>, compute for each pair <N, L>, re-compute Staff = People Acad. Staff: Faculty Tech. Staff: Staff Neighborhood configuration 68

Relaxation Labeling for Taxonomy Matching l Huge number of neighborhood configurations! – typically neighborhood

Relaxation Labeling for Taxonomy Matching l Huge number of neighborhood configurations! – typically neighborhood = immediate nodes – here neighborhood can be entire graph 100 nodes, 10 labels => configurations l Solution – label abstraction + dynamic programming – guarantee quadratic time for a broad range of domain constraints l Empirical evaluation – – – GLUE system [Doan et. al. , WWW-02] three real-world domains 30 -- 300 nodes / taxonomy high accuracy 66 -- 97% vs. 52 -- 83% of best base learner relaxation labeling very fast, finished in several seconds 69