Discovering Direct and Indirect Matches for Schema Elements
- Slides: 47
Discovering Direct and Indirect Matches for Schema Elements Li Xu and David W. Embley Brigham Young University BYU Data Extraction Group Funded by NSF DASFAA 2003
Information Exchange Source Target Information Extraction Leverage this … … to do this Schema Matching BYU Data Extraction Group DASFAA 2003
Outline • • • Information Extraction Direct Schema Matching Indirect Schema Matching for HTML Tables Conclusions BYU Data Extraction Group DASFAA 2003
Outline • • • Information Extraction Direct Schema Matching Indirect Schema Matching for HTML Tables Conclusions BYU Data Extraction Group DASFAA 2003
Extracting Pertinent Information from Documents BYU Data Extraction Group DASFAA 2003
A Conceptual-Modeling Solution Year Price 1. . * Make 1. . * has has 0. . 1 Car 0. . 1 has Model 1. . * has 1. . * Mileage has 0. . * 0. . 1 is for 1. . * Feature 1. . * Phone. Nr 0. . 1 has 1. . * Extension BYU Data Extraction Group DASFAA 2003
Car-Ads Ontology Car [->object]; Car [0. . 1] has Year [1. . *]; Car [0. . 1] has Make [1. . *]; Car [0. . . 1] has Model [1. . *]; Car [0. . 1] has Mileage [1. . *]; Car [0. . *] has Feature [1. . *]; Car [0. . 1] has Price [1. . *]; Phone. Nr [1. . *] is for Car [0. . *]; Phone. Nr [0. . 1] has Extension [1. . *]; Year matches [4] constant {extract “d{2}”; context "([^$d]|^)[4 -9]d[^d]"; substitute "^" -> "19"; }, … … End; BYU Data Extraction Group DASFAA 2003
Recognition and Extraction Car 0001 0002 0003 Year 1989 1998 1994 Make Model Mileage Price Phone. Nr Subaru SW $1900 (336)835 -8597 Elantra (336)526 -5444 HONDA ACCORD EX 100 K (336)526 -1081 BYU Data Extraction Group Car 0001 0002 0002 0002 0003 Feature Auto AC Black 4 door tinted windows Auto pb ps cruise am/fm cassette stereo a/c Auto jade green gold DASFAA 2003
Outline • • • Information Extraction Direct Schema Matching Indirect Schema Matching for HTML Tables Conclusions BYU Data Extraction Group DASFAA 2003
Attribute Matching for Populated Schemas • Central Idea: Exploit All Data & Metadata • Matching Possibilities (Facets) – Attribute Names – Data-Value Characteristics – Expected Data Values – Data-Dictionary Information – Structural Properties BYU Data Extraction Group DASFAA 2003
Approach • Target Schema T • Source Schema S • Framework – Individual Facet Matching – Combining Facets – Best-First Match Iteration BYU Data Extraction Group DASFAA 2003
Example Year Make Feature Year Make has 0: 1 Model 0: 1 0: * 0: 1 has Mileage 0: 1 has Phone Target Schema T BYU Data Extraction Group has Cost Car has has 0: 1 Model 0: 1 has Style Car 0: 1 Miles 0: 1 0: * has Cost Source Schema S DASFAA 2003
Individual Facet Matching • Attribute Names • Data-Value Characteristics • Expected Data Values BYU Data Extraction Group DASFAA 2003
Attribute Names • Target and Source Attributes – T: A – S: B • Word. Net • C 4. 5 Decision Tree: feature selection, trained on schemas in DB books – – – BYU Data Extraction Group f 0: same word f 1: synonym f 2: sum of distances to a common hypernym root f 3: number of different common hypernym roots f 4: sum of the number of senses of A and B DASFAA 2003
Word. Net Rule The number of different common hypernym roots of A and B The sum of distances of A and B to a common hypernym BYU Data Extraction Group The sum of the number of senses of A and B DASFAA 2003
Confidence Measures BYU Data Extraction Group DASFAA 2003
Data-Value Characteristics • C 4. 5 Decision Tree • Features – Numeric data (Mean, variation, standard deviation, …) – Alphanumeric data (String length, numeric ratio, space ratio) BYU Data Extraction Group DASFAA 2003
Confidence Measures BYU Data Extraction Group DASFAA 2003
Expected Data Values • Target Schema T and Source Schema S – Regular expression recognizer for attribute A in T – Data instances for attribute B in S • Hit Ratio = N'/N for (A, B) match – N' : number of B data instances recognized by the regular expressions of A – N: number of B data instances BYU Data Extraction Group DASFAA 2003
Confidence Measures BYU Data Extraction Group DASFAA 2003
Combined Measures 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 Threshold: 0. 5 BYU Data Extraction Group DASFAA 2003
Final Confidence Measures 0 0 0 BYU Data Extraction Group DASFAA 2003
Experimental Results • This schema, plus 6 other schemas – 32 matched attributes – 376 unmatched attributes • Matched: 100% • Unmatched: 99. 5% – “Feature” ---”Color” – “Feature” ---”Body Type” BYU Data Extraction Group F 1 F 2 F 3 93. 8% 98. 9% 84% 92% 97. 9% 98. 4% F 1: Word. Net F 2: Value Characteristics F 3: Expected Values DASFAA 2003
Outline • • • Information Extraction Direct Schema Matching Indirect Schema Matching for HTML Tables Conclusions BYU Data Extraction Group DASFAA 2003
Schema Matching Year Make Model Feature Cost Car Phone Mileage Target BYU Data Extraction Group Year Make & Model Color Body Type Car Miles Style Cost Source DASFAA 2003
Mapping Generation • Direct Matches as described earlier: – Attribute Names based on Word. Net – Value Characteristics based on value lengths, averages, … – Expected Values based on regular-expression recognizers • Indirect Matches: – Direct matches – Structure Evaluation • • BYU Data Extraction Group Union Selection Decomposition Composition DASFAA 2003
Union and Selection Year Make Model Feature Make & Model Color Body Type Cost Car Style Car Phone Mileage Target BYU Data Extraction Group Year Miles Cost Source DASFAA 2003
Decomposition and Composition Year Make Model Feature Make & Model Color Body Type Cost Car Style Car Phone Mileage Target BYU Data Extraction Group Year Miles Cost Source DASFAA 2003
Structure Example Taken From [MBR, VLDB’ 01] PO Purchase. Order Items POShip. To POBill. To POLines Invoice. To Count Item. Count City Street Address Item. Number Line Qty BYU Data Extraction Group City Street Uo. M Quantity Target Deliver. To Unit. Of. Measure Source DASFAA 2003
Structure (Nonlexical Matches) PO Purchase. Order Items POShip. To POBill. To POLines Invoice. To Count City Street Address Item. Number Line Qty City BYU Data Extraction Group Street Uo. M Quantity Target Deliver. To Unit. Of. Measure Source DASFAA 2003
Structure (Join over FD Relationship Sets, …) PO Purchase. Order Items POShip. To POBill. To POLines Invoice. To Deliver. To City Count City Street Item. Number Line Qty Uo. M Quantity Target BYU Data Extraction Group Unit. Of. Measure Source DASFAA 2003
Structure (Lexical Matches) PO Purchase. Order Items POShip. To POBill. To POLines Invoice. To Deliver. To City Count City Street Item. Number Line Qty Uo. M Quantity Target BYU Data Extraction Group Unit. Of. Measure Source DASFAA 2003
Experimental Results Applications (Number of Schemes) Precision (%) Recall (%) F (%) Correct False Positive False Negative Course Schedule (5) 98 93 96 119 2 9 Faculty Member (5) 100 100 140 0 0 Real Estate (5) 92 96 94 235 20 10 Indirect Matches: 94% (precision, recall, F-measure) Data borrowed from Univ. of Washington [DDH, SIGMOD 01] Rough Comparison with U of W Results (direct matches only) * Course Schedule – Accuracy: ~71% * Faculty Members – Accuracy, ~92% * Real Estate (2 tests) – Accuracy: ~75% BYU Data Extraction Group DASFAA 2003
Outline • • • Information Extraction Direct Schema Matching Indirect Schema Matching for HTML Tables Conclusions BYU Data Extraction Group DASFAA 2003
Problem: Different Schemas Target Database Schema {Car, Year, Make, Model, Mileage, Price, Phone. Nr}, {Phone. Nr, Extension}, {Car, Feature} Different Source Table Schemas – {Run #, Yr, Make, Model, Tran, Color, Dr} – {Make, Model, Year, Colour, Price, Auto, Air Cond. , AM/FM, CD} – {Vehicle, Distance, Price, Mileage} – {Year, Make, Model, Trim, Invoice/Retail, Engine, Fuel Economy} BYU Data Extraction Group DASFAA 2003
Solution • Form attribute-value pairs • Do extraction • Infer mappings from extraction patterns BYU Data Extraction Group DASFAA 2003
Solution: Remove Internal Factoring ACURA Legend Discover Nesting: Make, (Model, (Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*)* Unnest: μ(Model, Year, Colour, Price, Auto, Air Cond, AM/FM, CD)* μ (Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*Table BYU Data Extraction Group DASFAA 2003
Solution: Replace Boolean Values Auto Air Cond. AM/FM CD ACURA Legend Auto AM/FM Auto Air Cond. AM/FM CD AM/FM Air Cond. AM/FM AutoβAir CondβAM/FM β CD Table βYes, BYU Data Extraction Group DASFAA 2003
Solution: Form Attribute-Value Pairs Auto Air Cond. AM/FM CD ACURA Legend Auto AM/FM Auto Air Cond. AM/FM CD AM/FM Air Cond. AM/FM Auto Air Cond. AM/FM <Make, Honda>, <Model, Civic EX>, <Year, 1995>, <Colour, White>, <Price, $6300>, <Auto, Auto>, <Air Cond. , Air Cond. >, <AM/FM, AM/FM>, <CD, > BYU Data Extraction Group DASFAA 2003
Solution: Do Extraction Auto Air Cond. AM/FM CD ACURA Legend Auto AM/FM Auto Air Cond. AM/FM CD AM/FM Air Cond. AM/FM Auto Air Cond. AM/FM BYU Data Extraction Group DASFAA 2003
Solution: Infer Mappings Auto Air Cond. AM/FM CD ACURA Legend Auto AM/FM Auto Air Cond. AM/FM CD AM/FM Air Cond. AM/FM Auto Air Cond. AM/FM μ(Year, EachπMake rowμis(Model, car. πModel μ(Year, Table Year, Colour, Price, Auto, Colour, Air Cond, AM/FM, Colour, Price, Auto, Air Cond, AM/FM, CD)*Table πa. Year Table Price, Auto, CD)* Air Cond, AM/FM, CD)* {Car, Year, Make, Model, Mileage, Price, Phone. Nr}, {Phone. Nr, Extension}, {Car, Feature} Note: Mappings produce sets for attributes. Joining to form records BYU Data Extraction Group is trivial because we have OIDs for table rows (e. g. for each Car). DASFAA 2003
Solution: Do Extraction Auto Air Cond. AM/FM CD ACURA Legend Auto AM/FM Auto Air Cond. AM/FM CD AM/FM Air Cond. AM/FM Auto Air Cond. AM/FM πModelμ(Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*Table {Car, Year, Make, Model, Mileage, Price, Phone. Nr}, {Phone. Nr, Extension}, {Car, Feature} BYU Data Extraction Group DASFAA 2003
Solution: Do Extraction Auto Air Cond. AM/FM CD ACURA Legend Auto AM/FM Auto Air Cond. AM/FM CD AM/FM Air Cond. AM/FM Auto Air Cond. AM/FM πPrice. Table {Car, Year, Make, Model, Mileage, Price, Phone. Nr}, {Phone. Nr, Extension}, {Car, Feature} BYU Data Extraction Group DASFAA 2003
Solution: Do Extraction Auto Air Cond. AM/FM CD ACURA Legend Auto AM/FM Auto Air Cond. AM/FM CD AM/FM Air Cond. AM/FM Auto Air Cond. AM/FM ρ Colour←Feature π Colour. Table U ρ Auto←Feature π Auto β Auto. Table U ρ Air Cond. ←Feature π Air Cond. Yes, Air Cond. Table U ρ AM/FM CDTable β Yes, AM/FM←Feature π AM/FM β Yes, Table U ρ CD←Featureπ CDβ Yes, {Car, Year, Make, Model, Mileage, Price, Phone. Nr}, {Phone. Nr, Extension}, {Car, Feature} BYU Data Extraction Group DASFAA 2003
Experiment • • Tables from 60 sites 10 “training” tables 50 test tables 357 mappings (from all 60 sites) – 172 direct mappings (same attribute and meaning) – 185 indirect mappings (29 attribute synonyms, 5 “Yes/No” columns, 68 unions over columns for Feature, 19 factored values, and 89 columns of merged values that needed to be split) BYU Data Extraction Group DASFAA 2003
Results • 10 “training” tables – 100% of the 57 mappings – No false mappings • 50 test tables – 94. 7% of the 300 mappings – No false mappings • 16 missed mappings – – – 4 partial (not all unions included) 6 non-U. S. car-ads (unrecognized makes and models) 2 U. S. unrecognized makes and models 3 prices (missing $ or found MSRP instead) 1 mileage (mileages less than 1, 000) BYU Data Extraction Group DASFAA 2003
Conclusions • Direct Attribute Matching – Matched 32 of 32: 100% Recall – 2 False Positives: 94% Precision • Direct and Indirect Attribute Matching – Matched 494 of 513: 96% Recall – 22 False Positives: 96% Precision • Table Mappings – Matched 284 of 300: 94. 7% Recall – No False Positives: 100% Precision www. deg. byu. edu BYU Data Extraction Group DASFAA 2003
- Ethics discovering right and wrong
- Job analysis information
- Discovering the humanities 4th edition
- Discovering computers 2018 chapter 1 ppt
- Discovering computers 2016
- Discovering computers
- Self motivation explanation
- Discovering science 7
- P value in statistics
- Discovering economic systems guided practice
- Discovering computers 2011
- Dapps goal
- Discovering computers 2018 chapter 1
- Discovering science 7
- Mohamed computer technology
- Discovering engineering
- Defining and debating america's founding ideals
- Teachers discovering computers
- Discovering science 7
- Grade 8 science unit 3
- Amenhotop
- Discovering personal genius
- Tactile output devices
- Discovering computers 2012
- Chapter 7 lesson 1 discovering parts of an atom answer key
- Discovering the internet
- Discovering cells
- Discovering our past ancient civilizations
- Choose the picture that matches the word la sudadera
- Identifying forms of energy
- Identify a contextual selector that matches any element
- Y x 2 graph
- Collective nouns for banana
- Chemistry of matches
- Blue marble matches answer key
- Principles of portfolio assessment
- Blue marble matches answer key
- Practicscore
- Social groups primary and secondary
- Behavior that matches group expectations.
- Ancient chinese matches
- Countable uncountable exercises
- Which geometric construction matches the diagram below?
- Three gray geese in the green grass grazing
- Portfolio assessment matches assessment to teaching
- Indirect direct characterization
- Indirect characterization examples
- Indirect proof assumption