Discovering Direct and Indirect Matches for Schema Elements

  • Slides: 47
Download presentation
Discovering Direct and Indirect Matches for Schema Elements Li Xu and David W. Embley

Discovering Direct and Indirect Matches for Schema Elements Li Xu and David W. Embley Brigham Young University BYU Data Extraction Group Funded by NSF DASFAA 2003

Information Exchange Source Target Information Extraction Leverage this … … to do this Schema

Information Exchange Source Target Information Extraction Leverage this … … to do this Schema Matching BYU Data Extraction Group DASFAA 2003

Outline • • • Information Extraction Direct Schema Matching Indirect Schema Matching for HTML

Outline • • • Information Extraction Direct Schema Matching Indirect Schema Matching for HTML Tables Conclusions BYU Data Extraction Group DASFAA 2003

Outline • • • Information Extraction Direct Schema Matching Indirect Schema Matching for HTML

Outline • • • Information Extraction Direct Schema Matching Indirect Schema Matching for HTML Tables Conclusions BYU Data Extraction Group DASFAA 2003

Extracting Pertinent Information from Documents BYU Data Extraction Group DASFAA 2003

Extracting Pertinent Information from Documents BYU Data Extraction Group DASFAA 2003

A Conceptual-Modeling Solution Year Price 1. . * Make 1. . * has has

A Conceptual-Modeling Solution Year Price 1. . * Make 1. . * has has 0. . 1 Car 0. . 1 has Model 1. . * has 1. . * Mileage has 0. . * 0. . 1 is for 1. . * Feature 1. . * Phone. Nr 0. . 1 has 1. . * Extension BYU Data Extraction Group DASFAA 2003

Car-Ads Ontology Car [->object]; Car [0. . 1] has Year [1. . *]; Car

Car-Ads Ontology Car [->object]; Car [0. . 1] has Year [1. . *]; Car [0. . 1] has Make [1. . *]; Car [0. . . 1] has Model [1. . *]; Car [0. . 1] has Mileage [1. . *]; Car [0. . *] has Feature [1. . *]; Car [0. . 1] has Price [1. . *]; Phone. Nr [1. . *] is for Car [0. . *]; Phone. Nr [0. . 1] has Extension [1. . *]; Year matches [4] constant {extract “d{2}”; context "([^$d]|^)[4 -9]d[^d]"; substitute "^" -> "19"; }, … … End; BYU Data Extraction Group DASFAA 2003

Recognition and Extraction Car 0001 0002 0003 Year 1989 1998 1994 Make Model Mileage

Recognition and Extraction Car 0001 0002 0003 Year 1989 1998 1994 Make Model Mileage Price Phone. Nr Subaru SW $1900 (336)835 -8597 Elantra (336)526 -5444 HONDA ACCORD EX 100 K (336)526 -1081 BYU Data Extraction Group Car 0001 0002 0002 0002 0003 Feature Auto AC Black 4 door tinted windows Auto pb ps cruise am/fm cassette stereo a/c Auto jade green gold DASFAA 2003

Outline • • • Information Extraction Direct Schema Matching Indirect Schema Matching for HTML

Outline • • • Information Extraction Direct Schema Matching Indirect Schema Matching for HTML Tables Conclusions BYU Data Extraction Group DASFAA 2003

Attribute Matching for Populated Schemas • Central Idea: Exploit All Data & Metadata •

Attribute Matching for Populated Schemas • Central Idea: Exploit All Data & Metadata • Matching Possibilities (Facets) – Attribute Names – Data-Value Characteristics – Expected Data Values – Data-Dictionary Information – Structural Properties BYU Data Extraction Group DASFAA 2003

Approach • Target Schema T • Source Schema S • Framework – Individual Facet

Approach • Target Schema T • Source Schema S • Framework – Individual Facet Matching – Combining Facets – Best-First Match Iteration BYU Data Extraction Group DASFAA 2003

Example Year Make Feature Year Make has 0: 1 Model 0: 1 0: *

Example Year Make Feature Year Make has 0: 1 Model 0: 1 0: * 0: 1 has Mileage 0: 1 has Phone Target Schema T BYU Data Extraction Group has Cost Car has has 0: 1 Model 0: 1 has Style Car 0: 1 Miles 0: 1 0: * has Cost Source Schema S DASFAA 2003

Individual Facet Matching • Attribute Names • Data-Value Characteristics • Expected Data Values BYU

Individual Facet Matching • Attribute Names • Data-Value Characteristics • Expected Data Values BYU Data Extraction Group DASFAA 2003

Attribute Names • Target and Source Attributes – T: A – S: B •

Attribute Names • Target and Source Attributes – T: A – S: B • Word. Net • C 4. 5 Decision Tree: feature selection, trained on schemas in DB books – – – BYU Data Extraction Group f 0: same word f 1: synonym f 2: sum of distances to a common hypernym root f 3: number of different common hypernym roots f 4: sum of the number of senses of A and B DASFAA 2003

Word. Net Rule The number of different common hypernym roots of A and B

Word. Net Rule The number of different common hypernym roots of A and B The sum of distances of A and B to a common hypernym BYU Data Extraction Group The sum of the number of senses of A and B DASFAA 2003

Confidence Measures BYU Data Extraction Group DASFAA 2003

Confidence Measures BYU Data Extraction Group DASFAA 2003

Data-Value Characteristics • C 4. 5 Decision Tree • Features – Numeric data (Mean,

Data-Value Characteristics • C 4. 5 Decision Tree • Features – Numeric data (Mean, variation, standard deviation, …) – Alphanumeric data (String length, numeric ratio, space ratio) BYU Data Extraction Group DASFAA 2003

Confidence Measures BYU Data Extraction Group DASFAA 2003

Confidence Measures BYU Data Extraction Group DASFAA 2003

Expected Data Values • Target Schema T and Source Schema S – Regular expression

Expected Data Values • Target Schema T and Source Schema S – Regular expression recognizer for attribute A in T – Data instances for attribute B in S • Hit Ratio = N'/N for (A, B) match – N' : number of B data instances recognized by the regular expressions of A – N: number of B data instances BYU Data Extraction Group DASFAA 2003

Confidence Measures BYU Data Extraction Group DASFAA 2003

Confidence Measures BYU Data Extraction Group DASFAA 2003

Combined Measures 1 0 0 0 0 0 0 0 0 0 0 0

Combined Measures 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 Threshold: 0. 5 BYU Data Extraction Group DASFAA 2003

Final Confidence Measures 0 0 0 BYU Data Extraction Group DASFAA 2003

Final Confidence Measures 0 0 0 BYU Data Extraction Group DASFAA 2003

Experimental Results • This schema, plus 6 other schemas – 32 matched attributes –

Experimental Results • This schema, plus 6 other schemas – 32 matched attributes – 376 unmatched attributes • Matched: 100% • Unmatched: 99. 5% – “Feature” ---”Color” – “Feature” ---”Body Type” BYU Data Extraction Group F 1 F 2 F 3 93. 8% 98. 9% 84% 92% 97. 9% 98. 4% F 1: Word. Net F 2: Value Characteristics F 3: Expected Values DASFAA 2003

Outline • • • Information Extraction Direct Schema Matching Indirect Schema Matching for HTML

Outline • • • Information Extraction Direct Schema Matching Indirect Schema Matching for HTML Tables Conclusions BYU Data Extraction Group DASFAA 2003

Schema Matching Year Make Model Feature Cost Car Phone Mileage Target BYU Data Extraction

Schema Matching Year Make Model Feature Cost Car Phone Mileage Target BYU Data Extraction Group Year Make & Model Color Body Type Car Miles Style Cost Source DASFAA 2003

Mapping Generation • Direct Matches as described earlier: – Attribute Names based on Word.

Mapping Generation • Direct Matches as described earlier: – Attribute Names based on Word. Net – Value Characteristics based on value lengths, averages, … – Expected Values based on regular-expression recognizers • Indirect Matches: – Direct matches – Structure Evaluation • • BYU Data Extraction Group Union Selection Decomposition Composition DASFAA 2003

Union and Selection Year Make Model Feature Make & Model Color Body Type Cost

Union and Selection Year Make Model Feature Make & Model Color Body Type Cost Car Style Car Phone Mileage Target BYU Data Extraction Group Year Miles Cost Source DASFAA 2003

Decomposition and Composition Year Make Model Feature Make & Model Color Body Type Cost

Decomposition and Composition Year Make Model Feature Make & Model Color Body Type Cost Car Style Car Phone Mileage Target BYU Data Extraction Group Year Miles Cost Source DASFAA 2003

Structure Example Taken From [MBR, VLDB’ 01] PO Purchase. Order Items POShip. To POBill.

Structure Example Taken From [MBR, VLDB’ 01] PO Purchase. Order Items POShip. To POBill. To POLines Invoice. To Count Item. Count City Street Address Item. Number Line Qty BYU Data Extraction Group City Street Uo. M Quantity Target Deliver. To Unit. Of. Measure Source DASFAA 2003

Structure (Nonlexical Matches) PO Purchase. Order Items POShip. To POBill. To POLines Invoice. To

Structure (Nonlexical Matches) PO Purchase. Order Items POShip. To POBill. To POLines Invoice. To Count City Street Address Item. Number Line Qty City BYU Data Extraction Group Street Uo. M Quantity Target Deliver. To Unit. Of. Measure Source DASFAA 2003

Structure (Join over FD Relationship Sets, …) PO Purchase. Order Items POShip. To POBill.

Structure (Join over FD Relationship Sets, …) PO Purchase. Order Items POShip. To POBill. To POLines Invoice. To Deliver. To City Count City Street Item. Number Line Qty Uo. M Quantity Target BYU Data Extraction Group Unit. Of. Measure Source DASFAA 2003

Structure (Lexical Matches) PO Purchase. Order Items POShip. To POBill. To POLines Invoice. To

Structure (Lexical Matches) PO Purchase. Order Items POShip. To POBill. To POLines Invoice. To Deliver. To City Count City Street Item. Number Line Qty Uo. M Quantity Target BYU Data Extraction Group Unit. Of. Measure Source DASFAA 2003

Experimental Results Applications (Number of Schemes) Precision (%) Recall (%) F (%) Correct False

Experimental Results Applications (Number of Schemes) Precision (%) Recall (%) F (%) Correct False Positive False Negative Course Schedule (5) 98 93 96 119 2 9 Faculty Member (5) 100 100 140 0 0 Real Estate (5) 92 96 94 235 20 10 Indirect Matches: 94% (precision, recall, F-measure) Data borrowed from Univ. of Washington [DDH, SIGMOD 01] Rough Comparison with U of W Results (direct matches only) * Course Schedule – Accuracy: ~71% * Faculty Members – Accuracy, ~92% * Real Estate (2 tests) – Accuracy: ~75% BYU Data Extraction Group DASFAA 2003

Outline • • • Information Extraction Direct Schema Matching Indirect Schema Matching for HTML

Outline • • • Information Extraction Direct Schema Matching Indirect Schema Matching for HTML Tables Conclusions BYU Data Extraction Group DASFAA 2003

Problem: Different Schemas Target Database Schema {Car, Year, Make, Model, Mileage, Price, Phone. Nr},

Problem: Different Schemas Target Database Schema {Car, Year, Make, Model, Mileage, Price, Phone. Nr}, {Phone. Nr, Extension}, {Car, Feature} Different Source Table Schemas – {Run #, Yr, Make, Model, Tran, Color, Dr} – {Make, Model, Year, Colour, Price, Auto, Air Cond. , AM/FM, CD} – {Vehicle, Distance, Price, Mileage} – {Year, Make, Model, Trim, Invoice/Retail, Engine, Fuel Economy} BYU Data Extraction Group DASFAA 2003

Solution • Form attribute-value pairs • Do extraction • Infer mappings from extraction patterns

Solution • Form attribute-value pairs • Do extraction • Infer mappings from extraction patterns BYU Data Extraction Group DASFAA 2003

Solution: Remove Internal Factoring ACURA Legend Discover Nesting: Make, (Model, (Year, Colour, Price, Auto,

Solution: Remove Internal Factoring ACURA Legend Discover Nesting: Make, (Model, (Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*)* Unnest: μ(Model, Year, Colour, Price, Auto, Air Cond, AM/FM, CD)* μ (Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*Table BYU Data Extraction Group DASFAA 2003

Solution: Replace Boolean Values Auto Air Cond. AM/FM CD ACURA Legend Auto AM/FM Auto

Solution: Replace Boolean Values Auto Air Cond. AM/FM CD ACURA Legend Auto AM/FM Auto Air Cond. AM/FM CD AM/FM Air Cond. AM/FM AutoβAir CondβAM/FM β CD Table βYes, BYU Data Extraction Group DASFAA 2003

Solution: Form Attribute-Value Pairs Auto Air Cond. AM/FM CD ACURA Legend Auto AM/FM Auto

Solution: Form Attribute-Value Pairs Auto Air Cond. AM/FM CD ACURA Legend Auto AM/FM Auto Air Cond. AM/FM CD AM/FM Air Cond. AM/FM Auto Air Cond. AM/FM <Make, Honda>, <Model, Civic EX>, <Year, 1995>, <Colour, White>, <Price, $6300>, <Auto, Auto>, <Air Cond. , Air Cond. >, <AM/FM, AM/FM>, <CD, > BYU Data Extraction Group DASFAA 2003

Solution: Do Extraction Auto Air Cond. AM/FM CD ACURA Legend Auto AM/FM Auto Air

Solution: Do Extraction Auto Air Cond. AM/FM CD ACURA Legend Auto AM/FM Auto Air Cond. AM/FM CD AM/FM Air Cond. AM/FM Auto Air Cond. AM/FM BYU Data Extraction Group DASFAA 2003

Solution: Infer Mappings Auto Air Cond. AM/FM CD ACURA Legend Auto AM/FM Auto Air

Solution: Infer Mappings Auto Air Cond. AM/FM CD ACURA Legend Auto AM/FM Auto Air Cond. AM/FM CD AM/FM Air Cond. AM/FM Auto Air Cond. AM/FM μ(Year, EachπMake rowμis(Model, car. πModel μ(Year, Table Year, Colour, Price, Auto, Colour, Air Cond, AM/FM, Colour, Price, Auto, Air Cond, AM/FM, CD)*Table πa. Year Table Price, Auto, CD)* Air Cond, AM/FM, CD)* {Car, Year, Make, Model, Mileage, Price, Phone. Nr}, {Phone. Nr, Extension}, {Car, Feature} Note: Mappings produce sets for attributes. Joining to form records BYU Data Extraction Group is trivial because we have OIDs for table rows (e. g. for each Car). DASFAA 2003

Solution: Do Extraction Auto Air Cond. AM/FM CD ACURA Legend Auto AM/FM Auto Air

Solution: Do Extraction Auto Air Cond. AM/FM CD ACURA Legend Auto AM/FM Auto Air Cond. AM/FM CD AM/FM Air Cond. AM/FM Auto Air Cond. AM/FM πModelμ(Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*Table {Car, Year, Make, Model, Mileage, Price, Phone. Nr}, {Phone. Nr, Extension}, {Car, Feature} BYU Data Extraction Group DASFAA 2003

Solution: Do Extraction Auto Air Cond. AM/FM CD ACURA Legend Auto AM/FM Auto Air

Solution: Do Extraction Auto Air Cond. AM/FM CD ACURA Legend Auto AM/FM Auto Air Cond. AM/FM CD AM/FM Air Cond. AM/FM Auto Air Cond. AM/FM πPrice. Table {Car, Year, Make, Model, Mileage, Price, Phone. Nr}, {Phone. Nr, Extension}, {Car, Feature} BYU Data Extraction Group DASFAA 2003

Solution: Do Extraction Auto Air Cond. AM/FM CD ACURA Legend Auto AM/FM Auto Air

Solution: Do Extraction Auto Air Cond. AM/FM CD ACURA Legend Auto AM/FM Auto Air Cond. AM/FM CD AM/FM Air Cond. AM/FM Auto Air Cond. AM/FM ρ Colour←Feature π Colour. Table U ρ Auto←Feature π Auto β Auto. Table U ρ Air Cond. ←Feature π Air Cond. Yes, Air Cond. Table U ρ AM/FM CDTable β Yes, AM/FM←Feature π AM/FM β Yes, Table U ρ CD←Featureπ CDβ Yes, {Car, Year, Make, Model, Mileage, Price, Phone. Nr}, {Phone. Nr, Extension}, {Car, Feature} BYU Data Extraction Group DASFAA 2003

Experiment • • Tables from 60 sites 10 “training” tables 50 test tables 357

Experiment • • Tables from 60 sites 10 “training” tables 50 test tables 357 mappings (from all 60 sites) – 172 direct mappings (same attribute and meaning) – 185 indirect mappings (29 attribute synonyms, 5 “Yes/No” columns, 68 unions over columns for Feature, 19 factored values, and 89 columns of merged values that needed to be split) BYU Data Extraction Group DASFAA 2003

Results • 10 “training” tables – 100% of the 57 mappings – No false

Results • 10 “training” tables – 100% of the 57 mappings – No false mappings • 50 test tables – 94. 7% of the 300 mappings – No false mappings • 16 missed mappings – – – 4 partial (not all unions included) 6 non-U. S. car-ads (unrecognized makes and models) 2 U. S. unrecognized makes and models 3 prices (missing $ or found MSRP instead) 1 mileage (mileages less than 1, 000) BYU Data Extraction Group DASFAA 2003

Conclusions • Direct Attribute Matching – Matched 32 of 32: 100% Recall – 2

Conclusions • Direct Attribute Matching – Matched 32 of 32: 100% Recall – 2 False Positives: 94% Precision • Direct and Indirect Attribute Matching – Matched 494 of 513: 96% Recall – 22 False Positives: 96% Precision • Table Mappings – Matched 284 of 300: 94. 7% Recall – No False Positives: 100% Precision www. deg. byu. edu BYU Data Extraction Group DASFAA 2003