Schema Mapping Experiences and Lessons Learned Yihong Ding

Schema Mapping • Semantic correspondence between two schemas • Significance – – – data

Schema Representation MLS Phone_evening Bedrooms MLS Basic_features House location Agent Name Golf course Water

1: 1 Mapping Cardinality MLS Phone_evening Bedrooms MLS Basic_features House location Agent Name Golf

n: 1 Mapping Cardinality MLS Phone_evening Bedrooms MLS Basic_features House location Agent Name Golf

n: m Mapping Cardinality MLS Phone_evening Bedrooms MLS Basic_features House location Agent Name Golf

Object-Set Matcher (schema-level) • Name-based matcher – string and substring comparison – linguistic methods:

Object-Set Matcher (instance-level) • Data Frame – multiple regular expressions in Perl style –

Extended Data-Frame Matcher (instance -level) • n: 1 mapping cardinality • Add a STRICT_SUBSTRING

Direct Structure Matcher • Comparing structure similarity between two candidate schemas • 1: 1

Reference Structure Matcher • If A and B match C, then A matches B.

Experiments Application (Number of Schemes) Precision (%) Recall (%) F (%) Number Matches Number

Lessons Learned • n: 1 and n: m matches occur frequently. – 22% =

Slides: 13

Download presentation

Schema Mapping: Experiences and Lessons Learned Yihong Ding Data Extraction Group Brigham Young University Sponsored by NSF

Schema Mapping • Semantic correspondence between two schemas • Significance – – – data integration data warehouses ontology merging message translation in e-commerce semantic query processing etc. 2

Schema Representation MLS Phone_evening Bedrooms MLS Basic_features House location Agent Name Golf course Water front SQFT location_ description beds agent Phone_day Location Address name cell phone Street City State home phone office phone 3

1: 1 Mapping Cardinality MLS Phone_evening Bedrooms MLS Basic_features House location Agent Name Golf course Water front SQFT location_ description beds agent Phone_day Location Address name cell phone Street City State home phone office phone 4

n: 1 Mapping Cardinality MLS Phone_evening Bedrooms MLS Basic_features House location Agent Name Golf course Water front SQFT location_ description beds agent Phone_day Location Address name cell phone Street City State home phone office phone 5

n: m Mapping Cardinality MLS Phone_evening Bedrooms MLS Basic_features House location Agent Name Golf course Water front SQFT location_ description beds agent Phone_day Location Address name cell phone Street City State home phone office phone 6

Object-Set Matcher (schema-level) • Name-based matcher – string and substring comparison – linguistic methods: stemming, stop words, removing ignorable characters, etc. – thesaurus: Word. Net, etc. • 1: 1 mapping cardinality Agent agent Name name 7

Object-Set Matcher (instance-level) • Data Frame – multiple regular expressions in Perl style – as simple as a list of data values Car Model Ford, Honda, Chevy, Toyota … • Data-frame matcher – – use: compare recognized data values benefit: able to recognize disjunctive data value sets bias: data frame may not correspond 100% with the semantics limitation: a needed data frame might not exist • 1: 1 mapping cardinality Car Model Object-set A Object-set B Ford Chevy Honda Toyota 8

Extended Data-Frame Matcher (instance -level) • n: 1 mapping cardinality • Add a STRICT_SUBSTRING operation • With the help of structural analysis 120 N. University Ave. , Provo, UT Schema 1 location Schema 2 Address Street City State 9

Direct Structure Matcher • Comparing structure similarity between two candidate schemas • 1: 1 mapping cardinality Name agent Agent Fax Location name phone_day fax phone Address 10

Reference Structure Matcher • If A and B match C, then A matches B. • Able to solve n: m mapping cardinality • 1: 1, n: 1, and n: m mapping cardinalities Phone Day Phone Evening Phone Cell Phone Home Phone Office Phone Schema 2 Schema 1 Home Phone Evening Phone Cell Phone Day Phone Office Phone 11

Experiments Application (Number of Schemes) Precision (%) Recall (%) F (%) Number Matches Number Correct Number Incorrect Faculty Member (5) 100 100 540 0 Course Schedule (5) 99 93 96 490 454 6 Real Estate (5) 90 94 92 876 820 92 Indirect Matches: (precision 87%, recall 94%, F-measure 90%) Data borrowed from Univ. of Washington [DDH, SIGMOD 01] Rough Comparison with U of W Results * Faculty Member – Accuracy, ~92% * Course Schedule – Accuracy: ~71% * Real Estate (2 tests) – Accuracy: ~75% 12

Lessons Learned • n: 1 and n: m matches occur frequently. – 22% = 97/437 [DMD+03] (Course Catalog, Company Profile) – 45% = 287/638 (Car Ads, Cell Phones, Real Estate) • Reference structures provides a way to solve the longlasting hard cluster mapping (n: m cardinality) problem. • Data frames improve the instance-level matchers. • The combination of schema-level and instance-level matchers improve the results. 13