Data Exchange with DataMetadata Translations MAD Algorithm Paolo

  • Slides: 40
Download presentation
Data Exchange with Data-Metadata Translations MAD Algorithm Paolo Papotti Mauricio A. Hernández Wang-Chiew Tan

Data Exchange with Data-Metadata Translations MAD Algorithm Paolo Papotti Mauricio A. Hernández Wang-Chiew Tan

Data Exchange “Scientia potentia est” • What is Data Exchange? : • The process

Data Exchange “Scientia potentia est” • What is Data Exchange? : • The process of taking data built under a source schema and transforming it into data built under a target schema • Data Exchange is the restructuring of data

Data Exchange – why? 1. Today when companies merge they also merge information sources.

Data Exchange – why? 1. Today when companies merge they also merge information sources.

Data Exchange – why? 2. When several institutions are working on a joint venture

Data Exchange – why? 2. When several institutions are working on a joint venture – a combined database is

Data Exchange – why? 3. Refreshing and updating data base scheme

Data Exchange – why? 3. Refreshing and updating data base scheme

Few problems with data exchange 1. The labels in the Source Schema and the

Few problems with data exchange 1. The labels in the Source Schema and the values Target Schema could be very different 2. Data could be kept in a plethora of ways For instance: Car price could be stored in Shekels and in U. S dollars 3. Data could be lost in the exchange process if the Source Schema and Target Schema don’t correspond well

Data Exchange v In the past Data Exchange was done manually, taking many resources

Data Exchange v In the past Data Exchange was done manually, taking many resources such as time and money. v Many researchers struggle with ways of improving data exchange

Antique Car Dealership Schema Buy-A-Wreck Car Model price Agent-id Vauxhall 14 360, 000 48

Antique Car Dealership Schema Buy-A-Wreck Car Model price Agent-id Vauxhall 14 360, 000 48 Ford 430, 000 66 Model T Id Name 48 66 Car cars model Comm ission Nigel Dodds Vauxhall 14 0. 03 Ian Paisley T 0. 04 Ford Car AGENTS Schema Clunkers –R-Us Location List-price Automobil Seniorit e y Agentname Belfast, NR 650000 Morris 8 Gerry Adams Newry, NR 500000 Bentley Mark V 2 1 Martin Mc. Guiness Clunker table

Matching Examples Schema Buy-A-Wreck Name Nigel Dodds Ian Paisley Schema Clunkers –R-Us Agent- name

Matching Examples Schema Buy-A-Wreck Name Nigel Dodds Ian Paisley Schema Clunkers –R-Us Agent- name Nigel Dodds Ian Paisley

Matching Examples Schema Buy-A-Wreck Car model Vauxhall 14 Ford T Schema Clunkers –R-Us Automobile

Matching Examples Schema Buy-A-Wreck Car model Vauxhall 14 Ford T Schema Clunkers –R-Us Automobile Vauxhall 14 Ford T

Matching Examples Schema Buy-A-Wreck Car type price Agent-id Vauxh 14 all 360, 000 48

Matching Examples Schema Buy-A-Wreck Car type price Agent-id Vauxh 14 all 360, 000 48 Model 430, 000 T 66 Ford Id Commission 48 0. 03 66 0. 04 cars Car AGENTS Schema Clunkers –R-Us List-price Car model 370800 Vauxhall 14 447200 Ford Model T

How do we match? • Creating mappings: 1. schema matching: find matches 2. create

How do we match? • Creating mappings: 1. schema matching: find matches 2. create query expressions: for automated data translation or exchange Schema Matching Create Query expressions

Data Exchange 1. There may be no way to transform an instance given all

Data Exchange 1. There may be no way to transform an instance given all of our constraints. 2. There may be numerous ways to transform the instance (possibly infinitely many). 3. We must identify and justify a best suited choice of solutions for our need.

Data Exchange - Summery To conclude: 1. Data exchange is exchanging data from a

Data Exchange - Summery To conclude: 1. Data exchange is exchanging data from a Source Schema to a Target Schema 2. It is a greatly dealt problem in the computerized world 3. Some Data exchange scenarios deal with Metadata Source schema S S Target schema T T

What is Metadata? • Metadata: Data on Data. Metadata can come as: Video Audio

What is Metadata? • Metadata: Data on Data. Metadata can come as: Video Audio Image Text

Why Do we need Meta – Data? Meta-Data helps us to understand data Can

Why Do we need Meta – Data? Meta-Data helps us to understand data Can anyone tell what these numbers mean? Jan 120 223 89 Feb 83 168 56

Why Do we need Meta – Data? After adding Meta-Data… Umbrella Sales Month USA

Why Do we need Meta – Data? After adding Meta-Data… Umbrella Sales Month USA UK Italy Jan 120 223 89 Feb 83 168 56

Why Do we need Meta – Data? We all know this picture…

Why Do we need Meta – Data? We all know this picture…

Why Do we need Meta – Data? What is this picture all about?

Why Do we need Meta – Data? What is this picture all about?

Why Do we need Meta – Data? Sir Edward Carson signing the Ulster Covenant

Why Do we need Meta – Data? Sir Edward Carson signing the Ulster Covenant

Why Do we need Meta – Data?

Why Do we need Meta – Data?

Why Do we need Meta – Data? Wall Street, New York City, New York.

Why Do we need Meta – Data? Wall Street, New York City, New York.

Data-Metadata Translations 23 • Data exchange scenarios may involve metadata transformations. • Transforming the

Data-Metadata Translations 23 • Data exchange scenarios may involve metadata transformations. • Transforming the data in the Stock Ticker table to metadata in the Stock Quotes table is vital in the stock exchange world.

Data-Metadata Translations • • Mapping systems support Data-to-Data transformations with fixed schemas (Clio). Goal:

Data-Metadata Translations • • Mapping systems support Data-to-Data transformations with fixed schemas (Clio). Goal: Extend mapping systems to support Data-Metadata Translations.

Data Exchange Clio • One software developed for simple graphic data exchange is “Clio”

Data Exchange Clio • One software developed for simple graphic data exchange is “Clio” • Clio corresponded values between the source scheme and the target scheme • However, the Clio solution did not provide answers for possible data exchange scenarios that involve Metadata • the solution involving Metadata is based on Clio

Clio interface

Clio interface

Metadata-to-Data How can we transform the following source data into the corresponding target? Source.

Metadata-to-Data How can we transform the following source data into the corresponding target? Source. Sales month USA UK Italy Jan 120 223 89 Feb 83 168 56 Target. Sales month country units Jan Jan Feb Feb USA UK Italy 120 223 89 83 168 56 Schema mapping m 1 Source: Rcd Target: Rcd Sales: Set. Of Rcd month “USA” USA country UK units Italy 27 m 1: for $s in Source. Sales exists $t in Target. Sales where $t. month = $s. month and $t. country = “USA” and $t. units = $s. USA

Metadata-to-Data How can we transform the following source data into the corresponding target? Source.

Metadata-to-Data How can we transform the following source data into the corresponding target? Source. Sales month USA UK Italy Jan 120 223 89 Feb 83 168 56 Target. Sales month country units Jan Jan Feb Feb USA UK Italy 120 223 89 83 168 56 Schema mapping m 2 Source: Rcd Sales: Set. Of Rcd month “UK” USA UK Italy 28 Target: Rcd Sales: Set. Of Rcd month country units m 1: for $s in Source. Sales exists $t in Target. Sales where $t. month = $s. month and $t. country = “USA” and $t. units = $s. USA m 2: for $s in Source. Sales exists $t in Target. Sales where $t. month = $s. month and $t. country = “UK” and $t. units = $s. UK

Metadata-to-Data How can we transform the following source data into the corresponding target? Source.

Metadata-to-Data How can we transform the following source data into the corresponding target? Source. Sales month USA UK Italy Jan 120 223 89 Feb 83 168 56 Target. Sales month country units Jan Jan Feb Feb USA UK Italy 120 223 89 83 168 56 Schema mapping m 3 Source: Rcd Sales: Set. Of Rcd month “Italy” USA UK Italy 29 Target: Rcd Sales: Set. Of Rcd month country units m 1: for $s in Source. Sales exists $t in Target. Sales where $t. month = $s. month and $t. country = “USA” and $t. units = $s. USA m 2: for $s in Source. Sales exists $t in Target. Sales where $t. month = $s. month and $t. country = “UK” and $t. units = $s. UK m 3: for $s in Source. Sales exists $t in Target. Sales where $t. month = $s. month and $t. country = “Italy” and $t. units = $s. Italy

Metadata-to-Data: Our solution Metadat. A-Data (MAD) mapping: Source: Rcd Sales: Set. Of Rcd month

Metadata-to-Data: Our solution Metadat. A-Data (MAD) mapping: Source: Rcd Sales: Set. Of Rcd month USA countries UK label Italy Target: Rcd Sales: Set. Of Rcd month country units Source. Sales Jan 120 223 89 Feb 83 168 56 value Select the elements to group for exists where and 30 Placeholder $s in Source. Sales, $c in {“USA”, “UK”, “Italy”} $t in Target. Sales $t. month = $s. month $t. country = $c $t. units = $s. ($c) Copy elements’ labels Target. Sales Jan USA 120 Jan UK 223 Jan Italy 89 Feb USA 83 Feb UK 168 Feb Italy 56 Copy elements’ values Set of labels (strings) Is a label value Dynamic selection of the source element

Data-to-Metadata Now we want to support the opposite operation The target schema depends on

Data-to-Metadata Now we want to support the opposite operation The target schema depends on the source data We define a target template: Nested Dynamic Output Schemas (ndos) Source: Rcd Target: Rcd Stock. Ticker: Set. Of Rcd Stockquotes: Set. Of Rcd time symbols Dynamic price label value element Run-time: The dynamic element defines the target instance and the target schema. 31

Data-to-Metadata: Heterogeneous records Consider this mapping and this source instance: Source: Rcd Target: Rcd

Data-to-Metadata: Heterogeneous records Consider this mapping and this source instance: Source: Rcd Target: Rcd Stock. Ticker: Set. Of Rcd Stockquotes: Set. Of Rcd time symbols price label value Source Instance Stock. Ticker (time: 0900, Symbol : MSFT, Stock. Ticker (time: 0900, Symbol : IBM, Stock. Ticker (time: 0905, Symbol : MSFT, First alternative: Heterogeneous target records Computed Target Instance Stockquotes (time: 0900, MSFT: 27. 20 ) Stockquotes (time: 0900, IBM: 120. 00 ) Stockquotes (time: 0905, MSFT: 27. 30 ) Price: 27. 20 ) Price: 120. 00 ) Price: 27. 30 ) There are two possible interpretations for the target ndos: Computed Target Schema Target: Rcd Stockquotes: Set. Of Rcd time symbols: Choice MSFT IBM

Data-to-Metadata: Homogenous records Consider this mapping and this source instance: Source: Rcd Target: Rcd

Data-to-Metadata: Homogenous records Consider this mapping and this source instance: Source: Rcd Target: Rcd Stock. Ticker: Set. Of Rcd Stockquotes: Set. Of Rcd time symbols price label value Source Instance Stock. Ticker (time: 0900, Symbol : MSFT Stock. Ticker (time: 0900, Symbol : IBM Stock. Ticker (time: 0905, Symbol : MSFT Price: 27. 20 ) Price: 120. 00 ) Price: 27. 30 ) Second alternative: Homogeneous target records Computed Target Instance Stockquotes (time: 0900, MSFT: 27. 20, IBM: null ) Stockquotes (time: 0900, MSFT: null , IBM: 120. 00 ) Stockquotes (time: 0905, MSFT: 27. 30, IBM: null ) There are two possible interpretations for the target ndos: Computed Target Schema Target: Rcd Stockquotes: Set. Of Rcd time MSFT IBM

Data-to-Metadata: Homogenous records Source: Rcd Target: Rcd Stock. Ticker: Set. Of Rcd Stockquotes: Set.

Data-to-Metadata: Homogenous records Source: Rcd Target: Rcd Stock. Ticker: Set. Of Rcd Stockquotes: Set. Of Rcd time symbols price label value Stockquotes(time: 0900, MSFT : 27. 20 ) Stockquotes(time: 0900, IBM : 120. 00 ) Stockquotes(time: 0905, MSFT : 27. 30 ) Natural solution for semistructured data models (XSD, DTD, JSON) Homogeneity Constraint: “For every pair of tuples t 1 and t 2, if a is a label in t 1, then a is a label in t 2” Stockquotes(time: 0900, MSFT : 27. 20, IBM: null ) Stockquotes(time: 0900, MSFT : null , IBM: 120. 00) Stockquotes(time: 0905, MSFT : 27. 30, IBM: null ) 34 The Homogenous approach is a MAD improvemnet

MAD Mapping Metadat. A-Data(MAD) mapping three steps: 1. Preliminary mapping Ø How do we

MAD Mapping Metadat. A-Data(MAD) mapping three steps: 1. Preliminary mapping Ø How do we map the Source schema to the Target schema Ø Preliminary mapping for <<D>> includes the metadata label and the value label of <<D>>.

Target. Sales Source. Sales month USA UK Italy Jan 120 223 89 Feb 83

Target. Sales Source. Sales month USA UK Italy Jan 120 223 89 Feb 83 168 56 36 { $x 1 Source. Sales. By. Countries, $x 2 <<countries>>; $x 3=$x 1. ($x 2) } Source: Rcd Sales. By. Countries: month USA UK Italy Set. Of Rcd countries label value Target: Rcd Sales: Set. Of Rcd month country units Jan Jan Feb Feb USA UK Italy 120 223 89 83 168 56 Preliminary Mapping Label Value Transfer

MAD Mapping 2. Skeletons: Ø n x m matrix of skeletons is constructed for

MAD Mapping 2. Skeletons: Ø n x m matrix of skeletons is constructed for the set of source preliminary mapping and the set of target preliminary mapping while each entry(i, j) can be potential mapping. 3. Creating MAD Mapping: Ø At this stage, the value correspondences need to be matched against the preliminary mapping in order to factor them into the appropriate skeletons. Source. Sales. country Target. Country. Sales. country Matched against one or more source mappings 37 Matched against one or more target mappings

MAD Mapping Generation Example Source : { $x 1 Source. Sales. By. Country, $x

MAD Mapping Generation Example Source : { $x 1 Source. Sales. By. Country, $x 2 <<countries>>; $x 3: =$x 1. ($x 2) } Target : { $y 1 Target. Sales} Source. Sales. By. Countries. <<countries>> Target. Sales. country Source. Sales. By. Countries. &<<countries>> Target. Sales. units Source: Rcd Target: Rcd Sales. By. Country: Set. Of Rcd Sales: Set. Of Rcd month USA country countries UK units label Italy value

MAD vs Clio Data exchange with data-metadata support: Data to Data is a special

MAD vs Clio Data exchange with data-metadata support: Data to Data is a special case Source schema S GUI Target schema T Declarative (internal) representation Executable code (XSLT, XSLT XQuery, Java) Java § New construct to iterate over elements’ labels: placeholder § Target schema can be incomplete: nested dynamic output schema (ndos) § New mapping & query generation algorithms

Fin. 40

Fin. 40