Data Exchange with DataMetadata Translations WangChiew Tan Mauricio

  • Slides: 25
Download presentation
Data Exchange with Data-Metadata Translations Wang-Chiew Tan Mauricio A. Hernández IBM Almaden Research Center

Data Exchange with Data-Metadata Translations Wang-Chiew Tan Mauricio A. Hernández IBM Almaden Research Center UC Santa Cruz Paolo Papotti Università Roma Tre VLDB 2008 August 24 -- Auckland, New Zealand

Data-Metadata Translations • Data exchange scenarios may involve metadata transformations. – E. g. ,

Data-Metadata Translations • Data exchange scenarios may involve metadata transformations. – E. g. , Pivot/Unpivot in spreadsheets. [example from Miller 98] • • 2 Mapping systems support Data-to-Data transformations with fixed schemas. Goal: Extend mapping systems to support Data. Metadata Translations.

Mapping Systems Source schema S GUI Target schema T IBM Clio Hep. Tox MS

Mapping Systems Source schema S GUI Target schema T IBM Clio Hep. Tox MS ADO. net Declarative (internal) representation Altova Map. Force Stylus. Studio Executable code (XSLT, XQuery, Java) I 3 J BEA Aqualogic Data exchange

Outline 1. Data and Metadata translations 2. Generation Algorithms Data-to-Data Mapping Generation Metadata-to-Data Query

Outline 1. Data and Metadata translations 2. Generation Algorithms Data-to-Data Mapping Generation Metadata-to-Data Query Generation Graphic Design Data-to-Metadata 3. Results & Discussion Experiments Related Work Conclusion Graphic Design

Data-to-Data • Mapping Generation Algorithm: [PVMHF 2002] – Input: Source and Target schemas, and

Data-to-Data • Mapping Generation Algorithm: [PVMHF 2002] – Input: Source and Target schemas, and correspondences. – Output: declarative schema mapping • For example: Source: Rcd Sales: Set. Of Rcd country region style shipdate units price Target: Rcd Country. Sales: Set. Of Rcd country Sales: Set. Of Rcd style shipdate units id for $s in Source. Sales exists $t in Target. Country. Sales, $c in $t. Sales where $t. country = $s. country and $c. style = $s. style and $c. shipdate = $s. shipdate and $c. units = $s. units

Mappings for $s in Source. Sales exists $t in Target. Country. Sales, $c in

Mappings for $s in Source. Sales exists $t in Target. Country. Sales, $c in $t. Sales where $t. country = $s. country and $c. style = $s. style and $c. shipdate = $s. shipdate and $c. units = $s. units • Query Generation into multiple query languages: – Input: a data to data schema mapping – Output: a query script (XQuery, XSLT, SQL, etc. ) for $x 0 in $doc/Source/Sales return ( <Country. Sales> <country> { $x 0/country/text() } </country> …

“State-of-the-art” Metadata-to-Data How can we transform the following source data into the corresponding target?

“State-of-the-art” Metadata-to-Data How can we transform the following source data into the corresponding target? Source. Sales month USA UK Italy Jan 120 223 89 Feb 83 168 56 Target. Sales month country units Jan Jan Feb Feb USA UK Italy 120 223 89 83 168 56 Schema mapping m 1 Source: Rcd Target: Rcd Sales: Set. Of Rcd month “USA” USA country UK units Italy 7 m 1: for $s in Source. Sales exists $t in Target. Sales where $t. month = $s. month and $t. country = “USA” and $t. units = $s. USA

“State-of-the-art” Metadata-to-Data How can we transform the following source data into the corresponding target?

“State-of-the-art” Metadata-to-Data How can we transform the following source data into the corresponding target? Source. Sales month USA UK Italy Jan 120 223 89 Feb 83 168 56 Target. Sales month country units Jan Jan Feb Feb USA UK Italy 120 223 89 83 168 56 Schema mapping m 2 Source: Rcd Sales: Set. Of Rcd month “UK” USA UK Italy 8 Target: Rcd Sales: Set. Of Rcd month country units m 1: for $s in Source. Sales exists $t in Target. Sales where $t. month = $s. month and $t. country = “USA” and $t. units = $s. USA m 2: for $s in Source. Sales exists $t in Target. Sales where $t. month = $s. month and $t. country = “UK” and $t. units = $s. UK

“State-of-the-art” Metadata-to-Data How can we transform the following source data into the corresponding target?

“State-of-the-art” Metadata-to-Data How can we transform the following source data into the corresponding target? Source. Sales month USA UK Italy Jan 120 223 89 Feb 83 168 56 Target. Sales month country units Jan Jan Feb Feb USA UK Italy 120 223 89 83 168 56 Schema mapping m 3 Source: Rcd Sales: Set. Of Rcd month “Italy” USA UK Italy 9 Target: Rcd Sales: Set. Of Rcd month country units m 1: for $s in Source. Sales exists $t in Target. Sales where $t. month = $s. month and $t. country = “USA” and $t. units = $s. USA m 2: for $s in Source. Sales exists $t in Target. Sales where $t. month = $s. month and $t. country = “UK” and $t. units = $s. UK m 3: for $s in Source. Sales exists $t in Target. Sales where $t. month = $s. month and $t. country = “Italy” and $t. units = $s. Italy

Metadata-to-Data: Our solution Metadat. A-Data (MAD) mapping: Source: Rcd Sales: Set. Of Rcd month

Metadata-to-Data: Our solution Metadat. A-Data (MAD) mapping: Source: Rcd Sales: Set. Of Rcd month USA countries UK label Italy Target: Rcd Sales: Set. Of Rcd month country units Source. Sales Jan 120 223 89 Feb 83 168 56 value Select the elements to group for exists where and 10 Placeholder $s in Source. Sales, $c in {“USA”, “UK”, “Italy”} $t in Target. Sales $t. month = $s. month $t. country = $c $t. units = $s. ($c) Copy elements’ labels Target. Sales Jan USA 120 Jan UK 223 Jan Italy 89 Feb USA 83 Feb UK 168 Feb Italy 56 Copy elements’ values Set of labels (strings) Is a label value Dynamic selection of the source element

Data-to-Metadata Now we want to support the opposite operation [example from Miller 98] The

Data-to-Metadata Now we want to support the opposite operation [example from Miller 98] The target schema depends on the source data We define a target template: Nested Dynamic Output Schemas (ndos) Source: Rcd Target: Rcd Stock. Ticker: Set. Of Rcd Stockquotes: Set. Of Rcd time symbols Dynamic price label value element Run-time: The dynamic element defines the target instance and the target schema. 11

Data-to-Metadata: Heterogeneous records Consider this mapping and this source instance: Source: Rcd Target: Rcd

Data-to-Metadata: Heterogeneous records Consider this mapping and this source instance: Source: Rcd Target: Rcd Stock. Ticker: Set. Of Rcd Stockquotes: Set. Of Rcd time symbols price label value Source Instance Stock. Ticker (time: 0900, Symbol : MSFT, Stock. Ticker (time: 0900, Symbol : IBM, Stock. Ticker (time: 0905, Symbol : MSFT, First alternative: Heterogeneous target records Computed Target Instance Stockquotes (time: 0900, MSFT: 27. 20 ) Stockquotes (time: 0900, IBM: 120. 00 ) Stockquotes (time: 0905, MSFT: 27. 30 ) Price: 27. 20 ) Price: 120. 00 ) Price: 27. 30 ) There are two possible interpretations for the target ndos: Computed Target Schema Target: Rcd Stockquotes: Set. Of Rcd time symbols: Choice MSFT IBM

Data-to-Metadata: Homogenous records Consider this mapping and this source instance: Source: Rcd Target: Rcd

Data-to-Metadata: Homogenous records Consider this mapping and this source instance: Source: Rcd Target: Rcd Stock. Ticker: Set. Of Rcd Stockquotes: Set. Of Rcd time symbols price label value Source Instance Stock. Ticker (time: 0900, Symbol : MSFT Stock. Ticker (time: 0900, Symbol : IBM Stock. Ticker (time: 0905, Symbol : MSFT Price: 27. 20 ) Price: 120. 00 ) Price: 27. 30 ) Second alternative: Homogeneous target records Computed Target Instance Stockquotes (time: 0900, MSFT: 27. 20, IBM: null ) Stockquotes (time: 0900, MSFT: null , IBM: 120. 00 ) Stockquotes (time: 0905, MSFT: 27. 30, IBM: null ) There are two possible interpretations for the target ndos: Computed Target Schema Target: Rcd Stockquotes: Set. Of Rcd time MSFT IBM

Data-to-Metadata: Homogenous records Source: Rcd Target: Rcd Stock. Ticker: Set. Of Rcd Stockquotes: Set.

Data-to-Metadata: Homogenous records Source: Rcd Target: Rcd Stock. Ticker: Set. Of Rcd Stockquotes: Set. Of Rcd time symbols price label value Stockquotes(time: 0900, MSFT : 27. 20 ) Stockquotes(time: 0900, IBM : 120. 00 ) Stockquotes(time: 0905, MSFT : 27. 30 ) Natural solution for semistructured data models (XSD, DTD, JSON) Homogeneity Constraint: “For every pair of tuples t 1 and t 2, if a is a label in t 1, then a is a label in t 2” for $t 1 in Target. Stockquotes, $t 2 in Target. Stockquotes, $a in dom ($t 1) exists $a’ in dom ($t 2) where $a = $a’ Stockquotes(time: 0900, MSFT : 27. 20, IBM: null ) Stockquotes(time: 0900, MSFT : null , IBM: 120. 00) Stockquotes(time: 0905, MSFT : 27. 30, IBM: null ) 14 Natural solution for the Relational data model

MAD Mapping Generation Target: Rcd By. Shipdate. Country: Set. Of Choice dates Source: Rcd

MAD Mapping Generation Target: Rcd By. Shipdate. Country: Set. Of Choice dates Source: Rcd Sales: Set. Of Rcd country region style shipdate units price label 1 value 1 : Rcd countries label 2 value 2 : Set. Of <By. Ship. Date. Country> Rcd style units price <12 -07> <USA> <style>Tee</style><units>11</units><price>1200</price> </USA><USA> <style>Elec. </style><units>12</units><price>3600</price> Source. Sales country region style shipdate units price USA USA UK East West Tee Elec. Tee 12 -07 01 -08 02 -08 11 12 10 12 1200 3600 1600 2000 </USA> </12 -07> <01 -08> <USA> <style>Tee</style><units>10</units><price>1600</price> </USA> </01 -08> <02 -08> <UK> <style>Tee</style><units>12</units><price>2000</price> 15 </UK> </02 -08> </By. Ship. Data. Country>

MAD Mapping Generation Source: Rcd Sales: Set. Of Rcd country region style shipdate units

MAD Mapping Generation Source: Rcd Sales: Set. Of Rcd country region style shipdate units price Target: Rcd By. Shipdate. Country: Set. Of Choice dates label 1 value 1 : Rcd Three Steps: 1. Modify schemas with dynamic placeholders 2. Compile mappings 3. Simplify mapping countries label 2 value 2 : Set. Of Rcd style units price This is what we get from Clio [PVMHF 02] 16 for $s in Source. Sales for exists $t in Target. By. Shipdate. Country, exists $y in dates , $u in case $t of $y, $z in countries , where $v in $u. ($z) and where $y = $s. shipdate and $z= $s. country and $v. style = $s. style and $v. units = $s. units and $v. price = $s. price and $u. ($z) = SK[$s. shipdate, $s. country] $s in Source. Sales $t in Target. By. Shipdate. Country, $u in case $t of $s. shipdate, $v in $u. ($s. country) $v. style = $s. style $v. units = $s. units and $v. price = $s. price $u. ($s. country) = SK[$s. shipdate, $s. country]

Query Generation: two-phase algorithm [PVMHF 02] Phase 1: Phase 2: Q 1 shreds the

Query Generation: two-phase algorithm [PVMHF 02] Phase 1: Phase 2: Q 1 shreds the source instance I over relational views of the target schema Q 2 assembles the target instance J from the relational views T 2 T 1 T 4 T 3 S 1 S conforms-to I conforms-to Q 1 rr rr Q 2 J T

New Query Generation Phase 2: Phase 1: Q 1 shreds the source instance I

New Query Generation Phase 2: Phase 1: Q 1 shreds the source instance I over relational views of the target ndos S 1 S conforms-to I Q 2 assembles the target instance J from the relational views Q 3 computes the target schema T Q 4 is the optional post - processing Q 1 rr Q 3 T 1 ndos T 3 conforms-to T 2 T 1 T 4 T 3 conforms-to rr Q 2 T 2 J Q 4 T

MAD Clio vs. Commercial Tools Commercial Tool 19

MAD Clio vs. Commercial Tools Commercial Tool 19

MAD Clio vs. Commercial Tools MAD Clio Optimized query 48 source labels (10 MB):

MAD Clio vs. Commercial Tools MAD Clio Optimized query 48 source labels (10 MB): naïve 183 s, dynamic 14 s, optimized 10 s

MAD Clio Performance 12 target labels (10 MB): naïve 590 s, optimized 80 s

MAD Clio Performance 12 target labels (10 MB): naïve 590 s, optimized 80 s [1 phase: 3 s] 21

Some Related Work • Lots of related work in the relational setting: – FIRA/FISQL

Some Related Work • Lots of related work in the relational setting: – FIRA/FISQL [Wyss, Robertson 2005] has an excellent survey. – Schema. SQL [Lakshmanan, Sadri, Subramanian 1996], FIRA/FISQL [Wyss, Robertson 2005] • Extensions to SQL to handle metadata as data • Only relational dynamic output schemas • Language and semantics, NO transformations from GUI • In XML settings – Hep. Tox [BCHLP 2005], commercial mapping tools [Altova Map. Force, MS ADO. net, Stylus. Studio, BEA (Oracle) Aqualogic] • No dynamic elements in the target 22

MAD Clio Data exchange with data-metadata support: Data to Data is a special case

MAD Clio Data exchange with data-metadata support: Data to Data is a special case Source schema S GUI Target schema T Declarative (internal) representation Executable code (XSLT, XSLT XQuery, Java) Java 23 § New construct to iterate over elements’ labels: placeholder § Target schema can be incomplete: nested dynamic output schema (ndos) § New constructs for the mapping language § New mapping & query generation algorithms § Including a query to generate the target schema.

Data Exchange with Data-Metadata Translations Thank you. Questions? 24

Data Exchange with Data-Metadata Translations Thank you. Questions? 24

Metadata-to-Metadata. . . <properties name=“price” lang=“en-us” date=“ 01 -01 -2008”. . . > <pval>48.

Metadata-to-Metadata. . . <properties name=“price” lang=“en-us” date=“ 01 -01 -2008”. . . > <pval>48. 15</pval> </properties>. . . <price value=“ 48. 15” lang=“en-us” date=“ 01 -01 -2008”. . . />. . . Source: Rcd Target: Rcd properties: Set. Of Rcd label 1 <<names>> @name value 1: Set. Of Rcd <<attrs>> @lang @value label @date label 2 <<elems>> value … value 2 @format pval for $x 1 in Source. properties, $x 2 in { @lang, @date, …, @format } exists $y 1 in Target. ($x 1. @name), where $y 1. @value = $x 1. pval and $y 1. ($x 2) = $x 1. ($x 2) 25 Metadata to Metadata: placeholder to dynamic element