Conceptual Modeling for ETL processes Panos Vassiliadis Alkis

  • Slides: 41
Download presentation
Conceptual Modeling for ETL processes Panos Vassiliadis, Alkis Simitsis, Spiros Skiadopoulos {pvassil, asimi, spiros}@dblab.

Conceptual Modeling for ETL processes Panos Vassiliadis, Alkis Simitsis, Spiros Skiadopoulos {pvassil, asimi, spiros}@dblab. ece. ntua. gr National Technical University of Athens KDBS Laboratory http: //www. dbnet. ece. ntua. gr Vassiliadis, Simitsis, Skiadopoul os - DOLAP'02

General Idea n The problem: n n The conceptual part of the definition of

General Idea n The problem: n n The conceptual part of the definition of ETL process in the early stages of a DW project The key idea: n The mapping of the attributes of the data sources to the attributes of the DW tables Vassiliadis, Simitsis, Skiadopoulos - DOLAP'02 2

Outline n n n Motivation Conceptual Model Instantiation and Specialization Layers Methodology for the

Outline n n n Motivation Conceptual Model Instantiation and Specialization Layers Methodology for the usage of the conceptual model Conclusions and Future Work Vassiliadis, Simitsis, Skiadopoulos - DOLAP'02 3

Extract-Transform-Load (ETL) Extract Sources Vassiliadis, Simitsis, Skiadopoulos - DOLAP'02 Transform & Clean DSA Load

Extract-Transform-Load (ETL) Extract Sources Vassiliadis, Simitsis, Skiadopoulos - DOLAP'02 Transform & Clean DSA Load DW 4

Motivation n Practical necessity n n n e. g. , 80% of the development

Motivation n Practical necessity n n n e. g. , 80% of the development time in a DW project In-house development, ad-hoc solutions Lack of related work n The front end of the DW has monopolized the research on the conceptual part of DW modeling Thus, the design, development and deployment of ETL processes, needs modeling, design and methodological foundations Vassiliadis, Simitsis, Skiadopoulos - DOLAP'02 5

Motivation n Early stages of the DW design : n n Concepts are still

Motivation n Early stages of the DW design : n n Concepts are still fuzzy and changing frequently Lots of interviews with people No time for a full, clean-cut definition of the DW and the ETL workflow Still, we can: n n Trace the mapping of the attributes of the data sources to the attributes of the DW tables PK S 1. A Trace necessary constraints and transformations for the ETL process Vassiliadis, Simitsis, Skiadopoulos - DOLAP'02 DW. A 6

Outline n n n Motivation Conceptual Model Instantiation and Specialization Layers Methodology for the

Outline n n n Motivation Conceptual Model Instantiation and Specialization Layers Methodology for the usage of the conceptual model Conclusions and Future Work Vassiliadis, Simitsis, Skiadopoulos - DOLAP'02 7

Conceptual Model n Entities of our model: n n n n n Concepts Attributes

Conceptual Model n Entities of our model: n n n n n Concepts Attributes Part-of Relationships Transformations Serial Composition of Transformations Provider Relationships Notes ETL Constraints Candidate Relationships Vassiliadis, Simitsis, Skiadopoulos - DOLAP'02 8

Conceptual Model n Concepts n n n a name, finite set of attributes represent

Conceptual Model n Concepts n n n a name, finite set of attributes represent an entity in the source database or in the DW Attributes n n same role as in ER/dimensional models a granular module of information We do not employ standard UML notation for concepts and attributes, for the reason that we need to treat attributes as first class citizens of our model Vassiliadis, Simitsis, Skiadopoulos - DOLAP'02 9

Conceptual Model n Part-of Relationships n n finite set of attributes emphasize the fact

Conceptual Model n Part-of Relationships n n finite set of attributes emphasize the fact that a concept is composed of a set of attributes Vassiliadis, Simitsis, Skiadopoulos - DOLAP'02 10

Conceptual Model n Example n Source 1 p n S 1. PARTSUPP {PKEY, SUPPKEY,

Conceptual Model n Example n Source 1 p n S 1. PARTSUPP {PKEY, SUPPKEY, QTY, COST} Data Warehouse p DW. PARTSUPP {PKEY, SUPPKEY, DATE, QTY, COST} Vassiliadis, Simitsis, Skiadopoulos - DOLAP'02 11

Conceptual Model Vassiliadis, Simitsis, Skiadopoulos - DOLAP'02 12

Conceptual Model Vassiliadis, Simitsis, Skiadopoulos - DOLAP'02 12

Conceptual Model n Transformations n n n finite set of input/output attributes, a symbol

Conceptual Model n Transformations n n n finite set of input/output attributes, a symbol abstractions that represent parts, or full modules of code, executing a single task two categories: filtering or data cleaning operations (e. g. , foreign key violations) p transformation operations (e. g. , aggregation) p Vassiliadis, Simitsis, Skiadopoulos - DOLAP'02 13

Conceptual Model n Provider Relationships n n * finite set of input/output attributes, an

Conceptual Model n Provider Relationships n n * finite set of input/output attributes, an appropriate transformation map a set of input attributes to a set of output attributes through a relevant transformation* If the attributes are semantically and physically compatible, no transformation is required Vassiliadis, Simitsis, Skiadopoulos - DOLAP'02 14

Conceptual Model Vassiliadis, Simitsis, Skiadopoulos - DOLAP'02 15

Conceptual Model Vassiliadis, Simitsis, Skiadopoulos - DOLAP'02 15

Conceptual Model n Notes n n informal tags, exactly as in UML modeling used

Conceptual Model n Notes n n informal tags, exactly as in UML modeling used for: simple comments explaining design decisions p explanation of the semantics of the applied transformation p tracing of runtime constraints p Vassiliadis, Simitsis, Skiadopoulos - DOLAP'02 16

Conceptual Model Vassiliadis, Simitsis, Skiadopoulos - DOLAP'02 17

Conceptual Model Vassiliadis, Simitsis, Skiadopoulos - DOLAP'02 17

Conceptual Model n ETL Constraints n n finite set of attributes, a single transformation

Conceptual Model n ETL Constraints n n finite set of attributes, a single transformation express the fact that the data of a certain concept fulfill several requirements Vassiliadis, Simitsis, Skiadopoulos - DOLAP'02 18

Conceptual Model Vassiliadis, Simitsis, Skiadopoulos - DOLAP'02 19

Conceptual Model Vassiliadis, Simitsis, Skiadopoulos - DOLAP'02 19

Conceptual Model n Candidate Relationships n n a single candidate concept, a single target

Conceptual Model n Candidate Relationships n n a single candidate concept, a single target concept used when a certain DW concept is populated by a finite set of more than one candidate source concepts Active Candidate Relationship n n a certain candidate that has been selected for the population of the target concept a specialization of candidate relationships Vassiliadis, Simitsis, Skiadopoulos - DOLAP'02 20

Conceptual Model Vassiliadis, Simitsis, Skiadopoulos - DOLAP'02 21

Conceptual Model Vassiliadis, Simitsis, Skiadopoulos - DOLAP'02 21

Conceptual Model Vassiliadis, Simitsis, Skiadopoulos - DOLAP'02 22

Conceptual Model Vassiliadis, Simitsis, Skiadopoulos - DOLAP'02 22

Outline n n n Motivation Conceptual Model Instantiation and Specialization Layers Methodology for the

Outline n n n Motivation Conceptual Model Instantiation and Specialization Layers Methodology for the usage of the conceptual model Conclusions and Future Work Vassiliadis, Simitsis, Skiadopoulos - DOLAP'02 23

Instantiation & Specialization Layers n The key issues: n generecity p n identification of

Instantiation & Specialization Layers n The key issues: n generecity p n identification of a small set of generic constructs to capture all cases usability p construction of a ‘palette’ of frequently used types Vassiliadis, Simitsis, Skiadopoulos - DOLAP'02 24

Instantiation & Specialization Layers n Metamodel layer n n n Template layer n n

Instantiation & Specialization Layers n Metamodel layer n n n Template layer n n a set of generic entities, able to represent any ETL scenario involves classes: Concept, Attribute, Transformation, ETL Constraint and Relationship a set of ‘built-in’ specializations of the entities of the Metamodel layer, specifically tailored for the most frequent elements of ETL scenarios Schema layer n n a specific ETL scenario all the entities of the Schema layer are instances of the classes of the Metamodel layer Vassiliadis, Simitsis, Skiadopoulos - DOLAP'02 25

Instantiation & Specialization Layers Vassiliadis, Simitsis, Skiadopoulos - DOLAP'02 26

Instantiation & Specialization Layers Vassiliadis, Simitsis, Skiadopoulos - DOLAP'02 26

Instantiation & Specialization Layers n Template layer n Four groups of logical transformations Filters

Instantiation & Specialization Layers n Template layer n Four groups of logical transformations Filters p Unary transformations p Binary transformations p Composite transformations p n Two groups of physical transformations Transfer operations p File operations p Vassiliadis, Simitsis, Skiadopoulos - DOLAP'02 27

Instantiation & Specialization Layers Filters Composite transformations Selection (σ) Not null (NN) Primary key

Instantiation & Specialization Layers Filters Composite transformations Selection (σ) Not null (NN) Primary key violation (PK) Foreign key violation (FK) Unique value (UN) Domain mismatch DM) Slowly changing dimension (Type 1, 2, 3) (SDC-1/2/3) Format mismatch (FM) Data type conversion (DTC) Switch (σ*) Extended union (U) Unary transformations Push Aggregation (γ) Projection (π) Function application (f) Surrogate key assignment(SK) Tuple normalization (N) Tuple denormalization (DN) File operations EBCDIC to ASCII conversion (EB 2 AS) Sort file (Sort) Transfer operations Ftp (FTP) Compress/Decompress (Z/d. Z) Encrypt/Decrypt (Cr/d. Cr) Binary transformations Union (U) Join ( ) Diff (Δ) Update Detection (ΔUPD) Vassiliadis, Simitsis, Skiadopoulos - DOLAP'02 28

Outline n n n Introduction Motivation Conceptual Model Instantiation and Specialization Layers Methodology for

Outline n n n Introduction Motivation Conceptual Model Instantiation and Specialization Layers Methodology for the usage of the conceptual model Conclusions and Future Work Vassiliadis, Simitsis, Skiadopoulos - DOLAP'02 29

Methodology n Step 1 n n Step 2 n n Candidates and active candidates

Methodology n Step 1 n n Step 2 n n Candidates and active candidates for the involved data stores Step 3 n n Identification of the proper data stores Attribute mapping between the providers and the consumers Step 4 n Annotating the diagram with runtime constraints Vassiliadis, Simitsis, Skiadopoulos - DOLAP'02 30

Outline n n n Introduction Motivation Conceptual Model Instantiation and Specialization Layers Methodology for

Outline n n n Introduction Motivation Conceptual Model Instantiation and Specialization Layers Methodology for the usage of the conceptual model Conclusions and Future Work Vassiliadis, Simitsis, Skiadopoulos - DOLAP'02 31

Conclusions n Our contributions lies in: n n n The proposal of a novel

Conclusions n Our contributions lies in: n n n The proposal of a novel conceptual model which is customized for the tracing of interattribute relationships and the respective ETL activities A customizable and extensible construction The introduction of a 'palette' of a set of frequently used ETL activities Vassiliadis, Simitsis, Skiadopoulos - DOLAP'02 32

On-going/Future Work The Arktos II project is aimed towards the n Conceptual modeling n

On-going/Future Work The Arktos II project is aimed towards the n Conceptual modeling n Logical modeling n Optimization n What-if analysis of ETL scenarios http: //www. dblab. ece. ntua. gr/ ~pvassil/projects/arktos_II Vassiliadis, Simitsis, Skiadopoulos - DOLAP'02 33

Thank you Vassiliadis, Simitsis, Skiadopoulos - DOLAP'02 34

Thank you Vassiliadis, Simitsis, Skiadopoulos - DOLAP'02 34

Back-up slides Vassiliadis, Simitsis, Skiadopoulos - DOLAP'02 35

Back-up slides Vassiliadis, Simitsis, Skiadopoulos - DOLAP'02 35

Logical Model [DMDW’ 02] DS. PS_NEW 1. PKEY, DS. PS_OLD 1. PKEY SUPPKEY=1 DS.

Logical Model [DMDW’ 02] DS. PS_NEW 1. PKEY, DS. PS_OLD 1. PKEY SUPPKEY=1 DS. PS 1. PKEY, LOOKUP_PS. SKEY, SUPPKEY COST DATE DS. PS_NEW 1 DIFF 1 DS. PS 1 Add_SPK 1 SK 1 rejected DS. PS_OLD 1 DS. PS_NEW 2. PKEY, DS. PS_OLD 2. PKEY SUPPKEY=2 DS. PS_NEW 2 DIFF 2 DS. PS 2 Add_SPK 2 rejected Log DS. PS 2. PKEY, LOOKUP_PS. SKEY, SUPPKEY COST DATE=SYSDATE Not. NULL SK 2 DSA Log U rejected Log rejected DS. PS_OLD 2 A 2 EDate $2€ Add. Date QTY>0 Check. QTY rejected Log PKEY, DAY MIN(COST) S 1_PARTSUPP FTP 1 Aggregate 1 DW. PARTSUPP V 1 DW. PARTSUPP. DATE, PKEY, MONTH AVG(COST) DAY S 2_PARTSUPP FTP 2 Sources Vassiliadis, Simitsis, Skiadopoulos - DOLAP'02 TIME Aggregate 2 V 2 DW 36

Conceptual Model Vassiliadis, Simitsis, Skiadopoulos - DOLAP'02 37

Conceptual Model Vassiliadis, Simitsis, Skiadopoulos - DOLAP'02 37

The lifecycle of a Data Warehouse and its ETL processes Vassiliadis, Simitsis, Skiadopoulos -

The lifecycle of a Data Warehouse and its ETL processes Vassiliadis, Simitsis, Skiadopoulos - DOLAP'02 38

Conceptual Model Vassiliadis, Simitsis, Skiadopoulos - DOLAP'02 39

Conceptual Model Vassiliadis, Simitsis, Skiadopoulos - DOLAP'02 39

Conceptual Model n General Notes n n It is not a process/workflow model It

Conceptual Model n General Notes n n It is not a process/workflow model It is orthogonal to the conceptual models which are available for the modeling of DW star schemata It is specifically tailored for the back end of the DW Any of the proposals for the DW front end can be combined with our approach Vassiliadis, Simitsis, Skiadopoulos - DOLAP'02 40

Conceptual Model n Serial Composition of Transformations n n a single initiating transformation, a

Conceptual Model n Serial Composition of Transformations n n a single initiating transformation, a single subsequent transformation combine several transformations in a single provider relationship Vassiliadis, Simitsis, Skiadopoulos - DOLAP'02 41