4 Extraction Transformation and Loading ETL Extraction and

  • Slides: 41
Download presentation
4 Extraction, Transformation, and Loading (ETL) Extraction and Transportation Copyright © 2005, Oracle. All

4 Extraction, Transformation, and Loading (ETL) Extraction and Transportation Copyright © 2005, Oracle. All rights reserved.

Objectives After completing this lesson, you should be able to do the following: •

Objectives After completing this lesson, you should be able to do the following: • Describe the core ETL framework inside the database and its integration advantage • Explain data warehousing extraction methods • Identify transportation methods: – Flat file – Distributes operations – Transportablespaces • 4 -2 Describe transformation flow Copyright © 2005, Oracle. All rights reserved.

Overview • • • 4 -3 Lesson 4: Extraction/Transportation Lesson 5: Loading Lesson 6:

Overview • • • 4 -3 Lesson 4: Extraction/Transportation Lesson 5: Loading Lesson 6: Transformation Copyright © 2005, Oracle. All rights reserved.

What Is ETL? • • ETL is an acronym for Extraction, Transformation, and Loading.

What Is ETL? • • ETL is an acronym for Extraction, Transformation, and Loading. The following happen during the ETL process: – The desired data is identified and extracted from many different sources. – Some transformations may take place during this extraction process. – After extraction, the data must be transported to a target system or an intermediate system for further processing. – Depending on the method of transportation, some transformations can be done simultaneously. • 4 -4 ETL refers to a broad process. Copyright © 2005, Oracle. All rights reserved.

Extraction Methods • Extraction can be thought of in two parts: – Extraction –

Extraction Methods • Extraction can be thought of in two parts: – Extraction – Transportation • There are two extraction methods: – Logical – Physical • • Your logical choice influences the way the data is physically extracted. Some criteria for choosing a combination: – – 4 -6 Business needs Location of the source and target systems Availability of the source system Time required to extract data Copyright © 2005, Oracle. All rights reserved.

Logical Extraction Methods There are two kinds of logical extraction: • Full extraction –

Logical Extraction Methods There are two kinds of logical extraction: • Full extraction – All data is pulled – Less information to track – More time required to pull the data • Incremental extraction – A subset of data is pulled – Must track what data needs to be pulled – Less time required to pull the data 4 -7 Copyright © 2005, Oracle. All rights reserved.

Physical Extraction Methods There are two types of physical extraction. • Online extraction: –

Physical Extraction Methods There are two types of physical extraction. • Online extraction: – Pulls data from the source system • Offline extraction: – Pulls data from a staging area – Staging areas include flat files, dump files, and transportablespaces. 4 -9 Copyright © 2005, Oracle. All rights reserved.

Offline Extraction Staging areas: • Flat files – Requires data in a predefined, generic

Offline Extraction Staging areas: • Flat files – Requires data in a predefined, generic format • Dump files – Must be in an Oracle-specific format • Redo and archive logs – Data located in special dump files • Transportablespaces – Powerful, fast method for moving large volumes of data 4 -10 Copyright © 2005, Oracle. All rights reserved.

Implementing Methods of Extraction • Extracting to a file: – – • 4 -11

Implementing Methods of Extraction • Extracting to a file: – – • 4 -11 Spooling from SQL*Plus Using OCI or Pro*C to dump to a file Using Data Pump to export to an Oracle dump file Using external tables Extracting through distributed operations Copyright © 2005, Oracle. All rights reserved.

Incremental Extraction Using CDC can capture and publish committed change data in either of

Incremental Extraction Using CDC can capture and publish committed change data in either of the following modes: • Synchronous – Triggers on the source database allow change data to be captured immediately. – Change data is captured as part of the transaction modifying the source table. • Asynchronous – Change data is captured after a SQL statement performing DML is committed using the redo logs. – Asynchronous Change Data Capture is built on Oracle Streams. 4 -13 Copyright © 2005, Oracle. All rights reserved.

Publish and Subscribe Model The publisher performs the following tasks: • Identifies source tables

Publish and Subscribe Model The publisher performs the following tasks: • Identifies source tables from which the data warehouse is interested in capturing change data • Uses the DBMS_CDC_PUBLISH package to: – Set up the capture of data from the source tables – Determine and advance the change sets – Publish the change data • 4 -14 Allows controlled access to subscribers using the SQL GRANT and REVOKE statements Copyright © 2005, Oracle. All rights reserved.

Publish and Subscribe Model The subscriber uses the DBMS_CDC_SUBSCRIBE package to: • Subscribe to

Publish and Subscribe Model The subscriber uses the DBMS_CDC_SUBSCRIBE package to: • Subscribe to source tables • Extend the window and create change view • Prepare the subscriber views • View data stored in change tables • Purge the subscriber view • Remove the subscriber views 4 -16 Copyright © 2005, Oracle. All rights reserved.

Synchronous CDC Source database SYNC_SOURCE Change source Source database transactions Change set Trigger execution

Synchronous CDC Source database SYNC_SOURCE Change source Source database transactions Change set Trigger execution Change tables Source tables Subscriber views 4 -18 Copyright © 2005, Oracle. All rights reserved.

Asynchronous CDC: • Captures change data from redo log files after changes have been

Asynchronous CDC: • Captures change data from redo log files after changes have been committed to the source database • Modes are dependent on the level of supplemental logging used on the source database • Uses Oracle Streams to capture change data from redo log files • Has three source modes: – Asynchronous Auto. Log mode – Asynchronous Hot. Log mode – Asynchronous Distributed Hot. Log mode 4 -19 Copyright © 2005, Oracle. All rights reserved.

Asynchronous Auto. Log Mode Staging database Source database transactions Distributed Auto. Log change set

Asynchronous Auto. Log Mode Staging database Source database transactions Distributed Auto. Log change set Change set LOG_ARCHIVE_DEST_2 RFS Change tables Source tables LGWR Online redo logs Streams capture Distributed Auto. Log change source 4 -20 Standby redo logs Copyright © 2005, Oracle. All rights reserved. Subscriber views

Asynchronous Hot. Log Configuration Source database HOTLOG_SOURCE Change Source database transactions Change set LGWR

Asynchronous Hot. Log Configuration Source database HOTLOG_SOURCE Change Source database transactions Change set LGWR Change tables Streams local capture Source tables Online redo logs 4 -22 Subscriber views Copyright © 2005, Oracle. All rights reserved.

Asynchronous Distributed Hot. Log Mode Staging database Source database Distributed Hot. Log change set

Asynchronous Distributed Hot. Log Mode Staging database Source database Distributed Hot. Log change set Source database transactions DBlink Change set Source tables LGWR Streams propagation Change tables Online redo logs Distributed Hot. Log change source 4 -23 DBlink Subscriber views Copyright © 2005, Oracle. All rights reserved.

Preparing to Publish Change Data 1. Gather requirements from the subscribers. 2. Determine which

Preparing to Publish Change Data 1. Gather requirements from the subscribers. 2. Determine which source database contains the relevant source tables. 3. Choose the capture mode: • • Synchronous Asynchronous Hot. Log Asynchronous Distributed Hot. Log Asynchronous Auto. Log 4. Ensure that the source and staging databases have appropriate database initialization parameters set. 5. Set up database links between the source database and the staging database. 4 -24 Copyright © 2005, Oracle. All rights reserved.

Creating a Publisher User • The staging database publisher must be granted the following

Creating a Publisher User • The staging database publisher must be granted the following privileges and roles: – EXECUTE_CATALOG_ROLE privilege – SELECT_CATALOG_ROLE privilege – CREATE TABLE and CREATE SESSION privileges – EXECUTE on the DBMS_CDC_PUBLISH package • 4 -25 Create a default tablespace for the publisher. Copyright © 2005, Oracle. All rights reserved.

Synchronous Publishing 1. Create a change set. BEGIN DBMS_CDC_PUBLISH. CREATE_CHANGE_SET( change_set_name => 'CHICAGO_DAILY', description

Synchronous Publishing 1. Create a change set. BEGIN DBMS_CDC_PUBLISH. CREATE_CHANGE_SET( change_set_name => 'CHICAGO_DAILY', description => 'Change set for sales history info', change_source_name => 'SYNC_SOURCE'); END; 2. Create a change table. 3. Grant access to subscribers. GRANT SELECT ON cdcpub. products_ct TO subscriber 1; 4 -27 Copyright © 2005, Oracle. All rights reserved.

Asynchronous Distributed Hot. Log Publishing Prepare the source and staging databases: 1. Configure Oracle

Asynchronous Distributed Hot. Log Publishing Prepare the source and staging databases: 1. Configure Oracle Net so that the source database can communicate with the staging database. 2. Set initialization parameters on the source database. compatible = 10. 2. 0 global_names = true job_queue_processes = <current value> + 2 open_links = 4 parallel_max_servers = <current value> + 3 processes = <current value> + 4 sessions = <current value> + 1 streams_pool_size = <current value> + 20 MB undo_retention = 3600 3. Set initialization parameters on the staging database. 4 -29 Copyright © 2005, Oracle. All rights reserved.

Asynchronous Distributed Hot. Log Publishing Prepare the staging database: • Set the database initialization

Asynchronous Distributed Hot. Log Publishing Prepare the staging database: • Set the database initialization parameters on the staging database. compatible = 10. 2. 0 global_names = true java_pool_size = 50000000 open_links = 4 job_queue_processes = 2 parallel_max_servers = <current_value> + 2 processes = <current_value> + 3 sessions = <current value> + 1 streams_pool_size = <current_value> + 11 MB undo_retention = 3600 4 -30 Copyright © 2005, Oracle. All rights reserved.

Asynchronous Distributed Hot. Log Publishing Alter the source database: 1. Place the database into

Asynchronous Distributed Hot. Log Publishing Alter the source database: 1. Place the database into FORCE LOGGING logging mode to protect against unlogged direct writes. ALTER DATABASE FORCE LOGGING; 2. Enable supplemental logging. ALTER DATABASE ADD SUPPLEMENTAL LOG DATA; 3. Create an unconditional log group on all columns to be captured in the source table. ALTER TABLE SH. PRODUCTS ADD SUPPLEMENTAL LOG GROUP log_group_products (PROD_ID, PROD_NAME, PROD_LIST_PRICE) ALWAYS; 4 -31 Copyright © 2005, Oracle. All rights reserved.

Asynchronous Distributed Hot. Log Publishing Publisher privileges on source and staging databases: 1. Create

Asynchronous Distributed Hot. Log Publishing Publisher privileges on source and staging databases: 1. Create and grant privileges to the source database publisher. CREATE USER source_cdcpub IDENTIFIED BY source_cdcpub QUOTA UNLIMITED ON SYSTEM QUOTA UNLIMITED ON SYSAUX; GRANT CREATE SESSION TO source_cdcpub; GRANT DBA TO source_cdcpub; GRANT CREATE DATABASE LINK TO source_cdcpub; GRANT EXECUTE on DBMS_CDC_PUBLISH TO source_cdcpub; GRANT EXECUTE_CATALOG_ROLE TO source_cdcpub; GRANT SELECT_CATALOG_ROLE TO source_cdcpub; EXECUTE DBMS_STREAMS_AUTH. GRANT_ADMIN_PRIVILEGE( GRANTEE=> 'source_cdcpub'); 2. Create and grant privileges to the staging database publisher. 4 -32 Copyright © 2005, Oracle. All rights reserved.

Asynchronous Distributed Hot. Log Publishing Create source and staging database links: 1. Create the

Asynchronous Distributed Hot. Log Publishing Create source and staging database links: 1. Create the source database link. CREATE DATABASE LINK staging_db CONNECT TO staging_cdcpub IDENTIFIED BY staging_cdcpub USING 'staging_db'; 2. Create the staging database link. CREATE DATABASE LINK source_db CONNECT TO source_cdcpub IDENTIFIED BY source_cdcpub USING 'source_db'; 4 -34 Copyright © 2005, Oracle. All rights reserved.

Asynchronous Distributed Hot. Log Publishing Create change sources and change sets: 1. Create the

Asynchronous Distributed Hot. Log Publishing Create change sources and change sets: 1. Create the change sources. BEGIN DBMS_CDC_PUBLISH. CREATE_HOTLOG_CHANGE_SOURCE( change_source_name => 'CHICAGO', description => 'test source', source_database => 'source_db'); END; 2. Create the change sets. DBMS_CDC_PUBLISH. CREATE_CHANGE_SET( change_set_name => 'CHICAGO_DAILY', description => 'change set for product info', change_source_name => 'CHICAGO', stop_on_ddl => 'y'); END; 4 -35 Copyright © 2005, Oracle. All rights reserved.

Asynchronous Distributed Hot. Log Publishing Create the change tables on the staging database: BEGIN

Asynchronous Distributed Hot. Log Publishing Create the change tables on the staging database: BEGIN DBMS_CDC_PUBLISH. CREATE_CHANGE_TABLE( owner => 'staging_cdcpub', change_table_name => 'products_ct', change_set_name => 'CHICAGO_DAILY', source_schema => 'SH', source_table => 'PRODUCTS', column_type_list => 'PROD_ID NUMBER(6), PROD_NAME VARCHAR 2(50), PROD_LIST_PRICE NUMBER(8, 2), JOB_ID VARCHAR 2(10), DEPARTMENT_ID NUMBER(4)', capture_values => 'both', rs_id => 'y', row_id => 'n', . . . options_string => 'TABLESPACE TS_CHICAGO_DAILY'); END; 4 -36 Copyright © 2005, Oracle. All rights reserved.

Asynchronous Distributed Hot. Log Publishing Enable the change source and change set: 1. Enable

Asynchronous Distributed Hot. Log Publishing Enable the change source and change set: 1. Enable the change source. BEGIN DBMS_CDC_PUBLISH. ALTER_HOTLOG_CHANGE_SOURCE( change_source_name => 'CHICAGO', enable_source => 'Y'); END; 2. Enable the change set. BEGIN DBMS_CDC_PUBLISH. ALTER_CHANGE_SET( change_set_name => 'CHICAGO_DAILY', enable_capture => 'y'); END; 3. Grant access to subscribers. 4 -38 Copyright © 2005, Oracle. All rights reserved.

Subscribing to Change Data 1. Find the source tables for which the subscriber has

Subscribing to Change Data 1. Find the source tables for which the subscriber has access privileges. SQL> SELECT * FROM ALL_SOURCE_TABLES; SOURCE_SCHEMA_NAME SOURCE_TABLE_NAME -----------------SH PRODUCTS 2. Find the change set names and columns for which the subscriber has access privileges. SQL> SELECT UNIQUE CHANGE_SET_NAME, COLUMN_NAME, PUB_ID FROM 2 ALL_PUBLISHED_COLUMNS WHERE SOURCE_SCHEMA_NAME ='SH' AND 3 SOURCE_TABLE_NAME = 'PRODUCTS'; CHANGE_SET_NAME COLUMN_NAME PUB_ID ------------------CHICAGO_DAILY PROD_ID 41494 CHICAGO_DAILY PROD_LIST_PRICE 41494 CHICAGO_DAILY PROD_NAME 41494 4 -39 Copyright © 2005, Oracle. All rights reserved.

Subscribing to Change Data 3. Create a subscription. BEGIN DBMS_CDC_SUBSCRIBE. CREATE_SUBSCRIPTION( change_set_name => 'CHICAGO_DAILY',

Subscribing to Change Data 3. Create a subscription. BEGIN DBMS_CDC_SUBSCRIBE. CREATE_SUBSCRIPTION( change_set_name => 'CHICAGO_DAILY', description => 'Change data for PRODUCTS', subscription_name => 'SALES_SUB'); END; 4. Subscribe to a source table and columns. BEGIN DBMS_CDC_SUBSCRIBE( subscription_name => 'SALES_SUB', source_schema => 'SH', source_table => 'PRODUCTS', column_list => 'PROD_ID, PROD_NAME, PROD_LIST_PRICE', subscriber_view => 'SALES_VIEW'); END; 4 -40 Copyright © 2005, Oracle. All rights reserved.

Subscribing to Change Data 5. Activate the subscription. BEGIN DBMS_CDC_SUBSCRIBE. ACTIVATE_SUBSCRIPTION( subscription_name => 'SALES_SUB');

Subscribing to Change Data 5. Activate the subscription. BEGIN DBMS_CDC_SUBSCRIBE. ACTIVATE_SUBSCRIPTION( subscription_name => 'SALES_SUB'); END; 6. Get the next set of change data. BEGIN DBMS_CDC_SUBSCRIBE. EXTEND_WINDOW( subscription_name => 'SALES_SUB'); END; 4 -41 Copyright © 2005, Oracle. All rights reserved.

Subscribing to Change Data 7. Query the subscriber views. SELECT PROD_ID, PROD_NAME, PROD_LIST_PRICE FROM

Subscribing to Change Data 7. Query the subscriber views. SELECT PROD_ID, PROD_NAME, PROD_LIST_PRICE FROM SALES_VIEW; PROD_ID PROD_NAME PROD_LIST_PRICE --------------------------30 And 2 Crosscourt Tee Kids 14. 99 30 And 2 Crosscourt Tee Kids 17. 66 10 Gurfield& Murks Pleated Trousers 17. 99 10 Gurfield& Murks Pleated Trousers 21. 99 8. Indicate that the change data is no longer needed. BEGIN DBMS_CDC_SUBSCRIBE. PURGE_WINDOW( subscription_name => 'SALES_SUB'); END; 9. End the subscription. 4 -42 Copyright © 2005, Oracle. All rights reserved.

Asynchronous Distributed Hot. Log Source Database Initialization Parameters For all Oracle Database 10 g

Asynchronous Distributed Hot. Log Source Database Initialization Parameters For all Oracle Database 10 g releases: Parameter Value COMPATIBLE GLOBAL_NAMES JOB_QUEUE_PROCESSES 10. 2. 0 or 10. 0. 0 TRUE OPEN_LINKS Should be equal to the number of Distributed Hot. Log change sources planned The current value + (3 times the number of change sources planned) The current value + (4 times the number of change sources planned) The current value + (the number of change sources planned) 3600 PARALLEL_MAX_SERVERS PROCESSES SESSIONS UNDO_RETENTION 4 -43 Maximum number of DBMS_JOB jobs that can run simultaneously plus 2 Copyright © 2005, Oracle. All rights reserved.

Asynchronous Distributed Hot. Log Source Database Initialization Parameters For Oracle 9. 2 databases: Parameter

Asynchronous Distributed Hot. Log Source Database Initialization Parameters For Oracle 9. 2 databases: Parameter Value COMPATIBLE GLOBAL_NAMES JOB_QUEUE_PROCESSES 9. 2. 0 TRUE LOG_PARALLELISM 1 LOGMNR_MAX_PERSISTENT_ SESSIONS OPEN_LINKS The number of change sources planned PARALLEL_MAX_SERVERS PROCESSES 4 -44 Maximum number of DBMS_JOB jobs that can run simultaneously plus 2 The number of Distributed Hot. Log change sources planned The current value + (3 times the number of change sources planned) The current value + (the number of change sources planned) Copyright © 2005, Oracle. All rights reserved.

Asynchronous Distributed Hot. Log Staging Database Initialization Parameters For Oracle Database 10 g Release

Asynchronous Distributed Hot. Log Staging Database Initialization Parameters For Oracle Database 10 g Release 2: Parameter Value COMPATIBLE GLOBAL_NAMES JAVA_POOL_SIZE OPEN_LINKS 10. 2. 0 TRUE 50000000 PARALLEL_MAX_SERVERS PROCESSES SESSIONS STREAMS_POOL_SIZE 4 -45 Equal to the number of Distributed Hot. Log change sources planned, but no less than 4 The current value + (2 times the number of change sources planned) The current value + (3 times the number of change sources planned) The current value + (the number of change sources planned) Set to the current value + ((the number of change sources planned) * (11 MB)) Copyright © 2005, Oracle. All rights reserved.

Data Dictionary Views Supporting CDC • • • 4 -46 CHANGE_SOURCES lists existing change

Data Dictionary Views Supporting CDC • • • 4 -46 CHANGE_SOURCES lists existing change sources. CHANGE_SETS lists existing change sets. CHANGE_PROPAGATIONS describes the streams propagation associated with a given distributed Hot. Log change source on the source database. CHANGE_TABLES lists existing change tables. DBA_SOURCE_TABLES lists published source tables. DBA_PUBLISHED_COLUMNS lists published source table columns. DBA_SUBSCRIPTIONS lists all registered subscriptions. DBA_SUBSCRIBED_TABLES lists published tables to which subscribers have subscribed. DBA_SUBSCRIBED_COLUMNS lists the columns of tables to which subscribers have subscribed. Copyright © 2005, Oracle. All rights reserved.

Transportation in a Data Warehouse Three basic choices in transportation: • Transportation using flat

Transportation in a Data Warehouse Three basic choices in transportation: • Transportation using flat files • Transportation through distributed operations • Transportation using transportablespaces 4 -47 Copyright © 2005, Oracle. All rights reserved.

Transportable Tablespaces • • 4 -48 This is the fastest method for moving large

Transportable Tablespaces • • 4 -48 This is the fastest method for moving large volumes of data. Source and target databases can have different block sizes. The method is especially useful for transporting data from OLTP to data warehouse. Before Oracle Database 10 g, source and target databases needed to use the same operating system. Copyright © 2005, Oracle. All rights reserved.

Transportable Tablespaces: Example 1. Place the data into its own tablespace. CREATE TABLE temp_jan_sales

Transportable Tablespaces: Example 1. Place the data into its own tablespace. CREATE TABLE temp_jan_sales NOLOGGING TABLESPACE ts_temp_sales AS SELECT * FROM sales WHERE time_id BETWEEN '31 -DEC-1999' AND '01 -FEB-2000'; 2. Export the metadata. EXPDP DIRECTORY=DW_DUMP_DIR DUMPFILE=jan. dmp TRANSPORT_TABLESPACES=ts_temp_sales 3. Copy the data and export file to the target system. 4. Import the metadata. IMPDP DIRECTORY=DM_DUMP_DIR DUMPFILE=jan. dmp TRANSPORT_DATAFILES='/db/tempjan. f' 5. Insert the new data into the fact table or employ the partition exchange feature. 4 -50 Copyright © 2005, Oracle. All rights reserved.

Summary In this lesson, you should have learned how to: • Describe the core

Summary In this lesson, you should have learned how to: • Describe the core ETL framework inside the database and its integration advantage • Explain data warehousing extraction methods • Identify transportation methods: – Flat file – Distributes operations – Transportablespaces • 4 -52 Describe transformation flow Copyright © 2005, Oracle. All rights reserved.

Practice 4: Overview This practice covers the following topics: • Loading data from a

Practice 4: Overview This practice covers the following topics: • Loading data from a flat file by using SQL*Loader • Configuring synchronous Change Data Capture • Loading data from a transportablespace by using Data Pump 4 -53 Copyright © 2005, Oracle. All rights reserved.