What Is ETL Extract Transform Load ETL is

  • Slides: 26
Download presentation
What Is ETL? Extract Transform Load ETL is the automated and auditable data acquisition

What Is ETL? Extract Transform Load ETL is the automated and auditable data acquisition process from source system that involves one or more sub processes of data extraction, data cleaning, data transportation, data transformation, data consolidation, data integration and data loading.

ETL • ETL comes from Data Warehousing and stands for Extract-Transform-Load. ETL covers a

ETL • ETL comes from Data Warehousing and stands for Extract-Transform-Load. ETL covers a process of how the data are loaded from the source system to the data warehouse. Currently, the ETL encompasses a cleaning step as a separate step. The sequence is then Extract-Clean-Transform-Load.

Data Quality Data quality is a critical factor for the success of data warehousing

Data Quality Data quality is a critical factor for the success of data warehousing projects. If data is of inadequate quality, then the knowledge workers who query the data warehouse and the decision makers who receive the information cannot trust the results. Data quality has to be maintained for individual records or even small bits of information to ensure accuracy of complete database. Three powerful forces have converged to put data quality concerns near the top of the list for executives: • First, the long-term cultural trend that says, “If only I could see the data, then I could manage my business better” continues to grow; today’s knowledge workers believe instinctively that data is a crucial requirement for them to function in their jobs. • Second, most organizations understand their data sources are profoundly distributed, typically around the world, and that effectively integrating a myriad of disparate data sources is required. • And third, the sharply increased demands for compliance mean careless handling of data will not be overlooked or excused.

Data Profiling • Data profiling is the technical analysis of data to describe its

Data Profiling • Data profiling is the technical analysis of data to describe its content, consistency, and structure. • In some sense, any time you perform a SELECT DISTINCT investigative query on a database field, you are doing data profiling. • There a variety of tools specifcally designed to do powerful profiling. • Data profiling plays two distinct roles: strategic and tactical.

Data enrichment • Data enrichment is adding/combining external data values, rules to enrich the

Data enrichment • Data enrichment is adding/combining external data values, rules to enrich the information already existing in the data. • Example if we can get a list that provides a relationship between Zip Code, City and State, then if a address field has Zip code 06905 it be safely assumed and address can be enriched by doing a lookup on this table to get Zip Code 06905 –> City Stamford –> State CT.

Data Deduplication • Data Deduplication is a process of removing redundant data. There result

Data Deduplication • Data Deduplication is a process of removing redundant data. There result of Data Deduplication is an unique list of values which can be used to populate data warehouse dimension tables. • The typical business example would be creating unique list of customers. • Data deduplication is a technique for reducing the amount of storage space an organization needs to save its data.

ETL architecture

ETL architecture

Extract System • The Extract step covers the data extraction from the source system

Extract System • The Extract step covers the data extraction from the source system and makes it accessible for further processing. The main objective of the extract step is to retrieve all the required data from the source system with as little resources as possible. The extract step should be designed in a way that it does not negatively affect the source system in terms or performance, response time or any kind of locking.

There are several ways to perform the extract: • Update notification - if the

There are several ways to perform the extract: • Update notification - if the source system is able to provide a notification that a record has been changed and describe the change, this is the easiest way to get the data. • Incremental extract - some systems may not be able to provide notification that an update has occurred, but they are able to identify which records have been modified and provide an extract of such records. During further ETL steps, the system needs to identify changes and propagate it down. Note, that by using daily extract, we may not be able to handle deleted records properly. • Full extract - some systems are not able to identify which data has been changed at all, so a full extract is the only way one can get the data out of the system. The full extract requires keeping a copy of the last extract in the same format in order to be able to identify changes. Full extract handles deletions as well.

Clean • The cleaning step is one of the most important as it ensures

Clean • The cleaning step is one of the most important as it ensures the quality of the data in the data warehouse. Cleaning should perform basic data unification rules, such as: • Making identifiers unique (sex categories Male/Female/Unknown, M/F/null, Man/Woman/Not Available are translated to standard Male/Female/Unknown) • Convert null values into standardized Not Available/Not Provided value • Convert phone numbers, ZIP codes to a standardized form • Validate address fields, convert them into proper naming, e. g. Street/St/St. /Str • Validate address fields against each other (State/Country, City/State, City/ZIP code, City/Street).

Transform • The transform step applies a set of rules to transform the data

Transform • The transform step applies a set of rules to transform the data from the source to the target. This includes converting any measured data to the same dimension (i. e. conformed dimension) using the same units so that they can later be joined. The transformation step also requires joining data from several sources, generating aggregates, generating surrogate keys, sorting, deriving new calculated values, and applying advanced validation rules.

Load • During the load step, it is necessary to ensure that the load

Load • During the load step, it is necessary to ensure that the load is performed correctly and with as little resources as possible. The target of the Load process is often a database. In order to make the load process efficient, it is helpful to disable any constraints and indexes before the load and enable them back only after the load completes. The referential integrity needs to be maintained by ETL tool to ensure consistency.

Change Data Capture System • During the data warehouse’s initial historic load, capturing source

Change Data Capture System • During the data warehouse’s initial historic load, capturing source data content changes is not important because you load all data from a point in time forward. However, many data warehouse tables are so large that they cannot be refreshed during every ETL cycle. • The idea behind CDC is to Just transfer the data that has changed since the last load. But building a good CDC system is not as easy as it sounds.

The key goals for the change data capture subsystem are: • Isolate the changed

The key goals for the change data capture subsystem are: • Isolate the changed source data to allow selective processing rather than a complete refresh. • Capture all changes (deletions, edits, and insertions) made to the source data, including changes made through nonstandard interfaces. • Tag changed data with reason codes to distinguish error corrections from true updates. • Support compliance tracking with additional metadata. • Perform the CDC step as early as possible, preferably before a bulk data transfer to the data warehouse.

Data Mart • Data marts contain a subset of organizationwide data that is valuable

Data Mart • Data marts contain a subset of organizationwide data that is valuable to specific groups of people in an organization. In other words, a data mart contains only those data that is specific to a particular group. For example, the marketing data mart may contain only data related to items, customers, and sales. Data marts are confined to subjects.

Points to Remember About Data Marts • Windows-based or Unix/Linux-based servers are used to

Points to Remember About Data Marts • Windows-based or Unix/Linux-based servers are used to implement data marts. They are implemented on low-cost servers. • The implementation cycle of a data mart is measured in short periods of time, i. e. , in weeks rather than months or years. • The life cycle of data marts may be complex in the long run, if their planning and design are not organization-wide. • Data marts are small in size. • Data marts are customized by department. • The source of a data mart is departmentally structured data warehouse. • Data marts are flexible.

Datamart

Datamart

Data Cube • A data cube helps us represent data in multiple dimensions. It

Data Cube • A data cube helps us represent data in multiple dimensions. It is defined by dimensions and facts. The dimensions are the entities with respect to which an enterprise preserves the records.

Load Manager • This component performs the operations required to extract and load process.

Load Manager • This component performs the operations required to extract and load process. • The size and complexity of the load manager varies between specific solutions from one data warehouse to another. Load Manager Architecture The load manager performs the following functions: • Extract the data from source system. • Fast Load the extracted data into temporary data store. • Perform simple transformations into structure similar to the one in the data warehouse.

Load Manager Architecture

Load Manager Architecture

Late Arriving facts • Data warehouses are usually built around the ideal assumption that

Late Arriving facts • Data warehouses are usually built around the ideal assumption that measured activity (fact records) arrive in the data warehouse at the same time as the context of the activity (dimension records). • When you have both the fact records and the correct contemporary dimension rows, you have the luxury of first maintaining the dimension keys and then using these up-to-date keys in the accompanying fact rows. However, for a variety of reasons, the ETL system may need to process late arriving fact or dimension data.

 • In some environments, there may need to be special modifications to the

• In some environments, there may need to be special modifications to the standard processing procedures to deal with late arriving facts, namely fact records that come into the warehouse very much delayed. • Late arriving dimensions occur when the activity measurement (fact record) arrives at the data warehouse without its full context • For example, the identification of the new customer may come in a separate feed delayed by several hours; you may just be able to wait until the dependency is resolved. But in many situations, especially real-time environments, this delay is not acceptable. You cannot suspend the rows and wait for the dimension updates to occur; the business requirements demand that you make the fact row visible before knowing the dimensional context. • The ETL system needs additional capabilities to support this requirement.

Staging • It should be possible to restart, at least, some of the phases

Staging • It should be possible to restart, at least, some of the phases independently from the others. For example, if the transformation step fails, it should not be necessary to restart the Extract step. We can ensure this by implementing proper staging. Staging means that the data is simply dumped to the location (called the Staging Area) so that it can then be read by the next processing phase. The staging area is also used during ETL process to store intermediate results of processing. This is ok for the ETL process which uses for this purpose. However, t. The staging area should is be accessed by the load ETL process only. It should never be available to anyone else; particularly not to end users as it is not intended for data presentation to the end-user. may contain incomplete or in-themiddle-of-the-processing data.

Look ups • Lookup transformation - returns values from a database table or a

Look ups • Lookup transformation - returns values from a database table or a flat file associated with a given input value. • It can be used to to lookup data in a flat file, relational table , view or synonym. • There are the following Lookup transformation types: • Dynamic Lookup (connected) • Unconnected lookup

Time-lag • The Time-lag (or Backlog) is one of the Key Performance Indicator of

Time-lag • The Time-lag (or Backlog) is one of the Key Performance Indicator of ETL. It is defined here as the difference between the date of today and the date of latest loaded data. • This Time-lag is due to the time needed to collect, load and process the data (ETL) procedure.

Scheduling and Dependency Matrix • Dependency Scheduling Matrix (DSM) has been identified as a

Scheduling and Dependency Matrix • Dependency Scheduling Matrix (DSM) has been identified as a powerful tool to plan the activity sequences, identify and manage information exchanges. However, its application in scheduling is very limited. So far, DSM has been used to enable critical path calculations by assigning the amount of effort/work done as duration to the activities.