Helping Your Data Warehouse Succeed 10 Mistakes to

  • Slides: 34
Download presentation
Helping Your Data Warehouse Succeed: 10 Mistakes to Avoid in Data Integration Rafael Salas

Helping Your Data Warehouse Succeed: 10 Mistakes to Avoid in Data Integration Rafael Salas t: @Raf. Salas w: www. Rafael-Salas. com e: rfsalas@yahoo. es

About Rafael • DW BI – 12 years • SQL Server MVP • Solution

About Rafael • DW BI – 12 years • SQL Server MVP • Solution Architect - Quaero a CSG Systems Solution • Charlotte, NC • Lots mistakes along the way!

Mistakes are the portals of discovery. James Joyce

Mistakes are the portals of discovery. James Joyce

Today’s Plan • 10 mistakes to avoid • What, why, how to prevent them

Today’s Plan • 10 mistakes to avoid • What, why, how to prevent them • Share real life examples • No magic formulas

1 Ignoring Data Realities …Or finding them too late.

1 Ignoring Data Realities …Or finding them too late.

1 The Problem • Relying on common knowledge: – The data is ‘good’ –

1 The Problem • Relying on common knowledge: – The data is ‘good’ – I know this data well • We don’t have time • Cycle: Code Load Explode! • Research-Recode-Retest = Rework

1 The Fix Requirements Data Profiling Compare Clue: Business want ‘good’ quality data: •

1 The Fix Requirements Data Profiling Compare Clue: Business want ‘good’ quality data: • Accurate • Timeliness • Relevant • Complete • Understood • Trusted

1 Benefits • Early awareness about data quality issues • Better ETL development estimates

1 Benefits • Early awareness about data quality issues • Better ETL development estimates • Uncover new business rules • Better understanding of business requirements

1 How? • 3 rd Party tools • Hand crafted SQL queries • SSIS:

1 How? • 3 rd Party tools • Hand crafted SQL queries • SSIS: Data Profiling Task – – – Decent profiling Get up and running quickly SQL Server data sources only Output is XML Results can be loaded in a table – XSLT required

2 Exception Handling …Actually the lack thereof

2 Exception Handling …Actually the lack thereof

2 The Problem • Data’s ‘Buts and Ifs’ nobody mentioned • Unreliable data sources

2 The Problem • Data’s ‘Buts and Ifs’ nobody mentioned • Unreliable data sources • Missed homework: data profiling – Data type mismatches and overflow – Referential integrity • Cycle: Run Fail Patch

2 The Fix • Consider exceptions at different levels – Data/Database – Network –

2 The Fix • Consider exceptions at different levels – Data/Database – Network – Operative System • Design a system-wide strategy – Design Patterns Templates • Log and notify!

2 How? • Data/Database: – In SSIS: Use dataflow error outputs to redirect offending

2 How? • Data/Database: – In SSIS: Use dataflow error outputs to redirect offending rows • Network: – Pre-process: test connectivity – In SSIS: Event handlers, precedence constraints with conditional logic • O/S – Pre-process: Validate space available, File available, etc. – In SSIS: Event handlers, precedence constraints with conditional logic

3 Inadequate Logging …What, when, how?

3 Inadequate Logging …What, when, how?

3 The Problem • No/Little Logging • Too Much Logging • Meaningless Logging •

3 The Problem • No/Little Logging • Too Much Logging • Meaningless Logging • • Error troubleshooting Execution monitoring Performance tracking Auditing

3 The Fix • Add logging capabilities – Start with key events, add more

3 The Fix • Add logging capabilities – Start with key events, add more as needed – – Start – End date & Times Row Counts On Error On Warning • Create reports on top of logging tables • Don’t forget to clean/prune logs • Logging I/O are expensive

3 The Fix • SSIS logging • SSIS event handler • Be aware of

3 The Fix • SSIS logging • SSIS event handler • Be aware of the concept of containers in SSIS events ‘bubble-up’ • Have to be included on each package – Use package templates

4 No Recovery & Restart …Game Over!

4 No Recovery & Restart …Game Over!

4 • • • The Problem Re-starting after failure is not automated It requires

4 • • • The Problem Re-starting after failure is not automated It requires manual clean-up of partial results Prone to human error May require to start process from the beginning Risk of ‘skipping’ data Risk of duplicating data

4 The Fix • Create restart-ability points • Consider piggybacking on logging • Use

4 The Fix • Create restart-ability points • Consider piggybacking on logging • Use ternary logic at each recovery point: – Skip – Run – Clean-up and re-run • Staging source data is handy • Custom

5 Staging Area Unauthorized Use …could cause injuries.

5 Staging Area Unauthorized Use …could cause injuries.

5 The Problem • Failing to understand staging area is a ‘construction zone’ •

5 The Problem • Failing to understand staging area is a ‘construction zone’ • Reports and applications accessing staging data • Using staging tables as on-line data archive

5 The Fix • Easy: Keep staging area off-limit • Make all required data

5 The Fix • Easy: Keep staging area off-limit • Make all required data in data presentation layer • Keep staging data available only for required time • Use appropriate data aging and archiving policies and processes

6 Performance: Losing the focus …

6 Performance: Losing the focus …

6 Very Fast, but…

6 Very Fast, but…

7 Vanity Testing …good for feeling awesome.

7 Vanity Testing …good for feeling awesome.

8 No Portability …deployment in progress!

8 No Portability …deployment in progress!

9 Forgetting the Owner’s Manual …aka the beloved documentation.

9 Forgetting the Owner’s Manual …aka the beloved documentation.

10 Missing the Bigger Picture …the architecture.

10 Missing the Bigger Picture …the architecture.

10 The problem • Jumping to coding without a blueprint • Break it down

10 The problem • Jumping to coding without a blueprint • Break it down into group of tasks • List all tasks and functionality you can’t live without • Place the tasks in the appropriate group

10 The Fix • Create an attack plan • Embrace an architecture • Divide

10 The Fix • Create an attack plan • Embrace an architecture • Divide and conquer! • List all tasks and functionality you require • Place the tasks in the appropriate group

10 An example Extract Changed data capture Transform Load Data cleansing Data Load Deduplication

10 An example Extract Changed data capture Transform Load Data cleansing Data Load Deduplication Exception Handling Data Staging Data Error tracking Other Data Transformations Load Aggregates OLAP Cube Processing Other Post Load Actions ETL Management Job Scheduler Activity Monitor Backup Recovery Restart Alerting ABC Support Security Compliance

Helping Your Data Warehouse Succeed: 10 Mistakes to Avoid in Data Integration Rafael Salas

Helping Your Data Warehouse Succeed: 10 Mistakes to Avoid in Data Integration Rafael Salas t: @Raf. Salas w: www. Rafael-Salas. com e: rfsalas@yahoo. es