Lets talk about your model Greg Baldini Solutions
Let’s talk about your model Greg Baldini Solutions Architect
Let’s talk about me • I do • • Things with data Philly Power BI User Group • I work • • At Nihilent Technologies, Inc Leading our analytics and BI practice for the East Coast • I enjoy • Shopping at my local data mart • • Reading Kimball on the beach Warm summer nights with SSAS • Organic data, fresh from local analysts • I look • Like this on the internet: • Contact • • Greg. Baldini@nihilentinc. com http: //nihilentinc. com • I put • Too many words on this slide
Experience, Expertise & Domain Knowledge BI & Advanced Analytics Cloud Transformation Services Enterprise Transformation Consulting Customer Relationship Management 1, 500+ Employees Application Development 14, 000+ Portals & Collaboration Healthcare Migrated Cloud Users & devices Manufacturing Financial Services Retail
Alternate titles for this presentation • Storage is cheap: End users are impatient • In my measures, a horror • Your model is bad and you should feel bad • Servers don’t sleep at night • Optimize your model, not your measures • Your measures are a sickness: I Am The Cure
Alternate titles for this presentation • Storage is cheap: End users are impatient • In my measures, a horror • Your model is bad and you should feel bad • Servers don’t sleep at night • Optimize your model, not your measures • Your measures are a sickness: I Am The Cure
Some design thoughts • Storage is cheap • Servers don’t sleep at night • Don’t calculate static values at runtime • Your business questions define your model • This question is up to you to specify, as a BI professional • End users will give incomplete specifications • Assumptions tend to be untested • BI is not OLTP • Breaking normal form is a design pattern, not a code smell • Write-optimized data structures do not translate well to fast aggregation
What is a model? • A representation of the business processes and concepts of interest • A tool for thinking about a more complex phenomenon • Wrong
Let’s talk about dates The business problem Given the following table, create a prior quarter sales amount measure 1. 2. If Sales_Year is Current Year then show the total sales of previous quarter (current quarter minus 1) If Sales_Year is previous year(s) then show the total sales of Q 4 for that year. Sales_Year Sales_Quater Sales_Amount 2013 Q 1 100 2013 Q 2 200 2013 Q 3 300 2013 Q 4 400 2014 Q 1 500 2014 Q 2 600 2014 Q 3 700 2014 Q 4 800
Problems The business problem Given the following table, create a prior quarter sales amount measure Sales_Year Sales_Quater Sales_Amount 1. 2. If Sales_Year is Current Year then show the total sales of previous quarter (current quarter minus 1) If Sales_Year is previous year(s) then show the total sales of Q 4 for that year. 2013 2014 Q 1 Q 2 Q 3 Q 4 100 200 300 400 500 600 700 800
Mo’ Problems The business problem Given the following table, create a prior quarter sales amount measure Sales_Year Sales_Quater Sales_Amount 1. 2. If Sales_Year is Current Year then show the total sales of previous quarter (current quarter minus 1) If Sales_Year is previous year(s) then show the total sales of Q 4 for that year. 2013 2014 Q 1 Q 2 Q 3 Q 4 100 200 300 400 500 600 700 800
Solution? Sum. Sales: = SUM( Sales[Sales_Amount] ) Cur. Yr. Sales: = CALCULATE( [Sum. Sales] , FILTER( ALL( Sales ) , Sales[Sales_Year] = IF( VALUES( Sales[Sales_Quarter] ) = "Q 1" , YEAR( TODAY() ) - 1 , YEAR( TODAY() ) && Sales[Sales_Quarter] = SWITCH( VALUES( Sales[Sales_Quarter] ) , "Q 1", "Q 4" , "Q 2", "Q 1" , "Q 3", "Q 2" , "Q 3" ) ) )
Solution? ? Prev. Yr. Sales: = CALCULATE( [Sum. Sales] , ALL( Sales[Sales_Quarter] ) , Sales[Sales_Quarter] = "Q 4" ) Conditional. Sum: = IF( HASONEVALUE( Sales[Sales_Year] ) && HASONEVALUE( Sales[Sales_Quarter] ) , IF( VALUES( Sales[Sales_Year] ) < YEAR( TODAY() ) , [Prev. Yr. Sales] , [Cur. Yr. Sales] ) , BLANK() )
Sum. Sales: = SUM( Sales[Sales_Amount] ) All together Please don’t do this: Cur. Yr. Sales: = CALCULATE( [Sum. Sales] , FILTER( ALL( Sales ) , Sales[Sales_Year] = IF( VALUES( Sales[Sales_Quarter] ) = "Q 1" , YEAR( TODAY() ) - 1 , YEAR( TODAY() ) && Sales[Sales_Quarter] = SWITCH( VALUES( Sales[Sales_Quarter] ) , "Q 1", "Q 4" , "Q 2", "Q 1" , "Q 3", "Q 2" , "Q 3" ) ) ) Prev. Yr. Sales: = CALCULATE( [Sum. Sales] , ALL( Sales[Sales_Quarter] ) , Sales[Sales_Quarter] = "Q 4" ) Conditional. Sum: = IF( HASONEVALUE( Sales[Sales_Year] ) && HASONEVALUE( Sales[Sales_Quarter] ) , IF( VALUES( Sales[Sales_Year] ) < YEAR( TODAY() ) , [Prev. Yr. Sales] , [Cur. Yr. Sales] ) , BLANK() )
Think of the children. Use a date table. Date. Key Date Year Quarter. Key Quarter. Sequential Quarter. Name Month Weekday. Flag Weekday Holiday. Flag Workday. Flag 20100101 2010 -01 -01 20101 1 2010 -Q 1 1 1 Weekday 0 1 … … … 20101231 2010 -12 -31 2010 4 20104 4 2010 -Q 4 12 1 Weekday 0 1 20110101 2011 -01 -01 20111 5 2011 -Q 1 1 0 Weekend 0 0 … … ….
Think of the children. Use a date table. Date. Key Date Year Quarter. Key Quarter. Sequential Quarter. Name Month Weekday. Flag Weekday Holiday. Flag Workday. Flag 20100101 2010 -01 -01 20101 1 2010 -Q 1 1 1 Weekday 0 1 … … … 20101231 2010 -12 -31 2010 4 20104 4 2010 -Q 4 12 1 Weekday 0 1 20110101 2011 -01 -01 20111 5 2011 -Q 1 1 0 Weekend 0 0 … … …. Rather than handle year-end as a conditional, you can always subtract a fixed value – lookup becomes a single index seek for current [Quarter. Sequential] - 1
Think of the children. Use a date table. Date. Key Date Year Quarter. Key Quarter. Sequential Quarter. Name Month Weekday. Flag Weekday Holiday. Flag Workday. Flag 20100101 2010 -01 -01 20101 1 2010 -Q 1 1 1 Weekday 0 1 … … … 20101231 2010 -12 -31 2010 4 20104 4 2010 -Q 4 12 1 Weekday 0 1 20110101 2011 -01 -01 20111 5 2011 -Q 1 1 0 Weekend 0 0 … … …. Depending on storage engine, integer joins evaluate faster than date/time joins
Think of the children. Use a date table. Date. Key Date Year Quarter. Key Quarter. Sequential Quarter. Name Month Weekday. Flag Weekday Holiday. Flag Workday. Flag 20100101 2010 -01 -01 20101 1 2010 -Q 1 1 1 Weekday 0 1 … … … 20101231 2010 -12 -31 2010 4 20104 4 2010 -Q 4 12 1 Weekday 0 1 20110101 2011 -01 -01 20111 5 2011 -Q 1 1 0 Weekend 0 0 … … …. Used for date functions – can often do direct arithmetic on date/time
Think of the children. Use a date table. Date. Key Date Year Quarter. Key Quarter. Sequential Quarter. Name Month Weekday. Flag Weekday Holiday. Flag Workday. Flag 20100101 2010 -01 -01 20101 1 2010 -Q 1 1 1 Weekday 0 1 … … … 20101231 2010 -12 -31 2010 4 20104 4 2010 -Q 4 12 1 Weekday 0 1 20110101 2011 -01 -01 20111 5 2011 -Q 1 1 0 Weekend 0 0 … … …. Bit flags are fast to filter on – don’t use a weekday function on every row Storage is cheap, end users are impatient Slicers/filters are more user friendly with textual descriptions
Let’s talk about dates The business problem Always filter a report to “This year”, “This month”, “Today” Depending on the product, parameters can be defined programmatically – not a big deal YEAR(GETDATE()) Some only offer visual selection
Think of the children. Use your date table. Date 2014 -10 -01 2014 -10 -02 … 2015 -10 -28 2015 -10 -29 2015 -10 -30 … 2016 -10 -01 2016 -10 -02 … Year 2014 … 2015 … 2016 … Month Today Current. Year. Slicer October 0 0 2014 … … October 0 1 This Year October 1 1 This Year October 0 1 This Year … … October 0 0 2016 … … Storage is cheap, end users are impatient Date. Slicer 2014 -10 -01 2014 -10 -02 … 2015 -10 -28 Today 2015 -10 -30 … 2016 -10 -01 2016 -10 -02 …
Let’s talk about dates The business problem Every store has different days of operation – compare only sales on open days in the prior year, matching on day of week If store is open on Sunday of week 5 in 2014, only include these in comparative sales if store is open on Sunday of week 5 in 2015 as well Store 1 1 2 2 3 3 Date 2014 -02 -01 2015 -01 -31 Year 2014 2015 Week. Number 5 5 5 Day. Of. Week 7 7 7 Open. Closed Open Closed Open
Solution? 1. Iterate over stores 1. Iterate over dates 2. For each date in current year that is open 1. Check last year open status 2. Include current year only if last year is open 3. Sum sales 2. Return totals
What do you do to your servers when you leave for the day? Good night, sweet computer Sleep tight, don’t let the viruses bite See you in the morning light
No. Destroy your servers every night
No Your servers are not Skynet We are not in the Matrix Don’t destroy your machines They are expensive
Make a better date (and store) table Store 1 1 2 2 3 3 Date 2014 -02 -01 2015 -01 -31 Year 2014 2015 Week. Number 5 5 5 Day. Of. Week 7 7 7 Open. Closed Open Closed Open Storage is cheap, end users are impatient Your servers don’t sleep at night Evaluate helper columns as part of ETL Include. In. Comp 0 0 1 1 0 0
Let’s talk about “dynamic” reports If it doesn’t change between ETL/model refresh processes, it’s not dynamic • “Today” • Doesn’t change for 24 hours • “This year” • Doesn’t change for 365+/- days (fiscal calendars are weird) • But, fiscal calendars are easy once you think of the children (use a date table) • “Tomorrow’s new facts might change how we evaluate today” • Update that field during ETL, your servers don’t sleep at night
Let’s talk about Slowly Changing Dimensions The business problem The sales team is reorganized every year, and we want to report on [team | manager | region] sales over time, based on the attributes at the time of sale
Wait a minute • SCDs are really well understood • Have you even read Kimball? • Seriously, there’s even a pre-defined transformation is SSIS • I’m sorry, audience, my presenter clearly doesn’t deserve to be up here • You should just leave now – he doesn’t know what he’s talking about • Warning: Opinionated design tip coming
Let’s talk about Slowly Changing Dimensions So here’s a fairly standard (and simple) slowly changing dimension Sales. Rep. Key Sales. Rep Current. Manager Historic. Manager From To 1 Bob Joe Jack 2014 -01 -01 2014 -06 -01 2 Bob Joe James 2014 -06 -02 2014 -12 -31 3 Bob Joe Jack 2015 -01 -01 2015 -08 -31 4 Bob Joe 2015 -09 -01
Let’s talk about Slowly Changing Dimensions Sales. Rep. Key Sales. Rep Current. Manager Historic. Manager From To 1 Bob Joe Jack 2014 -01 -01 2014 -06 -01 2 Bob Joe James 2014 -06 -02 2014 -12 -31 3 Bob Joe Jack 2015 -01 -01 2015 -08 -31 4 Bob Joe 2015 -09 -01 This is often confusing for end users
Let’s talk about Slowly Changing Dimensions Sales. Rep. Key Sales. Rep Current. Manager Historic. Manager From To 1 Bob Joe Jack 2014 -01 -01 2014 -06 -01 2 Bob Joe James 2014 -06 -02 2014 -12 -31 3 Bob Joe Jack 2015 -01 -01 2015 -08 -31 4 Bob Joe 2015 -09 -01 This is often confusing for end users
Let’s talk about Slowly Changing Dimensions Sales. Rep. Key Sales. Rep Current. Manager Historic. Manager From To 1 Bob Joe Jack 2014 -01 -01 2014 -06 -01 2 Bob Joe James 2014 -06 -02 2014 -12 -31 3 Bob Joe Jack 2015 -01 -01 2015 -08 -31 4 Bob Joe 2015 -09 -01 This is the only thing that is actually a member of Dim. Sales. Rep
Let’s talk about Slowly Changing Dimensions • The concern is often point-in time reporting • An SCD actually becomes a business relationship fact table • We already have a table that has a historic record of attributes related to a given fact • It’s called the fact table
Let’s talk about Slowly Changing Dimensions Sales. Rep. Key 1 Sales. Rep Bob Manager. Key 1 2 3 Manager Jack James Joe Region. Key 1 Region 1 Sales. Rep. Key Manager. Key 1 1 Region. Key 1 Date. Key 20140201 Sales 500 Note the date table
Let’s talk about Slowly Changing Dimensions • The concern is often point-in time reporting • An SCD actually becomes a business relationship fact table • We already have a table that has a historic record of attributes related to a given fact • It’s called the fact table
Let’s talk about Slowly Changing Dimensions • An SCD actually becomes a business relationship fact table
Let’s talk about Slowly Changing Dimensions The business problem The following tables are maintained in an employee database • Employees (ID, Name) • Events (Start Date, Separation Date (Negative Attrition), Transition Date(Positive Attrition). . . ) Some employees have left and returned. • Contract History • Role History • Team History Trying to accomplish month to month headcount sliceable by contract, role and team. So, I have multiple type 2 slowly changing dimensions and I need my measures to be tied to all of them.
Solution? Events in progress 1. Based on dates selected (range or start/end pair) iterate through entire events table checking for each row 1. 2. 3. 4. 5. Whether the date of the event is in the date selection Whether this is a first hire Whether this is a rehire Whether a role or team transition has occurred Whether the employee matches the criteria selected in other filters 2. Count everything with true for all conditions in (1)
No. Children cry and servers don’t sleep 01 -11 • Employee 1 starts 2015 -01 -01 • Works until 2015 -01 -10 as Role 1 • Quits on 2015 -01 -15 • Transitions to Role 2 as of 2015 - • Rehired on 2015 -01 -20 as Role 3 Employee. Key 1 1 … 1 1 1 … Role. Key 1 1 … 1 2 2 … 2 3 3 … Date 2015 -01 -01 2015 -01 -02 … 2015 -01 -10 2015 -01 -11 2015 -01 -12 … 2015 -01 -15 2015 -01 -20 2015 -01 -21 … New. Hire. Flag 1 0 … 0 0 0 … Re. Hire. Flag 0 0 … 0 1 0 … New. Role. Flag 0 0 … 0 1 0 …
No. Children cry and servers don’t sleep Employee. Key Role. Key Date New. Hire. Flag Re. Hire. Flag New. Role. Flag 1 1 … 1 2 2 … 2 3 3 … 2015 -01 -01 2015 -01 -02 … 2015 -01 -10 2015 -01 -11 2015 -01 -12 … 2015 -01 -15 2015 -01 -20 2015 -01 -21 … 1 0 … 0 0 0 … There is one measure: distinct count of Employee. Key Easily sliceable by any attribute you want to add 0 0 … 0 1 0 … 0 1 0 …
No. Children cry and servers don’t sleep Employee. Key Role. Key Date New. Hire. Flag Re. Hire. Flag New. Role. Flag 1 1 … 1 2 2 … 2 3 3 … 2015 -01 -01 2015 -01 -02 … 2015 -01 -10 2015 -01 -11 2015 -01 -12 … 2015 -01 -15 2015 -01 -20 2015 -01 -21 … 1 0 … 0 0 0 … Bit flags are cheap – literally 1 bit per row 8 rows = 1 byte 8, 000 rows = 1 kilobyte 8, 000 rows = 1 megabyte 8, 000, 000 rows = 1 gigabyte 0 0 … 0 1 0 … 0 1 0 …
Let’s talk about sales distribution The business problem For commission purposes, a given sale may be shared among multiple sales reps. We have a sale fact table and a distribution table Sales. Order. Key Sales. Rep. Key Date. Key Sales 1 1 2015 -10 -29 500 2 4 2015 -10 -31 1000 Sales. Order. Key Sales. Rep. Key Distribution 1 1 0. 5 1 2 0. 5 2 3 0. 2 2 4 0. 5 2 5 0. 3
Why? Sales. Order. Key Sales. Rep. Key Date. Key Sales 1 1 2015 -10 -29 500 2 4 2015 -10 -31 1000 Sales. Order. Key Sales. Rep. Key Distribution 1 1 0. 5 1 2 0. 5 2 3 0. 2 2 4 0. 5 2 5 0. 3 This is an operational table, and in a dimensional model, is a duplicate fact
Let’s talk about sales distribution Sales. Order. Key 1 1 2 2 2 Sales. Rep. Key 1 2 3 4 5 Date. Key 2015 -10 -29 2015 -10 -31 Sales 250 200 500 300 This sums the same Performing aggregates across relationships gets expensive Fact table – millions of rows Distribution table – 2 x millions of rows
Let’s talk about sales distribution Sometimes we need to preserve the detail of the original Sales. Order. Key 1 1 2 2 2 Sales. Rep. Key 1 2 3 4 5 Date. Key 2015 -10 -29 2015 -10 -31 Sales 250 200 500 300 Sales. Order. Total 500 1000 Bit flags and original sales amount This is a simple filter for the original sales table: WHERE Primary. Rep. Flag = 1 Primary. Rep. Flag 1 0 0 1 0
Let’s talk about sales distribution Sometimes we need to preserve the detail of the original Sales. Order. Key 1 1 2 2 2 Sales. Rep. Key 1 2 3 4 5 Date. Key 2015 -10 -29 2015 -10 -31 Sales 250 200 500 300 Ratio 0. 5 0. 2 0. 5 0. 3 Distributed amount and distribution ratio This is simple arithmetic from any row to derive the original amount: Sales / Ratio
Some design thoughts • Storage is cheap • Bit flags on all the attributes • Pre-calculate helpful intermediate values • Servers don’t sleep at night • If it doesn’t change based on parameters/filters, calculate it in ETL • Update fields daily if necessary • Your business questions define your model • What are you logically trying to do? • Make your physical model match the logical description • BI is not OLTP • Optimize for fast reads, not fast writes • Ad-hoc workload vs optimized stored procedures every time • Non-technical end users need to navigate and use these models
A parting word Strive for reasonable models A model that is easy to reason about
Closing items • Contact information • Greg. Baldini@nihilentinc. com • http: //nihilentinc. com • Feel free to reach out with questions
Fin Don’t forget to join your local PUG to enjoy year-round networking and learning. www. pbiusergroup. com/NYNJ www. pbiusergroup. com/philly
- Slides: 54