CIS Data Warehousing for Analytics Week 1 A

  • Slides: 38
Download presentation
CIS - Data Warehousing for Analytics Week 1

CIS - Data Warehousing for Analytics Week 1

A few notes before we begin. . . • The course textbooks are in

A few notes before we begin. . . • The course textbooks are in the Baruch Campus Library. • Business Intelligence Guidebook: From Data Integration to Analytics by Rick Sherman. Elsevier Science & Technology (2014). ISBN: 978 -0 -12 -411461 -6 Price: $60 – NOTE: Book is available through Baruch Library. Link is here • The Data Warehouse Toolkit (3 rd. Edition). by Ralph Kimball and Margy Ross. Publisher: Wiley Edition: 3, Year Published: 2013, Price: 60. 00 USD. ISBN: 978 -1 -118 -53080 -1 – NOTE: Book is available through Baruch Library. Link is here These are available on the website syllabus link. 2 Information Systems Program

We start this course with SQL ! But why? I already learned it! 3

We start this course with SQL ! But why? I already learned it! 3 Information Systems Program

SQL and Data Warehouse Queries SQL is not an end-user tool (not for the

SQL and Data Warehouse Queries SQL is not an end-user tool (not for the recipient of analyses) • There is an industry around end-user tools that can work with various databases and other data sources. • But, it is the major analyst’s tool. 4 Information Systems Program

First tool of the analyst: SQL Why SQL? – It is the one common

First tool of the analyst: SQL Why SQL? – It is the one common tool amongst nearly every database and business intelligence/analytics product – It is universally known. – It is the engine behind most tools you will use. – It is the direct way to check results without products in-between. 5 Information Systems Program

Client Server 2 Tier Architecture [Reference: http: //holowczak. com/database-system-architectures/5/] • • • Clients Run

Client Server 2 Tier Architecture [Reference: http: //holowczak. com/database-system-architectures/5/] • • • Clients Run own copy of an operating system. Run one or more applications using the client machine’s CPU, memory. Application communicates with DBMS server running on server machine through a Database Driver Database driver (middleware) makes a connection to the DBMS server over a network • • • Run own copy of an operating system. Run a Database Management System that manages a single database. Provides a Listening daemon (process) that accepts connections from client machines and submits transactions to DBMS on behalf of the client machines Middleware: Small portion of software that sits between client and server. Establishes a connection from the client to the server and passes commands (e. g. , SQL) between them. E. g. ODBC, JDBC, ADO. NET/OLE DB 6 Information Systems Program

Client Server Advantages/Disadvantages [Reference: http: //holowczak. com/database-system-architectures/5/] Advantages • Processing of the entire Database

Client Server Advantages/Disadvantages [Reference: http: //holowczak. com/database-system-architectures/5/] Advantages • Processing of the entire Database System is spread out over clients and server. • DBMS can achieve high performance because it is dedicated to processing transactions (not running applications). • Client Applications can take full advantage of advanced user interfaces such as Graphical User Interfaces. • • • Business rules may be enforced at: The client application – so called “Thick Clients”. Entirely on the database server – so called “Thin Clients” A Mix of both. Disadvantages • Implementation is more complex because one needs to deal with middleware and the network. • It is possible the network is not well suited for client/server communications and may become a bottleneck. • Additional burden on DBMS server to handle concurrency control, etc. • As more business rule logic is programmed into the client side applications, they can become unwieldy. Stored procedures and triggers can help in this case. 7 Information Systems Program

Let’s talk about… • HW #0 -> 2 Surveys; Due Today! • HW #1

Let’s talk about… • HW #0 -> 2 Surveys; Due Today! • HW #1 -> OUT Tuesday • Oracle accounts – Your own personal login – Can only access on campus 8 Information Systems Program

Try connecting now… • Bring up Oracle SQL Developer & use the connection information…

Try connecting now… • Bring up Oracle SQL Developer & use the connection information… • Change your password!! • Recommendations by the way: – Use a Password Manager! • Lastpass • 1 Password • Dashlane 9 Information Systems Program

The sample schema • Schema to use is the OE schema • OE =

The sample schema • Schema to use is the OE schema • OE = Order Entry • Tracks customers’ orders of products [like Amazon!] 10 Information Systems Program

The sample tables 11 Information Systems Program

The sample tables 11 Information Systems Program

The sample schema 12 Information Systems Program

The sample schema 12 Information Systems Program

SQL [Structured Query Language] • SELECT – Retrieve existing rows from a table. If

SQL [Structured Query Language] • SELECT – Retrieve existing rows from a table. If the table is empty, a message indicating that no rows were found will be displayed. A simplified syntax for the SELECT statement is: – SELECT <column 1, column 2, . . . > FROM <table 1, table 2, . . . > WHERE <where clause> GROUP BY <column 1, column 2, . . . > HAVING <having clause> ORDER BY <column 1, column 2, . . . > [OPTIONAL] • The select query is the WORKHORSE of Data Analytics! 13 Information Systems Program

SQL Aggregate Commands 14 Information Systems Program

SQL Aggregate Commands 14 Information Systems Program

GROUP BY Clause Review Row summary Aggregat e function One row per combination of

GROUP BY Clause Review Row summary Aggregat e function One row per combination of grouping values One value per set of rows Grouping columns, summary values MIN, MAX, COUNT, SUM, AVG, … 15 Information Systems Program

GROUP BY Rules SELECT/GROUP BY consistency • SELECT Col 1, Col 2, …, Coln,

GROUP BY Rules SELECT/GROUP BY consistency • SELECT Col 1, Col 2, …, Coln, COUNT(*) • GROUP BY Col 1, Col 2, …, Coln Condition placement • Row conditions (no aggregate function) in WHERE • Group conditions (aggregate function) in HAVING 16 Information Systems Program

Query Clause Evaluation Order Rows • FROM • WHERE Groups • GROUP BY •

Query Clause Evaluation Order Rows • FROM • WHERE Groups • GROUP BY • HAVING Result • ORDER BY • SELECT 17 Information Systems Program

GROUP BY Example • Summarize (sum, min, and count) store sales for USA and

GROUP BY Example • Summarize (sum, min, and count) store sales for USA and Canada in 2016 by store zip and month • Only include groups with more than one row SELECT Store. Zip, Time. Month, SUM(Sales. Dollar) AS Sum. Sales, MIN(Sales. Dollar) AS Min. Sales, COUNT(*) AS Row. Count FROM SSSales, SSStore, SSTime. Dim WHERE SSSales. Store. Id = SSStore. Id AND SSSales. Time. No = SSTime. Dim. Time. No AND (Store. Nation = 'USA' OR Store. Nation = 'Canada') AND Time. Year = 2016 GROUP BY Store. Zip, Time. Month HAVING COUNT(*) > 1 ORDER BY Store. Zip, Time. Month; 18 Information Systems Program

SQL [Structured Query Language] • JOIN – When you have multiple tables that have

SQL [Structured Query Language] • JOIN – When you have multiple tables that have keys in common and you want to see attributes from both tables 19 Information Systems Program

ATTENDANCE ! 20 Information Systems Program

ATTENDANCE ! 20 Information Systems Program

Extensions to SQL for Analytics • Understand processing extension for analytic functions • Write

Extensions to SQL for Analytics • Understand processing extension for analytic functions • Write SELECT statements using basic syntax elements (OVER and ORDER BY) • Reflect on the motivation for analytic function extensions 21 Information Systems Program

Business Intelligence Analysis BI Applications typically involve analysis that combines data retrieval & computations

Business Intelligence Analysis BI Applications typically involve analysis that combines data retrieval & computations Top and worst performers Trends Quantitative contributions e. g. Who is the best salesperson of jet aircraft engines? e. g. Identify changes between time periods; this involves summary calcs on moving windows, e. g. moving averages e. g. Which contributions contributed what percentage to the entire amount? Information Systems Program 22

Organizations found standard SQL select statement inadequate for BI apps: • • • Applications

Organizations found standard SQL select statement inadequate for BI apps: • • • Applications required a complex skill set involving data retrieval, procedural coding and external tool usage Organizations experience difficulties to find individuals possessing the necessary skill set. Productivity was poor for developing business intelligence applications with complex select statements and procedural coding often required. • Performance was slow, with SQL compilers often developing poorly performing plans for complex select statements. • In addition, complex calculations were typically done outside of SQL statements, so optimizing SQL compilers could not optimize both data retrieval and computations. Poor productivity Poor performanc e Difficult to write SELECT statement extension s 23 Information Systems Program

Analytic Function versus Aggregate Function Know the difference! • Computes one value • Reduces

Analytic Function versus Aggregate Function Know the difference! • Computes one value • Reduces group to a single row • Calculated before analytic functions Analytic Function • Computes multiple values • Preserves number of rows in a group • Calculated after aggregate functions 24 Information Systems Program

Analytic Function Processing Rows Groups • FROM • WHERE Analytic Result • GROUP BY

Analytic Function Processing Rows Groups • FROM • WHERE Analytic Result • GROUP BY • HAVING • Create • ORDER partition BY s • SELECT • Evaluate function s • Order partition s • Analytic functions occur AFTER row & group processing to perform calculations of GROUP BY results 25 • Analytic function processing involves organizing results into partitions, evaluating functions over partitions, and then ordering the partitions Information Systems Program

Basic Syntax Elements • <Analytic. Function> ([<column-list>]) OVER ( [ORDER BY <ordering>] ) –

Basic Syntax Elements • <Analytic. Function> ([<column-list>]) OVER ( [ORDER BY <ordering>] ) – Place in SELECT clause list – OVER clause identifies window (set of rows) – Ordering criteria for function evaluation 26 Information Systems Program

SQL ANALYTIC Commands – RANK clause • Used to RANK rows after grouping them

SQL ANALYTIC Commands – RANK clause • Used to RANK rows after grouping them (although you don’t need to group them!) RANK() OVER ([ query_partition_clause ] order_by_clause) • Note that the query_partition_clause is optional; if you leave it out, it will treat ALL rows as one group • Examples RANK() OVER (ORDER BY Item. Price) AS Rank. Unit. Price RANK() OVER (ORDER BY SUM(Sales. Dollar) ) AS Rank. Sales 27 Information Systems Program

Oracle Diagram for the Store Sales Tables 28 Information Systems Program

Oracle Diagram for the Store Sales Tables 28 Information Systems Program

Rank Example I • Rank items by ascending item price SELECT Item. Id, Item.

Rank Example I • Rank items by ascending item price SELECT Item. Id, Item. Brand, Item. Unit. Price, RANK() OVER ( ORDER BY Item. Unit. Price ) AS Rank. Unit. Price FROM SSItem; ITEMID -------I 1412138 I 1445671 I 6677900 ITEMBRAND ITEMUNITPRICE RANKUNITPRICE --------- ------Ethlite 12 1 Intersafe 14. 99 2 Connex 25. 69 3 … 29 Information Systems Program

Rank Example II • Rank customers by descending sum of dollar sales • Group

Rank Example II • Rank customers by descending sum of dollar sales • Group on customer name SELECT Cust. Name, SUM(Sales. Dollar) AS Sum. Sales, RANK() OVER (ORDER BY SUM(Sales. Dollar) DESC) Sales. Rank FROM SSSales, SSCustomer WHERE SSSales. Cust. Id = SSCustomer. Cust. Id GROUP BY Cust. Name; CUSTNAME SUMSALESRANK --------------- -----Sheri Gordon 556322 1 Wally Jones 94004 2 Jim Glussman 91100 3 Candy Kendall 90664 4 … 30 Information Systems Program

Extended Partitioning Syntax • <Analytic. Function> ([<column-list>]) OVER ([PARTITION BY <partitioning>] [ORDER BY <ordering>]

Extended Partitioning Syntax • <Analytic. Function> ([<column-list>]) OVER ([PARTITION BY <partitioning>] [ORDER BY <ordering>] ) – PARTITION BY keywords – Divides result into partitions – Analytic function evaluated for each partition • Example RANK() OVER ( PARTITION BY Cust. State ORDER BY SUM(Sales. Dollar) ) AS Sales. Rank 31 Information Systems Program

Windows! Big in BI • Window comparisons common for financial • Changes in numeric

Windows! Big in BI • Window comparisons common for financial • Changes in numeric variables in sets of rows known as windows • analysis and forecasting Examples – 90 day moving average of stock prices – Percentage annual sales growth – Performance of ad campaign in recent months – Cumulative sales performance for current year 32 Information Systems Program

Window Concepts Units Movemen t Physical (ROWS) Cumulativ e Logical (RANGE) Sliding Window is

Window Concepts Units Movemen t Physical (ROWS) Cumulativ e Logical (RANGE) Sliding Window is a collection of rows in which a numeric function is calculated 33 Information Systems Program

SQL ANALYTIC Commands – LAG clause • Used to provide access to a row

SQL ANALYTIC Commands – LAG clause • Used to provide access to a row at a specified physical offset which comes before the current row – in other words, by using the LAG() function, from the current row, you can access data of the previous row, or from the second row before the current row, or from the third row before current row, and so on LAG(return_value [, offset[, default_value ]]) OVER ( PARTITION BY expr 1, expr 2, . . . ORDER BY expr 1 [ASC | DESC], expr 2, . . . ) • Note that the query_partition_clause is optional; if you leave it out, it will treat ALL rows as one group Information Systems Program 34

Lag Clause cont’d SELECT employee_id, fiscal_year, salary, LAG(salary) OVER ( PARTITION BY employee_id ORDER

Lag Clause cont’d SELECT employee_id, fiscal_year, salary, LAG(salary) OVER ( PARTITION BY employee_id ORDER BY fiscal_year) previous_salary FROM basic_pays; 35 Information Systems Program

VIEWS Can create a custom view of data. Why? • Can hide complexity of

VIEWS Can create a custom view of data. Why? • Can hide complexity of data from end users – End users don’t have to fiddle with multiple tables; you can do all the custom logic and joins for them and they can just select from the view! • Easier for end users to write queries to get out what they need • Can ‘modularize’ database and give access to only what you need users to see NOTE: Views can use one or multiple tables or even other views! 36 Information Systems Program

VIEWS Cont’d • To create a view: – Create view Csaccept as select s.

VIEWS Cont’d • To create a view: – Create view Csaccept as select s. ID, c. Name from Apply where major=‘CS’ and decision= ‘Y’ – Will now show those students/colleges where students got accepted into the CS major – Pretty easy! • NOTE that views are virtual; there is nothing actually created! 37 Information Systems Program

Other Handy SQL functions that are useful • to_char : takes a date and

Other Handy SQL functions that are useful • to_char : takes a date and extracts month or year, etc. – TO_CHAR(sysdate, 'yyyy/mm/dd') – Result: '2003/07/09' – TO_CHAR(sysdate, 'Month DD, YYYY') – Result: 'July 09, 2003' – TO_CHAR(sysdate, 'FMMonth DD, YYYY') – Result: 'July 9, 2003' – TO_CHAR(sysdate, 'MON DDth, YYYY') – Result: 'JUL 09 TH, 2003' – TO_CHAR(sysdate, 'FMMON DDth, YYYY') – Result: 'JUL 9 TH, 2003' 38 – TO_CHAR(sysdate, 'FMMon ddth, YYYY') – Result: 'Jul 9 th, 2003' Information Systems Program