CIS Data Warehousing for Analytics Week 1 A
- Slides: 38
CIS - Data Warehousing for Analytics Week 1
A few notes before we begin. . . • The course textbooks are in the Baruch Campus Library. • Business Intelligence Guidebook: From Data Integration to Analytics by Rick Sherman. Elsevier Science & Technology (2014). ISBN: 978 -0 -12 -411461 -6 Price: $60 – NOTE: Book is available through Baruch Library. Link is here • The Data Warehouse Toolkit (3 rd. Edition). by Ralph Kimball and Margy Ross. Publisher: Wiley Edition: 3, Year Published: 2013, Price: 60. 00 USD. ISBN: 978 -1 -118 -53080 -1 – NOTE: Book is available through Baruch Library. Link is here These are available on the website syllabus link. 2 Information Systems Program
We start this course with SQL ! But why? I already learned it! 3 Information Systems Program
SQL and Data Warehouse Queries SQL is not an end-user tool (not for the recipient of analyses) • There is an industry around end-user tools that can work with various databases and other data sources. • But, it is the major analyst’s tool. 4 Information Systems Program
First tool of the analyst: SQL Why SQL? – It is the one common tool amongst nearly every database and business intelligence/analytics product – It is universally known. – It is the engine behind most tools you will use. – It is the direct way to check results without products in-between. 5 Information Systems Program
Client Server 2 Tier Architecture [Reference: http: //holowczak. com/database-system-architectures/5/] • • • Clients Run own copy of an operating system. Run one or more applications using the client machine’s CPU, memory. Application communicates with DBMS server running on server machine through a Database Driver Database driver (middleware) makes a connection to the DBMS server over a network • • • Run own copy of an operating system. Run a Database Management System that manages a single database. Provides a Listening daemon (process) that accepts connections from client machines and submits transactions to DBMS on behalf of the client machines Middleware: Small portion of software that sits between client and server. Establishes a connection from the client to the server and passes commands (e. g. , SQL) between them. E. g. ODBC, JDBC, ADO. NET/OLE DB 6 Information Systems Program
Client Server Advantages/Disadvantages [Reference: http: //holowczak. com/database-system-architectures/5/] Advantages • Processing of the entire Database System is spread out over clients and server. • DBMS can achieve high performance because it is dedicated to processing transactions (not running applications). • Client Applications can take full advantage of advanced user interfaces such as Graphical User Interfaces. • • • Business rules may be enforced at: The client application – so called “Thick Clients”. Entirely on the database server – so called “Thin Clients” A Mix of both. Disadvantages • Implementation is more complex because one needs to deal with middleware and the network. • It is possible the network is not well suited for client/server communications and may become a bottleneck. • Additional burden on DBMS server to handle concurrency control, etc. • As more business rule logic is programmed into the client side applications, they can become unwieldy. Stored procedures and triggers can help in this case. 7 Information Systems Program
Let’s talk about… • HW #0 -> 2 Surveys; Due Today! • HW #1 -> OUT Tuesday • Oracle accounts – Your own personal login – Can only access on campus 8 Information Systems Program
Try connecting now… • Bring up Oracle SQL Developer & use the connection information… • Change your password!! • Recommendations by the way: – Use a Password Manager! • Lastpass • 1 Password • Dashlane 9 Information Systems Program
The sample schema • Schema to use is the OE schema • OE = Order Entry • Tracks customers’ orders of products [like Amazon!] 10 Information Systems Program
The sample tables 11 Information Systems Program
The sample schema 12 Information Systems Program
SQL [Structured Query Language] • SELECT – Retrieve existing rows from a table. If the table is empty, a message indicating that no rows were found will be displayed. A simplified syntax for the SELECT statement is: – SELECT <column 1, column 2, . . . > FROM <table 1, table 2, . . . > WHERE <where clause> GROUP BY <column 1, column 2, . . . > HAVING <having clause> ORDER BY <column 1, column 2, . . . > [OPTIONAL] • The select query is the WORKHORSE of Data Analytics! 13 Information Systems Program
SQL Aggregate Commands 14 Information Systems Program
GROUP BY Clause Review Row summary Aggregat e function One row per combination of grouping values One value per set of rows Grouping columns, summary values MIN, MAX, COUNT, SUM, AVG, … 15 Information Systems Program
GROUP BY Rules SELECT/GROUP BY consistency • SELECT Col 1, Col 2, …, Coln, COUNT(*) • GROUP BY Col 1, Col 2, …, Coln Condition placement • Row conditions (no aggregate function) in WHERE • Group conditions (aggregate function) in HAVING 16 Information Systems Program
Query Clause Evaluation Order Rows • FROM • WHERE Groups • GROUP BY • HAVING Result • ORDER BY • SELECT 17 Information Systems Program
GROUP BY Example • Summarize (sum, min, and count) store sales for USA and Canada in 2016 by store zip and month • Only include groups with more than one row SELECT Store. Zip, Time. Month, SUM(Sales. Dollar) AS Sum. Sales, MIN(Sales. Dollar) AS Min. Sales, COUNT(*) AS Row. Count FROM SSSales, SSStore, SSTime. Dim WHERE SSSales. Store. Id = SSStore. Id AND SSSales. Time. No = SSTime. Dim. Time. No AND (Store. Nation = 'USA' OR Store. Nation = 'Canada') AND Time. Year = 2016 GROUP BY Store. Zip, Time. Month HAVING COUNT(*) > 1 ORDER BY Store. Zip, Time. Month; 18 Information Systems Program
SQL [Structured Query Language] • JOIN – When you have multiple tables that have keys in common and you want to see attributes from both tables 19 Information Systems Program
ATTENDANCE ! 20 Information Systems Program
Extensions to SQL for Analytics • Understand processing extension for analytic functions • Write SELECT statements using basic syntax elements (OVER and ORDER BY) • Reflect on the motivation for analytic function extensions 21 Information Systems Program
Business Intelligence Analysis BI Applications typically involve analysis that combines data retrieval & computations Top and worst performers Trends Quantitative contributions e. g. Who is the best salesperson of jet aircraft engines? e. g. Identify changes between time periods; this involves summary calcs on moving windows, e. g. moving averages e. g. Which contributions contributed what percentage to the entire amount? Information Systems Program 22
Organizations found standard SQL select statement inadequate for BI apps: • • • Applications required a complex skill set involving data retrieval, procedural coding and external tool usage Organizations experience difficulties to find individuals possessing the necessary skill set. Productivity was poor for developing business intelligence applications with complex select statements and procedural coding often required. • Performance was slow, with SQL compilers often developing poorly performing plans for complex select statements. • In addition, complex calculations were typically done outside of SQL statements, so optimizing SQL compilers could not optimize both data retrieval and computations. Poor productivity Poor performanc e Difficult to write SELECT statement extension s 23 Information Systems Program
Analytic Function versus Aggregate Function Know the difference! • Computes one value • Reduces group to a single row • Calculated before analytic functions Analytic Function • Computes multiple values • Preserves number of rows in a group • Calculated after aggregate functions 24 Information Systems Program
Analytic Function Processing Rows Groups • FROM • WHERE Analytic Result • GROUP BY • HAVING • Create • ORDER partition BY s • SELECT • Evaluate function s • Order partition s • Analytic functions occur AFTER row & group processing to perform calculations of GROUP BY results 25 • Analytic function processing involves organizing results into partitions, evaluating functions over partitions, and then ordering the partitions Information Systems Program
Basic Syntax Elements • <Analytic. Function> ([<column-list>]) OVER ( [ORDER BY <ordering>] ) – Place in SELECT clause list – OVER clause identifies window (set of rows) – Ordering criteria for function evaluation 26 Information Systems Program
SQL ANALYTIC Commands – RANK clause • Used to RANK rows after grouping them (although you don’t need to group them!) RANK() OVER ([ query_partition_clause ] order_by_clause) • Note that the query_partition_clause is optional; if you leave it out, it will treat ALL rows as one group • Examples RANK() OVER (ORDER BY Item. Price) AS Rank. Unit. Price RANK() OVER (ORDER BY SUM(Sales. Dollar) ) AS Rank. Sales 27 Information Systems Program
Oracle Diagram for the Store Sales Tables 28 Information Systems Program
Rank Example I • Rank items by ascending item price SELECT Item. Id, Item. Brand, Item. Unit. Price, RANK() OVER ( ORDER BY Item. Unit. Price ) AS Rank. Unit. Price FROM SSItem; ITEMID -------I 1412138 I 1445671 I 6677900 ITEMBRAND ITEMUNITPRICE RANKUNITPRICE --------- ------Ethlite 12 1 Intersafe 14. 99 2 Connex 25. 69 3 … 29 Information Systems Program
Rank Example II • Rank customers by descending sum of dollar sales • Group on customer name SELECT Cust. Name, SUM(Sales. Dollar) AS Sum. Sales, RANK() OVER (ORDER BY SUM(Sales. Dollar) DESC) Sales. Rank FROM SSSales, SSCustomer WHERE SSSales. Cust. Id = SSCustomer. Cust. Id GROUP BY Cust. Name; CUSTNAME SUMSALESRANK --------------- -----Sheri Gordon 556322 1 Wally Jones 94004 2 Jim Glussman 91100 3 Candy Kendall 90664 4 … 30 Information Systems Program
Extended Partitioning Syntax • <Analytic. Function> ([<column-list>]) OVER ([PARTITION BY <partitioning>] [ORDER BY <ordering>] ) – PARTITION BY keywords – Divides result into partitions – Analytic function evaluated for each partition • Example RANK() OVER ( PARTITION BY Cust. State ORDER BY SUM(Sales. Dollar) ) AS Sales. Rank 31 Information Systems Program
Windows! Big in BI • Window comparisons common for financial • Changes in numeric variables in sets of rows known as windows • analysis and forecasting Examples – 90 day moving average of stock prices – Percentage annual sales growth – Performance of ad campaign in recent months – Cumulative sales performance for current year 32 Information Systems Program
Window Concepts Units Movemen t Physical (ROWS) Cumulativ e Logical (RANGE) Sliding Window is a collection of rows in which a numeric function is calculated 33 Information Systems Program
SQL ANALYTIC Commands – LAG clause • Used to provide access to a row at a specified physical offset which comes before the current row – in other words, by using the LAG() function, from the current row, you can access data of the previous row, or from the second row before the current row, or from the third row before current row, and so on LAG(return_value [, offset[, default_value ]]) OVER ( PARTITION BY expr 1, expr 2, . . . ORDER BY expr 1 [ASC | DESC], expr 2, . . . ) • Note that the query_partition_clause is optional; if you leave it out, it will treat ALL rows as one group Information Systems Program 34
Lag Clause cont’d SELECT employee_id, fiscal_year, salary, LAG(salary) OVER ( PARTITION BY employee_id ORDER BY fiscal_year) previous_salary FROM basic_pays; 35 Information Systems Program
VIEWS Can create a custom view of data. Why? • Can hide complexity of data from end users – End users don’t have to fiddle with multiple tables; you can do all the custom logic and joins for them and they can just select from the view! • Easier for end users to write queries to get out what they need • Can ‘modularize’ database and give access to only what you need users to see NOTE: Views can use one or multiple tables or even other views! 36 Information Systems Program
VIEWS Cont’d • To create a view: – Create view Csaccept as select s. ID, c. Name from Apply where major=‘CS’ and decision= ‘Y’ – Will now show those students/colleges where students got accepted into the CS major – Pretty easy! • NOTE that views are virtual; there is nothing actually created! 37 Information Systems Program
Other Handy SQL functions that are useful • to_char : takes a date and extracts month or year, etc. – TO_CHAR(sysdate, 'yyyy/mm/dd') – Result: '2003/07/09' – TO_CHAR(sysdate, 'Month DD, YYYY') – Result: 'July 09, 2003' – TO_CHAR(sysdate, 'FMMonth DD, YYYY') – Result: 'July 9, 2003' – TO_CHAR(sysdate, 'MON DDth, YYYY') – Result: 'JUL 09 TH, 2003' – TO_CHAR(sysdate, 'FMMON DDth, YYYY') – Result: 'JUL 9 TH, 2003' 38 – TO_CHAR(sysdate, 'FMMon ddth, YYYY') – Result: 'Jul 9 th, 2003' Information Systems Program
- Upenn mse in data science
- What is kdd process in data mining
- Javachive
- Datamart olap
- Data warehousing data mining and olap
- Best practice etl architecture
- Introduction to data mining and data warehousing
- "amplitude" analytics or "product analytics"
- Week by week plans for documenting children's development
- Introduction to data warehousing
- Greenplum data warehousing
- Data warehousing components
- Data warehouse project charter
- Temporal parallelism
- Data warehouse principles
- An overview of data warehousing and olap technology
- Introduction to data warehousing
- An overview of data warehousing and olap technology
- Concept hierarchy in data warehousing
- Cognos impromptu in data warehousing
- Data warehouse basic concepts
- Inmon cif
- Research problems in data warehousing
- Healthcare data warehouse model
- Starnet query model in data warehouse
- Technical architecture data warehouse
- Formuö
- Novell typiska drag
- Nationell inriktning för artificiell intelligens
- Returpilarna
- Varför kallas perioden 1918-1939 för mellankrigstiden
- En lathund för arbete med kontinuitetshantering
- Adressändring ideell förening
- Tidbok yrkesförare
- Anatomi organ reproduksi
- Förklara densitet för barn
- Datorkunskap för nybörjare
- Stig kerman
- Att skriva debattartikel