Data Science and Big Data Analytics Chap 11
Data Science and Big Data Analytics Chap 11: Adv. Analytics – Tech & Tools: In-Database Analytics Charles Tappert Seidenberg School of CSIS, Pace University
Chapter Contents n 11. 1 SQL Essentials n n n 11. 2 In-Database Text Analysis 11. 3 Advanced SQL n n n 11. 1. 1 Joins 11. 1. 2 Set Operations 11. 1. 3 Grouping Extensions 11. 3. 1 11. 3. 2 11. 3. 3 11. 3. 4 Window Functions User-Defined Functions and Aggregates Ordered Aggregates MADlib Summary
Chap 11: Adv. Analytics – Tech & Tools: In-Database Analytics n n n In-database analytics is a broad term that describes the processing of data within its repository This is in contrast to extracting data from a source and loading it into a sandbox or workspace like R Advantages and disadvantages n n n Advantage: Eliminates need to move the data Advantage: Fast – often almost real-time results Disadvantage: Data must be mostly structured Disadvantage: Data must be limited, not too huge Applications – Credit card transaction fraud detection, product recommendations, web advertisement selection
11. 1 SQL Essentials Relational Database – Entity Relationship Diagram 11. 1 SQL Essentials Tables Records (rows) Fields Primary Keys Foreign Keys Normalization reduces dup SQL queries
11. 2 In-Database Text Analysis n n SQL offers basic text string functions Example – extract zip code from text string
11. 2 In-Database Text Analysis n Example – identify invalid zip codes
11. 3 Advanced SQL n Window functions – moving averages
11. 3 Advanced SQL n EWMA = Exponentially Weighted Moving Average
11. 3 Advanced SQL n n MADlib – open-source library for in-database analytics MAD n n Magnetic – attract all the data sources within organization Agile – physical and logical contents in continuous rapid evolution Deep – sophisticated statistical methods http: //doc. madlib. net/latest/modules. html
11. 3 Advanced SQL MADlib Modules
Summary n n SQL can perform in-database analytics Although SQL is usually associated with structured data, SQL tables often contain unstructured data – comments, descriptions, etc. – that can be restructured for further analysis Complex SQL queries can utilize window functions and user-defined functions Libraries such as MADlib can be used to conduct statistical analyses within a database
- Slides: 11