BIG DATA 101 SERIOUSLY IT IS JUST 101
BIG DATA 101 SERIOUSLY, IT IS JUST 101
BIG DATA 101
PARESH MOTIWALA, PMP ® • pareshmotiwala@gmail. com • http: //www. linkedin. com/in/pareshmotiwala • @pareshmotiwala • www. circlesofgrowth. com 781 254 4096
BIG DATA 101 • Who should attend • • • DBAs CIO Marketing peeps Developers Big Data Enthusiasts Who should not attend
BIG DATA 101 Let’s grab a byte Brontobyte
BIG DATA 101
BIG DATA 101 • • • Misc info on Big Data Sources Definition Privacy concerns Data Lake Storing- Hadoop Processing – Map. Reduce Presentation Data Science and Scientists Few Hadoop stacks Summary
SO WHY SHOULD I CARE ABOUT THIS? Data is the new Electricity (Satya Nadella, Spring 2016) https: //www. microsoft. com/en-us/sql-server/data-driven Companies Generate data, Distribute, Meter, and Use it Where is data stored? Current: SQL Server, Oracle, Teradata, DB 2, Netezza, Open Source Databases; Casandra, My. SQL, Mongo. DB Unstructured: Hadoop, Spark, Data Lakes What type of data is stored? Traditional: Rows and Columns Big Data Explosion: Images, streaming data, internet-connected devices (Io. T), Machine data Source: Microsoft
BIG DATA: DRIVING TRANSFORMATIVE CHANGES Traditional Data characteristics Relational data Costs Specialized HW Commodity HW Operational reporting Experimentation leading to intelligent action With Culture with highly modeled schema Focus on rear-view analysis Source: Microsoft Big Data All data with schema agility machine learning, graph, a/b testing
BIG DATA: DECISION MAKING Effect Today’s Big Data Rearview Mirror Forward Looking < 10% Data Used Quality Batch, Incomplete and Disjointed Purpose Business Monitoring Source: The Big Data by Schmarzo Any and All Real-time, Correlated, Governed Business Optimization
BIG DATA 101 • Sources Cell Phones • Social Media • Credit Cards • GPSs • Io. T • Wearables •
BIG DATA 101
BIG DATA 101 V a l u e
BIG DATA 101 • Desired Properties: • • Robustness- Fault Tolerance Low Latency Scalability Generalization Extensibility Ad hoc Queries Minimal Maintenance Debuggability
BIG DATA 101 • Flow Collection Intervention Pre-processing Hygiene Visualization Analysis OVER 90% OF TODAY’S DATA WAS CREATED IN PAST 2 YEARS
BIG DATA 101 • 5 Rs of Data Quality • • • Relevancy Recency Range Robustness Reliability
BIG DATA 101 • Privacy of Data • If I collect the data, is it mine? • Ownership Vs Rights • Share Answers not Data • Op. Al (http: //www. trust. mit. edu/projects/) • Enigma (more resilient and secure data systems using secure multiparty computation and secret sharing over blockchain) • Let them know • Why you are collecting • What you are collecting
BIG DATA 101 • FIPP- Fair Information Privacy Principles • • Individual Control Transparency Respect for Context Security Access and Accuracy Focused Collection FERPA- Family Education Rights and Privacy Act
BIG DATA 101 WHAT IS A DATA LAKE? ---COURTESY : JAMES SERRA A storage repository, usually Hadoop, that holds a vast amount of raw data in its native format until it is needed. • A place to store unlimited amounts of data in any format inexpensively, especially for archive purposes • Allows collection of data that you may or may not use later: “just in case” • A way to describe any large data pool in which the schema and data requirements are not defined until the data is queried: “just in time” or “schema on read” • Complements EDW and can be seen as a data source for the EDW – capturing all data but only passing relevant data to the EDW • Frees up expensive EDW resources (storage and processing), especially for data refinement • Allows for data exploration to be performed without waiting for the EDW team to model and load the data (quick user access) • Some processing in better done with Hadoop tools than ETL tools like SSIS • Easily scalable
BIG DATA 101 THE “DATA LAKE” USES A BOTTOMS-UP APPROACH Store all data Ingest all data regardless of requirements Devices Do analysis in native format without schema definition Using analytic engines like Hadoop Batch queries Social Interactive queries Devices LOB apps Video Real-time analytics Social LOB applications Sensors Video Web Sensors Relational Web Relational Clickstream Machine Learning Data warehouse Clickstream Data Lake quickly turns into a data swamp if you don’t invest in data quality Courtesy : James Serra
BIG DATA 101 Doug Cutting and Mike Cafarella In 2005
BIG DATA 101 BENEFITS OF HADOOP
BIG DATA 101
BIG DATA 101 Data Lake Big Data
BIG DATA 101
BIG DATA 101 • Map. Reduce • • • Map –Sends Queries Reduce – Collects Results Job Tracker Task Tracker YARN
BIG DATA 101
Base Architecture : Big Data Advanced Analytics Pipeline Ingest Data Sources Analyze Prepare (normalize, clean, etc. ) (stat analysis, ML, etc. ) Publish (for programmatic consumption, BI/visualization) Consume (Alerts, Operational Stats, Insights) Machine Learning Telemetry (Anomaly Detection) Event Hub Data in Motion Stream Analytics Live / real-time data stats, Anomalies and aggregates (real-time analytics) Power. BI dashboard Near Realtime Data Analytics Pipeline using Azure Steam Analytics Data at Rest HDI Custom ETL Aggregate /Partition Machine Learning Customer MIS Scheduled hourly transfer using Azure Data Factory Azure Storage Blob Azure SQL (Predictions) dashboard of predictions / alerts Interactive Analytics and Predictive Pipeline using Azure Data Factory Azure Data Lake Storage Azure Data Lake Analytics (Big Data Processing) Big Data Analytics Pipeline using Azure Data Lake dashboard of operational stats Azure SQL
Comprehensive Microsoft Azure LOB applic ations On-Premises Connected Choice “Big Data” Cloud VISION FOR BIG DATA AND DATA WAREHOUSING Data Warehouse Azure Data Factory + Federated Query Microsoft SQL Server Devic es Social Relati onal Video Web Sens ors Clicks tream
BIG DATA 101 PRESENTATION • R • Python • Power BI Desktop
BIG DATA 101 • Data Science and Scientist
BIG DATA 101
BIG DATA 101
BIG DATA 101
BIG DATA 101 • Summary: • • • Misc info on Big Data Sources Definition Privacy concerns Data Lake Storing- Hadoop Processing – Map. Reduce Presentation Data Science and Scientists Few Hadoop stacks
BIG DATA 101 - CONCLUSION SQL Server is the best Relational Database The world is much bigger than any one relational database What is your company’s data strategy? What is your company’s cloud strategy? Learn adjacent technologies that will make you valuable. Power BI? Hadoop? No. SQL?
BIG DATA 101 Someday Big Data will just become data Thank You
PARESH MOTIWALA, PMP ® • pareshmotiwala@gmail. com • http: //www. linkedin. com/in/pareshmotiwala • @pareshmotiwala • www. circlesofgrowth. com • 781 254 4096
BIG DATA 101 • BIBLIOGRAPHY – • • http: //www. datasciencecentral. com/ https: //www. youtube. com/playlist? list=PLt 0 m. OCwx. J 6 B_Ox. Tlpevx. JNAa 7 Gf. CLd 3 l https: //www. dezyre. com/article/hadoop-components-and-architecturebig-data-and-hadoop-training/114 MIT Big Data Analytics Course Data Lake presentation by James Serra Future of Data…. . (or something like that) by George Walters Big Data Analytics with Microsoft HDInsight in 24 Hours PARESH MOTIWALA, PMP ®
- Slides: 39