Big Data Architectures for Improving Customer Analytics Carey

Big Data Architectures for Improving Customer Analytics Carey Moretti Director, Big Data Intelligence President, San Diego Chapter, TDWI @careymoretti © 2014 Trace 3, All rights reserved.

Data is like people – interrogate it hard enough and it will tell you whatever you want to hear © 2014 Trace 3 All rights reserved.

© 2014 Trace 3 All rights reserved.

Compelling Business case? Operational Efficiency Lower Cost Storage Data Transformation EDW Archival Strategic Advantage Strategic Data Architecture Agile Exploration Data Science © 2014 Trace 3 All rights reserved.

Data Architecture Strategy • Defined Independent of Hadoop • Hadoop is a Technology for Realizing the Strategy • Train and Level-Set the Team • Design for Long-Term Integration & Growth • Implement Incrementally © 2014 Trace 3 All rights reserved.

Data Architecture w/RDBMS Scheduling, flow control, restart, alerts, audits Centralized Data Management Source Data Access Report Writers Marts OLAP Data Scientists EDW (RDBMS) Raw Data Analysts (Short-Term Retention) Highly Integrated (Data Vault) Centralized Business Rules Power Users (Transformations) Intelligence Feeds back to Internal Applications © 2014 Trace 3 All rights reserved.

Data Architecture w/Hadoop Scheduling, flow control, restart, alerts, audits Centralized Data Management Source Data Analytic Cluster(s) Data Access Report Writers OLAP Data Scientists Data Lake Raw Data EDW Analysts (Long-Term Retention) Highly Integrated (Data Vault) Centralized Business Rules RDBMS Power Users (Transformations) Intelligence Feeds back to Internal Applications © 2014 Trace 3 All rights reserved.

Data Architecture w/Hadoop • Keep Everything • Historical System of Record • Long-Term Retention of Raw and Transformed Data • Highly Scalable Data Transformation • Storage, Processing and Analytic Capabilities Not Possible Before © 2014 Trace 3 All rights reserved.

Data Architecture Summarized • Same Concepts as Traditional Data Warehousing • Mountains of Raw Data is Liberating • Colocated or Integrated • Centralize Business Rules in the Data Transformation Layer • Dimensional Structures Persist in the RDB • Specialized Flat Structures for Advanced Analytics © 2014 Trace 3 All rights reserved.

Hadoop Ecosystem © 2014 Trace 3 All rights reserved.

Extract Source Data Analytics CDR Power Users Billing ERP Data Access Report Writers Data Lake Analysts Network Data Scientists Sensor EDW © 2014 Trace 3 All rights reserved.

Load Batch Streaming Source Data Analytics CDR Data Access Power Users Billing ERP RDB to Hadoop Report Writers Data Lake Analysts Network Data Scientists Sensor EDW © 2014 Trace 3 All rights reserved.

Transform Internal Source Data External Analytics CDR Power Users Billing ERP Data Access Report Writers Data Lake Analysts Network Data Scientists Sensor EDW © 2014 Trace 3 All rights reserved.

Ad Hoc Exploration Source Data Analytics CDR Power Users Billing ERP Data Access Report Writers Data Lake Analysts Network Data Scientists Sensor EDW © 2014 Trace 3 All rights reserved.

Reporting, Visualization & Analytics Reporting Visualization Source Data Analytics CDR Data Access Power Users Billing ERP Adv. Analytics Report Writers Data Lake Analysts Network Data Scientists Sensor EDW © 2014 Trace 3 All rights reserved.

Telecom • Incident management • Tens of thousands of devices and data sources to analyze • Shift from reactive incident response to proactive maintenance windows • Needed the ability to predict hardware failures to reduce or eliminate outages • Direct impact customer service and satisfaction © 2014 Trace 3 All rights reserved.

Model Inputs: Current and Future Y 1 (Metric) = b 0 + b 1 X 1 (Impact Factor 1) + b 2 X 2 (Impact Factor 2) + … + E What the heck does that even mean? Failure likelihood (Y 1) is an aggregate of weighted (b 1. . b? ) input factors (X 1. . X? ) • Current Factors – Known-suspect components – Ticket counts as “preponderance of evidence” – Locality • Future Factors (as made available in Data Lake) – Time-series performance data e. g. SNMP poll – Node event stream data e. g. syslog, SNMP trap © 2014 Trace 3 All rights reserved.

Basic Standalone Execution R Engine JDBC/ODBC RDBMS • Only used during Po. C/Discovery • Only good for one data input class © 2014 Trace 3 All rights reserved.

Distributed Batch Unstructured Text HDFS Relational Data Structured Time-Series HDFS Model HDFS Force-Ranked Suspect Report HDFS • Periodic Execution: Daily or Better • Engine Options – Map. Reduce – Cascading – Spark © 2014 Trace 3 All rights reserved.

Distributed Stream Unstructured Text Model HDFS Relational Structured Time-Series Model New Suspect Event Model • Flow Inputs into Model in Real-Time – Global state held resident, reevaluated for every input event – Model evaluation crosses threshold, emits new suspect event • Engine Options – Data. Torrent – Spark Streaming – Storm © 2014 Trace 3 All rights reserved.

How We Got There • Operationalize Data Collection – Utilizing technologies such as Apache Kafka, Flume and Sqoop. • Model Extension – Highlight and incorporate useful subsets of other input types to improve predictive capability – Selected model execution engine and port R model to it © 2014 Trace 3 All rights reserved.

How We Got There Data Model Data Input Collection Time (1 mo, 3 mo, 6 mo) = Increased Accuracy Results Operationalized on Hadoop with APSD © 2014 Trace 3 All rights reserved.

APPLICATIONS Architecture Existing Visualization/Reporting /Analytical Tools and Apps DATA IN MOTION CUSTOM REAL TIME STREAMING APPLICATIONS Source Data DATA AT REST BATCH APPLICATIONS HIVE (AD-HOC SQL QUERIES) Data In Motion DATA INGESTION HCatalog Enterprise Repositories (Data Metastore) ETL ANALYTICS Real Time Ingestion EVENT CORRELATION AND NOTIFICATION Map. Reduce RDBMS EDW REPORTING TOOLS Transformed Archive Persist YARN DATA LAKE (HDFS) 23 © 2014 Trace 3 All rights reserved.

Online Marketing and Advertising Company • Legacy startup-style My. SQL data silos due to organic growth over the years, impacting time-to-insight • Desire to ask deeper questions through combining disparate data sets and applying new techniques • More and more data, shrinking delivery time windows • Expanded analytics capabilities to remain competitive • Required: platform scalability and flexibility through centralized data and improved query performance © 2014 Trace 3 All rights reserved.

How We Got There • Modeling customer behavior based on search engine ad placement and keyword string queries • Expanded machine-learning system leveraging data from various My. SQL databases and log collection systems (net-new sponsor case) • Parallel deployment of data lake: phased workload migration to new environment to protect stability • As data becomes available, additional net-new capabilities follow • Outcome: Improved capabilities at a lower price point, original batch analytical use cases evolving into real-time responsive analytics © 2014 Trace 3 All rights reserved.

Existing Visualization/Reporting /Analytical Tools and Apps Legacy Repositories (My. SQL) Source Data Direct Access APPLICATIONS Architecture - Before REPORT Legacy Process Rep lica OPS KEYWORD tion Replication lica Rep DW tion © 2014 Trace 3 All rights reserved.

APPLICATIONS After Existing Visualization/Reporting /Analytical Tools and Apps Targeted Data-Marts Source Data RDBMS DATA AT REST EDW BATCH APPLICATIONS ed HIVE/Impala Tra n sfo rm (AD-HOC SQL QUERIES) HCatalog Legacy Repositories (Data Metastore) Transformed Map. Reduce Landing Zone DATA LAKE (HDFS) Transformed YARN RDBMS YARN EDW Legacy Process Cold Data (Archived and Transformed Data) © 2014 Trace 3 All rights reserved.

Roadmap for Success • Clear Business Case, Short and Long Term • Well Defined Corporate Data Strategy • Staff Transformation • No Source Left Behind Mentality • Highly Integrated • Consistent Delivery/Single Source of the Truth • Intelligence Feeds Back into the Business

Questions? 29 © 2014 Trace 3, All rights reserved.
- Slides: 29