Breeding Data Scientists Danielle Dean Ph D Senior
Breeding Data Scientists • Danielle Dean, Ph. D Senior Data Scientist Lead, Microsoft • Amy O’Connor Business Value Enablement, Cloudera
Five changes in the world of the Data Scientist More Data, Insights, Results Organization & Culture Data Engineering Productivity Tools Cloud Enabled
More Data, More Insights Data is abundant, diverse & shared freely As is how we store, process and analyze it Streaming Machine Learning ETL Modeling BI
More Results Destroying Human Trafficking Networks Thorn Working to Cure Cancer Top Cancer Research Institutions Rocket Science
Organization & Culture: Sobering Statistics “Only 27% of the big data projects are regarded as successful” Only 13% of organizations have achieved full-scale production for their Big Data implementations “Only 8% of the big data projects are regarded as VERY successful” Source: Cap. Gemini 2014 “Only 17% of survey respondents said they had a well-developed Predictive/Prescriptive Analytics program in place, while 80% said they planned on implementing such a program within five years” Dataversity 2015 Survey
The Data Scientist is not one person Source: Drew Conway Curiosity Hacking Skills Math and Statistical Knowledge Machine Learning Data Science Danger Zone Traditional Research Substantive Expertise
The Data Scientist does not stand alone Data Engineer/ETL Engineer Executive Sponsor Data Scientist Subject Matter Expert + Product Owner, app developer, program manager, dev. Ops etc Data Steward/SME
The Data Scientist does not sit in a centralized org Other - 37% CIO or IT Function - 18% CMO - 11% CFO - 9% Chief Analytics Officer - 7% CRO / Risk - 7% VP Strategic Planning - 5% VP Sales - 3% Chief Data Officer - 3% VP Customer Service - 3% Source: Gartner 2016
“How do I become a Data Scientist? ”
“How do I become a Data Scientist? ”
Importance of Process Data Science != Software Engineering But, we can learn a lot, especially on processes after all…Failing to plan is planning to fail 1. Data Problem Formulation 2. Acquire Data Sources 1. Data Flow Architecture 6. Model evaluation and tuning 3. Data exploration 2. Data Schema Architecture 4. Create analytics dataset 2. Feature Extraction 7. Model Deployment Data Science 5. Modeling & Descriptive Analysis 3. Data Flow Implementation 4. Data Flow Validation Data Acquisition
Four Pillars of the Team Data Science Process 1 2 3 4 Standard Project Lifecycle Standardized Document Templates, Project Structure Shared, Distributed Resources Productivity Tools, Shared Utilities
Team Data Science Process at Microsoft • Data science virtual machines (DSVMs) as the fundamental development platform on cloud • Use Visual Studio Team Services (VSTS) • Work item tracking and scrum planning • Git repositories • Shared data science utilities in Git repository • Use cloud-based Azure resources as needed
Data Engineering – ready for ML? The better the raw materials, the better the product. Question is sharp. E. g. Predict whether component X will fail in the next Y days; clear path of action with answer Data measures Data is what they accurate. care about. E. g. Identifiers at the level they are predicting E. g. Failures are really failures, human labels on root causes; domain knowledge translated into process Data is connected. E. g. Machine information linkable to usage information A lot of data. E. g. Will be difficult to predict failure accurately with few examples
A Bit more on Data Engineering How do Data Scientists spend their time? Cleaning & organizing data - 60% Collecting data sets - 19% Mining data for patterns -- 9% Refining algorithms - 4% Building training sets - 3% Other - 5% Source: Crowd. Flower Gartner estimates that poor quality of data costs an average organization $13. 5 million per year, and yet data governance problems — which all organizations suffer from — are worsening.
A Bit more on Data Engineering Data Ingestion (Kafka, Navigator, Search) Cloudera enables users to build real-time, end-toend data pipelines in order to power their business. Leadership in Apache Spark and Kafka have made Cloudera a trusted resource for users who want to capture real-time, streaming, and time series data without being presented with gaps in security. Data Processing (Spark, Hive) Cloudera is helping users accelerate their data pipelines with leadership in technologies like Apache Spark. Data processing in Cloudera Enterprise can help take processing windows from hours to minutes and enables faster access to data for a variety of users and skillsets.
Data Engineering/Science/Analyst Tools Data Engineering 70 60 50 Data Science/Analytics 50 120 40 100 40 30 30 20 20 10 0 2015 2016 Data Analyst / BI 80 60 40 10 20 0 0 2015 2016 Cloudera Certified Partners 2015 2016
Flexible deployments: Cloud enabled Easy Administration • • • Dynamic cluster lifecycle management Single pane of glass: multi-cluster view Consumption based billing and metering Enterprise-grade • • Integration across Cloudera Enterprise Management of CDH deployments at scale Flexible Deployments No cloud vendor lock-in: open plugin framework for Iaa. S platforms • Scaling of provisioned clusters • Spot instance provisioning • Cloudera Director
Cortana Intelligence Suite on Azure cloud platform Information Management Data Sources Apps Big Data Stores Machine Learning and Analytics Intelligence People Data Factory Data Lake Store Machine Learning Cognitive Services Data Catalog SQL Data Warehouse Data Lake Analytics Bot Framework HDInsight (Hadoop and Spark) Event Hubs Sensors and devices Stream Analytics Web Mobile Cortana Apps Bots Dashboards & Visualizations Automated Systems Power BI Data Intelligence Action
More Data = More results! Create a data driven culture & DS processes Careful checking and cleaning of data Use the right tool for the job Leverage the power of the cloud
Resources • Microsoft’s “Team Data Science Process” Github: http: //aka. ms/tdsp • Productive utilities repository: https: //github. com/Azure-TDSP-Utilities • Sign up for a free VSTS account: http: //www. visualstudio. com • Complete Cloudera resource library: https: //www. cloudera. com/resources. html • Coursera Data Science: http: //www. coursera. org
- Slides: 21