Starting a Data Science Team Dr Jonathan D

  • Slides: 39
Download presentation
Starting a Data Science Team Dr. Jonathan D. Adler jonathandadler@outlook. com http: //jadler. info

Starting a Data Science Team Dr. Jonathan D. Adler jonathandadler@outlook. com http: //jadler. info

 • Degrees: • BS – Mathematical Sciences (Worcester Polytechnic Institute) • MS –

• Degrees: • BS – Mathematical Sciences (Worcester Polytechnic Institute) • MS – Applied Mathematics (Worcester Polytechnic Institute) • Ph. D – Industrial Engineering (Arizona State University) • Research – real-time optimization in situations with uncertainty About me • Jobs: • Vistaprint – Forecasting, customer segmentation, product recommendation engine • Boeing – Market forecasting • Promontory Growth and Innovation – Data science consulting • Created the data science team from scratch • Microsoft Studios – Xbox and Windows 10 analytics 2

 • When a company gets to a certain size, they often realize a

• When a company gets to a certain size, they often realize a few things: The realization • They have a lot of data, and lots of questions • This data probably has useful information in it • There are people out there who can turn the data into useful insights • “So lets hire some people to turn that data into useful information” 3

 • What can we really do with data science? There a LOT of

• What can we really do with data science? There a LOT of open questions • What makes a data science project successful? • What are the skills the employees should have? • Who should be our first hire? Our second? Our fifth? • What are best practices for running the team? 4

1. What can we really do with data science? 2. What makes a data

1. What can we really do with data science? 2. What makes a data science project successful? Outline 3. What are the skills the employees should have? 4. Who should be our first hire? Our second? Our fifth? 5. What are best practices for running the team? 5

1. What you can do with data science 6

1. What you can do with data science 6

 • Data science tends to fall into three broad categories: Types of data

• Data science tends to fall into three broad categories: Types of data science work • Investigating – aggregating and inspecting data to get basic insights on what is currently happening Simple • Predicting – taking the data and using it to understand what will happen in the future • Optimizing – using the data to choose what the best choice of actions will be Complex 7

 • Look at historic data to answer direct questions: • If you have

• Look at historic data to answer direct questions: • If you have two products, which is selling better? How many people are buying both? • How frequently do customers order? • How are sales changing each month? Investigating • These questions are generally quick to answer and don’t require a mathematical model • Difficulty is in knowing which measures to use and how to visualize/represent them • Unfortunately, they don’t tell you much (“so what? ”) 8

 • Look at historic data to predict: • How likely is a customer

• Look at historic data to predict: • How likely is a customer going to come back? • How will a customer respond to a sale? • Is revenue going to increase over time? Predicting • This information is a lot more meaningful; you are more likely to be able to act on it • Requires mathematical modeling: regressions, clustering algorithms, etc. • Sometimes the data isn’t there to make the prediction, sometimes the prediction is wrong, or requires more skill to do well 9

 • Look at the historic data to make the best decisions Optimizing •

• Look at the historic data to make the best decisions Optimizing • How much inventory should be held, and when should you reorder? • Which product should you recommend to a website visitor for the most profit? • What price should you set for a product? When should it go on sale? • These problems are the hardest to get right • They also directly provide the most profit 10

Things you can’t do with data science • Solve problems where the core drivers

Things you can’t do with data science • Solve problems where the core drivers aren’t in the data, or the signal is too weak in the noise • Which start-up companies are going to succeed • When is the next recession going to hit • Having data isn’t sufficient for success • Knowing the last 10, 000 flips of a fair coin won’t help me predict the next flip 11

2. What makes a data science project successful 12

2. What makes a data science project successful 12

 • A company that sound brownies online wanted to improve their marketing. They

• A company that sound brownies online wanted to improve their marketing. They had two types of customers: • Consumers ordering for friends and family • Businesses ordering for their clients Case Study 1: an ecommerce company • Wanted to target their customers differently, but couldn’t consistently tell if a customer was a business or a consumer • Data science approach: analyze the text on the gift message to determine if it had “business” or “consumer words” • Result: a continuously running script determined if each new order was for a business or consumer, and the customer was put into one of the two categories Gift message Probability of business THANK YOU ALL FOR YOUR AND HARD WORK, IT IS TRULY APPRECIATED BY THE MANAGEMENT TEAM 0. 989084 CONGRATULATIONS AND BEST OF LUCK ON YOUR NEW JOB! WE ARE VERY PROUD OF YOU! LOVE MOM 0. 019581 13

Questions Data Analysis The data science process Modeling Abort Result Productioniz e • All

Questions Data Analysis The data science process Modeling Abort Result Productioniz e • All analyses begin with questions • A data scientist will take the question and investigate the data • Cases: • it’s clear the data isn’t right to answer the question • advanced modeling is required • the answer to the question can be found immediately in the data • If a result is found • It can be productionized • It can raise more questions 14

 • A manufacturing company made highly customized machines • “build a machine that

• A manufacturing company made highly customized machines • “build a machine that builds lightbulbs” Case Study 2: a manufacturing company • The company needs to makes quotes for the price without ever having made that machine before • Costs could be substantially higher or lower than quoted • Problem wasn’t with price but with quality of estimates • Data science used to better predict how much a machine would cost Questions Data Analysis Modeling Abort Result Productionize • Question: how can we predict the cost of a machine • Data: features of previous machines, their estimates, and the actual costs • Analysis: there is a relationship between features and the estimate/actual ratio • Model: a GAMLSS to predict the true cost and the possible error band • Result: model successfully predicted costs better • Productionize: a simple GUI for the company, and a contract for refitting the model 15

Questions Data Analysis The data science process: problems Modeling Abort Result Productioniz e Possible

Questions Data Analysis The data science process: problems Modeling Abort Result Productioniz e Possible problems Data inconsistent or poorly formatted Question ill formed Model can be faulty – overfitting or incorrect assumptions Signs the model won’t work are ignored Stuck in a “what about” loop Model not built in a way that makes for simple productionizing 16

Situation: • A company sells products in bulk to state governments • Want to

Situation: • A company sells products in bulk to state governments • Want to discount each quote to a price that brings in the most profit • Data Science used to determine which price Case Study 3: a distributor Problems: • No clear relationship found between customer and chance of accepting a quote • A highly advanced model was used that found a relationship, but was unintuitive and not robust, and therefore hard to tell if working correctly • Model was built on training data that was a different format from the production data, so entire model had to be rebuilt to productionize • Was stuck in a “what about loop: ” continuously cutting the data in different ways to satisfy customer End result: project was over-budget and under-delivered, and now is extremely difficult to maintain 17

3. Skills needed to be a data scientist 18

3. Skills needed to be a data scientist 18

Technical Skills The five skills needed to do data science • Statistics and Math

Technical Skills The five skills needed to do data science • Statistics and Math – The different techniques used on data: regressions, clustering algorithms, time series models • Software Development – How to write code, how to manage a code base, how to store data in a database • Business Experience – Where companies waste money, what makes a project succeed, how to get data from within a company Personal Skills • Leadership – How to help other data scientists, how to train them, and how to work with a customer to produce good results • Adaptability – the ability to figure out a solution when presented with an entirely new problem 19

Data scientist archetypes Junior data scientist (J) – Has a BS, and less than

Data scientist archetypes Junior data scientist (J) – Has a BS, and less than three years of experience in industry. Tends to know the only simple statistical techniques. Requires a lot of guidance, but is happy to do the less interesting work. ($) Expert junior scientist (E) – A junior data scientist who has been working for 5+ years. Gets very comfortable doing the simple stuff and knows deeply about their business area. May have gotten an MS to help career. ($$) Statistics Coding Business Leadership Adaptability 20

Data scientist archetypes Senior data scientist (S) – A person with an advanced degree,

Data scientist archetypes Senior data scientist (S) – A person with an advanced degree, and enough business experience to know what to do with it. Understands coding well enough to do things right. Is still less willing to do work than a junior data scientist, but will if no one else is around. Big difference between senior and expert junior is ability to independently learn. ($$$) Statistics Coding Business Leadership Adaptability Statistics Principal data scientist (P) – Just like a senior data scientist, but also with experience leading a team and a project. Difficult to find. ($$$$) Coding Business Leadership Adaptability 21

Data scientist archetypes: danger Business intelligence analyst (B) – Understands a lot about the

Data scientist archetypes: danger Business intelligence analyst (B) – Understands a lot about the business and the data powering it. Doesn’t know much about statistics or what to do with the data. Can be dangerous without proper guidance. ($) Academic (A) – Has an MS/Ph. D and not too much business experience. Loves to think about interesting problems, but is less willing to spend the time doing the mundane work to get a project done (and might not know how!). ($$$) Statistics Coding Business Leadership Adaptability 22

4. Who you should hire 23

4. Who you should hire 23

Hiring • You only need to hire one data scientist, they’ll hire the rest

Hiring • You only need to hire one data scientist, they’ll hire the rest • Who you choose for the first hire dramatically alters how your team will end up 24

First hire choice • [Expert] Junior data scientist: “the blind leading the blind. ”

First hire choice • [Expert] Junior data scientist: “the blind leading the blind. ” This team will know how to do simple data science but won’t know how to do advanced work. Often won’t even know the advanced things exist. Often very inefficient because they don’t know any better. • Academic: “the ivory tower. ” A team of people who look at only the most complex problems, and spend tons of time talking about how they are interesting and innovative solutions for them. Won’t product very many solutions. E J J J J A A A J A 25

First hire choice • Senior data scientist: “the very strong and expensive team. ”

First hire choice • Senior data scientist: “the very strong and expensive team. ” This team will be efficient and generally produce good results. The team members won’t enjoy doing the simpler work, but it’ll get done. • Principal data scientist: “the balanced. ” the principal data scientist will be expensive, but the people they hire won’t be. They will set the groundwork for the team to run efficiently, but will be able to support a junior team. S S P S S J E J J 26

5. Best practices 27

5. Best practices 27

Questions Data Analysis From data to a result Tables in a database Tools are

Questions Data Analysis From data to a result Tables in a database Tools are needed for this process Modeling Abort Result A presentation Productioniz e 28

Least efficient Tables in a database From data to a result: work streams 1.

Least efficient Tables in a database From data to a result: work streams 1. Database queries to aggregate and join the data 2. SAS code to analyze the aggregated data and run a model 3. Excel worksheets to visualize the data 4. Power. Point to make the presentation Most efficient Tables in a database 1. A single set of R code to aggregate and join the data, run the model, visualize the output, and make a presentation A. pdf file A. pptx file Firms underestimate the total loss from an inefficient work stream 29

 • Data science produces lots of types of files Storing knowledge • •

• Data science produces lots of types of files Storing knowledge • • • Raw data from the client Processed intermediate data Code to do the analysis Base results Finalized reports and presentations • Often an analysis is done once • May never be looked at again • In a year, someone might ask to do the analysis again with changes 30

Least robust Storing knowledge: methods Most robust Each data scientist has a folder on

Least robust Storing knowledge: methods Most robust Each data scientist has a folder on Materials split into three a share drive for each project components containing all of the data, code, and 1. Input data is stored in folders results for each project sharing a consistent scheme Doesn’t make clear what files are 2. Code for analysis is stored in used for version control to track Doesn’t track changes over time changes Doesn’t indicate what was 3. Output is stored in folders with delivered to client marked versions connected the code Anything delivered to the client is marked with how it was created Allows for clear change logs to see differences in versions Splitting input data allows for easy data updates 31

Questions Data Analysis Project managemen t Modeling Abort Result • Data science process involves

Questions Data Analysis Project managemen t Modeling Abort Result • Data science process involves many small tasks • Finding data • Initial analysis • Attempting multiple models • With multiple projects and multiple people, coordination is non-trivial Productioniz e 32

 • For project set up, have a standard expected timeline, example: Project managemen

• For project set up, have a standard expected timeline, example: Project managemen t: methods • • Initial investigation: 2 weeks Modeling: 4 weeks Result validation: 2 weeks Productionizing: 4 weeks • In the timeline, have set points to meet with the client and review • Use a card-based tool like trello to track the process of individual steps. 33

Conclusion 34

Conclusion 34

1. What can we really do with data science? Many problems can be solved

1. What can we really do with data science? Many problems can be solved that rely on data. From simple investigation of the data to building predictive models and optimization algorithms. 2. What makes a data science project successful? Having a clear path from the data to the result, and ensuring the project gets completed (or aborted) at the right time. Conclusion 3. What are the skills the employees should have? Statistics, software development, business experience, leadership, and adaptability. 4. Who should be our first hire? Our second? Our fifth? Someone with all five of those skills, or failing that, someone with all of them but leadership. 5. What are best practices for running the team? Have a clear, efficient process for doing data science and storing the results. 35

Questions? 36

Questions? 36

Appendix 37

Appendix 37

 • If you can find a principal data scientist Hiring roadmap • Hire

• If you can find a principal data scientist Hiring roadmap • Hire him or her, have them set up the groundwork for the team • 3 months in, hire a senior data scientist • 6 months in, hire a junior data scientist • By 18 months in, have a team of 5 -6 people • If you can’t find a principal data scientist • Hire a senior data scientist to work independently • Every 3 months hire an additional senior data scientist • If at any point there seems to be too much simple work, start hiring junior data scientists and assign them senior data scientists as mentors 38

During the hiring process, check: Ensuring candidates have these skills • Statistics and Math

During the hiring process, check: Ensuring candidates have these skills • Statistics and Math – Do they know how to use a linear regression? What overfitting is? Supervised vs. unsupervised learning? • Software Development – Have they used: R, python, or MATLAB? Have they used source control? Have they pulled data from a SQL database, and understand how to do joins? • Business Experience – Do they have experience working in a company? Have they seen a project through to completion? Can they reflect on why a project succeeded or failed? • Leadership – Have they managed a project? Have they lead employees? • Adaptability – Do they have experience in figuring out a solution to an entirely new problem without substantial guidance? 39