https xkcd com1425 Properties COMP 309 DIMENSIONALITY REDUCTION

  • Slides: 28
Download presentation
https: //xkcd. com/1425/

https: //xkcd. com/1425/

Properties COMP 309 DIMENSIONALITY REDUCTION CLUSTERING CLASSIFICATION UNSUPERVISED LEARNING REGRESSION REINFORCEMENT LEARNING ROBOTICS GAME

Properties COMP 309 DIMENSIONALITY REDUCTION CLUSTERING CLASSIFICATION UNSUPERVISED LEARNING REGRESSION REINFORCEMENT LEARNING ROBOTICS GAME AI Adapted from: http: //www. cognub. com/index. php/cognitive-platform/

CRISP-DM COMP 309

CRISP-DM COMP 309

CRISP-DM Cross-industry standard process for data mining - CRISP-DM Most widely recognised and tested:

CRISP-DM Cross-industry standard process for data mining - CRISP-DM Most widely recognised and tested: As a methodology, it includes descriptions of the typical phases of a project, the tasks involved with each phase, and an explanation of the relationships between these tasks. As a process model, CRISP-DM provides an overview of the data mining life cycle.

CRISP-DM Why successful? It's simple, easy to implement and works for industry! Only has

CRISP-DM Why successful? It's simple, easy to implement and works for industry! Only has six parts, Low overhead, Is process independent.

CRISP-DM Developed in 1996 by big players in data analysis (some say five, some

CRISP-DM Developed in 1996 by big players in data analysis (some say five, some say three, but they were big) • The life cycle model consists of six phases with arrows indicating the most important and frequent dependencies between phases. • The sequence of the phases is not strict - most projects move back and forth between phases as necessary. • The CRISP-DM model is flexible and can be customized easily. • CRISP-DM allows you to create a data mining model that fits your particular needs. • Can revisit phases during long-term planning and future data mining goals.

CRISP-DM Business Understanding Data Preparation Determine Business Objectives Background Business Objectives Business Success Criteria

CRISP-DM Business Understanding Data Preparation Determine Business Objectives Background Business Objectives Business Success Criteria Collect Initial Data Collection Report Describe Data Description Report Select Data Rationale for Inclusion / Exclusion Situation Assessment Inventory of Resources Requirements, Assumptions, and Constraints Risks and Contingencies Terminology Costs and Benefits Explore Data Exploration Report Clean Data Cleaning Report Verify Data Quality Report Construct Data Derived Attributes Generated Records Determine Data Mining Goals Data Mining Success Criteria Data Set Description Integrate Data Merged Data Format Data Reformatted Data Modeling Select Modeling Technique Modeling Assumptions Evaluation Evaluate Results Assessment of Data Mining Results w. r. t. Business Success Criteria Generate Test Design Approved Models Test Design Review Process Review of Process Build Model Parameter Settings Determine Next Steps Models List of Possible Actions Model Description Decision Assess Model Assessment Revised Parameter Settings Deployment Plan Monitoring and Maintenance Plan Produce Final Report Final Presentation Review Project Experience Documentation Produce Project Plan Initial Asessment of Tools and Techniques https: //www. the-modeling-agency. com/crisp-dm. pdf

CRISP-DM • Business Understanding • This initial phase focuses on understanding the project objectives

CRISP-DM • Business Understanding • This initial phase focuses on understanding the project objectives and requirements from a business perspective, and then converting this knowledge into a data mining problem definition, and a preliminary plan designed to achieve the objectives. • Data Understanding • The data understanding phase starts with an initial data collection and proceeds with activities in order to get familiar with the data, to identify data quality problems, to discover first insights into the data, or to detect interesting subsets to form hypotheses for hidden information. • Data Preparation • The data preparation phase covers all activities to construct the final dataset (data that will be fed into the modeling tool(s)) from the initial raw data. Data preparation tasks are likely to be performed multiple times, and not in any prescribed order. Tasks include table, record, and attribute selection as well as transformation and cleaning of data for modeling tools. • Modeling • In this phase, various modeling techniques are selected and applied, and their parameters are calibrated to optimal values. Typically, there are several techniques for the same data mining problem type. Some techniques have specific requirements on the form of data. Therefore, stepping back to the data preparation phase is often needed. • Evaluation • At this stage in the project you have built a model (or models) that appears to have high quality, from a data analysis perspective. Before proceeding to final deployment of the model, it is important to more thoroughly evaluate the model, and review the steps executed to construct the model, to be certain it properly achieves the business objectives. A key objective is to determine if there is some important business issue that has not been sufficiently considered. At the end of this phase, a decision on the use of the data mining results should be reached. • Deployment • Creation of the model is generally not the end of the project. Even if the purpose of the model is to increase knowledge of the data, the knowledge gained will need to be organized and presented in a way that the customer can use it. Depending on the requirements, the deployment phase can be as simple as generating a report or as complex as implementing a repeatable data mining process. In many cases it will be the customer, not the data analyst, who will carry out the deployment steps. However, even if the analyst will not carry out the deployment effort it is important for the customer to understand up front what actions will need to be carried out in order to actually make use of the created models.

Business Understanding <-> Data Understanding Determine business objectives • Business Understanding: Collect initial data

Business Understanding <-> Data Understanding Determine business objectives • Business Understanding: Collect initial data • Data Understanding • Statement of Business Objective • Assess situation • Statement of Data Mining Resources (data!), risks, costs objective & benefits • Statement of Success Criteria • Explore the data and verify the quality data • Describe • Find outliers Persist results • Determine data mining goals Ideally with quantitative success criteria • Develop project plan • Explore data Persist results • Verify data quality Carefully document problems and issues found! https: //www. slideshare. net/lopusz/crispdm-agile-approach-to-data-mining-projects

Business Understanding

Business Understanding

Business Understanding Estimate time line, budget, but also tools and techniques Difficult! Often, you

Business Understanding Estimate time line, budget, but also tools and techniques Difficult! Often, you have to enter a new field You have to explain data science limitations to non-experts No, performance will not be 100% We need much more data to train an accurate model “For tomorrow? it is impossible” DOs and DON'Ts • Have a lot of patience for vaguely defined problems • Learn to concretize or even reduce the scope of the initial idea • Data sample • Real-life use cases • Quantitative success metrics • Do not waste your time on ill-defined, unrealistic projects https: //www. slideshare. net/lopusz/crispdm-agile-approach-to-data-mining-projects

Data Understanding

Data Understanding

Data Understanding Validate Everything Spot Anomalies DOs and DON'Ts Do not economize on this

Data Understanding Validate Everything Spot Anomalies DOs and DON'Ts Do not economize on this phase The earlier you discover issues with your data the better (yes, your data will have issues!) Data understanding leads to domain understanding, it will pay off in the modelling phase Do not trust data quality estimates provided by your customer Verify as far as you can, if your data is correct, complete, coherent, deduplicated, representative, independent, up-to-date, stationary • Investigate what sort of processing was applied to the raw data • Understand anomalies and outliers • • • https: //www. slideshare. net/lopusz/crispdm-agile-approach-to-data-mining-projects

Data preparation Takes usually over 90% of our time • Data selection • active

Data preparation Takes usually over 90% of our time • Data selection • active role in ignoring non-contributory data? • Use of samples • visualization tools • outliers? • Collection • Assessment • Consolidation and Cleaning • table links, aggregation level, missing values, etc • Transformations – construct/create new variables • Integrate data - Merge information from different sources • Format data - Convert to format convenient for modelling Correlation matrix of image features Rank of features by importance http: //www. mdp i. com/14248220/18/4/1027

Data preparation

Data preparation

Data preparation Tedious! Use workflow tools (pipelines) to document, automate & parallelize data prep.

Data preparation Tedious! Use workflow tools (pipelines) to document, automate & parallelize data prep. Data understanding and preparation will usually consume half or more of your time! DOs and DON'Ts • Automate this phase as far as possible • When merging multiple sources, track provenance of your data • Use workflow tools to help you with the above • Prepare your customer that data understanding and preparation take considerable amount of time Source: KDNuggets Poll https: //www. slideshare. net/lopusz/crispdm-agile-approach-to-data-mining-projects

Model Building • Selection of the modeling techniques is based upon the data mining

Model Building • Selection of the modeling techniques is based upon the data mining objective • Modeling is an iterative process - different for supervised and unsupervised learning • May model for either description or prediction

Model Building

Model Building

Model Building • Select modelling technique Assumptions, measure of accuracy • Generate test design

Model Building • Select modelling technique Assumptions, measure of accuracy • Generate test design • Build model Feature eng. , optimize model parameters • Assess model Iterate the above • Should I use general purpose language? Breadth = performance, lots of general purpose libraries and tooling • Should I use data analysis language? Depth = easy data manipulation, latest models and statistical techniques available • Should I use a ML Tool? Combines Breadth and Depth, but at cost • Where your model will be deployed? • Do you need to distribute your computations? (avoid!) • Can I afford a prototype? https: //www. slideshare. net/lopusz/crispdm-agile-approach-to-data-mining-projects

Model Building DOs and DON'Ts • Be creative with your features (feature engineering) •

Model Building DOs and DON'Ts • Be creative with your features (feature engineering) • Esp. from textual data or time-series you can generate a lot of features • Make conscious decision about missing data (NAs) and outliers (regression!) • Allocate time for hyperparameter optimization • Whenever possible, peek inside your model and consult it with a domain expert • Assess feature importance • Run your model on simulated data • Develop your model with deployment conditions in mind https: //www. slideshare. net/lopusz/crispdm-agile-approach-to-data-mining-projects

Model Evaluation of model: how well it performed on test data? Methods and criteria

Model Evaluation of model: how well it performed on test data? Methods and criteria depend on model type: • e. g. confusion matrix with classification models, mean error rate with regression models Actual A Forecast F Interpretation of model: important or not, easy or hard depends on algorithm

Model Evaluation

Model Evaluation

Model Evaluation • Evaluate results Business success criteria fulfilled? • Review process • Determine

Model Evaluation • Evaluate results Business success criteria fulfilled? • Review process • Determine next steps To deploy or not to deploy? DOs and DON'Ts • Work with the performance criteria dictated by your customer's business model • Assess not only performance, but also practical aspects, related to deployment, for example: Training and prediction speed Robustness and maintainability (tooling, dependence on other subsystems, library vs. homegrown code) • Watch out for data leakage, for example: Time series – mixing past and future Meaningful identifiers Other nasty ways of artificially introducing extra information, not available in production https: //www. slideshare. net/lopusz/crispdm-agile-approach-to-data-mining-projects

Deployment • Deployment • Determine how the results need to be utilized • Who

Deployment • Deployment • Determine how the results need to be utilized • Who needs to use them? • How often do they need to be used • Deploy Data Mining results by: • Scoring a database • Utilizing results as business rules • Interactive scoring on-line

Deployment • • Plan deployment Plan monitoring and maintenance Produce final report Review project

Deployment • • Plan deployment Plan monitoring and maintenance Produce final report Review project Collect lessons learned! Dos – Read: https: //www. slideshare. net/lopusz/crispdm-agile-approach-to-data-mining-projects

Analytic technology can be effective • Combining multiple models and link analysis can reduce

Analytic technology can be effective • Combining multiple models and link analysis can reduce false positives • Today there are millions of false positives with manual analysis • Data Mining is just one additional tool to help analysts • Analytic Technology has the potential to reduce the current high rate of false positives Data Mining with Privacy • Data Mining looks for patterns, not people! • Technical solutions can limit privacy invasion • • Replacing sensitive personal data with anon. ID Give randomized outputs Multi-party computation – distributed data … • Bayardo & Srikant, Technological Solutions for Protecting Privacy, IEEE Computer, Sep 2003 Source: Gregory Piatetsky-Shapiro

The following table summarizes the main inputs to the deliverables. This does not mean

The following table summarizes the main inputs to the deliverables. This does not mean that only the inputs listed should be considered—for example, the business objectives should be pervasive to all deliverables. However, the deliverables should address specific issues raised by their inputs. https: //medium. com/@apolotakeshi/understanding-crisp-dmwith-my-cats-weight-and-covid-19 -s-prediction-e 84 c 42309 a 04