Data Mining 101 with ScikitLearn An informal introduction

  • Slides: 23
Download presentation
Data Mining 101 with Scikit-Learn An informal introduction of data mining Shuhan Yuan sy

Data Mining 101 with Scikit-Learn An informal introduction of data mining Shuhan Yuan sy 005@uark. edu

What is data mining? • Data mining is the computing process of discovering patterns

What is data mining? • Data mining is the computing process of discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems. (https: //en. wikipedia. org/wiki/Data_mining) • Data mining is the process of discovering interesting knowledge from large amounts of data stored either in databases, data warehouses, or other information repositories. (Data Mining: Concepts and Techniques)

What is data mining? • Data mining is the computing process of discovering patterns

What is data mining? • Data mining is the computing process of discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems. (https: //en. wikipedia. org/wiki/Data_mining) • Data mining is the process of discovering interesting knowledge from large amounts of data stored either in databases, data warehouses, or other information repositories. (Data Mining: Concepts and Techniques)

What is data mining? • A naïve view of data mining Data Mining Knowledge

What is data mining? • A naïve view of data mining Data Mining Knowledge (Models) knowledge discovery from data http: //hanj. cs. illinois. edu/bk 1/

Six common classes of tasks • Classification – is the task of generalizing known

Six common classes of tasks • Classification – is the task of generalizing known structure to apply to new data. For example, an e-mail program might attempt to classify an e-mail as "legitimate" or as "spam". • Clustering – is the task of discovering groups and structures in the data that are in some way or another "similar", without using known structures in the data. • Regression – attempts to find a function which models the data with the least error that is, for estimating the relationships among data or datasets. • Summarization – providing a more compact representation of the data set, including visualization and report generation. • Anomaly detection (outlier/change/deviation detection) – The identification of unusual data records, that might be interesting or data errors that require further investigation. • Association rule learning (dependency modelling) – Searches for relationships between variables. For example, a supermarket might gather data on customer purchasing habits. Using association rule learning, the supermarket can determine which products are frequently bought together and use this information for marketing purposes. This is sometimes referred to as market basket analysis. https: //en. wikipedia. org/wiki/Data_mining

Six common classes of tasks • Classification – is the task of generalizing known

Six common classes of tasks • Classification – is the task of generalizing known structure to apply to new data. For example, an e-mail program might attempt to classify an e-mail as "legitimate" or as "spam". • Clustering – is the task of discovering groups and structures in the data that are in some way or another "similar", without using known structures in the data. • Regression – attempts to find a function which models the data with the least error that is, for estimating the relationships among data or datasets. • Summarization – providing a more compact representation of the data set, including visualization and report generation. • Anomaly detection (outlier/change/deviation detection) – The identification of unusual data records, that might be interesting or data errors that require further investigation. • Association rule learning (dependency modelling) – Searches for relationships between variables. For example, a supermarket might gather data on customer purchasing habits. Using association rule learning, the supermarket can determine which products are frequently bought together and use this information for marketing purposes. This is sometimes referred to as market basket analysis. https: //en. wikipedia. org/wiki/Data_mining

Classification Supervised Learning https: //machinelearningmastery. com/a-tour-of-machine-learning-algorithms/

Classification Supervised Learning https: //machinelearningmastery. com/a-tour-of-machine-learning-algorithms/

Supervised Learning Regression https: //quantdare. com/machine-learning-a-brief-breakdown/ https: //medium. com/simple-ai/linear-regression-intro-to-machine-learning-66 e 320 dbdaf 06

Supervised Learning Regression https: //quantdare. com/machine-learning-a-brief-breakdown/ https: //medium. com/simple-ai/linear-regression-intro-to-machine-learning-66 e 320 dbdaf 06

Clustering Unsupervised Learning Clustering Algorithms https: //machinelearningmastery. com/a-tour-of-machine-learning-algorithms/ https: //apandre. wordpress. com/visible-data/cluster-analysis/

Clustering Unsupervised Learning Clustering Algorithms https: //machinelearningmastery. com/a-tour-of-machine-learning-algorithms/ https: //apandre. wordpress. com/visible-data/cluster-analysis/

Anomaly Detection http: //machine-learning-class-notes. readthedocs. io/en/latest/lecture 16. html http: //amid. fish/anomaly-detection-with-k-means-clustering

Anomaly Detection http: //machine-learning-class-notes. readthedocs. io/en/latest/lecture 16. html http: //amid. fish/anomaly-detection-with-k-means-clustering

Association Rule Market Basket Analysis http: //www. kdnuggets. com/2016/04/association-rules-apriori-algorithm-tutorial. html https: //blogs. adobe. com/digitalmarketing/analytics/shopping-for-kpis-market-basket-analysisfor-web-analytics-data/

Association Rule Market Basket Analysis http: //www. kdnuggets. com/2016/04/association-rules-apriori-algorithm-tutorial. html https: //blogs. adobe. com/digitalmarketing/analytics/shopping-for-kpis-market-basket-analysisfor-web-analytics-data/

Summarization • Know your data https: //generalassemb. ly/blog/the-best-topical-data-visualizations-of-2015/

Summarization • Know your data https: //generalassemb. ly/blog/the-best-topical-data-visualizations-of-2015/

Pipeline for Data Mining Data Model Training Data Preproc essing Feature Engineer ing Testing

Pipeline for Data Mining Data Model Training Data Preproc essing Feature Engineer ing Testing Prediction

Linus Torvalds: “Talk is cheap. Show me the code. ” http: //www. skilledup. com/articles/become-software-engineer

Linus Torvalds: “Talk is cheap. Show me the code. ” http: //www. skilledup. com/articles/become-software-engineer

Python Ecosystem

Python Ecosystem

Jupyter Notebook Contain both computer code (e. g. python) and rich text elements (paragraph,

Jupyter Notebook Contain both computer code (e. g. python) and rich text elements (paragraph, equations, figures, links, etc. . . ).

Scikit-Learn http: //scikit-learn. org/stable/

Scikit-Learn http: //scikit-learn. org/stable/

http: //peekaboo-vision. blogspot. de/2013/01/machine-learning-cheat-sheet-for-scikit. html

http: //peekaboo-vision. blogspot. de/2013/01/machine-learning-cheat-sheet-for-scikit. html

Like this graph? More here: https: //unsupervisedmethods. com/cheat-sheet-of-machine-learning-and-python-and-math-cheat-sheets-a 4 afe 4 e 791 b

Like this graph? More here: https: //unsupervisedmethods. com/cheat-sheet-of-machine-learning-and-python-and-math-cheat-sheets-a 4 afe 4 e 791 b 6 http: //peekaboo-vision. blogspot. de/2013/01/machine-learning-cheat-sheet-for-scikit. html

Scikit-learn • Simple and consistent API • Instantiate the model m = Model() •

Scikit-learn • Simple and consistent API • Instantiate the model m = Model() • Fit the model m. fit(train_data) • Predict m. predict(test_data) • Evaluate m. score(predict_y, target_y) https: //medium. com/towards-data-science/train-test-split-and-cross-validation-in-python-80 b 61 beca 4 b 6

Classification: k-nearest neighbors (K-NN) http: //bdewilde. github. io/blogger/2012/10/26/classification-of-hand-written-digits-3/

Classification: k-nearest neighbors (K-NN) http: //bdewilde. github. io/blogger/2012/10/26/classification-of-hand-written-digits-3/

Decision tree

Decision tree

Clustering: k-means • Given a data set where each observed example has a set

Clustering: k-means • Given a data set where each observed example has a set of features, but no labels http: //stanford. edu/~cpiech/cs 221/handouts/kmeans. html