Machine Learning Photo CMU Machine Learning Department protests




























- Slides: 28

Machine Learning Photo: CMU Machine Learning Department protests G 20 Slides: Isabelle Guyon, Erik Sudderth, Mark Johnson, Derek Hoiem, Lana Lazebnik

Machine Learning is… • A branch of artificial intelligence, concerns the construction and study of systems that can learn from data. • Studies how to automatically learn to make accurate predictions based on past observations

Machine Learning is…

Machine Learning aka. § data mining: machine learning applied to databases, i. e. collections of data § inference and/or estimation in statistics § pattern recognition in engineering § signal processing in electrical engineering § optimization

Supervised vs Unsupervised Learning • In Supervised learning categories are known. • In unsupervised learning, they are not, and the learning process attempts to discover the appropriate categories.

Supervised vs Unsupervised Learning

Classification… • Spam Detection: Given email in an inbox, identify those email messages that are spam and those that are not. • Credit Card Fraud Detection: Given credit card transactions for a customer in a month, identify those transactions that were made by the customer and those that were not. • Evil/Good…

Classification framework f( ) = “apple” f( ) = “tomato” f( ) = “cow” Slide credit: L. Lazebnik

Classification, cont. y = f(x) output prediction function Image feature • Training: given a training set of labeled examples {(x 1, y 1), …, (x. N, y. N)}, estimate the prediction function f by minimizing the prediction error on the training set • Testing: apply f to a never before seen test example x and output the predicted value y = f(x) Slide credit: L. Lazebnik

The process Training Labels Training Images Image Features Training Learned model Prediction Testing Image Features Test Image Slide credit: D. Hoiem and L. Lazebnik

Classifiers: Nearest neighbor Training examples from class 1 Test example Training examples from class 2 f(x) = label of the training example nearest to x • All we need is a distance function for our inputs • No training required! Slide credit: L. Lazebnik

Classifiers: Linear • Find a linear function to separate the classes: f(x) = sgn(w x + b) Slide credit: L. Lazebnik

Regression • Data is labelled with a real value rather then a label (Numeric/Factor). • Useful to predict time series data like the price of a stock over time. • The decision being modelled is what value to predict for new unpredicted data. • Learning a linear regression model means estimating the values of the coefficients used in the representation with the data that we have available.

Clustering (Data Mining) • Data is not labelled. • It can however be divided into groups based on similarity and other measures of natural structure in the data. • Market segmentation is one of the most famously used example of cluster analysis.

Dimensionality Reduction • Most algorithms works on columns (as variables) • Datasets with thousands of variables makes the algorithms run slower. • Important to reduce the number of columns in the data set while losing the smallest amount of information by doing so. • Missing Values Ratio, Low Variance Filter, High Correlation Filter, PCA, Random Forests / Ensemble Trees, etc.

Generalization Training set (labels known) Test set (labels unknown) • How well does a learned model generalize from the data it was trained on to a new test set? Slide credit: L. Lazebnik

Generalization • Components of generalization error – Bias: how much the average model over all training sets differ from the true model? • Error due to inaccurate assumptions/simplifications made by the model – Variance: how much models estimated from different training sets differ from each other • Under fitting: model is too “simple” to represent all the relevant class characteristics – High bias and low variance – High training error and high test error • Overfitting: model is too “complex” and fits irrelevant characteristics (noise) in the data – Low bias and high variance – Low training error and high test error Slide credit: L. Lazebnik

Bias-Variance Trade-off • Models with too few parameters are inaccurate because of a large bias (not enough flexibility). • Models with too many parameters are inaccurate because of a large variance (too much sensitivity to the sample). Slide credit: D. Hoiem

Bias-Variance Trade-off E(MSE) = noise 2 + bias 2 + variance Unavoidable error Error due to incorrect assumptions Error due to variance of training samples See the following for explanations of bias-variance (also Bishop’s “Neural Networks” book): • http: //www. inf. ed. ac. uk/teaching/courses/mlsc/Notes/Lecture 4/Bias. Variance. pdf Slide credit: D. Hoiem

Bias-variance tradeoff Overfitting Error Underfitting Test error Training error High Bias Low Variance Complexity Low Bias High Variance Slide credit: D. Hoiem

Toolkit • R • Python • Stata • VBA and SQL • Git and Git. Hub

R Advantages • Fast and free. • State of the art: Statistical researchers provide their methods as R packages. SPSS and SAS are years behind R! • Highly customizable. • Active user community • Excellent for simulation, programming, computer intensive analyses, etc. A very brief introduction to R, M. Keller Disadvantages • Not user friendly at start • Steep learning curve, minimal GUI. • Easy to make mistakes and not know. • Working with large datasets is limited by RAM

Python • Language with strong similarities to PERL, C but with powerful typing and object oriented features. • Commonly used for producing HTML content on websites. Great for text files. • Useful built-in types (lists, dictionaries). • Clean syntax, powerful extensions • Ease of use; interpreter • AI Processing: Statistical – Python has strong numeric processing capabilities: matrix operations, etc. – Suitable for probability and machine learning code. Based on presentation from www. cis. upenn. edu/~cse 391/cse 391_2004/Python. Intro 1. ppt

Stata • Typically used in the areas of economics and politics. • Friendly-user environment. • Pretty easy to learn • Ado files are available for extensions • Impact Evaluation in Practice

VBA and SQL • Visual Basic Applications and Structured Query Language both extensively used in BA. • Ability to retrieve data stored in SQL format. • Connections with R and Python possible and available. • Easier to work with R and Python when SQL is mastered.

Git-Git. Hub • Git – Version control system • Git. Hub – Repository site

Git-Git. Hub, Safa

References • • http: //cs. stackexchange. com/questions/2907/what-exactly-is-the-difference-betweensupervised-and-unsupervised-learning http: //www. kdnuggets. com/2015/05/7 -methods-data-dimensionality-reduction. html http: //machinelearningmastery. com/a-tour-of-machine-learning-algorithms/ https: //discuss. analyticsvidhya. com/t/difference-between-supervised-and-unsupervisedlearning/1196 https: //www. quora. com/Which-is-better-for-data-analysis-R-or-Python Git-Github, Safa A very brief introduction to R, Matthew Keller & Steven Boker Slide credit: D. Hoiem