Machine Learning Getting Started Andrew Loree Got a

Got a question? Andrew Loree www. andyloree. com andy@andyloree. com @Low. On. Disk. Space

Goals Outcome: • What is Machine Learning (ML)? • Understand the ML process •

What is Machine Learning? • Using data to find patterns and based upon those

Is Machine Learning… • …just Statistics? • …just Calculus/Matrix Algebra/Optimization Mathematics • …just Computer

Is Machine Learning… • …just Artificial Intelligence? • …just Deep Learning Artificial Intelligence ~1950’s

Does the Machine really Learn? • Pattern recognition is from learning and past experiences,

Types of Questions (about Data) • Descriptive - how many of X did I

Framing your questions the ML way In order of importance: 1. Are you asking

Machine Learning Ethics • Bias: Confirmation, … • Perspective: Recommendations lead to more purchases

The Machine (Supervised) Learning Process • Training Data (contains patterns) • One (or more)

The Machine Learning Process: Data • Can be multiple sources, Big. Data stores, flat

The Machine Learning Process: Algorithm • Which algorithm is the “right” one? * •

The Machine Learning Process: Model • Most cases, repeating the entire process many times

Machine Learning: Terminology Training Data/Set – Prepared (training) data ready to use to create

Machine Learning Model Types Regression – supervised learning problems, fitting data to a line

Machine Learning Model Types Classification – supervised learning problems, capturing data in two (or

Machine Learning Model Types Clustering – unsupervised learning problems, when we don’t know the

Machine Learning: Terminology Feature - individual measurable property – prepared data. A combination of

Text (SMS) Spam Which of these five messages are spam? SIX chances to win

Text Spam: Features What makes these two messages spam? SIX chances to win CASH!

Supervised Learning Example: Text Spam • Collection of SMS messages for mobile phone spam

Text Spam: Demo • Weka Explorer • Load training data set • Try a

ML Algorithms • Way too many to list • Commonly used: • Decision Trees

ML Algorithms: Decision Trees • Supervised learning, classification • Weka implements a particular algorithm

ML Algorithms: Random Forest • Supervised learning, classification • Multiple decision trees

ML Algorithms: Support Vector Machines • Supervised learning, classification • Separation by “hyperplane”, Weka

ML Algorithms: k-nearest Neighbors • Supervised learning, classification or regression • k is number

ML Algorithms: Linear Regression • Supervised learning, regression • Continuous values

ML Algorithms: Logistic Regression • Supervised learning, discrete (binary) values – (yes or no,

ML Algorithms: Cheat Sheet https: //docs. microsoft. com/en-us/azure/machine-learning/studio/algorithm-cheat-sheet

ML Algorithms: Considerations • Not all algorithms are the same • Accuracy • Other

ML Algorithms: Testing • Different ways to “slice and dice” your training set data

ML Algorithms: Evaluating Results • Confusion Matrix • Accuracy – closeness to the true

ML Algorithms: Pitfalls • Underfitting – when close enough isn’t close enough

ML Algorithms: Pitfalls • Overfitting – memorization

ML Algorithms: Pitfalls • Data Leakage – do NOT use your prediction value as

ML Algorithms: Text Processing • Think of “search” on top of machine learning •

ML Toolkits, Platforms & Libraries • Toolkit/Platforms • • WEKA R Parts of Python

Slides: 40

Download presentation

Machine Learning Getting Started Andrew Loree

Got a question? Andrew Loree www. andyloree. com andy@andyloree. com @Low. On. Disk. Space

Goals Outcome: • What is Machine Learning (ML)? • Understand the ML process • Base knowledge of types of ML & algorithms • Learning path for starting to use ML

What is Machine Learning? • Using data to find patterns and based upon those patterns predict the future • When is a prediction a guess? When it is not based upon “sufficient” observation, experience or scientific reasoning • Example questions: • How long until a production server is out of disk space? • Is this email spam? • Customer retention, product recommendations, marketing campaigns, fraud detection, credit worthiness, …

Is Machine Learning… • …just Statistics? • …just Calculus/Matrix Algebra/Optimization Mathematics • …just Computer Science/Engineering • . . . just applied “domain knowledge” • Answer is all of the above and somewhere in between • Philosophical question are best left to the philosophers

Is Machine Learning… • …just Artificial Intelligence? • …just Deep Learning Artificial Intelligence ~1950’s ↓ Machine Learning (boom) ~1980’s ↓ Deep Learning (boom) ~2010

Does the Machine really Learn? • Pattern recognition is from learning and past experiences, and we use it every day • Which of these charges were fraudulent for my credit card? Ship to Address Type Merchant Date Amount Mine Computers & Accessories AMAZON 04 -09 $734. 95 Mine Computers & Accessories AMAZON 04 -09 $12. 98 Not Mine Books AMAZON 04 -10 $468. 03 Mine Electronics NEWEGG 04 -12 $198. 32 When is there enough data and when do you have too much? Enter ML

Types of Questions (about Data) • Descriptive - how many of X did I sell? • Associative - is there an association between temperature and sales? (hypothesis) • Comparative - how many X sells versus Y? • Predicative - using associations and comparatives to predicate sales of X? Machine Learning can answer predictive questions

Framing your questions the ML way In order of importance: 1. Are you asking the right question? - ML is not magic, desired outcomes must be definable - Days until full? Will customer leave? Fraudulent Charge? 2. Do you think you have the right data? - Prediction cannot overcome lack of data - Data insight (domain knowledge) is critical to success 3. What results is good enough? - 50% accuracy, 70%, 99%? No false positives allowed? - Wait, what is accuracy?

Machine Learning Ethics • Bias: Confirmation, … • Perspective: Recommendations lead to more purchases (seller) - vs Leads to higher debt (buyer) • Moral dilemmas

The Machine (Supervised) Learning Process • Training Data (contains patterns) • One (or more) ML algorithms learn the patterns • A model is generated, used to predict against new data

The Machine Learning Process: Data • Can be multiple sources, Big. Data stores, flat files, DBMS, … • Usually never in the right format • Do you have the right “features”? * • Preprocessing almost always required – usually the hardest part

The Machine Learning Process: Algorithm • Which algorithm is the “right” one? * • How do you compare one algorithms results to another?

The Machine Learning Process: Model • Most cases, repeating the entire process many times • How stable is our results? • Rinse and repeat the process • Model Management – consuming and operationalization of models is a separate, but very critical topic

Machine Learning: Terminology Training Data/Set – Prepared (training) data ready to use to create a model Three main ML categories: • Supervised learning - Categorizes outcomes or value of interest, in training data • Unsupervised learning - Organize data in a way to describe structure (clustering) • Reinforcement learning - Makes a choice, measure how “good” that was, modify the strategy going forward

Machine Learning Model Types Regression – supervised learning problems, fitting data to a line or curve How long until I run out of disk space?

Machine Learning Model Types Classification – supervised learning problems, capturing data in two (or more) classes Is it spam or ham?

Machine Learning Model Types Clustering – unsupervised learning problems, when we don’t know the defined classes Market research from surveys to generate market segments

Machine Learning: Terminology Feature - individual measurable property – prepared data. A combination of features for an observation is a commonly called a “feature vector” Target Value (or Class) – our desired outcome of prediction; With supervised learning, the value is in the training data

Text (SMS) Spam Which of these five messages are spam? SIX chances to win CASH! From 100 to 20, 000 pounds txt> CSH 11 and send to 87575. Cost 150 p/day, 6 days, 16+ Tsand. Cs apply Reply HL 4 info This is the 2 nd time we have tried 2 contact u. U have won the £ 750 Pound prize. 2 claim is easy, call 087187272008 NOW 1! Only 10 p per minute. BT-nationalrate. I HAVE A DATE ON SUNDAY WITH WILL!! Fine if that's the way u feel. That's the way its gota b U GOIN OUT 2 NITE?

Text Spam: Features What makes these two messages spam? SIX chances to win CASH! From 100 to 20, 000 pounds txt> CSH 11 and send to 87575. Cost 150 p/day, 6 days, 16+ Tsand. Cs apply Reply HL 4 info This is the 2 nd time we have tried 2 contact u. U have won the £ 750 Pound prize. 2 claim is easy, call 087187272008 NOW 1! Only 10 p per minute. BT-national-rate. What if you cannot use the message text itself? What are the “features” that are common to spam messages? • Length of the message? • Number of numeric strings? • Number of web links? • Number of currency symbols? • Number of punctuations? • Others?

Supervised Learning Example: Text Spam • Collection of SMS messages for mobile phone spam research • Contains a “training set” of 5, 574 messages, marked either SPAM or HAM • Given just a message, how can we determine if the message is spam or ham? • Who doesn’t have domain knowledge of “spam” and texting? References: UC Irvine ML Repository: https: //archive. ics. uci. edu/ml/datasets/sms+spam+collection Contributions to the Study of SMS Spam Filtering: New Collection and Results: http: //www. dt. fee. unicamp. br/~tiago/smsspamcollection/

Text Spam: Demo • Weka Explorer • Load training data set • Try a couple algorithms using our features with Crossvalidation • Compare results Azure ML Studio • Show same solution

ML Algorithms • Way too many to list • Commonly used: • Decision Trees • Random Forest • Support Vector Machines (SVM) • k-Nearest Neighbor - KNN • Linear Regression • Logistic Regression

ML Algorithms: Decision Trees • Supervised learning, classification • Weka implements a particular algorithm named C 4. 5 (called J 48)

ML Algorithms: Random Forest • Supervised learning, classification • Multiple decision trees

ML Algorithms: Support Vector Machines • Supervised learning, classification • Separation by “hyperplane”, Weka version named SGD

ML Algorithms: k-nearest Neighbors • Supervised learning, classification or regression • k is number of neighbors used in measure of distance • Chose odd number to avoid ties • Called IBk in Weka

ML Algorithms: Linear Regression • Supervised learning, regression • Continuous values

ML Algorithms: Logistic Regression • Supervised learning, discrete (binary) values – (yes or no, A or B) • S-curve to fit against data

ML Algorithms: Cheat Sheet https: //docs. microsoft. com/en-us/azure/machine-learning/studio/algorithm-cheat-sheet

ML Algorithms: Considerations • Not all algorithms are the same • Accuracy • Other practical measures: • Training time • Memory requirements • Scalability https: //docs. microsoft. com/en-us/azure/machine-learning/studio/algorithm-choice

ML Algorithms: Testing • Different ways to “slice and dice” your training set data • Entire set • Percentage of set • Cross-validation - divide the set into subsets – generally best option

ML Algorithms: Evaluating Results • Confusion Matrix • Accuracy – closeness to the true (% of overall) • Precision – more important for non-binary classifications • Lots of others, some specific to problem Type (recall, F-measure, …)

ML Algorithms: Pitfalls • Underfitting – when close enough isn’t close enough

ML Algorithms: Pitfalls • Overfitting – memorization

ML Algorithms: Pitfalls • Data Leakage – do NOT use your prediction value as input to the model • Sampling Bias – poor choices for training set data e. g. predict item sales for entire store chain from a single store’s data • Predict Random Outcomes – fair and unfair coins flips, dependent outcomes

ML Algorithms: Text Processing • Think of “search” on top of machine learning • All of the common problems applied to classic linguistics challenge machine learning to an extent: • • Tokenization – word breaking Stemming (and lemmatization) – walk, walking, walked, walks → walk Domain specific dictionaries – company jargon, acronyms, emojis, … Language used – not everyone writes the Queen’s English • Semantic search – understand “meaning” – may be a better option to generate processing features

ML Toolkits, Platforms & Libraries • Toolkit/Platforms • • WEKA R Parts of Python Sci. Py Microsoft Cognitive Toolkit (CNTK) • Libraries • • • Scikit-learn (python) JSAT Accord. NET Framework. APIs Azure ML Mlib Predication. IO • Operationalize • SQL Server Machine Learning Services/Machine Learning Server

Got a question? Andrew Loree www. andyloree. com andy@andyloree. com @Low. On. Disk. Space