Machine Learning Getting Started Andrew Loree Got a








































- Slides: 40
Machine Learning Getting Started Andrew Loree
Got a question? Andrew Loree www. andyloree. com andy@andyloree. com @Low. On. Disk. Space
Goals Outcome: • What is Machine Learning (ML)? • Understand the ML process • Base knowledge of types of ML & algorithms • Learning path for starting to use ML
What is Machine Learning? • Using data to find patterns and based upon those patterns predict the future • When is a prediction a guess? When it is not based upon “sufficient” observation, experience or scientific reasoning • Example questions: • How long until a production server is out of disk space? • Is this email spam? • Customer retention, product recommendations, marketing campaigns, fraud detection, credit worthiness, …
Is Machine Learning… • …just Statistics? • …just Calculus/Matrix Algebra/Optimization Mathematics • …just Computer Science/Engineering • . . . just applied “domain knowledge” • Answer is all of the above and somewhere in between • Philosophical question are best left to the philosophers
Is Machine Learning… • …just Artificial Intelligence? • …just Deep Learning Artificial Intelligence ~1950’s ↓ Machine Learning (boom) ~1980’s ↓ Deep Learning (boom) ~2010
Does the Machine really Learn? • Pattern recognition is from learning and past experiences, and we use it every day • Which of these charges were fraudulent for my credit card? Ship to Address Type Merchant Date Amount Mine Computers & Accessories AMAZON 04 -09 $734. 95 Mine Computers & Accessories AMAZON 04 -09 $12. 98 Not Mine Books AMAZON 04 -10 $468. 03 Mine Electronics NEWEGG 04 -12 $198. 32 When is there enough data and when do you have too much? Enter ML
Types of Questions (about Data) • Descriptive - how many of X did I sell? • Associative - is there an association between temperature and sales? (hypothesis) • Comparative - how many X sells versus Y? • Predicative - using associations and comparatives to predicate sales of X? Machine Learning can answer predictive questions
Framing your questions the ML way In order of importance: 1. Are you asking the right question? - ML is not magic, desired outcomes must be definable - Days until full? Will customer leave? Fraudulent Charge? 2. Do you think you have the right data? - Prediction cannot overcome lack of data - Data insight (domain knowledge) is critical to success 3. What results is good enough? - 50% accuracy, 70%, 99%? No false positives allowed? - Wait, what is accuracy?
Machine Learning Ethics • Bias: Confirmation, … • Perspective: Recommendations lead to more purchases (seller) - vs Leads to higher debt (buyer) • Moral dilemmas
The Machine (Supervised) Learning Process • Training Data (contains patterns) • One (or more) ML algorithms learn the patterns • A model is generated, used to predict against new data
The Machine Learning Process: Data • Can be multiple sources, Big. Data stores, flat files, DBMS, … • Usually never in the right format • Do you have the right “features”? * • Preprocessing almost always required – usually the hardest part
The Machine Learning Process: Algorithm • Which algorithm is the “right” one? * • How do you compare one algorithms results to another?
The Machine Learning Process: Model • Most cases, repeating the entire process many times • How stable is our results? • Rinse and repeat the process • Model Management – consuming and operationalization of models is a separate, but very critical topic
Machine Learning: Terminology Training Data/Set – Prepared (training) data ready to use to create a model Three main ML categories: • Supervised learning - Categorizes outcomes or value of interest, in training data • Unsupervised learning - Organize data in a way to describe structure (clustering) • Reinforcement learning - Makes a choice, measure how “good” that was, modify the strategy going forward
Machine Learning Model Types Regression – supervised learning problems, fitting data to a line or curve How long until I run out of disk space?
Machine Learning Model Types Classification – supervised learning problems, capturing data in two (or more) classes Is it spam or ham?
Machine Learning Model Types Clustering – unsupervised learning problems, when we don’t know the defined classes Market research from surveys to generate market segments
Machine Learning: Terminology Feature - individual measurable property – prepared data. A combination of features for an observation is a commonly called a “feature vector” Target Value (or Class) – our desired outcome of prediction; With supervised learning, the value is in the training data
Text (SMS) Spam Which of these five messages are spam? SIX chances to win CASH! From 100 to 20, 000 pounds txt> CSH 11 and send to 87575. Cost 150 p/day, 6 days, 16+ Tsand. Cs apply Reply HL 4 info This is the 2 nd time we have tried 2 contact u. U have won the £ 750 Pound prize. 2 claim is easy, call 087187272008 NOW 1! Only 10 p per minute. BT-nationalrate. I HAVE A DATE ON SUNDAY WITH WILL!! Fine if that's the way u feel. That's the way its gota b U GOIN OUT 2 NITE?
Text Spam: Features What makes these two messages spam? SIX chances to win CASH! From 100 to 20, 000 pounds txt> CSH 11 and send to 87575. Cost 150 p/day, 6 days, 16+ Tsand. Cs apply Reply HL 4 info This is the 2 nd time we have tried 2 contact u. U have won the £ 750 Pound prize. 2 claim is easy, call 087187272008 NOW 1! Only 10 p per minute. BT-national-rate. What if you cannot use the message text itself? What are the “features” that are common to spam messages? • Length of the message? • Number of numeric strings? • Number of web links? • Number of currency symbols? • Number of punctuations? • Others?
Supervised Learning Example: Text Spam • Collection of SMS messages for mobile phone spam research • Contains a “training set” of 5, 574 messages, marked either SPAM or HAM • Given just a message, how can we determine if the message is spam or ham? • Who doesn’t have domain knowledge of “spam” and texting? References: UC Irvine ML Repository: https: //archive. ics. uci. edu/ml/datasets/sms+spam+collection Contributions to the Study of SMS Spam Filtering: New Collection and Results: http: //www. dt. fee. unicamp. br/~tiago/smsspamcollection/
Text Spam: Demo • Weka Explorer • Load training data set • Try a couple algorithms using our features with Crossvalidation • Compare results Azure ML Studio • Show same solution
ML Algorithms • Way too many to list • Commonly used: • Decision Trees • Random Forest • Support Vector Machines (SVM) • k-Nearest Neighbor - KNN • Linear Regression • Logistic Regression
ML Algorithms: Decision Trees • Supervised learning, classification • Weka implements a particular algorithm named C 4. 5 (called J 48)
ML Algorithms: Random Forest • Supervised learning, classification • Multiple decision trees
ML Algorithms: Support Vector Machines • Supervised learning, classification • Separation by “hyperplane”, Weka version named SGD
ML Algorithms: k-nearest Neighbors • Supervised learning, classification or regression • k is number of neighbors used in measure of distance • Chose odd number to avoid ties • Called IBk in Weka
ML Algorithms: Linear Regression • Supervised learning, regression • Continuous values
ML Algorithms: Logistic Regression • Supervised learning, discrete (binary) values – (yes or no, A or B) • S-curve to fit against data
ML Algorithms: Cheat Sheet https: //docs. microsoft. com/en-us/azure/machine-learning/studio/algorithm-cheat-sheet
ML Algorithms: Considerations • Not all algorithms are the same • Accuracy • Other practical measures: • Training time • Memory requirements • Scalability https: //docs. microsoft. com/en-us/azure/machine-learning/studio/algorithm-choice
ML Algorithms: Testing • Different ways to “slice and dice” your training set data • Entire set • Percentage of set • Cross-validation - divide the set into subsets – generally best option
ML Algorithms: Evaluating Results • Confusion Matrix • Accuracy – closeness to the true (% of overall) • Precision – more important for non-binary classifications • Lots of others, some specific to problem Type (recall, F-measure, …)
ML Algorithms: Pitfalls • Underfitting – when close enough isn’t close enough
ML Algorithms: Pitfalls • Overfitting – memorization
ML Algorithms: Pitfalls • Data Leakage – do NOT use your prediction value as input to the model • Sampling Bias – poor choices for training set data e. g. predict item sales for entire store chain from a single store’s data • Predict Random Outcomes – fair and unfair coins flips, dependent outcomes
ML Algorithms: Text Processing • Think of “search” on top of machine learning • All of the common problems applied to classic linguistics challenge machine learning to an extent: • • Tokenization – word breaking Stemming (and lemmatization) – walk, walking, walked, walks → walk Domain specific dictionaries – company jargon, acronyms, emojis, … Language used – not everyone writes the Queen’s English • Semantic search – understand “meaning” – may be a better option to generate processing features
ML Toolkits, Platforms & Libraries • Toolkit/Platforms • • WEKA R Parts of Python Sci. Py Microsoft Cognitive Toolkit (CNTK) • Libraries • • • Scikit-learn (python) JSAT Accord. NET Framework. APIs Azure ML Mlib Predication. IO • Operationalize • SQL Server Machine Learning Services/Machine Learning Server
Got a question? Andrew Loree www. andyloree. com andy@andyloree. com @Low. On. Disk. Space