Introducing Apache Mahout Scalable Machine Learning for All
Introducing Apache Mahout Scalable Machine Learning for All! Grant Ingersoll Lucid Imagination
Overview • What is Machine Learning? • Mahout
Definition • “Machine Learning is programming computers to optimize a performance criterion using example data or past experience” – Intro. To Machine Learning by E. Alpaydin • Subset of Artificial Intelligence – Many other fields: comp sci. , biology, math, psychology, etc.
Types • Supervised – Using labeled training data, create function that predicts output of unseen inputs • Unsupervised – Using unlabeled data, create function that predicts output • Semi-Supervised – Uses labeled and unlabeled data
Characterizations • Lots of Data • Identifiable Features in that Data • Too big/costly for people to handle – People still can help
Clustering • Unsupervised • Find Natural Groupings – Documents – Search Results – People – Genetic traits in groups – Many, many more uses
Example: Clustering Google News
Collaborative Filtering • Unsupervised • Recommend people and products – User-User • User likes X, you might too – Item-Item • People who bought X also bought Y
Example: Collab Filtering Amazon. com
Classification/Categorization • • • Many, many types Spam Filtering Named Entity Recognition Phrase Identification Sentiment Analysis Classification into a Taxonomy
Example: NER? Excerpt from Yahoo News
Example: Categorization
Info. Retrieval • Learning Ranking Functions • Learning Spelling Corrections • User Click Analysis and Tracking
Other • • Image Analysis Robotics Games Higher level natural language processing • Many, many others
What is Apache Mahout? • A Mahout is an elephant trainer/driver/keeper, hence… + (and other distributed techniques) Machine Learning =
What? • Hadoop brings: – Map/Reduce API – HDFS – In other words, scalability and faulttolerance • Mahout brings: – Library of machine learning algorithms – Examples
Why Mahout? • Many Open Source ML libraries either: – Lack Community – Lack Documentation and Examples – Lack Scalability – Lack the Apache License ; -) – Or are research-oriented
Why Mahout? • Intelligent Apps are the Present and Future • Thus, Mahout’s Goal is: – Scalable Machine Learning with Apache License
Current Status • What’s in it: – Simple Matrix/Vector library – Taste Collaborative Filtering – Clustering • Canopy/K-Means/Fuzzy K-Means/Mean-shift/Dirichlet – Classifiers • Naïve Bayes • Complementary NB – Evolutionary • Integration with Watchmaker for fitness function
How? • Examples – Taste – Clustering – Classification – Evolutionary
Taste: Movie Recommendations • Given ratings by users of movies, recommend other movies • http: //lucene. apache. org/mahout/taste. html#demo
Taste Demo • http: //localhost: 8080/mahout-tastewebapp/Recommender. Servlet? user. I D=12&debug=true • http: //localhost: 8080/mahout-tastewebapp/Recommender. Servlet? user. I D=43&debug=true
Clustering: Synthetic Control Data • http: //archive. ics. uci. edu/ml/datasets/Synth etic+Control+Chart+Time+Series • Each clustering impl. has an example Job for running in <MAHOUT_HOME>/examples – o. a. mahout. clustering. syntheticcontrol. * • Outputs clusters…
Classification: NB and CNB Examples • 20 Newsgroups – http: //cwiki. apache. org/confluence/displa y/MAHOUT/Twenty. Newsgroups • Wikipedia – http: //cwiki. apache. org/confluence/displa y/MAHOUT/Wikipedia. Bayes. Example
Evolutionary • Traveling Salesman – http: //cwiki. apache. org/confluence/displa y/MAHOUT/Traveling+Salesman • Class Discovery – http: //cwiki. apache. org/confluence/displa y/MAHOUT/Class+Discovery
What’s Next? • • More Examples Winnow/Perceptron (MAHOUT-85) Text Clustering Association Rules (MAHOUT-108) Logistic Regression Solr Integration (SOLR-769) GSOC
When, Who • When? Now! – Mahout is growing • Who? You! – We want programmers who: • Are comfortable with math • Like to work on hard problems – We want others to: • Kick the tires
Where? • http: //lucene. apache. org/mahout – Hadoop - http: //hadoop. apache. org • http: //cwiki. apache. org/MAHOUT • mahout-{user|dev}@lucene. apache. org – http: //www. lucidimagination. com/search/p: mahout
Resources • “Programming Collective Intelligence” by Segaran • “Data Mining - Practical Machine Learning Tools and Techniques” by Witten and Frank • “Taming Text” by Ingersoll and Morton
- Slides: 29