Apache Mahout Industrial Strength Machine Learning Jeff Eastman
- Slides: 9
Apache Mahout Industrial Strength Machine Learning Jeff Eastman
Current Situation • Large volumes of data are now available • Platforms now exist to run computations over large datasets (Hadoop, HBase) • Sophisticated analytics are needed to turn data into information people can use • Active research community and proprietary implementations of “machine learning” algorithms • The world needs scalable implementations of ML under open license - ASF
Where is ML Used Today • • Internet search clustering Knowledge management systems Social network mapping Taxonomy transformations Marketing analytics Recommendation systems Log analysis & event filtering Fraud detection
History of Mahout • Summer 2007 – Developers needed scalable ML – Mailing list formed • Community formed – Apache contributors – Academia & industry – Lots of initial interest • Project formed under Apache Lucene – January 25, 2008
Who We Are (so far) Grant Ingersoll Jeff Eastman Dawid Weiss Ted Dunning Otis Gospodetnic Erik Hatcher Karl Wettin Isabel Drost
Current Code Base • Matrix & Vector library – Hama collaboration for very large arrays • Clustering – Canopy – K-Means – Mean Shift • Utilities – Distance Measures – Parameters
Algorithms Under Development • • • Naïve Bayes Perceptron PLSI/EM Taste Collaborative Filtering Integration Genetic Programming Dirichlet Process Clustering
GSo. C @ Mahout • Many interesting submissions • 4 projects approved for Mahout (http: //code. google. com/soc/2008/asf/about. html) – “Mahout: Parallel implementation of machine learning algorithms”, Farid Bourennani – “Implementing Logistic Regression in Mahout”, Yun Jiang – “Codename Mahout. GA for mahout-machinelearning”, Abdel Hakim Deneche – “To implement Complementary Naïve Bayes and Expectation Maximization algorithm using Map Reduce for Multicore Systems”, Robin Anil
Conclusion • This is just the beginning • High demand for scalable machine learning • Contributors needed who have – Interest, enthusiasm & programming ability – Test driven development readiness – Comfort with the scary math (or bravery) – Interest and/or proficiency with Hadoop – Some large data sets you want to analyze