Apache Mahout Industrial Strength Machine Learning Jeff Eastman

Current Situation • Large volumes of data are now available • Platforms now exist

Where is ML Used Today • • Internet search clustering Knowledge management systems Social

History of Mahout • Summer 2007 – Developers needed scalable ML – Mailing list

Who We Are (so far) Grant Ingersoll Jeff Eastman Dawid Weiss Ted Dunning Otis

Current Code Base • Matrix & Vector library – Hama collaboration for very large

Algorithms Under Development • • • Naïve Bayes Perceptron PLSI/EM Taste Collaborative Filtering Integration

GSo. C @ Mahout • Many interesting submissions • 4 projects approved for Mahout

Conclusion • This is just the beginning • High demand for scalable machine learning

Slides: 9

Download presentation

Apache Mahout Industrial Strength Machine Learning Jeff Eastman

Current Situation • Large volumes of data are now available • Platforms now exist to run computations over large datasets (Hadoop, HBase) • Sophisticated analytics are needed to turn data into information people can use • Active research community and proprietary implementations of “machine learning” algorithms • The world needs scalable implementations of ML under open license - ASF

Where is ML Used Today • • Internet search clustering Knowledge management systems Social network mapping Taxonomy transformations Marketing analytics Recommendation systems Log analysis & event filtering Fraud detection

History of Mahout • Summer 2007 – Developers needed scalable ML – Mailing list formed • Community formed – Apache contributors – Academia & industry – Lots of initial interest • Project formed under Apache Lucene – January 25, 2008

Who We Are (so far) Grant Ingersoll Jeff Eastman Dawid Weiss Ted Dunning Otis Gospodetnic Erik Hatcher Karl Wettin Isabel Drost

Current Code Base • Matrix & Vector library – Hama collaboration for very large arrays • Clustering – Canopy – K-Means – Mean Shift • Utilities – Distance Measures – Parameters

Algorithms Under Development • • • Naïve Bayes Perceptron PLSI/EM Taste Collaborative Filtering Integration Genetic Programming Dirichlet Process Clustering

GSo. C @ Mahout • Many interesting submissions • 4 projects approved for Mahout (http: //code. google. com/soc/2008/asf/about. html) – “Mahout: Parallel implementation of machine learning algorithms”, Farid Bourennani – “Implementing Logistic Regression in Mahout”, Yun Jiang – “Codename Mahout. GA for mahout-machinelearning”, Abdel Hakim Deneche – “To implement Complementary Naïve Bayes and Expectation Maximization algorithm using Map Reduce for Multicore Systems”, Robin Anil

Conclusion • This is just the beginning • High demand for scalable machine learning • Contributors needed who have – Interest, enthusiasm & programming ability – Test driven development readiness – Comfort with the scary math (or bravery) – Interest and/or proficiency with Hadoop – Some large data sets you want to analyze