http mahout apache org Presented by Javier Pastorino

http: //mahout. apache. org/ Presented by: Javier Pastorino Fall 2016

About Mahout Installation Algorithms Examples Classification Clustering 2

Environment for quickly create scalable performant Machine Learning Applications Explodes Hadoop for Parallel processing Implements 3 ML Techniques Recommendation: Personal + Community information to make a recommendation. Video Streaming like Netflix and Hulu, Radio like Pandora and Spotify, and Others: like e. Harmony, Amazon Classification: known data to classify new data Antispam Systems Clustering: groups data into new categories Youtube: BTI-360 3

Quite easy: Download, Unzip, Ready Pre-Requisites Java Installed Apache Hadoop Download: http: //hadoop. apache. org/#Download+Hadoop Quick Setup: http: //hadoop. apache. org/docs/current/hadoop-project-dist/hadoop- common/Single. Cluster. html#Installing_Software Setup Environment Variables MAHOUT_HOME=/path/to/mahout MAHOUT_LOCAL=true #for running standalone on your dev machine, unset for running on a cluster JAVA_HOME=/usr/lib/jvm/java-1. 8. 0 -openjdk-amd 64/jre HADOOP_HOME=/home/bdlab/hadoop-2. 7. 3 4

Mahout Math-Scala Core Library and Scala DSL Mahout Distributed BLAS. Distributed Row Matrix Logistic Regression - trained via SGD API with R and Matlab like operators. Distributed ALS, SPCA, SSVD, thin-QR. Similarity Analysis. Mahout Interactive Shell Naive Bayes / Complementary Naive Bayes Hidden Markov Models Interactive REPL shell for Spark optimized Mahout k-Means Clustering Collaborative Filtering with CLI drivers Fuzzy k-Means User-Based Collaborative Filtering Streaming k-Means Item-Based Collaborative Filtering Matrix Factorization with ALS on Implicit Feedback Weighted Matrix Factorization, SVD++ Clustering with CLI drivers Canopy Clustering DSL Classification with CLI drivers Spectral Clustering *Dimensionality Reduction Singular Value Decomposition Lanczos Algorithm Stochastic SVD PCA (via Stochastic SVD) QR Decomposition 5

About Mahout Installation Algorithms Examples Classification Clustering 6

Create a working directory for the dataset and all input/output. Convert the full 20 newsgroups dataset into a < Text, Text > Sequence. File. $ mahout seqdirectory -i ${WD}/20 news-all -o ${WD}/20 news-seq -ow Convert and preprocesses the dataset into a < Text, Vector. Writable > Sequence. File containing term frequencies for each document. $ mahout seq 2 sparse -i ${WD}/20 news-seq -o ${WD}/20 news-vectors -lnorm -nv -wt tfidf Split the preprocessed dataset into training and testing sets. $ mahout split -i ${WD}/20 news-vectors/tfidf-vectors --training. Output ${ WD}/20 news-train-vectors --test. Output ${WD}/20 news-test-vectors --random. Selection. Pct 40 --overwrite --sequence. Files -xm sequential Train the classifier. $ mahout trainnb -i${WD}/20 news-train-vectors -el -o ${WD}/model -li {WD}/labelindex -ow -c Test the classifier. $ mahout testnb -i ${WD}/20 news-test-vectors -m ${WD}/model -l ${WD}/labelindex -ow -o ${WD}/20 news-testing -c 7

============================ Confusion Matrix ---------------------------a b c d e f g h i j k l m n o p q r s t <--Classified as 381 0 0 9 1 0 0 0 1 0 0 2 0 1 0 0 3 0 |398 a=rec. motorcycles 1 284 0 0 1 0 6 3 11 0 66 3 0 6 0 4 9 0 |395 b=comp. windows. x 2 0 339 2 0 3 5 1 0 0 1 1 12 1 7 0 2 0 |376 c=talk. politics. mideast 4 0 1 327 0 2 2 0 0 2 1 1 0 5 1 4 12 0 |364 d=talk. politics. guns 7 0 4 32 27 7 7 2 0 12 0 0 6 0 100 9 7 31 0 0 |251 e=talk. religion. misc 10 0 0 359 2 2 0 0 3 0 1 6 0 1 0 0 11 0 |396 f=rec. autos 0 0 0 1 383 9 1 0 0 0 0 3 0 0 |397 g=rec. sport. baseball 1 0 0 0 9 382 0 0 1 1 1 0 2 0 |399 h=rec. sport. hockey 2 0 0 4 3 0 330 4 4 0 5 12 0 0 2 0 12 7 |385 i=comp. sys. mac. hardware 0 3 0 0 1 0 0 368 0 0 10 4 1 3 2 0 |394 j=sci. space 0 0 0 3 1 0 27 2 291 0 11 25 0 0 13 18|392 k=comp. sys. ibm. pc. hardware 8 0 1 109 0 6 11 4 1 18 0 98 1 3 11 10 27 1 1 0 |310 l=talk. politics. misc 0 11 0 0 0 3 6 0 10 6 11 0 299 13 0 2 13 0 7 8 |389 m=comp. graphics 6 0 1 0 0 4 2 0 5 2 12 0 8 321 0 4 14 0 8 6 |393 n=sci. electronics 2 0 0 0 4 1 0 3 1 372 6 0 2 1 2 |398 o=soc. religion. christian 4 0 0 1 0 2 3 3 0 4 2 0 7 12 6 342 1 0 9 0 |396 p=sci. med 0 1 0 1 4 0 3 0 1 0 8 4 0 2 369 0 1 1 |396 q=sci. crypt 10 0 4 10 1 5 6 2 2 6 2 0 2 1 86 15 14 152 0 1 |319 r=alt. atheism 4 0 0 9 1 1 8 1 12 0 3 0 2 0 0 0 341 2 |390 s=misc. forsale 8 5 0 0 0 1 6 0 8 5 50 0 40 2 1 0 9 0 3 256|394 t=comp. os. ms-windows. misc ============================ Statistics ---------------------------Kappa 0. 8808 Accuracy 90. 8596% Reliability 86. 3632% Reliability (standard deviation) 0. 2131 8

Selects clustering type: kmeans, fuzzykmeans, lda, or streamingkmeans Parse Data: Runs org. apache. lucene. benchmark. utils. Extract. Reuters to generate reuters-out from reuters-sgm (the downloaded archive) $MAHOUT org. apache. lucene. benchmark. utils. Extract. Reuters ${WD}/reuters-sgm ${WD}/reuters-out Runs seqdirectory to convert reuters-out to Sequence. File format $MAHOUT seqdirectory -i ${WD}/reuters-out -o ${WD}/reuters-out-seqdir -c UTF-8 -chunk 64 -xm sequential Runs seq 2 sparse to convert Sequence. Files to sparse vector format $MAHOUT seq 2 sparse -i ${WD}/reuters-out-seqdir/ -o ${WD}/reuters-out-seqdir-sparse-kmeans -max. DFPercent 85 --named. Vector Runs k-means with 20 clusters $MAHOUT kmeans -i ${WD}/reuters-out-seqdir-sparse-kmeans/tfidf-vectors/ -c ${WD}/reuters-kmeansclusters -o ${WD}/reuters-kmeans -dm org. apache. mahout. common. distance. Euclidean. Distance. Measure x 10 -k 20 -ow --clustering Runs clusterdump to show results $MAHOUT clusterdump -i `$DFS -ls -d ${WD}/reuters-kmeans/clusters-*-final | awk '{print $8}'` -o ${WD}/reuters-kmeans/clusterdump -d ${WD}/reuters-out-seqdir-sparse-kmeans/dictionary. file-0 -dt sequencefile -b 100 -n 20 --evaluate -dm org. apache. mahout. common. distance. Euclidean. Distance. Measure -sp 0 --points. Dir ${WD}/reuterskmeans/clustered. Points 9

: {"identifier": "VL-5965", "r": [], "c": [{"10": 2. 643}, {"11": 2. 714}, {"16. 2": 7. 612 }, {"16. 9": 7. 545}, {"17": 3. Top Terms: ionics => 9. 880817413330078 3, 001, 000 => 9. 880817413330078 ion => 9. 880817413330078 938, 000 => 9. 880817413330078 419, 000 => 9. 593134880065918 28. 34 => 9. 593134880065918 383, 000 => 9. 593134880065918 64. 6 => 9. 18766975402832 952, 000 => 9. 18766975402832 68. 8 => 8. 899988174438477 175, 000 => 8. 899988174438477 nonrecurring => 8. 340372085571289 31. 9 => 8. 034990310668945 30. 3 => 8. 034990310668945 16. 2 => 7. 612133502960205 vs => 7. 593847751617432 16. 9 => 7. 545442581176758 backlog => 6. 622720718383789 95 => 6. 016584873199463 net => 5. 726622104644775 Weight : [props - optional]: Point: Inter-Cluster Density: 0. 5185260382824893 Intra-Cluster Density: 0. 6068435718170822 CDbw Inter-Cluster Density: 0. 0 CDbw Intra-Cluster Density: 20. 521295965955037 CDbw Separation: 29373. 551986330545 10

11