Introduction to WEKA Mark Hall Pentaho Corporation Suite

Data Mining A definition: “Extraction of implicit, previously unknown, and potentially useful information from

Data Mining Strong patterns can be used to make predictions Problem 1: a lot

Data Mining is a Process Selecte d data Preprocess ed data Select Preprocess Analyze

What is WEKA? Copyright: Martin Kramer (mkramer@wxs. nl)

WEKA: The Software WEKA (Waikato Environment for Knowledge Analysis) Funded by the NZ government

Core Functionality Support for the whole process of experimental data mining Preparation of input

Architecture Modular, object-oriented architecture Packages for different types of algorithms: filters, classifiers, clusterers, associations,

Explorer “Preprocess” panel Load data from various sources (file, SQL database, URL etc. )

Explorer “Select attributes” panel Mix and match algorithms for evaluating the utility of attributes

Knowledge Flow Define a data mining “process” Like the Explorer, all of WEKA’s algorithms

Experimenter Automate the process of determining the best method to use Is an interactive

Extensibility Plugin mechanisms allow WEKA to be extended without modifying the classes in the

Standards and Interoperability Support for PMML import Regression, general regression and neural network model

Integration With Pentaho Main point of integration is with Pentaho Data Integration (PDI), aka

Projects Based on/Using WEKA Open-source data mining systems Konstanz Information Miner (KNIME) & Rapid.

Projects Based on/Using WEKA Knowledge discovery in biology Bio. WEKA - extension to WEKA

Impact Has been downloaded more than 1. 5 million times since placed on Source.

Slides: 27

Download presentation

Introduction to WEKA Mark Hall Pentaho Corporation Suite 340, 5950 Hazeltine National Dr. Orlando, FL 32822, USA Data Mining WEKA - what is it? WEKA UIs Integration with Pentaho Projects based on WEKA

Data Mining A definition: “Extraction of implicit, previously unknown, and potentially useful information from data” Goal (business oriented): improve marketing, sales, and customer support operations Who is likely to remain a loyal customer/jump ship? What products should be marketed to which prospects? What determines whether a person will respond to a certain offer? How can I detect potential fraud? Central idea: historical data contains information that will be useful in the future Historical patterns provide useful insight and generalize to future situations Data Mining: algorithms that automatically detect patterns and regularities in data

Data Mining Strong patterns can be used to make predictions Problem 1: a lot of patterns are not interesting Problem 2: patterns may be inexact (or even completely spurious) if data is garbled or missing Techniques borrowed from statistics, computer science, machine learning research Compared to traditional statistics Statistics is manual, user driven, top-down - formulate a hypothesis, convert hypothesis into database query, test significance of results Data mining automates the data interrogation Data-driven, self-organizing, bottom-up Automatic examination of a large number of hypothesis Compared to OLAP: data summarization - aggregation via addition # widgets sold in all ZIP codes in the country Data Mining: ratios, patterns and influences Factors influencing the sales of the widgets in those ZIP codes DM can enhance OLAP - suggest dimensions for cube, discretization etc.

Data Mining is a Process Selecte d data Preprocess ed data Select Preprocess Analyze & Assimilate Transform ed data Transform Extracted informati on Mine Assimilat ed knowledg e

What is WEKA? Copyright: Martin Kramer (mkramer@wxs. nl)

Hamilton

WEKA: The Software WEKA (Waikato Environment for Knowledge Analysis) Funded by the NZ government for more than 10 years Develop an open-source state-of-the art workbench of data mining tools Explore fielded applications Develop new fundamental methods Became part of the Pentaho suite in 2006 Pentaho Data Mining (PDM)

Core Functionality Support for the whole process of experimental data mining Preparation of input data Statistical evaluation of learning schemes Visualization of input data and the result of learning Tools and algorithms 69 data pre-processing tools 118 classification/regression algorithms 11 clustering algorithms 18 attribute/subset evaluators + 12 search algorithms for feature selection 6 algorithms for finding association rules User Interfaces Explorer - data exploration/visualization, model construction and export, preliminary evaluation Experimenter - large-scale algorithm comparison with statistical tests for significant differences in performance Knowledge. Flow - process model view of data mining, export of DM process

Architecture Modular, object-oriented architecture Packages for different types of algorithms: filters, classifiers, clusterers, associations, attribute selection etc. Sub-packages group components by functionality or purpose E. g. classifiers. bayes, filters. unsupervised. attribute Public interface prescribed by abstract base class or interface for all types of algorithms Algorithms are Java Beans GUIs use introspection/reflection to dynamically generate editor dialogs at runtime All components rely to a greater or lesser extent on a “core” top-level package Classes and data structures for reading data sources; representing instances, sparse instances and attributes; and providing common utility methods Additional interfaces that indicate extra functionality Packages containing learning schemes have associated “Evaluation” classes Routines for performing cross-validation, computing performance metrics, generating ROC curves etc.

Explorer

Explorer “Preprocess” panel Load data from various sources (file, SQL database, URL etc. ) Apply pre-processing “filters” to the data Summary statistics & histograms “Classify” panel Apply classification and regression algorithms Evaluate resulting models Numerically via statistical estimation Graphically through visualization (data and model) “Cluster” panel Apply clustering algorithms to the data Visualize the outcome Clusters that represent density estimates can be evaluated based on the statistical likelihood of the data “Associate” panel Learn association rules for market-basket type analysis

Explorer “Select attributes” panel Mix and match algorithms for evaluating the utility of attributes and sets of attributes with different search methods “Visualize” panel Color-coded scatter plot matrix of the data Select, enlarge, zoom in etc.

Knowledge Flow Define a data mining “process” Like the Explorer, all of WEKA’s algorithms are available Data flows through the process from node to node Accommodates both batch-based processing and data streams Command line interface to WEKA can also train incremental classifiers on data streams Fully multi-threaded Accommodates multiple independent “flows” on the same layout Knowledge Flow’s Classifier step is multi-threaded Build models for more than one cross-validation fold in parallel Binary and XML-based persistence of flow layouts

Knowledge Flow

Experimenter Automate the process of determining the best method to use Is an interactive process in the Explorer or Knowledge Flow Run classification and regression algorithms on a corpus of data sets Try different parameter settings Collect performance statistics Perform significance tests on the results Raw output saved to files or databases Analysis results can be export as text, CSV, Gnuplot, La. Te. X or HTML Advanced users can distribute the processing load across multiple machines

Experimenter

Extensibility Plugin mechanisms allow WEKA to be extended without modifying the classes in the WEKA distribution New tabs can be added to the Explorer New visualizations can be added to the pop-up menu in the Explorer’s Classify panel Classifier errors, predictions, trees and graphs Knowledge Flow - “Plugins” tab Drop a jar file into $HOME/. knowledge. Flow/plugins/<a plugin>/

Standards and Interoperability Support for PMML import Regression, general regression and neural network model types More model types and support for export in future development releases Lib. SVM/SVM-Light data format support

Integration With Pentaho Main point of integration is with Pentaho Data Integration (PDI), aka the Kettle project PDI (Kettle) - streaming, engine-driven ETL tool PDI can export data in ARFF format High-volume, low memory consumption WEKA-specific transformation steps Weka. Scoring: score data using a pre-constructed WEKA model (classification, regression or clustering) or PMML model as part of an ETL transformation Knowledge. Flow: execute arbitrary Knowledge Flow processes as part of an ETL transformation Can be used to automatically refresh/rebuild a predictive model

Scoring as Part of an ETL process

Refreshing a predictive model

Projects Based on/Using WEKA Open-source data mining systems Konstanz Information Miner (KNIME) & Rapid. Miner provide WEKA plugins R provides an interface to WEKA through the RWeka package Scientific workflow environment Kepler Weka project integrates all of WEKA’s functionality into the Kepler open-source scientific workflow platform Systems for natural language processing GATE NLP workbench Balie - language identification, tokenization, sentence boundary detection, named entity recognition Kea - automatic keyphrase extraction Text mining Judge & IR Utilities - two systems that perform document categorization and clustering

Projects Based on/Using WEKA Knowledge discovery in biology Bio. WEKA - extension to WEKA for tasks in biology, bioinformatics and biochemistry Epitopes Tookit - platform based on WEKA for developing epitope prediction tools maxd. View & Mayday - visualization and analysis of microarray data Distributed and parallel data mining Weka-Parallel & Grid. Weka - distributed cross-validation, scoring and testing FAEHIM & Weka 4 WS - make WEKA available as a web service Connectionist and artificial immune system algorithms Weka Classification Algorithms Project - several artificial neural networks and artificial immune system based algorithms

Impact Has been downloaded more than 1. 5 million times since placed on Source. Forge in April 2000 Current download rate of more than 20, 000 per month Large user-base and active community 2750 people subscribed to the mailing list