W E K A Waikato Environment for Knowledge

  • Slides: 20
Download presentation
W E K A Waikato Environment for Knowledge Analysis Branko Kavšek MPŠ Jožef Stefan

W E K A Waikato Environment for Knowledge Analysis Branko Kavšek MPŠ Jožef Stefan November 2005

Goals • Aquisition of functional knowledge about the WEKA platform • Ability of processing

Goals • Aquisition of functional knowledge about the WEKA platform • Ability of processing (own) data in WEKA identify a problem transform into data apply to data choose appropriate DM technique evaluate results interpretation

What is WEKA ? Some basic facts about WEKA: • WEKA(1) = a flightless

What is WEKA ? Some basic facts about WEKA: • WEKA(1) = a flightless bird with an inquisitive nature (found only on the islands of New Zealand) • WEKA(2) = a software ‘workbench’ incorporating several standard ML/DM techniques • Authors = Ian H. Witten, Eibe Frank (et. al. ) • Programming language = JAVA • Origin = The University of Waikato, New Zealand • Literature = Ian H. Witten, Eibe Frank: Practical Machine Learning Tools with JAVA Implementations, Morgan Kaufmann, 1999 • Homepage = http: //www. cs. waikato. ac. nz/~ml/weka

Objectives of WEKA • make ML/DM techniques generally available • apply them to practical

Objectives of WEKA • make ML/DM techniques generally available • apply them to practical problems (in agriculture) • develop new ML/DM algorithms • contribute to theoretical framework of the field (ML/DM)

Versions of WEKA • There are several versions of WEKA: – WEKA 3. 0:

Versions of WEKA • There are several versions of WEKA: – WEKA 3. 0: “book version” compatible with description in data mining book – WEKA 3. 2: “first GUI version” adds graphical user interfaces (book version is command-line only) – WEKA 3. 5: “development version” with lots of improvements • This workshop is based on WEKA 3. 5(. 2)

Outline • WEKA on the WEB • Transforming data into the “right” format •

Outline • WEKA on the WEB • Transforming data into the “right” format • Using the “Explorer” • WEKA from the command-line (Simple CLI) • Knowledge flow in brief • Performing the experiments • Tips & tricks • The PRO’s and the CON’s of WEKA

WEKA on the WEB

WEKA on the WEB

The input to WEKA ARFF (Attribute-Relation • File Format) example: Play-tennis domain format -

The input to WEKA ARFF (Attribute-Relation • File Format) example: Play-tennis domain format - “flat” files: %this is an example of a knowledge %domain in ARFF format @relation weather @attribute @attribute outlook {sunny, overcast, rainy} temperature real humidity real windy {TRUE, FALSE} play {yes, no} @data sunny, 85, FALSE, no sunny, 80, 90, TRUE, no overcast, 83, 86, FALSE, yes rainy, 70, 96, FALSE, yes rainy, 68, 80, FALSE, yes rainy, 65, 70, TRUE, no overcast, 64, 65, TRUE, yes sunny, 72, 95, FALSE, no sunny, 69, 70, FALSE, yes rainy, 75, 80, FALSE, yes sunny, 75, 70, TRUE, yes overcast, 72, 90, TRUE, yes overcast, 81, 75, FALSE, yes. . . Conversion to the ARFF format ? Example: • converting from MS-EXCEL to ARFF

Starting WEKA – the GUI

Starting WEKA – the GUI

A quick tour of the “explorer” • Preprocess panel Filters panel Domain info. panel

A quick tour of the “explorer” • Preprocess panel Filters panel Domain info. panel Attributes panel Status bar Attribute visualization panel Log file

A quick tour of the “explorer” • Classify panel Classifier panel Test options panel

A quick tour of the “explorer” • Classify panel Classifier panel Test options panel Class attribute Result panel Output panel

A quick tour of the “explorer” • Visualize panel

A quick tour of the “explorer” • Visualize panel

The command line • example: C: Temp>java weka. classifiers. trees. J 48 Weka exception:

The command line • example: C: Temp>java weka. classifiers. trees. J 48 Weka exception: No training file and no object input file given. General options: -t <name of training file> Sets training file. -T <name of test file> Sets test file. If missing, a cross-validation will be performed on the training data. -c <class index> Sets index of class attribute (default: last). -x <number of folds> Sets number of folds for cross-validation (default: 10). -s <random number seed> Sets random number seed for cross-validation (default: 1). -m <name of file with cost matrix> Sets file with cost matrix. -l <name of input file> Sets model input file. -d <name of output file> Sets model output file. -v Outputs no statistics for training data. -o Outputs statistics only, not the classifier. -i Outputs detailed information-retrieval statistics for each class. -k Outputs information-theoretic statistics. -p Only outputs predictions for test instances. -r Only outputs cumulative margin distribution. -z <class name> Only outputs the source representation of the classifier, giving it the supplied name. -g Only outputs the graph representation of the classifier. Options specific to weka. classifiers. j 48. J 48: -U Use unpruned tree. -C <pruning confidence> Set confidence threshold for pruning. (default 0. 25) -M <minimum number of instances> Set minimum number of instances per leaf. (default 2) -R Use reduced error pruning. -N <number of folds> Set number of folds for reduced error pruning. One fold is used as pruning set. (default 3) -B Use binary splits only. -S Don't perform subtree raising. -L Do not clean up after the tree has been built.

Using the “Simple CLI”

Using the “Simple CLI”

The “flow of knowledge”

The “flow of knowledge”

Performing the experiments

Performing the experiments

Tips & tricks • More memory: java -mx 10000 -oss 10000. . . •

Tips & tricks • More memory: java -mx 10000 -oss 10000. . . • Converting to ARFF & verify: java weka. core. converters. CSVLoader filename. csv > filename. arff java weka. core. Instances filename. arff • Checking available memory: – rigth-clich on the status bar

GUI vs. command line GUI (+): Command line (-): • visualisation of data and

GUI vs. command line GUI (+): Command line (-): • visualisation of data and (some) models • only textual visualisation of models • awkward to use GUI (-): Command line (+): • not all the parameters can be set (reduced functionality) • full functionality • batch processing

PROs & CONs of WEKA PROs: CONs: • open source (GNU licence) • relatively

PROs & CONs of WEKA PROs: CONs: • open source (GNU licence) • relatively slow (JAVA) • platform-independent • ‘incomplete’ documentation (JAVA) • easy to use • (relatively) easy to modify (some GUI features could be explained better) • some features available only from command line

That’s it !!! Thanks

That’s it !!! Thanks