W E K A Waikato Environment for Knowledge

  • Slides: 14
Download presentation
W E K A Waikato Environment for Knowledge Aquisition

W E K A Waikato Environment for Knowledge Aquisition

Goals of the workshop • Aquisition of functional knowledge about the WEKA platform •

Goals of the workshop • Aquisition of functional knowledge about the WEKA platform • Ability of processing (own) data in WEKA evaluate & interpret the results identifying a problem Write seminar work transform into data apply to data choose appropriate DM technique

What is WEKA ? Some basic facts about WEKA: • WEKA(1) = a flightless

What is WEKA ? Some basic facts about WEKA: • WEKA(1) = a flightless bird with an inquisitive nature (found only on the islands of New Zealand) • WEKA(2) = a software ‘workbench’ incorporating several standard ML/DM techniques • Authors = Ian H. Witten, Eibe Frank (et. al. ) • Programming language = JAVA • Origin = The University of Waikato, New Zealand • Literature = Ian H. Witten, Eibe Frank: Practical Machine Learning Tools with JAVA Implementations, Morgan Kaufmann, 1999 • Homepage = http: //www. cs. waikato. ac. nz/~ml/weka

Objectives of WEKA • make ML/DM techniques generally available • apply them to practical

Objectives of WEKA • make ML/DM techniques generally available • apply them to practical problems (in agriculture) • develop new ML/DM algorithms • contribute to theoretical framework of the field (ML/DM)

Versions of WEKA • There are several versions of WEKA: – WEKA 3. 0:

Versions of WEKA • There are several versions of WEKA: – WEKA 3. 0: “book version” compatible with description in data mining book – WEKA 3. 2: “GUI version” adds graphical user interfaces (book version is commandline only) – WEKA 3. 4: “development version” with lots of improvements • This workshop is based on WEKA 3. 4(. 3)

The input to WEKA ARFF format (“flat” files): • example: Play-tennis domain %this is

The input to WEKA ARFF format (“flat” files): • example: Play-tennis domain %this is an example of a knowledge %domain in ARFF format @relation weather @attribute @attribute outlook {sunny, overcast, rainy} temperature real humidity real windy {TRUE, FALSE} play {yes, no} @data sunny, 85, FALSE, no sunny, 80, 90, TRUE, no overcast, 83, 86, FALSE, yes rainy, 70, 96, FALSE, yes rainy, 68, 80, FALSE, yes rainy, 65, 70, TRUE, no overcast, 64, 65, TRUE, yes sunny, 72, 95, FALSE, no sunny, 69, 70, FALSE, yes rainy, 75, 80, FALSE, yes sunny, 75, 70, TRUE, yes overcast, 72, 90, TRUE, yes overcast, 81, 75, FALSE, yes. . . Conversion to the ARFF format ? Example: • converting from MS-EXCEL to ARFF

Starting WEKA – the GUI

Starting WEKA – the GUI

A quick tour of the “explorer” • Preprocess panel Filters panel Domain info. panel

A quick tour of the “explorer” • Preprocess panel Filters panel Domain info. panel Attributes panel Status bar Attribute visualization panel Log file

A quick tour of the “explorer” • Classify panel Classifier panel Test options panel

A quick tour of the “explorer” • Classify panel Classifier panel Test options panel Class attribute Result panel Output panel

A quick tour of the “explorer” • Visualize panel

A quick tour of the “explorer” • Visualize panel

The command line • example: C: Temp>java weka. classifiers. trees. J 48 Weka exception:

The command line • example: C: Temp>java weka. classifiers. trees. J 48 Weka exception: No training file and no object input file given. General options: -t <name of training file> Sets training file. -T <name of test file> Sets test file. If missing, a cross-validation will be performed on the training data. -c <class index> Sets index of class attribute (default: last). -x <number of folds> Sets number of folds for cross-validation (default: 10). -s <random number seed> Sets random number seed for cross-validation (default: 1). -m <name of file with cost matrix> Sets file with cost matrix. -l <name of input file> Sets model input file. -d <name of output file> Sets model output file. -v Outputs no statistics for training data. -o Outputs statistics only, not the classifier. -i Outputs detailed information-retrieval statistics for each class. -k Outputs information-theoretic statistics. -p Only outputs predictions for test instances. -r Only outputs cumulative margin distribution. -z <class name> Only outputs the source representation of the classifier, giving it the supplied name. -g Only outputs the graph representation of the classifier. Options specific to weka. classifiers. j 48. J 48: -U Use unpruned tree. -C <pruning confidence> Set confidence threshold for pruning. (default 0. 25) -M <minimum number of instances> Set minimum number of instances per leaf. (default 2) -R Use reduced error pruning. -N <number of folds> Set number of folds for reduced error pruning. One fold is used as pruning set. (default 3) -B Use binary splits only. -S Don't perform subtree raising. -L Do not clean up after the tree has been built.

GUI vs. command line GUI (+): Command line (-): • visualisation of data and

GUI vs. command line GUI (+): Command line (-): • visualisation of data and (some) models • only textual visualisation of models • awkward to use GUI (-): Command line (+): • not all the parameters can be set (reduced functionality) • full functionality (‘saving the model’) • batch processing

PROs & CONs of WEKA PROs: CONs: • open source (GNU licence) • relatively

PROs & CONs of WEKA PROs: CONs: • open source (GNU licence) • relatively slow (JAVA) • platform-independent • ‘incomplete’ documentation (JAVA) • easy to use • (relatively) easy to modify (some GUI features could be explained better) • some features available only from command line

Let’s go to work

Let’s go to work