Datcracker Open datamining platform connecting Rseslib and WEKA
Datcracker Open data-mining platform connecting Rseslib and WEKA Marcin Wojnarski Warsaw University, Poland
Outline Datcracker is … Motivation What is available in version 0. 5 HOWTO … Architecture Future releases 2
Datcracker is… …an open-source extensible data-mining platform which provides common architecture for data processing algorithms of various types. The algorithms can be combined together to build data processing schemes of large complexity. 3
Main characteristics Extensibility of algorithm pool through well-defined API Extensibility of types of data that algorithms operate on Stream-based data processing, for efficient handling of large volumes of data and for freedom of designing complex experiments Language: Java Licence: GPL Download: www. datcracker. org 4
Motivation To enable independent research groups exchange and combine their algorithms To simplify implementation of new algorithms 5
Available in version 0. 5 Rseslib algorithms: l classifiers (~20 algorithms) Weka algorithms: l l l ARFF reader classifiers (~60) filters (47) Datcracker algorithms: l Train&Test evaluation scheme Data types: l vectors of numeric and/or symbolic features 6
HOWTO: Read ARFF file Cell arff = new Arff. Reader. Cell(); arff. set("filename", "data/iris. arff"); arff. set("label. Index", "last"); arff. open(); System. out. println(arff. next()); arff. close(); Output: [data: [5. 1 3. 5 1. 4 0. 2] label: [Iris-setosa]] [data: [4. 9 3. 0 1. 4 0. 2] label: [Iris-setosa]] 7
HOWTO: Train classifier (Rseslib) Cell learner = new Rseslib. Classifier("C 45"); learner. set("pruning", "true"); learner. set. Source(arff); learner. build(); learner. set. Source(arff_test); learner. open(); System. out. println(learner. next()); learner. close(); 8
HOWTO: Train classifier (Weka) Cell learner = new Weka. Classifier("J 48"); learner. set("min. Num. Obj", "2"); learner. set. Source(arff); learner. build(); 9
HOWTO: Apply Weka filter Cell filter = new Weka. Filter("attribute. Remove"); filter. set("attribute. Indices", "3 -6"); filter. set. Source(arff); filter. open(); System. out. println(filter. next()); filter. close(); 10
HOWTO: Set parameters arff. set("filename", "data/iris. arff"); arff. set("label. Index", "last"); . . . OR Parameters par = new Parameters(); par. set("filename", "data/iris. arff"); par. set("label. Index", "last"); . . . arff. set. Parameters(par); par = arff. get. Parameters(); 11
HOWTO: Train & Test Cell learner = new Rseslib. Classifier("C 45"); learner. set("pruning", "true"); Train. And. Test tt = new Train. And. Test(learner); tt. set("train. Percent", "70"); tt. set("repetitions", "10"); tt. set. Source(source); tt. build(); System. out. println(tt. report()); 12
Data Processing Chain Cell. set. Source(source. Cell) ARFF New ARFF Filter 1 Filter 2 set("attribute. Indices", "0 -3") set("attribute. Indices", "5") Classifier Another Classifier 13
Architecture 14
Outline Cell l interfaces state how to override Data Meta. Data 15
Cell Main class of Datcracker architecture Base class for all data-processing algorithms l l l classifiers clusterers filters data loaders data generators … Cells can be connected in a Data Processing Chain Data transfer between cells have form of a stream of samples Receiving cell may immidiately consume incoming samples large volumes of data processed efficiently 16
Cell’s interface Cell can be: l l a data source a data receiver buildable parameterized 17
Cell as a data source Cell’s interface for data transfer: open() session next() close() : Meta. Sample opens communication : Sample retrieves next sample of data closes communication session 18
Cell as a data receiver Cell’s interface for receiving data: set. Source(Cell) set source cell 19
Buildable cells Some cells may be buildable: they have to be built before use Building a cell is implemented by subclasses and may mean different things: l l training a decision system running an evaluation scheme (T&T, CV, …) buffering input data … Cell’s interface for building: build() erase() builds the cell erases the cell; it can be built again afterwards 20
Fixed cells Cells that are not buildable are called fixed. They are usable just after construction or setting parameters: l l l file reader WEKA filter … 21
Parameterized cells Cell’s interface for parameterization: set(String name, String value) sets a parameter set. Parameters(Parameters) sets all parameters at once get. Parameters() : Parameters returns all parameters that are set 22
State of the cell EMPTY cell has no content, cannot be used CLOSED content has been built, cell ready to use OPEN cell is being used now (generating samples of data) build() EMPTY open() CLOSED erase() next() OPEN close() 23
…motivation To check against access violations when the cell is accessed. Examples: l two cells try to retrieve data from a given cell at the same time l someone tries to use an empty cell l someone tries to reconnect cells during their activity To simplify implementation of subclasses (new algorithms): they may safely assume that access is correct (build() before open(), open() before next(), …) To detect bugs early – important in heterogenous system! 24
How to override Cell Methods to override: l l l on. Build() on. Erase() on. Open() on. Next() on. Close() Public methods build(), … can’t be overriden. They perform state checking and then call on…() method Like event handlers in event-driven programming You do not have to override all of them! (e. g. cell for reading data will not be buildable) You can provide additional interface in your subclass 25
Data representation Data set split into samples Sample: l l data label : Data input data associated decision label Separation of data and label: l l l useful for complex types of data/labels, e. g. in image processing (like segmentation) useful for meta-learning algorithm, which operate on labels alone labelled / unlabelled / partially labl. samples handled in the same way Data: abstract base class. Downcasted by cells to what they expect Currently available subclasses: l Numeric. Feature, Symbolic. Feature, Data. Vector In the future: time series, images, special types of labels, . . . 26
Immutability Data objects are immutable: they cannot be modified after creation (like String class) They can be freely shared among cells without risk of accidental modification l l l safety simplicity efficiency: l no need to copy data between cells l no need for synchronization in multi-threaded execution 27
Metadata Many algorithms have to know „type” of input data in advance, before processing of data starts metadata Separation of data and metadata base class Meta. Data Describes common properties of all Data objects generated in a given session l l l number and types of features in a Data. Vector dictionary of possible values of a Symbolic. Feature … Each Data subclass has an associated Meta. Data subclass Immutable! 28
Future releases Architecture l l Multi-input and multi-output cells Composite cells (e. g. meta-learning) Serialization and copying Progress info and suspension of cell building Algorithms l l l cross-validation data buffering … Data types l l time series … 29
Home www. datcracker. org 30
31
- Slides: 31