5 Data Preparation for Data Mining Chapter 4

  • Slides: 28
Download presentation
5 Data Preparation for Data Mining: Chapter 4, Basic Preparation Markku Roiha Global Technology

5 Data Preparation for Data Mining: Chapter 4, Basic Preparation Markku Roiha Global Technology Platform Datex-Ohmeda Group, Instrumentarium Corp.

5 Datex-Ohmeda

5 Datex-Ohmeda

5 Shortly what is Basic Preparation about 1 Finding data for mining 1 Creating

5 Shortly what is Basic Preparation about 1 Finding data for mining 1 Creating and understanding of the quality of data 1 Manipulating data to remove inaccuracies and redundancies 1 Making a row-column (text) file of the data

5 Assessing Data - “Data assay” 1 Data Discovery 1 Discovering and locating data

5 Assessing Data - “Data assay” 1 Data Discovery 1 Discovering and locating data to be used. Coping with the bureaucrats and data hiders. 1 Data Characterization 1 What is it, the data found ? Does it contain stuff needed or is it mere garbage ? 1 Data Set Assembly 1 Making an (ascii) table into a file of the data coming from different sources

5 Outcome of data assay 1 Detailed knowledge in the form of a report

5 Outcome of data assay 1 Detailed knowledge in the form of a report on 1 quality, problems, shortcomings, and suitability of the data for mining. 1 Tacit knowledge of the database 1 The miner has a perception of data suitability

5 Data Discovery 1 Input to data mining is a row-column text file 1

5 Data Discovery 1 Input to data mining is a row-column text file 1 Original source of the data may be like various databases, flat measurement data files, binary data files etc. 1 Data Access Issues 1 Overcome accessibility challenges, like legal issues, cross-departmental access limitations, company politics 1 Overcome technical challenges, like data format incompatibilities, data storage mediums, database architecture incompatibilities, measurement concurrency issues 1 Internal/external source and the cost of data

5 Data characterization 1 Characterize the nature of data sources 1 Study the nature

5 Data characterization 1 Characterize the nature of data sources 1 Study the nature of variables and usefulness for modelling 1 Looking frequency distributions and cross-tabs 1 Avoiding Garbage In

5 Characterization: Granularity 1 Variables fall within continuum of very detailed and very aggregated

5 Characterization: Granularity 1 Variables fall within continuum of very detailed and very aggregated 1 Sum means aggregation, as well as a mean value 1 General rule: detailed is preferred over aggregated for mining 1 Level of aggregation determines the accuracy of model. 1 One level of aggregation less in input compared to requirement of output 1 Model outputs weekly variance, use daily measurements for modelling.

5 Characterization: Consistency 1 Undiscovered inconsistency in data stream leads to Garbage Out model.

5 Characterization: Consistency 1 Undiscovered inconsistency in data stream leads to Garbage Out model. 1 If car model is stored as M-B, Mercedes, M-Benz, Mersu it is impossible to detect cross relations between person characteristics and the model of car owned. 1 Labelling of variables is dependent on the system producing variable data. 1 Employee means different thing for HR department system and to Payroll system in the precense of contractors 1 So, how many employees do we have?

5 Characterization: Pollution 1 Data is polluted if variable label does not reveal the

5 Characterization: Pollution 1 Data is polluted if variable label does not reveal the meaning of variable data 1 Typical sources of pollution 1 Misuse of record field 1 B to signify “Business” in gender field of credit card holders -> How do you do statistical analysis based on gender then ? 1 Data transfer unsuccesful 1 Misinterpreted fields while copying (comma) 1 Human resistance 1 Car sales example, work time report

5 Characterization: Objects 1 The precise nature of object measured needs to be known

5 Characterization: Objects 1 The precise nature of object measured needs to be known 1 Employee example 1 Data miner needs to understand why information was captured in the first place 1 Perspective may color data

5 Characterization: Relationships 1 Data mining needs a row-column text file for input -

5 Characterization: Relationships 1 Data mining needs a row-column text file for input - This file is created from multiple data streams 1 Data streams may be difficult to merge 1 There must be some sort of a key that is common to each stream 1 Example: different customer ID values in different databases. 1 Key may be inconsistent, polluted or difficult to get access; there may be duplicates etc.

5 Characterization: Domain 1 Variable values must be within permissible range of values 1

5 Characterization: Domain 1 Variable values must be within permissible range of values 1 Summary statistics and frequency counts reveal out-of-bounds values. 1 Conditional domanis: 1 Diagnosis bound to gender 1 Business rules, like fraud investigation for claims of > $1 k 1 Automated tools to find unknown business rules 1 Wiz. Rule in the CD ROM of the book

5 Characterization: Defaults 1 Default values in data may cause problems 1 Conditional defaults

5 Characterization: Defaults 1 Default values in data may cause problems 1 Conditional defaults dependent on other entries may create fake patterns 1 but really it is question of lack of data 1 May be useful patterns but often of limited use

5 Characterization: Integrity 1 Checking the possible/permitted relationships between variables 1 Many cars perhaps,

5 Characterization: Integrity 1 Checking the possible/permitted relationships between variables 1 Many cars perhaps, but one spouse (except in Utah) 1 Acceptable range 1 Outlier may actually be the data we are looking for 1 Fraud looks often like outlying data because majority of claims are not fraudulent.

5 Characterization: Concurrency 1 Data capture may be of different epochs 1 Thus streams

5 Characterization: Concurrency 1 Data capture may be of different epochs 1 Thus streams may not be comparable at all 1 Example: Last years tax report and current income/posessions may not match

5 Characterization: Duplicates/Redundancies 1 Different data streams may involve redundant data - even one

5 Characterization: Duplicates/Redundancies 1 Different data streams may involve redundant data - even one source may have redundancies 1 like dob and age, or 1 price_per_unit - number_purchased - total_price 1 Removing redundancies may increase modelling speed 1 Some algorithms may crash if two variables are identical 1 Tip: if two variables are almost colinear use difference

5 Data Set Assembly 1 Data is assembled from different data streams to row-column

5 Data Set Assembly 1 Data is assembled from different data streams to row-column text file 1 Then data assessment continues from this file

5 Data Set Assembly: Reverse Pivoting 1 Feature extraction by sorting data by one

5 Data Set Assembly: Reverse Pivoting 1 Feature extraction by sorting data by one key from transactions and deriving new fields 1 E. g. from transaction data to customer profile

5 Data Set Assembly: Feature Extraction 1 Choice of variables to extract means how

5 Data Set Assembly: Feature Extraction 1 Choice of variables to extract means how data is presented to data mining tool 1 Miner must judge which features are predictive 1 Choice cannot be automated but actual extraction of features can. 1 Reverse pivot is not the only way extract features 1 Source variables may be replaced by derived variables 1 Physical models: flat most of time - take only sequences where there is rapid changes

5 Data Set Assembly: Explanatory Structure 1 Data miner needs to have an idea

5 Data Set Assembly: Explanatory Structure 1 Data miner needs to have an idea how data set can address problem area 1 It is called the explanatory structure of data set 1 Explains how variables are expected to relate to each other 1 How data set relates to solving the problem 1 Sanity check: Last phase of data assay 1 Checking that explanatory structure actually holds as expected 1 Many tools like OLAP

5 Data Set Assembly: Enhancement/Enrichment 1 Assembled data set may not be sufficent 1

5 Data Set Assembly: Enhancement/Enrichment 1 Assembled data set may not be sufficent 1 Data set enrichment 1 Adding external data to data set 1 Data enhancement 1 embellishing or expanding data set w/o external data 1 Feature extraction, 1 adding bias 1 remove non-responders from data set 1 data multiplication 1 Generate rare events (add some noise)

5 Data Set Assembly: Sampling Bias 1 Undetected sampling bias may ruin the model

5 Data Set Assembly: Sampling Bias 1 Undetected sampling bias may ruin the model 1 US Census: cannot find poorest segment of the society - no home, no address 1 Telephone polls: have to own a telephone, have to be willing to share opinions over phone lines 1 At this phase - the end of data assay miner needs to realize existence of possible bias and explain it

5 Example 1: CREDT 1 Study of data source report 1 to find out

5 Example 1: CREDT 1 Study of data source report 1 to find out integrity of variables 1 to find out expected relationships between variables for integrity assessment 1 Tools for single variable integrity study 1 Status report for Credit file 1 Complete Content Report 1 Leads to removing some variables 1 Tools for cross correlation analysis 1 Knowledge. Seeker - chi-square analysis 1 Checking that expected relationships are there

5 Example 1: CREDT: Single-variable status 1 Conclusions 1 BEACON_C lin>0. 98 1 CRITERIA:

5 Example 1: CREDT: Single-variable status 1 Conclusions 1 BEACON_C lin>0. 98 1 CRITERIA: constant 1 EQBAL: empty, distinct values? 1 DOB month: sparse, 14 values? 1 HOME VALUE: min 0. 0? Rent/own?

5 Example 1: Relationships 1 Chi-square analysis 1 AGE_INFERR expectation it correlates w/ DOB_YEAR

5 Example 1: Relationships 1 Chi-square analysis 1 AGE_INFERR expectation it correlates w/ DOB_YEAR 1 Right, it does - data seems ok 1 Do we need both ? Remove other ? 1 HOME_ED correlates with PRCNT_PROF 1 Right, it does - data seems ok 1 Talk about bias 1 Introducing bias for e. g. increase number of child-bearing families to study marketing of child-related products.

5 Example 2: Shoe 1 What is interesting here: 1 Wiz. Rule to find

5 Example 2: Shoe 1 What is interesting here: 1 Wiz. Rule to find out probable hidden rules from data set.

5 Data Assay 1 Assessment of quality of data for mining 1 Leads to

5 Data Assay 1 Assessment of quality of data for mining 1 Leads to assembly of data sources to one file. 1 How to get data and does it suit the purpose 1 Main goal: miner understands where the data come from, what is there, and what remains to be done. 1 It is helpful to make a report on the state of data 1 It involves miner directly - rather than using automated tools 1 After assay rest can be carried out with tools