Data Preparation Part 1 Exploratory Data Analysis Data

  • Slides: 48
Download presentation
Data Preparation Part 1: Exploratory Data Analysis & Data Cleaning, Missing Data CAS Predictive

Data Preparation Part 1: Exploratory Data Analysis & Data Cleaning, Missing Data CAS Predictive Modeling Seminar Louise Francis Analytics and Actuarial Data Mining, Inc. www. data-mines. com Louise. francis@data-mines. cm 1

Objectives n Introduce data preparation and where it fits in in modeling process n

Objectives n Introduce data preparation and where it fits in in modeling process n Discuss Data Quality n Focus on a key part of data preparation n Exploratory data analysis n n n Identify data glitches and errors Understanding the data Identify possible transformations What to do about missing data n Provide resources on data preparation n 2

CRISP-DM n Guidelines for data mining projects n Gives overview of life cycle of

CRISP-DM n Guidelines for data mining projects n Gives overview of life cycle of data mining project n Defines different phases and activities that take place in phase 3

Modelling Process 4

Modelling Process 4

Data Preprocessing 5

Data Preprocessing 5

Data Quality Problem 6

Data Quality Problem 6

Data Quality: A Problem n Actuary reviewing a database 7

Data Quality: A Problem n Actuary reviewing a database 7

May’s Law 8

May’s Law 8

It’s Not Just Us n “In just about any organization, the state of information

It’s Not Just Us n “In just about any organization, the state of information quality is at the same low level” n Olson, Data Quality 9

Some Consequences of poor data quality n Affects quality (precision) of result n Can’t

Some Consequences of poor data quality n Affects quality (precision) of result n Can’t do modeling project because of data problems n If errors not found – modeling blunder 10

Data Exploration in Predictive Modeling 11

Data Exploration in Predictive Modeling 11

Exploratory Data Analysis n Typically the first step in analyzing data n Makes heavy

Exploratory Data Analysis n Typically the first step in analyzing data n Makes heavy use of graphical techniques n Also makes use of simple descriptive statistics n Purpose Find outliers (and errors) n Explore structure of the data n 12

Definition of EDA Exploratory data analysis (EDA) is that part of statistical practice concerned

Definition of EDA Exploratory data analysis (EDA) is that part of statistical practice concerned with reviewing, communicating and using data where there is a low level of knowledge about its cause system. . Many EDA techniques have been adopted into data mining and are being taught to young students as a way to introduce them to statistical thinking. - www. wikipedia. org 13

Example Data n Private passenger auto n Some variables are: n n n n

Example Data n Private passenger auto n Some variables are: n n n n Age Gender Marital status Zip code Earned premium Number of claims Incurred losses Paid losses 14

Some Methods for Numeric Data n Visual n Histograms n Box and Whisker Plots

Some Methods for Numeric Data n Visual n Histograms n Box and Whisker Plots n Stem and Leaf Plots n Statistical n Descriptive statistics n Data spheres 15

Histograms n Can do them in Microsoft Excel 16

Histograms n Can do them in Microsoft Excel 16

Histograms Frequencies for Age Variable 17

Histograms Frequencies for Age Variable 17

Histograms of Age Variable Varying Window Size 18

Histograms of Age Variable Varying Window Size 18

Formula for Window Width 19

Formula for Window Width 19

Example of Suspicious Value 20

Example of Suspicious Value 20

Discrete-Numeric Data 21

Discrete-Numeric Data 21

Filtered Data Filter out Unwanted Records 22

Filtered Data Filter out Unwanted Records 22

Box Plot Basics: Five – Point Summary n Minimum n 1 st quartile n

Box Plot Basics: Five – Point Summary n Minimum n 1 st quartile n Median n 2 nd quartile n Maximum 23

Functions for five point summary n =min(data range) n =quartile(data range 1) n =median(data

Functions for five point summary n =min(data range) n =quartile(data range 1) n =median(data range) n =quartile(data range, 3) n =max(data range) 24

Box and Whisker Plot 25

Box and Whisker Plot 25

Plot of Heavy Tailed Data Paid Losses 26

Plot of Heavy Tailed Data Paid Losses 26

Heavy Tailed Data – Log Scale 27

Heavy Tailed Data – Log Scale 27

Box and Whisker Example 28

Box and Whisker Example 28

Descriptive Statistics Analysis Tool. Pak 29

Descriptive Statistics Analysis Tool. Pak 29

Descriptive Statistics n Claimant age has minimum and maximums that are impossible 30

Descriptive Statistics n Claimant age has minimum and maximums that are impossible 30

Data Spheres: The Mahalanobis Distance Statistic 31

Data Spheres: The Mahalanobis Distance Statistic 31

Screening Many Variables at Once n Plot of Longitude and Latitude of zip codes

Screening Many Variables at Once n Plot of Longitude and Latitude of zip codes in data n Examination of outliers indicated drivers in Ca and PR even though policies only in one mid-Atlantic state 32

Records With Unusual Values Flagged 33

Records With Unusual Values Flagged 33

Categorical Data: Data Cubes 34

Categorical Data: Data Cubes 34

Categorical Data n Data Cubes n Usually frequency tables n Search for missing values

Categorical Data n Data Cubes n Usually frequency tables n Search for missing values coded as blanks 35

Categorical Data n Table highlights inconsistent coding of marital status 36

Categorical Data n Table highlights inconsistent coding of marital status 36

Missing Data 37

Missing Data 37

Screening for Missing Data 38

Screening for Missing Data 38

Blanks as Missing 39

Blanks as Missing 39

Types of Missing Values n Missing completely at random n Missing at random n

Types of Missing Values n Missing completely at random n Missing at random n Informative missing 40

Methods for Missing Values n Drop record if any variable used in model is

Methods for Missing Values n Drop record if any variable used in model is missing n Drop variable n Data Imputation n Other n n CART, MARS use surrogate variables Expectation Maximization 41

Imputation n A method to “fill in” missing value n Use other variables (which

Imputation n A method to “fill in” missing value n Use other variables (which have values) to predict value on missing variable n Involves building a model for variable with missing value n Y = f(x 1, x 2, …xn) 42

Example: Age Variable n About 14% of records missing values n Imputation will be

Example: Age Variable n About 14% of records missing values n Imputation will be illustrated with simple regression model n Age = a+b 1 X 1+b 2 X 2…bn. Xn 43

Model for Age 44

Model for Age 44

Missing Values n A problem for many traditional statistical models n Elimination of records

Missing Values n A problem for many traditional statistical models n Elimination of records missing on anything from analysis n Many data mining procedures have techniques built in for handling missing values n If too many records missing on a given variable, probably need to discard variable 45

Metadata 46

Metadata 46

Metadata n Data about data n A reference that can be used in future

Metadata n Data about data n A reference that can be used in future modeling projects n Detailed description of the variables in the file, their meaning and permissible values 47

Library for Getting Started n Dasu and Johnson, Exploratory Data Mining and Data Cleaning,

Library for Getting Started n Dasu and Johnson, Exploratory Data Mining and Data Cleaning, Wiley, 2003 n Francis, L. A. , “Dancing with Dirty Data: Methods for Exploring and Claeaning Data”, CAS Winter Forum, March 2005, www. casact. org n Find a comprehensive book for doing analysis in Excel such as: John Walkebach, Excel 2003 Formulas or Jospeh Schmuller, Statistical Analysis With Excel for Dummies n If you use R, get a book like: Fox, John, An R and S-PLUS Companion to Applied Regression, Sage Publications, 2002 48