Preprocessing Part 1 Preprocessing create a new data

  • Slides: 35
Download presentation
Pre-processing – Part 1

Pre-processing – Part 1

Pre-processing – create a new data set with the dependent and candidate independent variables

Pre-processing – create a new data set with the dependent and candidate independent variables to be used in the analysis.

Data wrangling, data munging

Data wrangling, data munging

From Wikipedia Data wrangling, sometimes referred to as data munging, is the process of

From Wikipedia Data wrangling, sometimes referred to as data munging, is the process of transforming and mapping data from one "raw" data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics.

5 Analysis Time Line Allotted Time Projected: Actual: Dreaded: (Data Acquisition) Needed: Data Preparation

5 Analysis Time Line Allotted Time Projected: Actual: Dreaded: (Data Acquisition) Needed: Data Preparation Data Analysis

Part 1: Make a direct copy of data. Examine contents. Transform all character variables

Part 1: Make a direct copy of data. Examine contents. Transform all character variables to numeric. Modify or transform variables based on subject matter knowledge. Drop variables no longer needed.

Create a copy of the data set as a first step in creating an

Create a copy of the data set as a first step in creating an analytic file. /* create a new data set chd 2018_a This will eventually be the file used in analysis */ libname a "d: dropboxchd 2018_data"; data a. chd 2018_a; set s 5238. chd 2018; run;

What is on the data set? What is the dependent (target) variable? What are

What is on the data set? What is the dependent (target) variable? What are the initial candidate independent variables?

Figure out what is on the data set. Proc Contents Proc freq with nlevels

Figure out what is on the data set. Proc Contents Proc freq with nlevels and tables with noprint; Proc means for numeric variables Proc freq for character variables

Proc Contents, chd 2018_a

Proc Contents, chd 2018_a

/*a macro that clears log and results*/ %clearall /*first take a look at the

/*a macro that clears log and results*/ %clearall /*first take a look at the data set */ libname a "d: dropboxchd 2018_data"; proc contents data=a. chd 2018_a position; run;

Number of unique values for each variable

Number of unique values for each variable

Number of unique values for each variable %clearall proc freq data=a. chd 2018_a nlevels;

Number of unique values for each variable %clearall proc freq data=a. chd 2018_a nlevels; tables age--diab/noprint; run;

Examine numeric variables.

Examine numeric variables.

Examine numeric variables. %clearall proc means data=a. chd 2018_a; run;

Examine numeric variables. %clearall proc means data=a. chd 2018_a; run;

An aside -- The clearall macro %macro clearall; dm "odsresults; clear"; dm "log; clear";

An aside -- The clearall macro %macro clearall; dm "odsresults; clear"; dm "log; clear"; ods html close; ods html; %mend;

mautosource option %let path=d: /dropbox/sasdata; options mautosource sasautos=("&pathdlmmacros", sasautos);

mautosource option %let path=d: /dropbox/sasdata; options mautosource sasautos=("&pathdlmmacros", sasautos);

autoexec. sas file A file that contains SAS statements that are executed immediately after

autoexec. sas file A file that contains SAS statements that are executed immediately after SAS initializes and before any user input is accepted Create with any text processor (notepad, sas, etc. ) Save in a location where SAS can find it.

My Autoexec. sas File (in my “home” directory c: usersdan) %let path=d: /dropbox/sasdata; options

My Autoexec. sas File (in my “home” directory c: usersdan) %let path=d: /dropbox/sasdata; options mautosource sasautos=("&pathdlmmacros", sasautos); %include "d: /dropbox/dlmauto. sas"; run;

dlmauto. sas (partial) options font=("Lucida Console" 18); OPTIONS FORMCHAR = "|----|+|---+=|-/<>*"; ods html path="c:

dlmauto. sas (partial) options font=("Lucida Console" 18); OPTIONS FORMCHAR = "|----|+|---+=|-/<>*"; ods html path="c: /tmp/sasfiles"; libname nh 9 Mort "&pathnhanes 1999mortalitysas"; libname nh 9 ques "&pathnhanes 1999questionnairesas"; libname nh 9 lab "&pathnhanes 1999labsas"; libname nh 9 exam "&pathnhanes 1999examsas"; libname nh 9 demo "&pathnhanes 1999demographicssas"; libname nh 9 diet "&pathnhanes 1999dietarysas"; libname nh 9 (nh 9 demo nh 9 exam nh 9 lab nh 9 mort nh 9 ques); libname framoff "&pathframoff"; libname framcoh "&pathframcoh"; libname hon "&pathhonolulu"; data _null_; put "End of Dan's autoexec. sas"; run;

Examine the coding of character variables

Examine the coding of character variables

Change character to numeric variables

Change character to numeric variables

Examine the coding of character variables %clearall; proc freq data=a. chd 2018_a; tables gender

Examine the coding of character variables %clearall; proc freq data=a. chd 2018_a; tables gender chd smoking diab; run;

If lots of char variables, an easier way using SQL %clearall; proc sql; select

If lots of char variables, an easier way using SQL %clearall; proc sql; select name into : vars separated by " " from dictionary. columns where libname="A" and memname="CHD 2018_A" and type="char" ; quit; %put &vars; proc freq data=a. chd 2018_a; tables &vars; run;

Change character variables to numeric. %clearall; libname a "d: dropboxchd 2018_data"; data a. chd

Change character variables to numeric. %clearall; libname a "d: dropboxchd 2018_data"; data a. chd 2018_a (drop=c_chd c_diab gender c_smoking); set s 5238. chd 2018(rename=(chd=c_chd diab=c_diab smoking=c_smoking)); chd=c_chd="Developed Chd"; diab=c_diab="Diabetic"; select (c_smoking); when ("Never Smoked") smoking=0; when ("Past Smoker") smoking=1; when ("Current Smok") smoking=2; otherwise smoking=. ; end; male=gender="Male"; run; proc contents data=a. chd 2018_a; run; proc freq data=a. chd 2018_a; tables male chd smoking diab; run;

Changes in the original data are made: For statistical reasons For subject matter reasons.

Changes in the original data are made: For statistical reasons For subject matter reasons.

Intra-individual variability sbp 1, sbp 2, sbp 3 dbp 1, dbp 2, dbp 3

Intra-individual variability sbp 1, sbp 2, sbp 3 dbp 1, dbp 2, dbp 3

data tmp; set s 5238. chd 2018; sd_sbp=std(of sbp 1 -sbp 3); run; proc

data tmp; set s 5238. chd 2018; sd_sbp=std(of sbp 1 -sbp 3); run; proc univariate data=tmp; var sd_sbp; histogram sd_sbp; run;

%clearall; libname a "d: dropboxchd 2018_data"; data a. chd 2018_a (drop=c_chd c_diab gender c_smoking

%clearall; libname a "d: dropboxchd 2018_data"; data a. chd 2018_a (drop=c_chd c_diab gender c_smoking sbp 1 -sbp 3 dbp 1 -dbp 3); set s 5238. chd 2018(rename=(chd=c_chd diab=c_diab smoking=c_smoking)); chd=c_chd="Developed Chd"; diab=c_diab="Diabetic"; select (c_smoking); when ("Never Smoked") smoking=0; when ("Past Smoker") smoking=1; when ("Current Smok") smoking=2; otherwise smoking=. ; end; male=gender="Male"; sbp=mean(of sbp 1 -sbp 3); dbp=mean(of dbp 1 -dbp 3); run; proc contents data=a. chd 2018_a; run; proc means data=a. chd 2018_a; var sbp dbp; run;

The problem of standardizing for height. /* the problem of height*/ proc corr data=a.

The problem of standardizing for height. /* the problem of height*/ proc corr data=a. chd 2018_a nosimple; var weight fvc; with height; run;

Weight and height are usually combined into a single measure: body mass index

Weight and height are usually combined into a single measure: body mass index

%clearall; libname a "d: dropboxchd 2018_data"; data a. chd 2018_a (drop=c_chd c_diab gender c_smoking

%clearall; libname a "d: dropboxchd 2018_data"; data a. chd 2018_a (drop=c_chd c_diab gender c_smoking sbp 1 -sbp 3 dbp 1 -dbp 3 weight); set s 5238. chd 2018(rename=(chd=c_chd diab=c_diab smoking=c_smoking)); chd=c_chd="Developed Chd"; diab=c_diab="Diabetic"; select (c_smoking); when ("Never Smoked") smoking=0; when ("Past Smoker") smoking=1; when ("Current Smok") smoking=2; otherwise smoking=. ; end; male=gender="Male"; sbp=mean(of sbp 1 -sbp 3); dbp=mean(of dbp 1 -dbp 3); bmi=(weight/height**2)*703; run; proc contents data=a. chd 2018_a; run; proc means data=a. chd 2018_a; var bmi; run;

Create a standardized fvc

Create a standardized fvc

%clearall; libname a "d: dropboxchd 2018_data"; data a. chd 2018_a (drop=c_chd c_diab gender c_smoking

%clearall; libname a "d: dropboxchd 2018_data"; data a. chd 2018_a (drop=c_chd c_diab gender c_smoking sbp 1 -sbp 3 dbp 1 -dbp 3 weight height fvc); set s 5238. chd 2018(rename=(chd=c_chd diab=c_diab smoking=c_smoking)); chd=c_chd="Developed Chd"; diab=c_diab="Diabetic"; select (c_smoking); when ("Never Smoked") smoking=0; when ("Past Smoker") smoking=1; when ("Current Smok") smoking=2; otherwise smoking=. ; end; male=gender="Male"; sbp=mean(of sbp 1 -sbp 3); dbp=mean(of dbp 1 -dbp 3); bmi=(weight/height**2)*703; fvcht=fvc/height; run; proc contents data=a. chd 2018_a; run; proc means data=a. chd 2018_a; var fvcht; run;

Preprocessing, Part 2 Missing Data Multi-collinearity Complete Separation Quasi-Complete Separation

Preprocessing, Part 2 Missing Data Multi-collinearity Complete Separation Quasi-Complete Separation