Preprocessing Part 1 Preprocessing create a new data
- Slides: 35
Pre-processing – Part 1
Pre-processing – create a new data set with the dependent and candidate independent variables to be used in the analysis.
Data wrangling, data munging
From Wikipedia Data wrangling, sometimes referred to as data munging, is the process of transforming and mapping data from one "raw" data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics.
5 Analysis Time Line Allotted Time Projected: Actual: Dreaded: (Data Acquisition) Needed: Data Preparation Data Analysis
Part 1: Make a direct copy of data. Examine contents. Transform all character variables to numeric. Modify or transform variables based on subject matter knowledge. Drop variables no longer needed.
Create a copy of the data set as a first step in creating an analytic file. /* create a new data set chd 2018_a This will eventually be the file used in analysis */ libname a "d: dropboxchd 2018_data"; data a. chd 2018_a; set s 5238. chd 2018; run;
What is on the data set? What is the dependent (target) variable? What are the initial candidate independent variables?
Figure out what is on the data set. Proc Contents Proc freq with nlevels and tables with noprint; Proc means for numeric variables Proc freq for character variables
Proc Contents, chd 2018_a
/*a macro that clears log and results*/ %clearall /*first take a look at the data set */ libname a "d: dropboxchd 2018_data"; proc contents data=a. chd 2018_a position; run;
Number of unique values for each variable
Number of unique values for each variable %clearall proc freq data=a. chd 2018_a nlevels; tables age--diab/noprint; run;
Examine numeric variables.
Examine numeric variables. %clearall proc means data=a. chd 2018_a; run;
An aside -- The clearall macro %macro clearall; dm "odsresults; clear"; dm "log; clear"; ods html close; ods html; %mend;
mautosource option %let path=d: /dropbox/sasdata; options mautosource sasautos=("&pathdlmmacros", sasautos);
autoexec. sas file A file that contains SAS statements that are executed immediately after SAS initializes and before any user input is accepted Create with any text processor (notepad, sas, etc. ) Save in a location where SAS can find it.
My Autoexec. sas File (in my “home” directory c: usersdan) %let path=d: /dropbox/sasdata; options mautosource sasautos=("&pathdlmmacros", sasautos); %include "d: /dropbox/dlmauto. sas"; run;
dlmauto. sas (partial) options font=("Lucida Console" 18); OPTIONS FORMCHAR = "|----|+|---+=|-/<>*"; ods html path="c: /tmp/sasfiles"; libname nh 9 Mort "&pathnhanes 1999mortalitysas"; libname nh 9 ques "&pathnhanes 1999questionnairesas"; libname nh 9 lab "&pathnhanes 1999labsas"; libname nh 9 exam "&pathnhanes 1999examsas"; libname nh 9 demo "&pathnhanes 1999demographicssas"; libname nh 9 diet "&pathnhanes 1999dietarysas"; libname nh 9 (nh 9 demo nh 9 exam nh 9 lab nh 9 mort nh 9 ques); libname framoff "&pathframoff"; libname framcoh "&pathframcoh"; libname hon "&pathhonolulu"; data _null_; put "End of Dan's autoexec. sas"; run;
Examine the coding of character variables
Change character to numeric variables
Examine the coding of character variables %clearall; proc freq data=a. chd 2018_a; tables gender chd smoking diab; run;
If lots of char variables, an easier way using SQL %clearall; proc sql; select name into : vars separated by " " from dictionary. columns where libname="A" and memname="CHD 2018_A" and type="char" ; quit; %put &vars; proc freq data=a. chd 2018_a; tables &vars; run;
Change character variables to numeric. %clearall; libname a "d: dropboxchd 2018_data"; data a. chd 2018_a (drop=c_chd c_diab gender c_smoking); set s 5238. chd 2018(rename=(chd=c_chd diab=c_diab smoking=c_smoking)); chd=c_chd="Developed Chd"; diab=c_diab="Diabetic"; select (c_smoking); when ("Never Smoked") smoking=0; when ("Past Smoker") smoking=1; when ("Current Smok") smoking=2; otherwise smoking=. ; end; male=gender="Male"; run; proc contents data=a. chd 2018_a; run; proc freq data=a. chd 2018_a; tables male chd smoking diab; run;
Changes in the original data are made: For statistical reasons For subject matter reasons.
Intra-individual variability sbp 1, sbp 2, sbp 3 dbp 1, dbp 2, dbp 3
data tmp; set s 5238. chd 2018; sd_sbp=std(of sbp 1 -sbp 3); run; proc univariate data=tmp; var sd_sbp; histogram sd_sbp; run;
%clearall; libname a "d: dropboxchd 2018_data"; data a. chd 2018_a (drop=c_chd c_diab gender c_smoking sbp 1 -sbp 3 dbp 1 -dbp 3); set s 5238. chd 2018(rename=(chd=c_chd diab=c_diab smoking=c_smoking)); chd=c_chd="Developed Chd"; diab=c_diab="Diabetic"; select (c_smoking); when ("Never Smoked") smoking=0; when ("Past Smoker") smoking=1; when ("Current Smok") smoking=2; otherwise smoking=. ; end; male=gender="Male"; sbp=mean(of sbp 1 -sbp 3); dbp=mean(of dbp 1 -dbp 3); run; proc contents data=a. chd 2018_a; run; proc means data=a. chd 2018_a; var sbp dbp; run;
The problem of standardizing for height. /* the problem of height*/ proc corr data=a. chd 2018_a nosimple; var weight fvc; with height; run;
Weight and height are usually combined into a single measure: body mass index
%clearall; libname a "d: dropboxchd 2018_data"; data a. chd 2018_a (drop=c_chd c_diab gender c_smoking sbp 1 -sbp 3 dbp 1 -dbp 3 weight); set s 5238. chd 2018(rename=(chd=c_chd diab=c_diab smoking=c_smoking)); chd=c_chd="Developed Chd"; diab=c_diab="Diabetic"; select (c_smoking); when ("Never Smoked") smoking=0; when ("Past Smoker") smoking=1; when ("Current Smok") smoking=2; otherwise smoking=. ; end; male=gender="Male"; sbp=mean(of sbp 1 -sbp 3); dbp=mean(of dbp 1 -dbp 3); bmi=(weight/height**2)*703; run; proc contents data=a. chd 2018_a; run; proc means data=a. chd 2018_a; var bmi; run;
Create a standardized fvc
%clearall; libname a "d: dropboxchd 2018_data"; data a. chd 2018_a (drop=c_chd c_diab gender c_smoking sbp 1 -sbp 3 dbp 1 -dbp 3 weight height fvc); set s 5238. chd 2018(rename=(chd=c_chd diab=c_diab smoking=c_smoking)); chd=c_chd="Developed Chd"; diab=c_diab="Diabetic"; select (c_smoking); when ("Never Smoked") smoking=0; when ("Past Smoker") smoking=1; when ("Current Smok") smoking=2; otherwise smoking=. ; end; male=gender="Male"; sbp=mean(of sbp 1 -sbp 3); dbp=mean(of dbp 1 -dbp 3); bmi=(weight/height**2)*703; fvcht=fvc/height; run; proc contents data=a. chd 2018_a; run; proc means data=a. chd 2018_a; var fvcht; run;
Preprocessing, Part 2 Missing Data Multi-collinearity Complete Separation Quasi-Complete Separation
- Etl in data cleaning and preprocessing stands for
- Entity identification problem in data integration
- Data preprocessing examples
- Data preparation and preprocessing
- Aggregation in data preprocessing
- Neural network data preprocessing
- Major tasks in data preprocessing
- Apa itu mutiple queue dan one way list
- Image url to text
- Text operations
- Image preprocessing
- Image preprocessing
- Preprocessing fem
- Preprocessing in image processing
- Password hashing and preprocessing
- Password hashing and preprocessing
- Dti preprocessing
- Sap hybris extensions
- Three old worlds create a new
- Create a new profile
- Part part whole addition
- Part to part ratio definition
- Part part whole
- Technical object description example
- Vodka chimney glass
- The phase of the moon you see depends on ______.
- Part to part variation
- What is enterprise data model
- Split speech punctuation
- New york, new jersey, pennsylvania, and delaware
- Fresh oil new wine
- Movies new hartford
- 3 weaknesses of the articles of confederation
- New-old approach to creating new ventures
- New marketing realities pdf
- New years old is new again