Introduction to SAS Essentials Mastering SAS for Data

  • Slides: 77
Download presentation
Introduction to SAS Essentials Mastering SAS for Data Analytics Alan Elliott and Wayne Woodward

Introduction to SAS Essentials Mastering SAS for Data Analytics Alan Elliott and Wayne Woodward 1 SAS ESSENTIALS -- Elliott & Woodward

Chapter 6: Advanced Programming Topics Part I 2 SAS ESSENTIALS -- Elliott & Woodward

Chapter 6: Advanced Programming Topics Part I 2 SAS ESSENTIALS -- Elliott & Woodward

LEARNING OBJECTIVES � To be able to use SAS® Functions � To be able

LEARNING OBJECTIVES � To be able to use SAS® Functions � To be able to TRANSPOSE data sets � To be able to perform data recoding using SELECf � To be able to use SAS programming techniques to clean up a messy data set 3 SAS ESSENTIALS -- Elliott & Woodward

6. 1 USING SAS FUNCTIONS � Sophisticated calculations can be created using SAS functions.

6. 1 USING SAS FUNCTIONS � Sophisticated calculations can be created using SAS functions. These functions include arithmetic and text and date manipulation. The format for the use of SAS functions is result = function(argl, arg 2, etc. ); � Where arg 1 etc are arguments (information) you “send” to the function for it to act upon, and it returns a result. 4 SAS ESSENTIALS -- Elliott & Woodward

Function Arguments � Functions can require one or more arguments. Some require no arguments.

Function Arguments � Functions can require one or more arguments. Some require no arguments. For example, a few mathematical functions in SAS are as follows: ABS(X); * Absolute value; FACT(X); * Factorial; INT(X); * Integer portion of a value; LOG(X); * Natural log; SQRT(X); * Square root; 5 SAS ESSENTIALS -- Elliott & Woodward

Types of functions � Function in SAS can be categorized into these types: �ARITHMETIC

Types of functions � Function in SAS can be categorized into these types: �ARITHMETIC �TRIGNOMETRIC �DATE AND TIME �CHARACTER �TRUNCATION �SPECIAL USE AND MISCELANNEOUS �FINANCIAL �ACCESS PREVIOUS OBSERVATIONS (LAG) F 6 SAS ESSENTIALS -- Elliott & Woodward

Using Functions in a Calculation � Functions can also be used as a part

Using Functions in a Calculation � Functions can also be used as a part of a more extensive calculation. For example, C =MEASURE+ SQRT(A**2 + B**2 ) ; � would calculate the square of A and then the square of B, then add those two numbers, take the square root of that value, add that number to the value of MEASURE, and assign the result to the variable named C. 7 SAS ESSENTIALS -- Elliott & Woodward

Functions that take more than one argument � Some examples of functions that take

Functions that take more than one argument � Some examples of functions that take more the one argument are: MAX(xl, x 2, x 3, . . . ) ; * Maximum of a list; MIN(xl, x 2, x 3, . . . ) ; * Minimum of a list; SUM(xl, x 2, x 3, . . . ) ; * Sum of a list; MEDIAN(xl, x 2, x 3, . . . ) ; * Median of a list; ROUND(value, round); * Round off a value; � For example TOTAL=SUM(TIMEl, TIME 2, TIME 3, TIME 4); 8 SAS ESSENTIALS -- Elliott & Woodward

Rounding Values � The round-off unit in the ROUND function determines how the rounding

Rounding Values � The round-off unit in the ROUND function determines how the rounding will be performed. The default is 1. A round-off value of 0. 01 means to round it off to the nearest l. OOth, and a round-off value of 5 means to round it off to the nearest 5. Here a few examples: ROUND (3. 1415, . 01) *Returns the value 3. 14; ROUND(107, 5) * Returns the value 105; ROUND(3. 6234)* Returns the value 4 (rounds to integer); 9 SAS ESSENTIALS -- Elliott & Woodward

A Function with No Argument � A function that takes no argument is TODAY

A Function with No Argument � A function that takes no argument is TODAY (). For example, the code NOW=TODAY(); � puts the current date value from the computer's clock into the variable named NOW. 10 SAS ESSENTIALS -- Elliott & Woodward

Specifying Arguments � When arguments are a list of values, such as in MAX

Specifying Arguments � When arguments are a list of values, such as in MAX or MIN, you can specify the list as variables separated by commas, or as a range preceded by the word OF. For example, if these variables have the following values: X 1 = 1; X 2 = 2; X 3 = 13; X 4=10; � Then MAX(1, 2, 3, 4, 5) * Returns the value 5; MAX(X 1, X 2, X 3, X 4) * Returns the value 13; MAX(OF X 1 -X 4) * Returns the value 13; 11 SAS ESSENTIALS -- Elliott & Woodward

More about Specifying Arguments � Note that the designation OF X 1 -X 4

More about Specifying Arguments � Note that the designation OF X 1 -X 4 is interpreted by SAS as all same-named consecutively numbered variables from X 1 to X 4 in the MAX () function example above. If there are missing values in the list, they are ignored. Other similar functions are illustrated here: MIN(OF X 1 -X 4) * Returns the value 1; SUM(OF X 1 -X 4) * Returns the value 26; MEDIAN(OF X 1 -X 4) * Returns the value; NMISS(OF X 1 -X 4) * Returns 0 (# missing values); N(OF X 1 -X 4) * Returns 4 (# non-missing); 12 SAS ESSENTIALS -- Elliott & Woodward

EXAMPLE � Suppose you have this code DATA NUM; X 1 = 1; X

EXAMPLE � Suppose you have this code DATA NUM; X 1 = 1; X 2 = 2; X 3 = 13; X 4=. ; M 1= SUM(X 1 -X 4) ; M 2= SUM(OF X 1 -X 4); RUN; PROC PRINT; RUN; What is the value of M 1 and M 2? EXERCISE – Enter this code and run the program. PAUSE and return when you’ve successfully run the code 13 SAS ESSENTIALS -- Elliott & Woodward

RESULTS � The answer is shown here. When you use M 1= SUM(X 1

RESULTS � The answer is shown here. When you use M 1= SUM(X 1 -X 4) ; A missing value in the list makes the returned value missing, whereas if you use M 2= SUM(OF X 1 -X 4); The function uses only non-missing values in its evaluation of the function. � 14 SAS ESSENTIALS -- Elliott & Woodward

6. 2 USING PROC TRANSPOSE � PROC TRANSPOSE allows you to restructure the values

6. 2 USING PROC TRANSPOSE � PROC TRANSPOSE allows you to restructure the values in your data set by transposing (or reorienting) the data. This is typically performed when your data are not in the structure required for an analysis. ROWS Transpose This is one example of how PROC Transpose is used. 15 SAS ESSENTIALS -- Elliott & Woodward COLUMNS

Simplified Syntax for TRANSPOSE PROC TRANSPOSE DATA=input-data OUT=output-data; PREFIX specifies a prefix to the

Simplified Syntax for TRANSPOSE PROC TRANSPOSE DATA=input-data OUT=output-data; PREFIX specifies a prefix to the <PREFIX=prefix>; names of variables created in the <BY <variables>; transposition. The default names are COL 1, COL 2, and so on <ID variable>; VAR variables; The VAR statement specifies which variables are to be transposed. 16 The BY variable, when specified, indicates the variable that is used to form BY groups. SAS ESSENTIALS -- Elliott & Woodward

Hands On Example Page 140 �Open the file DTRANSPOSE 1. SAS DATA SUBJECTS; INPUT

Hands On Example Page 140 �Open the file DTRANSPOSE 1. SAS DATA SUBJECTS; INPUT SUB 1 $ SUB 2 $ SUB 3 $ SUB 4 $; DATALINES; 12 21 13 14 Notice how the data are not in a 13 21 12 14 desired form. Use PROC TRANSPOSE 15 31 23 23 to “flip” the data. 15 33 21 32 M F F M ; RUN; 17 SAS ESSENTIALS -- Elliott & Woodward

Using PROC TRANSPOSE to “flip” data PROC TRANSPOSE DATA=SUBJECTS OUT=TRANSPOSED; VAR SUB 1 SUB

Using PROC TRANSPOSE to “flip” data PROC TRANSPOSE DATA=SUBJECTS OUT=TRANSPOSED; VAR SUB 1 SUB 2 SUB 3 SUB 4; RUN; PROC PRINT DATA=TRANSPOSED; RUN; 18 Results: Note that column names are COL 1, COL 2 by default. Obs _NAME_ COL 1 COL 2 COL 3 COL 4 COL 5 1 SUB 1 12 13 15 15 M 2 SUB 2 21 21 31 33 F 3 SUB 3 13 12 23 21 F 4 SUB 4 14 14 23 32 M SAS ESSENTIALS -- Elliott & Woodward

HANDS ON EXERCISE � You want column named to be INFO 1, INFO 2,

HANDS ON EXERCISE � You want column named to be INFO 1, INFO 2, etc. Do this with the PREFIX Statement: EXERCISE – 1. Change the PROC statement to PROC TRANSPOSE DATA=SUBJECTS OUT=TRANSPOSED PREFIX=INFO; Ad ded Rerun the code, and observe the results… particularly the column headings. PAUSE and return when you’ve run the new code. 19 SAS ESSENTIALS -- Elliott & Woodward

RESULTS �The added PREFIX= statement in PROC TRANSPOSE DATA=SUBJECTS OUT=TRANSPOSED PREFIX=INFO; �Changed the column

RESULTS �The added PREFIX= statement in PROC TRANSPOSE DATA=SUBJECTS OUT=TRANSPOSED PREFIX=INFO; �Changed the column headings: Note new column names 20 SAS ESSENTIALS -- Elliott & Woodward

CONTINUE EXERCISE � 2. To make the results better, add the following code (to

CONTINUE EXERCISE � 2. To make the results better, add the following code (to RENAME the columns) This is used to rename the variables. For example INFO 2 is renamed T 2. DATA NEW; SET TRANSPOSED; RENAME INFO 1=T 1 INFO 2=T 2 INFO 3=T 3 INFO 4=T 4 INFO 5=GENDER _NAME_=SUBJECT; RUN; PROC PRINT DATA=NEW; RUN; PAUSE – enter the new code, run it, and observe the results. Return when you are finished. 21 SAS ESSENTIALS -- Elliott & Woodward

RESULTS • Notice how the column names are now much more informative. 22 SAS

RESULTS • Notice how the column names are now much more informative. 22 SAS ESSENTIALS -- Elliott & Woodward

ANOTHER WAY TO NAME COLUMNS � Open the file DTRANSPOSE 1 a. SAS �

ANOTHER WAY TO NAME COLUMNS � Open the file DTRANSPOSE 1 a. SAS � Notice the label variable LAB in this version of the code: DATA SUBJECTS; INPUT LAB $ SUB 1 $ SUB 2 $ SUB 3 $ SUB 4 $; DATALINES; BASELINE 12 21 13 14 TIME 1 13 21 12 14 TIME 2 15 31 23 23 TIME 3 15 33 21 32 GENDER M F F M ; 23 SAS ESSENTIALS -- Elliott & Woodward

TRANSPOSE WITH LABELS �When you have a label variable, use the ID LAB statement

TRANSPOSE WITH LABELS �When you have a label variable, use the ID LAB statement to create column names. PROC TRANSPOSE DATA=SUBJECTS OUT=TRANSPOSED; ID LAB; VAR SUB 1 SUB 2 SUB 3 SUB 4; RUN; PROC PRINT DATA=TRANSPOSED; RUN; 24 SAS ESSENTIALS -- Elliott & Woodward

RESULTS WHEN YOU RUN THIS CODE � Notice how the column names reflect the

RESULTS WHEN YOU RUN THIS CODE � Notice how the column names reflect the LAB variable used in the ID LAB Statement. � (You may still want to rename the _NAME_ Column. ) 25 SAS ESSENTIALS -- Elliott & Woodward

 USING PROC TRANSPOSE WHEN YOU HAVE MULTIPLE RECORDS PER SUBJECT � Suppose you

USING PROC TRANSPOSE WHEN YOU HAVE MULTIPLE RECORDS PER SUBJECT � Suppose you have data that have one or more observations per subject, but you want to analyze the data by observation (a set of observations per row). Use PROC TRANSPOSE to transpose the data by a key variable. Note: Some subjects have more than one record in the data set. 26 SAS ESSENTIALS -- Elliott & Woodward

COMBINE MULTIPLE RECORDS ONTO ONE LINE � Use PROC TRANSPOSE to combine multiple records

COMBINE MULTIPLE RECORDS ONTO ONE LINE � Use PROC TRANSPOSE to combine multiple records (identified by a key variable) onto to one line per record PROC TRANSPOSE DATA="C: SASDATACOMPLICATIONS" The PREFIX allows you to name the OUT=COMP_OUT combined variable. PREFIX=COMP; The BY variable identifies the key BY SUBJECT; variable to expand on. VAR COMPLICATION; RUN; 27 The VAR identifies which variables to expand. SAS ESSENTIALS -- Elliott & Woodward

HANDS ON EXERCISE P 143 (DTRANSPOSE 2. SAS) � Run the code. Performs the

HANDS ON EXERCISE P 143 (DTRANSPOSE 2. SAS) � Run the code. Performs the transpose. Cleans up the data and limits output to subjects with 3 or more complications. Limits the names of the complications to 10 characters in length for the report. 28 SAS ESSENTIALS -- Elliott & Woodward

You’ve found that subject 2076 has three complications listed. RESULTS OF PROC TRANSPOSE Obs

You’ve found that subject 2076 has three complications listed. RESULTS OF PROC TRANSPOSE Obs SUBJECT COMP 1 COMP 2 COMP 3 COMP 4 1 2076 Pneumonia Heart Atta Renal Fail 2 3585 DVT (Lower Pneumonia Renal Fail 3 3630 DVT (Lower Heart Atta Pneumonia 4 4585 Compartmen Pneumonia Skin Break 5 4599 Aspiration Pneumonia Renal Fail 6 4760 Acute Resp Pneumonia Renal Fail 7 4775 Pneumonia DVT (Lower Pancreatit Pneumonia Renal Fail Note how PROC TRANSPOSE expanded SUBJECT records that had multiple complications and named them COMP 1, COMP 2 etc. 29 SAS ESSENTIALS -- Elliott & Woodward

EXERCISE - ANALYZE THE RESULTS � How many subjects had Renal Failure? Examine this

EXERCISE - ANALYZE THE RESULTS � How many subjects had Renal Failure? Examine this code DATA RENAL; SET COMP_OUT; CCAT=CATT(OF COMP 1 -COMP 7); CATT concatenates all of the complications into a single variable named CCAT. IF FIND(UPCASE(CCAT), "RENAL") NE 0 then RENALFAILURE="Yes"; ELSE RENALFAILURE="No"; Use FIND() to find any instance of RUN; “RENAL” in CCAT, and create a new PROC FREQ DATA=RENAL; variable named RENALFAILURE TABLES RENALFAILURE; RUN; Counts the number of subjects with Renal Failure. 30 SAS ESSENTIALS -- Elliott & Woodward

WHAT THIS CODE IS DOING… CCAT=CATT(OF COMP 1 -COMP 7); Concatenates complications for each

WHAT THIS CODE IS DOING… CCAT=CATT(OF COMP 1 -COMP 7); Concatenates complications for each subject, producing the records shown here: Note how multiple complications are concatenated. 31 SAS ESSENTIALS -- Elliott & Woodward

WHAT THIS CODE IS DOING… IF FIND(UPCASE(CCAT), "RENAL") NE 0 then RENALFAILURE="Yes"; This is

WHAT THIS CODE IS DOING… IF FIND(UPCASE(CCAT), "RENAL") NE 0 then RENALFAILURE="Yes"; This is the new ELSE RENALFAILURE="No"; RENALFAILURE variable created by the IF statement RESULTS Note this is Yes because RENAL is in the complications list. 32 SAS ESSENTIALS -- Elliott & Woodward

HANDS ON EXERCISE DATA RENAL; SET COMP_OUT; CCAT=CATT(OF COMP 1 -COMP 7); IF FIND(UPCASE(CCAT),

HANDS ON EXERCISE DATA RENAL; SET COMP_OUT; CCAT=CATT(OF COMP 1 -COMP 7); IF FIND(UPCASE(CCAT), "RENAL") NE 0 then RENALFAILURE="Yes"; ELSE RENALFAILURE="No"; RUN; PROC FREQ DATA=RENAL; TABLES RENALFAILURE; RUN; EXERCISE – Enter and run the new code and observe results. PAUSE – Return once you’ve completed this exercise. 33 SAS ESSENTIALS -- Elliott & Woodward

RESULTS You discover that 50 of the 528 subjects had Renal Failure as one

RESULTS You discover that 50 of the 528 subjects had Renal Failure as one of their complications. 34 SAS ESSENTIALS -- Elliott & Woodward

6. 3 THE SELECT STATEMENT � The SELECT statement evaluates the value of a

6. 3 THE SELECT STATEMENT � The SELECT statement evaluates the value of a variable and creates new assignments based on those values. Syntax (simplified) is as follows: SELECT <(select - expression)> ; WHEN- 1 statement; WHEN- n statement; < OTHERWISE statement; > 35 SAS ESSENTIALS -- Elliott & Woodward

SELECT Statement Example � Suppose you want to calculate NEWVAL according to some specific

SELECT Statement Example � Suppose you want to calculate NEWVAL according to some specific values of the variable OBSERVED. That is, if OBSERVED=1, you want to set NEWVAL=AGE+2. If OBSERVED=2 or 3, you want to set NEWVAL=AGE+10, and so on. A SELECT statement to perform this recoding would be as follows: Note that OBSERVED is the comparison variable – identified as the SELECT EXPRESSION SELECT (OBSERVED); WHEN (1) NEWVAL=AGE+2; WHEN (2, 3) NEWVAL=AGE+10; WHEN (4, 5, 6) NEWVAL=AGE+20 ; OTHERWISE NEWVAL=0; END; 36 SAS ESSENTIALS -- Elliott & Woodward WHEN (1) is interpreted as WHEN OBSERVED=1

Without a Specific Select Expression � Another way to use SELECT is without a

Without a Specific Select Expression � Another way to use SELECT is without a specific select- expression. In this case, the WHEN statements include conctitional expressions that should be in parentheses. For example: Note in this version there is no specified comparison variables. Comparisons are specified in the WHEN statements. SELECT; WHEN (GP='A') STATUS 2=1; WHEN (GP='B' and SEX=1) STATUS 2=2; WHEN (GP='C ' and SEX=0) STATUS 2=3; OTHERWISE STATUS 2=0; Use this version of SELECT when comparisons are more END; complex (and not just equal. ) 37 SAS ESSENTIALS -- Elliott & Woodward

Do Hands On Example p 145( DSELECT. SAS) DATA MYDATA; SET "C: SASDATASOMEDATA"; FORMAT

Do Hands On Example p 145( DSELECT. SAS) DATA MYDATA; SET "C: SASDATASOMEDATA"; FORMAT ECONOMIC $7. ; Note the difference in using a specified SELECT(STATUS); SELECT expression (here is it STATUS) WHEN (1, 2) ECONOMIC="LOW"; and in the second WHEN (3) ECONOMIC="MIDDLE"; part of this example, when you do not WHEN (4, 5) ECONOMIC="HIGH"; have a specified OTHERWISE ECONOMIC="MISSING"; expression. END; PROC PRINT DATA=MYDATA; RUN; 38 SAS ESSENTIALS -- Elliott & Woodward

6. 4 GOING DEEPER: CLEANING A MESSY DATA SET � Many of the features

6. 4 GOING DEEPER: CLEANING A MESSY DATA SET � Many of the features in the SAS language are helpful in cleaning up messy data. � By messy we mean data sets that are not quite ready for analysis. � Most data analysts experience problems dealing with files that contain data that have coding problems and must be fixed before a proper analysis is possible. � This section walks you through a case study of a data set with problems and illustrates how they might be corrected. 39 SAS ESSENTIALS -- Elliott & Woodward

A TYPICAL MESSY DATA SET – NOTE SOME OF THE ISSUES (A FEW ARE

A TYPICAL MESSY DATA SET – NOTE SOME OF THE ISSUES (A FEW ARE MARKED) 40 SAS ESSENTIALS -- Elliott & Woodward

Problems with the data set � A few of the problems you might quickly

Problems with the data set � A few of the problems you might quickly note include the following: � Line 17 is blank. � There are non-date values in the "Date. Left" column. � There is a non-number in the "Age" column (>29). � Values of Gender are mixed upper and lower cases. � There are multiple answers in columns that should have only one answer. �You can correct these issues by using SAS code. One reason doing this in SAS code is that it leaves an audit trail of changes. 41 SAS ESSENTIALS -- Elliott & Woodward

A TYPICAL MESSY DATA SET – NOTE SOME OF THE ISSUES (A FEW ARE

A TYPICAL MESSY DATA SET – NOTE SOME OF THE ISSUES (A FEW ARE MARKED) 42 SAS ESSENTIALS -- Elliott & Woodward

6. 4. 1 FIX LABELS, RENAME VARIABLES � Often, a first step in creating

6. 4. 1 FIX LABELS, RENAME VARIABLES � Often, a first step in creating a clean data set is to attach labels to the variables to make output more “readable. ” Variable Name Type Label 1 Subject Char Subject ID 2 Date. Arrived Date Arrived 3 Time. Arrive Time Arrived 4 Date. Left Date Left 5 Time. Left Time Left 6 Married Num Married? 7 Single Num Single? 8 Age Num Age Jan 1, 2014 9 Gender Char Gender 10 Education Num Years of Schooling 11 Race Char Race 12 How_Arrived Char How Arrived at Clinic 13 Top_Reason Num Top Reason for Coming 14 Arrival Num Temperature 15 Satisfaction Num Satisfaction Score Use the LABEL statement to created these labels… SAS friendly variable named created in Excel. 43 SAS ESSENTIALS -- Elliott & Woodward

HANDS ON EXERCISE P 148 � Open the file MESSY 1. SAS � Add

HANDS ON EXERCISE P 148 � Open the file MESSY 1. SAS � Add labels to the variables. For example: LABEL statement These statements shows how you might rename a variable using a more descriptive name (TEMP for temperature. ) 44 SAS ESSENTIALS -- Elliott & Woodward

DISPLAY THE RESULTS � The rest of the code in MESSY 1. SAS displays

DISPLAY THE RESULTS � The rest of the code in MESSY 1. SAS displays the first 10 records so you can verify the changes � Run the code to see the changes so far. Note that we’re saving changes in the file named CLEANED PROC PRINT LABEL DATA=MYSASLIB. CLEANED (firstobs=10); VAR SUBJECT EDUCATION TEMP TOP_REASON SATISFACTION; RUN; 45 SAS ESSENTIALS -- Elliott & Woodward

CLEANED DATA SO FAR 46 Note the explanatory column labels… SAS ESSENTIALS -- Elliott

CLEANED DATA SO FAR 46 Note the explanatory column labels… SAS ESSENTIALS -- Elliott & Woodward

Other housekeeping in the data set � There are some other housekeeping chores in

Other housekeeping in the data set � There are some other housekeeping chores in the data set to clean up the variable names. For example TEMP=ARRIVAL; DROP ARRIVAL; 47 The ARRIVAL variable (which has to do with temperature) is renamed to avoid confusion with the HOW_ARRIVED variable. SAS ESSENTIALS -- Elliott & Woodward

EXERCISE Variable Name Type Label 1 Subject Char Subject ID 2 Date. Arrived Date

EXERCISE Variable Name Type Label 1 Subject Char Subject ID 2 Date. Arrived Date Arrived 3 Time. Arrive Time Arrived 4 Date. Left Date Left 5 Time. Left Time Left 6 Married Num Married? 7 Single Num Single? 8 Age Num Age Jan 1, 2014 9 Gender Char Gender 10 Education Num Years of Schooling 11 Race Char Race 12 How_Arrived Char How Arrived at Clinic 13 Top_Reason Num Top Reason for Coming 14 Arrival Num Temperature 15 Satisfaction Num Satisfaction Score 48 SAS ESSENTIALS -- Elliott & Woodward PAUSE the tutorial and enter the remaining labels as shown in this table. Run the code… return once you’ve completed this exercise.

THE COMPLETED LABEL STATEMENT LABEL EDUCATION='Years of Schooling' HOW_ARRIVED='How Arrived at Clinic' TOP_REASON='Top Reason

THE COMPLETED LABEL STATEMENT LABEL EDUCATION='Years of Schooling' HOW_ARRIVED='How Arrived at Clinic' TOP_REASON='Top Reason for Coming' SATISFACTION='Satisfaction Score' Subject="Subject ID" Date. Arrived="Date Arrived" Time. Arrive="Time Arrived" Date. Left="Date Left" Time. Left="Time Left" Married="Married? " Single="Single? " Note: We left out the label for ARRIVAL since we replaced that variable with TEMP. Age="Age Jan 1, 2014" Gender="Gender" Race="Race" Satisfaction="Satisfaction Score"; 49 SAS ESSENTIALS -- Elliott & Woodward

Fix Case Problems, Allowed Categories, and Delete Unneeded Lines � To correct case problems,

Fix Case Problems, Allowed Categories, and Delete Unneeded Lines � To correct case problems, you can use the UPCASE ( ) or LOWCASE () function to convert data values to all upper or all lower case. � A second common fix is to verify that all items in a categorical variable are allowable. For example, in the HOW_ ARRIVED variable, only CAR, BUS, or WALK is acceptable. This statement can fix that problem… IF HOW_ARRIVED NOT IN ('CAR', 'BUS', 'WALK') THEN HOW_ARRIVED=" "; (Refer to the IN() function in Appendix B for more info. ) 50 SAS ESSENTIALS -- Elliott & Woodward

More fixes… � A third easy-to-perform check is to delete irrelevant records. In this

More fixes… � A third easy-to-perform check is to delete irrelevant records. In this data set, if a line does not contain a Subject ID, we want to eliminate that record. This is done with an IF statement IF SUBJECT ="" THEN DELETE; 51 SAS ESSENTIALS -- Elliott & Woodward

HANDS ON EXAMPLE P 150 � Open the file MESSY 2. SAS (Note it

HANDS ON EXAMPLE P 150 � Open the file MESSY 2. SAS (Note it includes labels from the previous exercise. ) Fix case problems. Eliminate bad entries in HOW_ARRIVED Get rid of empty records. 52 SAS ESSENTIALS -- Elliott & Woodward

EXERCISE – MAKE MORE CORRECTIONS � Make these changes: Use IN() to correct GENDER

EXERCISE – MAKE MORE CORRECTIONS � Make these changes: Use IN() to correct GENDER IF GENDER NOT IN(‘M’, ’F’) THEN GENDER=""; � Fix RACE using the code The only correct race codes are H, C, and AA IF IF RACE="MEX" OR RACE="M" then RACE="H"; RACE=“A" then RACE=“AA"; RACE=“W" then RACE=“C"; RACE="X" OR RACE="NA" then RACE=""; PAUSE – Make these changes to the code. Rerun the program and verify that the changes have taken place. Return to the tutorial when you have completed this exercise. 53 SAS ESSENTIALS -- Elliott & Woodward

RESULTS – FIRST FEW (CORRECTED) RECORDS OF OUTPUT Note the missing RACE entry where

RESULTS – FIRST FEW (CORRECTED) RECORDS OF OUTPUT Note the missing RACE entry where there was previously an incorrect entry. � We’ll continue with this example in the next tutorial… 54 SAS ESSENTIALS -- Elliott & Woodward

Check and Fix Incorrect Categories, Fix duplicated Variables � Two troubling variables are MARRIED

Check and Fix Incorrect Categories, Fix duplicated Variables � Two troubling variables are MARRIED and SINGLE. The survey asked respondents their marital status, and the information was recorded in the data set where 1 means yes and 0 means no. � Technically, these two variables should be the opposite of each other, and you should only require one of them in the data set. However, if you look at the frequencies of each using PROC FREQ, you discover that they are not telling you the same thing. 55 SAS ESSENTIALS -- Elliott & Woodward

Do Hands On Exercise p 151 (DISCOVER 1. SAS) � Notice how the MARRIED

Do Hands On Exercise p 151 (DISCOVER 1. SAS) � Notice how the MARRIED and SINGLE frequencies do not match… � You must make some decision to reconcile this problem. � The researcher should make this decision. 56 SAS ESSENTIALS -- Elliott & Woodward

TOP REASON FOR COMING � This variable allows subject to select the top reason

TOP REASON FOR COMING � This variable allows subject to select the top reason for coming to the clinic. The survey dis not intend to allow multiple answers, but some respondents chose more than one answer…results in these data problems: 57 SAS ESSENTIALS -- Elliott & Woodward

SOME FIXES APPLIED TO CATEGORICAL PROBLEMS Ø This code is in the file MESSY

SOME FIXES APPLIED TO CATEGORICAL PROBLEMS Ø This code is in the file MESSY 3. SAS… DROP MARRIED; Note the decision to use the SINGLE variable. Thus, MARRIED is no longer needed. IF TOP_REASON NE "1" AND TOP_REASON NE "2" AND TOP_REASON NE "3" THEN TOP_REASON=. ; Get rid of any TOP_REASON values that are not allowed. RUN; To double check this fix, go back to DISCOVER 1. SAS (take MARRIED out of the PROC PRINT) and run it again to make sure the fix is correct. 58 SAS ESSENTIALS -- Elliott & Woodward

Check and Fix Out-of-Range Numeric Variables � Do Hands On Exercise p 153 (DISCOVER

Check and Fix Out-of-Range Numeric Variables � Do Hands On Exercise p 153 (DISCOVER 2. SAS) to discover unusual minimum and maximum values in numeric variables. This can be done using the simple code: PROC MEANS MAXDEC=2 DATA=CLEANED; RUN; 59 Note problems Also note that AGE does not show up as a numeric variable… it is current a character variables. SAS ESSENTIALS -- Elliott & Woodward

Correct “out of range” problems Convert all temperatures to Fahrenheit IF EDUCATION=99 then EDUCATION=.

Correct “out of range” problems Convert all temperatures to Fahrenheit IF EDUCATION=99 then EDUCATION=. ; IF TEMP LT 45 THEN TEMP=(9/S)*TEMP+32; IF TEMP=1018 then TEMP=101. 8; IF SATISFACTION= -99 THEN SATISFACTION=. ; * Convert AGE from character to numeric; When you convert a AGEN=INPUT(AGE, 5. ); variable from character to DROP AGE; numeric, you can’t use the same variable name, so LABEL AGEN="Age Jan 1, 2014"; we chose AGEN to be the numeric version of the AGE variable. 60 SAS ESSENTIALS -- Elliott & Woodward

RECHECK AND FIX AGEN � Now that AGEN is numeric, a rerun of DISCOVER

RECHECK AND FIX AGEN � Now that AGEN is numeric, a rerun of DISCOVER 2. SAS reveals a range problem for AGEN Out of range values for AGEN Exercise. Using MESSY 4. SAS, add this code to correct the AGEN problem: IF AGEN LT 10 OR AGEN GT 99 then AGEN =. ; PAUSE. Rerun this code, and observe results (in DISCOVER 2. SAS). Return after you’ve completed this exercise. 61 SAS ESSENTIALS -- Elliott & Woodward

RESULTS (FROM DISCOVER 2. SAS) � Note that AGEN values are within an acceptable

RESULTS (FROM DISCOVER 2. SAS) � Note that AGEN values are within an acceptable range. 62 SAS ESSENTIALS -- Elliott & Woodward

ASIDE: CORRECTING DATA, ADMINISTRATIVE ISSUES � Statisticians (Data Scientists) often work with data provided

ASIDE: CORRECTING DATA, ADMINISTRATIVE ISSUES � Statisticians (Data Scientists) often work with data provided to them from other people such as: � research data from an experiment or a clinical trial � data from an on-line survey or extracted from on-line forms � electronically gathered data from observed behaviors � etc. � Before making data corrections, determine who has the authority, knowledge, and/or responsibility for making data change decisions. � Keep track of changes (SAS provides an audit trail) in case there are questions in the future. 63 SAS ESSENTIALS -- Elliott & Woodward

CORRECT DATE AND TIME VALUES � The date and time the subject arrived and

CORRECT DATE AND TIME VALUES � The date and time the subject arrived and left the clinic are needed to calculate how long it took to serve each patient. However, these values are currently of character type. � The following example illustrates how to convert the character variables to SAS date and time values, and how to combine them in a single “datetime” value. � The date values are store as character values. For example 2/7/2005 � Use INPUT() to convert them to dates: DATEARRIVED 2=INPUT(TRIM(DATEARRIVED), MMDDYYl 0. ); 64 SAS ESSENTIALS -- Elliott & Woodward

CONVERTING THE TIME VALUE � Converting the TIME value is a little more complex.

CONVERTING THE TIME VALUE � Converting the TIME value is a little more complex. A TIMEARRIVE value looks like this: � 11: 18: 00 A Use FIND to locate the blank between the number and either A or P. Assign it the value I I= FIND(TIMEARRIVE, " "); Use SUBSTR function and the I location of P or A and extract the number portion of the time. TIMEARRIVE=SUBSTR(TIMEARRIVE, l , I-1); P=FIND(TIMEARRIVE, "P"); 65 Also, determine if there is a “P” in the value – which implies that it is PM. If P=0 then it is AM. SAS ESSENTIALS -- Elliott & Woodward

CONTINUE CONVERSION OF TIME Convert TIMEARRIVE 2 to a number using INPUT() TIMEARRIVET=INPUT(TRIM(TIMEARRIVE 2),

CONTINUE CONVERSION OF TIME Convert TIMEARRIVE 2 to a number using INPUT() TIMEARRIVET=INPUT(TRIM(TIMEARRIVE 2), TIME 8. ); If the time is after noon, add 12 hours of seconds to the value… (P>0 means that this time in in the PM (afternoon. )) IF P>0 AND TIMEARRIVET LT 43200 THEN TIMEARRIVET=TIMEARRIVET+43200; Convert seconds (ARRIVEDT) value to a DATETIME SAS variable using the DHMS function, and give it a label. ARRIVEDT=DHMS(DATEARRIVED 2, 0, 0, TIMEARRIVET); Label ARRIVEDT="Date & Time Arrived"; 66 SAS ESSENTIALS -- Elliott & Woodward

DO HANDS ON EXAMPLE P 156 � Open MESSY 5. SAS to do correct

DO HANDS ON EXAMPLE P 156 � Open MESSY 5. SAS to do correct datetime variables. 67 SAS ESSENTIALS -- Elliott & Woodward

RUN MESSY 5. SAS, OBSERVE RESULTS. Note the finalized Date & Time Arrived variable,

RUN MESSY 5. SAS, OBSERVE RESULTS. Note the finalized Date & Time Arrived variable, which is needed to calculate the difference between when arrived and when left the clinic. 68 SAS ESSENTIALS -- Elliott & Woodward

CALCULATE DATE & TIME SUBJECT LEFT CLINIC To complete the calculations, correct values for

CALCULATE DATE & TIME SUBJECT LEFT CLINIC To complete the calculations, correct values for TIMELEFT using a similar process (The time subject left the clinic): And add the label LABEL LEFTDT ="Date & Time Left"; 69 SAS ESSENTIALS -- Elliott & Woodward

CALCULATE HOW LONG STAYED IN CLINIC � Once ARRIVEDR and LEFTST are calculated, calculate

CALCULATE HOW LONG STAYED IN CLINIC � Once ARRIVEDR and LEFTST are calculated, calculate how long a subject stayed in the clinic using the INTCK() function with a "MIN" (minutes) argument STAYMINUTES=INTCK('MIN', ARRIVEDT, LEFTDT); � Divide by 60 and round it off to get the number of hours stayed in the clinic: STAYHOURS=ROUND(STAYMINUTES/60, . 1); As always, check the results… 70 SAS ESSENTIALS -- Elliott & Woodward

STILL PROBLEMS WITH STAY LENGTH Ø There are some STAYHOURS that are negative or

STILL PROBLEMS WITH STAY LENGTH Ø There are some STAYHOURS that are negative or too large. Ø Use code to eliminate “impossible” values. For example: IF STAYHOURS<0 or STAYHOURS>48 then STAYHOURS=. ; 71 SAS ESSENTIALS -- Elliott & Woodward

LOOK FOR DUPLICATE RECORDS � A final check for this data set is to

LOOK FOR DUPLICATE RECORDS � A final check for this data set is to determine if there are duplicate records. Typically, this is accomplished by looking for duplicate IDs. � A simple way to do this is with PROC FREQ. � This final example for this section illustrates this process. � Do Hands On Exercise p 158. Use PROC FREQ to count the � DISCOVER 3. SAS number of unique SUBJECT IDs PROC FREQ DATA=MYSASLIB. CLEANED NOPRINT; TABLES SUBJECT / OUT=FREQCNT; RUN; PROC PRINT DATA=FREQCNT; WHERE COUNT>1; Display results where there are more than one SUBJCTS with RUN; the same ID 72 SAS ESSENTIALS -- Elliott & Woodward

CORRECT DUPLICATE RECORDS � The PROC FREQ identifies one duplicated record. There are two

CORRECT DUPLICATE RECORDS � The PROC FREQ identifies one duplicated record. There are two SUBJECTS with the ID number 26. � In the example, you discover that SUBJECT 27 was miscoded as 26, thus you can fix that type with the code: IF_N_=27 THEN SUBJECT=27; 73 SAS ESSENTIALS -- Elliott & Woodward

CLEANING A DATA SET: SUMMARY � When you get a new data set, do

CLEANING A DATA SET: SUMMARY � When you get a new data set, do these checks: � Visually inspect the data set for obvious problems. � Rename variables that have strange or unclear names. � Label your variables to make them more readable. � Fix case problems. � Delete unneeded records. � Use PROC FREQ to discover incorrect categorical values. � Check for and correct any duplicated variables. � Use PROC MEANS to check for unusual minimum or maximums. � Set those missing value codes. � Convert variables not in a correct format. � Search for and reconcile any duplicate records. 74 SAS ESSENTIALS -- Elliott & Woodward

THE COMPLETED MESSY FIX �The entire SAS code for fixing the MESSY data set

THE COMPLETED MESSY FIX �The entire SAS code for fixing the MESSY data set is in the file MESSY_ALL. SAS �Review this code to see how each fix builds on one another �See how this code provides an audit for the fixes, so you can verify them if needed, or illustrate how a change was performed to the data set. �EXAMINE THIS CODE: Use it as a template for fixing your own data sets. 75 SAS ESSENTIALS -- Elliott & Woodward

6. 5 SUMMARY � This chapter provides additional information on common programming topics for

6. 5 SUMMARY � This chapter provides additional information on common programming topics for the SAS language. The subjects covered are not exhaustive but were selected because they are often used for preparing data for analysis. Many more topics could have been covered, and readers are encouraged to refer to the SAS documentation for additional information. � Continue to Chapter 7: SAS® ADVANCED PROGRAMMING TOPICS PART 2 76 SAS ESSENTIALS -- Elliott & Woodward

These slides are based on the book: Introduction to SAS Essentials Mastering SAS for

These slides are based on the book: Introduction to SAS Essentials Mastering SAS for Data Analytics, 2 nd Edition By Alan C, Elliott and Wayne A. Woodward Paperback: 512 pages Publisher: Wiley; 2 edition (August 3, 2015) Language: English ISBN-10: 111904216 X ISBN-13: 978 -1119042167 These slides are provided for you to use to teach SAS using this book. Feel free to modify them for your own needs. Please send comments about errors in the slides (or suggestions for improvements) to acelliott@smu. edu. Thanks. 77 SAS ESSENTIALS -- Elliott & Woodward