Lesson 7 Topics Reading SAS data sets Subsetting
Lesson 7 - Topics • Reading SAS data sets • Sub-setting and merging SAS data sets • Permanent SAS data sets • Programs 10 -13 in course notes • LSB 2: 19 -21 3: 7 6: 1 -2, 4 -5, 9, 11 -13
Working With SAS Data Sets • Reading SAS dataset – SET Statement • Merging SAS datasets – MERGE Statement Done within a DATA step
SET STATEMENT • Reads SAS data set (one row at a time) • Replaces INFILE and INPUT statements used when reading in raw data • KEEP brings in selected variables (columns) • Where brings in selected observations (rows) DATA new; SET old (KEEP = varlist); WHERE = condition; RUN; This creates a new data set called new that has the variables in varlist and selected observations from old.
PROGRAM 10 Making SAS Datasets from Other SAS Datasets; DATA tdata; INFILE ‘C: SAS_Filestomhs. data' ; INPUT @ 1 ptid $10. @ 12 clinic $1. @ 25 group 1. @ 30 sex 1. @ 123 sbp 12 3. @ 14 randdate $10. ; RUN; * Making a new dataset containing only men; DATA men; SET tdata; * reads the existing dataset; WHERE sex = 1; This does the selection; if group in(1, 2, 3, 4, 5) then active = 1; else if group in(6) then active = 2; KEEP ptid clinic group sbp 12 randdate active; RUN;
* Making a new dataset containing only women; DATA women; SET tdata; WHERE sex = 2; if group in(1, 2, 3, 4, 5) then active = 1; else if group in(6) then active = 2; KEEP ptid clinic group sbp 12 randdate active; RUN; We now have 3 datasets “active” tdata men women
* Making both datasets in one data step; DATA men women; SET tdata; if group in(1, 2, 3, 4, 5) then active = 1; else if group in(6) then active = 2; if sex = 1 then OUTPUT men; else if sex = 2 then OUTPUT women; If SAS sees an OUTPUT statement then SAS will output only then; if there is no OUTPUT statement SAS outputs at the end of the data step. KEEP ptid clinic randdate group sbp 12 active; RUN; Partial Log: NOTE: There were 100 obs read from WORK. TDATA NOTE: The data wet WORK. MEN has 73 obs and 7 variables NOTE: The data set WORK. WOMEN has 27 obs and 7 variables
KEEP OPTION vs KEEP STATEMENT Purpose: Restricts variables read-in or written out DATA highbp; * Brings in only these variables; SET tdata (KEEP = ptid sex sbp 12); RUN; DATA highbp; SET tdata ; * Reads in all variables; * Writes out only these variables; KEEP ptid sex sbp 12; RUN; * There is also a DROP option/statement;
WHERE Versus Logical IF DATA highbp; * Brings in only certain rows; SET tdata; WHERE sbp 12 > 140; RUN; DATA highbp; SET tdata ; * Brings in all rows, outputs only certain rows (can be a new variable); IF sbp 12 > 140; * If true then continue; RUN;
PROGRAM 11 - Merging SAS Datasets DATA clinic; INFILE DATALINES; INPUT id $ sbp ; DATALINES; C 03615 115 B 00979 107 B 00644 138 D 01348 142 A 01088 117 B 01408 121 B 00025 130 B 00714 144 A 01166 113 … more data ; DATA lab; INFILE DATALINES; INPUT id $ glucose; DATALINES; C 03615 102 B 00644 089 D 01348 111 A 01088 093 B 01408 094 B 00025 077 B 00714 100 A 01166 113 D 00942 103 … more data ;
* Creating merged dataset; PROC SORT DATA= clinic; BY id; PROC SORT DATA= lab; BY id; DATA study; MERGE clinic lab; BY id ; RUN; Note: The BY statement is very important!
Obs 1 2 3 4 5 6 7 8 18 20 id A 00869 A 01088 A 01166 B 00025 B 00644 B 00714 B 00867 B 00979 D 00942 D 01809 Merged Dataset sbp glucose 110 117 113 130 138 144 114 107. 129 99 93 113 77 89 100 98. 103.
What if you want only observations that are in both datasets? DATA study; MERGE clinic (IN=in 1) lab (IN=in 2); BY id; if in 1 and in 2; RUN; PROC PRINT DATA=study; TITLE ‘Patients with Clinic and Lab'; RUN;
Logical Statements * Must be in 1 st dataset; if in 1; * Same as: if in 1 = 1; * Must be in 2 nd dataset; if in 2; * Must be in both datasets; if in 1 and in 2;
Things to Remember When Merging Datasets • Need to have common variable name in each dataset to use as linking variable • Variables in dataset with no match will be set to missing • Rows matched that have same variable names will be assigned right-most dataset value • Always remember the BY statement in the merge
Temporary vs Permanent SAS Datasets Temporary (or working) SAS dataset - After SAS session is over the dataset is deleted. DATA bp; * bp is deleted after SAS session; (rest of program) Permanent SAS dataset - After program is run the dataset is saved and is available for use in future programs. You need to tell SAS where to store/retrieve the dataset. Note: For PC SAS the working dataset is available until you end the SAS session.
Reasons to Create Permanent SAS Datasets • Read raw data and compute calculated variables only once • All variables have assigned names and labels. • Data is ready to be analyzed. • Dataset can be sent to other computers or users.
Creating a Permanent Dataset LIBNAME mylib ‘C: My SAS Datasets’; LIBNAME – assigns a directory (folder) reference name. In this example the directory ‘C: My SAS Datasets’ is assigned a reference name of mylib. DATA mylib. sescore; Tells SAS to create a dataset called sescore in the directory referenced by mylib, which is ‘C: My SAS Datasets’.
Examples of LIBNAME Statements LIBNAME mylib ‘C: My SAS Files'; LIBNAME class ‘C: My SAS Files' ; LIBNAME ph 6420 'C: My SAS FilesSASClass' ; LIBNAME points to a directory (folder) DATA mylib. datasetname; DATA class. datasetname; DATA ph 6420. datasetname; On UNIX and PC the file will be called datasetname. sas 7 bdat
PROGRAM 12 LIBNAME mylib ‘C: SAS_Files'; DATA mylib. sescore; INFILE INPUT @ 12 @ 14 @ 25 @ 49 @ 85 @ 97 @115 @123 @236 @276 ‘C: SAS_Filestomhs. data' LRECL =400; ptid $10. clinic $1. randdate mmddyy 10. group 1. educ 1. wtbl 5. 1 wt 12 5. 1 sbpbl 3. 0 sbp 12 3. 0 (sebl_1 -sebl_20) (1. +1) (se 12_1 -se 12_20) (1. +1) ;
wtd 12 sbpd 12 sescrbl sescr 12 sescrd 12 = = = wt 12 - wtbl; sbp 12 - sbpbl; MEAN (OF sebl_1 - sebl_20) ; MEAN (OF se 12_1 - se 12_20) ; sescr 12 - sescrbl ; LABEL educ = 'Highest Education Level'; LABEL wt 12 = 'Weight (lbs) at 12 Months'; LABEL wtbl = 'Weight (lbs) at Baseline'; LABEL wtd 12 = 'Weight Change at Baseline'; LABEL sbpbl = 'Systolic BP (mm. Hg) at Baseline'; LABEL sbp 12 = 'Systolic BP (mm. Hg) at 12 Months'; LABEL sbpd 12 = 'Systolic BP Change at 12 Months'; LABEL group = 'Treatment Group (1 -6)'; LABEL sescrbl = 'Side Effect at Baseline'; LABEL sescr 12 = 'Side Effect at 12 Months'; LABEL sescrd 12 = 'Side Effect Change Score'; FORMAT randdate mmddyy 10. ; DROP sebl_1 -sebl_20 se 12_1 -se 12_20 ;
60 LIBNAME mylib 'C: SAS_Files'; NOTE: Libref MYLIB was successfully assigned as follows: Engine: V 9 Physical Name: C: SAS_Files DATA mylib. sescore; NOTE: The infile 'C: SAS_Filestomhs. data' is: File Name=C: SAS_Filestomhs. data, RECFM=V, LRECL=400 NOTE: 100 records were read from the infile 'C: SAS_Filestomhs. data'. NOTE: The data set MYLIB. SESCORE has 100 observations and 14 variables.
PROC CONTENTS DATA=mylib. sescore VARNUM ; TITLE 'Description of Variables in Dataset SESCORE' ; RUN; What is inside a SAS dataset? Names, labels, and formats of all variables Data PROC CONTENTS reads the descriptor portion of the dataset
Description of Variables in Dataset SESCORE The CONTENTS Procedure Data Set Name: Member Type: Engine: Created: Last Modified: Protection: Data Set Type: Label: MYLIB. SESCORE DATA V 9 10: 59 Wednesday, August 11, 2004 Observations: Variables: Indexes: Observation Length: Deleted Observations: Compressed: Sorted: 100 14 0 112 0 NO NO -----Engine/Host Dependent Information----- File Name: C: SAS_Filessescore. sas 7 bdat Release Created: Host Created: File Size (bytes): 9. 1. 3 XP_PRO 24576 Note: mylib is not a part of the dataset name
Variables listed in creation order # Variable Type Len Pos Format Label --------------------------------------1 ptid Char 10 96 Patient ID 2 clinic Char 1 106 3 randdate Num 8 0 4 group Num 8 8 5 educ Num 8 16 Highest Education Level 6 wtbl Num 8 24 Weight (lbs) at Baseine 7 wt 12 Num 8 32 Weight (lbs) at 12 Months 8 sbpbl Num 8 40 Systolic BP (mm. Hg) at Baseline 9 sbp 12 Num 8 48 Systolic BP (mm. Hg) at 12 Months 10 wtd 12 Num 8 56 Weight Change at Baseline 11 sbpd 12 Num 8 64 Systolic BP Change at 12 Months 12 sescrbl Num 8 72 Side Efect at Baseline 13 sescr 12 Num 8 80 Side Efect at 12 Months 14 sescrd 12 Num 8 88 Side Efect Change Score Clinical Center MMDDYY 10. Randomization Date Treatment Group (1 -6) This becomes the documentation of the dataset
Using PROC COPY to copy work dataset to permanent dataset LIBNAME mylib ‘C: SAS_Files'; DATA sescore; …. RUN; PROC COPY IN=work OUT=mylib; SELECT sescore; RUN; Make a work dataset first – then when you know that is working correctly copy the work dataset to a permanent dataset.
PROGRAM 13 LIBNAME class ‘C: SAS_Files' ; * Tells SAS where to find the SAS dataset; PROC MEANS DATA=class. sescore ; TITLE 'Means of All Numeric Variables on SAS Permanent Dataset'; RUN; PROC CORR DATA=class. sescore; VAR wtd 12 sbpd 12 sescrd 12; TITLE 'Correlation Matrix of 3 Change Variables'; RUN; What if dataset was moved to a different folder? Just need to change LIBNAME
Means of All Numeric Variables on SAS Permanent Dataset The MEANS Procedure Variable Label N Mean ---------------------------------randdate Randomization Date 100 10101. 29 group Treatment Group (1 -6) 100 3. 62 educ Highest Education Level 99 6. 00 wtbl Weight (lbs) at Baseline 100 191. 76 wt 12 Weight (lbs) at 12 Months 92 180. 33 sbpbl Systolic BP (mm. Hg) at Baseline 100 139. 92 sbp 12 Systolic BP (mm. Hg) at 12 Months 92 124. 04 wtd 12 Weight Change at Baseline 92 -11. 53 sbpd 12 Systolic BP Change at 12 Months 92 -15. 64 sescrbl Side Effect at Baseline 100 1. 19 sescr 12 Side Effect at 12 Months 95 1. 16 sescrd 12 Side Effect Change Score 95 -0. 03 ---------------------------------
Pearson Correlation Coefficients Prob > |r| under H 0: Rho=0 Number of Observations wtd 12 sbpd 12 sescrd 12 1. 00000 0. 23986 0. 15341 0. 0213 0. 1443 92 92 92 0. 23986 1. 00000 0. 05679 Weight Change at Baseline sbpd 12 Systolic BP Change at 12 Months sescrd 12 Side Efect Change Score 0. 0213 0. 5908 92 92 92 0. 15341 0. 05679 1. 00000 0. 1443 0. 5908 92 92 95
*--------------------------------* Often you will read the permanent SAS dataset in a DATA step to modify or add variables. Usually these will be put on a new temporary SAS dataset. The SET statement reads a SAS dataset *--------------------------------*; LIBNAME class 'C: SAS_Files' DATA rxdata; SET class. sescore; if group in(1, 2, 3, 4, 5) then rx = 1; else rx = 2; RUN; PROC MEANS DATA=rxdata N MEAN MAXDEC=2 FW=7; CLASS group; VAR sbpd 12 wtd 12 sescrd 12; TITLE 'Change in SBP, Weight, and Side Effect Score by Treatment'; RUN;
- Slides: 29