Best of Both Worlds DATA Step and PROC

Best of Both Worlds: DATA Step and PROC SQL Joins Sunil Gupta Sunil@Gupta. Programming. com www. SASSavvy. com 1

DATA Step vs. PROC SQL Camps DATA Step Die-Hards Factors: First learned in SAS, First encounter of flexibility, First frustration of complexity PROC SQL Die-Hards Happy Camper – Use the best of both worlds 2

Factors to Consider when Merging Data Sets 1. Number of Observations per Key (One/Many Data Sets) 2. Direction of Merge (Left/Right Data Set) 3. Common Non-BY Variables (Yes/No) 4. BY statement (Yes/No) 5. Your strengths – DATA Step, Proc SQL 6. DATA Step and Proc SQL defaults, strengths 7. Many-to-Many join (With/without By Variables) – DATA Step and Proc SQL defaults, strengths Goal is to link related records from two or more datasets into one dataset. Generally, along with the key variables, conditions are included to control how related records are linked.

Data Step vs. Proc SQL: Issues with common variables between data sets data B; data A; input patno source $ age; $ gender $; cards; 2 B 45 2 B 40 1 A male 4 B 35 2 A female ; 3 A male run; 5 A male ; Typical one-to-many merge, source is the common non-by variable run; 4

Issues in Merging Data Sets: Output Data Sets A B all. A Anot. B Aand. B Bnot. A Six possible data sets 5 All. B

Issues in Merging Data Sets: Six Output Data Sets data allab alla allb aandb anotb bnota; merge A (in=a) B (in=b); by patno; All six possible output allab; output data sets if if if run; a b a a b then output alla; then output allb; and b then output aandb; and not b then output anotb; and not a then output bnota; 6

Issues in Merging Data Sets: Six Output Data Sets NOTE: There were 4 observations read from the data set WORK. A. NOTE: There were 3 observations read from the data set WORK. B. NOTE: NOTE: WORK. ALLAB has 6 observations and 4 variables. WORK. ALLA has 5 observations and 4 variables. WORK. ALLB has 3 observations and 4 variables. WORK. AANDB has 2 observations and 4 variables. WORK. ANOTB has 3 observations and 4 variables. WORK. BNOTA has 1 observations and 4 variables. Helpful to identify missing or non-matching records 7

Issues in Merging Data Sets: Six Output Data Sets Data Set ALLAB ALLA # records All rows One sided ALLB AANDB Matching 6 DATA 1, 2, 2, 3, 4, 5 Step 5 1, 2, 2, 3, 3 2, 2, 2 Proc 2, 2 SQL ANOTB BNOTA Patno Values 3 Non. Matching 1 1, 5 4 3, 5 4 8

Issues in Merging Data Sets: Six Output Data Sets Data Set Merge; By; Proc SQL; No conditions Full Outer Join ALLA If A; Left Outer Join ALLB If B; Right Outer Join AANDB If A and B; Inner Join ANOTB If A and not B; Except BNOTA If B and not A; Except ALLAB 9

Merge Two Data Sets - Common non-By variable: Attributes and Value A: First Length, Format, Type, Label B: Second Common Variables Prevent applying Data Value First and Second Data Sets Merge: First and Second Data Sets Apply with caution Data Value, Length, Format, Type, Label First Data Set Proc SQL: First Data Set 10

Four Main PROC SQL components: Columns, Joins, Conditions, & Sorts proc sql; select name, sex from sashelp. class where sex = 'F' order by name; quit; Joins Columns Conditions Sorts 80% of PROC SQL’s query can be broken down to 20% of the syntax 11

Data Step vs. Proc SQL PROC SQL: Outer Joins (Matching Records) LEFT RIGHT FULL 12

Data Step vs. Proc SQL PROC SQL ØLink related information. ØNo need to presort data sets. ØThe INNER join combines common records. ØThe OUTER join has three different types: LEFT, RIGHT and FULL join. ØBest method for cartesian joins. ØKey variables names can be different. 13

Proc SQL: Example proc sql; create table aandb as select * from A as a, B as b where a. patno = b. patno; quit; Default: Inner Join 14

Proc SQL: Inner Join Output ( A, B ) (A) (B) Obs patno source gender age 1 2 A female 45 Different from 2 2 A female 40 Match-Merge 2 matching records: patno = 2 source = ‘A’ (Common non-by variable, value from first data set) 15

Data Step vs. Proc SQL: Missing data A; input patno source $ gender $; cards; Missing data 1 A male 2. female 3 A male 5 A male ; run; data B; input patno source $ age; cards; 2 B 45 2 B 40 4 B 35 ; run; 16

Proc SQL: Inner Join Output ( A, B ) Obs patno source 1 2 2 2 (A) (B) gender age female 45 female 40 Caution: source = ‘ ’ (If source is missing in first data set even if source in second data set is non-missing) 17

Proc SQL: Warning WARNING: Variable patno already exists on file WORK. AANDB. WARNING: Variable source already exists on file WORK. AANDB. NOTE: Table WORK. AANDB created, with 2 rows and 4 columns. 30 quit; Warning for two common variables: patno, source Can use NOWARN to prevent WARNING message 18

Proc SQL: Issues with common variables between data sets of the original data sets proc sql; create table allab as Coalesce() select coalesce(a. patno, b. patno) as patno, coalesce(a. source, b. source) as source, a. gender, b. age from A as a Full Outer Join: full join B as b Resembles Match Merge on a. patno = b. patno; quit; 19

Proc SQL: Full Outer Join Output ( A, B ) Obs patno source 1 1 A 2 2 A 3 2 A 4 3 A 5 4 B 6 5 A No matching patno from B Data Set (A) (B) gender age male. female 45 female 40 male. 35 male. Warning free SAS Log Matching patno: source = ‘A’ (First Data Set) Different from Match-Merge Missing values in A or B 20

Many-to-many Merging Options • Best to mix only apples together Apply formula to calculate # of records • Apply date range for more focused links • Bad to mix apples and oranges With BY, Without WHERE Without BY, Without WHERE With BY, With WHERE Without BY, With WHERE • Unrelated records for date range 21

1) PROC SQL Joins: Many-to-Many Join proc sql; create table many_to_many_sql as select b. patno, b. name, b. weight, c. height from b, c where b. patno = c. patno; quit; Key_Var patno 1 1 2 2 4 Many: B (n=5) Comm_Var Non_Comm_B name weight Tim 140 150 Bell 130 Bell 160 Sue 160 Key_Var patno 1 1 1 2 2 2 4 Best to use PROC SQL instead of DATA Step C (n=7) Comm_Var name Tim Tim Bell Sue Non_Comm_C height 4. 1. 5. 1. 6 6. 2 5. 4

1) PROC SQL Joins: Many-to-Many C (n=7) Many: B (n=5) Join Key_Var Comm_Var Non_Comm_C Key_Var Comm_Var Non_Comm_B patno 1 1 2 2 4 name Tim Bell Sue PATNO B C 1 2 3 2 2 3 4 11 patno name PROC 1 Tim SQL 1 Tim Join 2 Bell 4 Sue Many_To_Many B to C (PROC SQL Cartesian Product) (n=13) Key_Var Comm_Var Non_Comm_B Non_Comm_C patno name weight height 1 Tim 140 4. 1 1 150 4. 1 1 Tim 140. 1 150. 1 Tim 140 5. 1 1 150 5. 1 2 Bell 130. 2 Bell 160. 2 Bell 130 6 2 Bell 160 6 2 Bell 130 6. 2 2 Bell 160 6. 2 4 Sue 160 5. 4 weight 140 150 130 160 height 4. 1. 5. 1. 6 6. 2 5. 4 All possible combinations of B and C by patno for a total of 13 records – 6 patno=1, 6 patno=2 and 1 patno=4

DATA Step or Proc SQL? 6 records PATNO=1 Correct join and records from Proc SQL 6 records PATNO=2 = # records/ PATNO X # records/ PATNO 1 record PATNO=4 24

DATA Step or Proc SQL? data m_to_m; merge b c; by patno; method 1='DATA Step'; run; Incorrect join and records from DATA Step Keeps the maximum number of records per PATNO from 25 datasets

DATA Step or Proc SQL? data m_to_m; merge b c; by patno; method 1='DATA Step'; run Key_Var patno 1 1 2 2 4 Many: B (n=5) Comm_Var Non_Comm_B name weight Tim 140 150 Bell 130 Bell 160 Sue 160 Incorrect join and records from DATA Step Key_Var patno 1 1 1 2 2 2 4 C (n=7) Comm_Var name Tim Tim Bell Sue Non_Comm_C height 4. 1. 5. 1. 6 6. 2 5. 4

DATA Step or Proc SQL? Method 1 = ‘DATA Step’ Incorrect join and records from DATA Step Method 2 = ‘Proc SQL’ is correct 27

2) Merging AE with Con. Meds 28

General DATA STEP Merging Best Practices 1. Requires pre-sorting data sets by BY variables or using indexes. 2. In general, specify a BY statement to link related data and prevent one-to-one merge. 3. Always have same attributes for each common BY variable to prevent BY variable truncation. 4. Remember that length for common non-BY variables are stored from the first data set. 5. Remember that values for common non-BY variables are stored from the second data set. 6. In general, exclude common non-BY variables to prevent SAS from overwriting common variables. 7. Set option MSGLEVEL=I to be aware of any common non-By variables being overwritten.

DATA Step Merge/PROC SQL Joins PROC SQL allows for joining by different variable names 30

Comparing Data Step with PROC SQL Data Step (SAS Programming) PROC SQL (Database Programming) SAS Functions, Data set options SAS Functions, COALESCE()*, Data set options If-Then Statements, Case-Select Clause Do Loop, Output Joins can simulate Do Loop Space to separate variables Comma to separate variables New variable = valid expression; Valid expression AS new variable IF/Where Statements WHERE for details/HAVING for summaries Multiple SAS Statements One SAS Statement By default, includes all variables By default, excludes all variables Many-to-many merge Cartesian Product is better By default, If A or B; Full Outer By default, If A and B; Inner Join Can recycle data set names Requires new data set names, else get a warning N/A Unique PROC SQL keywords

No PROC SQL Equivalent DATA Step Only FIRST (DOT) OUTPUT LAG(), DIF() Description By Group Processing Ex. if FIRST. SEX then sexcount + 1; Write record to multiple datasets Ex. OUTPUT MYDATA; Access values from previous record Ex. PREV_WEIGHT = LAG(WEIGHT); ARRAYS DO LOOPS Create and process arrays Ex. ARRAY VAR(3) $ VAR 1 – VAR 3; Create and process do-loops Ex. DO I = 1 TO 10; _ALL_, Reserved Keywords _CHARACTER_, Ex. KEEP _NUMERIC_; _NUMERIC_

Unique PROC SQL Keywords Keyword Description Creating new columns AS CALCULATED Ex. ((weight/sum(weight))*100) as wpercent Referencing new columns after being specified Ex. where calculated wgroup = 'high' DISTINCT/ UNIQUE Displaying unique combination of variables Ex. distinct patno Creating macro variables INTO : / SEPARATED BY Ex. sum(weight) into : wsum CHAR / DATE / NUM Variable type when creating tables and variables Ex. create table mydata (client char(25) format=$25. label ='Client‘);

Building Datasets with PROC SQL Tip - Notice that COALESCE() keeps any non-missing value and when specified with A. * may cause SAS to display warning messages. Try to trouble-shot Proc SQL code one clause at a time since messages may not be descriptive.

Building Datasets with PROC SQL Tip – For the final Proc SQL, order each line as follows – each dataset and then derived variables referencing calculated variables, format, length and label. Option to keep source variables to confirm calculated variables in following PROC FREQ.

DATA Step and PROC SQL Joins to Create datasets 1) PROC SQL Create intermediate datasets keeping only the variables required if keep list is shorter than drop list. 2) DATA Step Create intermediate datasets requiring unique DATA step statements such as FIRST. or LAG() or keep list is longer than drop list. 3) PROC SQL Keep all variables, LEFT JOIN with one master dataset to combine all intermediate datasets without presorting. Tip – Build and confirm one dataset at a time. Do not put do too much in one PROC SQL. As needed for post processing, can apply dataset options in PROC SQL or DATA Steps to subset, 36 rename, drop or keep variables.

Free PROC SQL Mind Maps – www. SASSavvy. com 37

Free PROC SQL Mind Maps 38

Download PROC SQL Summary Sheet from sassavvy. com/SAS_Tips 39