HRP 223 2008 HRP 223 2008 Topic 6

  • Slides: 66
Download presentation
HRP 223 - 2008 HRP 223 2008 Topic 6 – Relational Data Copyright ©

HRP 223 - 2008 HRP 223 2008 Topic 6 – Relational Data Copyright © 1999 -2008 Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected by copyright law and international treaties. Unauthorized reproduction of this presentation, or any portion of it, may result in severe civil and criminal penalties and will be prosecuted to maximum extent possible under the law.

HW 2 HRP 223 2008 § SORRY! I apologize for not getting it posted

HW 2 HRP 223 2008 § SORRY! I apologize for not getting it posted before yesterday! It is due in two weeks. – The new datasets have less variables and one variable renamed. You want to change new. ID to have the same name as the old subject ID. You can put a rename command on the line that does the import: proc import out = wide (rename = (dude = subject. ID)) datafile = "C: ProjectsclassesHRP 2232008day 6wide. Dx. xls" replace; mixed = yes; sheet = "Sheet 1"; run;

Flat Files HRP 223 2008 § Some people try to store all their data

Flat Files HRP 223 2008 § Some people try to store all their data in a single file. This causes lots of extra work because of holes in the tables and repeated information. § Both problems can be fixed by a relational model. – Split the data into many tables. § You need to use SQL to work with data split across multiple tables.

Not Normalized HRP 223 2008 § I frequently get data, from people who are

Not Normalized HRP 223 2008 § I frequently get data, from people who are not professional programmers, where the diagnosis data is organized “wide” across the page. Where the first diagnosis is in the first column, the second is in the second, etc. and the task is to find or fix a diagnosis.

Subsetting Based on 5 Variables HRP 223 2008

Subsetting Based on 5 Variables HRP 223 2008

SQL vs. Datastep § The GUI generates this code: HRP 223 2008 § Or

SQL vs. Datastep § The GUI generates this code: HRP 223 2008 § Or you could write a little data step program:

Change All 9 s to 999 s? § It is a lot of clicking.

Change All 9 s to 999 s? § It is a lot of clicking. HRP 223 2008

Code § The SQL is a bit complicated HRP 223 2008

Code § The SQL is a bit complicated HRP 223 2008

As Data Step HRP 223 2008 § If it is more than 5 columns,

As Data Step HRP 223 2008 § If it is more than 5 columns, things get unruly. Imagine doing this across 20 possible diagnoses. There is an easy solution in data step code. § First, the SQL code can be done easily in a data step.

A List HRP 223 2008 § As you can see, there is a list

A List HRP 223 2008 § As you can see, there is a list of variables and you are doing the same things over and over. § You want to make a list called dx and have the 1 st element refer to dx 1, the 2 nd thing refer to dx 2, etc. The concept of a named list of variables or an alias to a bunch of variables is instantiated as an array.

Arrays § A major improvement…. . Ummmm. HRP 223 2008 § You want to

Arrays § A major improvement…. . Ummmm. HRP 223 2008 § You want to process the same one line over and over. You need to count from 1 to 5…. Sounds like a loop.

Change Lots of Things HRP 223 2008 § If you have an array, you

Change Lots of Things HRP 223 2008 § If you have an array, you can process wide files easily.

Restructuring with Arrays HRP 223 2008 § You can use similar code to restructure

Restructuring with Arrays HRP 223 2008 § You can use similar code to restructure data so that you have only a couple of columns of data. § Add a new column that is called dx. Num and another called the. DX. Those two columns plus the subject ID number can contain the same information without all the “holes”.

How does that work? HRP 223 2008 § Go through all five variables, one

How does that work? HRP 223 2008 § Go through all five variables, one at a time. § If the variable is not missing, you need to do three things: – Copy the diagnosis counter number into the dx. Num variable. – Copy the diagnosis code number into the variable called the. Dx. – Write to the new data set.

Repeated Ifs HRP 223 2008 § This is a lot of typing and it

Repeated Ifs HRP 223 2008 § This is a lot of typing and it obscures the fact that you are doing three things if a condition is true:

do end HRP 223 2008 § You have seen do statements in the context

do end HRP 223 2008 § You have seen do statements in the context where you do stuff over and over. There is also a do end command for when you need to do a block of instructions if a condition is true. You need both do and end

Actual Code HRP 223 2008

Actual Code HRP 223 2008

Normalization Part 2 HRP 223 2008 § I got data where I needed to

Normalization Part 2 HRP 223 2008 § I got data where I needed to analyze age for people who have a particular diagnosis. The data was a not-normalized mess:

Normalization Part 2 The Wrong Way HRP 223 2008 § If your database is

Normalization Part 2 The Wrong Way HRP 223 2008 § If your database is like this, you need code like this: data bad 2; set bad; if (dob 1 ne. and not missing(dx 1)) then do; if code 1= 22 then Is. Case 1=1; You will end up with the same code else Iscase 1=0; repeated as many times as you have end; repetitions. if (dob 2 ne. and not missing(dx 2)) if code 2=22 then Is. Case 2=1; end; if (dob 3 ne. and not missing(dx 3)) if code 3=22 then Is. Case 3=1; end; if (dob 4 ne. and not missing(dx 4)) if code 4=22 then Is. Case 4=1; end; if (dob 5 ne. and not missing(dx 5)) if code 5=22 then Is. Case 5=1; end; run; then do; else Iscase 2=0; then do; else Iscase 3=0; then do; else Iscase 4=0; then do; else Iscase 5=0;

Normalization Part 2 The Right Way § Instead, you should have a record in

Normalization Part 2 The Right Way § Instead, you should have a record in a table corresponding to each repetition. § With code like this: data good 2; set good; if code= 22 then is. Case 1=1; else is. Case 1=0; run; HRP 223 2008

§ Your first attempt could go something like this: data normal 1 (keep =

§ Your first attempt could go something like this: data normal 1 (keep = sid mid dob dx code); set bad; format dob dx mmddyy 8. ; HRP 223 if (dob 1 ne. and dx 1 ne. and code 1 ne. ) then do; 2008 mid = 1; dob = dob 1; But you end up with dx = dx 1; code = code 1; output; just as many blocks end; of code. if (dob 2 ne. and dx 2 ne. and code 2 ne. ) then do; mid = 2; dob=dob 2; dx=dx 2; code=code 2; output; end; if (dob 3 ne. and dx 3 ne. and code 3 ne. ) then do; mid=3; dob=dob 3; dx=dx 3; code=code 3; output; end; if (dob 4 ne. and dx 4 ne. and code 4 ne. ) then do; mid=4; dob=dob 4; dx=dx 4; code=code 4; output; end; if (dob 5 ne. and dx 5 ne. and code 5 ne. ) then do; mid=5; dob=dob 5; dx=dx 5; code=code 5; output; end; run;

Setting up Aliases (Arrays) HRP 223 2008 § What you want is a way

Setting up Aliases (Arrays) HRP 223 2008 § What you want is a way to repeat this code over the five sets of variables: if (dob 1 ne. and dx 1 ne. and code 1 ne. ) then do; mid = 1; dob = dob 1; dx = dx 1; code = code 1; output; end; § You need: – A dob alias (dob_a) to refer to dob 1, dob 2, dob 3, dob 4 and dob 5 – A dx alias (dx_a) to refer to dx 1, dx 2, dx 3, dx 4 and dx 5 – A code alias (code_a) to refer to code 1, code 2, code 3, code 4 and code 5

Setting up Aliases (Arrays) HRP 223 2008 data normal 2 a; set bad; array

Setting up Aliases (Arrays) HRP 223 2008 data normal 2 a; set bad; array dob_a dob 1 -dob 5; array dx_a dx 1 -dx 5; array code_a code 1 -code 5; This sets up the arrays but they are not used in this program. if (dob 1 ne. and dx 1 ne. and code 1 ne. ) then do; mid = 1; dob = dob 1; dx = dx 1; code = code 1; output; end; run;

Setting up Aliases (Arrays) HRP 223 2008 data normal 2 a; set bad; array

Setting up Aliases (Arrays) HRP 223 2008 data normal 2 a; set bad; array dob_a dob 1 -dob 5; array dx_a dx 1 -dx 5; array code_a code 1 -code 5; if (dob_a[1] ne. and dx_a[1] ne. and code_a[1] ne. ) then do; mid = 1; dob = dob_a[1]; dx = dx_a[1]; code = code_a[1]; output; end; run;

Setting up Aliases (Arrays) HRP 223 2008 data normal 2 c (keep = sid

Setting up Aliases (Arrays) HRP 223 2008 data normal 2 c (keep = sid mid dob dx code); set bad; array dob_a dob 1 -dob 5; array dx_a dx 1 -dx 5; array code_a code 1 -code 5; do c = 1 to 5 by 1; if (dob_a[c] ne. and dx_a[c] ne. and code_a[c] ne. ) then do; mid = c; dob = dob_a[c]; dx = dx_a[c]; code = code_a[c]; output; end; run;

Arrays HRP 223 2008 § You can tell SAS that a set of variables

Arrays HRP 223 2008 § You can tell SAS that a set of variables are related by putting them into an array statement. § Arrays in SAS are not like arrays in other languages like BASIC or C. SAS arrays are only aliases to an existing set of variables. They are created using the array statement: My notation for arrays array times_a [365] time 1 -time 365; An optional size of the array What the array refers to

Arrays(2) HRP 223 2008 § If your array references variables that do not exist,

Arrays(2) HRP 223 2008 § If your array references variables that do not exist, they will be created. Make sure to use the $ if you intend to create character variables. § If you want to reference all numeric variables between the. Value and thingy 2, do it like this: array x the. Value -- thingy 2 _numeric_; -- means all values between and including the starting and ending variables - indicates the numeric sequence starting with the first variable and ending with the second

SQL and Colors HRP 223 2008 § You may have noticed that the guys

SQL and Colors HRP 223 2008 § You may have noticed that the guys who made the enhanced editor don’t know SQL commands because some of the key words were not colorized. There are lots of them, but they can be easily fixed.

Fix Color HRP 223 2008 § Go to Tools > Options …. > SAS

Fix Color HRP 223 2008 § Go to Tools > Options …. > SAS Programs and then click Editor Options… then User Defined Keywords

Missing Words § Add – – – – – calculated coalesce corresponding except full

Missing Words § Add – – – – – calculated coalesce corresponding except full group inner intersect join – – – – left on or order outer right union HRP 223 2008

Minimal SQL § Print a report showing the contents of variables from a single

Minimal SQL § Print a report showing the contents of variables from a single data set. HRP 223 2008 Put a comma-delimited list of variables here. Specify a library. table here.

What variables? HRP 223 2008 § Use an * to indicate that you want

What variables? HRP 223 2008 § Use an * to indicate that you want all variables instead of typing them all. § There is no syntax to specify variables based on position in the source files. That is, you can not specify that you want to select the 2 nd and 7 th variables (from left to right) or to select the first 3 variables.

Likely Tweaks HRP 223 2008 § You can rename a variable in the list

Likely Tweaks HRP 223 2008 § You can rename a variable in the list with an as statement. § You can also specify variable formats and labels.

More Tweaks HRP 223 2008 § The from line references tables which are in

More Tweaks HRP 223 2008 § The from line references tables which are in libraries. Complex queries require you to reference the table name over and over again. Instead of having to type the long library and dataset names repeatedly, you can refer to the files as an alias. Print the column called dude from the table reported. Cancers which is in the ov. Cancer library. Here the c. is optional because dude is only in one table (the query only uses one table).

Stacking HRP 223 2008 § You already know how to use proc append or

Stacking HRP 223 2008 § You already know how to use proc append or the Data > Append Table menu item to combine two sets of data on top of one another. How do you “copy/paste” to insert columns from one table into another?

The GUI can do easy SQL. HRP 223 2008 § You could write data

The GUI can do easy SQL. HRP 223 2008 § You could write data step or proc sql code. § Happily, most of the merges you need are in the graphical user interface.

How are tables linked? HRP 223 2008 § You need to tell it who

How are tables linked? HRP 223 2008 § You need to tell it who is matched with whom in the tables. If you have a demographics table and a disease table, you need to specify which column says which disease belongs to which person. In this case you would say match on the subject ID numbers in the two tables using a key column.

Inner Join HRP 223 2008 § If you want records where there is a

Inner Join HRP 223 2008 § If you want records where there is a match in both tables, you want an inner join (aka, equijoin or natural join). – For example, which subjects have demographic and cancer information?

HRP 223 2008

HRP 223 2008

Alternate Syntax HRP 223 2008 This is what I write.

Alternate Syntax HRP 223 2008 This is what I write.

All Information from the Left Table HRP 223 2008 § If you want all

All Information from the Left Table HRP 223 2008 § If you want all the demographics, as well as the cancers if they occur:

Left Join Code HRP 223 2008

Left Join Code HRP 223 2008

All Information from the Right Table HRP 223 2008

All Information from the Right Table HRP 223 2008

§ If you wanted the cancers info plus demographics where there were any: HRP

§ If you wanted the cancers info plus demographics where there were any: HRP 223 2008

Right Join Code HRP 223 2008

Right Join Code HRP 223 2008

Full Join § If you wanted all information: HRP 223 2008 It would be

Full Join § If you wanted all information: HRP 223 2008 It would be nice if you could combine the two dude variables so the first not –missing value was used.

Full Join Code HRP 223 2008

Full Join Code HRP 223 2008

Coalesce HRP 223 2008 § Coalesce says take the first not-missing value from the

Coalesce HRP 223 2008 § Coalesce says take the first not-missing value from the set of variables.

Checking for ID Numbers with SQL HRP 223 2008 § A task that I

Checking for ID Numbers with SQL HRP 223 2008 § A task that I need to do frequently is to build a list of all subject IDs when data is coming from multiple sources. – List IDs with duplicates. – List unique ID numbers. – List who is in both files. – List who is in one file but not the second. – Make a summary showing all IDs and an indicator for who appears where.

PROC SQL - Set Operators NO GUI § Outer Union Corresponding – concatenates §

PROC SQL - Set Operators NO GUI § Outer Union Corresponding – concatenates § Unions – unique rows from both queries § Except – rows that are part of first query § Intersect – rows common to both queries HRP 223 2008

outer union corresponding § You can concatenate data files. § I rarely use it.

outer union corresponding § You can concatenate data files. § I rarely use it. § proc sql; create table is. Outer as select dude from baseline outer union corresponding select dude from followup; quit; HRP 223 2008

union HRP 223 2008 § You can also concatenate data files and keep unique

union HRP 223 2008 § You can also concatenate data files and keep unique records: proc sql; create table is. Union as select dude from baseline union select dude from followup; quit;

except HRP 223 2008 § Say you needed everyone who did not come back.

except HRP 223 2008 § Say you needed everyone who did not come back. Start out with the baseline group and remove the people who came back. proc sql; select id from baseline except select id from followup; quit;

intersect HRP 223 2008 § Say you wanted to know who came back. In

intersect HRP 223 2008 § Say you wanted to know who came back. In other words, what IDs are in both files? proc sql; select id from baseline intersect select id from followup; quit;

PROC SQL - Set Operators HRP 223 2008 § When you have tables (with

PROC SQL - Set Operators HRP 223 2008 § When you have tables (with more than one column) with the same structure, you can combine them with these set operators. – Be extremely careful because SAS/SQL is forgiving about the structure of the tables and you may notice problems in the data. – For this to work as intended, the two tables must have the same variables, in the same order, and the variables must be of the same type (variables with the same name must both be character or both be numeric). Use the key word corresponding to have it match like named variables.

corresponding HRP 223 2008 § The columns do not need to have matching names

corresponding HRP 223 2008 § The columns do not need to have matching names or even the same length and it will still operate on them. § Use correponding to help spot this problem.

Summary Table HRP 223 2008 § Say you have two or more files and

Summary Table HRP 223 2008 § Say you have two or more files and they are supposed to have the same subject IDs. How do you make a summary table showing who has information in each table? – Make a master list that has all people regardless of the source file. – Add an indicator column with the value 1 where the subject ID in a table matches the master ID table. – Add in a second column with the value 1 where the subject ID in the second file matches the master ID.

Make Some Data HRP 223 2008 Make a file with 100 random numbers between

Make Some Data HRP 223 2008 Make a file with 100 random numbers between 1 and 100 (you can get the same number more than once) and an indicator variable holding the value 1. Sort the data and remove the duplicates.

In Code HRP 223 2008

In Code HRP 223 2008

Subquery HRP 223 2008 § In real life the tables that you are comparing

Subquery HRP 223 2008 § In real life the tables that you are comparing will not include a convenient variable that is holding “ 1”. You can have SQL make a new variable easily enough: Notice there is no column indicating it is in. Day 1. Add in a column called in. Day 1 with the value 1 for everyone.

Subquery HRP 223 2008 You can do this with a single query.

Subquery HRP 223 2008 You can do this with a single query.

Order HRP 223 2008 § Notice that the data is never put into order.

Order HRP 223 2008 § Notice that the data is never put into order. In this case, it ended up ordered correctly because of the union statement. You can explicitly request having the data sorted so you do not need to use the Data > Sort Data… menu. Just add an order by clause.

Working with Repeated Keys HRP 223 2008 § A file tracking diagnoses or treatments

Working with Repeated Keys HRP 223 2008 § A file tracking diagnoses or treatments will have multiple records for some people. – If you want to count the number of records for a person, specify what variable(s) are used to group by. – Count records in the group with count(*) or count not missing values with count(variable. Name)

Joining on Duplicated Keys HRP 223 2008 § If you join tables that have

Joining on Duplicated Keys HRP 223 2008 § If you join tables that have duplicated key values, you will end up with lots of records. Specifically, the new table will have as many records as the sum of the product of the two key counts. 2 in appt * 2 in dx = 4 records 2 in appt * 4 in dx = 8 records

distinct HRP 223 2008 § The word distinct removes duplicates. § If you want

distinct HRP 223 2008 § The word distinct removes duplicates. § If you want the IDs of people who had any records in both tables, use distinct.

Joint Keys HRP 223 2008 § Sometimes you need to use more than one

Joint Keys HRP 223 2008 § Sometimes you need to use more than one variable to indicate which records match across tables. For example, if you use both pedigree numbers and family member numbers in tables to identify people, you need to use both these pedigree ID number and accession number variables to join tables.