Data Transformation Data cleaning Importing Data Reading data

  • Slides: 31
Download presentation
Data Transformation Data cleaning

Data Transformation Data cleaning

Importing Data • • Reading data from external formats Libname/Infile/Input for text form data

Importing Data • • Reading data from external formats Libname/Infile/Input for text form data Proc Import for Excel/Access data ODBC for external database data

Importing an Excel Spreadsheet PROC IMPORT OUT= WORK. Fall 2007 DATAFILE= "L: Data. Warehousing

Importing an Excel Spreadsheet PROC IMPORT OUT= WORK. Fall 2007 DATAFILE= "L: Data. Warehousing 07 fCourse. DatabaseFall 2007. xls" DBMS=EXCEL REPLACE; SHEET="'Fall 07$'"; GETNAMES=YES; MIXED=NO; SCANTEXT=YES; USEDATE=YES; SCANTIME=YES; RUN;

Import an Access Table PROC IMPORT OUT= WORK. Order. Line DATATABLE= "Order. Lin" DBMS=ACCESS

Import an Access Table PROC IMPORT OUT= WORK. Order. Line DATATABLE= "Order. Lin" DBMS=ACCESS REPLACE; DATABASE="I: Data. Warehousing 07 fWholesale. Product s. mdb"; SCANMEMO=YES; USEDATE=NO; SCANTIME=YES; RUN;

 Good Practice • Check the metadata for a dataset PROC CONTENTS DATA= Order.

Good Practice • Check the metadata for a dataset PROC CONTENTS DATA= Order. Line; RUN; • Print a few records PROC PRINT DATA= Order. Line (OBS= 10); RUN;

Saving SAS Datasets LIBNAME course "L: Data. Warehousing 07 fCourse. Database"; Data course. Spring

Saving SAS Datasets LIBNAME course "L: Data. Warehousing 07 fCourse. Database"; Data course. Spring 2008; set spring 2008; run; Note: the name associated with the libname command (“course”) must be 8 characters or less.

LIBNAME / INFILE / INPUT for character data • LIBNAME identifies the location or

LIBNAME / INFILE / INPUT for character data • LIBNAME identifies the location or folder where the data file is stored • INFILE specifies the libname to use for reading external data. • INPUT reads text format data • SET reads SAS data

INFILE with INPUT for character data files DATA Fitness; INFILE "L: Data. Warehousing 07

INFILE with INPUT for character data files DATA Fitness; INFILE "L: Data. Warehousing 07 fTransformation. SASSAS 1. txt"; INPUT NAME $ WEIGHT WAIST PULSE CHINS SITUPS JUMPS; run;

Creating Derived Attributes Generating new attributes for a table. SAS creates attributes when they

Creating Derived Attributes Generating new attributes for a table. SAS creates attributes when they are referred to in a data step. The metadata depends on the context of the code. • LENGTH statements • FORMATS and INFORMATS • PUT • INPUT

PUT and INPUT Functions Text. Output = PUT(variable, format) Note: the result of a

PUT and INPUT Functions Text. Output = PUT(variable, format) Note: the result of a put function is always character Note: there is also a PUT statement that writes the contents of a variable to the SAS log Output = INPUT(Character. Input, informat) Note: the variable for an input function is always character

Formats • Formats always contain a period • Formats for character variables always start

Formats • Formats always contain a period • Formats for character variables always start with a $ • The most used format categories are Character, Date and Time, and Numeric Note: use the SAS “search” tab to look for “Formats. ” For a list of SAS formats look under: “Formats: Formats by Category”

 Good Practice The following code is handy for testing functions and formats in

Good Practice The following code is handy for testing functions and formats in SAS. The _Null_ dataset name tells SAS not to create the datset in the WORK library Data _Null_; Input. Val= 123; Output. Val= PUT(Input. Val, Roman 30. ); PUT Input. Val Output. Val; run;

Generating Dates • Generating a Date dimension • Usually done offline in something like

Generating Dates • Generating a Date dimension • Usually done offline in something like Excel • SAS has extensive date and datetime functions and formats • SAS formats apply to only one of datetime, date or time variable types. Convert from one type to another with SAS functions.

Creating a text variable for Date Data Orders 2; Length Date $10. ; Set

Creating a text variable for Date Data Orders 2; Length Date $10. ; Set Orders; Date= PUT( Datepart(Order. Date), MDDYY 8. ); • The Length statement assures that the variable will have enough space. It must come before the SET. • Order. Date has Date. Time format. The DATEPART function produces a date format output. MMDDYYx. is a date format type.

SAS Functions We are especially interested in “Character” and “Date and Time” functions Note:

SAS Functions We are especially interested in “Character” and “Date and Time” functions Note: use the SAS “search” tab to look for “Functions. ” For a list of SAS functions look under: “Functions and CALL routines: Functions and CALL Routines by Category”

Useful Data Cleaning Functions • Text Manipulation: – COMPRESS, STRIP, TRIM, LEFT, RIGHT, UPCASE,

Useful Data Cleaning Functions • Text Manipulation: – COMPRESS, STRIP, TRIM, LEFT, RIGHT, UPCASE, LOWCASE • Text Extraction – INDEX, SCAN, SUBSTR, TRANSLATE, TRANWRD

Parsing • The process of splitting a text field into multiple fields • Uses

Parsing • The process of splitting a text field into multiple fields • Uses SAS functions to extract parts of a character string. – Fixed position in a string: SUBSTR – Known delimiter: SCAN Note: it is a good idea to strip blanks before you try to parse a string.

Example of Parsing Data Customer 2; LENGTH street cust_addr $20. ; FORMAT street cust_addr

Example of Parsing Data Customer 2; LENGTH street cust_addr $20. ; FORMAT street cust_addr $20. ; SET Customer; Cust_Addr= TRIM(Cust_Addr); Number= Scan(Cust_Addr, 1, ' '); Street= Scan(Cust_Addr, 2, ' '); run; Note: The LENGTH and FORMAT statements clear trailing blanks for further display.

Parsing Results Obs 1 2 3 4 5 6 7 cust_addr 481 OAK 215

Parsing Results Obs 1 2 3 4 5 6 7 cust_addr 481 OAK 215 PETE 48 COLLEGE 914 CHERRY 519 WATSON 16 ELM 108 PINE Number 481 215 48 914 519 16 108 street OAK PETE COLLEGE CHERRY WATSON ELM PINE

 Good Practice Always print the before and after images here. Parsing free form

Good Practice Always print the before and after images here. Parsing free form text can be quite a problem. For example, apartment addresses ‘ 110 b Elm’ and ‘ 110 b Elm’ will parse differently. In this case you may have to search the second word for things that look like apartments and correct the data.

=SUBSTR( string, position<, length>) Use this when you have a known position for characters.

=SUBSTR( string, position<, length>) Use this when you have a known position for characters. • String: character expression • Position: start position (starts with 1) • Length: number of characters to take (missing takes all to the end) VAR= ‘ABCDEFG’ NEWVAR= SUBSTR(VAR, 2, 2) NEWVAR 2= SUBSTR(VAR, 4) NEWVAR= ‘BC’ NEWVAR 2= ‘DEFG’

SUBSTR(variable, position<, length>) = new-characters Replaces character value contents. Use this when you know

SUBSTR(variable, position<, length>) = new-characters Replaces character value contents. Use this when you know where the replacement starts. a='KIDNAP'; substr(a, 1, 3)='CAT'; a: CATNAP substr(a, 4)='TY' ; a: KIDTY

INDEX(source, excerpt) • Searches a character expression for a string of characters. Returns the

INDEX(source, excerpt) • Searches a character expression for a string of characters. Returns the location (number) where the string begins. a='ABC. DEF (X=Y)'; b='X=Y'; x=index(a, b); x: 10 x= index(a, ’DEF’); x: 5

Alternative INDEX functions • INDEXC searches for a single character • INDEXW searches for

Alternative INDEX functions • INDEXC searches for a single character • INDEXW searches for a word: Syntax INDEXW(source, excerpt<, delimiter>)

Length Returns the length of a character variable • The LENGTH and LENGTHN functions

Length Returns the length of a character variable • The LENGTH and LENGTHN functions return the same value for non-blank character strings. LENGTH returns a value of 1 for blank character strings, whereas LENGTHN returns a value of 0. • The LENGTH function returns the length of a character string, excluding trailing blanks, whereas the LENGTHC function returns the length of a character string, including trailing blanks. LENGTH always returns a value that is less than or equal to the value returned by LENGTHC.

Standardizing • Adjusting terms to standard format. • Based off of frequency prints. •

Standardizing • Adjusting terms to standard format. • Based off of frequency prints. • Use functions or IF statements – TRANWRD is easy but can produce unexpected results – IF statements are safer, but less general

Standardization Code Supplier= Tranwrd(supplier, " Incorporated", ""); If Supplier= "Trinkets & Things" then supplier=

Standardization Code Supplier= Tranwrd(supplier, " Incorporated", ""); If Supplier= "Trinkets & Things" then supplier= "Trinkets n' Things"; More complex logic is often needed. See the course examples.

 Good Practice It is a good idea to produce a change log for

Good Practice It is a good idea to produce a change log for standardized changes: Data Products 2 Changed; Set Products; Supplier. Old= Supplier; **** Output Products 2; If Trim(supplier) ^= Trim(Supplier. Old) then output Changed; Proc Print Data= Changed; Var Supplier. Old Supplier;

Locating Anomalies • Frequency counts are a good way to identify anomalies. • It

Locating Anomalies • Frequency counts are a good way to identify anomalies. • It is also helpful to identify standard changes that you do not have to review. • Probably the safest way to execute standard changes is with a “Change Table” that lists From and To values. (Advanced SAS exercise – go for it!!)

De Duplicating • Reconcile different representations of the same entity • Done after standardizing.

De Duplicating • Reconcile different representations of the same entity • Done after standardizing. Usually requires multi-field testing. • May use probabilistic logic, depending on the application. • Should produce a change log.

Correcting • Identifying and correcting values that are wrong • Very difficult to do.

Correcting • Identifying and correcting values that are wrong • Very difficult to do. Usually based off of exception reports or range checks.