Data Transformation Data cleaning Importing Data Reading data
- Slides: 31
Data Transformation Data cleaning
Importing Data • • Reading data from external formats Libname/Infile/Input for text form data Proc Import for Excel/Access data ODBC for external database data
Importing an Excel Spreadsheet PROC IMPORT OUT= WORK. Fall 2007 DATAFILE= "L: Data. Warehousing 07 fCourse. DatabaseFall 2007. xls" DBMS=EXCEL REPLACE; SHEET="'Fall 07$'"; GETNAMES=YES; MIXED=NO; SCANTEXT=YES; USEDATE=YES; SCANTIME=YES; RUN;
Import an Access Table PROC IMPORT OUT= WORK. Order. Line DATATABLE= "Order. Lin" DBMS=ACCESS REPLACE; DATABASE="I: Data. Warehousing 07 fWholesale. Product s. mdb"; SCANMEMO=YES; USEDATE=NO; SCANTIME=YES; RUN;
Good Practice • Check the metadata for a dataset PROC CONTENTS DATA= Order. Line; RUN; • Print a few records PROC PRINT DATA= Order. Line (OBS= 10); RUN;
Saving SAS Datasets LIBNAME course "L: Data. Warehousing 07 fCourse. Database"; Data course. Spring 2008; set spring 2008; run; Note: the name associated with the libname command (“course”) must be 8 characters or less.
LIBNAME / INFILE / INPUT for character data • LIBNAME identifies the location or folder where the data file is stored • INFILE specifies the libname to use for reading external data. • INPUT reads text format data • SET reads SAS data
INFILE with INPUT for character data files DATA Fitness; INFILE "L: Data. Warehousing 07 fTransformation. SASSAS 1. txt"; INPUT NAME $ WEIGHT WAIST PULSE CHINS SITUPS JUMPS; run;
Creating Derived Attributes Generating new attributes for a table. SAS creates attributes when they are referred to in a data step. The metadata depends on the context of the code. • LENGTH statements • FORMATS and INFORMATS • PUT • INPUT
PUT and INPUT Functions Text. Output = PUT(variable, format) Note: the result of a put function is always character Note: there is also a PUT statement that writes the contents of a variable to the SAS log Output = INPUT(Character. Input, informat) Note: the variable for an input function is always character
Formats • Formats always contain a period • Formats for character variables always start with a $ • The most used format categories are Character, Date and Time, and Numeric Note: use the SAS “search” tab to look for “Formats. ” For a list of SAS formats look under: “Formats: Formats by Category”
Good Practice The following code is handy for testing functions and formats in SAS. The _Null_ dataset name tells SAS not to create the datset in the WORK library Data _Null_; Input. Val= 123; Output. Val= PUT(Input. Val, Roman 30. ); PUT Input. Val Output. Val; run;
Generating Dates • Generating a Date dimension • Usually done offline in something like Excel • SAS has extensive date and datetime functions and formats • SAS formats apply to only one of datetime, date or time variable types. Convert from one type to another with SAS functions.
Creating a text variable for Date Data Orders 2; Length Date $10. ; Set Orders; Date= PUT( Datepart(Order. Date), MDDYY 8. ); • The Length statement assures that the variable will have enough space. It must come before the SET. • Order. Date has Date. Time format. The DATEPART function produces a date format output. MMDDYYx. is a date format type.
SAS Functions We are especially interested in “Character” and “Date and Time” functions Note: use the SAS “search” tab to look for “Functions. ” For a list of SAS functions look under: “Functions and CALL routines: Functions and CALL Routines by Category”
Useful Data Cleaning Functions • Text Manipulation: – COMPRESS, STRIP, TRIM, LEFT, RIGHT, UPCASE, LOWCASE • Text Extraction – INDEX, SCAN, SUBSTR, TRANSLATE, TRANWRD
Parsing • The process of splitting a text field into multiple fields • Uses SAS functions to extract parts of a character string. – Fixed position in a string: SUBSTR – Known delimiter: SCAN Note: it is a good idea to strip blanks before you try to parse a string.
Example of Parsing Data Customer 2; LENGTH street cust_addr $20. ; FORMAT street cust_addr $20. ; SET Customer; Cust_Addr= TRIM(Cust_Addr); Number= Scan(Cust_Addr, 1, ' '); Street= Scan(Cust_Addr, 2, ' '); run; Note: The LENGTH and FORMAT statements clear trailing blanks for further display.
Parsing Results Obs 1 2 3 4 5 6 7 cust_addr 481 OAK 215 PETE 48 COLLEGE 914 CHERRY 519 WATSON 16 ELM 108 PINE Number 481 215 48 914 519 16 108 street OAK PETE COLLEGE CHERRY WATSON ELM PINE
Good Practice Always print the before and after images here. Parsing free form text can be quite a problem. For example, apartment addresses ‘ 110 b Elm’ and ‘ 110 b Elm’ will parse differently. In this case you may have to search the second word for things that look like apartments and correct the data.
=SUBSTR( string, position<, length>) Use this when you have a known position for characters. • String: character expression • Position: start position (starts with 1) • Length: number of characters to take (missing takes all to the end) VAR= ‘ABCDEFG’ NEWVAR= SUBSTR(VAR, 2, 2) NEWVAR 2= SUBSTR(VAR, 4) NEWVAR= ‘BC’ NEWVAR 2= ‘DEFG’
SUBSTR(variable, position<, length>) = new-characters Replaces character value contents. Use this when you know where the replacement starts. a='KIDNAP'; substr(a, 1, 3)='CAT'; a: CATNAP substr(a, 4)='TY' ; a: KIDTY
INDEX(source, excerpt) • Searches a character expression for a string of characters. Returns the location (number) where the string begins. a='ABC. DEF (X=Y)'; b='X=Y'; x=index(a, b); x: 10 x= index(a, ’DEF’); x: 5
Alternative INDEX functions • INDEXC searches for a single character • INDEXW searches for a word: Syntax INDEXW(source, excerpt<, delimiter>)
Length Returns the length of a character variable • The LENGTH and LENGTHN functions return the same value for non-blank character strings. LENGTH returns a value of 1 for blank character strings, whereas LENGTHN returns a value of 0. • The LENGTH function returns the length of a character string, excluding trailing blanks, whereas the LENGTHC function returns the length of a character string, including trailing blanks. LENGTH always returns a value that is less than or equal to the value returned by LENGTHC.
Standardizing • Adjusting terms to standard format. • Based off of frequency prints. • Use functions or IF statements – TRANWRD is easy but can produce unexpected results – IF statements are safer, but less general
Standardization Code Supplier= Tranwrd(supplier, " Incorporated", ""); If Supplier= "Trinkets & Things" then supplier= "Trinkets n' Things"; More complex logic is often needed. See the course examples.
Good Practice It is a good idea to produce a change log for standardized changes: Data Products 2 Changed; Set Products; Supplier. Old= Supplier; **** Output Products 2; If Trim(supplier) ^= Trim(Supplier. Old) then output Changed; Proc Print Data= Changed; Var Supplier. Old Supplier;
Locating Anomalies • Frequency counts are a good way to identify anomalies. • It is also helpful to identify standard changes that you do not have to review. • Probably the safest way to execute standard changes is with a “Change Table” that lists From and To values. (Advanced SAS exercise – go for it!!)
De Duplicating • Reconcile different representations of the same entity • Done after standardizing. Usually requires multi-field testing. • May use probabilistic logic, depending on the application. • Should produce a change log.
Correcting • Identifying and correcting values that are wrong • Very difficult to do. Usually based off of exception reports or range checks.
- Offset counter trade
- Insourching
- The great promise of exporting is that
- While reading activities
- Data quality and data cleaning an overview
- Data quality and data cleaning an overview
- Data quality and data cleaning an overview
- Pengolahan data editing, coding processing cleaning
- Potter's wheel data cleaning tool
- Stata data cleaning
- Disambiguation data cleaning
- Etl in data cleaning and preprocessing stands for
- Entering data in spss
- Data cleaning lesson
- Cse 234 ucsd
- St. louis
- Reading aims
- Reading techniques and strategies
- Edb net section
- Active process of discovery
- Intensive reading and extensive reading
- Intensive and extensive reading
- What is intensive reading
- Tru count air clutch installation manual
- Power tool cleaning
- Ultra shine company manufactures a cleaning solvent
- Water hardness can affect cleaning by servsafe
- Sedimclean
- What are three types of hazards that make food unsafe
- Pt among
- Hospital patient room cleaning procedures
- Cleaning methods