Basics of writing SPSS syntax files Vince Gray

Basics of writing SPSS syntax files Vince Gray DLI Boot Camp June 3, 2014

Session goals �Introduction to the basic parts of a SPSS syntax file to read in data ◦ Not intended to show to analyze data, but how to make them available for analysis �Tips and tricks for preparing syntax file �Cleaning up blatant problems with the data �Have a short exercise in coding a SPSS syntax file

Why know how to do this? �Older files may not have syntax available – may be in paper only �SPSS is not Statistics Canada's specialty: they don't do much work with it, and that can show in what you receive from them �Faculty members may wish to deposit old data with you

Sample of a print-only codebook Household Income, Facilities and Equipment Micro Data File, 1971 Income: Survey of Consumer Finances, 1972 and Survey of Household Facilities and Equipment, 1972

SPSS foundational concepts �SPSS is generally case insensitive ◦ Commands and labels are capitalized for display purposes only ◦ On Unix computers, file specification is case sensitive (C: Datafile. txt <> c: dataFile. txt) ◦ Operations on string variables are case sensitive �SPSS commands end with a period �Recommendation: edit syntax files using a fixed pitch font (e. g. , Courier new)

SPSS foundational concepts �Comments (text that isn't a command) can be used to explain what you're doing ◦ May be placed at the start of a line with either the word comment or an asterisk and ending with a period ◦ May be placed within a command or at end of a line enclosed within /* comment */ Variable labels /* this is a fatuous comment */ var 001 1 -4 …

Basics of SPSS syntax file �Where is the file; what are its attributes? �What are the variables and what format? �What are the variable labels? �What values do you want to label? �Are any of the values missing (i. e. , should they be ignored during analysis? ) �Do you need to repair data?

Where is the file and what are its attributes? �Usually done with data list file='drive: directoryfilename. ext' format records=# table / variable list / line 2 variable list …. �Need to define a file handle for very large files (record length 8192+) first file handle myhandle / name='drive: directoryfilename. ext’ /recform=? ? /lrecl=####. data list file=myhandle format records=# table /variable list.

What are the variables and what format? �Each variable being read in from the file must be described �Must be assigned a variable name: see Variable Names in SPSS help (Syntax) ◦ Cannot be a reserved word ◦ May be up to 64 characters long: no spaces ◦ Start with A-Z, @, # (scratch variables), or $ (system variables) ◦ May contain A-Z 0 -9 _. $ # @

Thoughts on long variable names �Users of older (perpetual) versions of SPSS may not be able to use them �Variable names may wrap across lines �Being lazy, it's more typing �Can use rename variables syntax to retain long variable names �Recommendation: use 8 characters or less for variable names

Defining variable format �Multiple ways to do it ◦ Specify columns and type Uniqueid 1 -8 Recwght 9 -15 (3) Cityname 16 -45 (A) var 001 46 -50 var 002 51 -55 var 003 56 -60 income gvttrans othrincm 61 -87 (2) … ◦ Use Fortran encoding Uniqueid (F 8. 0) Recwght (F 7. 3) Cityname (A 30) var 001 to var 003 (3 F 5. 0) income gvttrans othrincm (3 F 9. 2) …

Defining variable format (cont'd) � Can combine various formats in a data list command * Here we will declare variables in the file. data list file=oldfile records=1 table / uniqueid 1 -5 province 6 -7 urbnrurl 8 farmflag 9 hhldwght 10 -12 numprsns nmadults nmchlt 06 nmch 0615 nmch 1617 nmch 1824 13 -24 hhldcomp (F 1. 0) farm_income_dependence 56 mjsrcinc 57 nmearner nmpsninc (2 F 2. 0) earnrmbd invstmnt govttran miscincm ttlincom 62 -91. � Note indentation of 1 space on each variable: used to be required, now more stylistic

Defining variable format (cont'd) �Don't define variables as strings unless the data contain non-numeric characters ◦ Can lose ordinal variable relationships ◦ This may mean revising Stat. Can syntax files, which have been known to define non-interval variables as string, regardless of the coding actually used for the variable

String variables (cont'd) ◦ In worst case (and at your discretion based on comfort level), means recoding variables (e. g. , Discharge Abstract Database) ◦ If convert to Stata, value labels won't convert since can't be assigned to string variables ◦ Recommendation: if the string requires a value label to be meaningful, convert it to a coded numeric value (therefore, leave place names, census tract numbers, etc. as strings)

What are the variable labels? �The purpose of variable labels is to give more descriptive information than the variable name can provide ◦ Sex �Probably safe to guess that it is a gender variable �But not necessarily: Have you had sex in the past month? �Recorded for whom? Respondent/spouse/1 st-born? �If any doubt might exist, try to remove it! �Do not use arbitrary contractions – especially if loading into a searchable metadata service

Variable labels (cont'd) �Sample code with arbitrary contractions VARIABLE LABELS YEAR "Refyr - 1998" PUCPID 26 "Cross-sect random pers ID - 1998" PUCHID 25 "Cross-sect random hhld ID - 1998" D 31 CF 26 "Census family ID - 1998" ICSWT 26 "Int cross-sect weight - 1998" ECYOB 26 "Ext YOB (cross-sect) - 1998" ECAGE 26 "Ext age refyr (cross-sect) - 1998" ECSEX 99 "Ext sex refyr (cross-sect) - 1998" MARST 26 "Marital status refyr - 1998" MJACT 26 "Major activity - 1998" MJIEH 26 "Major inc earner Hhld - 1998" MJINE 26 "Major inc earner EF - 1998" RMJIG 26 "Rel maj inc earner grp EF - 1998" MJICE 26 "Major inc earner CF - 1998"

Variable labels (cont'd) �Make sure that the label includes the most important information. In the variables below, the key information was omitted by Stat. Can – does it what, would you what, described as what? HAL_Q 150 "Does a physical condition or mental condition or health prob" HAL_Q 160 "Does a physical condition or mental condition or health prob" HAL_Q 170 "Does a physical condition or mental condition or health prob" HAL_Q 210 "Do you regularly have trouble going to sleep or staying asle" MSS_Q 110 "Thinking about the amount of stress in your life, would you" MSS_Q 120 "What is your main source of stress? " HS_Q 110 "Presently, would you describe yourself as: "

Variable labels (cont'd) �Meaningful labels HAL_Q 150 "Reduction of amount/kind of activity at home" HAL_Q 160 "Reduction of amount/kind of activity at work or school" HAL_Q 170 "Reduction of amount/kind of activity in other activities (transport/leisure)" HAL_Q 210 "Regularly have trouble going to sleep or staying asleep" MSS_Q 110 "Self-assessed amount of stress in respondent's life" MSS_Q 120 "What is your main source of stress" HS_Q 110 "Self-assessed happiness"

Variable labels (cont'd) �If labels are repeated, explain why (the variable names may not be intuitive): SUDDLAI 'Any drug use (incl 1 time cann)' SUDDLAE 'Any drug use (excl 1 time cann)' SUDDLID 'Any drug use (excl cann) - life (D)' SUDDYAI 'Any drug use (incl 1 time cann)' SUDDYAE 'Any drug use (excl 1 time cann)' is less useful than SUDDLAI "Ever used drugs (including 1 time cannabis, derived)" SUDDLAE "Ever used drugs (excluding 1 time cannabis, derived)" SUDDLID "Ever used drugs (excluding cannabis, derived)" SUDDYAI "Used any drugs in past 12 months (including 1 time cannabis, derived)" SUDDYAE "Used any drugs in past 12 months (excluding 1 time cannabis, derived)"

Variable label formatting �Recommend placing all labels in double quotes rather than single quotes noanswr 1 "Didn't answer: wasn't at home" rather than noanswr 1 'Didn't answer: wasn''t at home' ◦ Either works, but single quotes can lead to more mistakes due to carelessness in data entry �Have up to 255 characters for variable labels: all may not be displayed, though (some procedures show only 40 characters)

What values do you want to label? �Nominal and ordinal variables are generally meaningless without value labels ◦ Gender: is 1 male and 0 female, or vice versa? ◦ Does a scale variable run worse to better or better to worse (the value alone doesn't necessarily suffice to tell you this) ◦ What does value 3 in Agegroup represent? �Continuous values variables may have key

Value label formats �Do not use arbitrary contractions: up to 120 characters can be displayed �Recommend placing all labels in double quotes rather than single quotes 6 "Don't know" rather than 6 'Don''t know' �String values must be enclosed in quotes (e. g. , "B" "Boston lettuce") ◦ but you won't be using string variables if you need value labels to make sense, right?

Value label formats (cont'd) �A single label declaration can be used for any and all variables using that coding, or separate declarations can be made value labels SUDDYOA SUDDYOD SUDFINT SUDFLAU SUDFLCA SUDFLCM SUDFLSU SUDFLTU SUDFYCM SUDGLOTH SUD_87 SUI_01 SUI_02 SUI_03 TWD_1 TWD_3 TWD_5 1 "YES" 2 "NO" 6 "NOT APPLICABLE" 7 "DON'T KNOW" 8 "REFUSAL" 9 "NOT STATED" / �Each declaration is separated from the previous with a /

Value label formats (cont'd) Can explicitly identify variables to which no values are assigned � If consecutive variables use the same coding, use "to” � value labels uniqueid hhldwght / SUDDYO to SUDGLOTH SUD_87 SUI_01 SUI_02 SUI_03 TWD_1 TWD_3 TWD_5 1 "YES" 2 "NO" 6 "NOT APPLICABLE" 7 "DON'T KNOW" 8 "REFUSAL" 9 "NOT STATED". � Repeated value labels for any variable are ignored: the first one found is used, and a warning is issued in the syntax window

Missing values �Missing values get omitted from analysis – if you are looking for the average income of spouses, you don't include households who don't have spouses �Statistics Canada normally uses values ending in 6/7/8/9 as missings (i. e. , not applicable, don't know, refusal, not asked) – but often only define the values 9 as missing values in SPSS: varies by Division

Missing values (cont'd) �Other values may be missing as well mthrplbr fthrplbr 1 "Born in Canada" 2 "Born outside of Canada - North America/Europe" 3 "Born outside of Canada - Other country" 4 "Country uncodeable" 8 "Not stated" 9 "Don't know" / ◦ The value 4 might be considered missing – I would code it as missing! ◦ Check the codebook carefully!

Missing values format �SPSS allows up to three discrete values to be defined as missing, or a range (using thru, which includes all values within the range), or one discrete value and a range. �May explicitly declare that no values are missing for a variable. Missing values uniqueid () /* Can explicitly show no missings */ var 001 to var 028 (6, 7, 9) var 029 var 031 (6 thru 9) var 030 (-1, 6 thru highest).

Missing values format (cont'd) �String and non-string missing values can't be declared in the same missing values statement. Missing values uniqueid () var 001 to var 028 (6, 7, 9) var 029 var 031 (6 thru 9) var 030 (-1, 6 thru highest). Missing values stringv 1 ("ZZZZZZZ", "-1 "). �Missing values are dealt with immediately: be aware of the order of operations

Do you need to repair data? �Does each record have a unique record identifier (used to match variables from different files or subsets) ◦ If not, create one: compute uniqueid=($casenum). variable labels uniqueid "Unique record identifier". * The formats command will specify how many columns are reserved for the field: by default, new variables are created as F 8. 2. No decimals are needed for this variable. Length (#) is based on the number of records in the file. formats uniqueid (F#. 0).

Repairing data (cont’d) �If numerically coded variables are defined as string, change that to be non-string. Data list … >>> Data list … uniqueid 1 -8 gender 9 (A) 9 … … Value labels gender "1" "Male" 1 "Male" "2" "Female" 2 "Female" "9" "Not ascertained" 9 "Not ascertained" . .

Repairing data (cont’d) � If string variables require value labels to be meaningful, create non-string versions: this is case sensitive! Value labels gradelvl "H" "Top 10% of the class" "Middle 80% of the class" "L" "Bottom 10% of the class" " " "Rank in class not known". Missing values gradlvl (" "). * Create a non-string version of the variable. Formats newgrdlv (F 1. 0). If gradelvl="H" newgrdlv=1. If gradelvl="M" newgrdlv=2. If gradelvl="L" newgrdlv=3. If (missing(gradelvl)) newgrdlv=9. Value labels newgrdlvl 1 "Top 10% of the class" 2 "Middle 80% of the class" 3 "Bottom 10% of the class" 9 "Rank in class not known". Missing values newgrdlv (9). Variable labels newgrdlv "Reformatted gradelvl: class placement".

Repairing data (cont’d) �Repairing coding flaws is the most difficult, and possibly, the most important thing you can do for your users: do it if you’re comfortable!

Solution to coding problem * Find records where there is no wife. * According to documentation, should use (hdmarsta=1) or (hdmarsta=8) or (hdmarsta=9) or (hdmarsta=10). * Doing that results in 17, 129 valid (non-missing) records. * Defining 0 as missing for age gives 14, 352 valid records. * Since 0 is defined as a missing code for wfagegrp, you cannot use "wfagegrp=0" as the condition. do if (missing(wfagegrp)). * Reset values from 0 to a specifed missing code. + compute wfincome=999999. + compute wfwkswrk=-1. end if. � Try to not change the format of the variable when adding a value – wfincome has 6 columns, with valid entries from –ve 99999 to +99999. So, 999999 is outside the valid range. For wfwkswrk, we could have used 99 as the missing code (the valid range is 0 to 52).

Solution to coding problem (cont’d) � Value labels are needed: Value labels wfincome 999999 "Not applicable - no wife" / wfincsrc 1 "No income" 2 "Wages and salaries" 3 "Military pay and allowances" 4 "Net income from self-employment" 5 "Net income from roomers and boarders" 6 "Government transfer payments" 7 "Net income from investment" 8 "Retirement pensions, superannuation and annuities" 9 "Other money income" 0 "Not applicable - no wife" / wfagegrp 76 "Age 76 and over" 0 "Not applicable - no wife" / …

Solution to coding problem (cont’d) � Missing value declarations are needed, to make having done this worthwhile Missing values wfincsrc wfagegrp (0) wfincome (999999) … � The ripple effect of the change isn’t necessarily as simple as changing one piece of code: you have to track down the rest of the effects of the change and document them.

Where & how to save files? �Write: creates ASCII file (for preservation) ◦ Doesn’t actually do anything until the program encounters an executable command write outfile=‘drive: directoryfilename. dat’ table /all. ◦ The table parameter tells SPSS to include the format used in writing the ASCII file in the log file; /all indicates to write out all variables on the file. ◦ Does not preserve variable/value labels or missing declarations in ASCII file – you need syntax to read the file created by write into

Where & how to save files? �Export: creates portable file ◦ No longer widely used: used to transport between platforms or programs export outfile=‘drive: directoryfilename. por’ /keep=? /drop=? /map. ◦ Keep and drop allow you to include or exclude variables by naming them; map lists variable names and labels ◦ Preserves variable/value labels and missing declarations: can be read back into SPSS ◦ Long variable names truncate to 8 characters ◦ Is an executable command (will force Write)

Where & how to save files? �Save: creates system file ◦ This is the native format of SPSS: files will load into SPSS and keep all variable/value labels, missing declarations and long variable names save outfile=‘drive: directoryfilename. sav’ /keep=? /drop=? /map. ◦ Keep and drop allow you to include or exclude variables by naming them; map lists variable names and labels ◦ Is an executable command (will force Write)

Where & how to save files? Syntax for saving data & metadata: write outfile='j: presentationshife 1972. dat' table /all. save outfile='j: presentationshife 1972. sav' /map. display dictionary. �Display dictionary ◦ Writes information about the system file into the output – variable names, formats, labels, missing declarations, etc. �Save your output file, at least as a. spv file, better by exporting to text (because can ‘always’ read it – preservation purposes!)

Exercise �Create syntax to read the 4 variables on the next page into SPSS, including: ◦ A data list command (c: data192_1972. dat) ◦ Variable labels ◦ Value labels ◦ Missing declarations ◦ Comments for any "fixups" that need to be done: reflect any fixups in value labels and missing declarations ◦ Saving your work

Exercise page

Good, better and horrible news �Good news ◦ You’re done! �Better news ◦ You may never have to do this: ask on the DLI list if other DLI reps have a syntax file that they can provide you if you can’t locate on the EFT site! �Horrible news ◦ If a faculty member shows up with a file that he or she collected, no one else will have syntax – someone may have to do this!
- Slides: 42