MPA STATISTICS Eva Witesman MPA Ph D Brigham
MPA STATISTICS Eva Witesman, MPA, Ph. D Brigham Young University Romney Institute of Public Management
Sampling
Definitions • Population: The set of observations the study is intended to represent (the people/units you want to reach or make conclusions about). • Sampling frame: How the population is operationalized. This may include a list (e. g. phone book, e-mail list) or other identification method (e. g. people standing in line at the DMV) for finding people in the population. (The people/units you can reach). Mismatches between the sampling frame and the population may introduce bias, or systematic differences between the sample and the population.
More Definitions • Sample: The subset of the population you attempt to survey/observe/measure. (The people/units you tried to reach). Sampling using any nonrandom method may introduce selection bias. • Respondents/Observations: The subset of the sample actually represented in the study. • Response rate: For human subject studies, the percent of the sample who participated in the study. Mismatch between the respondents and the sample may introduce bias. When this bias is due to people's decision to participate or not participate in the study, this is called self-selection bias.
Sampling Visualization Population Sample Observations Sampling Frame
Approaches to Sampling A sample that includes all units in the sampling frame is called a census. A sample that results from randomizing the sampling frame and selecting units is called random sampling. This is the general standard for sampling. Organizing the sampling frame into subgroups and then randomly selecting samples from each subgroup is called stratified random sampling. Sampling units based on cost and/or convenience rather than randomness is called convenience sampling. Sampling people by getting referrals from people who are already in the study is called snowball sampling.
When to Use Sampling Approaches • Census - Census sampling is ideal if you have the manual and technical capacity to gather and manage the quantity of data necessary. Attempted census data can be biased if true census is not achieved. • Random Sampling - Random samples are a statistically valid method for representing a population without the capacity demands of a census. • Stratified Random Sampling - Appropriate particularly when you are interested in fair representation of views that cross meaningful boundaries and might be underrepresented in a random sample (e. g. race, college). • Convenience Sampling - This type of sampling is appropriate for exploratory studies, pilot studies, and instrument/construct refinement. • Snowball Sampling - This type of sampling is appropriate for building samples of people with very rare characteristics (such as rare medical conditions).
Correctly Formatted Raw Datasets
Raw Data Characteristics • The data should be in Excel format (or an Excel-readable format like. csv) • The data should be organized so that each row is a single observation (student, instructor, course, program, year, etc. ) • The data should be organized so that each column is a single variable (characteristic, measurement, value, descriptor, etc. ) • The data file should not include summary statistics or other data
How to Randomly Select a Sample via Excel • In Excel, create a new column. In the variable name row, title the new column “random”. • Using the command “=rand()”, create random numbers for all rows in your dataset. • Replace the =rand() command with values (copy/paste/keep values only) • Select the ENTIRE dataset (with only one header row). • Sort the dataset according to the random variable. • Keep the desired number of observations from the top rows.
Unit of analysis • A unit of analysis is the nature of the subject being studied, such as “departments" or “incoming calls" or "counties" or "individuals". It is the unit that is sampled and observed. • A well-designed study typically has a consistent unit of analysis, meaning that all variables in the study are measured at the same unit level (e. g. if it is an individual study, all the variables deal with individuals rather than households, cities, or counties. )
Identifying the unit of analysis • In a study that has already been conducted, you can generally identify the unit of analysis by identifying what the data points uniquely describe. • In some cases, it requires more than one pieces of information to identify what makes each row in a database unique. For example, each row in your database may be a department, but each department may appear more than once (e. g. once for each of five years). In this case, the unit of analysis would be department/year. • Ultimately, the units of analysis will be rows in the database we will use for analysis.
Data Management
The Purposes of Data Management 1. 2. 3. 4. Make the data useable Work efficiently Make sure that findings can be replicated Communicate with others
Version Everything • When you make a change—particularly a significant change—to your dataset, use save as to create a new version of the document. Saving a file overwrites your previous file and makes undoing changes or going back to earlier versions of a document virtually impossible. Saving new versions of the document protects you from having to redo everything if you make a mistake and save it. • I suggest the following naming convention for data files to automatically sort your files from oldest to newest: filename. YYYYMMDDVV (filenameyearmonthdayversion)
Keep a Log of Changes • Keep a data cleaning log in a separate sheet of your data file. Record major changes reflected in each saved version. That way, each data file will automatically have its own updated log included and you’ll know exactly what has happened with the data in that particular file. • Keep a research log. When you run models and other analysis in R, recording what you did, why, the results, and your interpretations can prevent duplicated work. • Record why you are doing things when you do them. You think you’ll remember but you won’t!
File Types • XLSX files allow multiple sheets, formatting, and various other functions not allowed in CSV files. However, these files can be more difficult to upload into statistical software. • CSV (Comma Separated Values) files are text files that use commas to indicate breaks in cells, rows, and columns. CSV files can be read and created using Excel. This is an extremely common and useful file format for data, but such files do not allow multiple sheets or formatting. • Typically, use XLSX file format for preparing and documenting data. When you are ready to import to statistical software, convert to CSV.
Creating a Codebook • Add 6 rows above your data in Excel and one column to the left. In the top 5 cells of the (new) first column, enter the words: “Content, ” “Data Source, ” “Level of Measurement, ” “Units or Coding, ” “Notes” and “Variable Name. ” • In row 1, identify the content of the variable. This may be a construct (e. g. “birthdate”) or a survey question (e. g. “what is your date of birth? ”). • In row 2, enter the source of the data. For datasets with multiple sources, identify the database or data source. For single source data, identify the data title, callout, or original source (e. g. a particular line on a form). For survey data, identify the actual question asked (for a question) or selection indicated (for multipart questions). • In row 3, enter the level of measurement of the data. Examples: Binary, ordinal, interval, text • In row 4, enter the units or coding associated with the data. Examples: Hours per week; 1= strongly disagree 5=strongly agree; 1=yes, 0=no • In row 6, enter a variable name. This name should be short (8 characters is standard) and descriptive. This name should NOT INCLUDE SPACES OR SPECIAL CHARACTERS. Examples: q 04 aexpens q 04 bqual q 01 hrspwk q 10 female. For ordinal variables, the name should reflect the order of the scale from low to high (e. g. “libcons” instead of “ideology”). For properly coded binary variables (0/1), the variable name should reflect the meaning of the code “ 1” (e. g. “female” instead of “gender”)
Sample Codebook Content Data Source Gender Question 5: “What is your gender? ” Level of Measurement Units or Coding Text/binary Binary As entered 1=female 0=male Notes This is the original version gender Female Male male This is the clean version female 1 1 0 0 Variable Name Sports-rugby Question 6: “What sports do you like? Check all that apply. –rugby” Binary 1=rugby is checked 0=rugby is not checked rugby 1 1 0 1 Codebook Data
Transposing the Codebook • Keep this codebook with your data while you make coding changes to the data (this allows you to conveniently update the codebook in the same sheet. • Once the data is clean, create a nicely formatted codebook in another sheet by doing the following: • • Copy your 6 header rows Click the top left cell of a new sheet Right click and select “paste special” Select the option “transpose. ”
Cleaning Data
Raw Data vs. Clean Data • Raw data is the data as you receive it from its original source. • This includes files uploaded from survey software or administrative databases, original forms, text, notes, and recordings, and other data that has not been formatted for analysis. • There may be various stages of “raw” data, including data that has been transcribed or entered into electronic databases even if it was collected in an alternative form. • Clean data is data that is ready for analysis. This includes: • • • Coding of data Data recovery and imputation Deleting observations with missing data Verifying that data is accurate and falls within expected ranges Etc.
Checking for Symbols and Non-entries • Remove placeholder codes. Codes for missing data should deleted to leave blank cells. • Check for missing data codes that might be embedded in your data Some programs use “. ”, “-99”, “ 999” or similar values to indicate missing values. Verify that all formatting is uniform, with no symbols or spaces in any cells • Remove all commas (, ) and apostrophes (‘) from the dataset, including those used as place value indicators. • These can be confused as cell dividers when saving a dataset as a comma separated values (CSV) file. • Replace all formulas with values after first verifying all formula results. • Do this by selecting the data (you can just select all data), copying it, and pasting it in the same place. Click the clipboard icon that appears, and from the drop-down menu, select “values only. ”
Examine Each Variable • Examine each variable to determine the following: • Whether the data appears complete or whether it represents a subset of the full dataset, there are missing values that can be recovered, or if the variable has too little data to be functionally useful • Whether the data appears analytically useful (i. e. it represents the construct well and varies within the dataset) • Whether the data is coded numerically and appropriately • Typically, the variables you should keep for the analysis are variables that: • Match constructs in your logic model (inputs, activities, outcomes) or • Are important control variables
Make Data Cleaning Decisions • Decide which variables to keep for analysis and which to discard based on utility or missingness of values. • Decide what to do with blank cells: • Recover (fill blanks with data from the original source, from alternative sources for the same information, or with known quantities) • Delete the entire observation (this can create a biased sample) • Impute (fill blanks with value estimates. USE WITH EXTREME CAUTION AS THIS CREATES BIAS) • Record and report all major data cleaning decisions
Recode Variables • Interval variables should be checked for consistency of units and checked for obvious typos and/or values outside of reasonable ranges. • Ordinal variables should be coded from low to high values of a construct. Some variables need to be “reverse coded” if responses are opposite of the construct being measured. Ordinal scales should include ONLY ordinal options. “Other” and “not applicable” categories change the level of measurement to categorical because they cannot be ordered. • Binary variables should be coded as 0 and 1. • Categorical variables should typically be recoded as sets of binary variables.
Recoding Categorical Variables to Binary • Because binary variables are more analytically useful than categorical variables, we typically recapture the content of a categorical variable with several binary variables. This process is known as “creating dummy variables. ” • Example: Imagine a single variable, “Professor Type” containing three categories: “Adjunct Professor, ” “Tenured Professor, ” and “Tenure-track Professor. ” • To make this data more useful for analysis, we create three NEW variables— one for “Adjunct Professor, ” one for “Tenured Professor, ” and one for “Tenure-track Professor. ” • Each adjunct professor would have a “ 1” in the Adjunct Professor variable and a “ 0” in each of the other new variables. Tenured professors would have a “ 1” in the tenured professor variable but “ 0” in each of the other categories.
Creating Dummy Variables: Example Professor ID Professor Type Adjunct Professor Tenure-track Professor Tenured Professor 1098234 Adjunct Professor 1 0 0 2435980 Adjunct Professor 1 0 0 3875039 Tenure-track 0 1 0 3458906 Adjunct Professor 1 0 0 2943029 Tenured 0 0 1 1508749 Adjunct Professor 1 0 0 1231512 Tenured 0 0 1 1234234 Tenured 0 0 1 3523123 Tenure-track 0 1 0 2341032 Tenured 0 0 1 2355432 Tenure-track 0 1 0
Dates and Times • Dates and times are not analytically useful by themselves. Typically dates represent some other information that is useful. These may be: • • Age Time elapsed between events Day of the week Month of the year Time of day Year Season • In order to use dates or times, they should be converted to the information they are intended to represent.
Recoding Dates and Times • When dates or times are intended to capture information about the type of day or time (e. g. morning vs. evening, winter vs. summer, etc. ), they should be recoded into dummy variables as though the data were categorical. • When dates or times are intended to capture age or time elapsed between events, identify the reference date(s) and perform one of the following functions in Excel: • Subtract one date from the other to find total number of days between the dates • Use the NETWORKDAYS command to find total number of work days between the dates • Use the DATEDIF command to find other specific types of date differences
Preparing Data and Importing to R
Data Cleaning in Statistical Software vs. Excel • Some datasets or variables are easier to clean in statistical software than they are in Excel. This is particularly true of large datasets. • Cleaning data in statistical software can be more uniform and less prone to idiosyncratic human error. It is also often faster and easier. • Cleaning data in statistical software can be less transparent than in a program like Excel, because the software automates tasks and you don’t always see the changes in the raw data. Always check your work to make sure the software accomplished the tasks you intended. • It is often prudent to 1) version before major data cleaning tasks and 2) create duplicate variables before recoding. This can prevent lost work in the case of an error.
Preparing Data for Statistical Software Import • Make sure all formulas have been replaced with values in the dataset and there are no commas, apostrophes, or special symbols in the dataset. • Save your data, codebook, and research notes and version the file. • Select the data sheet and delete all header rows EXCEPT the variable name. If there is an empty column to the left of your data, delete it. • Select all remaining data and format as number (right click/format cells/number) • While the data sheet is selected, click “Save As. ” • Identify a filename (the filename should not have spaces or numbers). • Change the file extension to Comma Separated Values format (. csv). This will save ONLY the values in each cell. All formatting and additional sheets will be deleted from this file.
Importing Data Into R • In R Commander, select Import/export data under the MPA Statistics tab. You can import various types of files here, but we will only describe how to import a. csv. • Select “…from text file, clipboard or URL”
Importing Data Into R (Continued) • Enter the name by which you would like your dataset to be known in R (no numbers or spaces). • Change the field separator to “comma. ” Make sure the box for “variable names in file” is checked • Click ok and navigate to the. csv file containing the data and variable names.
Successful Import • The command window in RStudio will indicate whether or not the data import was successful. If successful, it should read, “The dataset filename has XX rows and XX columns. ” • Your data name should also appear blue in the R Commander data set window in the top right of the R Commander window. • If the upload was not successful, the words “no active dataset” will appear in red in the R Commander data set window. • Verify that the dataset is functional by selecting MPA Statistics/Descriptive Statistics/Summarize dataset. Review the data summary for errors. • You can also view the dataset by clicking the “View data set” button.
Recoding Data With R Commander • Under the edit data tab, select the “manage variables in active data set” option and select “Recode variable. ” • Select the variable you wish to recode, and enter a new name for the recoded variable (unless you want to overwrite your original variable) • Uncheck the box that says “Make (each) new variable a factor. ” • Enter the recodes you would like to complete. • Each recode should appear on a different line in the box. • For example, if I want to change all values of 1 to values of 10, I would enter “ 1=10”. If I also want all values of 2 to change to values of 20, I would enter “ 2=20” on a separate line. If I want values of 99 to be coded as missing, I would enter “ 99=NA”. If I want values of. to be coded as missing, I would enter ““. ”=NA ” (note that the period must be in quotes).
Renaming or Deleting Variables • Under the edit data tab, select the “manage variables in active data set” and select the appropriate option: • Rename. With this option, you can rename variables in your dataset. Be sure to update your codebook. • Delete variables. With this option, you can delete variables from your dataset.
To Delete Observations with Missing Data • In the R Commander window, select the Data tab. • Select “Active data set”. • In the submenu that appears, select “remove cases with missing data”. • Select the variables that have missing data you want to delete. • This will delete ALL data for rows that are missing data in one or more of the selected variables. • Use with caution.
Characteristics of a Clean Dataset
A “Clean” Dataset Has The Following Characteristics: • The codebook associated with the data is complete and accurate • Any data that is not useful for analysis (variables that have been recoded, observations with missing data, etc. ) has been deleted. • The maximum and minimum values for all variables make sense • All variables have been recoded appropriately • All observations with missing data that could not be recovered (or imputed with extreme caution) have been deleted, leaving an equal number of observations for every variable
- Slides: 41