Data preparation and descriptive statistics Chapter 4 Statistics

Data editing • • Are non-response errors within acceptable limits? Does the questionnaire meet

Non-response errors and countermeasures • Refusals • contacts prior to the interview • incentives

Problems in returned questionnaires • Basic requirements • Has it reached the targeted respondent

Dealing with unsatisfactory responses • Best solution: go back to the respondent and clarify

From the questionnaire to the data sheet 1) Organize the data base • •

Codebook example Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi

The transcription process • With computer-assisted administration methods (CAPI, CATI or CAWI), the interviewer

Post-editing treatment • The editing step creates the data-set • The post-editing process is

Post-editing consistency checks • More complex automatic checks than those in the editing step

Missing data treatment • Problems with missing data • Reduction in sample size (hence

Non-random missing data • Diagnosis: test where data are MAR • Divide the sample

Weighting cases • If some particular sub-group of the population might be under- or

Transforming and creating variables Variables can be modified (or new variables can be generated)

Post-editing in SPSS • Check for consistency by screening variables Statistics for Marketing &

Missing data and SPSS (1) • Mean/ median interpolation replacement based on the target

Missing data & SPSS (2) • More complex imputation techniques Statistics for Marketing &

Weighting cases in SPSS Statistics for Marketing & Consumer Research Copyright © 2008 -

Transforming and creating variables in SPSS • COMPUTE: performs mathematical computations and transformations (with

Data exploration • Graphical plots of the data: to get a first overview of

Graphs • • Univariate plots of qualitative or discrete data Univariate plots of quantitative

Univariate qualitative or discrete data Statistics for Marketing & Consumer Research Copyright © 2008

Univariate continuous data (1) Maximum Median Upper quartile Lower quartile Minimum Statistics for Marketing

Univariate continuous data (2) Pareto charts • Bars ordered in decreasing order of the

Bivariate and multivariate plots Statistics for Marketing & Consumer Research Copyright © 2008 -

Tabulation in SPSS • Frequency tables • Discrete, qualitative or banded metric variables •

Frequency table Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi

Descriptive statistics Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi

Cross-tabulation Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 30

3 -variables frequency table Statistics for Marketing & Consumer Research Copyright © 2008 -

Quantitative by categorical Statistics for Marketing & Consumer Research Copyright © 2008 - Mario

Detection of missing values Cross-tabulation with a related variable (e. g. income by self-perceived

Detection of outliers • Outliers are anomalous values compared to other in the sample

Dealing with outliers (1) 1) Define them • • a value that is more

Dealing with outliers (2) 2) Detect them • Graphically (scatterplots, boxplots) • Check whether

Outlier detection in SPSS • E. g. Trust data set, variable q 4 kilos:

Exploring outliers • Eliminate cases that are not outliers (e. g. below 4. 69)

Outliers: scatterplots Consumption and expenditure Statistics for Marketing & Consumer Research Copyright © 2008

Slides: 39

Download presentation

Data editing • • Are non-response errors within acceptable limits? Does the questionnaire meet the basic respondent requirements? Are the response in the questionnaire complete? Are they consistent and clear? Should low quality questionnaires be replaced or discarded? How should the database be organized? How are the questions coded? How is the transcribing process organized? Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 2

Non-response errors and countermeasures • Refusals • contacts prior to the interview • incentives • follow-ups • Not-at-homes • Call-back plan • Other countermeasures • Interview a sub-sample of non-respondents (with a different interview method) • Substitute non-respondents with other respondents (from the sampling frame), that are similar to the non-respondents with respect to some key characteristics. • Post-editing (discussed later) Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 3

Problems in returned questionnaires • Basic requirements • Has it reached the targeted respondent (or did someone else respond). • Filter questions ensure that the respondent is qualified to answer • Invalid/deceptive questionnaire should be discarded • Completeness of questionnaires • Missing response to some items (e. g. sensitive questions, lack of pages) • Consistency • Specific structures and multiple related questions allow a quality check (inconsistencies, ambiguities, etc. ) • E. g. How much do you drive? Do you have a driving license? Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 4

Dealing with unsatisfactory responses • Best solution: go back to the respondent and clarify the issues (with a different interview method) • Call-backs are expensive, but improve quality • Alternative solution: assign missing values to unsatisfactory responses • If there is a large proportion of missing values, better to discard the whole questionnaire • If there is a small proportion of incomplete questionnaires, better to discard the whole questionnaire • The discarding procedure • may reintroduce biases that the sampling procedure was expected to avoid • Example: all low quality questionnaires come from the same socio-economic group; discarding them reduces the representativeness of the sample • Non-random unsatisfactory responses indicates a problem in the survey procedure • Other statistical procedures deal with missing data (discussed later) Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 5

From the questionnaire to the data sheet 1) Organize the data base • • 2) Clearly define the structure of the electronic records and the number of variables Example: a multiple-choice item can be classified as a single categorical variable or as a set of binary variables Prepare the codebook • • Assign variable names Decide the variable types (see lecture 1 – qualitative vs. quantitative, nominal vs. ordinal, etc. ) Decide the variable width (decimals, etc. ) Decide the coding for the identifications of missing values (e. g. -999 non-response, -888 unreadable, etc. ) Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 6

The transcription process • With computer-assisted administration methods (CAPI, CATI or CAWI), the interviewer keys in the answers directly and the software automatically produces the data-set • The software also performs consistency checks to validate the data-set and anomalies are signalled to the interviewer • For other questionnaires, there is an option for electronic reading using appropriate optical scanners and software • In some circumstances, however, the questionnaires are keyed into the compute; this is a potential source of error; random checks on the entered data are advisable. Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 8

Post-editing treatment • The editing step creates the data-set • The post-editing process is a series of operations to improve the quality of the data prior to the actual statistical analysis through: • • • Further consistency checks Missing data treatment Weighting cases Transforming variables Creating new variables Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 9

Post-editing consistency checks • More complex automatic checks than those in the editing step • Check for outliers • Test whether some of the recorded quantitative values are too high or too low • Check whether multiple choice responses are out of the established range • Set rules (usually by software) for consistency checks: • Example: an automatic search for those who declare they drive but have no driving license • Cross-collected data with that from an external source (mainly census and other demographic data) to detect anomalies. Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 10

Missing data treatment • Problems with missing data • Reduction in sample size (hence precision of estimates) • Introduction of systematic biases when missing data are related to specific characteristics (e. g. missing data in the lower income range) • When missing data are not a problem: • Large sample size • Non-response (missing data) are random, as they do not depend on respondent characteristics (missing at random or MAR data) • They can be discarded • Casewise or listwise deletion – all data for the case with missing data are deleted • Pairwise deletion – for each estimation all valid cases are considered, the same case might enter in one estimation and not in another because of missing data Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 11

Non-random missing data • Diagnosis: test where data are MAR • Divide the sample into two group (missing data vs. non-missing data) • Perform mean comparison tests on relevant respondent characteristics (control variables) • Cures: • Statistical imputation: missing information replaced by information already in the sample • Mean substitution (reduces data variability) • Regression analysis (see lecture 9) using relations with other variables • Multiple imputation: several imputation methods are exploited an average of the estimates is used • EM algorithm: assumes a specific distribution (usually normal), then obtain estimates through maximum likelihood – the procedure is iterative • Directly into statistical analysis: for example, treating cases with missing values as a stand – alone group Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 12

Weighting cases • If some particular sub-group of the population might be under- or over-represented in the sample, applying a weight to each case restores the representativeness of the sample. • When to apply weights: • with stratified sampling or cluster sampling, weights are necessary and depend on the sampling design (see lecture 5) • Probabilistic samples are expected to be self-weighted, but the selfweighting property is undermined by non-responses: weights adjust the sample • Weighting can also be useful to value the information of a sub-group of the population compared to others, with an explicit choice of the researcher in relation to the final objective Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 13

Transforming and creating variables Variables can be modified (or new variables can be generated) through: (a) Mathematical transformations • E. g. logarithmic transformation to reduce variability (b) Banding • Categorization of quantitative variables (c) Recoding • Change the response categories of a qualitative variable (d) Ranking • Transform a scale value into an ordinal variable which reflects ranking (e. g. transformation into quartiles) Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 14

Transforming and creating variables in SPSS • COMPUTE: performs mathematical computations and transformations (with If condition) • RECODE: allows to re-classify a nominal or ordinal variable into a smaller or larger set of categories. • VISUAL BANDER: categorizes a scale variable • RANK: creates rank variables • AUTOMATIC RECODE: for automatic recoding, like transformation of string variables into nominal variables Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 19

Data exploration • Graphical plots of the data: to get a first overview of the main characteristics of the data-set, especially the distribution of the original variables across the whole sample and for sub-samples • Univariate descriptive statistics and one-way tabulation: to synthesize the main characteristics of each of the variables in the-set • Multivariate descriptive statistics and cross-tabulation: to get a first understanding of relationship existing between different variables and enabling the joint examination of two or more variables • Missing data and outliers detection: to allow an early detection of potential issues in subsequent analysis Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 20

Graphs • • Univariate plots of qualitative or discrete data Univariate plots of quantitative data Bivariate and multivariate plots of quantitative versus qualitative data Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 21

Univariate continuous data (2) Pareto charts • Bars ordered in decreasing order of the frequencies they represent • The line indicates the cumulative proportion • Useful for quality control (ANALYZE/QUALITY CONTROL in SPSS) Q-Q plots • Compare the empirical (observed) data distribution and some theoretical distribution • When the observed distribution is close to theoretical one, the plotted values tend to lie on a straight line. Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 24

Tabulation in SPSS • Frequency tables • Discrete, qualitative or banded metric variables • Tables with descriptive statistics • Cross-tabulation • Joint frequencies for two discretequalitative variables • Statistics for a metric variable by category of a nonmetric variable • Multi-way frequency tables Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 27

Detection of missing values Cross-tabulation with a related variable (e. g. income by self-perceived wealth) Total Not very well off Difficult Modest Reasonable Well off Missing Sys. Mis Income Present Missing Count 342 23 33 114 117 53 2 Percent 68. 4 88. 5 76. 7 74. 5 63. 9 65. 4 14. 3 % Sys. Mis 31. 6 11. 5 23. 3 25. 5 36. 1 34. 6 85. 7 • Non-response for income (31. 6% in total) are not evenly distributed considering selfperceived wealth: there is a larger proportion of NR for higher wealth levels • Hypothesis testing methods (see lectures 6 and 7) may add statistical evidence about non-randomness of missing values Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 33

Detection of outliers • Outliers are anomalous values compared to other in the sample • They could be determined by: • A measurement error • A truly anomalous (exceptional) value with some specific cause • If a measurement error is ruled out, should outliers be discarded or not? • Answer: it depends upon whether anomalous observations are relevant to the research objectives Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 34

Dealing with outliers (1) 1) Define them • • a value that is more than 2. 5 standard deviations from the mean a value that lies more than 1. 5 times the interquantile range beyond the upper or the lower quartile As the sample size increase the probability of having anomalous values increases and their impact on statistical analysis decreases Hence one may wish to raise the above coefficient to higher values (for example 4 standard deviations instead of 2. 5) Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 35

Dealing with outliers (2) 2) Detect them • Graphically (scatterplots, boxplots) • Check whether the combination of variables is anomalous rather than having individual values • Multivariate analysis, bivariate scatterplots, etc. 3) Decide what to do (according to objective and subjective decisions) • Delete outliers • Correct outliers (similarly to missing values) • Leave them alone Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 36

Outlier detection in SPSS • E. g. Trust data set, variable q 4 kilos: Mean: Standard Deviation: Upper quartile: Lower quartile: Interquartile range: 1. 06 1. 45 0. 50 1. 36 0. 86 Kg Kg. Kg. Outliers: Value above 4. 69 (mean + 2. 5 times the standard deviation) Or values above 2. 65 (mean +1. 5 times the interquantile range) Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 37

Exploring outliers • Eliminate cases that are not outliers (e. g. below 4. 69) • Summarize cases left in the data-set with some variables which could explain the outliers • 5 outliers • The first two are likely to be response errors (expenditure is zero) • The remaining three values seem consistent with the expenditure and the number of components Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 38