Data preparation and descriptive statistics Chapter 4 Statistics







































- Slides: 39
Data preparation and descriptive statistics Chapter 4 Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 1
Data editing • • Are non-response errors within acceptable limits? Does the questionnaire meet the basic respondent requirements? Are the response in the questionnaire complete? Are they consistent and clear? Should low quality questionnaires be replaced or discarded? How should the database be organized? How are the questions coded? How is the transcribing process organized? Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 2
Non-response errors and countermeasures • Refusals • contacts prior to the interview • incentives • follow-ups • Not-at-homes • Call-back plan • Other countermeasures • Interview a sub-sample of non-respondents (with a different interview method) • Substitute non-respondents with other respondents (from the sampling frame), that are similar to the non-respondents with respect to some key characteristics. • Post-editing (discussed later) Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 3
Problems in returned questionnaires • Basic requirements • Has it reached the targeted respondent (or did someone else respond). • Filter questions ensure that the respondent is qualified to answer • Invalid/deceptive questionnaire should be discarded • Completeness of questionnaires • Missing response to some items (e. g. sensitive questions, lack of pages) • Consistency • Specific structures and multiple related questions allow a quality check (inconsistencies, ambiguities, etc. ) • E. g. How much do you drive? Do you have a driving license? Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 4
Dealing with unsatisfactory responses • Best solution: go back to the respondent and clarify the issues (with a different interview method) • Call-backs are expensive, but improve quality • Alternative solution: assign missing values to unsatisfactory responses • If there is a large proportion of missing values, better to discard the whole questionnaire • If there is a small proportion of incomplete questionnaires, better to discard the whole questionnaire • The discarding procedure • may reintroduce biases that the sampling procedure was expected to avoid • Example: all low quality questionnaires come from the same socio-economic group; discarding them reduces the representativeness of the sample • Non-random unsatisfactory responses indicates a problem in the survey procedure • Other statistical procedures deal with missing data (discussed later) Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 5
From the questionnaire to the data sheet 1) Organize the data base • • 2) Clearly define the structure of the electronic records and the number of variables Example: a multiple-choice item can be classified as a single categorical variable or as a set of binary variables Prepare the codebook • • Assign variable names Decide the variable types (see lecture 1 – qualitative vs. quantitative, nominal vs. ordinal, etc. ) Decide the variable width (decimals, etc. ) Decide the coding for the identifications of missing values (e. g. -999 non-response, -888 unreadable, etc. ) Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 6
Codebook example Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 7
The transcription process • With computer-assisted administration methods (CAPI, CATI or CAWI), the interviewer keys in the answers directly and the software automatically produces the data-set • The software also performs consistency checks to validate the data-set and anomalies are signalled to the interviewer • For other questionnaires, there is an option for electronic reading using appropriate optical scanners and software • In some circumstances, however, the questionnaires are keyed into the compute; this is a potential source of error; random checks on the entered data are advisable. Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 8
Post-editing treatment • The editing step creates the data-set • The post-editing process is a series of operations to improve the quality of the data prior to the actual statistical analysis through: • • • Further consistency checks Missing data treatment Weighting cases Transforming variables Creating new variables Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 9
Post-editing consistency checks • More complex automatic checks than those in the editing step • Check for outliers • Test whether some of the recorded quantitative values are too high or too low • Check whether multiple choice responses are out of the established range • Set rules (usually by software) for consistency checks: • Example: an automatic search for those who declare they drive but have no driving license • Cross-collected data with that from an external source (mainly census and other demographic data) to detect anomalies. Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 10
Missing data treatment • Problems with missing data • Reduction in sample size (hence precision of estimates) • Introduction of systematic biases when missing data are related to specific characteristics (e. g. missing data in the lower income range) • When missing data are not a problem: • Large sample size • Non-response (missing data) are random, as they do not depend on respondent characteristics (missing at random or MAR data) • They can be discarded • Casewise or listwise deletion – all data for the case with missing data are deleted • Pairwise deletion – for each estimation all valid cases are considered, the same case might enter in one estimation and not in another because of missing data Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 11
Non-random missing data • Diagnosis: test where data are MAR • Divide the sample into two group (missing data vs. non-missing data) • Perform mean comparison tests on relevant respondent characteristics (control variables) • Cures: • Statistical imputation: missing information replaced by information already in the sample • Mean substitution (reduces data variability) • Regression analysis (see lecture 9) using relations with other variables • Multiple imputation: several imputation methods are exploited an average of the estimates is used • EM algorithm: assumes a specific distribution (usually normal), then obtain estimates through maximum likelihood – the procedure is iterative • Directly into statistical analysis: for example, treating cases with missing values as a stand – alone group Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 12
Weighting cases • If some particular sub-group of the population might be under- or over-represented in the sample, applying a weight to each case restores the representativeness of the sample. • When to apply weights: • with stratified sampling or cluster sampling, weights are necessary and depend on the sampling design (see lecture 5) • Probabilistic samples are expected to be self-weighted, but the selfweighting property is undermined by non-responses: weights adjust the sample • Weighting can also be useful to value the information of a sub-group of the population compared to others, with an explicit choice of the researcher in relation to the final objective Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 13
Transforming and creating variables Variables can be modified (or new variables can be generated) through: (a) Mathematical transformations • E. g. logarithmic transformation to reduce variability (b) Banding • Categorization of quantitative variables (c) Recoding • Change the response categories of a qualitative variable (d) Ranking • Transform a scale value into an ordinal variable which reflects ranking (e. g. transformation into quartiles) Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 14
Post-editing in SPSS • Check for consistency by screening variables Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 15
Missing data and SPSS (1) • Mean/ median interpolation replacement based on the target variable only Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 16
Missing data & SPSS (2) • More complex imputation techniques Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 17
Weighting cases in SPSS Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 18
Transforming and creating variables in SPSS • COMPUTE: performs mathematical computations and transformations (with If condition) • RECODE: allows to re-classify a nominal or ordinal variable into a smaller or larger set of categories. • VISUAL BANDER: categorizes a scale variable • RANK: creates rank variables • AUTOMATIC RECODE: for automatic recoding, like transformation of string variables into nominal variables Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 19
Data exploration • Graphical plots of the data: to get a first overview of the main characteristics of the data-set, especially the distribution of the original variables across the whole sample and for sub-samples • Univariate descriptive statistics and one-way tabulation: to synthesize the main characteristics of each of the variables in the-set • Multivariate descriptive statistics and cross-tabulation: to get a first understanding of relationship existing between different variables and enabling the joint examination of two or more variables • Missing data and outliers detection: to allow an early detection of potential issues in subsequent analysis Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 20
Graphs • • Univariate plots of qualitative or discrete data Univariate plots of quantitative data Bivariate and multivariate plots of quantitative versus qualitative data Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 21
Univariate qualitative or discrete data Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 22
Univariate continuous data (1) Maximum Median Upper quartile Lower quartile Minimum Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 23
Univariate continuous data (2) Pareto charts • Bars ordered in decreasing order of the frequencies they represent • The line indicates the cumulative proportion • Useful for quality control (ANALYZE/QUALITY CONTROL in SPSS) Q-Q plots • Compare the empirical (observed) data distribution and some theoretical distribution • When the observed distribution is close to theoretical one, the plotted values tend to lie on a straight line. Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 24
Bivariate and multivariate plots Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 25
Bivariate and multivariate plots Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 26
Tabulation in SPSS • Frequency tables • Discrete, qualitative or banded metric variables • Tables with descriptive statistics • Cross-tabulation • Joint frequencies for two discretequalitative variables • Statistics for a metric variable by category of a nonmetric variable • Multi-way frequency tables Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 27
Frequency table Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 28
Descriptive statistics Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 29
Cross-tabulation Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 30
3 -variables frequency table Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 31
Quantitative by categorical Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 32
Detection of missing values Cross-tabulation with a related variable (e. g. income by self-perceived wealth) Total Not very well off Difficult Modest Reasonable Well off Missing Sys. Mis Income Present Missing Count 342 23 33 114 117 53 2 Percent 68. 4 88. 5 76. 7 74. 5 63. 9 65. 4 14. 3 % Sys. Mis 31. 6 11. 5 23. 3 25. 5 36. 1 34. 6 85. 7 • Non-response for income (31. 6% in total) are not evenly distributed considering selfperceived wealth: there is a larger proportion of NR for higher wealth levels • Hypothesis testing methods (see lectures 6 and 7) may add statistical evidence about non-randomness of missing values Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 33
Detection of outliers • Outliers are anomalous values compared to other in the sample • They could be determined by: • A measurement error • A truly anomalous (exceptional) value with some specific cause • If a measurement error is ruled out, should outliers be discarded or not? • Answer: it depends upon whether anomalous observations are relevant to the research objectives Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 34
Dealing with outliers (1) 1) Define them • • a value that is more than 2. 5 standard deviations from the mean a value that lies more than 1. 5 times the interquantile range beyond the upper or the lower quartile As the sample size increase the probability of having anomalous values increases and their impact on statistical analysis decreases Hence one may wish to raise the above coefficient to higher values (for example 4 standard deviations instead of 2. 5) Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 35
Dealing with outliers (2) 2) Detect them • Graphically (scatterplots, boxplots) • Check whether the combination of variables is anomalous rather than having individual values • Multivariate analysis, bivariate scatterplots, etc. 3) Decide what to do (according to objective and subjective decisions) • Delete outliers • Correct outliers (similarly to missing values) • Leave them alone Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 36
Outlier detection in SPSS • E. g. Trust data set, variable q 4 kilos: Mean: Standard Deviation: Upper quartile: Lower quartile: Interquartile range: 1. 06 1. 45 0. 50 1. 36 0. 86 Kg Kg. Kg. Outliers: Value above 4. 69 (mean + 2. 5 times the standard deviation) Or values above 2. 65 (mean +1. 5 times the interquantile range) Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 37
Exploring outliers • Eliminate cases that are not outliers (e. g. below 4. 69) • Summarize cases left in the data-set with some variables which could explain the outliers • 5 outliers • The first two are likely to be response errors (expenditure is zero) • The remaining three values seem consistent with the expenditure and the number of components Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 38
Outliers: scatterplots Consumption and expenditure Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi Consumption and prices 39