Sampling Chapter 5 Statistics for Marketing Consumer Research

  • Slides: 45
Download presentation
Sampling Chapter 5 Statistics for Marketing & Consumer Research Copyright © 2008 - Mario

Sampling Chapter 5 Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 1

Sample • A sub-set of the units in a population, expected to represent the

Sample • A sub-set of the units in a population, expected to represent the whole population • By measuring data on a sample • Information on the entire population is gathered at a lower cost compared to censuses • Some margin of error is necessarily accepted Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 2

Probability vs. non-probability sampling Probability sampling: each unit in the sampling frame is associated

Probability vs. non-probability sampling Probability sampling: each unit in the sampling frame is associated to a given probability of being included in the sample, which means that the probability of each potential sample is known Non-probability sampling: extraction of sample units is not based on probability rules Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 3

Inference and sampling error • Prior knowledge on the probability of all potential samples

Inference and sampling error • Prior knowledge on the probability of all potential samples allows statistical inference • Statistical inference is the generalization of sample statistics (parameters) to the target population, subject to a margin of uncertainty, or sampling error. • The probability laws ruling probability sampling allow one to ascertain how much the sample estimates reflect the true characteristic of the target population • The sampling error can be estimated and used to assess the precision of sample estimates • The sampling error is only a portion of the survey error (which also includes non-sampling error), but has the advantage that it can be estimated and controlled using the information on the sampling method Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 4

Characteristics and limits of nonprobability sampling • The use of non-probability samples is common

Characteristics and limits of nonprobability sampling • The use of non-probability samples is common in marketing research, especially quota sampling (discussed later) • It can be argued that the sampling error is often much smaller than error from non-sampling sources • However, the problems with non-probability samples are that: • Selection of the sampling units is subjective • it is impossible to assess scientifically the ability to avoid the potential biases of a non-probability sample • Sampling error, precision and accuracy cannot be estimated • Statistical methods to analyze sample data are based on probability assumptions (e. g. normality of the data distribution), which can be only be determined by probability extraction rules Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 5

Sampling concepts • The sampling error depends on: • The sampling fraction (ratio between

Sampling concepts • The sampling error depends on: • The sampling fraction (ratio between sample size and population size) • Larger sampling fractions increase precision • However, the gain in precision decreases as the sampling fraction increases • The data variability in the population • The less variable the population data, the more precise the sample estimates • However, population variability is rarely known and generally estimated • The precision of the sample estimator • Precision of an estimator (measured through the standard error of the estimator) is the variability of an estimate across multiple measurements (across different samples) Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 6

Standard deviation vs. standard error • The standard deviation measures the variability of a

Standard deviation vs. standard error • The standard deviation measures the variability of a given variable (e. g. X) within the population or sample • The standard error is a precision measure which refers to the variability of the sample estimator (e. g. the sample mean) across multiple estimates • The standard error depends on the standard deviation but they are not the same concept Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 7

Accuracy and precision Accuracy: the degree to which the sample estimate is close to

Accuracy and precision Accuracy: the degree to which the sample estimate is close to the true population value – maximum accuracy is obtained when the estimate equals the true population value Precision: the variability of the sample estimate in repeated measurements (across different samples) – maximum precision is obtained when the estimate is the same across all samples The standard error of an estimator is a measure of precision Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 8

Samples and population: some terminology Target population: the set of units which are the

Samples and population: some terminology Target population: the set of units which are the object of the research in from which the sample is extracted. The characters to be measured in the population are usually called parameters Sampling frame: a list of the population units Sample size: the number of units (n) in the sample Sample statistic: an estimate of the population parameter based on the sample observations Sampling distribution: the probability distribution of the sample statistics around the true population value Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 9

The indirect problem and inference • If one extracted all potential samples from a

The indirect problem and inference • If one extracted all potential samples from a population (the sampling space) then the sampling distribution would be exactly known • However, this would be a quite stupid exercise – given that the true population parameter would be already known • Thus, statisticans are interested in the indirect problem • Only one sample is extracted • Only the sample statistics are known • The sampling distribution is not known exactly, but it can be ascertained from the probabilistic sampling method • Given the sampling distribution and the sample statistics, one obtains estimates of the true population parameters through statistical inference Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 10

Example – sample mean Extract two elements from a population of four POPULATION: A=4;

Example – sample mean Extract two elements from a population of four POPULATION: A=4; B=1; C=3; D=4 – Pop. Average=3 SAMPLING SPACE: AB – Sample mean= 2. 5 AC – 3. 5 AD – 4 BC – 2 BD – 2. 5 CD – 3. 5 Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 11

Example – the sampling distribution • The average of all sample means in the

Example – the sampling distribution • The average of all sample means in the sampling space is three (equal to the true population mean) • None of the extracted samples exactly reflects the population (none has a mean of 3) • The mean absolute error which we commit by observing only two out of four population units is 0. 667 – this is a direct measure of sampling error Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 12

Example – the indirect problem • Suppose we have observed only one sample which

Example – the indirect problem • Suppose we have observed only one sample which is extracted randomly • The probability extraction method and the sample observations allow us to • Obtain an estimate of the population mean • Obtain a precision estimate (the sampling error) • By combining the sample estimate with the sampling error, one can draw inference on the true population value, for example by defining a bracket which is likely to include the true value Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 13

Example results • Suppose we extract the sample AB with a simple random sampling

Example results • Suppose we extract the sample AB with a simple random sampling • The sample mean is 2. 5 • The mean error within the sample is 0. 5 • With a very rough (and inexact) assumption (that the mean error within the sample reflects the sampling error), we might claim that the true population value lies between [2. 5 -0. 5] and [2. 5+0. 5], that is between two and three • This is a rough example, but with large samples and probability theory, knowledge based on a single sample can lead to accurate conclusions on the whole population, accounting for sampling error Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 14

In practise… • One sample is extracted • Sample means and sample standard deviation

In practise… • One sample is extracted • Sample means and sample standard deviation are obtained • An estimate of precision is obtained through an estimate of the standard error of the mean, which is a function of the sample standard deviation and the sample size • Using the sample mean and the measure of precision one draws conclusion on the population mean (see lecture 6) Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 15

Population parameters (in a population of N elements) • Mean • Variance • Standard

Population parameters (in a population of N elements) • Mean • Variance • Standard deviation Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 16

Sample statistics • Sample mean • Sample variance • Sample standard deviation Statistics for

Sample statistics • Sample mean • Sample variance • Sample standard deviation Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi unbiasedness 17

Simple random sampling • Each element of the population has a known and equal

Simple random sampling • Each element of the population has a known and equal probability of selection • Every element is selected independently from other elements • The probability of selecting a given sample of n elements is computable (known) • The Central Limit Theorem guarantees that for simple random samples with sample size (n) sufficiently large (>40), the sampling distribution of the sample mean follows the normal distribution Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 18

The normal distribution (again) • Recall the curve of measurement error • With simple

The normal distribution (again) • Recall the curve of measurement error • With simple random sampling the sample means follow the same probability distribution Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 19

Basic SRS sample statistics (unknown pop. variance) Mean case Proportion case (p) Sample standard

Basic SRS sample statistics (unknown pop. variance) Mean case Proportion case (p) Sample standard deviation of X Standard error of the mean/proportion PRECISION of sample estimates Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 20

Precision of estimators and sample size • The standard error increases with higher population

Precision of estimators and sample size • The standard error increases with higher population variances and decreases with larger sample sizes • However, the relative gain in precision decreases as sample size increases • Very large sample sizes are not convenient, because the gain in precision is very small and the increase in costs is very large Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 21

Accuracy and confidence level (1) • Suppose the sampling distribution is normal (as for

Accuracy and confidence level (1) • Suppose the sampling distribution is normal (as for simple random sampling) • The confidence level a (further discussed in lecture 6) is the probability that the relative difference between the estimated sample mean and the true population mean is larger than a given relative accuracy level r: is the population mean and 1 -a is the level of confidence Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 22

Accuracy and confidence level (2) • The confidence level is chosen by the researcher

Accuracy and confidence level (2) • The confidence level is chosen by the researcher • For example suppose we want a 95% confidence level – what is the value of the relative accuracy r? • In other words, if we extracted 100 different samples, in only 5 out of 100 would we commit a relative error larger than r • Relative accuracy – expressed in %age terms with respect to the population mean Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 23

Example – relative accuracy in SRS • Suppose we set a=0. 05 (confidence level

Example – relative accuracy in SRS • Suppose we set a=0. 05 (confidence level of 95%) • The equation to compute relative accuracy with simple random sampling is the following: • Accuracy depends on: • • Sample size Population size Standard error of the mean A constant value (ta/2) which depends on the confidence level and the sample size Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 24

Relative accuracy, sample size and population size • For larger population sizes it is

Relative accuracy, sample size and population size • For larger population sizes it is not necessary to increase sample size • A sample size of 500 guarantees an error below 5% for any population size • Above a size of 500, it is better to consider spending money on reducing non-sampling errors Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 25

Determining sample size Factors influencing sample size (n) • • • Size of the

Determining sample size Factors influencing sample size (n) • • • Size of the population (N) Variability of the population (s) Desired level of accuracy (sx or r) Level of confidence (a) Budget constraint Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 26

Simple random sampling – determining sample size Determining sampling size for a given relative

Simple random sampling – determining sample size Determining sampling size for a given relative error s needs to be estimated (or conservative assumptions can be made) r is the relative level of precision t as before, is a constant which depend on a and on the sample size Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 27

The sampling design process 1. Define the target population its elements and the sampling

The sampling design process 1. Define the target population its elements and the sampling units 2. Determine the sampling frame (list) 3. Select a sampling technique 4. Determine the sample size 5. Execute the sampling process Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 28

Selection bias • Improper selection of sample units (ignoring a relevant control variable) so

Selection bias • Improper selection of sample units (ignoring a relevant control variable) so that the values observed in the sample are biased and the sample is not representative. • Some units have a higher probability of being selecte, without acknowledging this in the sampling process. • If the units with higher inclusion probabilities have specific characteristics that differ from the rest of the population – as it is often the case – sample measurement will suffer from a significant bias. Example: A survey is conducted for measuring goat milk consumption, but the interviewers just select people in urban areas that on average drink less goat milk. Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 29

The sampling techniques • Probability sampling • • • Simple random sampling Systematic sampling

The sampling techniques • Probability sampling • • • Simple random sampling Systematic sampling Stratified sampling Cluster sampling Complex sampling techniques • Non-probability sampling – – Convenience sampling Judgmental sampling Quota sampling Snowball sampling Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 30

Simple random sampling • Each element of the population has a known and equal

Simple random sampling • Each element of the population has a known and equal probability of selection • Every element is selected independently from other elements • The probability of selecting a given sample of n elements is computable (known) • Statistical inference is possible • It is easily understood Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi • Representative samples are large and expensive • Standard errors are larger than in other probabilistic sampling techniques • Sometimes it is difficult to execute a really random sampling 31

Systematic sampling • A list of N elements in the population is compiled and

Systematic sampling • A list of N elements in the population is compiled and ordered according to a specified variable • Unrelated to the target variable (similar to SRS) • Related to the target variable (increased representativeness) • A sampling size n is chosen • A systematic step of k=N/n is set • A random number s between 1 and N is extracted and represents the first element to be included • Then the other elements selected are s+k, s+2 k, s+3 k… • Cheaper and easier than SRS • More representative if order is related to the interest variable (monotone) • Sampling frame not always necessary Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi –Less representative (biased) if the order is cyclical 32

Stratified sampling • Population is partitioned in strata through control variables (stratification variables), closely

Stratified sampling • Population is partitioned in strata through control variables (stratification variables), closely related with the target variable, so that there is homogeneity within each stratum and heterogeneity between strata • A simple random sampling frame is applied in each strata of the population – Proportionate sampling – size of the sample from each stratum is proportional to the relative size of the stratum in the total population – Disproportionate sampling: size is also proportional to the standard deviation of the target variable in each stratum –Gains in precision –Include all relevant subpopolation even if small Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi –Stratification variables may not be easily identifiable –Stratification can be expensive 33

Post-stratification • Typical obstacle to stratified sampling: unavailability of a sampling frame for each

Post-stratification • Typical obstacle to stratified sampling: unavailability of a sampling frame for each of the strata • It may be useful to proceed through simple random sampling and exploit the stratified estimator once the sample has been extracted, which increases efficiency. • All that is required is the knowledge of the stratum sizes in the population and that such post-stratum sizes are sufficiently large. The advantage of poststratifications is two-fold: Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 34

Applying post-stratification(PS) • It allows to correct the potential bias due to insufficient coverage

Applying post-stratification(PS) • It allows to correct the potential bias due to insufficient coverage of the survey (incomplete sampling frame) • PS allows one to correct the missing responses bias, provided that the variable is related both to the target variable and to the cause of nonresponse • It is carried out by extracting a Simple Random Sample (SRS) of size n and then classifying units into strata. Instead of the usual SRS mean, a PS estimator is computed by weighting the means of the sub-groups by the size of each sub-group. • The procedure is identical to the one of stratified sampling and the only difference is that the allocation into strata is made ex post. • The standard error for the PS mean estimator is larger than the stratified sampling one, because additional variability is given by the fact that the sample stratum sizes are themselves the outcome of a random process. Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 35

Cluster sampling • • The population is partitioned into clusters ) Elements within the

Cluster sampling • • The population is partitioned into clusters ) Elements within the cluster should be as heterogeneous as possible with respect to the variable of interests (e. g. area sampling) A random sample of clusters is extracted through SRS (with probability proportional to the cluster size) 1. • • 2 a. All the elements of the cluster are selected (one-stage) 2 b. A probabilistic sample is extracted from the cluster (twostage cluster sampling) • Reduced costs • Higher feasibility Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi • Less precision • Inference can be difficult 36

Complex sampling designs • Combination of different sampling methods to increase efficiency or reduce

Complex sampling designs • Combination of different sampling methods to increase efficiency or reduce costs • Two-stage sampling: two different sampling units, where the secondstage sampling units are a sub-set of the first-stage ones. • Typically in household surveys a sample of cities or municipalities is extracted in the first- stage while in the second stage the actual sample of households is extracted out of the first-stage units. • Any probability design can be applied within each stage. • For example, municipalities can be stratified according to their populations in the first stage to ensure that the sample will include small and rural towns as well as large cities, while in the second stage one could apply area sampling, a particular type of cluster sampling where: 1) each sampled municipality is subdivided into blocks on a map through geographical coordinates; 2) blocks are extracted through simple random sampling; 3) all households in a block are interviewed. Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 37

Probability sampling with SPSS (a) Statistics for Marketing & Consumer Research Copyright © 2008

Probability sampling with SPSS (a) Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 38

Probability sampling with SPSS (b) Statistics for Marketing & Consumer Research Copyright © 2008

Probability sampling with SPSS (b) Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 39

Sampling with SAS • SASSTAT component procedures for the extraction of samples and statistical

Sampling with SAS • SASSTAT component procedures for the extraction of samples and statistical inference • Proc SURVEYSELECT allows one to extract probabilitybased samples • Proc SURVEYMEANS computes sample statistics taking into account the sample design • Proc SURVEYREG estimates sample-based regression relationships Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 40

Non-probability sampling • Non-probability sampling does not allow one to accompany sample estimates with

Non-probability sampling • Non-probability sampling does not allow one to accompany sample estimates with evaluations of their precision and accuracy • Still, non-probability sampling is a common practice in marketing research, especially quota sampling. • It is not necessarily biasing or uninformative • In some circumstances – for example when there is no sampling frame – it may be the only viable solution • Key limit – in general, techniques for statistical inference cannot be used to generalize sample results to the population Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 41

Convenience sampling • Only convenient elements enter the sample • Cheapest method • Quickest

Convenience sampling • Only convenient elements enter the sample • Cheapest method • Quickest method Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi • Selection bias • Non representativeness • Inference is not possible 42

Judgmental sampling • Selection based on the judgment of the researcher • Low cost

Judgmental sampling • Selection based on the judgment of the researcher • Low cost • Quick Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi • Non -representativeness • Inference is not possible • Subjective (potential selection bias) 43

Quota sampling 1. Define control categories (quotas) for the population elements, such as sex,

Quota sampling 1. Define control categories (quotas) for the population elements, such as sex, age… 2. Apply a restricted judgmental sampling so that quotas in the sample are the same of those in the population • Cheapest method • Quickest method Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi • There is no guarantee that the sample is representative (relevance of control characteristic chosen) • Many sources of selection bias • No assessment of sampling error 44

Snowball sampling • A first small sample is selected randomly • Respondents are asked

Snowball sampling • A first small sample is selected randomly • Respondents are asked to identify others who belong to the population of interests • The referrals will have demographic and psychographic characteristics similar to the referrers • Lower costs • Low variability • Useful for rare populations Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi • Inference is not possible 45