Sample Size calculations for multilevel models Part II

Summary • • Introduction to sample size calculations. Simulation-based approaches. The MLPOWSIM software. Two

Background • Many quantitative social science research questions are of the form of a

Example • Suppose I have the research question ‘Are Welshmen on average taller than

Hypothesis Testing • Let us assume our null hypothesis is that the average height

Type I and Type II errors • No hypothesis test is perfect and there

Example hypothesis test • Let us assume that in reality our sample mean is

Power calculations • Our last slide in some sense is backwards as we cannot

Welsh height example Here we have looked at two examples with effect sizes 5

Power curve for Welshmen example Here we see the two power curves for the

Extending the idea • The simple formula can be used in many situations and

The use of simulation • In reality our (hoped for) research path will be

Simulation in our example • Consider our Welsh height example case 2 where we

Power curve comparison Note simulation curve is a good approximation of theoretical curve although

Advantages/Disadvantages • Theoretical approach is quick when the formula can be derived. • Approximations

Methods to calculate Power calculations from simulations • The method illustrated previously we will

MLPOWSIM software package • Software package currently being written and tested (90% complete). •

MLPOWSIM state of play • The software still requires code to generate macros for

What happens with multilevel data? We will here mainly consider 2 -level models and

Education Example • For motivation we considered a two level dataset with exam marks

Design effect formula • • If we assume balance then with n pupils in

Comparison of formula/simulations • The following graph compares the design effect formula to the

Other sample size issues • There are other reasons why we may be interested

Sampling policy The design effect formula: Design effect = 1 + (n-1)ρ suggests that

More complex examples – random intercepts and random slopes • We will now look

Example problem • We will continue with our educational example but also consider the

What happens when we include random slopes? • The following table gives power values

Binary response models • The simplest model for binary data is to compare an

Model based approach • We will use a logistic regression model • and now

Power graphs Here we see the results of the 0/1 approach in blue and

Multilevel binary response models • We can extend the earlier modelling of the Bangladeshi

Differences between methods – MQL 1 Here we see that the SE method (red)

Differences between methods – PQL 2 Here we see less difference between the SE

Cross-classified models • We will here consider a dataset of education results from Fife

Cross classified methods • Currently in MLPOWSIM we have only implemented cross classified models

Comparison of Power estimates # XC 1 # XC 2 MCMC R 20 10

Cross-classified models • From the table we see reasonable agreement between MLwi. N and

Conclusions • In this talk we have shown the flexibility of using simulation to

Further work • Find (faster) approximations to simulation results – potentially create look up

References • Bosker, R. J. , Snijders, T. A. B. and Guldemond, H. (1996)

Slides: 41

Download presentation

Sample Size calculations for multilevel models (Part II) William Browne and Mousa Golalizadeh Department of Clinical Veterinary Sciences University of Bristol

Summary • • Introduction to sample size calculations. Simulation-based approaches. The MLPOWSIM software. Two level normal examples. Binary responses. Cross classified models. Future plans.

Background • Many quantitative social science research questions are of the form of a hypothesis – A has a significant effect on B. • To answer such a question data is collected that allows the researcher to (hopefully) test whether statistically A has a significant effect on B. (In fact we aim to reject the hypothesis that A doesn’t significantly affect B). • A test is performed and either the researcher is happy and A indeed has a significant effect on B or is left wondering why the data collected do not back up their hypothesis. Is the hypothesis false or was the data not sufficient? • The sufficiency of the data is the motivation for sample size calculations.

Example • Suppose I have the research question ‘Are Welshmen on average taller than 175 cms? ’ • I now need to get hold of a random sample of n Welshmen and measure each of their heights. • I make some statistical assumption about the distribution of the heights of Welshmen e. g. that they come from a Normal distribution. • I might like to check this assumption by plotting a histogram of the data. • I can then form a statistical hypothesis test and test whether indeed Welshmen are taller than 175 cms. • I need to decide how big to make n, my sample of Welshmen.

Hypothesis Testing • Let us assume our null hypothesis is that the average height of Welshmen (μ) is 175 cm. • So we test H 0: μ=175 vs HA: μ>175 (or alternatively H 0: θ=0 vs HA: θ>0 where θ=μ 175) • In practice we calculate from our sample its mean ( ) and standard deviation (s 2) and use these along with n to form a test statistic which we can compare with the distribution assumed under H 0

Type I and Type II errors • No hypothesis test is perfect and there is always the possibility of errors Truth Reject H 0 True Type I error Decision Accept H Correct 0 H 0 False Correct Type II error • P(Type I error) = α = significance level or size • P(Type II error) = β, 1 -β is the power of the test. • In general we fix α to some value e. g. 0. 05, 0. 01 then 1 -β depends on our sample size.

Example hypothesis test • Let us assume that in reality our sample mean is 180 cms and the population standard deviation (sd) is 5 cms (known). • We can then form a test statistic as follows: • Note here that for small n and unknown sd we should use a student-t distribution rather than Normal. • For a 1 -sided Z test we wish Z= > 1. 645 and so we need our sample to be of size 3 to reject H 0, using a student-t distribution increases this to 5. (Here α=0. 05) • However if the sample mean had been only 176 cms then we would need n > (1. 645*5)2 = 68 Welshmen to reject H 0

Power calculations • Our last slide in some sense is backwards as we cannot get from a given sample mean to choosing a sample size! • What we do instead is use different terminology and play God! • We will choose an ‘effect size’, γ which will represent a guess at the increase in the sample mean for Welshmen. • There then exists an (approximate) formula that links four quantities, size (α), power (1 -β), effect size (γ) and sample size (n) Here RHS is sum of cases H 0 true and H 0 false. • Note that the standard error (SE) of γ is a function of n and σ the population sd which is assumed known. • We can now evaluate one of these quantities conditional on the others e. g. what sample size is required given α, 1 -β and γ?

Welsh height example Here we have looked at two examples with effect sizes 5 and 1 respectively. Assume σ takes the value 5 and so let us suppose we take a sample of size 25 Welshmen. Then Case 1: 5/(5/√ 25)=1. 645+z 1 -β, z 1 -β=3. 355 β=0. 9996 Case 2: 1/(5/ √ 25)=1. 645+z 1 -β, z 1 -β=-0. 645 β=0. 25946 So here a sample of 25 Welshmen from a population with mean 180 cms would almost always result in rejecting H 0, but if the population mean is 176 cms then only 26% of such samples would be rejected. We can plot curves of how power increases with sample size as shown in the next slide.

Power curve for Welshmen example Here we see the two power curves for the two scenarios:

Extending the idea • The simple formula can be used in many situations and hypothesis tests. • To generalise the idea we assume that γ is an effect size associated with a statistic that we wish to compare with a (null) hypothesized value of 0. • The complication occurs in finding a formula for the standard error for the statistic and relating this formula to the sample size, n. • We will next consider an alternative approach before returning to look at how the above approach extends to multilevel models.

The use of simulation • In reality our (hoped for) research path will be as follows: Construct research question -> Form null hypothesis that we believe false -> Collect appropriate data -> Reject hypothesis therefore proving our research question. • Assuming what we believe in our research question is correct and hence null hypothesis is false we can still be let down by not collecting enough data. • The idea behind using simulation is to simulate the data gathering process (assuming we know the right answer) many times and see how often we can reject the null hypothesis. The percentage of rejected null hypotheses (via simulation) will then estimate power.

Simulation in our example • Consider our Welsh height example case 2 where we believe Welshmen have a mean height of 176 cms (and sd = 5 cms) and we are testing the hypothesis H 0: μ=175 cms, and we consider a sample size 25. • Then we generate N samples (e. g. 5000) of size 25, • and for each sample form a lower bound for the confidence interval of the form • . This we compare with the value 175 and the proportion greater than 175 is an estimate of the power of the test. • We can repeat this exercise for different sample sizes and form a power curve.

Power curve comparison Note simulation curve is a good approximation of theoretical curve although there are some minor (Monte Carlo) errors even with 5000 simulations per sample size.

Advantages/Disadvantages • Theoretical approach is quick when the formula can be derived. • Approximations for more complex situations exist which are equally quick. • Simulation approach generalizes to more situations but is much slower and we may need large numbers of simulations per scenario to get accurate power estimates. • Note that alternative method described next typically needs less simulations per scenario for the same accuracy.

Methods to calculate Power calculations from simulations • The method illustrated previously we will describe as the 0/1 method where each simulation results in a sample that would either reject or fail to reject the null hypothesis. • An alternative method (the standard error method) is to instead take from each simulation the estimated standard error of the parameter to be tested. Then the average of these standard errors can be used with the effect size to construct the power. • A 2 nd alternative (when MCMC is used) is as in the 0/1 approach to establish whether the simulation sample rejects the null hypothesis but rather than construct the interval assuming normality to use the quantiles of the MCMC chain.

MLPOWSIM software package • Software package currently being written and tested (90% complete). • A ‘rather old fashioned’ text-based interface allows user to specify sample size scenarios. • Software then generates either MLwi. N macro code or an R command file to run the simulations to calculate power for scenarios. • Normal, Binomial and Poisson response offered. • Software will cope with 1 -level, 2 -level (balanced and unbalanced), 3 -level nested (balanced and unbalanced) and cross-classified (balanced and unbalanced) with 2 higher classifications models. • Many options for unbalanced designs. • Extensive user manual in preparation (~100 pages with examples)

MLPOWSIM state of play • The software still requires code to generate macros for unbalanced cross-classified models in MLwi. N to be written. • Documentation for normal responses is almost complete with short chapters on other response types to write. • Currently MLwi. N code (using IGLS) runs considerably faster than R code (using lmer) for most models though we have had some success speeding up R code. • Exception is cross-classified models where we have only programmed an MCMC-based macro in MLwi. N that is typically slower than the lmer code in R. See more details later.

What happens with multilevel data? We will here mainly consider 2 -level models and take as our application area education, so we have students nested within schools. When deciding on a sampling scheme we have many choices: • How many schools, N ? • How many pupils per school, nj ? • Should we collect the same size sample from each school ? Our decision will depend on which parameter we wish to estimate in the model.

Education Example • For motivation we considered a two level dataset with exam marks measured for each student in a collection of schools. In fact this dataset exists and has 4915 students in 96 schools. • Our hypothesis of interest is that the exam mark for an average student is > 20 (null hypothesis = 20) which with such a large sample results in the null hypothesis being rejected for our particular data. • If we fit the following multilevel model to the data we get the estimates given: • If we treat these estimates as population values, we are interested in what power for testing our hypothesis results from various combinations of N and nj

Design effect formula • • If we assume balance then with n pupils in each of N schools for our simple model (and only this simple model) the following formula holds: Design effect = 1 + (n-1)ρ where ρ is the intra-class correlation. So if we know the simple random sample size required for a given power we need to multiply this by the design effect. For example our data has ρ=16. 205/(16. 205+139. 367)=0. 104 So for schools of size 10 pupils we would need 1+9*0. 104=1. 94 times as many students (in total) to get the same power. For this model (and this model only) we could therefore perform our power calculations assuming simple random sampling from a population with variance 155. 572 and scale up the sample required based on the design. So And for schools of size 10 we require 1. 94*338. 4=657 pupils which we can round up to 66 schools.

Comparison of formula/simulations • The following graph compares the design effect formula to the simulation approach:

Other sample size issues • There are other reasons why we may be interested in sample size questions in multilevel modelling. • It is often problematic to fit multilevel models when the number of higher level units is small as demonstrated in the last graph • Also some methods can be biased for small sample sizes. • Note although method comparison is done using a similar approach of generating simulated datasets, here power calculations are not the main aim; that said when performing power calculations parameter bias of methods should be noted as this will result in bias of predicted power (see later). • Browne (1998), Browne and Draper(2006) compare MCMC, RIGLS and IGLS for small sample sizes and continuous responses, and MCMC, MQL and PQL for binary response models. • Maas and Hox (2004) also look at small sample sizes and how robust estimation is to the Normal distributional assumption of the level 2 residuals.

Sampling policy The design effect formula: Design effect = 1 + (n-1)ρ suggests that if we are to sample a fixed (balanced) number of pupils n*N then our best power results when n is smallest i. e. sampling one pupil each from 100 schools is better than sampling 100 pupils from the same school. The effect of sampling policy is most important in scenarios where ρ is large e. g. repeated measures designs. The simulation procedure gives approximately the same power curve and so in this simple example we have an easy to use formula. The reason in practice for sampling several pupils from each school is purely the additional cost incurred in visiting additional schools.

More complex examples – random intercepts and random slopes • We will now look at more complex random effect models with predictor variables. • We will consider the random intercept model • and the random slopes model • We will consider how to extend (approximately) theoretical approach and also the simulation approach.

Example problem • We will continue with our educational example but also consider the effect of gender (β 1). For a random intercepts model let us assume the true parameter values are β 0=20. 9, β 1= 1. 6, σ2 u=15, σ2 e=135. • We have two hypotheses to test: • Hypothesis 1: boys get on average more than 20 marks (H 0: β 0=20 vs HA: β 0>20) • Hypothesis 2: girls do better on average than boys (H 0: β 1=0 vs HA: β 1>0) • We will also consider the effect of random slopes on the dataset so will have a second model with additionally σ2 u 0=10, σu 01=2, σ2 u 1=5

What happens when we include random slopes? • The following table gives power values for β 1 for the random intercept model. • Note that pairs of values with the same total N*n have similar powers. Schools (N) 20 30 40 50 60 20 0. 39 0. 49 0. 60 0. 75 Pupils 30 (n) 40 0. 51 0. 67 0. 76 0. 84 0. 89 0. 61 0. 77 0. 86 0. 92 0. 95 50 0. 69 0. 84 0. 92 0. 96 0. 98 60 0. 77 0. 90 0. 95 0. 98 0. 99

What happens when we include random slopes? • The following table gives power values for β 1 for the random slopes model. • Note that pairs of values with the same total N*n now do not have similar powers and larger N is better. Schools (N) 20 30 40 50 60 20 0. 33 0. 45 0. 54 0. 63 0. 69 Pupils 30 (n) 40 0. 42 0. 55 0. 67 0. 76 0. 82 0. 49 0. 64 0. 76 0. 83 0. 89 50 0. 57 0. 69 0. 80 0. 89 0. 93 60 0. 76 0. 85 0. 91 0. 95

Binary response models • The simplest model for binary data is to compare an observed proportion with some fixed proportion, e. g. to detect whether more than 9 out of 10 cats prefer…! • We will be using as our example the use of contraceptives by women in Bangladesh taken from the MLwi. N user guide. Here the full dataset (nearly 2000 women) has roughly 40% contraceptive use and we are interested in what sample size we need to say that less than 50% use contraceptives. • The standard approach uses a normal approximation to the binomial for the underlying probability to give a standard formula that can be easily calculated and gives a sample size of 153 for a power of 0. 80.

Model based approach • We will use a logistic regression model • and now the probability 0. 5 will correspond to the value 0 for β 0. • The MLPOWSIM software will perform approximate Wald tests i. e. assume that β 0 is normally distributed. • As can be seen by the curves in the following slide the power is fairly similar (sample size of ~ 160 for power of 0. 8) but not identical to that obtained by the standard formulae as a different quantity is being assumed normally distributed.

Power graphs Here we see the results of the 0/1 approach in blue and the much smoother red line corresponds to the standard error approach for calculating power.

Multilevel binary response models • We can extend the earlier modelling of the Bangladeshi dataset to account for the structure of the data – women nested within districts. • We will again test whether usage of contraceptive is less than 50% but now we account for the clustering in the model with a random effects logistic regression model. • The estimate of the district level variability is 0. 246 • We also now have options of different estimation methods and will consider MQL 1 and PQL 2.

Differences between methods – MQL 1 Here we see that the SE method (red) gives higher power than the 0/1 method (blue) but this is due to the bias in the estimates and hence the SEs. Here the 0/1 method is preferable.

Differences between methods – PQL 2 Here we see less difference between the SE method (red) and the 0/1 method (blue) as PQL is far less biased than MQL.

Cross-classified models • We will here consider a dataset of education results from Fife in Scotland. Here for each student their attainment at 16 can be affected by both the secondary school and the primary school they attended. We will look at perhaps the simplest model that accounts for the structure. Here our response is a mark out of 10 from which we subtract 5 which we treat as a pass mark. This response is then assumed to be affected by both the primary and secondary school attended Approximate estimates from the actual data that we will use for our power calculations are β 0 = 0. 5, variances for primary, secondary and residual are 1. 2, 0. 4 and 5 respectively. We are interested in sample sizes to detect an average greater than 5.

Cross classified methods • Currently in MLPOWSIM we have only implemented cross classified models using MCMC. • The IGLS method uses a clever trick of converting cross-classified models to constrained nested models but this is harder to do within simulation macros. • The R function lmer allows cross-classified models which although slower than nested models can be computed with no problems. • We looked at balanced structures with 5 pupils per combination of secondary and primary school, 20 -50 primary schools in steps of 10 and 10 -30 secondary schools in steps of 10. • The MCMC method was run for a burnin of 500 and main run of 5001 for each simulation. We used the default inverse-gamma (ε, ε) priors for the variance parameters. • The power values for various combinations can be seen on the next page based on the standard error method.

Comparison of Power estimates # XC 1 # XC 2 MCMC R 20 10 0. 39 0. 47 20 20 0. 47 0. 56 20 30 0. 67 0. 59 30 10 0. 54 0. 55 30 20 0. 67 0. 66 30 30 0. 72 0. 70 40 10 0. 61 40 20 0. 73 0. 72 40 30 0. 75 0. 78 50 10 0. 62 0. 64 50 20 0. 79 0. 77 50 30 0. 83

Cross-classified models • From the table we see reasonable agreement between MLwi. N and R. • When the number of higher level units is 10 we have bigger differences as the effect of prior and running for only 5000 will affect the MCMC estimates. • This was more marked when we initially also considered only 10 primary schools. • It is reassuring that in the range that counts i. e. where power is ~0. 8 the methods agree well. • R took of the order of 1 hour 20 minutes whilst the MCMC code ran overnight.

Conclusions • In this talk we have shown the flexibility of using simulation to perform power calculations for multilevel models. • Although computationally the approach can be slow the flexibility of the approach means it can be used for virtually all models given enough assumptions. • Low powered studies often involve small amounts of data thus making the power calculations quicker. • Other software – PINT (Bosker, Snijders and Guldemond, 1996) is quick for specific models – balanced 2 level models and we compare answers in our documentation.

Further work • Find (faster) approximations to simulation results – potentially create look up tables, power curves etc. • Investigate efficient MCMC methods. • Consider the IGLS algorithm for crossclassified models • Expand the model classes considered. • Link MLPOWSIM directly within MLwi. N.

References • Bosker, R. J. , Snijders, T. A. B. and Guldemond, H. (1996) PINT (Power IN Two-level designs) User Manual. • Browne, W. J. (1998) Applying MCMC methods to Multilevel models. Unpublished Ph. D. thesis. University of Bath. • Browne, W. J. and Draper, D. (2006) A comparison of Bayesian and likelihood-based methods for fitting multilevel models (with discussion). Bayesian Analysis 1, 473 -550. • Mass, C. J. M. and Hox, J. J. (2004) Robustness issues in multilevel regression analysis. Statistica Neerlandica 58, 127 -137.