Topic 20 Single Factor Analysis of Variance Outline

  • Slides: 55
Download presentation
Topic 20: Single Factor Analysis of Variance

Topic 20: Single Factor Analysis of Variance

Outline • Analysis of Variance – One set of treatments (i. e. , single

Outline • Analysis of Variance – One set of treatments (i. e. , single factor) • Cell means model • Factor effects model – Link to linear regression using indicator explanatory variables

One-Way ANOVA • The response variable Y is continuous • The explanatory variable is

One-Way ANOVA • The response variable Y is continuous • The explanatory variable is categorical – We call it a factor – The possible values are called levels • This approach is a generalization of the independent two-sample pooled t-test • In other words, it can be used when there are more than two treatments

Data for One-Way ANOVA • Y is the response variable • X is the

Data for One-Way ANOVA • Y is the response variable • X is the factor (it is qualitative/discrete) – r is the number of levels – often refer to these levels as groups or treatments • Yi, j is the jth observation in the ith group

Notation • For Yi, j we use – i to denote the level of

Notation • For Yi, j we use – i to denote the level of the factor – j to denote the jth observation at factor level i • i = 1, . . . , r levels of factor X • j = 1, . . . , ni observations for level i of factor X – ni does not need to be the same in each group

KNNL Example (p 685) • Y is the number of cases of cereal sold

KNNL Example (p 685) • Y is the number of cases of cereal sold • X is the design of the cereal package – there are 4 levels for X because there are 4 different package designs • i =1 to 4 levels • j =1 to ni stores with design i (ni=5, 5, 4, 5) • Will use n if ni the same across groups

Data for one-way ANOVA data a 1; infile 'c: . . /data/ch 16 ta

Data for one-way ANOVA data a 1; infile 'c: . . /data/ch 16 ta 01. txt'; input cases design store; proc print data=a 1; run;

The data Obs 1 2 3 4 5 6 7 8 cases 11 17

The data Obs 1 2 3 4 5 6 7 8 cases 11 17 16 14 15 12 10 15 design 1 1 1 2 2 2 store 1 2 3 4 5 1 2 3

Plot the data symbol 1 v=circle i=none; proc gplot data=a 1; plot cases*design; run;

Plot the data symbol 1 v=circle i=none; proc gplot data=a 1; plot cases*design; run;

The plot

The plot

Plot the means proc means data=a 1; var cases; by design; output out=a 2

Plot the means proc means data=a 1; var cases; by design; output out=a 2 mean=avcases; proc print data=a 2; symbol 1 v=circle i=join; proc gplot data=a 2; plot avcases*design; run;

New Data Set Obs design _TYPE_ _FREQ_ avcases 1 1 0 5 14. 6

New Data Set Obs design _TYPE_ _FREQ_ avcases 1 1 0 5 14. 6 2 2 0 5 13. 4 3 3 0 4 19. 5 4 4 0 5 27. 2

Plot of the means

Plot of the means

The Model • We assume that the response variable is – Normally distributed with

The Model • We assume that the response variable is – Normally distributed with a 1. mean that may depend on the level of the factor 2. constant variance • All observations assumed independent • NOTE: Same assumptions as linear regression except there is no assumed linear relationship between X and E(Y|X)

Cell Means Model • A “cell” refers to a level of the factor •

Cell Means Model • A “cell” refers to a level of the factor • Yij = μi + εij – where μi is theoretical mean or expected value of all observations at level (or cell) i – the εij are iid N(0, σ2) which means – Yij ~N(μi, σ2) and independent – This is called the cell means model

Parameters • The parameters of the model are – μ 1, μ 2, …

Parameters • The parameters of the model are – μ 1, μ 2, … , μ r – σ2 • Question (Version 1) – Does our explanatory variable help explain Y? • Question (Version 2) – Do the μi vary? H 0: μ 1= μ 2= … = μr = μ (a constant) Ha: not all μ’s are the same

Estimates • Estimate μi by the mean of the observations at level i, (sample

Estimates • Estimate μi by the mean of the observations at level i, (sample mean) • ûi = = ΣYi, j/ni • For each level i, also get an estimate of the variance • = Σ(Yij- )2/(ni-1) (sample variance) • We combine these to get an overall estimate of σ2 • Same approach as pooled t-test

Pooled estimate of 2 σ • If the ni were all the same we

Pooled estimate of 2 σ • If the ni were all the same we would average the – Do not average the si • In general we pool the , giving weights proportional to the df, ni -1 • The pooled estimate is

Running proc glm Difference 1: Need to specify factor variables proc glm data=a 1;

Running proc glm Difference 1: Need to specify factor variables proc glm data=a 1; class design; model cases=design; Difference 2: Ask means design; for mean estimates lsmeans design run;

Output Class Level Information Class Levels Values design 41234 Number of Observations Read Number

Output Class Level Information Class Levels Values design 41234 Number of Observations Read Number of Observations Used Important summaries to check these summaries!!! 19 19

SAS 9. 3 default output for MEANS statement

SAS 9. 3 default output for MEANS statement

MEANS statement output Level of design 1 2 3 4 N 5 5 4

MEANS statement output Level of design 1 2 3 4 N 5 5 4 5 cases Mean 14. 6000000 13. 4000000 19. 5000000 27. 2000000 Std Dev 2. 30217289 3. 64691651 2. 64575131 3. 96232255 Table of sample means and sample variances

SAS 9. 3 default output for LSMEANS statement

SAS 9. 3 default output for LSMEANS statement

LSMEANS statement output design 1 2 3 4 cases LSMEAN 14. 6000000 13. 4000000

LSMEANS statement output design 1 2 3 4 cases LSMEAN 14. 6000000 13. 4000000 19. 5000000 27. 2000000 Standard Error Pr > |t| 1. 4523544 <. 0001 1. 6237816 <. 0001 1. 4523544 <. 0001 Provides estimates based on model (i. e. , constant variance)

Notation

Notation

ANOVA Table Source df SS MS Model r-1 Σij( - )2 SSR/df. R Error

ANOVA Table Source df SS MS Model r-1 Σij( - )2 SSR/df. R Error n. T-r Σij(Yij - )2 SSE/df. E Total n. T-1 Σij(Yij - )2 SST/df. T

ANOVA SAS Output Source Model Sum of Mean DF Squares Square F Value Pr

ANOVA SAS Output Source Model Sum of Mean DF Squares Square F Value Pr > F 3 588. 2210526 196. 0736842 18. 59 <. 0001 Error 15 158. 2000000 Corrected Total 18 746. 4210526 10. 5466667 R-Square Coeff Var Root MSE cases Mean 0. 788055 17. 43042 3. 247563 18. 63158

Expected Mean Squares • E(MSR) > E(MSE) when the group means are different •

Expected Mean Squares • E(MSR) > E(MSE) when the group means are different • See KNNL p 694 – 698 for more details • In more complicated models, these tell us how to construct the F test

F test • • • F = MSR/MSE H 0: μ 1 = μ

F test • • • F = MSR/MSE H 0: μ 1 = μ 2 = … = μ r Ha: not all of the μi are equal Under H 0, F ~ F(r-1, n. T-r) Reject H 0 when F is large Report the P-value

Maximum Likelihood Approach proc glimmix data=a 1; class design; model cases=design / dist=normal; lsmeans

Maximum Likelihood Approach proc glimmix data=a 1; class design; model cases=design / dist=normal; lsmeans design; run;

GLIMMIX Output Data Set Model Information WORK. A 1 Response Variable cases Response Distribution

GLIMMIX Output Data Set Model Information WORK. A 1 Response Variable cases Response Distribution Gaussian Link Function Identity Variance Function Default Variance Matrix Diagonal Estimation Technique Restricted Maximum Likelihood Residual Degrees of Freedom Method

GLIMMIX Output Fit Statistics -2 Res Log Likelihood AIC (smaller is better) AICC (smaller

GLIMMIX Output Fit Statistics -2 Res Log Likelihood AIC (smaller is better) AICC (smaller is better) BIC (smaller is better) CAIC (smaller is better) HQIC (smaller is better) Pearson Chi-Square / DF 84. 12 94. 12 100. 79 97. 66 102. 66 94. 08 158. 20 10. 55

GLIMMIX Output Type III Tests of Fixed Effects Num Den Effect DF DF F

GLIMMIX Output Type III Tests of Fixed Effects Num Den Effect DF DF F Value Pr > F design 3 15 18. 59 <. 0001 design 1 2 3 4 design Least Squares Means Standard Estimate Error DF t Value 14. 6000 1. 4524 15 10. 05 13. 4000 1. 4524 15 9. 23 19. 5000 1. 6238 15 12. 01 27. 2000 1. 4524 15 18. 73 Pr > |t| <. 0001

Factor Effects Model • A reparameterization of the cell means model • Useful way

Factor Effects Model • A reparameterization of the cell means model • Useful way at looking at more complicated models • Null hypotheses are easier to state • Yij = μ + i + εij – the εij are iid N(0, σ2)

Parameters • The parameters of the model are – μ, 1, 2, … ,

Parameters • The parameters of the model are – μ, 1, 2, … , r – σ2 • The cell means model had r + 1 parameters – r μ’s and σ2 • The factor effects model has r + 2 parameters – μ, the r ’s, and σ2 – Cannot uniquely estimate all parameters

An example • Suppose r=3; μ 1 = 10, μ 2 = 20, μ

An example • Suppose r=3; μ 1 = 10, μ 2 = 20, μ 3 = 30 • What is an equivalent set of parameters for the factor effects model? • We need to have μ + i = μi • μ = 0, 1 = 10, 2 = 20, 3 = 30 • μ = 20, 1 = -10, 2 = 0, 3 = 10 • μ = 5000, 1 = -4990, 2 = -4980, 3 = -4970

Problem with factor effects? • These parameters are not estimable or not well defined

Problem with factor effects? • These parameters are not estimable or not well defined (i. e. , unique) • There are many solutions to the least squares problem • There is an X΄X matrix for this parameterization that does not have an inverse (perfect multicollinearity) • The parameter estimators here are biased (SAS proc glm)

Factor effects solution • Put a constraint on the i • Common to assume

Factor effects solution • Put a constraint on the i • Common to assume Σi i = 0 • This effectively reduces the number of parameters by 1 • Numerous other constraints possible

Consequences • Regardless of constraint, we always have μi = μ + i •

Consequences • Regardless of constraint, we always have μi = μ + i • The constraint Σi i = 0 implies – μ = (Σi μi)/r (unweighted grand mean) – i = μi – μ (group effect) • The “unweighted” complicates things when the ni are not all equal; see KNNL p 702 -708

Hypotheses • H 0: μ 1 = μ 2 = … = μ r

Hypotheses • H 0: μ 1 = μ 2 = … = μ r • H 1: not all of the μi are equal are translated into • H 0: 1 = 2 = … = r = 0 • H 1: at least one i is not 0

Estimates of parameters • With the constraint Σi i = 0

Estimates of parameters • With the constraint Σi i = 0

Solution used by SAS • Recall, X΄X does not have an inverse • We

Solution used by SAS • Recall, X΄X does not have an inverse • We can use a generalized inverse in its place • (X΄X)- is the standard notation • There are many generalized inverses, each corresponding to a different constraint

Solution used by SAS • (X΄X)- used in proc glm corresponds to the constraint

Solution used by SAS • (X΄X)- used in proc glm corresponds to the constraint r = 0 • Recall that μ and the i are not estimable • But the linear combinations μ + i are estimable • These are estimated by the cell means

Cereal package example • • Y is the number of cases of cereal sold

Cereal package example • • Y is the number of cases of cereal sold X is the design of the cereal package i =1 to 4 levels j =1 to ni stores with design i

SAS coding for X • Class statement generates r explanatory variables • The ith

SAS coding for X • Class statement generates r explanatory variables • The ith explanatory variable is equal to 1 if the observation is from the ith group • In other words, the rows of X are 1 1 0 0 0 for design=1 1 0 0 for design=2 1 0 0 1 0 for design=3 1 0 0 0 1 for design=4

Some options proc glm data=a 1; class design; model cases=design /xpx inverse solution; run;

Some options proc glm data=a 1; class design; model cases=design /xpx inverse solution; run;

Also contains X’Y Output The X'X Matrix Int d 1 d 2 d 3

Also contains X’Y Output The X'X Matrix Int d 1 d 2 d 3 d 4 cases Int 19 5 5 4 5 354 d 1 5 5 0 0 0 73 d 2 5 0 0 67 d 3 4 0 0 4 0 78 d 4 5 0 0 0 5 136 cases 354 73 67 78 136 7342

Output X'X Generalized Inverse (g 2) Int d 1 d 2 d 3 d

Output X'X Generalized Inverse (g 2) Int d 1 d 2 d 3 d 4 cases Int 0. 2 -0. 2 0 27. 2 d 1 -0. 2 0. 4 0. 2 0 -12. 6 d 2 -0. 2 0. 4 0. 2 0 -13. 8 d 3 -0. 2 0. 45 0 -7. 7 d 4 0 0 0 cases 27. 2 -12. 6 -13. 8 -7. 7 0 158. 2

Output matrix • Actually, this matrix is (X΄X)Y΄X(X΄X)- X΄Y Y΄Y-Y΄X(X΄X)- X΄Y • Parameter estimates

Output matrix • Actually, this matrix is (X΄X)Y΄X(X΄X)- X΄Y Y΄Y-Y΄X(X΄X)- X΄Y • Parameter estimates are in upper right corner, SSE is lower right corner (last column on previous page)

Parameter estimates Par Int d 1 d 2 d 3 d 4 Est 27.

Parameter estimates Par Int d 1 d 2 d 3 d 4 Est 27. 2 B -12. 6 B -13. 8 B -7. 7 B 0. 0 B St Err 1. 45 2. 05 2. 17. t 18. 73 -6. 13 -6. 72 -3. 53. P <. 0001 0. 0030.

Caution Message NOTE: The X'X matrix has been found to be singular, and a

Caution Message NOTE: The X'X matrix has been found to be singular, and a generalized inverse was used to solve the normal equations. Terms whose estimates are followed by the letter 'B' are not uniquely estimable.

Interpretation • If r = 0 (in our case, 4 = 0), then the

Interpretation • If r = 0 (in our case, 4 = 0), then the corresponding estimate should be zero • the intercept μ is estimated by the mean of the observations in group 4 • since μ + i is the mean of group i, the i are the differences between the mean of group i and the mean of group 4

Recall the means output Level of design N Mean Std Dev 1 2 3

Recall the means output Level of design N Mean Std Dev 1 2 3 4 14. 6 13. 4 19. 5 27. 2 2. 3 3. 6 2. 6 3. 9 5 5 4 5

Parameter estimates based on means Level of design Mean 1 2 3 4 14.

Parameter estimates based on means Level of design Mean 1 2 3 4 14. 6 13. 4 19. 5 27. 2 = 14. 6 -27. 2 = -12. 6 = 13. 4 -27. 2 = -13. 8 = 19. 5 -27. 2 = -7. 7 = 27. 2 -27. 2 = 0

Last slide • Read KNNL Chapter 16 up to 16. 10 • We used

Last slide • Read KNNL Chapter 16 up to 16. 10 • We used programs topic 20. sas to generate the output for today • Will focus more on the relationship between regression and one-way ANOVA in next topic