AMS 572 ANOVA OneWay TwoWay and Multiway 1

  • Slides: 119
Download presentation
AMS 572 ANOVA: One-Way, Two-Way, and Multiway. 1

AMS 572 ANOVA: One-Way, Two-Way, and Multiway. 1

Group 3 1 – Intro & Hist. - Na Chan 2 – Basics of

Group 3 1 – Intro & Hist. - Na Chan 2 – Basics of ANOVA - Alla Tashlitsky 3 - Data Collection - Bryan Rong 4 - Checking Assumptions in SAS - Junying Zhang 5 - 1 -Way ANOVA derivation - Yingying Lin and Wenyi Dong 6 - 1 -Way ANOVA in SAS - Yingying Lin and Wenyi Dong 7 - 2 -Way ANOVA derivation - Peng Yang 8 - 2 -Way ANOVA in SAS - Phil Caffrey and Yin Diao 9 - Multi-Way ANOVA Derivation - Michael Biro 10 - ANOVA and Regression – Cris (Jiangyang) Liu 2

Intro & History Na Chen 3

Intro & History Na Chen 3

USES OF T-TEST • A one-sample location test of whether the mean of a

USES OF T-TEST • A one-sample location test of whether the mean of a normally distributed population has a value specified in a null hypothesis. • A two sample location test of the null hypothesis that the means of two normally distributed populations are equal 4

USES OF T-TEST • A test of the null hypothesis that the difference between

USES OF T-TEST • A test of the null hypothesis that the difference between two responses measured on the same statistical unit has a mean value of zero • A test of whether the slope of a regression line differs significantly from 0 5

BACKGROUND • If comparing means among > 2 groups, 3 or more t-tests are

BACKGROUND • If comparing means among > 2 groups, 3 or more t-tests are needed -Time-consuming (Number of t-tests increases) -Inherently flawed (Probability of making a Type I error increases) 6

RONALD A. FISHER • • Biologist Eugenicist Geneticist Statistician � � Informally used by

RONALD A. FISHER • • Biologist Eugenicist Geneticist Statistician � � Informally used by researchers in the 1800 s Formally proposed by Ronald A. Fisher in 1918 “A genius who almost single-handedly created the foundations for modern statistical science” - Anders Hald “The greatest of Darwin's successors” -Richard Dawkins 7

HISTORY • Fisher proposed a formal analysis of variance in his paper The Correlation

HISTORY • Fisher proposed a formal analysis of variance in his paper The Correlation Between Relatives on the Supposition of Mendelian Inheritance in 1918. • His first application of the analysis of variance was published in 1921. • Become widely known after being included in Fisher's 1925 book Statistical Methods for Research Workers in 1925. 8

DEFINITION • An abbreviation for: ANalysis Of VAriance • The procedure to consider means

DEFINITION • An abbreviation for: ANalysis Of VAriance • The procedure to consider means from k independent groups, where k is 2 or greater. 9

ANOVA and T-TEST • ANOVA and T-Test are similar -Compare means between groups •

ANOVA and T-TEST • ANOVA and T-Test are similar -Compare means between groups • 2 groups, both work • 2 or more groups, ANOVA is better 10

TYPES • ANOVA - analysis of variance – One way (F-ratio for 1 factor

TYPES • ANOVA - analysis of variance – One way (F-ratio for 1 factor ) – Two way (F-ratio for 2 factors) • ANCOVA - analysis of covariance • MANOVA - multiple analysis 11

APPLICATION • • • Biology Microbiology Medical Science Computer Science Industry Finance 12

APPLICATION • • • Biology Microbiology Medical Science Computer Science Industry Finance 12

Basics of ANOVA Alla Tashlitsky 13

Basics of ANOVA Alla Tashlitsky 13

Definition • ANOVA can determine whethere is a significant relationship between variables. It is

Definition • ANOVA can determine whethere is a significant relationship between variables. It is also used to determine whether a measurable difference exists between two or more sample means. • Objective: To identify important independent variables (predictor variables – yi’s) and determine how they affect the response variables. • One-way, two-way, or multi-way ANOVA depend on the number of independent variables there are in the experiment that affect the outcome of the hypothesis test. 14

Model & Assumptions • 15

Model & Assumptions • 15

Classes of ANOVA 1. Fixed Effects: concrete (e. g. sex, age) 2. Random Effects:

Classes of ANOVA 1. Fixed Effects: concrete (e. g. sex, age) 2. Random Effects: representative sample (e. g. treatments, locations, tests) 3. Mixed Effects: combination of fixed and random 16

Procedure • H 0: µ 1=µ 2=…=µk vs Ha: at least one the equalities

Procedure • H 0: µ 1=µ 2=…=µk vs Ha: at least one the equalities doesn’t hold • F~fk, n-(k+1), α = MSR/MSE = t 2 (when there are only 2 means) – Where mean square regression: MSR = SSR/1 and mean square error: MSE = SSE/n-2 • The rejection region for a given significance level is F > f 17

Regression • SST (sum of squares total) = SSR (sum of squares regression) +

Regression • SST (sum of squares total) = SSR (sum of squares regression) + SSE (sum of squares error) • • Sample variance: S 2 = MSE = SSE/n-k → Unbiased estimator for σ2 18

Mean Variation 19

Mean Variation 19

Data Collection Bryan Rong 20

Data Collection Bryan Rong 20

Data Collection • 3 industries – Application Software, Credit Service, Apparel Stores • Sample

Data Collection • 3 industries – Application Software, Credit Service, Apparel Stores • Sample 15 stocks from each industry • For each stock, we observed the last 30 days and calculated – Mean daily percentage change – Mean daily percentage range – Mean Volume 21

Application software • • • • CA, Inc. [CA] Compuware Corporation [CPWR] Deltek, Inc.

Application software • • • • CA, Inc. [CA] Compuware Corporation [CPWR] Deltek, Inc. [PROJ] Epicor Software Corporation [EPIC] Fundtech Ltd. [FNDT] Intuit Inc. [INTU] Lawson Software, Inc. [LWSN] Microsoft Corporation [MSFT MGT Capital Investments, Inc. [MGT] Magic Software Enterprises Ltd. [MGIC] SAP AG [SAP] Sonic Foundry, Inc. [SOFO] Real. Page, Inc. [RP] Red Hat, Inc. [RHT] Veri. Sign, Inc. [VRSN] 22

Credit Service • • • • Advance America, Cash Advance Centers, Inc. [AEA] Alliance

Credit Service • • • • Advance America, Cash Advance Centers, Inc. [AEA] Alliance Data Systems Corporation [ADS] American Express Company [AXP] Asset Acceptance Capital Corp. [AACC] Capital One Financial Corporation [COF] Capital. Source Inc. [CSE] Cash America International, Inc. [CSH] Discover Financial Services [DFS] Equifax Inc. [EFX] Global Cash Access Holdings, Inc. [GCA] Federal Agricultural Mortgage Corporation [AGM] Intervest Bancshares Corporation [IBCA] Manhattan Bridge Capital, Inc. [LOAN] Micro. Financial Incorporated [MFI] Moody's Corporation [MCO] 23

APPAREL STORES • • • • Abercrombie & Fitch Co. [ANF] American Eagle Outfitters,

APPAREL STORES • • • • Abercrombie & Fitch Co. [ANF] American Eagle Outfitters, Inc. [AEO] bebe stores, inc. [BEBE] DSW Inc. [DSW] Express, Inc. [EXPR] J. Crew Group, Inc. [JCG] New York & Company, Inc. [NWY] Nordstrom, Inc. [JWN] Pacific Sunwear of California, Inc. [PSUN] The Gap, Inc. [GPS] The Buckle, Inc. [BKE] The Children's Place Retail Stores, Inc. [PLCE] The Dress Barn, Inc. [DBRN] The Finish Line, Inc. [FINL] Urban Outfitters, Inc. [URBN] 24

25

25

26

26

Final Data look 27

Final Data look 27

Checking Assumptions Zhang Junying 28

Checking Assumptions Zhang Junying 28

Major Assumptions of Analysis of Variance • The Assumptions – Normal populations – Independent

Major Assumptions of Analysis of Variance • The Assumptions – Normal populations – Independent samples – Equal (unknown) population variances • Our Purpose – Examine these assumptions by graphical analysis of residual 29

Residual plot • • • Violations of the basic assumptions and model adequacy can

Residual plot • • • Violations of the basic assumptions and model adequacy can be easily investigated by the examination of residuals. We define the residual for observation j in treatment i as If the model is adequate, the residuals should be structureless; that is, they should contain no obvious patterns. 30

Normality • Why normal? – ANOVA is an Analysis of Variance – Analysis of

Normality • Why normal? – ANOVA is an Analysis of Variance – Analysis of two variances, more specifically, the ratio of two variances – Statistical inference is based on the F distribution which is given by the ratio of two chi-squared distributions – No surprise that each variance in the ANOVA ratio come from a parent normal distribution • Normality is only needed for statistical inference. 31

Sas code for getting residual PROC IMPORT datafile = 'C: UsersjunyzhangDesktopmydata. xls' out =

Sas code for getting residual PROC IMPORT datafile = 'C: UsersjunyzhangDesktopmydata. xls' out = stock; RUN; PROC PRINT DATA=stock; RUN; Proc glm data=stock; Class indu; Model adpcdata=indu; Output out =stock 1 p=yhat r=resid; Run; PROC PRINT DATA=stock 1; RUN; 32

Normality test The normal plot of the residuals is used to check the normality

Normality test The normal plot of the residuals is used to check the normality test. proc univariate data= stock 1 normal plot; var resid; run; 33

Normality Tests for Normality Test --Statistic--- -----p Value------ Shapiro-Wilk Kolmogorov-Smirnov Cramer-von Mises Anderson-Darling W

Normality Tests for Normality Test --Statistic--- -----p Value------ Shapiro-Wilk Kolmogorov-Smirnov Cramer-von Mises Anderson-Darling W D W-Sq A-Sq Pr Pr 0. 731203 0. 206069 1. 391667 7. 797847 Normal Probability Plot 8. 25+ | * | + 4. 25+ ** | ++++ ** +++ | *+++ | +++* | ++**** | ++++***** | ++****** 0. 25+* * ********* < > > > W D W-Sq A-Sq <0. 0001 <0. 0100 <0. 0050 Test --Statistic--- -----p Value------ Shapiro-Wilk Kolmogorov-Smirnov Cramer-von Mises Anderson-Darling W D W-Sq A-Sq Pr Pr 0. 989846 0. 057951 0. 03225 0. 224264 < > > > W D W-Sq A-Sq 0. 6521 >0. 1500 >0. 2500 Normal Probability Plot 2. 3+ ++ * | ++* | +** | *** | **+ | *** 0. 1+ *** | ** | +** | **** | ++ | +* -2. 1+*++ +----+----+----+----+----+ -2 -1 0 +1 +2 +----+----+----+----+----+ 34 34

Normality Tests 35

Normality Tests 35

Independence • Independent observations – No correlation between error terms – No correlation between

Independence • Independent observations – No correlation between error terms – No correlation between independent variables and error • Positively correlated data inflates standard error – The estimation of the treatment means are more accurate than the standard error shows. 36

SAS code for independence test The plot of the residual against the factor is

SAS code for independence test The plot of the residual against the factor is used to check the independence. proc plot; plot resid* indu; run; 37

Independence Tests 38

Independence Tests 38

Homogeneity of Variances • Eisenhart (1947) describes the problem of unequal variances as follows

Homogeneity of Variances • Eisenhart (1947) describes the problem of unequal variances as follows – the ANOVA model is based on the proportion of the mean squares of the factors and the residual mean squares – The residual mean square is the unbiased estimator of 2, the variance of a single observation – The between treatment mean squares takes into account not only the differences between observations, 2, just like the residual mean squares, but also the variance between treatments – If there was non-constant variance among treatments, we can replace the residual mean square with some overall variance, a 2, and a treatment variance, t 2, which is some weighted version of a 2 – The “neatness” of ANOVA is lost 39

Sas code for Homogeneity of Variances test The plot of residuals against the fitted

Sas code for Homogeneity of Variances test The plot of residuals against the fitted value is used to check constant variance assumption. proc plot; plot resid* yhat; run; 40

Data with homogeneity of Variances 41

Data with homogeneity of Variances 41

Tests for Homogeneity of Variances 42

Tests for Homogeneity of Variances 42

Result about our data – Normal populations – Nearly independent samples – Equal (unknown)

Result about our data – Normal populations – Nearly independent samples – Equal (unknown) population variances So we can employ ANOVA to analyze our data. 43

1 -Way ANOVA Derivation and SAS Yin gying Lin & Wenyi Dong 44

1 -Way ANOVA Derivation and SAS Yin gying Lin & Wenyi Dong 44

Derivation – 1 -Way ANOVA • Hypotheses – H 0: μ= μ 1 =

Derivation – 1 -Way ANOVA • Hypotheses – H 0: μ= μ 1 = μ 2 = μ 3 = … = μn – H 1: μi ≠ μj for some i, j • We assume that the jth observation in group i is related to the mean by xij = μ+ (μi – μ) + εij, where εij is a random noise term. • We wish to separate the variability of the individual observations into parts due to differences between groups and individual variability 45

Derivation – 1 -Way ANOVA – Cont’ 46

Derivation – 1 -Way ANOVA – Cont’ 46

Derivation – 1 -Way ANOVA – Cont’ • Using the above equation, we define

Derivation – 1 -Way ANOVA – Cont’ • Using the above equation, we define • We can show that 47

Derivation – 1 -Way ANOVA – Cont’ • Given the distributions of the MSS

Derivation – 1 -Way ANOVA – Cont’ • Given the distributions of the MSS values, we can reject the null hypothesis if the between group variance is significantly higher than the within group variance. That is, • We reject the null hypothesis if F > fn-1, N-n, α 48

Brief Summary Statistics • Code proc means data=stock maxdec=5 n mean std; by industry;

Brief Summary Statistics • Code proc means data=stock maxdec=5 n mean std; by industry; var ADPC; Get simple summary statistics(sample size, sample mean and SD of each industry) with max of 5 decimal places 49

Brief Summary Statistics • Output Industry N Mean Std Dev Apparel Stores 15 0.

Brief Summary Statistics • Output Industry N Mean Std Dev Apparel Stores 15 0. 00253 0. 00356 Application Software 15 0. 00413 0. 00742 Credit Service 15 0. 00135 0. 00443 50

Data Plot • Code proc plot data=stock; plot industry*ADPC; Produce crude graphical output 51

Data Plot • Code proc plot data=stock; plot industry*ADPC; Produce crude graphical output 51

Data Plot • Output Plot of industry*ADPC. Legend: A = 1 obs, B =

Data Plot • Output Plot of industry*ADPC. Legend: A = 1 obs, B = 2 obs, D = 4 obs. industry | Credit. Se + Applicat + A A B A AABA A D A AAAAA A Apparel. S + AA B B B A BA | -+---------+---------+---------+-----+-----0. 015 -0. 010 -0. 005 0. 000 0. 005 0. 010 0. 015 0. 020 0. 025 ADPC 52

One Way ANOVA Test • • Code proc anova data=stock; class industry; model ADPC=industry;

One Way ANOVA Test • • Code proc anova data=stock; class industry; model ADPC=industry; Class statement indicates that “industry” is a factor. Assumes”industry”influences average daily percentage change. • means industry/tukey cldiff; Multiple comparison by Tukey’s method—get actual Confidence Intervals. Get pictorial display of comparisons. • means industry/tukey lines; 53

GLM analysis • Code proc glm data=stock; class industry; model ADPC=industry; output out=stockfit p=yhat

GLM analysis • Code proc glm data=stock; class industry; model ADPC=industry; output out=stockfit p=yhat r=resid; This procedure is similar to 'proc anova' but 'glm' allows residual plots but gives more junk output. 54

One Way ANOVA Test • Output Dependent Variable: ADPC Sum of Source DF Squares

One Way ANOVA Test • Output Dependent Variable: ADPC Sum of Source DF Squares Mean Square F Value 1. 00 Model 2 0. 00005833 0. 00002916 1. 00 Error 42 0. 00122217 0. 00002910 Corrected Total 44 0. 00128050 Source industry Pr > F 0. 3757 R-Square Coeff Var Root MSE ADPC Mean 0. 045552 201. 8054 0. 005394 0. 002673 DF Anova SS Mean Square F Value Pr > F 2 0. 00005833 0. 00002916 1. 00 0. 3757 55

One Way ANOVA Test Tukey's Studentized Range (HSD) Test for ADPC Alpha Error Degrees

One Way ANOVA Test Tukey's Studentized Range (HSD) Test for ADPC Alpha Error Degrees of Freedom 0. 05 42 Error Mean Square. 000029 Critical Value of Studentized Range 3. 43582 Minimum Significant Difference. 0048 56

One Way ANOVA Test Industry Comparison Applicat - Apparel. S Applicat - Credit. Se

One Way ANOVA Test Industry Comparison Applicat - Apparel. S Applicat - Credit. Se Apparel. S - Applicat Apparel. S - Credit. Se - Applicat Credit. Se - Apparel. S Difference Between Means 0. 001601 0. 002778 -0. 001601 0. 001177 -0. 002778 -0. 001177 Simultaneous 95% Confidence Limits -0. 003184 0. 006387 -0. 002008 0. 007563 -0. 006387 0. 003184 -0. 003609 0. 005962 -0. 007563 0. 002008 -0. 005962 0. 003609 57

Univariate Procedure • Code • proc univariate data=stockfit plot normal; • var resid; We

Univariate Procedure • Code • proc univariate data=stockfit plot normal; • var resid; We use the proc univariate to produce the stem-and-leaf and normal probability plots and we use the stemleaf plot to visualize the overall distribution of a variable. 58

Univariate Procedure • Output Moments N 45 Sum Weights 45 Mean 0 Sum Observations

Univariate Procedure • Output Moments N 45 Sum Weights 45 Mean 0 Sum Observations 0 Std Deviation 0. 00527035 Variance 0. 00002778 Skewness 1. 33008795 Kurtosis 5. 46395169 Uncorrected. SS 0. 00122217 Corrected SS 0. 00122217 Coeff Variation. Std Error Mean 0. 00078566 59

Tests for Location: Mu 0=0 Test -Statistic- -----p Value-----Student's t t 0 Pr >

Tests for Location: Mu 0=0 Test -Statistic- -----p Value-----Student's t t 0 Pr > |t| 1. 0000 Sign M -1. 5 Pr >= |M| 0. 7660 Signed Rank S -43. 5 Pr >= |S| 0. 6288 60

Basic Statistical Measures Location Variability Mean 0. 00000 Std Deviation 0. 00527 Median -0.

Basic Statistical Measures Location Variability Mean 0. 00000 Std Deviation 0. 00527 Median -0. 00048 Variance 0. 0000278 Mode. Range 0. 03389 Interquartile Range 0. 00623 61

Tests for Normality Test --Statistic-------p Value-----Shapiro-Wilk W 0. 904256 Pr < W 0. 0013

Tests for Normality Test --Statistic-------p Value-----Shapiro-Wilk W 0. 904256 Pr < W 0. 0013 Kolmogorov-Smirnov D 0. 112584 Pr > D >0. 1500 Cramer-von Mises W-Sq 0. 096018 Pr > W-Sq 0. 1266 Anderson-Darling A-Sq 0. 781507 Pr > A-Sq 0. 0410 62

Quantiles Quantile Estimate 100% Max 0. 021509105 99% 0. 021509105 95% 0. 007261567 90%

Quantiles Quantile Estimate 100% Max 0. 021509105 99% 0. 021509105 95% 0. 007261567 90% 0. 005106613 75% Q 3 0. 002667399 50% Median -0. 000477723 25% Q 1 -0. 003565176 10% -0. 004824061 5% -0. 005444811 1% -0. 012376248 0% Min -0. 012376248 63

Extreme Observations -------Lowest------Value Obs -0. 01237625 -0. 00807339 -0. 00544481 -0. 00483936 -0. 00482406

Extreme Observations -------Lowest------Value Obs -0. 01237625 -0. 00807339 -0. 00544481 -0. 00483936 -0. 00482406 -------Highest-----Value 41 25 13 3 28 Obs 0. 00510661 0. 00596875 0. 00726157 0. 00814126 0. 02150911 6 34 29 27 22 64

Stem Leaf Plot and Boxplot Stem Leaf # Boxplot 20 5 1 * 18

Stem Leaf Plot and Boxplot Stem Leaf # Boxplot 20 5 1 * 18 16 14 12 10 8 1 1 | 6 03 2 | 4 4561 4 | 2 0027922 7 +-----+ 0 334669 6 | + | -0 9809753 7 *-----* -2 97688551 8 +-----+ -4 4888772 7 | -6 | -8 1 1 | -10 | -12 4 1 | ----+----+ Multiply Stem. Leaf by 10**-3 65

Plot • • • Code proc plot; plot resid*industry; plot resid*yhat; run; Plot the

Plot • • • Code proc plot; plot resid*industry; plot resid*yhat; run; Plot the qq graph of residual VS industry, and residual VS the approximated ADPC value. 66

Graph 0. 025 + | A 0. 020 + 0. 010 + | A

Graph 0. 025 + | A 0. 020 + 0. 010 + | A | A 0. 005 + B | A A | A C | B A B | A 0. 000 + C B | A B A | A B | B A A -0. 005 + B D | A -0. 010 + | A -0. 015 + | ---+-------------------------+-industry Apparel. S Applicat Credit. Se Plot of resid*industry. Legend: A = 1 obs B = 2 obs D = 4 obs 68

Plot of resid*yhat resid 0. 025 + | A 0. 010 + | A

Plot of resid*yhat resid 0. 025 + | A 0. 010 + | A | A 0. 005 + B | A A | C A | B B A | A 0. 000 + B C | A B | A A B | B A | A B A -0. 005 + B D | A -0. 015 + --+------------+------------+-----------0. 0010 0. 0015 0. 0020 0. 0025 0. 0030 0. 0035 yhat Plot of resid*yhat. Legend: A = 1 obs, B = 2 obs, D=4 obs. 69

Conclusion • After the analysis of one way anova test, we can get the

Conclusion • After the analysis of one way anova test, we can get the result of F=1. 00 and p=0. 3757. Since the p-value is bigger, we accept the null hypothesis which indicates that there is no difference between the mean of daily average percentage change of stocks of different industries. Thus, there is no different if we buy the stocks in different industries in the long term. 70

2 -Way ANOVA Derivation and SAS Peng Yang Phil Caffrey Yin Diao 71

2 -Way ANOVA Derivation and SAS Peng Yang Phil Caffrey Yin Diao 71

2 -Way ANOVA Derivation We now have two factors (A & B) A B

2 -Way ANOVA Derivation We now have two factors (A & B) A B 72

2 -Way ANOVA Derivation Linear Dot Notation Model 73

2 -Way ANOVA Derivation Linear Dot Notation Model 73

2 -Way ANOVA Derivation Least Square Method SST = SSA + SSB+ SSAB +

2 -Way ANOVA Derivation Least Square Method SST = SSA + SSB+ SSAB + SSE + SSB + SSAB + SSE 74

2 -Way ANOVA Derivation Rejection Test Criteria Conditions 75

2 -Way ANOVA Derivation Rejection Test Criteria Conditions 75

2 -Way ANOVA Derivation Pivotal Quantity 76

2 -Way ANOVA Derivation Pivotal Quantity 76

2 -Way ANOVA Derivation Pivotal Quantity (Cont’) 77

2 -Way ANOVA Derivation Pivotal Quantity (Cont’) 77

Two-Way ANOVA in SAS By: Philip Caffrey & Yin Diao 78

Two-Way ANOVA in SAS By: Philip Caffrey & Yin Diao 78

Model • An extension of one way ANOVA. It provides more insight about how

Model • An extension of one way ANOVA. It provides more insight about how the two IVs interact and individually affect the DV. Thus, the main effects and interaction effects of two IVs have on the DV need to be tested. • Model: • Null hypothesis: 79

Sum of Squares Every term compared with the error term leads to F distribution.

Sum of Squares Every term compared with the error term leads to F distribution. In this way, we can conclude whethere is main effect or interaction effect. SSTOTAL = SSA + SSB + SSINTERACTION + SSERROR 80

Example Using the same data from the One-Way analysis, we will now separate the

Example Using the same data from the One-Way analysis, we will now separate the data further by introducing a second factor, Average Daily Volume. 81

Example Factor 1: Industry • Apparrel Stores • Application Software • Credit Services Factor

Example Factor 1: Industry • Apparrel Stores • Application Software • Credit Services Factor 2: Average Daily Volume • Low • Medium • High 82

Two-Way Design Repeat 5 times each V O L U M E High Medium

Two-Way Design Repeat 5 times each V O L U M E High Medium Low Credit Apparel Software INDUSTRY 83

Using SAS code: PROC IMPORT DATAFILE='G: Stony Brok Univ Text BooksAMS ProjectData. xls' OUT=TWOWAY;

Using SAS code: PROC IMPORT DATAFILE='G: Stony Brok Univ Text BooksAMS ProjectData. xls' OUT=TWOWAY; RUN; PROC ANOVA DATA = TWOWAY; TITLE “ANALYSIS OF STOCK DATA”; CLASS INDUSTRY VOLUME; MODEL ADPC = INDUSTRY | VOLUME; MEANS INDUSTRY | VOLUME / TUKEY CLDIFF; RUN; 84

Using SAS /*PLOT THE CELL MEANS*/ PROC MEANS DATA=WAY NOPRINT; CLASS INDT ADTV; VAR

Using SAS /*PLOT THE CELL MEANS*/ PROC MEANS DATA=WAY NOPRINT; CLASS INDT ADTV; VAR ADPC; OUTPUT OUT=MEANS MEAN=; RUN; PROC GPLOT DATA=MEANS; PLOT INDT*ADTV; RUN; 85

ANOVA Table No Sig. Results 86

ANOVA Table No Sig. Results 86

Using SAS To test the main effect of one IV, we should combine all

Using SAS To test the main effect of one IV, we should combine all the data of the other IV. And this is done in the one way ANOVA. From the ANOVA we know there is no significant main effects or interaction effect of the two IVs. To indicate if there is an interaction effect, we can plot of means of each cell formed by combination of all levels of IVs. 87

PLOT OF CELL MEANS Industry by Average Daily Volume 88

PLOT OF CELL MEANS Industry by Average Daily Volume 88

Interpreting the Output Given that the F tests were not significant we would normally

Interpreting the Output Given that the F tests were not significant we would normally stop our analysis here. If the F test is significant, we would want to know exactly which means are different from each other. Use Tukey’s Test. MEANS INDUSTRY | VOLUME / TUKEY CLDIFF; 89

Interpreting the Output Comparing Means Comparison Diff. b/w Means 95% CI Software - Apparel

Interpreting the Output Comparing Means Comparison Diff. b/w Means 95% CI Software - Apparel 0. 001601 [-0. 003184 0. 006387] Software - Credit 0. 002778 [-0. 002008 0. 007563] Credit - Apparel -0. 001177 [-0. 005962 0. 003609] Med. Vol. - Low. Vol. -0. 003698 [-0. 008435 0. 001038] Med. Vol. - High. Vol. -0. 001252 [-0. 005989 0. 003484] High. Vol. - Low. Vol. -0. 002446 [-0. 007182 0. 002290] 90

Conclusion • We cannot conclude that there is a significant difference between any of

Conclusion • We cannot conclude that there is a significant difference between any of the group means. • The two IVs have no effects on the DV. 91

Mutli-Way ANOVA Derivation Michael Biro & Cris Liu 92

Mutli-Way ANOVA Derivation Michael Biro & Cris Liu 92

M-way ANOVA (Derivation) • Let us have n factors, A 1, A 2, …,

M-way ANOVA (Derivation) • Let us have n factors, A 1, A 2, …, An , each with 2 or more levels, a 1, a 2, …, an, respectively. Then there are N = a 1 a 2…an types of treatment to conduct, with each treatment having sample size ni. Let xi 1 i 2…ink be the kth observation from treatment i 1 i 2…in. • By the assumption for ANOVA, xi 1 i 2…ink is a random variable that follows the normal distribution. Using the model xi 1 i 2…ink = µi 1 i 2…ink + εi 1 i 2…ink where each (residual) εi 1 i 2…ink are i. i. d. and follows N(0, σ2). 93

M-way ANOVA (Derivation) 94

M-way ANOVA (Derivation) 94

M-way ANOVA (Derivation) 95

M-way ANOVA (Derivation) 95

M-way ANOVA (Derivation) • These are all distributed as independent χ2 random variables (when

M-way ANOVA (Derivation) • These are all distributed as independent χ2 random variables (when multiplied by the correct constants and when some hypotheses hold) with d. f. satisfying the equation: 96

M-way ANOVA (Derivation) • There a total of 2 m hypotheses in an mway

M-way ANOVA (Derivation) • There a total of 2 m hypotheses in an mway ANOVA. – The null hypothesis, which states that there is no difference or interaction between factors – For k from 1 to m, there are m. Ck alternative hypotheses about the interaction between every collection of k factors. – Then we have 1 + m. C 2 + … + m. Cm = 2 m by a well known combinatorial identity. 97

M-way ANOVA (Derivation) • These hypotheses are: 98

M-way ANOVA (Derivation) • These hypotheses are: 98

M-way ANOVA (Derivation) • We want to see if the variability between groups is

M-way ANOVA (Derivation) • We want to see if the variability between groups is larger that the variability within the groups. • To do this, we use the F distribution as our pivotal quantity, and then we can derive the proper tests, very similar to the 1 -way and 2 way tests. 99

M-way ANOVA (Derivation) 100

M-way ANOVA (Derivation) 100

RELATIONSHIP BETWEEN ANOVA and Regression Presenter: Cris J. Y. Liu 101

RELATIONSHIP BETWEEN ANOVA and Regression Presenter: Cris J. Y. Liu 101

 • What we know: – regression is the statistical model that you use

• What we know: – regression is the statistical model that you use to predict a continuous outcome on the basis of one or more continuous predictor variables. – ANOVA compares several groups (usually categorical predictor variables) in terms of a certain dependent variable(continuous outcome ) ( if there are mixture of categorical and continuous data, ANCOVA is an alternative method. ) • Take a second look: They are the just different sides of the same coin! 102

Review of ANOVA • Compare the means of different groups • n groups, ni

Review of ANOVA • Compare the means of different groups • n groups, ni elements for ith group, N element in total. • SST= SSbetween + SSwithin How about only two group, X and Y, Each have n data? 103

Review of Simple Linear Regression • We try to find a line y =

Review of Simple Linear Regression • We try to find a line y = β 0 + β 1 x that best fits our data so that we can calculate the best estimate of y from x • It will find such β 0 and β 1 that minimize the distance Q between the actual and estimated score Minimize me • Let predicted value be of one group, while the other group consist all of original value. . • It is a special (and also simple) case of ANOVA! 104

Review of Regression Total = Model + (Between) = d. f. : n-1 Error

Review of Regression Total = Model + (Between) = d. f. : n-1 Error (Within) + d. f. : 2 -1 = 1 d. f. : n-2 105

ANOVA table of Regression 106

ANOVA table of Regression 106

How are they alike? • If we use the group mean to be our

How are they alike? • If we use the group mean to be our X values from which we predict Y we can see that ANOVA and regression is the same!! • The group mean is the best prediction of a Y-score. 107

Term comparison Regression ANOVA Dependent variable Explaintory variable total mean SSR SSE SSbetween SSwithin

Term comparison Regression ANOVA Dependent variable Explaintory variable total mean SSR SSE SSbetween SSwithin 108

Term comparison if more than one predictor…. . Regression ANOVA Multiple Regression Multi-way ANOVA

Term comparison if more than one predictor…. . Regression ANOVA Multiple Regression Multi-way ANOVA dummy variable categorical variable interaction effect covariance …………………. …………… 109

Notes: • Both of them are applicable only when outcome variables are continuous. •

Notes: • Both of them are applicable only when outcome variables are continuous. • They share basically the same procedure of checking the underlying assumption. 110

Robust ANOVA -Taguchi Method 111

Robust ANOVA -Taguchi Method 111

What is Robustness? • The term “robustness” is often used to refer to methods

What is Robustness? • The term “robustness” is often used to refer to methods designed to be insensitive to distributional assumptions (such as normality) in general, and unusual observations (“outliers”) in particular. • Why Robust ANOVA? • There is always the possibility that some observations may contain excessive noise. • excessive noise during experiments might lead to incorrect inferences. • Widely used in Quality control 112

Robust ANOVA • What we want from robust ANOVA? robust ANOVA methods could withstand

Robust ANOVA • What we want from robust ANOVA? robust ANOVA methods could withstand nonideal conditions while no more difficult to perform than ordinary ANOVA • Standard technique----least squares method is highly sensitive to unusual observations 113

Robust ANOVA Our aim is to minimize by choosing β: In standard ANOVA, we

Robust ANOVA Our aim is to minimize by choosing β: In standard ANOVA, we let we can also try some other ρ(x). 114

Least absolute deviation • It is well-known that the median is much more robust

Least absolute deviation • It is well-known that the median is much more robust to outliers than the mean. • least absolute deviation (LAD) estimate, which takes • How is LAD related to median? the LAD estimator determines the “center” of the data set by minimizing the sum of the absolute deviations from the estimate of the center, which turns out to be the median. • It has been shown to be quite effective in the presence 115 of fat tailed data

M-estimation • M-estimation is based on replacing ρ(. ) with a function that is

M-estimation • M-estimation is based on replacing ρ(. ) with a function that is less sensitive to unusual observations than is the quadratic. • The M means we should keep ρ follows MLE. • LSD with , is an example of a robust M-estimator. • Another popular choice of ρ : Tukey bisquare: and (; )1 rcρ= otherwise, where r is the residual and c is a constant. 116

Suggestion • these robust analyses may not take the place of standard ANOVA analyses

Suggestion • these robust analyses may not take the place of standard ANOVA analyses in this context; • Rather, we believe that the robust analyses should be undertaken as an adjunct to the standard analyses 117

Questions? 118

Questions? 118

Thank You 119

Thank You 119