Praktische Statistik fr Umwelt und Geowissenschaftler Unibivariate Probleme

Praktische Statistik für Umwelt- und Geowissenschaftler Uni/bivariate Probleme Unabhängigkeit Normalverteilung (Phasen-Iterationstest) (KS-Test / Chi-Quadrat Test) Ja Voraussetzungen erfüllt ? Ausreißer (Dixon / Chebyshev's Theorem) Nein Parametrische Verfahren Nicht parametrische Verfahren Verteilungstest KS-Test / Chi-quadrat Test Einstichproben T-test Chi-quadrat Test Vergleich von Mittelwerten mit dem Parameter der GG Zweistichproben T-test F-Test / Levene Test Vergleich von 2 unabhängigen Stichproben U-Test T-Test für verbundene Stichproben Vergleich von 2 verbundenen Stichproben Wilcoxon-Test Varianzanalyse (ANOVA) Vergleich von k unabhängigen Stichproben H-Test Mehrfachvergleiche: Bonferroni Korrektur, Šidàk-Bonferonni correction Post hoc tests Pearson‘s Korrelationsanalyse/ Regressionsanalyse Zusammenhangsanalyse -1 - Rangkorrelation nach Spearmann 26/06/2008

Praktische Statistik für Umwelt- und Geowissenschaftler Categorization of multivariate methods Data analysis Reduction Data mining Classification Data Relationships Principal Component Analysis Factor Analysis Discriminant Analysis Multiple Regression Principal Component Regression Correspondence Analysis Homogeneity Analysis Hierarchical Cluster Analysis Multidimensional Scaling Linear Mixture Analysis Partial Least Squares - 2 Non-linear PCA Procrustes Analysis K-Means Artificial Neural Networks Partial Least Squares -1 Canonical Analysis ANN SVM Support Vector -2 - Machines 26/06/2008

Praktische Statistik für Umwelt- und Geowissenschaftler Vorgehen beim statistischen testen: a) Aufstellen der H 0/H 1 -Hypothese b) Ein- oder zweiseitige Fragestellung c) Auswahl des Testverfahrens d) Festlegen des Signikanzniveaus (Fehler 1. und 2. Art) e) Testen f) Interpretation -3 - 26/06/2008

Praktische Statistik für Umwelt- und Geowissenschaftler Fehler 1. und 2. Art H 1 H 0 richtig, mit 1 -α H 1 Entscheidung aufgrund der Stichprobe In Population gilt α-Fehler P(H 1¦H 0)= α -4 - β-Fehler P(H 0¦H 1)= β richtig, mit 1 - β 26/06/2008

Praktische Statistik für Umwelt- und Geowissenschaftler Bestimmen von Irrtumswahrscheinlichkeiten sei eine normalverteilte Stichprobe (nach 1. Grenzwertsatz) unbekannter Herkunft, mit Probe stammt aus der Eifel Probe stammt aus dem Hunsrück -5 - 26/06/2008

Praktische Statistik für Umwelt- und Geowissenschaftler P-Wert wird gleiner mit > Diff. mit < mit > n Test: Einstichproben Gauss Test mit Wert schneidet 0. 62% von NV ab (P-Wert = Irrtumswahrscheinlichkeit) 0. 45 0. 4 0. 35 0. 3 α=5%, ~Z=1. 65 0. 2 0. 15 0. 1 0. 05 0 -6 -4 -2 0 H 0 muss verworfen werden!-6 - 2 4 6 26/06/2008

Praktische Statistik für Umwelt- und Geowissenschaftler Frage: Welches muss überschritten werden, um H 0 mit gerade verwerfen zu können? schneided von der rechten Seite der SNV genau 5% ab -7 - 26/06/2008

Praktische Statistik für Umwelt- und Geowissenschaftler Zweiseitiger Test: schneidet auf jeder Seite der SNV genau 2. 5% ab 0. 45 0. 4 0. 35 0. 3 0. 25 0. 2 0. 15 0. 1 0. 05 0 -6 -4 -2 0 2 4 6 H 0 wird knapper abgelehnt! Entscheidung ein-/zweiseitiger Test muss im Vorfeld erfolgen! -8 - 26/06/2008

Praktische Statistik für Umwelt- und Geowissenschaftler Der β-Fehler Kann nur bei spezifischer H 1 bestimmt werden! Wir testen, ob sich die Stichprobe mit dem Parameter der Eifelproben verträgt Wert schneidet auf der linken Seite der SNV 10. 6% ab. Entscheidet man sich aufgrund des Ereignisses für die H 0, so wird man mit einer p von 10. 6% einen βFehler begehen, d. h. H 1 ( « Probe stammt aus der Eifel » ) verwerfen, obwohl sie richtig ist. -9 - 26/06/2008

Praktische Statistik für Umwelt- und Geowissenschaftler Die Teststärke Die β-Fehlerwahrscheinlichkeit gibt an, mit welcher p die H 1 verworfen wird, obwohl ein Unterschied besteht 1 - β gibt die p an zugunsten von H 1 zu entscheiden, wenn H 1 gilt. Bestimmen der Teststärke Wir habe herausgefunden, dass ab einem Wert der Test gerade signifikant wird ( « Probe stammt aus der Eifel » ) -10 - 26/06/2008

Praktische Statistik für Umwelt- und Geowissenschaftler Bestimmen der Teststärke β-Wahrscheinlichkeit: 0. 0179 Teststärke: 1 -β =1 -0. 0179 = 0. 9821 Die p, dass wir uns aufgrund des gewählten Signifikanzniveaus (α=5%) zu Recht zugunsten der H 1 entscheiden, beträgt 98. 21% Determinanten der Teststärke: Mit kleiner werdener Diff. µ 0 -µ 1 verringert sich 1 - β Mit wachsendem n vergrössert sich 1 - β Mit wachsender Merkmalsstreuung sinkt 1 - β -11 - 26/06/2008

Praktische Statistik für Umwelt- und Geowissenschaftler Why multivariate statistics? Remember ØFancy statistics do not make up for poor planning ØDesign is more important than analysis -12 - 26/06/2008

Praktische Statistik für Umwelt- und Geowissenschaftler Categorization of multivariate methods • Prediction Methods – Use some variables to predict unknown or future values of other variables. • Description Methods – Find human-interpretable patterns that describe the data. From [Fayyad, et. al. ] Advances in Knowledge Discovery and Data Mining, 1996 -13 - 26/06/2008

Praktische Statistik für Umwelt- und Geowissenschaftler Multiple Linear Regression Analysis The General Linear Model y A general linear model can be: qstraight-line qquadratic qmore 0=10 model (second-order model) than one independent variables. E. g. i yi Response Surface E(yi) 0 (xi 1, xi 2) x 1 x 2 -14 - 26/06/2008

Praktische Statistik für Umwelt- und Geowissenschaftler Multiple Linear Regression Analysis y=x 1 + x 2 – x 1 + 2 x 12 + 2 x 22 -15 - 26/06/2008

Praktische Statistik für Umwelt- und Geowissenschaftler Multiple Linear Regression Analysis Parameter Estimation The goal of an estimator is to provide an estimate of a particular statistic based on the data. There are several ways to characterize estimators: ØBias: an unbiased estimator converges to the true value with large enough sample size. Each parameter is neither consistently over or under estimated ØLikelihood: the maximum likelihood (ML) estimator is the one that makes the observed data most likely ML estimators are not always unbiased for small N ØEfficient: an estimator with lower variance is more efficient, in the sense that it is likely to be closer to the true value over samples the “best” estimator is the one with minimum variance of all estimators -16 - 26/06/2008

Praktische Statistik für Umwelt- und Geowissenschaftler Multiple Linear Regression Analysis A linear model can be written as Where: is an N-dimensional column vector of observations is a (k+1)-dimensional column vector of unknown parameters is an N-dimensional random column vector of unobserved errors Matrix X is written as The first column of X is the vector , so that the first coefficient is the intercept. The unknown coefficient vector is estimated by minimizing the residual sum of squares -17 - 26/06/2008

Praktische Statistik für Umwelt- und Geowissenschaftler Multiple Linear Regression Analysis Model assumptions The OLS estimator can be considered as the best linear unbiased estimator (BLUE) of provided some basic assumptions regarding the error term are satisfied : ØMean of errors is zero: ØErrors have a constant variance: ØErrors from different observations are independent of each other: ØErrors follow a Normal Distribution. ØErrors are not uncorrelated with explanatory variable: -18 - for 26/06/2008

Praktische Statistik für Umwelt- und Geowissenschaftler Multiple Linear Regression Analysis Interpreting Multiple Regression Model For a multiple regression model： Y X 2 X 1 1 should be interpreted as change in y when a unit change is observed in x 1 and x 2 is kept constant. This statement is not very clear when x 1 and x 2 are not independent. Ø Misunderstanding: i always measures the effect of xi on E(y), independent of other x variables. Ø Misunderstanding: a statistically significant value establishes a cause and effect relationship between x and y. -19 - 26/06/2008

Praktische Statistik für Umwelt- und Geowissenschaftler Multiple Linear Regression Analysis Explanation Power by § If the model is useful… At least one estimated must 0 § But wait …What is the chance of having one estimated significant if I have 2 random x? For each , prob(b 0) = 0. 05 At least one happen to be b 0, the chance is: § Prob(b 1 0 or b 2 0) § = 1 – prob(b 1=0 and b 2=0) = 1 -(0. 95)2 = 0. 0975 § Implication? -20 - 26/06/2008

Praktische Statistik für Umwelt- und Geowissenschaftler Multiple Linear Regression Analysis R 2 (multiple correlation squared) – variation in Y accounted for by the set of predictors ØAdjusted R 2. The adjustment takes into account the size of the sample and number of predictors to adjust the value to be a better estimate of the population value. Adjusted R 2 = R 2 - (k - 1) / (n - k) * (1 - R 2) Where: n = # of observations, k = # of independent variables, Accordingly: smaller n decreases R 2 value; larger n increases R 2 value; smaller k, increases R 2 value; larger k, decreases R 2 value. ØThe F-test in the ANOVA table to judge whether the explanatory variables in the model adequately describe the outcome variable. ØThe t-test of each partial regression coefficient. Significant t indicates that the variable in question influences the Y response while controlling for other explanatory variables. -21 - 26/06/2008

Praktische Statistik für Umwelt- und Geowissenschaftler Multiple Linear Regression Analysis ANOVA Source of Variance SS df MS Regression p-1 MSR=SSR/(p-1) Error n-p MSE=SSE/(n-p) Total n-1 where J is an n n matrix of 1 s -22 - 26/06/2008

Praktische Statistik für Umwelt- und Geowissenschaftler Multiple Linear Regression Analysis The R 2 statistic measures the overall contribution of Xs. Then test hypothesis: H 0: 1=… k=0 H 1: at least one parameter is nonzero Since there is no probability distribution form for R 2, F statistic is used instead. -23 - 26/06/2008

Praktische Statistik für Umwelt- und Geowissenschaftler Multiple Linear Regression Analysis F-statistics -24 - 26/06/2008

Praktische Statistik für Umwelt- und Geowissenschaftler Multiple Linear Regression Analysis How many variables should be included in the model? Basic strategies: ØSequential forward ØSequential backward ØForce entire The first two strategies determine a suitable number of explanatory variables using the semi-partial correlation as criterion and a partial F-statistics which is calculated from the error terms from the restricted (RSS 1) and unrestricted (RSS) models: where k, k 1 denotes the number of lags of the unrestricted and restricted model, and N is the number of observations. -25 - 26/06/2008

Praktische Statistik für Umwelt- und Geowissenschaftler Multiple Linear Regression Analysis The semi-partial correlation Y Z X § Measures the relationship between a predictor and the outcome, controlling for the relationship between that predictor and any others already in the model. § It measures the unique contribution of a predictor to explaining the variance of the outcome. -26 - 26/06/2008

Praktische Statistik für Umwelt- und Geowissenschaftler Multiple Linear Regression Analysis Testing the regression coefficients An unbiased estimator for the variance The regression coefficients using a standard t-test is are tested for significance under the Null-Hypothesis Where denotes the ith diagonal element of the matrix referred to as standard error of a regression coefficient. -27 - . is also 26/06/2008

Praktische Statistik für Umwelt- und Geowissenschaftler Multiple Linear Regression Analysis Which X is contributing the most to the prediction of Y? Cannot interpret relative size of bs because each are relative to the variables scale but s (Betas; standardized Bs) can be interpreted. a is the mean on Y which is zero when Y is standardized -28 - 26/06/2008

Praktische Statistik für Umwelt- und Geowissenschaftler Multiple Linear Regression Analysis Can the regression equation be generalized to other data? Can be evaluated by § randomly separating a data set into two halves. Estimate regression equation with one half and apply it to the other half and see if it predicts § Cross-validation -29 - 26/06/2008

Praktische Statistik für Umwelt- und Geowissenschaftler Multiple Linear Regression Analysis Residual analysis -30 - 26/06/2008

Praktische Statistik für Umwelt- und Geowissenschaftler Multiple Linear Regression Analysis The Revised Levene’s test Ø Divide the residuals into two (or more) groups based the level of x, Ø The variances and the means of the two groups are supposed to be equal. A standard t-test can be used to test the difference in mean. A large t indicates nonconsistancy. e 0 x/E(y) -31 - 26/06/2008

Praktische Statistik für Umwelt- und Geowissenschaftler Multiple Linear Regression Analysis Detecting Outliers and Influential Observations Ø Influential points are those whose exclusion will cause major change in fitted line. Ø “Leave-one-out” crossvalidation. Ø If ei > 4 s, it is considered as outlier. Ø True outlier should not be removed, but should be explained. -32 - 26/06/2008

Praktische Statistik für Umwelt- und Geowissenschaftler Multiple Linear Regression Analysis Generalized Least-Squares Example for a Generalized Least. Square model which can be used instead of OLS-regression in the case of autocorrelated error terms (e. g. in Distributed Lag-Models) -33 - 26/06/2008

Praktische Statistik für Umwelt- und Geowissenschaftler Multiple Linear Regression Analysis SPSS-Example -34 - 26/06/2008

Praktische Statistik für Umwelt- und Geowissenschaftler Multiple Linear Regression Analysis SPSS-Example -35 - 26/06/2008

Praktische Statistik für Umwelt- und Geowissenschaftler Multiple Linear Regression Analysis SPSS-Example Model evaluation -36 - 26/06/2008

Praktische Statistik für Umwelt- und Geowissenschaftler Multiple Linear Regression Analysis SPSS-Example Studying residual helps to detect if: q. Model is nonlinear in function Model evaluation q. Missing x q. One or more assumptions of is violated. q. Outliers -37 - 26/06/2008

Praktische Statistik für Umwelt- und Geowissenschaftler ANOVA ANalysis Of VAriance ANOVA (ONE-WAY) ANOVA (TWO-WAY) MANOVA -38 - 26/06/2008

Praktische Statistik für Umwelt- und Geowissenschaftler ANOVA Comparing more than two groups • ANOVA deals with situations with one observation per object, and three or more groups of objects • The most important question is as usual: Do the numbers in the groups come from the same population, or from different populations? -39 - 26/06/2008

Praktische Statistik für Umwelt- und Geowissenschaftler ANOVA One-way ANOVA: Example • Assume ”treatment results” from 13 soil plots from three different regions: – Region A: 24, 26, 31, 27 – Region B: 29, 31, 30, 36, 33 – Region C: 29, 27, 34, 26 • H 0: The treatment results are from the same population of results • H 1: They are from different populations -40 - 26/06/2008

Praktische Statistik für Umwelt- und Geowissenschaftler ANOVA Comparing the groups • Averages within groups: – Region A: 27 – Region B: 31. 8 – Region C: 29 • Total average: • Variance around the mean matters for comparison. • We must compare the variance within the groups to the variance between the group means. -41 - 26/06/2008

Praktische Statistik für Umwelt- und Geowissenschaftler ANOVA Variance within and between groups • Sum of squares within groups: • Sum of squares between groups: • The number of observations and sizes of groups has to be taken into account! -42 - 26/06/2008

Praktische Statistik für Umwelt- und Geowissenschaftler ANOVA Adjusting for group sizes Both are estimates of population variance of error under H 0 n: number of observations K: number of groups • If populations are normal, with the same variance, then we can show that under the null hypothesis • Reject at confidence level , if -43 - 26/06/2008

Praktische Statistik für Umwelt- und Geowissenschaftler ANOVA Continuing example • -> H 0 can not be rejected -44 - 26/06/2008

Praktische Statistik für Umwelt- und Geowissenschaftler ANOVA table Source of variation Sum of squares Deg. of freedom Mean squares Between groups SSG K-1 MSG Within groups SSW n-K MSW Total SST n-1 F ratio NOTE: -45 - 26/06/2008

Praktische Statistik für Umwelt- und Geowissenschaftler ANOVA When to use which method • In situations where we have one observation per object, and want to compare two or more groups: – Use non-parametric tests if you have enough data • For two groups: Mann-Whitney U-test (Wilcoxon rank sum) • For three or more groups use Kruskal-Wallis – If data analysis indicate assumption of normally distributed independent errors is OK • For two groups use t-test (equal or unequal variances assumed) • For three or more groups use ANOVA -46 - 26/06/2008

Praktische Statistik für Umwelt- und Geowissenschaftler ANOVA Two-way ANOVA (without interaction) • In two-way ANOVA, data fall into categories in two different ways: Each observation can be placed in a table. • Example: Both type of fertilization and crop type should influence soil properties. • Sometimes we are interested in studying both categories, sometimes the second category is used only to reduce unexplained variance. Then it is called a blocking variable -47 - 26/06/2008

Praktische Statistik für Umwelt- und Geowissenschaftler ANOVA Sums of squares for two-way ANOVA • Assume K categories, H blocks, and assume one observation xij for each category i and each block j block, so we have n=KH observations. – Mean for category i: – Mean for block j: – Overall mean: -48 - 26/06/2008

Praktische Statistik für Umwelt- und Geowissenschaftler ANOVA Sums of squares for two-way ANOVA -49 - 26/06/2008

Praktische Statistik für Umwelt- und Geowissenschaftler ANOVA table for two-way data Source of variation Sums of squares Deg. of freedom Mean squares F ratio Between groups SSG K-1 MSG= SSG/(K-1) MSG/MSE Between blocks SSB H-1 MSB= SSB/(H-1) MSB/MSE Error SSE (K-1)(H-1) MSE= SSE/(K -1)(H-1) Total SST n-1 Test for between groups effect: compare to Test for between blocks effect: compare to -50 - 26/06/2008

Praktische Statistik für Umwelt- und Geowissenschaftler ANOVA Two-way ANOVA (with interaction) • The setup above assumes that the blocking variable influences outcomes in the same way in all categories (and vice versa) • Checking interaction between the blocking variable and the categories by extending the model with an interaction term -51 - 26/06/2008

Praktische Statistik für Umwelt- und Geowissenschaftler ANOVA Sums of squares for two-way ANOVA (with interaction) • Assume K categories, H blocks, and assume L observations xij 1, xij 2, …, xij. L for each category i and each block j block, so we have n=KHL observations. – Mean for category i: – Mean for block j: – Mean for cell ij: – Overall mean: -52 - 26/06/2008

Praktische Statistik für Umwelt- und Geowissenschaftler ANOVA Sums of squares for two-way ANOVA (with interaction) -53 - 26/06/2008

Praktische Statistik für Umwelt- und Geowissenschaftler ANOVA table for two-way data (with interaction) Source of variation Sums of squares Deg. of freedom Mean squares F ratio Between groups SSG K-1 MSG= SSG/(K-1) MSG/MSE Between blocks SSB H-1 MSB= SSB/(H-1) MSB/MSE Interaction SSI (K-1)(H-1) MSI= SSI/(K-1)(H-1) MSI/MSE Error SSE KH(L-1) Total SST n-1 MSE= SSE/KH(L-1) Test for interaction: compare MSI/MSE with Test for block effect: compare MSB/MSE with Test for group effect: compare MSG/MSE with -54 - 26/06/2008

Praktische Statistik für Umwelt- und Geowissenschaftler ANOVA Notes on ANOVA • All analysis of variance (ANOVA) methods are based on the assumptions of normally distributed and independent errors • The same problems can be described using the regression framework. We get exactly the same tests and results! • There are many extensions beyond those mentioned -55 - 26/06/2008

Praktische Statistik für Umwelt- und Geowissenschaftler MANOVA Uses Multiple DVs ANOVA MANOVA Predictors (IVs) Multiple, discrete Criterion (DV(s)) Single, continuous Multiple, continuous • Various measures of soil properties – Corg, Cmik, N, p. H, … • Various outcome measures following different types of categories – Fertilization, point in time, crop type, … -56 - 26/06/2008

Praktische Statistik für Umwelt- und Geowissenschaftler MANOVA • Multiple DVs could be analysed using multiple ANOVAs, but: – The FW increases with each ANOVA – Scores on the DVs are likely correlated • Non-independent, and taken from the same subjects • Hard to interpret results if multiple ANOVAs are significant • MANOVA solves this by conducting only one overall test – Creates a ‘composite’ DV – Tests for significance of the composite DV -57 - 26/06/2008

Praktische Statistik für Umwelt- und Geowissenschaftler MANOVA • The Composite DV is a linear combination of the DVs – i. e. , a discriminant function, or root – The weights maximally separate the groups on the composite DV C = W 1 Y 1 + W 2 Y 2 + W 3 Y 3 + …+ Wn. Yn where, C is a subject’s score on the composite DV Yi are scores on each of the DVs Wi are the weights, one for each DV A composite DV is required for each main effect and interaction -58 - 26/06/2008

Praktische Statistik für Umwelt- und Geowissenschaftler MANOVA • Considering the DVs together can enhance power a. Frequency distributions show considerable overlap between groups on the individual DVs b. The elipses, that reflect the DVs in combination, show less overlap c. Small differences on each DV combine to make a larger multivariate difference -59 - 26/06/2008

Praktische Statistik für Umwelt- und Geowissenschaftler MANOVA • In ANOVA, the sums of squared deviations are partitioned: SST = SSA + SSB + SSAx. B + SSS/AB • In MANOVA, the sum of squares cross-products are partitioned: ST = SD + STr + SDx. Tr + SS(DTr) • The SSCP matrices (S) are analogous to the SS – SSCP matrix is a squared deviation that also reflects correlations among the DVs -60 - 26/06/2008

Praktische Statistik für Umwelt- und Geowissenschaftler MANOVA Scores and Means in MANOVA are Vectors • • Y: Scores for each subject T and D: Row and column marginals GM: the grand mean DTr: the average scores of subjects within cells -61 - 26/06/2008

Praktische Statistik für Umwelt- und Geowissenschaftler MANOVA -62 - 26/06/2008

Praktische Statistik für Umwelt- und Geowissenschaftler MANOVA • The deviation score for the first subject is: • The squared deviation is obtained by multiplying by the transpose: Ø SS are on the diagonal: (25. 89)2 = 670, and (20. 78)2 = 431 Ø Cross-products are on the off-diagonals: (25. 89)(20. 78)=538 • And: -63 - 26/06/2008

Praktische Statistik für Umwelt- und Geowissenschaftler MANOVA • The squaring of a matrix is carried out by multiplying it by its transpose • The transpose is obtained by flipping the matrix about its diagonal: • To multiply, the ijth element in the resulting matrix is obtained by the sum of products of the ith row in A and the jth column in A' • For a vector, the transpose is a row vector, and: -64 - 26/06/2008

Praktische Statistik für Umwelt- und Geowissenschaftler MANOVA • Main Effects in ANOVA vs. MANOVA: • The Interaction: • The Error Term: -65 - 26/06/2008

Praktische Statistik für Umwelt- und Geowissenschaftler MANOVA • In ANOVA, variance estimates (MS) are obtained from the SS for significance testing using the F-statistic • In MANOVA, variance estimates (determinants) are obtained from the SSCP matrices for significance testing e. g. using Wilk’s Lambda ( ) ANOVA SS MS MANOVA SSCP |SSCP| ~ ~ ~ Note that F and are inverse to one another -66 - 26/06/2008

Praktische Statistik für Umwelt- und Geowissenschaftler MANOVA • The determinant of a 2 x 2 matrix is given by: • The determinants required to test the interaction are: • Wilk’s Lambda for the Interaction is obtained by: -67 - 26/06/2008

Praktische Statistik für Umwelt- und Geowissenschaftler MANOVA • If the effect is small, then approaches 1. 0 – Here SDT was small, and was 0. 91 • Eta Squared for MANOVA is: • 2 = 1 - Effect • = 1 – 0. 91 • = 0. 09 • The interaction accounts for only 9% of the variance in the group means on the composite DV -68 - 26/06/2008

Praktische Statistik für Umwelt- und Geowissenschaftler MANOVA SPSS Example -70 - 26/06/2008

Praktische Statistik für Umwelt- und Geowissenschaftler MANOVA SPSS Example -71 - 26/06/2008

Praktische Statistik für Umwelt- und Geowissenschaftler MANOVA -72 - 26/06/2008

Praktische Statistik für Umwelt- und Geowissenschaftler MANOVA -73 - 26/06/2008

Praktische Statistik für Umwelt- und Geowissenschaftler MANOVA -74 - 26/06/2008

Praktische Statistik für Umwelt- und Geowissenschaftler Discriminant Analysis ØDiscriminant analysis is used to predict group memberships from a set of continuous predictors ØAnalogy to MANOVA: in MANOVA linearly combined DVs are created to answer the question if groups can be separated. The same “DVs” can be used to predict group membership!! -75 - 26/06/2008

Praktische Statistik für Umwelt- und Geowissenschaftler Discriminant Analysis What is the goal of Discriminant Analysis? − Perform dimensionality reduction “while preserving as much of the class discriminatory information as possible”. − Seeks to find directions along which the classes are best separated. − Takes into consideration the scatter within-classes but also the scatter between-classes. -76 - 26/06/2008

Praktische Statistik für Umwelt- und Geowissenschaftler Discriminant Analysis MANOVA and Disriminant Analysis (DA) are mathematically identical but are different in terms of emphasis: – DA is usually concerned with grouping of objects (classification) and testing how well objects were classified (one grouping variable, one or more predictor variables) – Discriminant functions are identical to canonical correlations between the groups on one side and the predictors on the other side. – MANOVA is applied to test if groups significantly differ from each other (one or more grouping variables, one or more predictor variables) -77 - 26/06/2008

Praktische Statistik für Umwelt- und Geowissenschaftler Discriminant Analysis -78 - 26/06/2008

Praktische Statistik für Umwelt- und Geowissenschaftler Discriminant Analysis Assumptions – – – – small number of samples might lead to overfitting. If there are more DVs than objects in any cell the cell will become singular and cannot be inverted. If only a few cases more than DVs equality of covariance matrices is likely to be rejected. With a small objects/DV ratio power is likely to be very small Multivariate normality: the means of the various DVs in each cell and all linear combinations of them are normally distributed Absence of outliers – significance assessment is very sensitive to outlying cases Homogeneity of Covariance Matrices. DA is relatively robust to violations of this assumption if interference is the focus of the analysis, but not in classification. -79 - 26/06/2008

Praktische Statistik für Umwelt- und Geowissenschaftler Discriminant Analysis Assumptions — For classification purposes DA is highly influenced by violations for the last assumption, since subjects will tend to be classified into groups with the largest variance — Homogeneity of class variances can be assessed by plotting pairwise the discriminant function scores for the first discriminant functions. — LDA assumes linear relationships between all predictors within each group. Violations tend to reduce power and not increase alpha. — Absence of Multicollinearity/Singularity in each cell of the design: Avoid redundant predictors -80 - 26/06/2008

Praktische Statistik für Umwelt- und Geowissenschaftler Discriminant Analysis Interpreting a Two-Group Discriminant Function In the two-group case, discriminant function analysis is analogous to multiple regression; the two-group discriminant analysis is also called Fisher linear discriminant analysis. In general, in the two-group case we fit a linear equation of the type: c = a + d 1*x 1 + d 2*x 2 +. . . + dm*xm where a is a constant and d 1 through dm are regression coefficients and c is the predicted class. The interpretation of the results of a two-group problem is straightforward and closely follows the logic of multiple regression: Those variables with the largest (standardized) regression coefficients are the ones that contribute most to the prediction of group membership. -81 - 26/06/2008

Praktische Statistik für Umwelt- und Geowissenschaftler Discriminant Analysis Discriminant Functions for Multiple Groups When there are more than two groups, then we can estimate more than one discriminant function. For instance, when there are three groups, there exist a function for discriminating between group 1 and groups 2 and 3 combined, and another function for discriminating between group 2 and group 3. Canonical analysis. In a multiple group discriminant analysis, the first function is defined such that it provides the most overall discrimination between groups, the second provides second most, and so on. All functions are independent or orthogonal. Computationally, a canonical correlation analysis is performed that determines the successive functions and canonical roots. The number of function that can be calculated is: Min [number of groups-1; number of variables] -82 - 26/06/2008

Praktische Statistik für Umwelt- und Geowissenschaftler Discriminant Analysis Eigenvalues Eigenvalus can be interpreted as the proportion of variance accounted for by the correlation between the respective canonical variates. Successive eigenvalues will be of smaller and smaller size. First, compute the weights that maximize the correlation of the two sum scores. After this first root has been extracted, you will find the weights that produce the second largest correlation between sum scores, subject to the constraint that the next set of sum scores does not correlate with the previous one, and so on. Canonical correlations. If the square root of the eigenvalues is taken, then the resulting numbers can be interpreted as correlation coefficients. Because the correlations pertain to the canonical variates, they are called canonical correlations. -83 - 26/06/2008

Praktische Statistik für Umwelt- und Geowissenschaftler Discriminant Analysis Suppose there are C classes Let µi be the mean vector of class i, i = 1, 2, …, C Let be the total number of samples. And Within-class scatter matrix: Between-class scatter matrix: Where = mean of the entire data set and -84 - 26/06/2008

Praktische Statistik für Umwelt- und Geowissenschaftler Discriminant Analysis • Methodology projection matrix – LDA computes a transformation that maximizes the betweenclass scatter while minimizing the within-class scatter: products of eigenvalues ! : scatter matrices of the projected data y -85 - 26/06/2008

Praktische Statistik für Umwelt- und Geowissenschaftler Discriminant Analysis Linear transformation implied by LDA – The LDA solution is given by the eigenvectors of the generalized eigenvector problem: – The linear transformation is given by a matrix U whose columns are the eigenvectors of the above problem. – Important: Since Sb has at most rank C-1, the max number of eigenvectors with non-zero eigenvalues is C-1 (i. e. , max dimensionality of sub-space is C-1) -86 - 26/06/2008

Praktische Statistik für Umwelt- und Geowissenschaftler Discriminant Analysis • Does Sw-1 always exist? – If Sw is non-singular, we can obtain a conventional eigenvalue problem by writing: – In practice, Sw is often singular when more variables than cases are involved in the analysis (M << N ) -87 - 26/06/2008

Praktische Statistik für Umwelt- und Geowissenschaftler -88 - 26/06/2008