Association loglinear analysis and canonical correlation analysis Chapter

  • Slides: 49
Download presentation
Association log-linear analysis and canonical correlation analysis Chapter 9 Statistics for Marketing & Consumer

Association log-linear analysis and canonical correlation analysis Chapter 9 Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 1

Association between qualitative variables • Association is a generic term referring to the relationship

Association between qualitative variables • Association is a generic term referring to the relationship of two variables • Correlation measures strictly refer to quantitative variables • Thus, association, generally refers to qualitative variables • Two qualitative variables are said to be associated when changes in one variable lead to changes in the other variable (i. e. they are not independent. For example, education is generally associated with job position. • Association measures for categorical variables are based on tables of frequencies, also termed contingency tables Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 2

Contingency tables • Frequency tables show the joint frequencies of two categorical variables •

Contingency tables • Frequency tables show the joint frequencies of two categorical variables • The marginal totals, that is the row and column totals of the contingency table, represent the univariate frequency distribution for each of the two variables • If these variables are independent one would expect that the distribution of frequencies across the internal cells of the contingency table only depends on the marginal totals and the sample size Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 3

Contingency table (frequencies) Statistics for Marketing & Consumer Research Copyright © 2008 - Mario

Contingency table (frequencies) Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 4

Independent variables • In probability terms, two events are regarded as independent when their

Independent variables • In probability terms, two events are regarded as independent when their joint probability is the product of the probabilities of the two individual events Prob(X=a, Y=b)=Prob(X=a)Prob(Y=b) • Similarly, two categorical variables are independent when the joint probability of two categorical outcomes is equal to the product of the probabilities of the individual outcomes for each variable Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 5

Expected frequencies under independence • Thus, the frequencies within the contingency table should not

Expected frequencies under independence • Thus, the frequencies within the contingency table should not be too different from these expected values: where • nij and fij are the absolute and relative frequencies, respectively • ni 0 and n 0 j (or fi 0 and f 0 j) are the marginal totals for row i and column j, respectively • n 00 is the sample size (hence the total relative frequency f 00 equals one). Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 6

Independence and association – testing • The more that empirical frequencies are at a

Independence and association – testing • The more that empirical frequencies are at a distance from the expected frequency under independence, the more the two categorical variables are associated. • Thus, a synthetic measure of association is given by Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 7

The Chi-square statistic • The more distant the actual joint frequencies are from the

The Chi-square statistic • The more distant the actual joint frequencies are from the expected ones, the larger is the Chi-square statistics • Under the independence assumption, the chi-square statistic has a known probability distribution, so that its empirical values can be associated with a probability value to test independence • The observed frequency values may differ from the expected values fij* because of random errors, so that the discrepancy can be tested using a statistical tool, the Chi-square distribution • As usual, the basic principle is also to measure the probability that the discrepancy between the expected and observed value is due to randomness only • If this probability value (from the Chi-square theoretical distribution) is very low (below the significance threshold), then one rejects the null hypothesis of independence between the two variable and proceed assuming some association Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 8

Other association measures • contingency coefficient: ranges from zero (independence) to values close to

Other association measures • contingency coefficient: ranges from zero (independence) to values close to one for strong association, but its value depends on the shape (number of rows and columns) of the contingency table • Cramers V, bound between zero and one does not suffer from the above shortcoming (but strong associations may translate in relatively low - below 0. 5 – values • Goodman and Kruskal's Lambda for strictly nominal variables, compares predictions obtained for one of the variables using two different methods, one which only considers the marginal frequency distribution for that variable, the other which picks up the most likely values after considering the distribution of the other variable. Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 9

Other association measures • Uncertainty coefficient as the Goodman and Kruskals lambda, but considers

Other association measures • Uncertainty coefficient as the Goodman and Kruskals lambda, but considers the reduction in the prediction error rather than the rate of correct predictions. • Ordinal variables: • Gamma statistic (between minus one and one, zero indicates independence) • Somers d statistic, adjustment of the Gamma statistics to account for the direction of the relationshop • Kendall’s Tau b and Tau c statistics, for square and rectangular tables, respectively • These statistics check all pairs of values assumed by the two variables to see if (a) a category increase in one variable leads to a category increase in the second one (positive association); or (b) whether the opposite happens (negative association); or (c) the ordering of one variable is independent from the ordering of the other. Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 10

Directional vs. symmetric measures • Directional measures (e. g. Somer’s d) assume that the

Directional vs. symmetric measures • Directional measures (e. g. Somer’s d) assume that the change in one variable depends on the change in the other variable (there is a direction) • Symmetric measures assume no direction Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 11

Association measures in SPSS Click here to see the list of available statistics Statistics

Association measures in SPSS Click here to see the list of available statistics Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 12

Chi-square test in SPSS Contingency Table Chi-square test As the p-value is above 0.

Chi-square test in SPSS Contingency Table Chi-square test As the p-value is above 0. 05, the hypothesis of independence cannot be rejected at the 95% confidence level Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 13

Other association measures (symmetric) Statistics for Marketing & Consumer Research Copyright © 2008 -

Other association measures (symmetric) Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 14

Directional association measures Statistics for Marketing & Consumer Research Copyright © 2008 - Mario

Directional association measures Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 15

More than two variables: three-way contingency tables Statistics for Marketing & Consumer Research Copyright

More than two variables: three-way contingency tables Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 16

Log-linear analysis • The objective of log-linear analysis is to explore the association between

Log-linear analysis • The objective of log-linear analysis is to explore the association between more than two categorical variables check whether associations are significant and explore how the variables are associated • Log-linear analysis can be applied by considering a general log-linear model Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 17

Log-linear analysis: saturated model • Consider the three variables of the three-way contingency table:

Log-linear analysis: saturated model • Consider the three variables of the three-way contingency table: T(in government) Gender and Country • The frequency of each cell in the table can be rewritten as: • where: nijk is the frequency for trust-level i, gender j and country k u. G is the main effect of Gender (Trust, Country) u. GT is the interaction effect of Gender and Trust u. GCT is the interaction effect of Gender Trust and Country q is scale parameter which depends on the total number of obs. and similarly for u. T, u. C, u. GC, . . . The frequency of each cell is fully explained when considering all of the main and interaction effect (the model is saturated) Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 18

Interpretation of the model The u terms represent the main and interaction effects and

Interpretation of the model The u terms represent the main and interaction effects and can be interpreted as the expected relative frequencies • For example, in a two by two contingency table with no interaction, one would have nij=Nfi 0 f 0 j • Instead, if there is dependence (relevant interaction), the frequencies of a two by two contingency table are exactly explained (this is in fact a saturated model) by where the term between brackets reflects the frequency explained by the interaction term and is one under independence Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 19

The log-linear model • By taking the logarithms one moves to a linear rather

The log-linear model • By taking the logarithms one moves to a linear rather than multiplicative form • The saturated model is not very useful, as it fits the data perfectly and does not tell much about the relevance of each of the effects • Thus, log-linear analysis check whether simplified log-linear models are as good as the saturated model in predicting the table frequencies Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 20

1. What log-linear analysis does 1. Computes the main and interaction effects for the

1. What log-linear analysis does 1. Computes the main and interaction effects for the saturated model 2. Simplifies the saturated model by deleting (according to a given rule) some of the main and interaction effects and obtains estimates for all of the main and interaction effects left in the simplified model 3. Compares the simplified model with the benchmark model • • If the simplified model performs well, it goes back to No. 2 and proceeds with attempts for further simplification Otherwise it stops Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 21

Simplified models • For example, suppose that the three-way interaction term is omitted, the

Simplified models • For example, suppose that the three-way interaction term is omitted, the log-linear model becomes: • Now there is an error term • The effects cannot be computed exactly, but can be estimated through a regression-like model, where • the dependent variable is the (logarithm of) cell frequency • the explanatory variables are a set of dummy variables with value one when a main effect or interaction is relevant to that cell of the contingency table and zero otherwise • the estimated coefficients are the (logarithm of) the corresponding main or interaction effects Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 22

Hierarchical log-linear analysis Proceeds hierarchically backward • Delete the highest-order interaction first (u. GCT)

Hierarchical log-linear analysis Proceeds hierarchically backward • Delete the highest-order interaction first (u. GCT) • Delete lower-order interactions (u. GC, u. CT , u. GT), one by one, two together, three altogether • Delete main effects (u. G, u. T, u. C), one by one, two together, three altogether Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 23

Hierarchical LLA in SPSS Select the categorical variables and define their range of values

Hierarchical LLA in SPSS Select the categorical variables and define their range of values Select backward elimination for hierarchical LLA Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 24

Options Provides the (exact) estimates for the saturated model Provides the association table (useful

Options Provides the (exact) estimates for the saturated model Provides the association table (useful for deleting terms) Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 25

K-way effects: K=1 MAIN EFFECTS K=2 2 -WAY INTERACTION K=3 3 -WAY INTERACTION Test

K-way effects: K=1 MAIN EFFECTS K=2 2 -WAY INTERACTION K=3 3 -WAY INTERACTION Test the effect of deleting that kway order effect ONLY Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi Output The 3 -way interaction can be omitted, but other effects seem to be relevant Test the effect of deleting that k-way order effect AND ALL EFFECTS OF AN HIGHER ORDER 26

Deletion of terms • Now it is possible to look within a given kway

Deletion of terms • Now it is possible to look within a given kway class (partial association table) Deletion of these terms does not make the prediction of the contingency table cells worse compared to the model with the term Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 27

Specification search This term can be eliminated No more terms can be eliminated Statistics

Specification search This term can be eliminated No more terms can be eliminated Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 28

Further steps • The hierarchical procedure stops when it cannot eliminate all effects for

Further steps • The hierarchical procedure stops when it cannot eliminate all effects for a given order • However, the partial association table showed that the main effect for country might be non-relevant • It may be desirable to test another model where that main effect is eliminated Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 29

Further steps Select the variables here Click here to define the model Statistics for

Further steps Select the variables here Click here to define the model Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 30

The model Specify the model by inserting those 2 -way interaction terms retained from

The model Specify the model by inserting those 2 -way interaction terms retained from hierarchical analysis and deleting the main effect for q 64 Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 31

Output The model is still acceptable after deleting the country main effect Statistics for

Output The model is still acceptable after deleting the country main effect Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 32

And the winner is. . . • This model explains the contingency table cells

And the winner is. . . • This model explains the contingency table cells almost as well as the saturated (exact) model • Thus, (a) the interaction among country, trust level and gender; and (b) the interaction between trust level and gender are not relevant Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 33

Parameter estimates • These can be regarded as size effects (how much are these

Parameter estimates • These can be regarded as size effects (how much are these terms relevant? comparisons are allowed!) – check the Z statistic • Check the SPSS output (click on OPTIONS to ask for estimates) • Odds-ratio (the ratio between estimates of the Z for different cells) indicate the ratio of the probabilities of ending up in a cell compared to the one chosen as a benchmark Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 34

Odds-ratio example • compare UK females (Z=3. 97) with German females (Z=1. 02) •

Odds-ratio example • compare UK females (Z=3. 97) with German females (Z=1. 02) • The ratio is about four, which means that: • the interaction between being female and from the UK is about four times more important than the interaction between being female and from Germany in explaining departure from a flat distribution • the effect is positive (it increase frequencies) • This would suggest that in contingency tables it is more likely to find UK females than German females after accounting for all other effects Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 35

Canonical Correlation Analysis (CCA) (1) • This technique allows one to explore the relationship

Canonical Correlation Analysis (CCA) (1) • This technique allows one to explore the relationship between a set of dependent variables and a set of explanatory variables. • Multiple regression analysis can be seen as a special case of canonical correlation analysis where there is a single dependent variable • CCA is applicable to both metric and nonmetric variables. Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 36

CCA(2) • Link with correlation analysis: canonical correlation is the method which maximizes the

CCA(2) • Link with correlation analysis: canonical correlation is the method which maximizes the correlation between two sets of variables rather than individual variables Example: relation between attitudes towards chicken and general food lifestyles in the Trust data-set • Attitudes towards chicken are measured through a set of variables which include taste, perceived safety, value for money, safety, etc. (items in q 12) • Lifestyle measurement is based on agreement with statements like “I purchase the best quality food I can afford” or “I am afraid of things that I have never eaten before” (items in q 25) Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 37

Canonical variates and canonical correlation • CCA relates two sets of variables • This

Canonical variates and canonical correlation • CCA relates two sets of variables • This technique also needs to combine variables within each set to obtain two composite measures which can be correlated • In standard correlation analysis this synthesis consists in a linear combination of the original variables for each set leading to the estimation of canonical variates or linear composites • The bivariate correlation between the two canonical variates is the canonical correlation Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 38

Canonical correlation equations 1)m dependent variables, y 1, y 2, …, ym 2)k independent

Canonical correlation equations 1)m dependent variables, y 1, y 2, …, ym 2)k independent variables, x 1, x 2, …, xk The objective is to estimate several (say c) canonical variates as follows: the (canonical) correlation between the canonical variables YS 1 and XS 1 is the highest, followed by the correlation between YS 2 and XS 2 and so on Furthermore, the extracted canonical variates are not correlated between each other, so that CORR(YSi, YSj)=0 and CORR(XSi, XSj)=0 for any i≠j, which also implies CORR(YSi, XSj)=0. Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 39

Canonical functions • The bivariate linear relationship between variates, YSi=f(XSi) is the i-th canonical

Canonical functions • The bivariate linear relationship between variates, YSi=f(XSi) is the i-th canonical function • The maximum number of canonical functions c (canonical variates) is equal to m or k, whichever the smaller. • CCA estimates the canonical coefficients a and b in a way that they maximize the canonical correlation between the two covariates • The coefficients are usually normalized in a way that each canonical variable has a variance of one. • The method can be generalized to deal with partial canonical correlation (controlling for other sets of variables) and nonlinear canonical correlation (where the canonical variates show a non-linear relationship). Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 40

Output elements • Canonical loadings – linear correlations between each of the original variable

Output elements • Canonical loadings – linear correlations between each of the original variable and their respective canonical variate • Cross-loadings – correlations with the opposite canonical variate. • Eigenvalues (or canonical roots) – squared canonical correlations, they represent how much of the original variability is shared by the two canonical variables of each canonical correlation • Canonical scores – value of the canonical function for each of the observations, based on the canonical variates • Canonical redundancy index – it measures how much of the variance in one of the canonical variates is explained by the other canonical variate Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 41

Canonical correlation analysis in SPSS • There is no menu-driven routine for CCA •

Canonical correlation analysis in SPSS • There is no menu-driven routine for CCA • A macro routine written through the command (syntax) editor is necessary Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 42

Canonical correlation macro Indicate here the path to the SPSS directory Run the program

Canonical correlation macro Indicate here the path to the SPSS directory Run the program List the variables of the two sets Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 43

Output Values of the canonical correlation The first 3 correlations are different from 0

Output Values of the canonical correlation The first 3 correlations are different from 0 at a 95% confidence level Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 44

Canonical coefficients for the 1 st set of canonical variates Canonical variate Statistics for

Canonical coefficients for the 1 st set of canonical variates Canonical variate Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 45

Canonical coefficients for the 2 nd set of canonical variates Canonical variate Statistics for

Canonical coefficients for the 2 nd set of canonical variates Canonical variate Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 46

Loadings (correlations) Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi

Loadings (correlations) Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 47

Cross-loadings Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 48

Cross-loadings Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 48

Redundancy analysis % OF VARIANCE EXPLAINED BY: Statistics for Marketing & Consumer Research Copyright

Redundancy analysis % OF VARIANCE EXPLAINED BY: Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 49