Multivariate Statistical Data Analysis with Its Applications HuaKai

  • Slides: 45
Download presentation
Multivariate Statistical Data Analysis with Its Applications Hua-Kai Chiou Ph. D. , Assistant Professor

Multivariate Statistical Data Analysis with Its Applications Hua-Kai Chiou Ph. D. , Assistant Professor Department of Statistics, NDMC hkchiou@rs 590. ndmc. edu. tw September, 2005 1

Agenda 1. 2. 3. 4. 5. 6. 7. 8. Introduction Examining Your Data Sampling

Agenda 1. 2. 3. 4. 5. 6. 7. 8. Introduction Examining Your Data Sampling & Estimation Hypothesis & Testing Multiple Regression Analysis Logistic Regression Multivariate Analysis of Variance Principal Components Analysis 2

9. 10. 11. 12. 13. 14. 15. Factor Analysis Cluster Analysis Discriminant Analysis Multidimensional

9. 10. 11. 12. 13. 14. 15. Factor Analysis Cluster Analysis Discriminant Analysis Multidimensional Scaling Canonical Correlation Analysis Conjoint Analysis Structural Equation Modeling 3

1 Introduction 4

1 Introduction 4

Some Basic Concept of MVA • • • What is Multivariate Analysis (MVA)? Impact

Some Basic Concept of MVA • • • What is Multivariate Analysis (MVA)? Impact of the Computer Revolution Multivariate Analysis Defined Measurement Scales Type of Multivariate Techniques 5

 • Dependence technique – the objective is prediction of the dependent variable(s) by

• Dependence technique – the objective is prediction of the dependent variable(s) by the independent variable(s), e. g. , regression analysis. • Dependent variable – presumed effect of, or response to, a change in the independent variable(s). • Dummy variable – nometrically measured variable transformed into a metric variable by assigning 1 or 0 to a subject, depending on whether it possesses a particular characteristic. • Effect size – estimate of the degree to which the phenomenon being studied (e. g. , correlation or difference in means) exists in population. 6

 • Indicator – single variable used in conjunction with one or more other

• Indicator – single variable used in conjunction with one or more other variables to form a composite measure. • Interdependence technique – classification of statistical techniques in which the variables are not divided into dependent and independent sets (e. g. , factor analysis). • Metric data – also called quantitative data, interval data, or ratio data, these measurements identify or describe subjects (or objects) not only on the possession of an attribute but also by the amount or degree to which the subject may be characterized by attribute. For example, a person’s age and weight are metric data. 7

 • Multicollinearity – extent to which a variable can be explained by the

• Multicollinearity – extent to which a variable can be explained by the other variables in the analysis. As multicollinearity increases, it complicates the interpretation of the variate as it is more difficult to ascertain the effect of any single variable, owing to their interrelationships. • Nonmetric data – also called qualitative data. • Power – probability of correctly rejecting the null hypothesis when it is false, that is, correctly finding a hypothesized relationship when it exists. Determined as a function of (1)the statistical significance level (α) set by the researcher for a Type I error, (2) the sample size used in the analysis, and (3) the effect size being examined. 8

 • Practical significance – means of assessing multivariate analysis results based on their

• Practical significance – means of assessing multivariate analysis results based on their substantive findings rather than their statistical significance. Whereas statistical significance determines whether the result is attributable to chance, practical significance assesses whether the result is useful. • Reliability – extent to which a variable or set of variables is consistent in what it is intended to measure. Reliability relates to the consistency of the measure(s). • Validity – extent to which a measure or set of measures correctly represents the concept of study. Validity is concerned with how well the concept is defined by the measure(s). 9

 • Type I error – probability of incorrectly rejecting the null hypothesis. •

• Type I error – probability of incorrectly rejecting the null hypothesis. • Type II error - probability of incorrectly failing to reject the null hypothesis, it meaning the chance of not finding a correlation or mean difference when it does exist. • Variate – linear combination of variables formed in the multivariate technique by deriving empirical weights applied to a set of variables specified by the researcher. 10

 • The Relationship between Multivariate Dependence Methods Analysis of Variance (ANOVA) (metric) (nometric)

• The Relationship between Multivariate Dependence Methods Analysis of Variance (ANOVA) (metric) (nometric) Multivariate Analysis of Variance (MANOVA) (metric) (nometric) Canonical Correlation (metric, nometric) 11

Discriminant Analysis (nometric) (metric) Multiple Regression Analysis (metric) (metric, nometric) Conjoint Analysis (metric, nometric)

Discriminant Analysis (nometric) (metric) Multiple Regression Analysis (metric) (metric, nometric) Conjoint Analysis (metric, nometric) (nometric) 12

Structural Equation Modeling (metric) (metric, nometric) 13

Structural Equation Modeling (metric) (metric, nometric) 13

What type of relationship is being examined? Dependence Interdependence How many variables are being

What type of relationship is being examined? Dependence Interdependence How many variables are being predicted? Multiple relationships of dependent and independent variables Is the structure of relationships among: Several dependent variables in single relationship One dependent variables in single relationship Variable Factor analysis What is the measurement scale of the dependent variable? Structural Equation Modeling What is the measurement scale of the dependent variable? Cases/Respondent Object How are the attributes measured? Cluster analysis Nometric Metric Nometric Multidimensional scaling What is the measurement scale of the dependent variable? Canonical correlation analysis with dummy variables Metric Nometric Canonical correlation analysis Multivariate analysis of variance (MANOVA) Multiple regression Conjoint analysis Correspondenc e analysis Multiple discriminant analysis Linear probability models 14

A Structured Approach to Multivariate Model Building Stage 1: Define the research problem, objectives,

A Structured Approach to Multivariate Model Building Stage 1: Define the research problem, objectives, and multivariate technique to be used Stage 2: Develop the analysis plan Stage 3: Evaluate the assumptions underlying the multivariate technique Stage 4: Estimate the multivariate model and assess overall model fit Stage 5: Interpret the variate(s) Stage 6: Validate the multivariate model 15

2 Examining Your Data 16

2 Examining Your Data 16

HATCO Case • Primary Database – This example investigates a business-to-business case from existing

HATCO Case • Primary Database – This example investigates a business-to-business case from existing customers of HATCO. – The primary database consists 100 observations on 14 separate variables. • Three types of information were collected: – The perceptions of HATCO, 7 attributes (X 1 – X 7); – The actual purchase outcomes, 2 specific measures (X 9, X 10); – The characteristics of the purchasing companies, 5 characteristics (X 8, X 11 -X 14). 17

Table 2. 1 Description of Database Variables (Hair et al. , 1998) 18

Table 2. 1 Description of Database Variables (Hair et al. , 1998) 18

Fig 2. 1 Scatter Plot Matrix of Metric Variables (Hair et al. , 1998)

Fig 2. 1 Scatter Plot Matrix of Metric Variables (Hair et al. , 1998) 19

Fig 2. 2 Examples of Multivariate Graphical Displays (Hair et al. , 1998) 20

Fig 2. 2 Examples of Multivariate Graphical Displays (Hair et al. , 1998) 20

Missing Data • A missing data process is any systematic event external to the

Missing Data • A missing data process is any systematic event external to the respondent (e. g. data entry errors or data collection problems) or action on the part of the respondent (such as refusal to answer) that leads to missing values. • The impact of missing data is detrimental not only through its potential “hidden” biases of the results but also in its practical impact on the sample size available for analysis. 21

 • Understanding the missing data – Ignorable missing data – Remediable missing data

• Understanding the missing data – Ignorable missing data – Remediable missing data • Examining the pattern of missing data 22

Table 2. 2 Summary Statistics of Pretest Data (Hair et al. , 1998) 23

Table 2. 2 Summary Statistics of Pretest Data (Hair et al. , 1998) 23

Table 2. 3 Assessing the Randomness of Missing Data through Group Comparisons of Observations

Table 2. 3 Assessing the Randomness of Missing Data through Group Comparisons of Observations with Missing versus Valid Data (Hair et al. , 1998) 24

Table 2. 4 Assessing the Randomness of Missing Data through Dichotomized Variable Correlations and

Table 2. 4 Assessing the Randomness of Missing Data through Dichotomized Variable Correlations and the Multivariate Test for Missing Completely at Random (MCAR) (Hair et al. , 1998) 25

Table 2. 5 Comparison of Correlations Obtained with All-Available (Pairwise), Complete Case (Listwise), and

Table 2. 5 Comparison of Correlations Obtained with All-Available (Pairwise), Complete Case (Listwise), and Mean Substitution Approaches (Hair et al. , 1998) 26

Table 2. 6 Results of the Regression and EM Imputation Methods (Hair et al.

Table 2. 6 Results of the Regression and EM Imputation Methods (Hair et al. , 1998) 27

Outliers • Four classes of outliers: – – Procedural error Extraordinary event can be

Outliers • Four classes of outliers: – – Procedural error Extraordinary event can be explained Extraordinary observations has no explanation Observations fall within the ordinary range of values on each of the variables but are unique in their combination of values across the variables. • Detecting outliers – Univariate detection – Bivariate detection – Multivariate detection 28

Outliers detection • Univariate detection threshold: – For small samples, within ± 2. 5

Outliers detection • Univariate detection threshold: – For small samples, within ± 2. 5 standardized variable values – For larger samples, within ± 3 or ± 4 standardized variable values • Bivariate detection threshold: – Varying between 50 and 90 percent of the ellipse representing normal distribution. • Multivariate detection: – The Mahalanobis distance D 2 29

Table 2. 7 Identification of Univariate and Bivariate Outliers (Hair et al. , 1998)

Table 2. 7 Identification of Univariate and Bivariate Outliers (Hair et al. , 1998) 30

Fig 2. 3 Graphical Identification of Bivariate Outliers (Hair et al. , 1998) 31

Fig 2. 3 Graphical Identification of Bivariate Outliers (Hair et al. , 1998) 31

Table 2. 8 Identification of Multivariate Outliers (Hair et al. , 1998) 32

Table 2. 8 Identification of Multivariate Outliers (Hair et al. , 1998) 32

Testing the Assumptions of Multivariate Analysis • Graphical analyses of normality – Kurtosis refers

Testing the Assumptions of Multivariate Analysis • Graphical analyses of normality – Kurtosis refers to the peakedness or flatness of the distribution compared with the normal distribution. – Skewness indicates the arc, either above or below the diagonal. • Statistical tests of normality 33

Fig 2. 4 Normal Probability Plots and Corresponding Univariate Distribution 34 (Hair et al.

Fig 2. 4 Normal Probability Plots and Corresponding Univariate Distribution 34 (Hair et al. , 1998)

Homoscedasticity vs. Heteroscedasticity • Homoscedasticity is an assumption related primarily to dependence relationships between

Homoscedasticity vs. Heteroscedasticity • Homoscedasticity is an assumption related primarily to dependence relationships between variables. • Although the dependent variables must be metric, this concept of an equal spread of variance across independent variables can be applied either metric or nonmetric. 35

Fig 2. 5 Scatter Plots of Homoscedastic and Heteroscedastic Relationships 36 (Hair et al.

Fig 2. 5 Scatter Plots of Homoscedastic and Heteroscedastic Relationships 36 (Hair et al. , 1998)

Fig 2. 6 Normal Probability Plots of Metric Variables (Hair et al. , 1998)

Fig 2. 6 Normal Probability Plots of Metric Variables (Hair et al. , 1998) 37

Table 2. 9 Distributional Characteristics, Testing for Normality, and Possible Remedies (Hair et al.

Table 2. 9 Distributional Characteristics, Testing for Normality, and Possible Remedies (Hair et al. , 1998) 38

Fig 2. 7 Transformation of X 2 (Price Level) to Achieve Normality (Hair et

Fig 2. 7 Transformation of X 2 (Price Level) to Achieve Normality (Hair et al. , 1998) 39

Table 2. 10 Testing for Homoscedasticity (Hair et al. , 1998) 40

Table 2. 10 Testing for Homoscedasticity (Hair et al. , 1998) 40

3 Sampling Distribution 41

3 Sampling Distribution 41

Understanding sampling distributions • A histogram is constructed from a frequency table. The intervals

Understanding sampling distributions • A histogram is constructed from a frequency table. The intervals are shown on the X-axis and the number of scores in each interval is represented by the height of a rectangle located above the interval. 42

 • A bar graph is much like a histogram, differring in that the

• A bar graph is much like a histogram, differring in that the columns are separated from each other by a small distance. Bar graphs are commonly used for qualitative variables. 43

What is a normal distribution? • Normal distributions are a family of distributions that

What is a normal distribution? • Normal distributions are a family of distributions that have the same general shape. They are symmetric with scores more concentrated in the middle than in the tails. Normal distributions are sometimes described as bell shaped. The height of a normal distribution can be specified mathematically in terms of two parameters: the mean (m) and the standard deviation (s). 44

45

45