Analysis of Categorical Data Dr Siti Azrin Binti
Analysis of Categorical Data Dr Siti Azrin Binti Ab Hamid Unit Biostatistics and Research Methodology
Outline � Types of categorical analysis � Steps to analysis
Overview univariable analysis Dependent variable Independent variable Number of groups in independent variable Parametric test Non parametric test Numerical (one) - - One sample t Sign test Categorical 2 groups (independent) Independent t Mann Whitney Categorical 2 groups (dependent) Paired t Signed rank test Categorical > 2 groups (independent) One way ANOVA Kruskal Wallis Categorical (2 groups) 2 groups (independent) - Chi square test Fisher exact test 2 groups (dependent) - Mc. Nemar test Categorical
Introduction � Categorical data analysis deals with discrete data that can be organized into categories. � The data are organized into a contingency table.
Types of categorical data analysis Data One proportion Two proportion Independent sample Dependent sample Stratified sampling to control confounder Statistical tests Chi-square goodness of fit Pearson chi-square / Fisher exact Mc. Nemar test Mantel-Haenszel test
Hypothesis testing Step 1: 2: 3: 4: Step 5: Step 6: State the hypotheses Set the significance level Check the assumptions Perform the statistical analysis Make interpretation Draw conclusion
Contingency table � Consists of two columns and two rows. � Cells are labeled A through D. � Columns and rows are added for labels. � Row: independent variable / exposure / risk factors � Column: dependent variable / outcome
Example of contingency table Smoker Nonsmoker Total CHD present 138 263 CHD absent 32 105 Total 137 401 538 170 368
Pearson Chi-square � To test the association between two categorical variables � Independent sample � Result of test: - Not significant: no association - Significant: an association
Research Question � Does estrogen receptor associated with breast cancer status? � Data: Breast cancer. sav
Step 1: State the hypothesis � H O: There is no association between estrogen receptor and breast cancer status. � H A: There is an association between estrogen receptor and breast cancer status.
Step 2: Set the significance level �α = 0. 05
Step 3: Check the assumption 1. 2. 3. Two variables are independent Two variables are categorical Expected count of < 5 - > 20%: Fisher exact test - < 20%: Pearson Chi-square Expected count = Row total x Column total Grand total Variable Breast Ca Total Died Alive ER - ve 310 28 338 ER + ve 508 23 531 Total 818 51 869
Step 3: Check the assumption Variable Breast Ca Total Died Alive ER - ve 310 E = 318. 2 28 E = 19. 8 338 ER + ve 508 E = 499. 8 23 E = 31. 2 531 818 51 869 Total
Step 4: Statistical test � Calculate the Chi-square value x 2 = ∑((O – E)2/ E) = 5. 897 df = (R-1)(C-1) = (2 -1) =1 Between 0. 01 – 0. 02
Step 4: Statistical test 4 1 3 2 5 7 6 8 10 9
Step 5: Interpretation p value = 0. 016 < 0. 05 – reject HO, accept HA
Step 6: Conclusion � There is significant association between estrogen receptor and breast cancer status using Pearson Chi-square test (p = 0. 016).
Fisher’s Exact Test � To test the association between two categorical variables � Independent sample � Sample sizes are small
Research Question � Does gender associated with coronary heart disease? � Data: CHD data. sav
Step 1: State the hypothesis � H O: There is no association between gender and coronary heart disease. � H A: There is an association between gender and coronary heart disease.
Step 2: Set the significance level �α = 0. 05
Step 3: Check the assumption 1. 2. 3. Two variables are independent Two variables are categorical Expected count of < 5 - > 20%: Fisher exact test - < 20%: Pearson Chi-square Expected count = Row total x Column total Grand total Variable Coronary Heart Disease Total Presence Absent Male 15 5 20 Female 10 0 10 Total 25 5 30
Step 3: Check the assumption Variable Male Female Total Coronary Heart Disease Total Presence Absent 15 5 20 E = 16. 7 E = 3. 3 10 E = 8. 3 25 0 E = 1. 7 5 2 cells (50%) – expected count < 5 10 30
Step 4: Statistical test � Calculate the Chi-square value x 2 = ∑((O – E)2/ E) = 3. 0968 df = (R-1)(C-1) = (2 -1) =1 Between 0. 1 – 0. 05
Step 4: Statistical test 4 1 5 3 2 7 6 8 1 0 9
Step 5: Interpretation p value = 0. 140 > 0. 05 – accept HO
Step 6: Conclusion � There is no significant association between gender and coronary heart disease using Fisher’s Exact test (p = 0. 140).
Mc. Nemar Test � Categorical data � Dependent sample - Matched sample - Cross over design - Before & after (same subject) � To determine whether the row and column marginal frequencies are equal (marginal homogeneity)
Hypotheses � Null hypothesis of marginal homogeneity states the two marginal probabilities for each outcome are the same HO : P B = P C HA : P B ≠ PC A & D = concordant pair B & C = discordant pair Discordant pair is pair of different outcome
Research Question � Does type of mastectomy associated with 5 -year survival proportion in patients with breast cancer? � The sample were breast cancer patients - matched for age (same decade of age) - same clinical condition � Data: breast ca. sav
Step 1: State the hypothesis � H O: There is no association between type of mastectomy and 5 -year survival proportion in patients with breast cancer. � H A: There is an association between type of mastectomy and 5 -year survival proportion in patients with breast cancer.
Step 2: Set the significance level �α = 0. 05
Step 3: Check the assumption 1. 2. Two variables are dependent Two variables are categorical
Step 4: Statistical test � x 2 = (|b-c|-1)2/(b + c) = (|0 – 8| - 1)2 / (0 +8) =6. 125 � df = (R-1)(C-1) = (2 -1) =1 Calculated x 2 > tabulated x 2 *x 2 = (|b-c|-0. 5)2/(b + c)
Step 4: Statistical test 2 1 9 7 4 5 8 3 6
Step 5: Interpretation p value = 0. 008 < 0. 05 – reject HO, accept HA
Step 6: Conclusion � There is an association between type of mastectomy and 5 -year survival proportion in patients with breast cancer using Mc. Nemar test (p = 0. 008).
Cochran Mantel-Haenszel Test � Test is a method to compare the probability of an event among independent groups in stratified samples. � The stratification factor can be study center, gender, race, age groups, obesity status or disease severity. � Gives a stratified statistical analysis of the relationship between exposure and disease, after controlling for a confounder (strata variables). � The data are arranged in a series of associated 2 × 2 contingency tables.
Research Question � Does the type of treatment associated with response of treatment among migraine patients after controlling for gender? � Confounder: gender Active Placebo No of patients 27 25 No of better response 16 5 No of patients 28 26 No of better response 12 7 Female Male
Step 1: 2 x 2 contingency table Better Same Total 16 11 27 5 20 25 12 16 28 7 19 26 Reasons of failure Strata 1 Female Active Placebo Strata 2 Male Active Placebo
Step 2: Check the assumption 1. 2. Random sampling Stratified sampling
Step 3: State the hypothesis � H O: There is no association between type of treatment and response of treatment among female and male migraine patients. � H A: There is an association between type of treatment and response of treatment among female and male migraine patients.
Step 4: Statistical test � Compute the expected frequency from each stratum ei = (ai + bi)(ai + ci) ni � Compute each stratum vi = (ai +bi)(ci +di)(ai +ci)(bi + di) ni 2(ni -1) � Compute x 2 MH Mantel-Haenszel statistics = ∑(ai –ei)2 ∑ vi
Step 4: Statistical test � Compute the expected frequency from each stratum ei = (ai + bi)(ai + ci) ni e 1 = (16 +11)(16+ 5) 52 = 10. 9038 e 2 = (12 +16)(12+ 7) 54 = 9. 8519
Step 4: Statistical test � Compute each stratum vi = (ai +bi)(ci +di)(ai +ci)(bi + di) ni 2(ni -1) v 1 = (16 + 11)(5 + 20)(16 + 5)(11+20) (52)2(52 -1) = 3. 1865 v 2 = (12 + 16)(7 + 19)(12 + 7)(16+19) (54)2(54 -1) = 3. 1325
Step 4: Statistical test � Compute x 2 MH Mantel-Haenszel statistics = (∑ai –∑ei)2 ∑ vi = ((16 +12) - (10. 9038 + 9. 8519))2 3. 1865 + 3. 1325 = 8. 3051 = 8. 31
Step 4: Statistical test � Compute odd ratio ORMH = ∑(ai di/ ni) ∑(bi ci/ ni) = (16 x 20/ 52) + (12 x 19 / 54) (11 x 5/ 52) + (16 x 7/ 54 = 3. 313
Step 4: Statistical test Data: Migraine. sav 1 3 2 4 6 5
Step 5: Interpretation � Compute x 2 MH Mantel-Haenszel statistics = (∑ai –∑ei)2 ∑ vi = ((16 +12) - (10. 9038 + 9. 8519))2 3. 1865 + 3. 1325 = 8. 3051 = 8. 31 Calculated value > tabulated value Reject HO
Step 5: Interpretation HO = OR 1 = OR 2 Association homogenous *Tarone’s - adjusted HO = OR 1 = 1 HO = OR 2 = 1 Conditionally independent The large p-value for the Breslow-Day test (p = 0. 222) indicates no significant gender difference in the odds ratios.
Step 6: Conclusion � There is significant association between type of treatment and response of treatment among female and male migraine patients (p = 0. 004). � We estimate that female patients and male patients who receive active treatment are 3. 33 times more likely to have better symptoms in migraine for any reason than patients who receive placebo.
- Slides: 52