Analysis of Categorical Data Dr Siti Azrin Binti

Analysis of Categorical Data Dr Siti Azrin Binti Ab Hamid Unit Biostatistics and Research Methodology

Outline � Types of categorical analysis � Steps to analysis

Overview univariable analysis Dependent variable Independent variable Number of groups in independent variable Parametric test Non parametric test Numerical (one) - - One sample t Sign test Categorical 2 groups (independent) Independent t Mann Whitney Categorical 2 groups (dependent) Paired t Signed rank test Categorical > 2 groups (independent) One way ANOVA Kruskal Wallis Categorical (2 groups) 2 groups (independent) - Chi square test Fisher exact test 2 groups (dependent) - Mc. Nemar test Categorical

Introduction � Categorical data analysis deals with discrete data that can be organized into categories. � The data are organized into a contingency table.

Types of categorical data analysis Data One proportion Two proportion Independent sample Dependent sample Stratified sampling to control confounder Statistical tests Chi-square goodness of fit Pearson chi-square / Fisher exact Mc. Nemar test Mantel-Haenszel test

Hypothesis testing Step 1: 2: 3: 4: Step 5: Step 6: State the hypotheses Set the significance level Check the assumptions Perform the statistical analysis Make interpretation Draw conclusion

Contingency table � Consists of two columns and two rows. � Cells are labeled A through D. � Columns and rows are added for labels. � Row: independent variable / exposure / risk factors � Column: dependent variable / outcome

Example of contingency table Smoker Nonsmoker Total CHD present 138 263 CHD absent 32 105 Total 137 401 538 170 368

Pearson Chi-square � To test the association between two categorical variables � Independent sample � Result of test: - Not significant: no association - Significant: an association

Research Question � Does estrogen receptor associated with breast cancer status? � Data: Breast cancer. sav

Step 1: State the hypothesis � H O: There is no association between estrogen receptor and breast cancer status. � H A: There is an association between estrogen receptor and breast cancer status.

Step 2: Set the significance level �α = 0. 05

Step 3: Check the assumption 1. 2. 3. Two variables are independent Two variables are categorical Expected count of < 5 - > 20%: Fisher exact test - < 20%: Pearson Chi-square Expected count = Row total x Column total Grand total Variable Breast Ca Total Died Alive ER - ve 310 28 338 ER + ve 508 23 531 Total 818 51 869

Step 3: Check the assumption Variable Breast Ca Total Died Alive ER - ve 310 E = 318. 2 28 E = 19. 8 338 ER + ve 508 E = 499. 8 23 E = 31. 2 531 818 51 869 Total

Step 4: Statistical test � Calculate the Chi-square value x 2 = ∑((O – E)2/ E) = 5. 897 df = (R-1)(C-1) = (2 -1) =1 Between 0. 01 – 0. 02

Step 4: Statistical test 4 1 3 2 5 7 6 8 10 9

Step 5: Interpretation p value = 0. 016 < 0. 05 – reject HO, accept HA

Step 6: Conclusion � There is significant association between estrogen receptor and breast cancer status using Pearson Chi-square test (p = 0. 016).

Fisher’s Exact Test � To test the association between two categorical variables � Independent sample � Sample sizes are small

Research Question � Does gender associated with coronary heart disease? � Data: CHD data. sav

Step 1: State the hypothesis � H O: There is no association between gender and coronary heart disease. � H A: There is an association between gender and coronary heart disease.

Step 2: Set the significance level �α = 0. 05

Step 3: Check the assumption 1. 2. 3. Two variables are independent Two variables are categorical Expected count of < 5 - > 20%: Fisher exact test - < 20%: Pearson Chi-square Expected count = Row total x Column total Grand total Variable Coronary Heart Disease Total Presence Absent Male 15 5 20 Female 10 0 10 Total 25 5 30

Step 3: Check the assumption Variable Male Female Total Coronary Heart Disease Total Presence Absent 15 5 20 E = 16. 7 E = 3. 3 10 E = 8. 3 25 0 E = 1. 7 5 2 cells (50%) – expected count < 5 10 30

Step 4: Statistical test � Calculate the Chi-square value x 2 = ∑((O – E)2/ E) = 3. 0968 df = (R-1)(C-1) = (2 -1) =1 Between 0. 1 – 0. 05

Step 4: Statistical test 4 1 5 3 2 7 6 8 1 0 9

Step 5: Interpretation p value = 0. 140 > 0. 05 – accept HO

Step 6: Conclusion � There is no significant association between gender and coronary heart disease using Fisher’s Exact test (p = 0. 140).

Mc. Nemar Test � Categorical data � Dependent sample - Matched sample - Cross over design - Before & after (same subject) � To determine whether the row and column marginal frequencies are equal (marginal homogeneity)

Hypotheses � Null hypothesis of marginal homogeneity states the two marginal probabilities for each outcome are the same HO : P B = P C HA : P B ≠ PC A & D = concordant pair B & C = discordant pair Discordant pair is pair of different outcome

Research Question � Does type of mastectomy associated with 5 -year survival proportion in patients with breast cancer? � The sample were breast cancer patients - matched for age (same decade of age) - same clinical condition � Data: breast ca. sav

Step 1: State the hypothesis � H O: There is no association between type of mastectomy and 5 -year survival proportion in patients with breast cancer. � H A: There is an association between type of mastectomy and 5 -year survival proportion in patients with breast cancer.

Step 2: Set the significance level �α = 0. 05

Step 3: Check the assumption 1. 2. Two variables are dependent Two variables are categorical

Step 4: Statistical test � x 2 = (|b-c|-1)2/(b + c) = (|0 – 8| - 1)2 / (0 +8) =6. 125 � df = (R-1)(C-1) = (2 -1) =1 Calculated x 2 > tabulated x 2 *x 2 = (|b-c|-0. 5)2/(b + c)

Step 4: Statistical test 2 1 9 7 4 5 8 3 6

Step 5: Interpretation p value = 0. 008 < 0. 05 – reject HO, accept HA

Step 6: Conclusion � There is an association between type of mastectomy and 5 -year survival proportion in patients with breast cancer using Mc. Nemar test (p = 0. 008).

Cochran Mantel-Haenszel Test � Test is a method to compare the probability of an event among independent groups in stratified samples. � The stratification factor can be study center, gender, race, age groups, obesity status or disease severity. � Gives a stratified statistical analysis of the relationship between exposure and disease, after controlling for a confounder (strata variables). � The data are arranged in a series of associated 2 × 2 contingency tables.

Research Question � Does the type of treatment associated with response of treatment among migraine patients after controlling for gender? � Confounder: gender Active Placebo No of patients 27 25 No of better response 16 5 No of patients 28 26 No of better response 12 7 Female Male

Step 1: 2 x 2 contingency table Better Same Total 16 11 27 5 20 25 12 16 28 7 19 26 Reasons of failure Strata 1 Female Active Placebo Strata 2 Male Active Placebo

Step 2: Check the assumption 1. 2. Random sampling Stratified sampling

Step 3: State the hypothesis � H O: There is no association between type of treatment and response of treatment among female and male migraine patients. � H A: There is an association between type of treatment and response of treatment among female and male migraine patients.

Step 4: Statistical test � Compute the expected frequency from each stratum ei = (ai + bi)(ai + ci) ni � Compute each stratum vi = (ai +bi)(ci +di)(ai +ci)(bi + di) ni 2(ni -1) � Compute x 2 MH Mantel-Haenszel statistics = ∑(ai –ei)2 ∑ vi

Step 4: Statistical test � Compute the expected frequency from each stratum ei = (ai + bi)(ai + ci) ni e 1 = (16 +11)(16+ 5) 52 = 10. 9038 e 2 = (12 +16)(12+ 7) 54 = 9. 8519

Step 4: Statistical test � Compute each stratum vi = (ai +bi)(ci +di)(ai +ci)(bi + di) ni 2(ni -1) v 1 = (16 + 11)(5 + 20)(16 + 5)(11+20) (52)2(52 -1) = 3. 1865 v 2 = (12 + 16)(7 + 19)(12 + 7)(16+19) (54)2(54 -1) = 3. 1325

Step 4: Statistical test � Compute x 2 MH Mantel-Haenszel statistics = (∑ai –∑ei)2 ∑ vi = ((16 +12) - (10. 9038 + 9. 8519))2 3. 1865 + 3. 1325 = 8. 3051 = 8. 31

Step 4: Statistical test � Compute odd ratio ORMH = ∑(ai di/ ni) ∑(bi ci/ ni) = (16 x 20/ 52) + (12 x 19 / 54) (11 x 5/ 52) + (16 x 7/ 54 = 3. 313

Step 4: Statistical test Data: Migraine. sav 1 3 2 4 6 5

Step 5: Interpretation � Compute x 2 MH Mantel-Haenszel statistics = (∑ai –∑ei)2 ∑ vi = ((16 +12) - (10. 9038 + 9. 8519))2 3. 1865 + 3. 1325 = 8. 3051 = 8. 31 Calculated value > tabulated value Reject HO

Step 5: Interpretation HO = OR 1 = OR 2 Association homogenous *Tarone’s - adjusted HO = OR 1 = 1 HO = OR 2 = 1 Conditionally independent The large p-value for the Breslow-Day test (p = 0. 222) indicates no significant gender difference in the odds ratios.

Step 6: Conclusion � There is significant association between type of treatment and response of treatment among female and male migraine patients (p = 0. 004). � We estimate that female patients and male patients who receive active treatment are 3. 33 times more likely to have better symptoms in migraine for any reason than patients who receive placebo.