Association Rule Mining in Type 2 Diabetes Risk
Association Rule Mining in Type 2 Diabetes Risk Prediction Gyorgy J. Simon Dept. of Health Sciences Research Mayo Clinic SHARPn Summit 2012
Outline • Introduction • Modeling Diabetes Risk – Association Rule Mining • Results – Diabetes Disease Network Reconstruction – Diabetes Risk Prediction • Applicability to SHARP
Diabetes • In the US, 25. 8 million people (8% of the population) suffer from Diabetes Mellitus – Type 2 Diabetes Mellitus (DM) • DM leads to significant medical complications • Effective preventive treatments exist – Identifying subpopulations at risk is important • Pre-Diabetes (Pre. DM) is a condition that precedes DM – fasting glucose 100 -125 • Identify sets of risk factors that significantly increase the risk of developing diabetes in a pre-diabetic population – Risk factors: • Co-morbid diseases: obesity, cardiac-, vascular conditions • Vitals, lab test results, medications, co-morbid conditions • 85 k Mayo Patients 1999 -2004 with research consent
Design Study Period Follow-Up Pre. DM 23, 828 Pre. DM 21, 826 2, 002 DM 424 16, 664 DM 19, 013 347 Normal 84, 708 Normal 44, 156 1/1/1999 Normal 43, 809 12/31/2004 7/2010
Data PID Co-morbidities OB 001 Y HTN Y Glucose Age FUT DM … 110 55 1. 8 Y 002 115 19 2. 5 N … … • Follow-up Time (FUT): Time since Pre. DM Dx • Co-morbidities: before elevated glucose measurement … – hypertension, hyperlipidemia, obesity, various cardiac and vascular diseases • Age and Follow-up time (FUT) are predictive of DM – They are not modifiable, we need to compensate for them • Goal is different from high-throughput phenotyping – None of the patients have the disease – Predict the risk that patients progress to DM
Outline • Introduction • Modeling Diabetes Risk – Association Rule Mining • Results – Diabetes Disease Network Reconstruction – Diabetes Risk Prediction • Applicability to SHARP
Computational Model Unknown Disease Mechanism Age … bmi HTN … hdl glucose Goal Find sets of clinical factors (level 2) that are associated with elevated risk of DM DM Dx Sex … … statin … Tobacco Level 1 Unmodifiable “nuisance” factors Level 2 Clinical factors of interest Level 3 Glucose “definition” of DM We have to adjust for level 1 factors before we can assess the effect of level 2 factors !
Modeling Approaches 1. Logistic regression / Survival Analysis – No ability to discover interactions 2. Decision Trees/Random. Forest/Gradient-boosted Trees – – Greedy approach to discover interaction No ability to compensate for age and follow-up time (FUT) 3. Association Rule Mining (ARM) – – Specifically designed to discover interactions No ability to compensate for age and FUT Regression Analysis + Association Rule Mining Remove the effect of age gender and FUT Find association between the risk factors and the DM risk not explained by age and FUT Simon et al. AMIA 2011
Overview 1 st Phase PID DM 2 nd Phase Age FUT R 1 Co-morbidities Obese 001 Y 55 1. 8 002 N 19 2. 5 … … O Observed Number of DM incidents E 1 Expected Number of DM incidents based on age and sex only Regression modeling • Survival model or • Logistic regression 3 rd Phase Y HTN R 2 Glucose … 103 Y 112 … R 1 = O – E 1 st 1 Phase Residual R 2 = O–(E 1+E 2) = R 1 -E 2 2 nd Phase Residual E 2 Expected Number of DM incidents based on co-morbidities only (after adjusting for age and sex) E 3 Expected Number of DM incidents based on glucose (after adjusting for everything else) Association Rule Mining E = E 1 + E 2 + E 3 Final Prediction
Association Rule Mining • • • Origins from sales data Items (columns): co-morbid conditions Transactions (rows): patients Itemsets: sets of co-morbid conditions Goal: find all itemsets (sets of conditions) that frequently co-occur in patients. Patient OB HTN IHD … DM 001 Y Y 002 Y Y 003 Y Y 004 Y 005 Y Y Y – One of those conditions should be DM. • Support: # of transactions the itemset I appeared in – Support({OB, HTN, IHD})=3 • Frequent: an itemset I is frequent, if support(I)>minsup X: infrequent
Distributional Association Rule Mining Distributional Association Rules associate an itemset with a continuous outcome. A B C D 01 Y Y . 40 02 Y Y Y . 38 03 Y Y Y 04 Y Y 06 08 . 00 Y Y Y . 39 10 . 01. 02. 00 5 0 . 41 Y Y 07 Y R R 0 0. 15 0. 3 0. 45 6 Frequency 05 … Frequency PID 15 4 2 0 R 0 0. 15 0. 3 0. 45 Application to Diabetes Find all sets I of co-morbid conditions, such that the distribution of risk R is significantly different between the patient population having I and without I Simon et al, KDD 2011 a
Why Association Rule Mining? Challenge Solution Interactions Designed to discover associations Missing data Asymmetry in items Clinical question Directly extracts sets of risk factors • Absence of item does not mean that the risk factor was not present Allows for differences in modeling for prediction and for disease mechanism discovery Computational Efficiency Efficient algorithms exist
Outline • Introduction • Modeling Diabetes Risk – Association Rule Mining • Results – Diabetes Disease Network Reconstruction – 4. 5 -yr DM Risk Prediction • Applicability to SHARP
Diabetes Disease Network Reconstruction • Metabolic Syndrome: DM + cardiac/vascular diseases • Use Association Rule Mining to map out the relationships between DM and other metabolic syndrome diseases – Also measure their effect on DM progression risk • Predictors: Age, sex, FUT; co-morbid disease Dx • 1 st Phase model is survival model • 2 nd Phase ARM
Results • 37 Distributional Association Rules were discovered • 11 are significant. (Poisson test; Bonferroni adjusted 5%) Sup Cases P-value RR Itemset 7116 819 2. 0 e-7 1. 32 HTN 4729 560 1. 7 e-8 1. 45 OB 8612 964 2. 6 e-8 1. 31 HL 1980 291 1. 9 e-9 1. 78 HTN, OB 4171 534 1. 5 e-8 1. 47 HTN, HL 553 85 8. 3 e-4 1. 86 OB, IHD 2434 335 4. 3 e-9 1. 68 OB, HL 382 66 7. 7 e-4 2. 08 1271 204 2. 8 e-8 470 76 339 61 • Interpretation: Patients with HTN, OB, IHD and HL have age and FUT adjusted 2. 15 RR of DM. • Effect of age- and FUT adjustment – The entire Pre. DM population has 8. 04% chance of DM. – Without age and FUT adjustment, the above population has 61/339=17. 9% – With age and FUT adjustment, 1(1 -. 084)2. 15=17. 2% Legend OB Obesity HTN, OB, IHD HTN Hypertension 1. 93 HTN, OB, HL IHD 7. 2 e-4 1. 93 OB, IHD, HL Ischemic Heart Disease 6. 1 e-4 2. 15 HTN, OB, IHD, HL HL Hyperlipidemia
Results Condition(s) IHD 2366 (1. 16) Subpop. ( Relative [p-value. 11] Size Risk ) HTN, OB, IHD 382 (2. 08) HTN, IHD, HL Legend 1210 (1. 36) OB Obesity [p-value. 015] HTN Hypertension IHD Ischemic Heart Disease HL Hyperlipidemia
Outline • Introduction • Modeling Diabetes Risk – Association Rule Mining • Results – Diabetes disease network re-construction – 4. 5 -yr DM risk prediction • Applicability to SHARP
DM Progression Risk Prediction • Predicting the probability of progression to DM within 4. 5 years • Predictors: age, sex, co-morbid Dx, laboratory results and medication orders • 1 st Phase: spline logistic regression to adjust for age and sex • 2 nd Phase: ARM • 3 rd Phase: linear regression using glucose
Machine Learned Indices • Comparison to machine learning methods – Gradient Boosted Trees (GBM) C-statistic • 10, 000 trees – Linear Model (LM) – Random Forest (RF) • 275 -325 trees – Association Rule Mining (ARM) • 100 rules • 10 -fold CV repeated 50 times • Same predictive performance but more interpretable model
Traditional Indices • Performance similar to San Antonio (Refit) • ARM readily provides a justification as to why the risk is high • Proposed method places the patient on a path in the diabetes network
Clinical Validation • Work in progress… • Apply the rules to both normo-glycemic and Pre-DM patients • Each point is a rule • Patterns similar for lower-risk subpopulations • For high-RR rules, risk of DM is higher for Pre-DM patients
Outline • Introduction • Modeling Diabetes Risk – Association Rule Mining • Results – Interpretability – Predictive Performance • Applicability to SHARP
High-Throughput Phenotyping (HTP) • We can use the Association Rules as a HTP algorithm – Discover the rules with ARM – Validate the rules with an expert clinician High-throughput Phenotyping DM Risk Assessment Does the patient currently have Will the patient progress to DM DM? in 4. 5 yrs? - Interventions are possible Binary decision (DM or not) Probability of diabetes - Prob. can be dichotomized into DM/no DM
Acknowledgment Peter W. Li, Ph. D Health Sciences Research, Mayo Clinic, MN Pedro J. Caraballo, MD Internal Medicine, Mayo Clinic, MN M. Regina Castro, MD Division of Endocrinology and Metabolism, Mayo Clinic, MN Terry M. Therneau, Ph. D Health Sciences Research, Mayo Clinic, MN Vipin Kumar, Ph. D Department of Computer Science, University of Minnesota
References Vemuri P, Simon G, Kantarci K, Whitwell J, Senjem M, Przybelski S, Gunter J, Josephs K, Knopman D, Boeve B, Ferman T, Dickson D, Parisi J, Petersen R and Jack C. Antemortem differential diagnosis of dementia pathology using structural MRI: Differential-STAND. Neuro. Image, 2010. Caraballo P, Li P, Simon G. Use of Association Rule-mining to Assess Diabetes Risk in Patients with Impaired Fasting Glucose, AMIA, 2011. Simon G, Kumar V, Li P. A Simple statistical model and association rule filtering. In Proc. ACM International Conference on Data Mining and Knowledge Discovery (KDD), 2011. Simon G. Li P, Jack C, Vemuri P. Understanding Atrophy Trajectories in Alzheimer’s Disease Using Association Rules on MRI images. In Proc. ACM International Conference on Data Mining and Knowledge Discovery (KDD), 2011.
- Slides: 25