i 2 b 2 Rheumatoid Arthritis DBP Defining
i 2 b 2 Rheumatoid Arthritis DBP Defining RA in the electronic health record for future studies Elizabeth Karlson, MD Associate Professor of Medicine Harvard Medical School Brigham and Women’s Hospital
Background: Partners Resources • i 2 b 2: “Informatics for Integrating Biology and the Bedside” • RPDR: “Research Patient Data Repository” • Natural Language Processing (Hi. TEX) • Gold standard dataset: – Training set: 500 manual chart reviews – Validation set: 400 manual chart reviews
Coded data • ICD-9 codes for RA • ICD-9 codes for related phenotypes – Lupus (SLE), psoriatic arthritis (Ps. A), juvenile inflammatory arthritis (JIA) • Lab results for RA related antibodies – Rheumatoid factor (RF), anti-CCP • Medications – physician entry, escripts
NLP Concepts NLP queries – Rheumatoid arthritis – RA-related antibodies • Anti-CCP/RF/seropositive • Result coded as positive/negative – RA Medications • Coded as any mention – Radiographs: RA erosions • Coded as any erosion
Approach to develop RA cohort Classification algorithm Step 1: Develop gold standard training set Step 2: Identify variables important for predicting RA Step 3: Develop algorithm
Chart review results • RA Mart, N=32, 000 – ICD 9 = 714. xxx OR – CCP test ordered • Manual chart review for 500 patients – 20% validation rate – definite RA=100 – possible/no RA= 400
Comparison of NLP to manual chart review • Precision of NLP queries – Methotrexate – Etanercept – CCP+ – Seropositive – Erosion 100% 98. 7% 96% 88%
Approach to develop RA cohort Classification algorithm Step 2: Define variables (Vivian Gainer, Sergey Goryachev, Qing Zeng-Treitler, Shawn Murphy) • Codified data – ICD 9 billing codes – Electronic medication prescription – CCP, RF lab results • Narrative data extracted using natural language processing (NLP), i. e. from physician notes, radiology reports – Erosions – RF positive, CCP positive, seropositive – RA medications
Approach to develop RA cohort Classification algorithm Step 3: Develop algorithm (Tianxi Cai) • Penalized logistic regression with adaptive LASSO • Parsimonious predictors selected based on BIC
Model RA PPV (%) Sensitivity (%) Difference in PPV Narrative + Codified 3585 94 63 reference Codified only 3046 88 51 6 NLP only 3341 89 56 5 Algorithms Published administrative codified criteria ≥ 3 ICD 9 RA 7960 56 80 38 ≥ 1 ICD 9 RA + med 7799 45 66 49
Top 5 predictive variables for RA Variable Standardized regression coefficient Standard error NLP rheumatoid arthritis 1. 11 0. 48 NLP seropositive 0. 74 0. 26 ICD 9 RA normalized 0. 71 0. 23 ICD 9 RA 0. 66 0. 44 NLP erosions 0. 46 0. 29
i 2 b 2 RA cohort Characteristic s I 2 b 2 RA, n=3, 585 CORRONA*, n=7, 971 Age, mean (SD) 57. 5 (17. 5) 58. 9 (13. 4) Women (%) 79. 9 74. 5 63 N/A RF+ (%) 74. 4 72. 1 Erosions (%) 59. 2 52. 8 MTX use (%) 59. 5 52. 8 TNFi use (%) 32. 6 22. 6 Anti-CCP+ (%) *Consortium of Rheumatology Researchers of North America Liao, et al. , Arthritis Care & Research 2010
i 2 b 2 Virtual RA Cohort Studies • Case-control cohort – ~4, 000 RA cases – ~13, 000 matched non-RA controls • Age, gender, race and health care utilization • Samples collected from 1500 cases/1500 controls for genotyping – Genetic risk score predicts RA with same magnitude as in GWAS (Kurreman, 2010) – CAD outcomes in RA cases being validated in i 2 b 2 • Pharmacogenetics Research Network (PGRN)
i 2 b 2 RA Project: • Selected codified data from RPDR • Performed NLP queries for RA features • Developed algorithm based on: coded + NLP data Liao, 2010
PGRN Methods: • Select codified data from RPDR (meds) • Perform NLP queries for RA disease activity features • Develop algorithm (s) based on: Meds + NLP data
PGRN Specific Aims • Aim 1: Define RA disease activity level in the EMR • Aim 2: Develop an algorithm to predict RA disease activity from EMR data • Aim 3: Define temporal relations between RA medications and disease activity to define treatment response in RA
Background • In RA, disease activity score (DAS 28) is considered the gold standard tool to evaluate disease activity and response to treatment in clinical practice • DAS 28 has 2 components: – Disease activity level – Change in disease activity level
• Disease activity level scored as low, moderate, high • Disease activity change scored as low, moderate, high Van Gestel AM et al. Arthritis Rheum 1996; 39: 34 -40
Research Methods • Construct a virtual cohort of RA patients (N=5906) • Review charts for disease activity (document level) – – – Remission Low Moderate High Indeterminant Remission/Low vs. High/Moderate • Annotate charts for disease activity features (Knowtator) – – – – Disease_disorder Symptoms (reported pain, stiffness, swelling) Signs (objective tenderness, limited range of motion, synovitis) Anatomic site (relations with signs and symptoms) RA medication signature RA labs, level of inflammation (CRP, ESR) Patient functioning (activities of daily living)
NLP Methods • Move from keyword matching in i 2 b 2 to ontology mapping in PGRN • Customize c. TAKES for – RA medications – RA anatomic sites • Find relations between entities • Define new modules – – RA medication changes (start/stop) Reasons to stop medications Lab values Patient functioning status
NLP Analytic Approaches 1 - Internal gold standard datasets – N=200 BWH annotated notes – N= 200 MGH annotated notes 2 - Analyses – Study whether MD summary (1 -3 sentences) predicts disease activity – SVM: construct vectors based on features and relations to predict disease activity – Bag of concepts to predict disease activity 2 - External gold standard datasets: – DAS 28 scores from standardized tool at MGH matched to clinical note – DAS 28 scores from BRASS matched to clinical note
Future work • Define temporal relations between anti. TNF medication use (eg. new starts) and pre and post start disease activity to define response to therapy – Construct disease activity timeline (patient level) – Construct medication timeline (patient level)
Use NLP to define temporal sequence of medication start and adverse event
Questions?
- Slides: 29