Bayesian Biosurveillance Using Multiple Data Streams WengKeen Wong
Bayesian Biosurveillance Using Multiple Data Streams Weng-Keen Wong, Greg Cooper, Denver Dash*, John Levander, John Dowling, Bill Hogan, Mike Wagner RODS Laboratory, University of Pittsburgh *Intel Research This research was supported in part by grants from the National Science Foundation (IIS-0325581), the Defense Advanced Research Projects Agency (F 30602 -01 -20550), and the Pennsylvania Department of Health (ME-01 -737). 2004 University of Pittsburgh
Over-the-Counter (OTC) Data Being Collected by the National Retail Data Monitor (NRDM) 19, 000 stores 50% market share nationally >70% market share in large cities 2004 University of Pittsburgh
ED Chief Complaint Data Being Collected by RODS Chief Complaint ED Records for Allegheny County Date / Time Admitted Age Gender Home Zip Nov 1, 2004 3: 02 20 -30 Male 15213 Nov 1, 2004 3: 09 70 -80 Female 15132 15213 Fever : : : 2004 University of Pittsburgh Work Zip Chief Complaint Shortness of breath
Objective Using the ED and OTC data streams, detect a disease outbreak in a given region as quickly and accurately as possible 2004 University of Pittsburgh
Our Approach Population-wide ANomaly Detection and Assessment (PANDA) • • • A unique detection algorithm that models each individual in the population Combines ED and OTC data streams Focuses on detecting an outdoor aerosolized release of an anthrax-like agent in Allegheny county 2004 University of Pittsburgh
PANDA: Population-wide Anomaly Detection and Assessment Uses a causal Bayesian network Home Location of Person Anthrax Infection of Person Visit of Person to ED Location of Anthrax Release Bayesian Network: A graphical model representing the joint probability distribution of a set of random variables 2004 University of Pittsburgh
PANDA: Population-wide Anomaly Detection and Assessment Uses a causal Bayesian network Home Location of Person Anthrax Infection of Person Visit of Person to ED Location of Anthrax Release The arrows convey conditional independence relationships among the variables. They also represent causal relationships. 2004 University of Pittsburgh
Outline 1. 2. 3. 4. Introduction Model Inference Conclusions 2004 University of Pittsburgh
The Generic PANDA Model for Non-Contagious Diseases Population Risk Factors Population Disease Exposure (PDE) Person Model Population-Wide Evidence 2004 University of Pittsburgh Person Model
A Special Case of the Generic Model Anthrax Release Location of Release Person Model Time of Release Person Model OTC Sales for Region Each person in the population is represented as a subnetwork in the overall model 2004 University of Pittsburgh
The Person Model Location of Release Age Decile Home Zip Time Of Release Gender Anthrax Infection Non-ED Acute Respiratory Infection Other ED Disease Respiratory from Anthrax Respiratory CC From Other ED Acute Respiratory Infection ED Admit from Anthrax Respiratory CC ED Admit from Other Daily OTC Purchase Respiratory CC When Admitted Last 3 Days OTC Purchase ED Admission OTC Sales for Region 2004 University of Pittsburgh
Why Population Based? Representational power 1. • • 2. Background knowledge about spatial, temporal, demographic, and symptom information can be coherently represented in a single model Spatial, temporal, demographic, and symptom evidence can be combined to derive a posterior probability of a disease outbreak Representational flexibility New types of knowledge and evidence can be readily incorporated into the model Hypothesis: A population-based approach will achieve better detection performance than non-populationbased approaches. 2004 University of Pittsburgh
Computational Cost of a Population-Wide Approach? ~1. 4 million people in Allegheny County, Pennsylvania 2004 University of Pittsburgh
Equivalence Classes The ~1. 4 M people in the modeled population can be partitioned into approximately 24, 240 equivalence classes 2004 University of Pittsburgh
The Person Model Location of Release Age Decile Home Zip Time Of Release Gender Anthrax Infection Non-ED Acute Respiratory Infection Other ED Disease Respiratory from Anthrax Respiratory CC From Other ED Acute Respiratory Infection ED Admit from Anthrax Respiratory CC ED Admit from Other Daily OTC Purchase Respiratory CC When Admitted Last 3 Days OTC Purchase ED Admission OTC Sales for Region 2004 University of Pittsburgh
The Person Model Location of Release Age Decile Home Zip Time Of Release Gender Anthrax Infection Non-ED Acute Respiratory Infection Other ED Disease Respiratory from Anthrax Respiratory CC From Other ED Acute Respiratory Infection Respiratory CC ED Admit from Anthrax ED Admit from Other Daily OTC Purchase Respiratory CC When Admitted Last 3 Days OTC Purchase ED Admission Equivalence Class Example: Age Decile Gender Home Zip Respiratory Chief Comp. Date Admitted 20 -30 Male 15213 Yes Today
Outline 1. 2. 3. 4. Introduction Model Inference Conclusions 2004 University of Pittsburgh
Inference Anthrax Release Location of Release Person Model Time of Release Person Model OTC Sales for Region Derive P (Anthrax Release = true | OTC Sales Data & ED Data) 2004 University of Pittsburgh
Inference AR = Anthrax Release PDE = Population Disease Exposure ED = ED Data OTC = OTC Counts Key Term in Deriving P ( AR | OTC, ED ) : P ( OTC, ED | PDE ) = P ( OTC | ED, PDE ) P ( ED | PDE ) Contribution of OTC Counts Contribution of ED Data Details in: Cooper GF, Dash DH, Levander J, Wong W-K, Hogan W, Wagner M. Bayesian Biosurveillance of Disease Outbreaks. In: Proceedings of the Conference on Uncertainty in Artificial Intelligence, 2004 University of Pittsburgh
Inference AR = Anthrax Release PDE = Population Disease Exposure ED = ED Data OTC = OTC Counts Key Term in Deriving P ( AR | OTC, ED ) : P ( OTC, ED | PDE ) = P ( OTC | ED, PDE ) P ( ED | PDE ) The focus of the remainder of this talk 2004 University of Pittsburgh
The PANDA OTC Model the OTC purchases for each Equivalence Class Ei as a binomial Distribution. Ei ~ Binomial(NEi , PEi) 2004 University of Pittsburgh
The PANDA OTC Model the OTC purchases for each Equivalence Class Ei as a binomial Distribution. Ei ~ Binomial(NEi , PEi) Number of people in Equivalence Class Ei 2004 University of Pittsburgh Probability of an OTC cough medication purchase during the previous 3 days by each person in Equivalence Class Ei
The PANDA OTC Model the OTC purchases for each Equivalence Class Ei as a binomial Distribution. Approximate the binomial distribution as a normal distribution. Ei ~ Binominal(NEi , PEi) Normal( Ei , 2 Ei) 2004 University of Pittsburgh
The PANDA OTC Model the OTC purchases for each Equivalence Class Ei as a binomial Distribution. Approximate the binomial distribution as a normal distribution. Ei ~ Binominal(NEi , PEi) Normal( Ei , 2 Ei) Ei = N Ei × P Ei 2 Ei = NEi × PEi× (1 - PEi) 2004 University of Pittsburgh
The PANDA OTC Model P (OTC sales = X | ED, PDE ) Recall that: P ( OTC, ED | PDE ) = P ( OTC | ED, PDE ) P ( ED | PDE ) 2004 University of Pittsburgh
Example Equivalence Class 1 ~ Normal(100, 100) Age Decile Gender Home Zip Respiratory Chief Comp. Date Admitted 50 -60 Male 15213 Yes Today 2004 University of Pittsburgh
Example Equivalence Class 1 ~ Normal(100, 100) Equivalence Class 2 ~ Normal(150, 225) Age Decile Gender Home Zip Respiratory Chief Comp. Date Admitted 50 -60 Male 15213 Yes Today 50 -60 Female 15213 Yes Today 2004 University of Pittsburgh
Example Equivalence Class 1 ~ Normal(100, 100) Equivalence Class 2 ~ Normal(150, 225) Age Decile Gender Home Zip Respiratory Chief Comp. Date Admitted 50 -60 Male 15213 Yes Today 50 -60 Female 15213 Yes Today If these were the only 2 Equivalence Classes in the County then County Cough & Cold OTC ~ Normal(100+150, 100+225) 2004 University of Pittsburgh
Example Now suppose 260 units are sold in the county P( OTC Sales = 260 | ED Data, PDE ) = Normal( 260; 250, 325 ) = 0. 001231 260 2004 University of Pittsburgh
Inference Timing Machine: P 4 3 Gigahertz, 2 GB RAM Initialization Time (seconds) ED model ED and OTC model 2004 University of Pittsburgh Each hour of data (seconds) 55 5 229 5
Outline 1. 2. 3. 4. Introduction Model Inference Conclusions 2004 University of Pittsburgh
Challenges in Population-Wide Modeling Include … • • • Obtaining good parameter estimates to use in modeling (e. g. , the probability of an OTC cough medication purchase given an acute respiratory illness) Modeling time and space in a way that is both useful and computationally tractable Modeling contagious diseases 2004 University of Pittsburgh
Conclusions • • • PANDA is a multivariate algorithm that can combine multiple data streams Modeling each individual in the population is computationally feasible An evaluation of this approach using simulations is in progress 2004 University of Pittsburgh
Thank you http: //www. cbmi. pitt. edu/panda/ 2004 University of Pittsburgh
- Slides: 34