Bayesian Network Anomaly Pattern Detection for Disease Outbreaks

Bayesian Network Anomaly Pattern Detection for Disease Outbreaks Weng-Keen Wong (Carnegie Mellon University) Andrew Moore (Carnegie Mellon University) Gregory Cooper (University of Pittsburgh) Michael Wagner (University of Pittsburgh) 1

The Problem Suppose we have real-time access to Emergency Department data from hospitals around a city (with patient confidentiality preserved) Primary Key Date Time Hospital ICD 9 Prodrome Gender Age Home Location Work Location Many more… 100 6/1/03 9: 12 1 781 Fever M 20 s NE ? … 101 6/1/03 10: 45 1 787 Diarrhea F 40 s NE NE … 102 6/1/03 11: 03 1 786 Respiratory F 60 s NE N … : : : From this data, can we detect if a disease outbreak is happening? How early can we detect it? The question we’re really asking: what’s strange about recent events? 2

Traditional Approaches What about using traditional anomaly detection? • Typically assume data is generated by a model • Finds individual data points that have low probability with respect to this model • These outliers have rare attributes or combinations of attributes • Need to identify anomalous patterns not isolated data points 3

Traditional Approaches What about monitoring aggregate daily counts of certain attributes? • We’ve now turned multivariate data into univariate data • Lots of algorithms have been developed for monitoring univariate data: – Time series algorithms – Regression techniques – Statistical Quality Control methods • Need to know apriori which attributes to form daily aggregates for! 4

Traditional Approaches What if we don’t know what attributes to monitor? What if we want to exploit the spatial, temporal and/or demographic characteristics of the epidemic to detect the outbreak as early as possible? 5

One Possible Approach Recent records ( from today ) Primary Key Date Time Gender Age … 100 8/24/03 9: 12 M Child … 101 8/24/03 10: 45 M Senior … : : : Primary Key Date Time … Source 100 8/24/03 9: 12 … Recent 101 8/24/03 10: 45 … Recent : Baseline records ( from 7 days ago ) Primary Key Date 2164 8/17/03 13: 05 F Senior … 2165 8/17/03 13: 57 F Senior … : : Time : Gender : Age : : Idea: Can use association rules to find patterns in today’s records that weren’t there in past data … : : : 2164 8/17/03 13: 05 … Baseline 2165 8/17/03 13: 57 … Baseline : : Find which rules predict unusually high proportions in recent records when compared to the baseline eg. 52/200 records from “recent” have Gender = Male AND Age = Senior 90/180 records from “baseline” have 6 Gender = Male AND Age = Senior

Which rules do we report? • Search over all rules with at most 2 components • For each rule, form a 2 x 2 contingency table eg. Count. Recent Count. Baseline Home Location = NW 48 45 Home Location NW 86 220 • Perform Fisher’s Exact Test to get a p-value for each rule (call this the score) • Report the rule with the lowest score 7

Problem #1: Multiple Hypothesis Testing • Can’t interpret the rule scores as p-values • Suppose we reject null hypothesis when score < , where = 0. 05 • For a single hypothesis test, the probability of making a false discovery = • Suppose we do 1000 tests, one for each possible rule • Probability(false discovery) could be as bad as: 1 – ( 1 – 0. 05)1000 >> 0. 05 8

Solution: Randomization Test Aug 16, 2003 C 2 Aug 17, 2003 C 3 Aug 17, 2003 C 4 Aug 24, 2003 C 4 Aug 17, 2003 C 5 Aug 17, 2003 C 6 Aug 24, 2003 C 6 Aug 17, 2003 C 7 Aug 21, 2003 C 8 Aug 21, 2003 C 9 Aug 22, 2003 C 10 Aug 22, 2003 C 11 Aug 23, 2003 C 12 Aug 23, 2003 C 13 Aug 24, 2003 C 14 Aug 17, 2003 C 14 Aug 24, 2003 C 15 Aug 17, 2003 C 15 • Take the recent cases and the baseline cases. Shuffle the date field to produce a randomized dataset called DBRand 9 • Find the rule with the best score on DBRand.

Randomization Test Repeat the procedure on the previous slide for 1000 iterations. Determine how many scores from the 1000 iterations are better than the original score. If the original score were here, it would place in the top 1% of the 1000 scores from the randomization test. We would be impressed an alert should be raised. Corrected p-value of the rule is: # better scores / # iterations 10

Problem #2: A Changing Baseline From: Goldenberg, A. , Shmueli, G. , Caruana, R. A. , and Fienberg, S. E. (2002). Early statistical detection of anthrax outbreaks by tracking over-the-counter medication sales. Proceedings of the National Academy of Sciences (pp. 5237 -5249) • Baseline is affected by temporal trends in health care data eg: – Seasonal effects in temperature and weather – Day of Week effects – Holidays • Choosing the wrong baseline distribution can affect the detection 11 time and false positives rate

Solution: Bayesian Network All Historical Data 1. Learn Bayesian Network using Optimal Reinsertion [Moore and Wong 2003] Today’s Environment Baseline 2. Generate baseline given today’s environment 12

Environmental Attributes Divide the data into two types of attributes: • Environmental attributes: attributes that cause trends in the data eg. day of week, season, weather, flu levels • Response attributes: all other nonenvironmental attributes 13

Environmental Attributes When learning the Bayesian network structure, do not allow environmental attributes to have parents. Why? • We are not interested in predicting their distributions • Instead, we use them to predict the distributions of the response attributes Side Benefit: We can speed up the structure search by avoiding DAGs that assign parents to the environmental attributes Season Day of Week Weather Flu Level 14

Generate Baseline Given Today’s Environment Suppose we know the following for today: We fill in these values for the environmental attributes in the learned Bayesian network Today Season = Winter We sample 10000 records from the Bayesian network and make this data set the baseline Season Day of Week Weather Flu Level Winter Monday Snow High Day of Week = Monday Weather = Snow Baseline Flu Level = High 15

What’s Strange About Recent Events (WSARE) 1. Obtain Recent and Baseline datasets 2. Search for rule with best score 3. Determine p-value of best scoring rule 4. If p-value is less than threshold, signal alert 16

Simulator 17

Results on Simulation 18

Results on Actual ED Data from 2001 1. Sat 2001 -02 -13: SCORE = -0. 00000004 PVALUE = 0. 0000 14. 80% ( 74/500) of today's cases have Viral Syndrome = True and Encephalitic Prodome = False 7. 42% (742/10000) of baseline have Viral Syndrome = True and Encephalitic Syndrome = False 2. Sat 2001 -03 -13: SCORE = -0. 00000464 PVALUE = 0. 0000 12. 42% ( 58/467) of today's cases have Respiratory Syndrome = True 6. 53% (653/10000) of baseline have Respiratory Syndrome = True 3. Wed 2001 -06 -30: SCORE = -0. 00000013 PVALUE = 0. 0000 1. 44% ( 9/625) of today's cases have 100 <= Age < 110 0. 08% ( 8/10000) of baseline have 100 <= Age < 110 4. Sun 2001 -08 -08: SCORE = -0. 00000007 PVALUE = 0. 0000 83. 80% (481/574) of today's cases have Unknown Syndrome = False 74. 29% (7430/10001) of baseline have Unknown Syndrome = False 5. Thu 2001 -12 -02: SCORE = -0. 00000087 PVALUE = 0. 0000 14. 71% ( 70/476) of today's cases have Viral Syndrome = True and Encephalitic Syndrome = False 7. 89% (789/9999) of baseline have Viral Syndrome = True and Encephalitic Syndrome = False 6. Thu 2001 -12 -09: SCORE = -0. 0000 PVALUE = 0. 0000 8. 58% ( 38/443) of today's cases have Hospital ID = 1 and Viral Syndrome = True 2. 40% (240/10000) of baseline have Hospital ID = 1 and Viral Syndrome = True 19

Related Work • Deviations between models induced by two datasets [Ganti, Gehrke and Ramakrishnan] • Emerging Patterns [Dong and Li] • Mining Surprising Patterns using Temporal Description Length [Chakrabarti, Sarawagi and Dom] • Contrast sets [Bay and Pazzani] • Association Rules and Data Mining in Hospital Infection Control and Public Health Surveillance [Brossette et. al. ] • Spatial Scan Statistic [Kulldorff] 20

Conclusion • One approach to biosurveillance: one algorithm monitoring millions of signals derived from multivariate data instead of Hundreds of univariate detectors • WSARE is best used as a general purpose safety net in combination with other detectors • Careful evaluation of statistical significance • Modeling historical data with Bayesian Networks to allow conditioning on unique features of today Software: http: //www. autonlab. org/ 21