FDA Project Anomaly and Temporal Pattern Detection Brigham
FDA Project: Anomaly and Temporal Pattern Detection Brigham Anderson Robin Sabhnani Adam Goode Alice Zheng Artur Dubrawski Copyright © 2006, Brigham S. Anderson
OUTLINE • • • Client Data Solutions • Anomaly Detector • Temporal Pattern Detector 2
The Players • • • e. LEXNET: Electronic Laboratory Exchange Network NBIS: National Bio-surveillance Integration System The Department of Homeland Security has “asked” the FDA to submit relevant e. LEXNET data to NBIS. 3
Auton Lab National Bio-surveillance Integration System Electronic Laboratory Exchange Network SAIC? 4
Scenarios Scenario #1: Anchovy + Mercury 5
Anchovy/Mercury Summarization Report to FDA analyst… 6
Scenarios Scenario #1: Anchovy + Mercury Scenario #2: OJ + Salmonella 7
OK, so what does the data look like? 8
DATA • Samples of food products • • • Sample ID Collection Date Product Code Country Code Zip Code Reason Collected On order of 10, 000 different products • Human Illness? 9
DATA • Each Sample consists of multiple Tests • “Analyte” • Detection Estimated 5, 000 different analytes • Lab ID • Test Method • … 10
Example Analyte: Salmonella spp Detect: Negative Sample #223591: 2/18/2005 Coffee/Tea Analyte: Staphylococcus aureus Detect: Negative Analyte: Bacillus cereus Detect: Negative 11
Data • (Show spreadsheet) 12
Data • Time span: 1999 -present • Number of records: 300 K to 1 M? • • Missing data? …Only a few in the sample datasets provided. Different types of tests: • • • Microbials Mycotoxins Pesticides Dyes … 13
Data Stream • • About 1200 Microbial tests submitted per week Tests are not submitted regularly! 14
Anomaly Detector Temporal Pattern Detector 15
What is an Anomaly? • • An irregularity that cannot be explained by simple domain models and knowledge Anomaly detection only needs to learn from examples of “normal” system behavior. • Classification, on the other hand, would need examples labeled “normal” and “not-normal” 16
Anomaly Detectors in Practice • Monitoring computer networks for attacks. • Looking for suspicious activity in bank transactions • Detecting unusual e. Bay selling/buying behavior. 17
Simple FDA Anomaly Detection • GIVEN: • 1 test = 1 record • The relevant features of a test are • Product • Analyte • Detect • PROBLEM: • For each test, compute P(product, analyte, detect) and explain it. 18
Simple Anomaly Detector • Suppose we estimate all the probabilities from data: P(Meat, EColi, N) P(Meat, EColi, Y) P(Meat, Salmonella, N) P(Meat, Salmonella, Y) = = 0. 00021 0. 00007 0. 00010 0. 00005 P(Apple, Vibrosa, N) P(Apple, Vibrosa, Y) P(Apple, Listeria, N) P(Apple, Listeria, Y) = = 0. 00000 0. 00020 0. 00002 P(Product, Analyte, Detect) = 19
Simple Anomaly Detector How likely is <meat, Salmonella, Y>? Could not be easier! Just look up the entry in the JPT! Smaller numbers are more anomalous because the model is more surprised to see them. 20
Estimating P(product, analyte, detect) • • • There are ~ 10, 000 products. There are ~ 5, 000 analytes. There are 2 detection outcomes. …so there are ~100 M possible triplets. • We cannot directly estimate P(product, analyte, detect) from the data… 21
P(product, analyte, detect) = P(product) P(analyte|product) P(detect|product, analyte) e. g. , P(Anchovy, Mercury, Y) = P(Anchovy) P(Mercury | Anchovy) P(Y | Anchovy, Mercury) 22
10, 000 x 1 vector Product ~10, 000 values 10, 000 x 5, 000 matrix Analyte ~5, 000 values 10, 000 x 5, 000 x 2 matrix Detect 2 values 23
• Two ways we handle insufficient data: • Aggregate Products into “Industries” • Dirichlet priors on CPTs 24
50 x 1 vector Industry Product ~10, 000 ~50 values 50 x 5, 000 matrix Analyte ~5, 000 values 50 x 5, 000 x 2 matrix Detect 2 values 25
Least Anomalous in 2005 Anomaly Score 26
Most Anomalous in 2005 Anomaly Score 27
Dirichlet priors How we add Dirichlet priors: 1. 2. Before learning the CPTs, assume that we’ve seen every possible combination exactly “once”. Continue learning the network. 28
Which Abstraction Level? • There about three levels of detail for a given product… E. g. , Seafood Anchovy Smoked Achovy • Currently, use P(Mercury | Seafood) • …should we use P(Mercury | Anchovy) instead? • …but what if we’ve only seen 4 Anchovy/Mercury tests? Do we use that to estimate P(Mercury | Anchovy) ? 29
Which Abstraction Level? • There about three levels of detail for a given product… E. g. , Seafood Anchovy Smoked Achovy • IDEA: 1. Build one anomaly detector for each level. 2. Test each sample at all three levels. 3. Choose the most anomalous score. At the lower levels, the anomaly score will tend to be dominated by the prior (and thus produce high probabilities. ) 30 Are you insane? Maybe not…
Anomaly Detector Temporal Pattern Detector 31
What is a Temporal Pattern? • • How find the Orange Juice + Salmonella pattern? This is not a daily scan, it is “on-demand” 32
What is a Temporal Pattern? BASIC PROBLEM #1: Check each product/analyte pair in the last t weeks against the previous t’ weeks for unusual “behavior”. BASIC SOLUTION: Chi-square test for each product/analyte pair: Detects Non-Detects Recent O 11 O 12 Baseline O 21 O 22 33
2005 vs. 2003 -2004 Microbials only 34
What is a Temporal Pattern? BASIC PROBLEM #2: Check each product/analyte pair in the last t weeks for any interval of unusual behavior. BASIC SOLUTION: Chi-square test for each product/analyte pair for each interval (Bootstrap to get baseline distribution of best chi-square. ) Detects Non-Detects Duration Inside #detects_inside, O 11 #non-detects_inside, O 12 #weeks_inside O 13 Outside #detects_outside, O 21 #non-detects_outside, O 22 #weeks_outside O 23 35
Patulin mycotoxin tests on Fruits 36
All years? Microbials only 37
38
39
40
- Slides: 40