PROGRAMS IN HOMELAND SECURITY AT DIMACS Fred S

THE FOUNDING OF DIMACS THE NSF SCIENCE AND TECHNOLOGY CENTERS PROGRAM The STC program

THE FOUNDING OF DIMACS Because of the increasing importance of discrete mathematics and theoretical

The DIMACS Partners Today Rutgers University Princeton University AT&T Labs Bell Labs (Lucent Technologies)

WHO IS DIMACS? • There about 250 scientists affiliated with DIMACS and called permanent

Outline: A Selection of DIMACS Projects • Bioterrorism Sensor Location • Port of Entry

The Bioterrorism Sensor Location Problem 7

• Early warning is critical in defense against terrorism • This is a

Locating Sensors is not Easy • Sensors are expensive • How do we select

Two Fundamental Problems • Sensor Location Problem – Choose an appropriate mix of sensors

Two Fundamental Problems • Pattern Interpretation Problem: When sensors set off an alarm, help

The SLP: What is a Measure of Success of a Solution? • A modeling

The SLP: What is a Measure of Success of a Solution? • Identify and

The SLP: What is a Measure of Success of a Solution? • Cost: Given

The SLP: What is a Measure of Success of a Solution? • It’s hard

The Sensor Location Problem • Approach is to develop new algorithmic methods. • We

Algorithmic Approaches I : Greedy Algorithms 17

Greedy Algorithms • Find the most important location first and locate a sensor there.

Algorithmic Approaches II : Variants of Classic Location and Clustering Methods 19

Algorithmic Approaches II : Variants of Classic Location and Clustering Methods • Location theory:

Variants of Classic Location and Clustering Methods • k-median clustering: Given k sensors, place

Variants of Classic Location and Clustering Methods • Further complications make this even more

Algorithmic Approaches III : Variants of Highway Sensor Network Algorithms 23

Variants of Highway Sensor Network Algorithms • Sensors located along highways and nearby pathways

Variants of Highway Sensor Network Algorithms • These algorithms apply to situations with many

Algorithmic Approaches IV : Building on Equipment Placing Algorithms 26

Building on Equipment Placing Algorithms • The “Node Placement Problem” is problem of determining

The Broadband Access Node Placement Problem • There are inherent range limitations that drive

The Broadband Access Node Placement Problem: Complications • Restrictions on types of equipment that

The Pattern Interpretation Problem • It will be up to the Decision Maker to

The Pattern Interpretation Problem • Little has been done to develop analytical models for

The Pattern Interpretation Problem (PIP) • Close connection to the SLP. • How we

Approaching the PIP: Minimizing False Alarms 34

Approaching the PIP: Minimizing False Alarms • One approach: Redundancy. Require two or more

Approaching the PIP: Minimizing False Alarms • Portal Shield: requires two positives for the

Approaching the PIP: Using Decision Rules • Existing sensors come with a sensitivity level

Approaching the PIP: Using Decision Rules • Alternative decision rule: alarm if two sensors

Approaching the PIP: Using Decision Rules • When sensors are to be used jointly,

Approaching the PIP: Using Decision Rules • Prior work along these lines in missile

Approaching the PIP: Using Decision Rules • Most work has concentrated on the case

Approaching the PIP: Spatio. Temporal Mining of Sensor Data 42

Approaching the PIP: Spatio. Temporal Mining of Sensor Data • Sensors provide observations of

Approaching the PIP: Spatio. Temporal Mining of Sensor Data • Sensor technology is evolving

Approaching the PIP: Triggering Other Methods of Surveillance • One type of BT surveillance

Approaching the PIP: Triggering Other Methods of Surveillance • Decreased threshold for alarm from

Approaching the PIP: Triggering Other Methods of Surveillance • If there is an initial

Outline • Bioterrorism Sensor Location • Port of Entry Inspection Algorithms • Monitoring Message

Port of Entry Inspection Algorithms In collaboration with Los Alamos National Laboratory 49

Port of Entry Inspection Algorithms • Goal: Find ways to intercept illicit nuclear materials

Sequential Decision Making Problem • Stream of entities arrives at a port • Decision

Sequential Decision Making Problem • Entities arriving to be classified into categories. • Simple

Sequential Decision Making Problem • Simplest Case: Attributes are in state 0 or 1

Sequential Decision Making Problem • Different problems depending on whether or not F is

Binary Decision Tree Approach • We assume we have sensors to measure presence or

Binary Decision Tree Approach • We reach category 1 from the root only through

Binary Decision Tree Approach • We reach category 1 from the root by: a

Binary Decision Tree Approach • This binary decision tree corresponds to the same boolean

Binary Decision Tree Approach • Even if the boolean function F is fixed, the

Cost Functions • Above analysis: Only uses number of sensors • Using a sensor

Cost Functions • Cost of false positive: Cost of additional tests. • If it

Complications • Sensor errors – probabilistic approach • More than two values of an

Monitoring Message Streams: Algorithmic Methods for Automatic Processing of Messages 64

OBJECTIVE: Monitor huge communication streams, in particular, streams of textualized communication to automatically detect

TECHNICAL APPROACHES: • Given stream of text in any language. • Decide whether "new

TECHNICAL APPROACHES: SUPERVISED FILTERING • Batch filtering: Given examples of relevant documents up front.

MORE COMPLEX PROBLEM: PROSPECTIVE DETECTION OR “UNSUPERVISED” FILTERING • Classes change - new classes

COMPONENTS OF AUTOMATIC MESSAGE PROCESSING (1). Compression of Text – increase speed, reduce memory/disk

COMPONENTS OF AUTOMATIC MESSAGE PROCESSING - II • These distinctions are somewhat arbitrary. •

COMPONENTS OF AUTOMATIC MESSAGE PROCESSING - III • Our approach is to develop/explore methods

Nearest Neighbor (k. NN) Classifiers • Route message by – Finding k most similar

Speeding up k. NN • Can finding neighbors be made fast enough to make

k. NN: Results • Great reduction in size of inverted index and speed of

Bayesian Methods • Bayesian statistical methods place “prior” probability distributions on all unknowns, and

Bayesian Methods • Zhang and Oles (2001): developed an efficient optimization algorithm for logistic

Bayesian Methods: Sample Results • We have implemented several efficient variants, e. g. ,

Streaming Data Analysis • Motivated by need to make decisions about data during an

Streaming Text Data: “Historic” Data Analysis • The accumulation of text messages is massive

Streaming Analysis Tool: CM Sketch • Theoretical: We have developed the CM Sketch that

Large-scale Automated Author Identification 82

Statistical Analysis of Text • Statistical text analysis has a long history in literary

• Hamilton versus Madison: the Federalist Papers • Mosteller and Wallace (1963) used

Some Background • Identification technologies important for homeland security and in the legal system

Author ID Project Objectives • Application of state-of-the-art statistical and computing technologies to authorship

Author ID Project Focus Goal: Identification of Authors From Large Collection of Objects •

Representation • Long tradition in stylometry that seeks a small number of textual characteristics

Idiosyncratic Usage • Idiosyncratic usage less formalized in the literature (misspellings, repeated neologisms, etc.

Odd-Man Out Were these documents written by one of this set of authors or

Some Results • Created largest-ever (? ) feature set including function words, suffixes, POS

Some Results - II • Developed general purpose feature extraction software for author attribution

“Special Focus” on Computational and Mathematical Epidemiology smallpox 94

Components of a Special Focus • Working Groups • Tutorials • Workshops • Visitor

A Sampling of Working Groups WG’s on Large Data Sets: • Adverse Event/Disease Reporting,

WG’s on Methods/Tools of Theoretical CS • Phylogenetic Trees and Rapidly Evolving Diseases •

WG’s on Mathematical Sciences Methodologies • Mathematical Models and Defense Against Bioterrorism • Predictive

A Sampling of Workshops on Modeling of Infectious Diseases • The Pathogenesis of Infectious

Workshops on Evolution and Epidemiology • Genetics and Evolution of Pathogens • The Epidemiology

Workshops on Methodological Issues • Capture-recapture Models in Epidemiology • Spatial Epidemiology and Geographic

The DIMACS Working Group on Adverse Event/Disease Reporting, Surveillance, and Analysis 103

Working Group on Adverse Event/Disease Reporting, Surveillance, and Analysis • Health surveillance a core

New Data Types for Public Health Surveillance • Managed care patient encounter data •

New Analytic Methods and Approaches • • Spatial-temporal scan statistics Statistical process control (SPC)

Sub. Group on Privacy & Confidentiality of Health Data • Privacy concerns are a

Bioterrorism Working Group • Biosurveillance • Evolution • Modeling Bioterror Response Logistics • Computer

Modeling Bioterror Response Logistics Exploring Discrete Optimization/Queueing • size of stockpiles of vaccines •

Agroterrorism • Subgroup just starting • Interest in plant diseases • Partnership with the

Working Group on Modeling Social Responses to Bioterrorism • Models of the spread of

Working Group on Modeling Social Responses to Bioterrorism Interdisciplinary group is discussing incorporating social

Predicting Disease Outbreaks from Remote Sensing and Media Data Outbreaks of disease in other

Predicting Disease Outbreaks from Remote Sensing and Media Data • Recent work has shown

Predicting Disease Outbreaks from Remote Sensing and Media Data • Rift Valley Fever epidemic

Predicting Disease Outbreaks from Remote Sensing and Media Data • Indications and warnings can

Predicting Disease Outbreaks from Remote Sensing and Media Data • A model developed at

Predicting Disease Outbreaks from Remote Sensing and Media Data • Project Premise: We can

Predicting Disease Outbreaks from Remote Sensing and Media Data • The approach is similar

“Special Focus” on Communication Security and Information Privacy 126

“Special Focus” on Communication Security and Information Privacy Working Groups • Privacy-Preserving Data Mining

“Special Focus” on Communication Security and Information Privacy A Selection of Workshops • Software

Working Group on Analogies between Computer Viruses and Biological Viruses • Can ideas for

Slides: 130

Download presentation

PROGRAMS IN HOMELAND SECURITY AT DIMACS Fred S. Roberts DIMACS Director 1

THE FOUNDING OF DIMACS THE NSF SCIENCE AND TECHNOLOGY CENTERS PROGRAM The STC program was launched by the White House and the National Academy of Sciences in 1988 in order to increase the economic competitiveness of the U. S. NSF ran a nationwide competition. The rules: *cutting edge research *education and knowledge transfer *university-industry partnerships 2

THE FOUNDING OF DIMACS Because of the increasing importance of discrete mathematics and theoretical computer science, especially in the fields of telecommunications and computing, four institutions, Rutgers and Princeton Universities and AT&T Bell Labs and Bell Communications Research (Bellcore) each developed strong research groups in these fields. Under the leadership of Rutgers, they came together to found DIMACS and entered the STC competition. There were more than 800 preproposals; more than 300 proposals, in all fields of science; 11 winners. 3

The DIMACS Partners Today Rutgers University Princeton University AT&T Labs Bell Labs (Lucent Technologies) NEC Laboratories America Telcordia Technologies Affiliates: Avaya Labs HP Labs IBM Research Microsoft Research Stevens Institute of Technology 4

WHO IS DIMACS? • There about 250 scientists affiliated with DIMACS and called permanent members. • Most are from the partner and affiliated organizations. • They include many of the world’s leaders in discrete mathematics and theoretical computer science and their applications. • They also include statisticians, biologists, psychologists, chemists, epidemiologists, and engineers. • None are paid by DIMACS, but they join in DIMACS projects. 5

Outline: A Selection of DIMACS Projects • Bioterrorism Sensor Location • Port of Entry Inspection Algorithms • Monitoring Message Streams • Author Identification • Computational and Mathematical Epidemiology • Adverse Event/Disease Reporting/Surveillance/Analysis • Bioterrorism Working Group • Modeling Social Responses to Bioterrorism • Predicting Disease Outbreaks from Remote Sensing and Media Data • Communication Security and Information Privacy 6

The Bioterrorism Sensor Location Problem 7

• Early warning is critical in defense against terrorism • This is a crucial factor underlying the government’s plans to place networks of sensors/detectors to warn of a bioterrorist attack 8 The BASIS System – Salt Lake City

Locating Sensors is not Easy • Sensors are expensive • How do we select them and where do we place them to maximize “coverage, ” expedite an alarm, and keep the cost down? • Approaches that improve upon existing, ad hoc location methods could save countless lives in the case of an attack and also money in capital and operational costs. 9

Two Fundamental Problems • Sensor Location Problem – Choose an appropriate mix of sensors – decide where to locate them for best protection and early warning 10

Two Fundamental Problems • Pattern Interpretation Problem: When sensors set off an alarm, help public health decision makers decide – Has an attack taken place? – What additional monitoring is needed? – What was its extent and location? – What is an appropriate response? 11

The SLP: What is a Measure of Success of a Solution? • A modeling problem. • Needs to be made precise. • Many possible formulations. 12

The SLP: What is a Measure of Success of a Solution? • Identify and ameliorate false alarms. • Defending against a “worst case” attack or an “average case” attack. • Minimize time to first alarm? (Worst case? (Average case? ) • Maximize “coverage” of the area. – Minimize geographical area not covered – Minimize size of population not covered – Minimize probability of missing an attack 13

The SLP: What is a Measure of Success of a Solution? • Cost: Given a mix of available sensors and a fixed budget, what mix will best accomplish our other goals? 14

The SLP: What is a Measure of Success of a Solution? • It’s hard to separate the goals. • Even a small number of sensors might detect an attack if there is no constraint on time to alarm. • Without budgetary restrictions, a lot more can be accomplished. 15

The Sensor Location Problem • Approach is to develop new algorithmic methods. • We are building on approaches to other modeling problems, seeing if they can be modified in the sensor location context. • This is a multi-criteria modeling problem and it seems hopeless to try to find “optimal solutions” • We will be happy with “efficient” algorithms that find “good” solutions 16

Algorithmic Approaches I : Greedy Algorithms 17

Greedy Algorithms • Find the most important location first and locate a sensor there. • Find second-most important location. • Etc. • Builds on earlier mathematical work at Institute for Defense Analyses (Grotte, Platt) • “Steepest ascent approach. ’’ • No guarantee of “optimal” or best solution. • In practice, gets pretty close to optimal solution. 18

Algorithmic Approaches II : Variants of Classic Location and Clustering Methods 19

Algorithmic Approaches II : Variants of Classic Location and Clustering Methods • Location theory: locate facilities (sensors) to be used by users located in a region. • Cluster analysis: Given points in a metric space, partition them into groups or clusters so points within clusters are relatively close. • Clusters correspond to points covered by a facility (sensor). 20

Variants of Classic Location and Clustering Methods • k-median clustering: Given k sensors, place them so each point in the city is within x feet of a sensor. • Complications: More dimensions: location affects sensitivity, wind strength enters, sensors have different characteristics, etc. • This higher-dimensional k-median clustering problem is hard! Best-known algorithms are due to Rafail Ostrovsky. 21

Variants of Classic Location and Clustering Methods • Further complications make this even more challenging: – Different costs of different sensors – Restrictions on where we can place different sensors – Is it better to have every point within x feet of some sensor or every point within y feet of at least three sensors (y > x)? • Approximation methods due to Chuzhoy, Ostrovsky, and Rabani and to Guha, Tardos, and 22 Shmoys are relevant.

Algorithmic Approaches III : Variants of Highway Sensor Network Algorithms 23

Variants of Highway Sensor Network Algorithms • Sensors located along highways and nearby pathways measure atmospheric and road conditions. • Muthukrishnan, et al. have developed very efficient algorithms for sensor location. • Based on “bichromatic clustering” and “bichromatic facility location” (color nodes corresponding to sensors red, nodes corresponding to sensor messages blue) 24

Variants of Highway Sensor Network Algorithms • These algorithms apply to situations with many more sensors than the bioterrorism sensor location problem. • As BT sensor technology changes, we can envision a myriad of miniature sensors distributed around a city, making this work all the more relevant. 25

Algorithmic Approaches IV : Building on Equipment Placing Algorithms 26

Building on Equipment Placing Algorithms • The “Node Placement Problem” is problem of determining locations or nodes to install certain types of networking equipment. • “Coverage” and cost are a major consideration. • Researchers at Telcordia Technologies have studied variations of this problem arising from broadband access technologies. 27

The Broadband Access Node Placement Problem • There are inherent range limitations that drive placement. • E. g. : customer for DSL service must be within xx feet of an assigned multiplexer. • Multiplexer = sensor. • Problem solved using dynamic programming algorithms. (Tamra Carpenter, Martin Eiger, David Shallcross, Paul Seymour) 28

The Broadband Access Node Placement Problem: Complications • Restrictions on types of equipment that can be placed at a given node. • Constraints on how far a signal from a given piece of equipment can travel. • Cost and profit maximization considerations. • Relevance of work on general integer programming, the knapsack cover problem, and local access network expansion problems. 29

The Pattern Interpretation Problem 30

The Pattern Interpretation Problem • It will be up to the Decision Maker to decide how to respond to an alarm from the sensor network. 31

The Pattern Interpretation Problem • Little has been done to develop analytical models for rapid evaluation of a positive alarm or pattern of alarms from a sensor network. • How can this pattern be used to minimize false alarms? • Given an alarm, what other surveillance measures can be used to confirm an attack, locate areas of major threat, and guide public health interventions? 32

The Pattern Interpretation Problem (PIP) • Close connection to the SLP. • How we interpret a pattern of alarms will affect how we place the sensors. • The same simulation models used to place the sensors can help us in tracing back from an alarm to a triggering attack. 33

Approaching the PIP: Minimizing False Alarms 34

Approaching the PIP: Minimizing False Alarms • One approach: Redundancy. Require two or more sensors to make a detection before an alarm is considered confirmed. 35

Approaching the PIP: Minimizing False Alarms • Portal Shield: requires two positives for the same agent during a specific time period. • Redundancy II: Place two or more sensors at or near the same location. Require two proximate sensors to give off an alarm before we consider it confirmed. • Redundancy drawbacks: cost, delay in confirming an alarm. 36

Approaching the PIP: Using Decision Rules • Existing sensors come with a sensitivity level specified and sound an alarm when the number of particles collected is sufficiently high – above threshold. 37

Approaching the PIP: Using Decision Rules • Alternative decision rule: alarm if two sensors reach 90% of threshold, three reach 75% of threshold, etc. • One approach: use clustering algorithms for sounding an alarm based on a given distribution of clusters of sensors reaching a percentage of threshold. 38

Approaching the PIP: Using Decision Rules • When sensors are to be used jointly, the rules for “tuning” each sensor should be optimized to take advantage of the fact that each is part of a network. • The optimal tuning depends on the decision rule applied to reach an overall decision given the sensor inputs. 39

Approaching the PIP: Using Decision Rules • Prior work along these lines in missile detection (Cherikh and Kantor) 40

Approaching the PIP: Using Decision Rules • Most work has concentrated on the case of stochastic independence of information available at two sensors – clearly violated in BT sensor location problems. • Even with stochastic independence, finding “optimal” decision rules is nontrivial. • Recent promising approaches of Paul Kantor: study fusion of multiple methods for monitoring message streams. 41

Approaching the PIP: Spatio. Temporal Mining of Sensor Data 42

Approaching the PIP: Spatio. Temporal Mining of Sensor Data • Sensors provide observations of the state of the world localized in space and time. • Finding trends in data from individual sensors: time series data mining. • PIP: detecting general correlations in multiple time series of observations. • This has been studied in statistics, database theory, knowledge discovery, data mining. • Complications: proximity relationships based on 43 geography; complex chronological effects.

Approaching the PIP: Spatio. Temporal Mining of Sensor Data • Sensor technology is evolving rapidly. • It makes sense to consider idealized settings where data are collected continuously and communicated instantly. • Then, modern methods of spatio-temporal data mining due to Muthukrishnan and others are relevant. 44

Approaching the PIP: Triggering Other Methods of Surveillance • One type of BT surveillance cannot be considered in isolation. • Question: How can the pattern of sensor warnings guide other biosurveillance methods? • Increased syndromic surveillance? • Change threshold for alarm in syndromic surveillance? • Increased attention to E. R. visits in a certain region? 45

Approaching the PIP: Triggering Other Methods of Surveillance • Decreased threshold for alarm from subway worker absenteeism levels? 46

Approaching the PIP: Triggering Other Methods of Surveillance • If there is an initial alarm, each sensor may be read more often. • How do we pick the sensors to read more frequently? • This is “adaptive biosensor engagement. ” • Methods of bichromatic combinatorial optimization may be relevant. • As for the SLP, sensors get one color, sensor messages another. • Relevance of work of Muthukrishnan. 47

Outline • Bioterrorism Sensor Location • Port of Entry Inspection Algorithms • Monitoring Message Streams • Author Identification • Computational and Mathematical Epidemiology • Adverse Event/Disease Reporting/Surveillance/Analysis • Bioterrorism Working Group • Modeling Social Responses to Bioterrorism • Predicting Disease Outbreaks from Remote Sensing and Media Data 48 • Communication Security and Information Privacy

Port of Entry Inspection Algorithms In collaboration with Los Alamos National Laboratory 49

Port of Entry Inspection Algorithms • Goal: Find ways to intercept illicit nuclear materials and weapons destined for the U. S. via the maritime transportation system • Aim: Develop decision support algorithms that will help us to “optimally” intercept illicit materials and weapons • Find inspection schemes that minimize total “cost” including “cost” of false positives and false negatives 50

Sequential Decision Making Problem • Stream of entities arrives at a port • Decision Maker needs to decide which to inspect, which to subject to increasingly stringent inspection based on outcomes of previous inspections • Our approach: “decision logics” and combinatorial optimization methods • Builds on approach of Stroud and Saeger and large literature in sequential decision making. 51

Sequential Decision Making Problem • Entities arriving to be classified into categories. • Simple case: 0 = “ok”, 1 = “suspicious” • Observations are made. • Inspection scheme: specifies which observations are to be made based on previous observations • Entities have attributes a 0, a 1, …, an, each in a number of states • Sample attributes: ØDoes ship’s manifest set off an “alarm”? ØDoes container give off neutron or Gamma emission above threshold? ØDoes a radiograph image come up positive? 52 ØDoes an induced fission test come up positive?

Sequential Decision Making Problem • Simplest Case: Attributes are in state 0 or 1 • Then: Entity is a binary string like 011001 • Then: Classification is a decision function F that assigns each binary string to a category. • If there are two categories, 0 and 1, F is a boolean function. F(000) = F(111) = 1, F(abc) = 0 otherwise This classifies an entity as positive iff it has none of the attributes or all of them. 53

Sequential Decision Making Problem • Different problems depending on whether or not F is known. Assume first that F is known. • Given an entity, test its attributes until know enough to calculate the value of F. • An inspection scheme tells us in which order to test the attributes to minimize cost. • Even this simplified problem is hard computationally. 54

Binary Decision Tree Approach • We assume we have sensors to measure presence or absence of attributes. • Build a tree: • Nodes are sensors or categories (0 or 1) • Label nodes with atrribute the sensor measures for or the number of the category • Category nodes are “leaves” of the tree – nodes with only one neighbor • Two arcs exit from each sensor node, labeled left and right. • Take the right arc when sensor says the 55 attribute is present, left arc otherwise

Binary Decision Tree Approach • We reach category 1 from the root only through the path a 0 to a 1 to 1. • Thus, an entity is classified in category 1 iff it has both attributes. • The binary decision tree corresponds to the boolean function F(11) = 1, F(10) = F(01) = F(00) = 0. Figure 1 56

Binary Decision Tree Approach • We reach category 1 from the root by: a 0 L to a 1 R a 2 R 1 or a 0 R a 2 R 1 • An entity is classified in category 1 iff has a 1 and a 2 and not a 0 or a 0 and a 2 and possibly a 1. • Corresponding boolean function F(111) = F(101) = F(011) = 1, F(abc) = 0 otherwise. Figure 2 57

Binary Decision Tree Approach • This binary decision tree corresponds to the same boolean function F(111) = F(101) = F(011) = 1, F(abc) = 0 otherwise. However, it has one less observation node. So, it is more efficient if all observations are equally costly and equally likely. Figure 3 58

Binary Decision Tree Approach • Even if the boolean function F is fixed, the problem of finding the “optimal” binary decision tree for it is NPcomplete. • For small n, can try to solve it by brute force enumeration. • But even for n = 4, not practical. (n = 4 at Port of Long Beach-Los Angeles) • Seeking heuristic algorithms, approximations to optimal. • Making special assumptions about the boolean function F. • Example: For so-called “monotone” boolean functions, integer programming formulations give promising heuristics. 59

Cost Functions • Above analysis: Only uses number of sensors • Using a sensor has a cost: ØUnit cost of inspecting one item with it ØFixed cost of purchasing and deploying it ØDelay cost from queuing up at the sensor station • How many nodes of the decision tree are actually visited during average inspection? Depends on “distribution” of entities. 60

Cost Functions • Cost of false positive: Cost of additional tests. • If it means opening the container, it’s very expensive. • Cost of false negative: Complex issue. 61

Complications • Sensor errors – probabilistic approach • More than two values of an attribute (present, absent, present with 75% probability, …) • Partially defined boolean functions (inferring the boolean function from observations) • In this case, machine learning approaches are promising: Bayesian binary regression Splitting strategies Pruning learned decision trees 62

Outline • Bioterrorism Sensor Location • Port of Entry Inspection Algorithms • Monitoring Message Streams • Author Identification • Computational and Mathematical Epidemiology • Adverse Event/Disease Reporting/Surveillance/Analysis • Bioterrorism Working Group • Modeling Social Responses to Bioterrorism • Predicting Disease Outbreaks from Remote Sensing and Media Data 63 • Communication Security and Information Privacy

Monitoring Message Streams: Algorithmic Methods for Automatic Processing of Messages 64

OBJECTIVE: Monitor huge communication streams, in particular, streams of textualized communication to automatically detect pattern changes and "significant" events Motivation: monitoring email traffic, news, communiques, faxes, voice intercepts (with speech recognition) 65

TECHNICAL APPROACHES: • Given stream of text in any language. • Decide whether "new events" are present in the flow of messages. • Event: new topic or topic with unusual level of activity. • Initial Problem: Retrospective or “Supervised” Event Identification: Classification into preexisting classes. Given example messages on events/topics of interest, algorithm detects 66 instances in the stream.

TECHNICAL APPROACHES: SUPERVISED FILTERING • Batch filtering: Given examples of relevant documents up front. • Adaptive filtering: Examples accumulated; need to decide if will bother analyst for guidance; “pay” for information about relevance as process moves along. 67

MORE COMPLEX PROBLEM: PROSPECTIVE DETECTION OR “UNSUPERVISED” FILTERING • Classes change - new classes or change meaning • A difficult problem in statistics • Recent new C. S. approaches “Semi-supervised Learning”: • Algorithm suggests a possible new event/topic • Human analyst labels it; determines its significance 68

COMPONENTS OF AUTOMATIC MESSAGE PROCESSING (1). Compression of Text – increase speed, reduce memory/disk use (2). Representation of Text – convert text to form amenable to computation and statistical analysis; (3). Matching Scheme – compute similarity between texts; (4). Learning Method – create profiles of events/topics from known examples. (5). Fusion Scheme -- combine multiple filtering techniques to increase accuracy. 69

COMPONENTS OF AUTOMATIC MESSAGE PROCESSING - II • These distinctions are somewhat arbitrary. • Many approaches to message processing overlap several of these components of automatic message processing; our techniques usually address more than one component. Project Premise: Existing methods don’t exploit the full power of the 5 components, synergies among them, and/or an understanding of how to apply them to text data. 70

COMPONENTS OF AUTOMATIC MESSAGE PROCESSING - III • Our approach is to develop/explore methods for each component and then to combine them. • In the first phase of the project, we did over 5000 complete experiments with different combinations of methods. 71

Nearest Neighbor (k. NN) Classifiers • Route message by – Finding k most similar training messages (neighbors) – Assign to classes that are most common among neighbors (using weighting by distance) • k. NN classifiers studied since 1958, for text since early 90’s – Moderately effective for text; has been considered inefficient; finding neighbors is slow • But, finding neighbors only needs to be done once – No matter how many classes (even if huge) – So: for large number of topics, maybe more efficient than one-classifier-per-topic approaches 72

Speeding up k. NN • Can finding neighbors be made fast enough to make k. NN practical? • Worked on fast implementation • Store text and classes sparsely (Representation) – Store class labels sparsely – Arrange computations to do work proportional only to number of class labels in neighbors, not total number of classes • Search engine heuristics use the in-memory inverted file (Matching) – Use inverted file (group by word, not by document) – Retain only high impact terms within each document, or within each inverted list – Compute similarities using only inverted lists for the few words occurring in test document 73

k. NN: Results • Great reduction in size of inverted index and speed of classification • Slight additional cost in effectiveness • Effectiveness slightly below our best methods (Bayesian probit and logistic classifiers) • Compressed index 90% smaller than original index w/only 7 -12% loss in effectiveness (macro. F 1) • Approximate matching is 10 to 100 times faster w/ only 2 -10% loss in effectiveness (macro-F 1) • Ours are first large scale experiments on search engine heuristic for neighbor lookup in k. NN • Partnership between theoreticians and 74 practitioners.

Bayesian Methods • Bayesian statistical methods place “prior” probability distributions on all unknowns, and then compute “posterior” distribution for the unknowns conditional on the knowns. Thomas Bayes 75

Bayesian Methods • Zhang and Oles (2001): developed an efficient optimization algorithm for logistic regression (10, 000 dimensions) and achieved excellent predictive performance. • The Bayesian approach explicitly incorporates prior knowledge about model complexity (“regularization”) • We extended the Bayesian approach to incorporate a prior requirement for sparsity. • Logistic regression has one parameter per dimension; our sparse model sets many of these to zero; handles hundreds of thousands of parameters efficiently. • Resulting sparse models produce outstanding accuracy 76 and ultra-fast predictions with no ad-hoc feature selection

Bayesian Methods: Sample Results • We have implemented several efficient variants, e. g. , probit, informative priors. • Publicly released software; over 1000 downloads • Compared to Zhang & Oles, our implementation: –Eliminates ad hoc feature selection –Often uses less than 1% of the features at prediction time –Is publicly available • Accuracy: as good as the best results ever published. • In sum, we have a sparseness-inducing Bayesian approach that produces dramatically simpler models with no loss in accuracy 77

Streaming Data Analysis • Motivated by need to make decisions about data during an initial scan as data “stream by” • Recent development of theoretical CS algorithms • Algorithms motivated by intrusion detection, transaction applications, time series transactions 1 1 0 0 1 1 78

Streaming Text Data: “Historic” Data Analysis • The accumulation of text messages is massive over time • A lot of streaming research is focused on on -going or current analyses • It is a great challenge to use only summarized historic data and see if a currently emerging phenomenon had precursors occurring in the past • We are working on a novel architecture for historic and posterior analyses via small summaries - “sketches” 79

Streaming Analysis Tool: CM Sketch • Theoretical: We have developed the CM Sketch that uses (1/e) log 1/d space to approximate data distribution with error at most e, and probability of success at least 1 -d. – All other previously known sample or sketch methods use space at least (1/e 2). – CM Sketch is an order of magnitude better. • Practical: Few 10's of KBs gives accurate summary of large data: Create summaries of data that allow historic queries to find – Heavy Hitters (Most Frequent Items) – Quantiles of a Distribution (Median, Percentiles etc. ) – Finding items with large changes 80

Outline • Bioterrorism Sensor Location • Port of Entry Inspection Algorithms • Monitoring Message Streams • Author Identification • Computational and Mathematical Epidemiology • Adverse Event/Disease Reporting/Surveillance/Analysis • Bioterrorism Working Group • Modeling Social Responses to Bioterrorism • Predicting Disease Outbreaks from Remote Sensing and Media Data 81 • Communication Security and Information Privacy

Large-scale Automated Author Identification 82

Statistical Analysis of Text • Statistical text analysis has a long history in literary analysis and in solving disputed authorship problems • First (? ) is Thomas C. Mendenhall in 1887 83

• Hamilton versus Madison: the Federalist Papers • Mosteller and Wallace (1963) used Naïve Bayes with a Poisson and Negative Binomial model • Good predictive performance 84

Some Background • Identification technologies important for homeland security and in the legal system • Author attribution for textual artifacts using “topic independent” stylometric features has a long history • Historical focus on small numbers of authors and low-dimensional representations via function words 85

Author ID Project Objectives • Application of state-of-the-art statistical and computing technologies to authorship attribution • Work with very highdimensional document representations • Focus on providing working solutions to particular problems 86

Author ID Project Focus Goal: Identification of Authors From Large Collection of Objects • traditional disputed authorship (choose among k known authors) • clustering of “putative” authors (e. g. , internet handles: termin 8 r, heyr, Ka. Ma. Ka. Zie) • document pair analysis: Were two documents written by the same author? • odd-man-out: Were these documents written by one of this set of authors or by someone else? 87

Representation • Long tradition in stylometry that seeks a small number of textual characteristics that distinguish the texts of authors from one another (Burrows, Holmes, Binongo, Hoover, Mosteller & Wallace, Mc. Menamin, Tweedie, etc. ) • Typically use “function words” (a, with, as, were, all, would, etc. ) followed by PCA & cluster analysis • Function words aim to be “topic-independent” • Hoover (2003) shows that using all high-frequency words does a better job than function words alone 88

Idiosyncratic Usage • Idiosyncratic usage less formalized in the literature (misspellings, repeated neologisms, etc. ) but apparently useful. For example, Foster’s unmasking of Klein as the author of “Primary Colors”: “Klein and Anonymous loved unusual adjectives ending in -y and –inous: cartoony, chunky, crackly, dorky, snarly, …, slimetudinous, vertiginous, …” “Both Klein and Anonymous added letters to their interjections: ahh, aww, naww. ” “Both Klein and Anonymous loved to coin words beginning in hyper-, mega-, post-, quasi-, and semi-, more than all others put together” “Klein and Anonymous use “riffle” to mean rifle or rustle, a usage for which the OED provides no instance in the past thousand years” 89

Odd-Man Out Were these documents written by one of this set of authors or by someone else? • Training data contains documents by given set of authors • Test data contains documents by some set of authors including some not in original set • Bayesian hierarchical model incorporates prior knowledge that model parameters for different authors differ from each other • Initial success on small-scale simulated examples • Generalizations for more than one new author 90

Some Results • Created largest-ever (? ) feature set including function words, suffixes, POS tags, lengths, spelling errors, common English errors, grammatical errors, phrases, idiosyncratic usage, ngrams, etc. • Extensive experiments for 1 -of-K and “odd-manout” • New 1. 2 million message Listserv corpus, 82, 000 authors 91

Some Results - II • Developed general purpose feature extraction software for author attribution • Bayesian Multinomial Regression Software extends our highly scalable, sparse, BBR software (MMS Project) to the multi-class case 92

Outline • Bioterrorism Sensor Location • Port of Entry Inspection Algorithms • Monitoring Message Streams • Author Identification • Computational and Mathematical Epidemiology • Adverse Event/Disease Reporting/Surveillance/Analysis • Bioterrorism Working Group • Modeling Social Responses to Bioterrorism • Predicting Disease Outbreaks from Remote Sensing and Media Data • Communication Security and Information Privacy 93

“Special Focus” on Computational and Mathematical Epidemiology smallpox 94

Components of a Special Focus • Working Groups • Tutorials • Workshops • Visitor Programs • Graduate Student Programs • Postdoc Programs • Dissemination 95

A Sampling of Working Groups WG’s on Large Data Sets: • Adverse Event/Disease Reporting, Surveillance & Analysis • Data Mining and Epidemiology WG’s on Analogies between Computers and Humans: • Analogies between Computer Viruses/Immune Systems and Human Viruses/Immune Systems • Distributed Computing, Social Networks, and Disease Spread Processes 96

WG’s on Methods/Tools of Theoretical CS • Phylogenetic Trees and Rapidly Evolving Diseases • Order-Theoretic Aspects of Epidemiology WG’s on Computational Methods for Analyzing Large Models for Spread/Control of Disease • Spatio-temporal and Network Modeling of Diseases • Methodologies for Comparing Vaccination Strategies 97

WG’s on Mathematical Sciences Methodologies • Mathematical Models and Defense Against Bioterrorism • Predictive Methodologies for Infectious Diseases • Statistical, Mathematical, and Modeling Issues in the Analysis of Marine Diseases 98

A Sampling of Workshops on Modeling of Infectious Diseases • The Pathogenesis of Infectious Diseases • Models/Methodological Problems of Botanical Epidemiology WS on Modeling of Non-Infectious Diseases • Disease Clusters 99

Workshops on Evolution and Epidemiology • Genetics and Evolution of Pathogens • The Epidemiology and Evolution of Influenza • The Evolution and Control of Drug Resistance • Models of Co-Evolution of Hosts and Pathogens 100

Workshops on Methodological Issues • Capture-recapture Models in Epidemiology • Spatial Epidemiology and Geographic Information Systems • Ecologic Inference • Combinatorial Group Testing 101

Outline • Bioterrorism Sensor Location • Port of Entry Inspection Algorithms • Monitoring Message Streams • Author Identification • Computational and Mathematical Epidemiology • Adverse Event/Disease Reporting/Surveillance/Analysis • Bioterrorism Working Group • Modeling Social Responses to Bioterrorism • Predicting Disease Outbreaks from Remote Sensing and Media Data 102 • Communication Security and Information Privacy

The DIMACS Working Group on Adverse Event/Disease Reporting, Surveillance, and Analysis 103

Working Group on Adverse Event/Disease Reporting, Surveillance, and Analysis • Health surveillance a core activity in public health • Concerns about bioterrorism have attracted attention to new surveillance methods: –OTC drug sales –Subway worker absenteeism –Ambulance dispatches • Spawns need for novel statistical methods for surveillance of multiple data streams. • WG coordinated closely with National Syndromic Surveillance Conferences 104

New Data Types for Public Health Surveillance • Managed care patient encounter data • Pre-diagnostic/chief complaint (text data) • Over-the-counter sales transactions – Drug store – Grocery store • 911 -emergency calls • Ambulance dispatch data • Absenteeism data • ED discharge summaries • Prescription/pharmaceuticals 105 • Adverse event reports

Farzad Mostashari 106

New Analytic Methods and Approaches • • Spatial-temporal scan statistics Statistical process control (SPC) Bayesian applications Market-basket association analysis Text mining Rule-based surveillance Change-point techniques 107

Sub. Group on Privacy & Confidentiality of Health Data • Privacy concerns are a major stumbling block to public health surveillance, in particular bioterrorism surveillance. • Challenge: produce anonymous data specific enough for research. • Exploring ways to remove identifiers (s. s. #, tel. #, zip code) from data sets. • Exploring ways to aggregate, remove information from data sets. • Partnerships with cryptographers • Exploring methods of combinatorial optimization 108

Outline • Bioterrorism Sensor Location • Port of Entry Inspection Algorithms • Monitoring Message Streams • Author Identification • Computational and Mathematical Epidemiology • Adverse Event/Disease Reporting/Surveillance/Analysis • Bioterrorism Working Group • Modeling Social Responses to Bioterrorism • Predicting Disease Outbreaks from Remote Sensing and Media Data 109 • Communication Security and Information Privacy

Bioterrorism Working Group anthrax 110

Bioterrorism Working Group • Biosurveillance • Evolution • Modeling Bioterror Response Logistics • Computer Science Challenges • Agroterrorism 111

Modeling Bioterror Response Logistics Exploring Discrete Optimization/Queueing • size of stockpiles of vaccines • allocation of medications • analysis of bottlenecks in treatment facilities • transportation schedules 1947 smallpox vaccincation queue NYC 112

Agroterrorism • Subgroup just starting • Interest in plant diseases • Partnership with the National Plant Diagnostic Network • Emphasis on Data Mining and Epidemiology 113

Outline • Bioterrorism Sensor Location • Port of Entry Inspection Algorithms • Monitoring Message Streams • Author Identification • Computational and Mathematical Epidemiology • Adverse Event/Disease Reporting/Surveillance/Analysis • Bioterrorism Working Group • Modeling Social Responses to Bioterrorism • Predicting Disease Outbreaks from Remote Sensing and Media Data 114 • Communication Security and Information Privacy

Working Group on Modeling Social Responses to Bioterrorism • Models of the spread of infectious disease commonly assume passive bystanders and rational actors who will comply with health authorities. • It is not clear how well this assumption applies to situations like a bioterrorist attack using smallpox or plague. 115

Working Group on Modeling Social Responses to Bioterrorism Interdisciplinary group is discussing incorporating social behavior into models, building models of public health decisionmaking, risk communication. Some Issues • Movement • Compliance • Rumor • Subcultural differences • Indirect economic effects • Social stigmata • Panic How do you measure the indirect cost of an attack? 116

Outline • Bioterrorism Sensor Location • Port of Entry Inspection Algorithms • Monitoring Message Streams • Author Identification • Computational and Mathematical Epidemiology • Adverse Event/Disease Reporting/Surveillance/Analysis • Bioterrorism Working Group • Modeling Social Responses to Bioterrorism • Predicting Disease Outbreaks from Remote Sensing and Media Data 117 • Communication Security and Information Privacy

Predicting Disease Outbreaks from Remote Sensing and Media Data Outbreaks of disease in other parts of the world have the capacity to affect the security of the US Joint project with Imaging Science and Information Systems Center at Georgetown University Medical School (ISIS Center) 118

Predicting Disease Outbreaks from Remote Sensing and Media Data • Recent work has shown that it’s possible to predict disease outbreaks in distant parts of the world using remotely sensed satellite data. • SARS and heightened avian flu in the Pacific Rim appeared following temperature anomalies in China. • Could we have anticipated this given enviro-climatic information? 119

Predicting Disease Outbreaks from Remote Sensing and Media Data • Rift Valley Fever epidemic in 1997/8 in East Africa occurred following heavy flooding related to El Nino • Flooding in Venezuela in 1995 resulted in a multi -pathogen outbreak. 120

Predicting Disease Outbreaks from Remote Sensing and Media Data • Indications and warnings can alert US responders to bioevents in faraway places. • Disease that can result in social disruptions can be detected in open source media reports even if there is no official reporting of this. 121

Predicting Disease Outbreaks from Remote Sensing and Media Data • A model developed at the ISIS Center at Georgetown predicts social disruptions due to disease based on keyword “hit counts” from textbased sources (media reports). • DIMACS Project goal: Use media model to develop ways to predict social disruptions from disease from remote sensing enviro-climatic data. • We will be using remote sensing data indicating increased Normalized Difference Vegetation Index (NDVI). 122

Predicting Disease Outbreaks from Remote Sensing and Media Data • Project Premise: We can use enviro-climatic indices such as NDVI coupled with diseaserelated social disruption predictors from media data delayed by several months to validate the enviro-climatic indicators as predictors. • Approach: Machine Learning • Project waiting to get started 123

Predicting Disease Outbreaks from Remote Sensing and Media Data • The approach is similar to ones used by members of the DIMACS team to estimate probability of a match between remotely sensed signals and a signature that has been observed before. This work has been applied to face recognition and explosive detection. 124

Outline • Bioterrorism Sensor Location • Port of Entry Inspection Algorithms • Monitoring Message Streams • Author Identification • Computational and Mathematical Epidemiology • Adverse Event/Disease Reporting/Surveillance/Analysis • Bioterrorism Working Group • Modeling Social Responses to Bioterrorism • Predicting Disease Outbreaks from Remote Sensing and Media Data • Communication Security and Information 125 Privacy

“Special Focus” on Communication Security and Information Privacy 126

“Special Focus” on Communication Security and Information Privacy Working Groups • Privacy-Preserving Data Mining • Usable Privacy and Security Software • Data De-Identification, Combinatorial Optimization, Graph Theory, and the Stat-OR Interface • Intrusion Detection and Network Security Management Systems 127

“Special Focus” on Communication Security and Information Privacy A Selection of Workshops • Software Security • Applied Cryptography and Network Security • Large-scale Internet Attacks • Mobile and Wireless Security • Security of Web Services and E-Commerce • Database Security: Query Authorization and Information Inference 128

Working Group on Analogies between Computer Viruses and Biological Viruses • Can ideas for defending against biological viruses lead to ideas for defending against computer viruses? • Concern about large gap between initial time of attack and implementation of defensive strategies • “Public health” approach: Once a virus has infected a machine, it tries to connect it to as many computers as possible, as fast as possible. A “throttle” limits rate at which a computer can connect to new computers. 129

130