School of Computer Science Carnegie Mellon Sensor and

  • Slides: 63
Download presentation
School of Computer Science Carnegie Mellon Sensor and Graph Mining Christos Faloutsos Carnegie Mellon

School of Computer Science Carnegie Mellon Sensor and Graph Mining Christos Faloutsos Carnegie Mellon University & IBM www. cs. cmu. edu/~christos USC 04 C. Faloutsos

School of Computer Science Carnegie Mellon Joint work with • • • USC 04

School of Computer Science Carnegie Mellon Joint work with • • • USC 04 Anthony Brockwell (CMU/Stat) Deepayan Chakrabarti (CMU) Spiros Papadimitriou (CMU) Chenxi Wang (CMU) Yang Wang (CMU) C. Faloutsos 2

School of Computer Science Carnegie Mellon Outline • Introduction - motivation • Problem #1:

School of Computer Science Carnegie Mellon Outline • Introduction - motivation • Problem #1: Stream Mining – Motivation – Main idea – Experimental results • Problem #2: Graphs & Virus propagation • Conclusions USC 04 C. Faloutsos 3

School of Computer Science Carnegie Mellon Introduction • Sensor devices – Temperature, weather measurements

School of Computer Science Carnegie Mellon Introduction • Sensor devices – Temperature, weather measurements – Road traffic data – Geological observations – Patient physiological data • Embedded devices – Network routers – Intelligent (active) disks USC 04 C. Faloutsos 4

School of Computer Science Carnegie Mellon Introduction • Limited resources – Memory – Bandwidth

School of Computer Science Carnegie Mellon Introduction • Limited resources – Memory – Bandwidth – Power – CPU • Remote environments – No human intervention USC 04 C. Faloutsos 5

School of Computer Science Carnegie Mellon Introduction – problem dfn • Given a emi-infinite

School of Computer Science Carnegie Mellon Introduction – problem dfn • Given a emi-infinite stream of values (time series) x 1, x 2, …, xt, … • Find patterns, forecasts, outliers… USC 04 C. Faloutsos 6

School of Computer Science Carnegie Mellon Introduction • E. g. , Periodicity? (twice daily)

School of Computer Science Carnegie Mellon Introduction • E. g. , Periodicity? (twice daily) “Noise”? ? Periodicity? (daily) USC 04 C. Faloutsos 7

School of Computer Science Carnegie Mellon Introduction • Can we capture these patterns –

School of Computer Science Carnegie Mellon Introduction • Can we capture these patterns – automatically – with limited resources? Periodicity? (twice daily) “Noise”? ? Periodicity? (daily) USC 04 C. Faloutsos 8

School of Computer Science Carnegie Mellon Related work Statistics: Time series forecasting • Main

School of Computer Science Carnegie Mellon Related work Statistics: Time series forecasting • Main problem: “[…] The first step in the analysis of any time series is to plot the data [and inspect the graph]” [Brockwell 91] • Typically: • Resource intensive • Cannot update online • AR(I)MA and seasonal variants • ARFIMA, GARCH, … USC 04 C. Faloutsos 9

School of Computer Science Carnegie Mellon Related work Databases: Continuous Queries • Typically, different

School of Computer Science Carnegie Mellon Related work Databases: Continuous Queries • Typically, different focus: – “Compression” – Not generative models • Largely orthogonal problem… – – Gilbert, Guha, Indyk et al. (STOC 2002) Garofalakis, Gibbons (SIGMOD 2002) Chen, Dong, Han et al. (VLDB 2002); Bulut, Singh (ICDE 2003) Gehrke, Korn, et al. (SIGMOD 2001), Dobra, Garofalakis, Gehrke et al. (SIGMOD 2002) – Guha, Koudas (ICDE 2003) Datar, Gionis, Indyk et al. (SODA 2002) – Madden+ [SIGMOD 02], [SIGMOD 03] USC 04 C. Faloutsos 10

School of Computer Science Carnegie Mellon Goals • Adapt and handle arbitrary periodic components

School of Computer Science Carnegie Mellon Goals • Adapt and handle arbitrary periodic components • No human intervention/tuning Also: • Single pass over the data • Limited memory (logarithmic) • Constant-time update USC 04 C. Faloutsos 11

School of Computer Science Carnegie Mellon Outline • Introduction - motivation • Problem #1:

School of Computer Science Carnegie Mellon Outline • Introduction - motivation • Problem #1: Stream Mining – Motivation – Main idea – Experimental results • Problem #2: Graphs & Virus propagation • Conclusions USC 04 C. Faloutsos 12

School of Computer Science Carnegie Mellon Wavelets “Straight” signal xt t I 1 I

School of Computer Science Carnegie Mellon Wavelets “Straight” signal xt t I 1 I 2 t USC 04 I 3 t I 4 t I 5 t C. Faloutsos I 6 t I 7 t I 8 t t 13 time

School of Computer Science Carnegie Mellon Wavelets Introduction – Haar xt t W 1,

School of Computer Science Carnegie Mellon Wavelets Introduction – Haar xt t W 1, 1 W 1, 3 W 1, 2 W 1, 4 t t t frequency W 2, 1 t W 2, 2 t t W 3, 1 t V 4, 1 USC 04 t C. Faloutsos time 14

School of Computer Science Carnegie Mellon Wavelets • So? • Wavelets compress many real

School of Computer Science Carnegie Mellon Wavelets • So? • Wavelets compress many real signals well… – Image compression and processing – Vision; Astronomy, seismology, … • Wavelet coefficients can be updated as new points arrive [Kotidis+] USC 04 C. Faloutsos 15

School of Computer Science Carnegie Mellon Wavelets Correlations xt t W 1, 1 W

School of Computer Science Carnegie Mellon Wavelets Correlations xt t W 1, 1 W 1, 3 W 1, 2 W 1, 4 t t t frequency W 2, 1 t W 2, 2 = t t W 3, 1 t V 4, 1 USC 04 t C. Faloutsos time 16

School of Computer Science Carnegie Mellon Wavelets Correlations xt t W 1, 1 W

School of Computer Science Carnegie Mellon Wavelets Correlations xt t W 1, 1 W 1, 3 W 1, 2 W 1, 4 t t t frequency W 2, 1 t W 2, 2 t t W 3, 1 t V 4, 1 USC 04 t C. Faloutsos time 17

School of Computer Science Carnegie Mellon Main idea Correlations • Wavelets are good… •

School of Computer Science Carnegie Mellon Main idea Correlations • Wavelets are good… • …we can do even better – One number… – …and the fact that they are equal/correlated USC 04 C. Faloutsos 18

School of Computer Science Carnegie Mellon Proposed method Wl, t-2 Wl, t-1 Wl, t

School of Computer Science Carnegie Mellon Proposed method Wl, t-2 Wl, t-1 Wl, t Wl’, t’-2 Wl’, t’-1 Wl’, t’ Wl, t Wl’, t’ l, 1 Wl, t-1 l, 2 Wl, t-2 … l’, 1 Wl’, t’-1 l’, 2 Wl’, t’-2 … Small windows suffice… (k~4) USC 04 C. Faloutsos 19

School of Computer Science Carnegie Mellon More details… • Update of wavelet coefficients (incremental)

School of Computer Science Carnegie Mellon More details… • Update of wavelet coefficients (incremental) • Update of linear models (incremental; RLS) • Feature selection (single-pass) – Not all correlations are significant – Throw away the insignificant ones – very important!! [see paper] USC 04 C. Faloutsos 20

School of Computer Science Carnegie Mellon SKIP Complexity • Model update Space: O lg.

School of Computer Science Carnegie Mellon SKIP Complexity • Model update Space: O lg. N + mk 2 O lg. N Time: O k 2 O 1 Where – N: number of points (so far) – k: number of regression coefficients; fixed – m: number of linear models; O lg. N [see paper] USC 04 C. Faloutsos 21

School of Computer Science Carnegie Mellon Outline • Introduction - motivation • Problem #1:

School of Computer Science Carnegie Mellon Outline • Introduction - motivation • Problem #1: Stream Mining – Motivation – Main idea – Experimental results • Problem #2: Graphs & Virus propagation • Conclusions USC 04 C. Faloutsos 22

School of Computer Science Carnegie Mellon Setup • First half used for model estimation

School of Computer Science Carnegie Mellon Setup • First half used for model estimation • Models applied forward to forecast entire second half • AR, Seasonal AR (SAR): R – Simplest possible estimation – no maximum likelihood estimation (MLE), etc. • … vs. Python scripts USC 04 C. Faloutsos 23

School of Computer Science Carnegie Mellon Results Synthetic data – Triangle pulse • Triangle

School of Computer Science Carnegie Mellon Results Synthetic data – Triangle pulse • Triangle pulse • AR captures wrong trend (or none) • Seasonal AR (SAR) estimation fails USC 04 C. Faloutsos 24

School of Computer Science Carnegie Mellon Results Synthetic data – Mix • Mix (sine

School of Computer Science Carnegie Mellon Results Synthetic data – Mix • Mix (sine + square pulse) • AR captures wrong trend (or none) • Seasonal AR estimation fails USC 04 C. Faloutsos 25

School of Computer Science Carnegie Mellon Results Real data – Automobile (filtered) • Automobile

School of Computer Science Carnegie Mellon Results Real data – Automobile (filtered) • Automobile traffic – Daily periodicity with rush-hour peaks – Bursty “noise” at smaller time scales USC 04 C. Faloutsos 26

School of Computer Science Carnegie Mellon Results Real data – Automobile • Automobile traffic

School of Computer Science Carnegie Mellon Results Real data – Automobile • Automobile traffic – Daily periodicity with rush-hour peaks – Bursty “noise” at smaller time scales • AR fails to capture any trend (average) • Seasonal AR estimation fails USC 04 C. Faloutsos 27

School of Computer Science Carnegie Mellon Results Real data – Automobile • Automobile traffic

School of Computer Science Carnegie Mellon Results Real data – Automobile • Automobile traffic – Daily periodicity with rush-hour peaks – Bursty “noise” at smaller time scales • USCAWSOM spots periodicities, automatically 04 C. Faloutsos 28

School of Computer Science Carnegie Mellon Results Real data – Automobile • Automobile traffic

School of Computer Science Carnegie Mellon Results Real data – Automobile • Automobile traffic – Daily periodicity with rush-hour peaks – Bursty “noise” at smaller time scales • Generation with identified noise USC 04 C. Faloutsos 29

School of Computer Science Carnegie Mellon Results Real data – Sunspot • Sunspot intensity

School of Computer Science Carnegie Mellon Results Real data – Sunspot • Sunspot intensity – Slightly time-varying “period” • AR captures wrong trend (average) • Seasonal ARIMA – Captures immediate wrong downward trend – Requires human to determine seasonal component period (fixed) USC 04 C. Faloutsos 30

School of Computer Science Carnegie Mellon Results Real data – Sunspot • Sunspot intensity

School of Computer Science Carnegie Mellon Results Real data – Sunspot • Sunspot intensity – Slightly time-varying “period” Estimation: 40 minutes (R) vs. 9 seconds (Python) USC 04 C. Faloutsos 31

School of Computer Science Carnegie Mellon SKIP Variance ~Hurst exponent ~ 1 hour •

School of Computer Science Carnegie Mellon SKIP Variance ~Hurst exponent ~ 1 hour • Variance (log-power) vs. scale: – “Noise” diagnostic (if decreasing linear…) – Can use to estimate noise parameters USC 04 C. Faloutsos 32

School of Computer Science Carnegie Mellon time (t) Running time USC 04 stream size

School of Computer Science Carnegie Mellon time (t) Running time USC 04 stream size (N) C. Faloutsos 33

School of Computer Science Carnegie Mellon Space requirements Equal total number of model parameters

School of Computer Science Carnegie Mellon Space requirements Equal total number of model parameters USC 04 C. Faloutsos 34

School of Computer Science Carnegie Mellon Conclusion ü Adapt and handle arbitrary periodic components

School of Computer Science Carnegie Mellon Conclusion ü Adapt and handle arbitrary periodic components ü No human intervention/tuning ü Single pass over the data ü Limited memory (logarithmic) ü Constant-time update USC 04 C. Faloutsos 35

School of Computer Science Carnegie Mellon Conclusion ü Adapt and handle arbitrary periodic no

School of Computer Science Carnegie Mellon Conclusion ü Adapt and handle arbitrary periodic no human components ü No human intervention/tuning ü Single pass over the data ü Limited memory (logarithmic) ü Constant-time update USC 04 C. Faloutsos limited resources 36

School of Computer Science Carnegie Mellon Outline • Introduction - motivation • Problem #1:

School of Computer Science Carnegie Mellon Outline • Introduction - motivation • Problem #1: Streams • Problem #2: Graphs & Virus propagation – Motivation & problem definition – Related work – Main idea – Experiments • Conclusions USC 04 C. Faloutsos 37

School of Computer Science Carnegie Mellon Introduction Internet Map [lumeta. com] Food Web [Martinez

School of Computer Science Carnegie Mellon Introduction Internet Map [lumeta. com] Food Web [Martinez ’ 91] Protein Interactions [genomebiology. com] ► Graphs are ubiquitious Friendship Network [Moody ’ 01] USC 04 C. Faloutsos 38

School of Computer Science Carnegie Mellon Introduction • What can we do with graph

School of Computer Science Carnegie Mellon Introduction • What can we do with graph analysis? “bridges” – Immunization; – Information Dissemination – network value of a customer [Domingos+] “Needle exchange” networks of drug users [Weeks et al. 2002] USC 04 C. Faloutsos 39

School of Computer Science Carnegie Mellon Problem definition • Q 1: How does a

School of Computer Science Carnegie Mellon Problem definition • Q 1: How does a virus spread across an arbitrary network? • Q 2: will it create an epidemic? • (in a sensor setting, with a ‘gossip’ protocol, will a rumor/query spread? ) USC 04 C. Faloutsos 40

School of Computer Science Carnegie Mellon Framework • Susceptible-Infected-Susceptible (SIS) model – Cured nodes

School of Computer Science Carnegie Mellon Framework • Susceptible-Infected-Susceptible (SIS) model – Cured nodes immediately become susceptible Infected by neighbor Susceptible/ healthy USC 04 Cured internally C. Faloutsos Infected & infectious 41

School of Computer Science Carnegie Mellon The model • (virus) Birth rate β :

School of Computer Science Carnegie Mellon The model • (virus) Birth rate β : probability than an infected neighbor attacks • (virus) Death rate δ : probability that an Healthy infected node heals Prob. δ N 2 Prob. β N 1 N Infected USC 04 Pro b. β N 3 C. Faloutsos 43

School of Computer Science Carnegie Mellon Epidemic threshold t Defined as the value of

School of Computer Science Carnegie Mellon Epidemic threshold t Defined as the value of t, such that if b / d < t an epidemic can not happen Thus, • given a graph • compute its epidemic threshold USC 04 C. Faloutsos 44

School of Computer Science Carnegie Mellon Epidemic threshold t What should t depend on?

School of Computer Science Carnegie Mellon Epidemic threshold t What should t depend on? • avg. degree? and/or highest degree? • and/or variance of degree? • and/or determinant of the adjacency matrix? USC 04 C. Faloutsos 45

School of Computer Science Carnegie Mellon Basic Homogeneous Model Homogeneous graphs [Kephart-White ’ 91,

School of Computer Science Carnegie Mellon Basic Homogeneous Model Homogeneous graphs [Kephart-White ’ 91, ’ 93] • Epidemic threshold = 1/<k> • Homogeneous connectivity <k>, ie, all nodes have ~same degree unrealistic USC 04 C. Faloutsos 46

School of Computer Science Carnegie Mellon Power-law Networks • Model for Barabási-Albert networks –

School of Computer Science Carnegie Mellon Power-law Networks • Model for Barabási-Albert networks – [Pastor-Satorras & Vespignani, ’ 01, ’ 02] – Epidemic threshold = <k> / <k 2> – for BA type networks, with only γ = 3 (γ = slope of power-law exponent) USC 04 C. Faloutsos 47

School of Computer Science Carnegie Mellon Epidemic threshold • Homogeneous graphs: • BA (g=3)

School of Computer Science Carnegie Mellon Epidemic threshold • Homogeneous graphs: • BA (g=3) • more complicated graphs • arbitrary, REAL graphs 1/<k> / <k 2> ? ? • how many parameters? ? USC 04 C. Faloutsos 48

School of Computer Science Carnegie Mellon Epidemic threshold • [Theorem] We have no epidemic,

School of Computer Science Carnegie Mellon Epidemic threshold • [Theorem] We have no epidemic, if β/δ <τ = 1/ λ 1, A USC 04 C. Faloutsos 49

School of Computer Science Carnegie Mellon Epidemic threshold • [Theorem] We have no epidemic,

School of Computer Science Carnegie Mellon Epidemic threshold • [Theorem] We have no epidemic, if epidemic threshold recovery prob. β/δ <τ = 1/ λ 1, A largest eigenvalue of adj. matrix A attack prob. Proof: [Wang+03] USC 04 C. Faloutsos 50

School of Computer Science Carnegie Mellon Epidemic threshold for various networks • sanity checks

School of Computer Science Carnegie Mellon Epidemic threshold for various networks • sanity checks / older results: • Homogeneous networks – λ 1, A = <k>; τ = 1/<k> – where <k> = average degree – This is the same result as of Kephart & White ! USC 04 C. Faloutsos 51

School of Computer Science Carnegie Mellon Epidemic threshold for various networks • sanity checks

School of Computer Science Carnegie Mellon Epidemic threshold for various networks • sanity checks / older results: • Star networks – λ 1, A = sqrt(d); τ = 1/ sqrt(d) – where d = the degree of the central node USC 04 C. Faloutsos 52

School of Computer Science Carnegie Mellon Epidemic threshold for various networks • sanity checks

School of Computer Science Carnegie Mellon Epidemic threshold for various networks • sanity checks / older results: • Infinite, power-law networks – λ 1, A = ∞; τ = 0 : *any* virus has a chance! [Barabasi et al] • Finite power-law networks – τ = 1/ λ 1, A USC 04 C. Faloutsos 53

School of Computer Science Carnegie Mellon Outline • Introduction - motivation • Problem #1:

School of Computer Science Carnegie Mellon Outline • Introduction - motivation • Problem #1: Streams • Problem #2: Graphs & Virus propagation – Motivation & problem definition – Related work – Main idea – Experiments • Conclusions USC 04 C. Faloutsos 54

School of Computer Science Carnegie Mellon Experiments • 2 graphs – Star network: one

School of Computer Science Carnegie Mellon Experiments • 2 graphs – Star network: one “hub” + 99 “spokes” – “Oregon” Internet AS graph: • 10, 900 nodes, 31180 edges • topology. eecs. umich. edu/data. html • More in our paper: [SRDS ’ 03] USC 04 C. Faloutsos 55

School of Computer Science Carnegie Mellon Experiments (Star) β/δ > τ (above threshold) β/δ

School of Computer Science Carnegie Mellon Experiments (Star) β/δ > τ (above threshold) β/δ = τ (at the threshold) β/δ < τ (below threshold) USC 04 C. Faloutsos 56

School of Computer Science Carnegie Mellon Experiments (Oregon) β/δ > τ (above threshold) β/δ

School of Computer Science Carnegie Mellon Experiments (Oregon) β/δ > τ (above threshold) β/δ = τ (at the threshold) β/δ < τ (below threshold) USC 04 C. Faloutsos 57

Number of infected nodes School of Computer Science Carnegie Mellon Our prediction vs. previous

Number of infected nodes School of Computer Science Carnegie Mellon Our prediction vs. previous PL 3 prediction PL 3 Our β/δ Oregon Star • our predictions are more accurate USC 04 C. Faloutsos 58

School of Computer Science Carnegie Mellon Conclusions We found an epidemic threshold √ that

School of Computer Science Carnegie Mellon Conclusions We found an epidemic threshold √ that applies to any network topology √ and it depends only on one parameter of the graph USC 04 C. Faloutsos 59

School of Computer Science Carnegie Mellon Overall conclusions • Automatic stream mining: AWSOM •

School of Computer Science Carnegie Mellon Overall conclusions • Automatic stream mining: AWSOM • graphs and virus propagation: eigenvalue USC 04 C. Faloutsos 60

School of Computer Science Carnegie Mellon Ongoing / related work • Streams – how

School of Computer Science Carnegie Mellon Ongoing / related work • Streams – how to find hidden variables on multiple streams [w/ Spiros and Jimeng Sun] – ‘network tomography’ [w/ Airoldi +] • Graphs – graph partitioning [w/ Deepay+] – important subgraphs [w/ Tomkins + Mc. Curley] – graph generators [RMAT, w/ Deepay] USC 04 C. Faloutsos 61

School of Computer Science Carnegie Mellon Thank you! Contact info: christos @ cs. cmu.

School of Computer Science Carnegie Mellon Thank you! Contact info: christos @ cs. cmu. edu spapadim @ cs. cmu. edu deepay @ cs. cmu. edu USC 04 C. Faloutsos 62

School of Computer Science Carnegie Mellon Main References • Spiros Papadimitriou, Anthony Brockwell and

School of Computer Science Carnegie Mellon Main References • Spiros Papadimitriou, Anthony Brockwell and Christos Faloutsos Adaptive, Hands-Off Stream Mining VLDB 2003, Berlin, Germany, Sept. 2003. • [Wang+03] Yang Wang, Deepayan Chakrabarti, Chenxi Wang and Christos Faloutsos: Epidemic Spreading in Real Networks: an Eigenvalue Viewpoint, SRDS 2003, Florence, Italy. USC 04 C. Faloutsos 63

School of Computer Science Carnegie Mellon Additional References • Connection Subgraphs, C. Faloutsos, K.

School of Computer Science Carnegie Mellon Additional References • Connection Subgraphs, C. Faloutsos, K. Mc. Curley, A. Tomkins, SIAM-DM 2004 workshop on link analysis • RMAT: A recursive graph generator, D. Chakrabarti, Y. Zhan, C. Faloutsos, SIAM-DM 2004 • i. Filter: Network tomography using particle filters, Edoardo Airoldi, Christos Faloutsos (submitted) USC 04 C. Faloutsos 64