CMU SCS Bioinformatics Graph and Stream mining Christos
CMU SCS Bio-informatics, Graph and Stream mining Christos Faloutsos CMU IC '07 C. Faloutsos
CMU SCS CONGRATULATIONS! IC '07 C. Faloutsos 2
CMU SCS Outline • • • Problem definition / Motivation Biological image mining Graphs and power laws Streams and forecasting [Scalability: Gb and Tb of data…] Conclusions IC '07 C. Faloutsos 3
CMU SCS Motivation • • Data mining: ~ find patterns (rules, outliers) Trends in fly embryos gene expressions? How do real graphs look like? How do (numerical) streams look like? IC '07 C. Faloutsos 4
CMU SCS FEMine: Mining Fly Embryos IC '07 C. Faloutsos 5
CMU SCS Problem • Given – fly embryo images – and their labels (eg. , 1 h, 2 h, radiated, etc) • Find patterns and trends, e. g. , • Ultimate goal: how genes affect each other. IC '07 C. Faloutsos 6
CMU SCS With • • Eric Xing (CMU CS) Bob Murphy (CMU – Bio) Tim Pan (CMU -> Google) Andre Balan (U. Sao Paulo) IC '07 C. Faloutsos 7
CMU SCS Outline • • • Problem definition / Motivation Biological image mining Graphs and power laws Streams and forecasting [Scalability: Gb and Tb of data…] Conclusions IC '07 C. Faloutsos 8
CMU SCS Graphs - why should we care? IC '07 C. Faloutsos 9
CMU SCS Graphs - why should we care? Internet Map [lumeta. com] Food Web [Martinez ’ 91] Protein Interactions [genomebiology. com] Friendship Network [Moody ’ 01] IC '07 C. Faloutsos 10
CMU SCS Joint work with • Dr. Deepayan Chakrabarti (CMU/Yahoo R. L. ) IC '07 C. Faloutsos 11
CMU SCS Problem: network and graph mining • How does the Internet look like? • How does the web look like? • What constitutes a ‘normal’ social network? • What is ‘normal’/‘abnormal’? • which patterns/laws hold? IC '07 C. Faloutsos 12
CMU SCS Graph mining • Are real graphs random? IC '07 C. Faloutsos 13
CMU SCS Laws and patterns NO!! • Diameter • in- and out- degree distributions • other (surprising) patterns IC '07 C. Faloutsos 14
CMU SCS Laws – degree distributions • Q: avg degree is ~3 - what is the most probable degree? count ? ? 3 IC '07 degree C. Faloutsos 15
CMU SCS Laws – degree distributions • Q: avg degree is ~3 - what is the most probable degree? count ? ? 3 IC '07 count degree C. Faloutsos 3 degree 16
CMU SCS Solution: Frequency Exponent = slope O = -2. 15 Nov’ 97 Outdegree The plot is linear in log-log scale [FFF’ 99] freq = degree (-2. 15) IC '07 C. Faloutsos 17
CMU SCS But: • Q 1: How about graphs from other domains? • Q 2: How about temporal evolution? IC '07 C. Faloutsos 18
CMU SCS The Peer-to-Peer Topology [Jovanovic+] • Frequency versus degree • Number of adjacent peers follows a power-law IC '07 C. Faloutsos 19
CMU SCS More power laws: citation counts: (citeseer. nj. nec. com 6/2001) log(count) Ullman log(#citations) IC '07 C. Faloutsos 20
CMU SCS More power laws: • web hit counts [w/ A. Montgomery] log(count) Web Site Traffic Zipf ``ebay’’ users sites log(in-degree) IC '07 C. Faloutsos 21
CMU SCS But: • Q 1: How about graphs from other domains? • Q 2: How about temporal evolution? IC '07 C. Faloutsos 22
CMU SCS Time evolution • with Jure Leskovec (CMU) • and Jon Kleinberg (Cornell) (‘best paper’ KDD 05) IC '07 C. Faloutsos 23
CMU SCS Evolution of the Diameter • Prior work on Power Law graphs hints at slowly growing diameter: – diameter ~ O(log N) • What is happening in real data? IC '07 C. Faloutsos 24
CMU SCS Evolution of the Diameter • Prior work on Power Law graphs hints at slowly growing diameter: – diameter ~ O(log N) • What is happening in real data? • Diameter shrinks over time – As the network grows the distances between nodes slowly decrease IC '07 C. Faloutsos 25
CMU SCS Diameter – Ar. Xiv citation graph • Citations among physics papers • 1992 – 2003 • One graph per year diameter time [years] IC '07 C. Faloutsos 26
CMU SCS Diameter – “Patents” • Patent citation network • 25 years of data diameter time [years] IC '07 C. Faloutsos 27
CMU SCS Temporal Evolution of the Graphs • N(t) … nodes at time t • E(t) … edges at time t • Suppose that N(t+1) = 2 * N(t) • Q: what is your guess for E(t+1) =? 2 * E(t) IC '07 C. Faloutsos 28
CMU SCS Temporal Evolution of the Graphs • N(t) … nodes at time t • E(t) … edges at time t • Suppose that N(t+1) = 2 * N(t) • Q: what is your guess for E(t+1) =? 2 * E(t) • A: over-doubled! – But obeying the ``Densification Power Law’’ IC '07 C. Faloutsos 29
CMU SCS Densification – Physics Citations • Citations among physics papers E(t) • 2003: – 29, 555 papers, 352, 807 citations 1. 69 N(t) IC '07 C. Faloutsos 30
CMU SCS Densification – Patent Citations • Citations among patents granted E(t) • 1999 1. 66 – 2. 9 million nodes – 16. 5 million edges • Each year is a datapoint IC '07 N(t) C. Faloutsos 31
CMU SCS Outline • Problem definition / Motivation • Biological image mining • Graphs and power laws – Time evolving graphs + tensors • Streams and forecasting • [Scalability: Gb and Tb of data…] • Conclusions IC '07 C. Faloutsos 32
CMU SCS Tensors for time evolving graphs • [Jimeng Sun+ KDD’ 06] • [ “ , SMD’ 07] • [ CF, Kolda, Sun, SDM’ 07 tutorial] IC '07 C. Faloutsos 33
CMU SCS Application: Network Anomaly Detection destination • Anomaly detection • Data – TCP flows collected at CMU backbone – 500 GB with compression – <source, destination, port #> – 1200 timestamps (hours) – ‘Tensor’ IC '07 C. Faloutsos source 34
CMU SCS with • Jimeng Sun • Hui Zhang • Yinglian Xie • (Dave Anderson) IC '07 C. Faloutsos 35
CMU SCS Network anomaly detection destination error destination scanners source Abnormal Time (hour) source Normal • Identify when and where anomalies occurred. • Prominent difference between normal and abnormal ones is mainly due to unusual scanning activity (confirmed by the campus admin). IC '07 C. Faloutsos 36
CMU SCS Outline • • • Problem definition / Motivation Biological image mining Graphs and power laws Streams and forecasting [Scalability: Gb and Tb of data…] Conclusions IC '07 C. Faloutsos 37
CMU SCS Why care about streams? • Sensor devices – Temperature, weather measurements – Road traffic data – Geological observations – Patient physiological data – Chlorine in drinking water (*) • Embedded devices – Network routers IC '07 C. Faloutsos 38
CMU SCS SPIRIT / Inte. Mon • http: //warsteiner. db. cs. cmu. edu/demo/intem on. jsp • http: //localhost: 8080/demo/graphs. jsp • self-* storage system (PDL/CMU) • • with Jimeng Sun (CMU/CS) Evan Hoke (CMU/CS-ug) Prof. Greg Ganger (CMU/CS/ECE) John Strunk (CMU/ECE) IC '07 C. Faloutsos 39
CMU SCS self-* system @ CMU • >200 nodes • 40 racks of computing equipment • 774 kw of power. • target: 1 Peta. Byte • goal: self-correcting, selfsecuring, self-monitoring, self-. . . IC '07 C. Faloutsos 40
CMU SCS System Architecture IC '07 C. Faloutsos 41
CMU SCS Data center room monitoring Temperature Humidity • Abnormal dehumidification and reheating cycle is identified IC '07 C. Faloutsos 42
CMU SCS Outline • • • Problem definition / Motivation Biological image mining Graphs and power laws Streams and forecasting [Scalability: Gb and Tb of data…] Conclusions IC '07 C. Faloutsos 43
CMU SCS Our emphasis: scalability • Gb (->Tb) of data • Dream: to exploit ‘map-reduce’/hadoop, • By re-designing D. M. algorithms for such an environment, within the D. I. S. C. effort (Data Intensive Scientific Computing) IC '07 C. Faloutsos 44
CMU SCS D. I. S. C. • • • Randy Bryant (SCS Dean) Dave O’Hallaron (dir. of INTEL Pittsburgh) Garth Gibson Greg Ganger ++ • INTEL: 15 nodes; Yahoo: ~100 nodes, running hadoop IC '07 C. Faloutsos 45
CMU SCS DM for Tera- and Peta-bytes Two-way street: <- DM can use such infrastructures to find patterns -> DM can help such infrastructures become self-healing, self-adjusting, ‘self-*’ IC '07 C. Faloutsos 46
CMU SCS Conclusions • Biological images, graphs & streams pose fascinating problems • self-similarity, fractals and power laws work, when other methods fail! • SCALABILITY: creates fascinating research problems! IC '07 C. Faloutsos 47
CMU SCS Contact info • christos@cs. cmu. edu • www. cs. cmu. edu/~christos • Wean Hall 7107 • Ph#: x 8. 1457 • and, again IC '07 WELCOME! C. Faloutsos 48
- Slides: 48