CMU SCS Big graph data analytics Christos Faloutsos
CMU SCS Big (graph) data analytics Christos Faloutsos CMU
CMU SCS CONGRATULATIONS! IC '13 C. Faloutsos 2
CMU SCS Outline • • • Q+A Problem definition / Motivation Graphs and power laws Anomaly/fraud detection Conclusions IC '13 C. Faloutsos 3
CMU SCS Q+A • • • Are you recruiting? How many do you have? How frequently you meet them? What is your advising style? How do you feel about summer internships? IC '13 C. Faloutsos 4
CMU SCS Q+A • • • Are you recruiting? How many? • • How many do you have? How frequently you meet them? • • What is your advising style? How do you feel about summer • internships? IC '13 C. Faloutsos Maybe, ~1 4 (+5 pdocs) 1/week results Yes/Maybe (FB, MSR, IBM, ++) 5
CMU SCS Outline • Problem definition / Motivation • Graphs and power laws – Patterns – Scalability and ‘hadoop’ • Anomaly detection • Conclusions IC '13 C. Faloutsos 6
CMU SCS Motivation • Data mining: ~ find patterns (rules, outliers) • How do real graphs look like? Anomalies? – Virus/influence propagation • Time series / env. Monitoring Temperature in datacenter IC '13 C. Faloutsos 7
CMU SCS Graphs - why should we care? IC '13 C. Faloutsos 8
CMU SCS Graphs - why should we care? Food Web [Martinez ’ 91] ~1 B users $10 -$100 B revenue Internet Map [lumeta. com] IC '13 C. Faloutsos 9
CMU SCS Tensors: Graphs on steroids • Tensors (=multi-dimensional arrays) – Predicates (subject, verb, object) in knowledge base Vagelis Papalexakis CMU-CS “Eric Clapton plays guitar” “Barack Obama is the president of U. S. ” Open House, 2013 Tom Mitchell CMU/CS-MLD (48 M) (26 M) NELL (Never Ending Language Learner) data Nonzeros =144 M (26 M) C. Faloutsos (CMU) 10
CMU SCS Concept Discovery • Concept Discovery in Knowledge Base Open House, 2013 C. Faloutsos (CMU) 11
CMU SCS Concept Discovery • Concept Discovery in Knowledge Base NP 1: Internet, file, data NP 2: Protocol, software, suite Open House, 2013 C. Faloutsos (CMU) 12
CMU SCS ‘Neuro. Semantics’ >200 GB total Open House, 2013 C. Faloutsos (CMU) 13
CMU SCS Experiments • Giga. Tensor solves 100 x larger problem (K) (J) Giga. Tensor r o s n box e T ool T Open House, 2013 100 x Out of Memory C. Faloutsos (CMU) (I) Number of nonzero = I / 50 14
CMU SCS Problem #1 - network and graph mining • What does the Internet look like? • What does Face. Book like? • What is ‘normal’/‘abnormal’? • which patterns/laws hold? – To spot anomalies (rarities), we have to discover patterns – Large datasets reveal patterns/anomalies that may be invisible otherwise… IC '13 C. Faloutsos 15
CMU SCS Graph mining • Are real graphs random? IC '13 C. Faloutsos 16
CMU SCS Laws and patterns NO!! • Diameter • in- and out- degree distributions • other (surprising) patterns IC '13 C. Faloutsos 17
CMU SCS Outline • Problem definition / Motivation • Graphs and power laws – Patterns – Scalability and ‘hadoop’ • Anomaly/Fraud detection • Conclusions IC '13 C. Faloutsos 18
CMU SCS S 1 – degree distributions • Q: avg degree is ~3 - what is the most probable degree? count ? ? 3 IC '13 C. Faloutsos degree 19
CMU SCS S 1– degree distributions • Q: avg degree is ~3 - what is the most probable degree? count 3 IC '13 count ? ? degree C. Faloutsos 3 degree 20
CMU SCS Solution: Frequency Exponent = slope O = -2. 15 Nov’ 97 Outdegree The plot is linear in log-log scale [FFF’ 99] freq = degree (-2. 15) IC '13 C. Faloutsos 21
CMU SCS Solution# S. 2: Triangle ‘Laws’ • Real social networks have a lot of triangles IC '13 C. Faloutsos 22
CMU SCS Solution# S. 2: Triangle ‘Laws’ • Real social networks have a lot of triangles – Friends of friends are friends • Any patterns? IC '13 C. Faloutsos 23
CMU SCS Triangle Law: #S. 2 [Tsourakakis ICDM 2008] Reuters X-axis: degree Y-axis: mean # triangles n friends -> ? ? triangles IC '13 C. Faloutsos 24
CMU SCS Triangle Law: #S. 2 [Tsourakakis ICDM 2008] Reuters SN X-axis: degree Y-axis: mean # triangles n friends -> ~n 1. 6 triangles Epinions IC '13 C. Faloutsos 25
CMU SCS Triangle counting for large graphs? Anomalous nodes in Twitter(~ 3 billion edges) [U Kang, Brendan Meeder, +, PAKDD’ 11] IC '13 C. Faloutsos 26 26
CMU SCS Triangle counting for large graphs? Anomalous nodes in Twitter(~ 3 billion edges) [U Kang, Brendan Meeder, +, PAKDD’ 11] IC '13 C. Faloutsos 27 27
CMU SCS Triangle counting for large graphs? Anomalous nodes in Twitter(~ 3 billion edges) [U Kang, Brendan Meeder, +, PAKDD’ 11] IC '13 C. Faloutsos 28 28
CMU SCS And many more patterns… • • • Diameter: SHRINKS with size! #nodes vs #edges (power law(!)) # conn. Components (power law, too) Contact/phone-call duration (log-logistic) Total node weight vs # edges (superlinear/power law) • …. IC '13 C. Faloutsos 29
CMU SCS Outline • Problem definition / Motivation • Graphs and power laws – Patterns and anomalies – Scalability and ‘hadoop’ • Anomaly/fraud detection • Conclusions IC '13 C. Faloutsos 30
CMU SCS Scalability • Google: > 450, 000 processors in clusters of ~2000 processors each [Barroso, Dean, Hölzle, “Web Search for a Planet: The Google Cluster Architecture” IEEE Micro 2003] • • Yahoo: 5 Pb of data [Fayyad, KDD’ 07] Problem: machine failures, on a daily basis How to parallelize data mining tasks, then? A: map/reduce – hadoop (open-source clone) http: //hadoop. apache. org/ IC '13 C. Faloutsos 31
CMU SCS details User Program Input Data (on HDFS) Split 0 read Split 1 Split 2 fork assign map Master assign reduce Mapper Reducer local write Output File 0 Output Reducer Mapper write File 1 remote read, sort By default: 3 -way replication; Late/dead machines: ignored, transparently (!) IC '13 C. Faloutsos 32
CMU SCS Outline • Problem definition / Motivation • Graphs and power laws – Patterns – Scalability and ‘hadoop’ • Anomaly/Fraud detection • Conclusions IC '13 C. Faloutsos 33
CMU SCS E-bay Fraud detection w/ Polo Chau & Shashank Pandit, CMU [www’ 07] IC '13 C. Faloutsos 34
CMU SCS E-bay Fraud detection IC '13 C. Faloutsos 35
CMU SCS E-bay Fraud detection IC '13 C. Faloutsos 36
CMU SCS E-bay Fraud detection - Net. Probe IC '13 C. Faloutsos 37
CMU SCS App-store fraud Opinion Fraud Detection in Online Reviews using Network Effects Leman Akoglu, Rishi Chandy, CF ICWSM’ 13 IC '13 C. Faloutsos 38
CMU SCS • Given Problem – user-product review network – review sign (+/-) • Classify – objects into type-specific classes: users: `honest’ / `fraudster’ products: `good’ / `bad’ reviews: `genuine’ / `fake’ No side data! (e. g. , timestamp, review text) IC '13 C. Faloutsos 39
CMU SCS User honest Formulation: BP Product – + bad Before After IC '13 C. Faloutsos 40
CMU SCS Users Top scorers Products + positive (4 -5) rating o negative (1 -2) rating IC '13 C. Faloutsos 41
CMU SCS Users Top scorers Products + positive (4 -5) rating o negative (1 -2) rating IC '13 C. Faloutsos 42
CMU SCS ‘Fraud-bot’ member reviews Same developer! IC '13 Duplicated text! C. Faloutsos Same day activity! 43
CMU SCS Outline • Problem definition / Motivation • Graphs and power laws – Patterns and anomalies – Scalability and ‘hadoop’ • Anomaly/fraud detection • Streams, spikes, environment, data center monitoring • IC '13 Conclusions C. Faloutsos 44
CMU SCS Datacenter Monitoring & Management • Goal: save energy in data centers Lei Li – US alone, $7. 4 B power consumption (2011) • Challenge: – 1 TB per day – Complex cyber physical systems IC '13 C. Faloutsos Temperature in datacenter 45
CMU SCS Spike forecasting Yasuko Matsubara –Forecast not only tail-part, but also risepart! ? (1) First spike IC '13 (2) Release date C. Faloutsos ? (3) Two weeks before release 46
CMU SCS Spike forecasting Yasuko Matsubara –Forecast not only tail-part, but also risepart! (1) First spike IC '13 (2) Release date C. Faloutsos (3) Two weeks before release 47
CMU SCS Environmental data Temperatures, April Temp. and pressure over time Sao Paulo, Brazil IC '13 C. Faloutsos 48
CMU SCS Open research questions • Patterns/anomalies for time-evolving graphs (Call graph, 3 M people x 6 mo) • Patterns/anomalies given node attributes • Graph understanding / attribution …. . • How is the human brain wired IC '13 C. Faloutsos 50
CMU SCS Contact info • www. cs. cmu. edu/~christos • GHC 8019 • Ph#: x 8. 1457 • www. cs. cmu. edu/~christos/TALKS/13 ic/ • FYI: Course: 15 -826, Tu-Th 1: 30 -3: 00 IC '13 C. Faloutsos • and, again WELCOME! 52
- Slides: 50