CMU SCS Large Graph Mining Patterns Tools and

  • Slides: 128
Download presentation
CMU SCS Large Graph Mining – Patterns, Tools and Cascade analysis Christos Faloutsos CMU

CMU SCS Large Graph Mining – Patterns, Tools and Cascade analysis Christos Faloutsos CMU

CMU SCS Thank you! • Brian Gallagher • Jan Winfield LLNL, Jan 2013 C.

CMU SCS Thank you! • Brian Gallagher • Jan Winfield LLNL, Jan 2013 C. Faloutsos (CMU) 2

CMU SCS Roadmap • Introduction – Motivation – Why ‘big data’ – Why (big)

CMU SCS Roadmap • Introduction – Motivation – Why ‘big data’ – Why (big) graphs? • • Problem#1: Patterns in graphs Problem#2: Tools Problem#3: Scalability Conclusions LLNL, Jan 2013 C. Faloutsos (CMU) 3

CMU SCS Why ‘big data’ • Why? • What is the problem definition? •

CMU SCS Why ‘big data’ • Why? • What is the problem definition? • What are the major research challenges? LLNL, Jan 2013 C. Faloutsos (CMU) 4

CMU SCS Main message: Big data: often > experts • ‘Super Crunchers’ Why Thinking-By-Numbers

CMU SCS Main message: Big data: often > experts • ‘Super Crunchers’ Why Thinking-By-Numbers is the New Way To Be Smart by Ian Ayres, 2008 • Google won the machine translation competition 2005 • http: //www. itl. nist. gov/iad/mig//tests/mt/2005/doc/mt 05 eval_official_r esults_release_20050801_v 3. html LLNL, Jan 2013 C. Faloutsos (CMU) 5

CMU SCS Problem definition – big picture Tera/Peta-byte data LLNL, Jan 2013 Analytics C.

CMU SCS Problem definition – big picture Tera/Peta-byte data LLNL, Jan 2013 Analytics C. Faloutsos (CMU) Insights, outliers 6

CMU SCS Problem definition – big picture Tera/Peta-byte data Analytics Insights, outliers Main emphasis

CMU SCS Problem definition – big picture Tera/Peta-byte data Analytics Insights, outliers Main emphasis in this talk LLNL, Jan 2013 C. Faloutsos (CMU) 7

CMU SCS Problem definition – big picture Tera/Peta-byte data LLNL, Jan 2013 (my personal)

CMU SCS Problem definition – big picture Tera/Peta-byte data LLNL, Jan 2013 (my personal) rules of thumb: if data • fits in memory -> R, matlab, scipy • single disk -> RDBMS (sqlite 3, mysql, postgres) • multiple (<100 -1000) disks: parallel RDBMS (Vertica, Tera. Data) • multiple (>1000) disks: hadoop, pig C. Faloutsos (CMU) 8

CMU SCS (Free) Resource for graphs Open source system for mining huge graphs: PEGASUS

CMU SCS (Free) Resource for graphs Open source system for mining huge graphs: PEGASUS project (PEta Gr. Aph mining System) • www. cs. cmu. edu/~pegasus • Apache license for s/w • code and papers LLNL, Jan 2013 C. Faloutsos (CMU) 9

CMU SCS Research challenges • The usual ones from data mining – Data cleansing

CMU SCS Research challenges • The usual ones from data mining – Data cleansing – Feature engineering –… PLUS – Scalability ( < O(N**2)) – Real data *disobey* textbook assumptions (uniformity, independence, Gaussian, Poisson) with huge performance implications LLNL, Jan 2013 C. Faloutsos (CMU) 10

CMU SCS Roadmap • Introduction – Motivation – Why ‘big data’ – Why (big)

CMU SCS Roadmap • Introduction – Motivation – Why ‘big data’ – Why (big) graphs? • • Problem#1: Patterns in graphs Problem#2: Tools Problem#3: Scalability Conclusions LLNL, Jan 2013 C. Faloutsos (CMU) 11

CMU SCS Graphs - why should we care? Food Web [Martinez ’ 91] >$10

CMU SCS Graphs - why should we care? Food Web [Martinez ’ 91] >$10 B revenue >0. 5 B users Internet Map [lumeta. com] LLNL, Jan 2013 C. Faloutsos (CMU) 12

CMU SCS Graphs - why should we care? • IR: bi-partite graphs (doc-terms) D

CMU SCS Graphs - why should we care? • IR: bi-partite graphs (doc-terms) D 1 . . . DN TM • web: hyper-text graph • . . . and more: LLNL, Jan 2013 C. Faloutsos (CMU) T 1 13

CMU SCS Graphs - why should we care? • ‘viral’ marketing • web-log (‘blog’)

CMU SCS Graphs - why should we care? • ‘viral’ marketing • web-log (‘blog’) news propagation • computer network security: email/IP traffic and anomaly detection • . . • Subject-verb-object -> graph • Many-to-many db relationship -> graph LLNL, Jan 2013 C. Faloutsos (CMU) 14

CMU SCS Outline • Introduction – Motivation • Problem#1: Patterns in graphs – Static

CMU SCS Outline • Introduction – Motivation • Problem#1: Patterns in graphs – Static graphs – Weighted graphs – Time evolving graphs • Problem#2: Tools • Problem#3: Scalability • Conclusions LLNL, Jan 2013 C. Faloutsos (CMU) 15

CMU SCS Problem #1 - network and graph mining • What does the Internet

CMU SCS Problem #1 - network and graph mining • What does the Internet look like? • What does Face. Book like? • What is ‘normal’/‘abnormal’? • which patterns/laws hold? LLNL, Jan 2013 C. Faloutsos (CMU) 16

CMU SCS Problem #1 - network and graph mining • What does the Internet

CMU SCS Problem #1 - network and graph mining • What does the Internet look like? • What does Face. Book like? • What is ‘normal’/‘abnormal’? • which patterns/laws hold? – To spot anomalies (rarities), we have to discover patterns LLNL, Jan 2013 C. Faloutsos (CMU) 17

CMU SCS Problem #1 - network and graph mining • What does the Internet

CMU SCS Problem #1 - network and graph mining • What does the Internet look like? • What does Face. Book like? • What is ‘normal’/‘abnormal’? • which patterns/laws hold? – To spot anomalies (rarities), we have to discover patterns – Large datasets reveal patterns/anomalies that may be invisible otherwise… LLNL, Jan 2013 C. Faloutsos (CMU) 18

CMU SCS Graph mining • Are real graphs random? LLNL, Jan 2013 C. Faloutsos

CMU SCS Graph mining • Are real graphs random? LLNL, Jan 2013 C. Faloutsos (CMU) 19

CMU SCS Laws and patterns • Are real graphs random? • A: NO!! –

CMU SCS Laws and patterns • Are real graphs random? • A: NO!! – Diameter – in- and out- degree distributions – other (surprising) patterns • So, let’s look at the data LLNL, Jan 2013 C. Faloutsos (CMU) 20

CMU SCS Solution# S. 1 • Power law in the degree distribution [SIGCOMM 99]

CMU SCS Solution# S. 1 • Power law in the degree distribution [SIGCOMM 99] internet domains att. com log(degree) ibm. com log(rank) LLNL, Jan 2013 C. Faloutsos (CMU) 21

CMU SCS Solution# S. 1 • Power law in the degree distribution [SIGCOMM 99]

CMU SCS Solution# S. 1 • Power law in the degree distribution [SIGCOMM 99] internet domains att. com log(degree) ibm. com -0. 82 log(rank) LLNL, Jan 2013 C. Faloutsos (CMU) 22

CMU SCS Solution# S. 2: Eigen Exponent E Eigenvalue Exponent = slope E =

CMU SCS Solution# S. 2: Eigen Exponent E Eigenvalue Exponent = slope E = -0. 48 May 2001 Rank of decreasing eigenvalue • A 2: power law in the eigenvalues of the adjacency matrix LLNL, Jan 2013 C. Faloutsos (CMU) 23

CMU SCS Solution# S. 2: Eigen Exponent E Eigenvalue Exponent = slope E =

CMU SCS Solution# S. 2: Eigen Exponent E Eigenvalue Exponent = slope E = -0. 48 May 2001 Rank of decreasing eigenvalue • [Mihail, Papadimitriou ’ 02]: slope is ½ of rank exponent LLNL, Jan 2013 C. Faloutsos (CMU) 24

CMU SCS But: How about graphs from other domains? LLNL, Jan 2013 C. Faloutsos

CMU SCS But: How about graphs from other domains? LLNL, Jan 2013 C. Faloutsos (CMU) 25

CMU SCS More power laws: • web hit counts [w/ A. Montgomery] Web Site

CMU SCS More power laws: • web hit counts [w/ A. Montgomery] Web Site Traffic Count (log scale) Zipf ``ebay’’ users sites in-degree (log scale) LLNL, Jan 2013 C. Faloutsos (CMU) 26

CMU SCS epinions. com • who-trusts-whom [Richardson + Domingos, KDD 2001] count trusts-2000 -people

CMU SCS epinions. com • who-trusts-whom [Richardson + Domingos, KDD 2001] count trusts-2000 -people user (out) degree LLNL, Jan 2013 C. Faloutsos (CMU) 27

CMU SCS And numerous more • • # of sexual contacts Income [Pareto] –’

CMU SCS And numerous more • • # of sexual contacts Income [Pareto] –’ 80 -20 distribution’ Duration of downloads [Bestavros+] Duration of UNIX jobs (‘mice and elephants’) • Size of files of a user • … • ‘Black swans’ LLNL, Jan 2013 C. Faloutsos (CMU) 28

CMU SCS Roadmap • Introduction – Motivation • Problem#1: Patterns in graphs – Static

CMU SCS Roadmap • Introduction – Motivation • Problem#1: Patterns in graphs – Static graphs • degree, diameter, eigen, • triangles • cliques – Weighted graphs – Time evolving graphs • Problem#2: Tools LLNL, Jan 2013 C. Faloutsos (CMU) 29

CMU SCS Solution# S. 3: Triangle ‘Laws’ • Real social networks have a lot

CMU SCS Solution# S. 3: Triangle ‘Laws’ • Real social networks have a lot of triangles LLNL, Jan 2013 C. Faloutsos (CMU) 30

CMU SCS Solution# S. 3: Triangle ‘Laws’ • Real social networks have a lot

CMU SCS Solution# S. 3: Triangle ‘Laws’ • Real social networks have a lot of triangles – Friends of friends are friends • Any patterns? LLNL, Jan 2013 C. Faloutsos (CMU) 31

CMU SCS Triangle Law: #S. 3 [Tsourakakis ICDM 2008] HEP-TH ASN Epinions LLNL, Jan

CMU SCS Triangle Law: #S. 3 [Tsourakakis ICDM 2008] HEP-TH ASN Epinions LLNL, Jan 2013 X-axis: # of participating triangles Y: count (~ pdf) C. Faloutsos (CMU) 32

CMU SCS Triangle Law: #S. 3 [Tsourakakis ICDM 2008] HEP-TH ASN Epinions LLNL, Jan

CMU SCS Triangle Law: #S. 3 [Tsourakakis ICDM 2008] HEP-TH ASN Epinions LLNL, Jan 2013 X-axis: # of participating triangles Y: count (~ pdf) C. Faloutsos (CMU) 33

CMU SCS Triangle Law: #S. 4 [Tsourakakis ICDM 2008] Reuters SN X-axis: degree Y-axis:

CMU SCS Triangle Law: #S. 4 [Tsourakakis ICDM 2008] Reuters SN X-axis: degree Y-axis: mean # triangles n friends -> ~n 1. 6 triangles Epinions LLNL, Jan 2013 C. Faloutsos (CMU) 34

CMU SCS Triangle counting for large graphs? ? ? ? Anomalous nodes in Twitter(~

CMU SCS Triangle counting for large graphs? ? ? ? Anomalous nodes in Twitter(~ 3 billion edges) [U Kang, Brendan Meeder, +, PAKDD’ 11] LLNL, Jan 2013 C. Faloutsos (CMU) 38

CMU SCS Triangle counting for large graphs? Anomalous nodes in Twitter(~ 3 billion edges)

CMU SCS Triangle counting for large graphs? Anomalous nodes in Twitter(~ 3 billion edges) [U Kang, Brendan Meeder, +, PAKDD’ 11] LLNL, Jan 2013 C. Faloutsos (CMU) 39

CMU SCS Triangle counting for large graphs? Anomalous nodes in Twitter(~ 3 billion edges)

CMU SCS Triangle counting for large graphs? Anomalous nodes in Twitter(~ 3 billion edges) [U Kang, Brendan Meeder, +, PAKDD’ 11] LLNL, Jan 2013 C. Faloutsos (CMU) 40

CMU SCS Triangle counting for large graphs? Anomalous nodes in Twitter(~ 3 billion edges)

CMU SCS Triangle counting for large graphs? Anomalous nodes in Twitter(~ 3 billion edges) [U Kang, Brendan Meeder, +, PAKDD’ 11] LLNL, Jan 2013 C. Faloutsos (CMU) 41

CMU SCS Roadmap • Introduction – Motivation • Problem#1: Patterns in graphs – Static

CMU SCS Roadmap • Introduction – Motivation • Problem#1: Patterns in graphs – Static graphs • degree, diameter, eigen, • triangles • cliques – Weighted graphs – Time evolving graphs • Problem#2: Tools LLNL, Jan 2013 C. Faloutsos (CMU) 42

CMU SCS Observations on weighted graphs? • A: yes - even more ‘laws’! M.

CMU SCS Observations on weighted graphs? • A: yes - even more ‘laws’! M. Mc. Glohon, L. Akoglu, and C. Faloutsos Weighted Graphs and Disconnected Components: Patterns and a Generator. SIG-KDD 2008 LLNL, Jan 2013 C. Faloutsos (CMU) 43

CMU SCS Observation W. 1: Fortification Q: How do the weights of nodes relate

CMU SCS Observation W. 1: Fortification Q: How do the weights of nodes relate to degree? LLNL, Jan 2013 C. Faloutsos (CMU) 44

CMU SCS Observation W. 1: Fortification More donors, more $ ? $10 $5 $7

CMU SCS Observation W. 1: Fortification More donors, more $ ? $10 $5 $7 ‘Reagan’ ‘Clinton’ LLNL, Jan 2013 C. Faloutsos (CMU) 45

CMU SCS Observation W. 1: fortification: Snapshot Power Law • Weight: super-linear on in-degree

CMU SCS Observation W. 1: fortification: Snapshot Power Law • Weight: super-linear on in-degree • exponent ‘iw’: 1. 01 < iw < 1. 26 More donors, even more $ $10 In-weights ($) Orgs-Candidates e. g. John Kerry, $10 M received, from 1 K donors $5 Edges (# donors) LLNL, Jan 2013 C. Faloutsos (CMU) 46

CMU SCS Roadmap • Introduction – Motivation • Problem#1: Patterns in graphs – Static

CMU SCS Roadmap • Introduction – Motivation • Problem#1: Patterns in graphs – Static graphs – Weighted graphs – Time evolving graphs • Problem#2: Tools • … LLNL, Jan 2013 C. Faloutsos (CMU) 47

CMU SCS Problem: Time evolution • with Jure Leskovec (CMU -> Stanford) • and

CMU SCS Problem: Time evolution • with Jure Leskovec (CMU -> Stanford) • and Jon Kleinberg (Cornell – sabb. @ CMU) LLNL, Jan 2013 C. Faloutsos (CMU) 48

CMU SCS T. 1 Evolution of the Diameter • Prior work on Power Law

CMU SCS T. 1 Evolution of the Diameter • Prior work on Power Law graphs hints at slowly growing diameter: – diameter ~ O(log N) • What is happening in real data? LLNL, Jan 2013 C. Faloutsos (CMU) 49

CMU SCS T. 1 Evolution of the Diameter • Prior work on Power Law

CMU SCS T. 1 Evolution of the Diameter • Prior work on Power Law graphs hints at slowly growing diameter: – diameter ~ O(log N) • What is happening in real data? • Diameter shrinks over time LLNL, Jan 2013 C. Faloutsos (CMU) 50

CMU SCS T. 1 Diameter – “Patents” • Patent citation network • 25 years

CMU SCS T. 1 Diameter – “Patents” • Patent citation network • 25 years of data • @1999 diameter – 2. 9 M nodes – 16. 5 M edges time [years] LLNL, Jan 2013 C. Faloutsos (CMU) 51

CMU SCS T. 2 Temporal Evolution of the Graphs • N(t) … nodes at

CMU SCS T. 2 Temporal Evolution of the Graphs • N(t) … nodes at time t • E(t) … edges at time t • Suppose that N(t+1) = 2 * N(t) • Q: what is your guess for E(t+1) =? 2 * E(t) LLNL, Jan 2013 C. Faloutsos (CMU) 52

CMU SCS T. 2 Temporal Evolution of the Graphs • N(t) … nodes at

CMU SCS T. 2 Temporal Evolution of the Graphs • N(t) … nodes at time t • E(t) … edges at time t • Suppose that N(t+1) = 2 * N(t) • Q: what is your guess for E(t+1) =? 2 * E(t) • A: over-doubled! – But obeying the ``Densification Power Law’’ LLNL, Jan 2013 C. Faloutsos (CMU) 53

CMU SCS T. 2 Densification – Patent Citations • Citations among patents granted E(t)

CMU SCS T. 2 Densification – Patent Citations • Citations among patents granted E(t) • @1999 – 2. 9 M nodes – 16. 5 M edges 1. 66 • Each year is a datapoint N(t) LLNL, Jan 2013 C. Faloutsos (CMU) 54

CMU SCS Roadmap • Introduction – Motivation • Problem#1: Patterns in graphs – Static

CMU SCS Roadmap • Introduction – Motivation • Problem#1: Patterns in graphs – Static graphs – Weighted graphs – Time evolving graphs • Problem#2: Tools • … LLNL, Jan 2013 C. Faloutsos (CMU) 55

CMU SCS T. 3 : popularity over time # in links 1 2 3

CMU SCS T. 3 : popularity over time # in links 1 2 3 lag: days after post Post popularity drops-off – exponentially? @t @t + lag LLNL, Jan 2013 C. Faloutsos (CMU) 56

CMU SCS T. 3 : popularity over time # in links (log) days after

CMU SCS T. 3 : popularity over time # in links (log) days after post (log) Post popularity drops-off – exponentially? POWER LAW! Exponent? LLNL, Jan 2013 C. Faloutsos (CMU) 57

CMU SCS T. 3 : popularity over time # in links (log) -1. 6

CMU SCS T. 3 : popularity over time # in links (log) -1. 6 days after post (log) Post popularity drops-off – exponentially? POWER LAW! Exponent? -1. 6 • close to -1. 5: Barabasi’s stack model • and like the zero-crossings of a random walk LLNL, Jan 2013 C. Faloutsos (CMU) 58

CMU SCS -1. 5 slope J. G. Oliveira & A. -L. Barabási Human Dynamics:

CMU SCS -1. 5 slope J. G. Oliveira & A. -L. Barabási Human Dynamics: The Correspondence Patterns of Darwin and Einstein. Nature 437, 1251 (2005). [PDF] Prob(RT > x) (log) Response time (log) LLNL, Jan 2013 C. Faloutsos (CMU) 59

CMU SCS Roadmap • Introduction – Motivation • Problem#1: Patterns in graphs • Problem#2:

CMU SCS Roadmap • Introduction – Motivation • Problem#1: Patterns in graphs • Problem#2: Tools – Belief Propagation – Tensors – Spike analysis • Problem#3: Scalability • Conclusions LLNL, Jan 2013 C. Faloutsos (CMU) 60

CMU SCS E-bay Fraud detection w/ Polo Chau & Shashank Pandit, CMU [www’ 07]

CMU SCS E-bay Fraud detection w/ Polo Chau & Shashank Pandit, CMU [www’ 07] LLNL, Jan 2013 C. Faloutsos (CMU) 61

CMU SCS E-bay Fraud detection LLNL, Jan 2013 C. Faloutsos (CMU) 62

CMU SCS E-bay Fraud detection LLNL, Jan 2013 C. Faloutsos (CMU) 62

CMU SCS E-bay Fraud detection LLNL, Jan 2013 C. Faloutsos (CMU) 63

CMU SCS E-bay Fraud detection LLNL, Jan 2013 C. Faloutsos (CMU) 63

CMU SCS E-bay Fraud detection - Net. Probe LLNL, Jan 2013 C. Faloutsos (CMU)

CMU SCS E-bay Fraud detection - Net. Probe LLNL, Jan 2013 C. Faloutsos (CMU) 64

CMU SCS E-bay Fraud detection - Net. Probe Compatibility matrix F F A H

CMU SCS E-bay Fraud detection - Net. Probe Compatibility matrix F F A H 99% heterophily 99% 49% LLNL, Jan 2013 49% C. Faloutsos (CMU) 65

CMU SCS Roadmap • Introduction – Motivation • Problem#1: Patterns in graphs • Problem#2:

CMU SCS Roadmap • Introduction – Motivation • Problem#1: Patterns in graphs • Problem#2: Tools – Belief Propagation – Tensors – Spike analysis • Problem#3: Scalability • Conclusions LLNL, Jan 2013 C. Faloutsos (CMU) 66

CMU SCS Giga. Tensor: Scaling Tensor Analysis Up By 100 Times – Algorithms and

CMU SCS Giga. Tensor: Scaling Tensor Analysis Up By 100 Times – Algorithms and Discoveries U Kang Evangelos Papalexakis Abhay Harpale Christos Faloutsos KDD’ 12 LLNL, Jan 2013 C. Faloutsos (CMU) 67

CMU SCS Background: Tensor • Tensors (=multi-dimensional arrays) are everywhere – Hyperlinks &anchor text

CMU SCS Background: Tensor • Tensors (=multi-dimensional arrays) are everywhere – Hyperlinks &anchor text [Kolda+, 05] Anchor Text 1 URL 2 1 1 1 URL 1 LLNL, Jan 2013 C. Faloutsos (CMU) 1 1 C# 1 C++ Java 68

CMU SCS Background: Tensor • Tensors (=multi-dimensional arrays) are everywhere – Sensor stream (time,

CMU SCS Background: Tensor • Tensors (=multi-dimensional arrays) are everywhere – Sensor stream (time, location, type) – Predicates (subject, verb, object) in knowledge base “Eric Clapton plays guitar” “Barrack Obama is the president of U. S. ” LLNL, Jan 2013 (48 M) (26 M) NELL (Never Ending Language Learner) data Nonzeros =144 M (26 M) C. Faloutsos (CMU) 69

CMU SCS Background: Tensor • Tensors (=multi-dimensional arrays) are everywhere – Sensor stream (time,

CMU SCS Background: Tensor • Tensors (=multi-dimensional arrays) are everywhere – Sensor stream (time, location, type) – Predicates (subject, verb, object) in knowledge base Time-stamp IP-source LLNL, Jan 2013 IP-destination C. Faloutsos (CMU) Anomaly Detection in Computer networks 70

CMU SCS all I learned on tensors: from Nikos Sidiropoulos UMN LLNL, Jan 2013

CMU SCS all I learned on tensors: from Nikos Sidiropoulos UMN LLNL, Jan 2013 C. Faloutsos (CMU) Tamara Kolda, Sandia Labs (tensor toolbox) 71

CMU SCS Problem Definition • How to decompose a billion-scale tensor? – Corresponds to

CMU SCS Problem Definition • How to decompose a billion-scale tensor? – Corresponds to SVD in 2 D case LLNL, Jan 2013 C. Faloutsos (CMU) 72

CMU SCS Problem Definition q q q Q 1: Dominant concepts/topics? Q 2: Find

CMU SCS Problem Definition q q q Q 1: Dominant concepts/topics? Q 2: Find synonyms to a given noun phrase? (and how to scale up: |data| > RAM) (48 M) (26 M) NELL (Never Ending Language Learner) data Nonzeros =144 M (26 M) LLNL, Jan 2013 C. Faloutsos (CMU) 73

CMU SCS Experiments • Giga. Tensor solves 100 x larger problem (K) (J) Giga.

CMU SCS Experiments • Giga. Tensor solves 100 x larger problem (K) (J) Giga. Tensor r o s n box e T ool T LLNL, Jan 2013 100 x Out of Memory C. Faloutsos (CMU) (I) Number of nonzero = I / 50 74

CMU SCS A 1: Concept Discovery • Concept Discovery in Knowledge Base LLNL, Jan

CMU SCS A 1: Concept Discovery • Concept Discovery in Knowledge Base LLNL, Jan 2013 C. Faloutsos (CMU) 75

CMU SCS LLNL, Jan 2013 A 1: Concept Discovery C. Faloutsos (CMU) 76

CMU SCS LLNL, Jan 2013 A 1: Concept Discovery C. Faloutsos (CMU) 76

CMU SCS A 2: Synonym Discovery LLNL, Jan 2013 C. Faloutsos (CMU) 77

CMU SCS A 2: Synonym Discovery LLNL, Jan 2013 C. Faloutsos (CMU) 77

CMU SCS Roadmap • Introduction – Motivation • Problem#1: Patterns in graphs • Problem#2:

CMU SCS Roadmap • Introduction – Motivation • Problem#1: Patterns in graphs • Problem#2: Tools – Belief propagation – Tensors – Spike analysis • Problem#3: Scalability -PEGASUS • Conclusions LLNL, Jan 2013 C. Faloutsos (CMU) 78

CMU SCS Rise and fall patterns in social media • Meme (# of mentions

CMU SCS Rise and fall patterns in social media • Meme (# of mentions in blogs) – short phrases Sourced from U. S. politics in 2008 “you can put lipstick on a pig” “yes we can” LLNL, Jan 2013 C. Faloutsos (CMU) 79

CMU SCS Rise and fall patterns in social media • Can we find a

CMU SCS Rise and fall patterns in social media • Can we find a unifying model, which includes these patterns? • four classes on You. Tube [Crane et al. ’ 08] • six classes on Meme [Yang et al. ’ 11] LLNL, Jan 2013 C. Faloutsos (CMU) 80

CMU SCS Rise and fall patterns in social media • Answer: YES! • We

CMU SCS Rise and fall patterns in social media • Answer: YES! • We can represent all patterns by single model In Matsubara+ SIGKDD 2012 LLNL, Jan 2013 C. Faloutsos (CMU) 81

CMU SCS Main idea - Spike. M - 1. Un-informed bloggers (uninformed about rumor)

CMU SCS Main idea - Spike. M - 1. Un-informed bloggers (uninformed about rumor) - 2. External shock at time nb (e. g, breaking news) - 3. Infection (word-of-mouth) Time n=0 Time n=nb β Time n=nb+1 Infectiveness of a blog-post at age n: - Strength of infection (quality of news) LLNL, Jan 2013 Decay function C. Faloutsos (CMU) 82

CMU SCS Main idea - Spike. M - 1. Un-informed bloggers (uninformed about rumor)

CMU SCS Main idea - Spike. M - 1. Un-informed bloggers (uninformed about rumor) - 2. External shock at time nb (e. g, breaking news) - 3. Infection (word-of-mouth) Time n=0 Time n=nb β Time n=nb+1 Infectiveness of a blog-post at age n: - Strength of infection (quality of news) LLNL, Jan 2013 Decay function C. Faloutsos (CMU) 83

Details CMU SCS Spike. M - with periodicity • Full equation of Spike. M

Details CMU SCS Spike. M - with periodicity • Full equation of Spike. M Periodicity noon Bloggers change their activity over time activity (e. g. , daily, weekly, yearly) Peak 3 am Dip Time n LLNL, Jan 2013 C. Faloutsos (CMU) 84

CMU SCS Details • Analysis – exponential rise and power-raw fall Rise-part Lin-log SI

CMU SCS Details • Analysis – exponential rise and power-raw fall Rise-part Lin-log SI -> exponential Spike. M -> exponential Log-log LLNL, Jan 2013 C. Faloutsos (CMU) 85

CMU SCS Details • Analysis – exponential rise and power-raw fall Fall-part Lin-log SI

CMU SCS Details • Analysis – exponential rise and power-raw fall Fall-part Lin-log SI -> exponential Spike. M -> power law LLNL, Jan 2013 C. Faloutsos (CMU) Log-log 86

CMU SCS Tail-part forecasts • Spike. M can capture tail part LLNL, Jan 2013

CMU SCS Tail-part forecasts • Spike. M can capture tail part LLNL, Jan 2013 C. Faloutsos (CMU) 87

CMU SCS “What-if” forecasting (1) First spike (2) Release date (3) Two weeks before

CMU SCS “What-if” forecasting (1) First spike (2) Release date (3) Two weeks before release ? ? e. g. , given (1) first spike, (2) release date of two sequel movies (3) access volume before the release date LLNL, Jan 2013 C. Faloutsos (CMU) 88

CMU SCS “What-if” forecasting (1) First spike (2) Release date (3) Two weeks before

CMU SCS “What-if” forecasting (1) First spike (2) Release date (3) Two weeks before release Spike. M can forecast upcoming spikes LLNL, Jan 2013 C. Faloutsos (CMU) 89

CMU SCS Roadmap • • Introduction – Motivation Problem#1: Patterns in graphs Problem#2: Tools

CMU SCS Roadmap • • Introduction – Motivation Problem#1: Patterns in graphs Problem#2: Tools Problem#3: Scalability –PEGASUS – Diameter – Connected components • Conclusions LLNL, Jan 2013 C. Faloutsos (CMU) 90

CMU SCS Scalability • Google: > 450, 000 processors in clusters of ~2000 processors

CMU SCS Scalability • Google: > 450, 000 processors in clusters of ~2000 processors each [Barroso, Dean, Hölzle, “Web Search for a Planet: The Google Cluster Architecture” IEEE Micro 2003] • • Yahoo: 5 Pb of data [Fayyad, KDD’ 07] Problem: machine failures, on a daily basis How to parallelize data mining tasks, then? A: map/reduce – hadoop (open-source clone) http: //hadoop. apache. org/ LLNL, Jan 2013 C. Faloutsos (CMU) 91

CMU SCS Roadmap – Algorithms & results Degree Distr. Pagerank Diameter/ANF Conn. Comp Triangles

CMU SCS Roadmap – Algorithms & results Degree Distr. Pagerank Diameter/ANF Conn. Comp Triangles Visualization LLNL, Jan 2013 Centralized Hadoop/PEG ASUS old old old HERE old done started C. Faloutsos (CMU) 92

CMU SCS HADI for diameter estimation • Radius Plots for Mining Tera-byte Scale Graphs

CMU SCS HADI for diameter estimation • Radius Plots for Mining Tera-byte Scale Graphs U Kang, Charalampos Tsourakakis, Ana Paula Appel, Christos Faloutsos, Jure Leskovec, SDM’ 10 • Naively: diameter needs O(N**2) space and up to O(N**3) time – prohibitive (N~1 B) • Our HADI: linear on E (~10 B) – Near-linear scalability wrt # machines – Several optimizations -> 5 x faster LLNL, Jan 2013 C. Faloutsos (CMU) 93

CMU SCS Count ? ? 19+ [Barabasi+] ~1999, ~1 M nodes Radius LLNL, Jan

CMU SCS Count ? ? 19+ [Barabasi+] ~1999, ~1 M nodes Radius LLNL, Jan 2013 C. Faloutsos (CMU) 94

CMU SCS ? ? Count ? ? 19+ [Barabasi+] ~1999, ~1 M nodes Radius

CMU SCS ? ? Count ? ? 19+ [Barabasi+] ~1999, ~1 M nodes Radius Yahoo. Web graph (120 Gb, 1. 4 B nodes, 6. 6 B edges) • Largest publicly available graph ever studied. LLNL, Jan 2013 C. Faloutsos (CMU) 95

CMU SCS Count 14 (dir. ) ? ? ~7 (undir. ) 19+? [Barabasi+] Radius

CMU SCS Count 14 (dir. ) ? ? ~7 (undir. ) 19+? [Barabasi+] Radius Yahoo. Web graph (120 Gb, 1. 4 B nodes, 6. 6 B edges) • Largest publicly available graph ever studied. LLNL, Jan 2013 C. Faloutsos (CMU) 96

CMU SCS Count 14 (dir. ) ? ? ~7 (undir. ) 19+? [Barabasi+] Radius

CMU SCS Count 14 (dir. ) ? ? ~7 (undir. ) 19+? [Barabasi+] Radius Yahoo. Web graph (120 Gb, 1. 4 B nodes, 6. 6 B edges) • 7 degrees of separation (!) • Diameter: shrunk LLNL, Jan 2013 C. Faloutsos (CMU) 97

CMU SCS Count ? ? ~7 (undir. ) Radius Yahoo. Web graph (120 Gb,

CMU SCS Count ? ? ~7 (undir. ) Radius Yahoo. Web graph (120 Gb, 1. 4 B nodes, 6. 6 B edges) Q: Shape? LLNL, Jan 2013 C. Faloutsos (CMU) 98

CMU SCS Yahoo. Web graph (120 Gb, 1. 4 B nodes, 6. 6 B

CMU SCS Yahoo. Web graph (120 Gb, 1. 4 B nodes, 6. 6 B edges) • effective diameter: surprisingly small. • Multi-modality (? !) LLNL, Jan 2013 C. Faloutsos (CMU) 99

CMU SCS Radius Plot of GCC of Yahoo. Web. LLNL, Jan 2013 C. Faloutsos

CMU SCS Radius Plot of GCC of Yahoo. Web. LLNL, Jan 2013 C. Faloutsos (CMU) 100

CMU SCS Yahoo. Web graph (120 Gb, 1. 4 B nodes, 6. 6 B

CMU SCS Yahoo. Web graph (120 Gb, 1. 4 B nodes, 6. 6 B edges) • effective diameter: surprisingly small. • Multi-modality: probably mixture of cores. LLNL, Jan 2013 C. Faloutsos (CMU) 101

CMU SCS Conjecture: DE EN ~7 BR Yahoo. Web graph (120 Gb, 1. 4

CMU SCS Conjecture: DE EN ~7 BR Yahoo. Web graph (120 Gb, 1. 4 B nodes, 6. 6 B edges) • effective diameter: surprisingly small. • Multi-modality: probably mixture of cores. LLNL, Jan 2013 C. Faloutsos (CMU) 102

CMU SCS Conjecture: ~7 Yahoo. Web graph (120 Gb, 1. 4 B nodes, 6.

CMU SCS Conjecture: ~7 Yahoo. Web graph (120 Gb, 1. 4 B nodes, 6. 6 B edges) • effective diameter: surprisingly small. • Multi-modality: probably mixture of cores. LLNL, Jan 2013 C. Faloutsos (CMU) 103

CMU SCS Roadmap • • Introduction – Motivation Problem#1: Patterns in graphs Problem#2: Tools

CMU SCS Roadmap • • Introduction – Motivation Problem#1: Patterns in graphs Problem#2: Tools Problem#3: Scalability –PEGASUS – Diameter – Connected components • Conclusions LLNL, Jan 2013 C. Faloutsos (CMU) 105

CMU SCS Generalized Iterated Matrix Vector Multiplication (GIMV) PEGASUS: A Peta-Scale Graph Mining System

CMU SCS Generalized Iterated Matrix Vector Multiplication (GIMV) PEGASUS: A Peta-Scale Graph Mining System - Implementation and Observations. U Kang, Charalampos E. Tsourakakis, and Christos Faloutsos. (ICDM) 2009, Miami, Florida, USA. Best Application Paper (runner-up). LLNL, Jan 2013 C. Faloutsos (CMU) 106

CMU SCS details Generalized Iterated Matrix Vector Multiplication (GIMV) • Page. Rank • proximity

CMU SCS details Generalized Iterated Matrix Vector Multiplication (GIMV) • Page. Rank • proximity (RWR) • Diameter • Connected components • (eigenvectors, • Belief Prop. • …) LLNL, Jan 2013 C. Faloutsos (CMU) Matrix – vector Multiplication (iterated) 107

CMU SCS Example: GIM-V At Work • Connected Components – 4 observations: Count Size

CMU SCS Example: GIM-V At Work • Connected Components – 4 observations: Count Size LLNL, Jan 2013 C. Faloutsos (CMU) 108

CMU SCS Example: GIM-V At Work • Connected Components Count 1) 10 K x

CMU SCS Example: GIM-V At Work • Connected Components Count 1) 10 K x larger than next Size LLNL, Jan 2013 C. Faloutsos (CMU) 109

CMU SCS Example: GIM-V At Work • Connected Components Count 2) ~0. 7 B

CMU SCS Example: GIM-V At Work • Connected Components Count 2) ~0. 7 B singleton nodes Size LLNL, Jan 2013 C. Faloutsos (CMU) 110

CMU SCS Example: GIM-V At Work • Connected Components Count 3) SLOPE! Size LLNL,

CMU SCS Example: GIM-V At Work • Connected Components Count 3) SLOPE! Size LLNL, Jan 2013 C. Faloutsos (CMU) 111

CMU SCS Example: GIM-V At Work • Connected Components Count 300 -size cmpt X

CMU SCS Example: GIM-V At Work • Connected Components Count 300 -size cmpt X 500. 1100 -size cmpt Why? X 65. Why? 4) Spikes! Size LLNL, Jan 2013 C. Faloutsos (CMU) 112

CMU SCS Example: GIM-V At Work • Connected Components Count suspicious financial-advice sites (not

CMU SCS Example: GIM-V At Work • Connected Components Count suspicious financial-advice sites (not existing now) LLNL, Jan 2013 Size C. Faloutsos (CMU) 113

CMU SCS GIM-V At Work • Connected Components over Time • Linked. In: 7.

CMU SCS GIM-V At Work • Connected Components over Time • Linked. In: 7. 5 M nodes and 58 M edges Stable tail slope after the gelling point LLNL, Jan 2013 C. Faloutsos (CMU) 114

CMU SCS Roadmap • • • Introduction – Motivation Problem#1: Patterns in graphs Problem#2:

CMU SCS Roadmap • • • Introduction – Motivation Problem#1: Patterns in graphs Problem#2: Tools Problem#3: Scalability Conclusions LLNL, Jan 2013 C. Faloutsos (CMU) 115

CMU SCS OVERALL CONCLUSIONS – low level: • Several new patterns (fortification, triangle -laws,

CMU SCS OVERALL CONCLUSIONS – low level: • Several new patterns (fortification, triangle -laws, conn. components, etc) • New tools: – belief propagation, giga. Tensor, etc • Scalability: PEGASUS / hadoop LLNL, Jan 2013 C. Faloutsos (CMU) 116

CMU SCS OVERALL CONCLUSIONS – high level • BIG DATA: Large datasets reveal patterns/outliers

CMU SCS OVERALL CONCLUSIONS – high level • BIG DATA: Large datasets reveal patterns/outliers that are invisible otherwise LLNL, Jan 2013 C. Faloutsos (CMU) 117

CMU SCS Theory & Algo. Comp. Systems ML & Stats. Biology Graph Analytics Physics

CMU SCS Theory & Algo. Comp. Systems ML & Stats. Biology Graph Analytics Physics Social Science Econ. 118 LLNL, Jan 2013 C. Faloutsos (CMU)

CMU SCS References • Leman Akoglu, Christos Faloutsos: RTG: A Recursive Realistic Graph Generator

CMU SCS References • Leman Akoglu, Christos Faloutsos: RTG: A Recursive Realistic Graph Generator Using Random Typing. ECML/PKDD (1) 2009: 13 -28 • Deepayan Chakrabarti, Christos Faloutsos: Graph mining: Laws, generators, and algorithms. ACM Comput. Surv. 38(1): (2006) LLNL, Jan 2013 C. Faloutsos (CMU) 119

CMU SCS References • Deepayan Chakrabarti, Yang Wang, Chenxi Wang, Jure Leskovec, Christos Faloutsos:

CMU SCS References • Deepayan Chakrabarti, Yang Wang, Chenxi Wang, Jure Leskovec, Christos Faloutsos: Epidemic thresholds in real networks. ACM Trans. Inf. Syst. Secur. 10(4): (2008) • Deepayan Chakrabarti, Jure Leskovec, Christos Faloutsos, Samuel Madden, Carlos Guestrin, Michalis Faloutsos: Information Survival Threshold in Sensor and P 2 P Networks. INFOCOM 2007: 1316 -1324 LLNL, Jan 2013 C. Faloutsos (CMU) 120

CMU SCS References • Christos Faloutsos, Tamara G. Kolda, Jimeng Sun: Mining large graphs

CMU SCS References • Christos Faloutsos, Tamara G. Kolda, Jimeng Sun: Mining large graphs and streams using matrix and tensor tools. Tutorial, SIGMOD Conference 2007: 1174 LLNL, Jan 2013 C. Faloutsos (CMU) 121

CMU SCS References • T. G. Kolda and J. Sun. Scalable Tensor Decompositions for

CMU SCS References • T. G. Kolda and J. Sun. Scalable Tensor Decompositions for Multi-aspect Data Mining. In: ICDM 2008, pp. 363 -372, December 2008. LLNL, Jan 2013 C. Faloutsos (CMU) 122

CMU SCS References • Jure Leskovec, Jon Kleinberg and Christos Faloutsos Graphs over Time:

CMU SCS References • Jure Leskovec, Jon Kleinberg and Christos Faloutsos Graphs over Time: Densification Laws, Shrinking Diameters and Possible Explanations, KDD 2005 (Best Research paper award). • Jure Leskovec, Deepayan Chakrabarti, Jon M. Kleinberg, Christos Faloutsos: Realistic, Mathematically Tractable Graph Generation and Evolution, Using Kronecker Multiplication. PKDD 2005: 133 -145 LLNL, Jan 2013 C. Faloutsos (CMU) 123

CMU SCS References • Yasuko Matsubara, Yasushi Sakurai, B. Aditya Prakash, Lei Li, Christos

CMU SCS References • Yasuko Matsubara, Yasushi Sakurai, B. Aditya Prakash, Lei Li, Christos Faloutsos, "Rise and Fall Patterns of Information Diffusion: Model and Implications", KDD’ 12, pp. 6 -14, Beijing, China, August 2012 LLNL, Jan 2013 C. Faloutsos (CMU) 124

CMU SCS References • Jimeng Sun, Yinglian Xie, Hui Zhang, Christos Faloutsos. Less is

CMU SCS References • Jimeng Sun, Yinglian Xie, Hui Zhang, Christos Faloutsos. Less is More: Compact Matrix Decomposition for Large Sparse Graphs, SDM, Minneapolis, Minnesota, Apr 2007. • Jimeng Sun, Spiros Papadimitriou, Philip S. Yu, and Christos Faloutsos, Graph. Scope: Parameterfree Mining of Large Time-evolving Graphs ACM SIGKDD Conference, San Jose, CA, August 2007 LLNL, Jan 2013 C. Faloutsos (CMU) 125

CMU SCS References • Jimeng Sun, Dacheng Tao, Christos Faloutsos: Beyond streams and graphs:

CMU SCS References • Jimeng Sun, Dacheng Tao, Christos Faloutsos: Beyond streams and graphs: dynamic tensor analysis. KDD 2006: 374383 LLNL, Jan 2013 C. Faloutsos (CMU) 126

CMU SCS References • Hanghang Tong, Christos Faloutsos, and Jia-Yu Pan, Fast Random Walk

CMU SCS References • Hanghang Tong, Christos Faloutsos, and Jia-Yu Pan, Fast Random Walk with Restart and Its Applications, ICDM 2006, Hong Kong. • Hanghang Tong, Christos Faloutsos, Center -Piece Subgraphs: Problem Definition and Fast Solutions, KDD 2006, Philadelphia, PA LLNL, Jan 2013 C. Faloutsos (CMU) 127

CMU SCS References • Hanghang Tong, Christos Faloutsos, Brian Gallagher, Tina Eliassi-Rad: Fast best-effort

CMU SCS References • Hanghang Tong, Christos Faloutsos, Brian Gallagher, Tina Eliassi-Rad: Fast best-effort pattern matching in large attributed graphs. KDD 2007: 737 -746 LLNL, Jan 2013 C. Faloutsos (CMU) 128

CMU SCS Project info & ‘thanks’ www. cs. cmu. edu/~pegasus Thanks to: NSF IIS-0705359,

CMU SCS Project info & ‘thanks’ www. cs. cmu. edu/~pegasus Thanks to: NSF IIS-0705359, IIS-0534205, CTA-INARC; Yahoo (M 45), LLNL, IBM, SPRINT, LLNL, Jan 2013 C. Faloutsos (CMU) 129 Google, INTEL, HP, i. Lab

CMU SCS Cast Akoglu, Leman Mc. Glohon, Mary LLNL, Jan 2013 Beutel, Alex Prakash,

CMU SCS Cast Akoglu, Leman Mc. Glohon, Mary LLNL, Jan 2013 Beutel, Alex Prakash, Aditya Chau, Polo Kang, U Papalexakis, Vagelis C. Faloutsos (CMU) Koutra, Danae Tong, Hanghang 130

CMU SCS Thanks to LLNL colleagues Brian Gallagher LLNL, Jan 2013 C. Faloutsos (CMU)

CMU SCS Thanks to LLNL colleagues Brian Gallagher LLNL, Jan 2013 C. Faloutsos (CMU) Keith Henderson 131

CMU SCS Take-home message Tera/Peta-byte data Analytics Insights, outliers Big data reveal insights that

CMU SCS Take-home message Tera/Peta-byte data Analytics Insights, outliers Big data reveal insights that would be invisible otherwise (even to experts) LLNL, Jan 2013 C. Faloutsos (CMU) 132