Tools for Large Graph Mining Deepayan Chakrabarti Thesis
Tools for Large Graph Mining - Deepayan Chakrabarti Thesis Committee: q Christos Faloutsos q Chris Olston q Guy Blelloch q Jon Kleinberg (Cornell) 1
Introduction Internet Map [lumeta. com] Food Web [Martinez ’ 91] Protein Interactions [genomebiology. com] ► Graphs are ubiquitous Friendship Network [Moody ’ 01] 2
Introduction n What can we do with graphs? q How quickly will a disease spread on this graph? “Needle exchange” networks of drug users [Weeks et al. 2002] 3
Introduction “Key” terrorist n What can we do with graphs? q q How quickly will a disease spread on this graph? Who are the “strange bedfellows”? Who are the key people? … Hijacker network [Krebs ‘ 01] ► Graph analysis can have great impact 4
Graph Mining: Two Paths Specific applications General issues • Node grouping • Realistic graph generation • Viral propagation • Graph patterns and “laws” • Frequent pattern mining • Graph evolution over time? • Fast message routing 5
Our Work Specific applications General issues • Node grouping • Realistic graph generation • Viral propagation • Graph patterns and “laws” • Frequent pattern mining • Graph evolution over time? • Fast message routing 6
Our Work n Node Grouping q Find “natural” partitions and outliers automatically. n Viralapplications Propagation Specific General issues q Will a • Node grouping virus spread and become an generation • Realistic graph epidemic? • Graph patterns and “laws” • Viral propagation n Graph Generation • Frequent pattern mining q • Graph evolution over time? How can we mimic a given real-world graph? • Fast message routing 7
Roadmap Focus of this talk Specific applications 1 • Node grouping 2 • Viral propagation General issues 3 • Realistic graph generation • Graph patterns and “laws” Find “natural” partitions and outliers automatically 4 Conclusions 8
Customers Products Customer Groups Node Grouping [KDD 04] Products Product Groups Simultaneously group customers and products, or, documents and words, or, users and preferences … 9
Both are fine Product Groups Customer Groups Node Grouping [KDD 04] Product Groups Row and column groups • need not be along a diagonal, and • need not be equal in number 10
Motivation § Visualization § Summarization § Detection of outlier nodes and edges § Compression, and others… 11
Node Grouping Desiderata: 1. Simultaneously discover row and column groups 2. Fully Automatic: No “magic numbers” 3. Scalable to large matrices 4. Online: New data should not require full recomputations 12
Closely Related Work n Information Theoretic Co-clustering [Dhillon+/2003] q Number of row and column groups must be specified Desiderata: ü Simultaneously discover row and column groups Fully Automatic: No “magic numbers” ü Scalable to large graphs ü Online 13
Other Related Work n n K-means and variants: [Pelleg+/2000, Hamerly+/2003] Do not cluster rows and cols simultaneously “Frequent itemsets”: User must specify “support” Information Retrieval: Choosing the number of “concepts” [Agrawal+/1994] n [Deerwester+1990, Hoffman/1999] n Graph Partitioning: [Karypis+/1998] Number of partitions Measure of imbalance between clusters 14
versus Column groups Good Clustering Why is this better? Row groups What makes a cross-association “good”? 1. Similar nodes are grouped together 2. As few groups as necessary A few, homogeneous blocks Good Compression implies 15
Main Idea Good Compression Row groups Binary Matrix implies Good Clustering density pi 1 = % of dots Cost of describing 1) size * H(p Σi + Σi n 1, n 0 and groups i i i Column groups Code Cost Description Cost 16
Examples One row group, one column group high Total Encoding Cost = Σi size * H(pi 1) + Σi Code Cost low Cost of describing ni 1, ni 0 and groups Description Cost high m row group, n column group 17
versus Column groups Why is this better? Row groups What makes a cross-association “good”? Column groups low Cost of describing 1) size * H(p + Σi n 1, n 0 and groups i Total Encoding Cost = Σi i i Code Cost Description Cost 18
Formal problem statement Given a binary matrix, Re-organize the rows and columns into groups, and Choose the number of row and column groups, to Minimize the total encoding cost. 19
Formal problem statement Note: No Parameters Given a binary matrix, Re-organize the rows and columns into groups, and Choose the number of row and column groups, to Minimize the total encoding cost. 20
Algorithms l = 5 col groups k = 5 row groups k=1, l=2 k=2, l=3 k=3, l=4 k=4, l=5 21
Algorithms l=5 k=5 Find good groups for fixed k and l Start with initial matrix Lower the encoding cost Final crossassociation Choose better values for k and l 22
Fixed k and l l=5 k=5 Find good groups for fixed k and l Start with initial matrix Lower the encoding cost Final crossassociation Choose better values for k and l 23
Fixed k and l Row groups Re-assign: for each row x re-assign it to the row group which minimizes the code cost Column groups Row groups 1. Row re-assigns 2. Column re-assigns 3. and repeat … Column groups 24
Choosing k and l l=5 k=5 Find good groups for fixed k and l Start with initial matrix Lower the encoding cost Final crossassociation Choose better values for k and l 25
Row groups Choosing k and l Column groups Split: 1. Find the most “inhomogeneous” group. 2. Remove the rows/columns which make it inhomogeneous. 3. Create a new group for these rows/columns. 26
Algorithms l=5 k=5 Find good groups Re-assigns for fixed k and l Start with initial matrix Lower the encoding cost Final crossassociation Choose better Splits values for k and l 27
Experiments l = 5 col groups k = 5 row groups “Customer-Product” graph with Zipfian sizes, no noise 28
Experiments l = 8 col groups k = 6 row groups “Quasi block-diagonal” graph with Zipfian sizes, noise=10% 29
Experiments l = 3 col groups k = 2 row groups “White Noise” graph: we find the existing spurious patterns 30
Experiments “CLASSIC” Documents • 3, 893 documents • 4, 303 words • 176, 347 “dots” Words Combination of 3 sources: • MEDLINE (medical) • CISI (info. retrieval) • CRANFIELD (aerodynamics) 31
Documents Experiments Words “CLASSIC” graph of documents & words: k=15, l=19 32
Experiments insipidus, alveolar, aortic, death, … blood, disease, clinical, cell, … MEDLINE (medical) “CLASSIC” graph of documents & words: k=15, l=19 33
Experiments providing, studying, records, development, … abstract, notation, works, construct, … MEDLINE (medical) CISI (Information Retrieval) “CLASSIC” graph of documents & words: k=15, l=19 34
Experiments shape, nasa, leading, assumed, … MEDLINE (medical) CISI (Information Retrieval) CRANFIELD (aerodynamics) “CLASSIC” graph of documents & words: k=15, l=19 35
Experiments paint, examination, fall, raise, leave, based, … MEDLINE (medical) CISI (Information Retrieval) CRANFIELD (aerodynamics) “CLASSIC” graph of documents & words: k=15, l=19 36
Experiments NSF Grant Proposals “GRANTS” • 13, 297 documents • 5, 298 words • 805, 063 “dots” Words in abstract 37
NSF Grant Proposals Experiments Words in abstract “GRANTS” graph of documents & words: k=41, l=28 38
Experiments encoding, characters, bind, nucleus The Cross-Associations refer to topics: • Genetics “GRANTS” graph of documents & words: k=41, l=28 39
Experiments coupling, deposition, plasma, beam The Cross-Associations refer to topics: • Genetics • Physics “GRANTS” graph of documents & words: k=41, l=28 40
Experiments manifolds, operators, harmonic The Cross-Associations refer to topics: • Genetics • Physics • Mathematics • … “GRANTS” graph of documents & words: k=41, l=28 41
Experiments Time (secs) Splits Re-assigns Number of “dots” Linear on the number of “dots”: Scalable 42
Summary of Node Grouping Desiderata: ü Simultaneously discover row and column groups ü Fully Automatic: No “magic numbers” ü Scalable to large matrices ü Online: New data does not need full recomputation 43
Extensions n We can use the same MDL-based framework for other problems: 1. Self-graphs 2. Detection of outlier edges 44
Extension #1 [PKDD 04] n Self-graphs, such as q q q Co-authorship graphs Social networks The Internet, and the World-wide Web Authors Products Customers Bipartite graph Self-graph 45
Extension #1 [PKDD 04] n Self-graphs q q Rows and columns represent the same nodes so row re-assigns affect column re-assigns… Authors Products Customers Bipartite graph Self-graph 46
Experiments DBLP dataset Authors • 6, 090 authors in: Authors • SIGMOD • ICDE • VLDB • PODS • ICDT • 175, 494 co-citation or co-authorship links 47
Authors Author groups Experiments Author groups Stonebraker, De. Witt, Carey k=8 author groups found 48
Extension #2 [PKDD 04] n Outlier edges q q Which links should not exist? (illegal contact/access? ) Which links are missing? (missing data? ) 49
Extension #2 [PKDD 04] Nodes Node Groups Outlier edges Nodes Outliers Node Groups Deviations from “normality” Lower quality compression Find edges whose removal maximally reduces cost 50
Roadmap Specific applications 1 • Node grouping 2 • Viral propagation General issues 3 • Realistic graph generation • Graph patterns and “laws” Will a virus spread and become an epidemic? 4 Conclusions 51
The SIS (or “flu”) model n n n (Virus) birth rate β : probability than an infected neighbor attacks (Virus) death rate δ : probability that an infected node heals Cured = Susceptible Prob. δ Healthy N 2 Prob. β N 1 N Pro Infected Undirected network b. β N 3 52
The SIS (or “flu”) model n n Competition between virus birth and death Epidemic or extinction? q depends on the ratio β/δ q but also on the network topology Epidemic or Extinction Example of the effect of network topology 53
Epidemic threshold n The epidemic threshold τ is the value such that q If β/δ < τ there is no epidemic q where β = birth rate, and δ = death rate 54
Previous models Question: What is the epidemic threshold? Answer #1: 1/<k> [Kephart and White ’ 91, ’ 93] Answer #2: BUT Homogeneity assumption: All nodes have the same degree (but most graphs have power laws) BUT Mean-field assumption: All nodes of the same degree are equally affected (but susceptibility should depend on position in network too) <k>/<k 2> [Pastor-Satorras and Vespignani ’ 01] 55
The full solution is intractable! n The full Markov Chain q q n has 2 N states intractable so, a simplification is needed. Independence assumption: q q Probability that two neighbors are infected = Product of individual probabilities of infection This is a point estimate of the full Markov Chain. 56
Our model n A non-linear dynamical system (NLDS) q which makes no assumptions about the topology Probability of being infected Adjacency matrix N 1 -pi, t = [1 -pi, t-1 + δpi, t-1]. ∏ (1 -β. Aji. pj, t-1) j=1 Healthy at time t-1 Infected but cured No infection received from another node 57
Epidemic threshold n [Theorem 1] We have no epidemic if: (Virus) Death rate Epidemic threshold β/δ < τ = 1/ λ 1, A (Virus) Birth rate largest eigenvalue of adj. matrix A ► λ 1, A alone decides viral epidemics! 58
Recall the definition of eigenvalues eigenvalue A X = λA X λ 1, A = largest eigenvalue ≈ size of the largest “blob” 59
…… …… Experiments (100 -node Star) β/δ > τ (above threshold) β/δ = τ (close to the threshold) β/δ < τ (below threshold) 60
Experiments (Oregon) 10, 900 nodes and 31, 180 edges β/δ > τ (above threshold) β/δ = τ (at the threshold) β/δ < τ (below threshold) 61
Extensions n This dynamical-systems framework can exploited further 1. The rate of decay of the infection 2. Information survival thresholds in sensor/P 2 P networks 62
Extension #1 n n Below the threshold: How quickly does an infection die out? [Theorem 2] Exponentially quickly 63
Number of infected nodes (log-scale) Experiment (10 K Star Graph) Linear on log-lin scale exponential decay Time-steps (linear-scale) “Score” s = β/δ * λ 1, A = “fraction” of threshold 64
Number of infected nodes (log-scale) Experiment (Oregon Graph) Linear on log-lin scale exponential decay Time-steps (linear-scale) “Score” s = β/δ * λ 1, A = “fraction” of threshold 65
Extension #2 n Information survival in sensor networks [+ Leskovec, Faloutsos, Guestrin, Madden] • Sensors gain new information 66
Extension #2 n Information survival in sensor networks [+ Leskovec, Faloutsos, Guestrin, Madden] • Sensors gain new information • but they may die due to harsh environment or battery failure • so they occasionally try to transmit data to nearby sensors • and failed sensors are occasionally replaced. 67
Extension #2 n Information survival in sensor networks [+ Leskovec, Faloutsos, Guestrin, Madden] • Sensors gain new information • but they may die due to harsh environment or battery failure • so they occasionally try to transmit data to nearby sensors • and failed sensors are occasionally replaced. • Under what conditions does the information survive? 68
Extension #2 n [Theorem 1] The information dies out exponentially quickly if Resurrection rate Failure rate of sensors Retransmission rate Largest eigenvalue of the “link quality” matrix 69
Roadmap Specific applications 1 • Node grouping 2 • Viral propagation General issues 3 • Realistic graph generation • Graph patterns and “laws” How can we generate a “realistic” graph, that mimics a given real-world? 4 Conclusions Skip 70
Experiments (Clickstream bipartite Some personal graph) webpage Websites Count Clickstream R-MAT + x Yahoo, Google and others Users In-degree 71
Experiments (Clickstream bipartite graph) Email-checking surfers Websites Count Clickstream R-MAT + x “All-night” surfers Users Out-degree 72
Experiments (Clickstream bipartite graph) Count vs Out-degree Singular value vs Rank Count vs In-degree Hop-plot Left “Network value” Right “Network value” ►R-MAT can match real-world graphs 73
Roadmap Specific applications 1 • Node grouping 2 • Viral propagation General issues 3 • Realistic graph generation • Graph patterns and “laws” 4 Conclusions 74
Conclusions n Two paths in graph mining: q Specific applications: n n q Viral Propagation non-linear dynamical system, epidemic depends on largest eigenvalue Node Grouping MDL-based approach for automatic grouping General issues: n n Graph Patterns Marks of “realism” in a graph Generators R-MAT, a scalable generator matching many of the patterns 75
Software http: //www-2. cs. cmu. edu/~deepay/#Sw Cross. Associations n n q q q To find natural node groups. Used by “anonymous” large accounting firm. Used by Intel Research, Cambridge, UK. Used at UC, Riverside (net intrusion detection). Used at the University of Porto, Portugal Net. Mine n q q To extract graph patterns quickly + build realistic graphs. Used by Northrop Grumman corp. F 4 n q A non-linear time series forecasting package. 76
===CROSS-ASSOCIATIONS=== n n n n Why simultaneous grouping? Differences from coclustering and others? Other parameter-fitting criteria? Cost surface Exact cost function Exact complexity, wallclock times Soft clustering Different weights for code and description costs? n n n Precision-recall for CLASSIC Inter-group “affinities” Collaborative filtering and recommendation systems? CA versus bipartite cores Extras General comments on CA communities 77
===Viral Propagation=== n n n n Comparison with previous methods Accuracy of dynamical system Relationship with full Markov chain Experiments on information survival threshold Comparison with Infinite Particle Systems Intuition behind the largest eigenvalue Correlated failures 78
===R-MAT=== n n n Graph patterns Generator desiderata Description of R-MAT Experiments on a directed graph R-MAT communities via Cross-Associations? R-MAT versus tree-based generators 79
===Graphs in general=== n n Relational learning Graph Kernels 80
Simultaneous grouping is useful Sparse blocks, with little in common between rows Grouping rows first would collapse these two into one! Index 81
Cross-Associations ≠ Co-clustering ! Information-theoretic co -clustering Cross-Associations 1. Lossy Compression. 1. Lossless Compression. 2. Approximates the original matrix, while trying to minimize KLdivergence. 2. Always provides complete information about the matrix, for any number of row and column groups. 3. The number of row and column groups must be given by the user. 3. Chosen automatically using the MDL principle. Index 82
Other parameter-fitting methods n The Gap statistic [Tibshirani+ ’ 01] q n Minimize the “gap” of log-likelihood of intra-cluster distances from the expected log-likelihood. But q q q Needs a distance function between graph nodes Needs a “reference” distribution Needs multiple MCMC runs to remove “variance due to sampling” more time. Index 83
Other parameter-fitting methods n Stability-based method [Ben-Hur+ ’ 02, ‘ 03] q q q n Run clustering multiple times on samples of data, for several values of “k” For low k, clustering is stable; for high k, unstable Choose this transition point. But q q Needs many runs of the clustering algorithm Arguments possible over definition of transition point Index 84
Precision-Recall for CLASSIC Index 85
Cost surface (total cost) Surface plot l Contour plot k l k With increasing k and l: Total cost decays very rapidly initially, but then starts increasing slowly Index 86
Cost surface (code cost only) Surface plot l Contour plot k l k With increasing k and l: Code cost decays very rapidly Index 87
Encoding Cost Function Description cost Total encoding cost = Code cost log*(k) + log*(l) + (cluster number) N. log(N) + M. log(M) + Σ log(ai) + Σ log(bj) + ΣΣ log(aibj+1) + (row/col order) (cluster sizes) (block densities) ΣΣ aibj. H(pi, j) Index 88
Complexity of CA n O(E. (k 2+l 2)) ignoring the number of re-assign iterations, which is typically low. Index 89
Time / Σ(k+l) Complexity of CA Number of edges Index 90
Nodes Node Groups Inter-group distances Nodes Grp 1 Grp 2 Grp 3 Node Groups Two groups are “close” Merging them does not increase cost by much distance(i, j) = relative increase in cost on merging i and j Index 91
Inter-group distances 5. 5 Grp 2 4. 5 5. 1 Grp 3 Node Groups Grp 1 Grp 2 Grp 3 Node Groups Two groups are “close” Merging them does not increase cost by much distance(i, j) = relative increase in cost on merging i and j Index 92
Experiments Grp 8 Author groups Grp 1 Author groups Stonebraker, De. Witt, Carey Inter-group distances can aid in visualization Index 93
Collaborative filtering and recommendation systems n n Q: If someone likes a product X, will (s)he like product Y? A: Check if others who liked X also liked Y. Focus on distances between people, typically cosine similarity and not on clustering Index 94
CA and bipartite cores: related but different Authorities Hubs A 3 x 2 bipartite core Kumar et al. [1999] say that bipartite cores correspond to communities. Index 95
CA and bipartite cores: related but different n n CA finds two communities there: one for hubs, and one for authorities. We gracefully handle cases where a few links are missing. CA considers connections between all sets of clusters, and not just two sets. Not each node need belong to a non-trivial bipartite core. CA is (informally) a generalization Index 96
Comparison with soft clustering n n Soft clustering each node belongs to each cluster with some probability Hard clustering one cluster per node Index 97
Comparison with soft clustering 1. Far more degrees of freedom 1. Parameter fitting is harder 2. Algorithms can be costlier 2. Hard clustering is better for exploratory data analysis 3. Some real-world problems require hard clustering e. g. , fraud detection for accountants Index 98
Weights for code cost vs description cost n Total = 1. (code cost) + 1. (description cost) Physical meaning: Total number of bits n Total = α. (code cost) + β. (description cost) n n Physical meaning: Number of encoding bits under some prior Index 99
Formula for re-assigns Row groups Re-assign: for each row x Column groups Index 100
Choosing k and l l=5 k=5 Split: 1. Find the row group R with the maximum entropy per row 2. Choose the rows in R whose removal reduces the entropy per row in R 3. Send these rows to the new row group, and set k=k+1 Index 101
Experiments Epinions dataset User groups • 75, 888 users • 508, 960 “dots”, one “dot” per “trust” relationship User groups k=19 groups found Small dense “core” Index 102
Comparison with previous methods n n Our threshold subsumes the homogeneous model Proof We are more accurate than the Mean-Field Assumption model. Index 103
Comparison with previous methods 10 K Star Graph Index 104
Comparison with previous methods Oregon Graph Index 105
Accuracy of dynamical system 10 K Star Graph Index 106
Accuracy of dynamical system Oregon Graph Index 107
Accuracy of dynamical system 10 K Star Graph Index 108
Accuracy of dynamical system Oregon Graph Index 109
Relationship with full Markov Chain n The full Markov Chain is of the form: Prob(infection at time t) = Xt-1 + Yt-1 – Zt-1 Non-linear component n n Independence assumption leads to a point estimate for Zt-1 non-linear dynamical system. Still non-linear, but now tractable Index 110
Experiments: Information survival n n n INTEL sensor map (54 nodes) MIT sensor map (40 nodes) and others… Index 111
Experiments: Information survival INTEL sensor map Index 112
Survival threshold on INTEL Index 113
Survival threshold on INTEL Index 114
Experiments: Information survival MIT sensor map Index 115
Survival threshold on MIT Index 116
Survival threshold on MIT Index 117
Infinite Particle Systems n n “Contact Process” ≈ SIS model Differences: q q q n Infinite graphs only the questions asked are different Very specific topologies lattices, trees Exact thresholds have not been found for these; proving existence of thresholds is important Our results match those on the finite line graph [Durrett+ ’ 88] Index 118
Intuition behind the largest eigenvalue n n Approximately size of the largest “blob” Consider the special case of a “caveman” graph Largest eigenvalue = 4 Index 119
Intuition behind the largest eigenvalue n Approximately size of the largest “blob” Largest eigenvalue = 4. 016 Index 120
Graph Patterns n Power Laws Count vs Outdegree The “epinions” graph with 75, 888 nodes and 508, 960 edges Count vs Indegree Index 121
Graph Patterns n Power Laws Count vs Outdegree The “epinions” graph with 75, 888 nodes and 508, 960 edges Count vs Indegree Index 122
Graph Patterns Power Laws and deviations (DGX/Lognormals [Bi+ ’ 01]) Count n Count vs Indegree Degree Index 123
n n Power Laws and deviations Small-world “Community” effect … # reachable pairs Graph Patterns Effective Diameter hops Index 124
Graph Generator Desiderata n n Power Laws and deviations Small-world “Community” effect … n n Other desiderata Few parameters Fast parameter-fitting Scalable graph generation Simple extension to undirected, bipartite and weighted graphs Most current graph generators fail to match some of these. Index 125
The R-MAT generator n n Intuition: The “ 80 -20 law” [SIAM DM’ 04] From Subdivide the adjacency matrix and choose one quadrant with probability (a, b, c, d) To a (0. 5) b (0. 1) c (0. 15) d (0. 25) 2 n n 2 n Index 126
The R-MAT generator n n n [SIAM DM’ 04] Subdivide the adjacency matrix and choose one quadrant with probability (a, b, c, d) Recurse till we reach a 1*1 cell where we place an edge and repeat for all edges. Intuition: The “ 80 -20 law” a a b c d 2 n n d c 2 n Index 127
The R-MAT generator n n [SIAM DM’ 04] Only 3 parameters a, b and c (d = 1 -a-b-c). We have a fast parameter fitting algorithm. Intuition: The “ 80 -20 law” a a b c d 2 n n d c 2 n Index 128
Experiments (Epinions directed graph) Effective Diameter Count vs Indegree Eigenvalue vs Rank Count vs Outdegree “Network value” Hop-plot Count vs Stress ►R-MAT matches directed graphs Index 129
R-MAT communities and Cross. Associations n n R-MAT builds communities in graphs, and Cross-Associations finds them. Relationship? q q R-MAT builds a hierarchy of communities, while CA finds a flat set of communities Linkage in the sizes of communities found by CA: n n When the R-MAT parameters are very skewed, the community sizes for CA are skewed and vice versa Index 130
R-MAT and tree-based generators n Recursive splitting in R-MAT ≈ following a tree from root to leaf. n Relationship with other tree-based generators [Kleinberg ’ 01, Watts+ ’ 02]? q q The R-MAT tree has edges as leaves, the others have nodes Tree-distance between nodes is used to connect nodes in other generators, but what does treedistance between edges mean? Index 131
Comparison with relational learning Relational Learning (typical) Graph Mining (typical) 1. Aims to find small structure/patterns at the local level 1. Emphasis on global aspects of large graphs 2. Labeled nodes and edges 2. Unlabeled graphs 3. Semantics of labels are important 3. More focused on topological structure and properties 4. Algorithms are typically costlier 4. Scalability is more important Index 132
===OTHER WORK=== n OTHER WORK 133
Other Work n Time Series Prediction [CIKM 2002] q q q We use the fractal dimension of the data This is related to chaos theory and Lyapunov exponents… 134
Other Work n Logistic Parabola Time Series Prediction [CIKM 2002] 135
Other Work n Lorenz attractor Time Series Prediction [CIKM 2002] 136
Other Work n Laser fluctuations Time Series Prediction [CIKM 2002] 137
Other Work n Adaptive histograms with error guarantees [+ Ashraf Aboulnaga, Yufei Tao, Christos Faloutsos] Count Insertions, deletions • Maintain count probabilities for buckets • to give statistically correct query result-size estimation • and query feedback Prob. • +… Salary 138
Other Work n User-personalization q n Relevance feedback in multimedia image search q n Patent number 6, 611, 834 (IBM) Filed for patent (IBM) Building 3 D models using robot camera and rangefinder data [ICML 2001] 139
===EXTRAS=== 140
Conclusions n Two paths in graph mining: q Specific applications: n n Viral Propagation Resilience testing, information dissemination, rumor spreading Node Grouping automatically grouping nodes, AND finding the correct number of groups References: 1. Fully automatic Cross-Associations, by Chakrabarti, Papadimitriou, Modha and Faloutsos, in KDD 2004 2. Auto. Part: Parameter-free graph partitioning and Outlier detection, by Chakrabarti, in PKDD 2004 3. Epidemic spreading in real networks: An eigenvalue viewpoint, by Wang, Chakrabarti, Wang and Faloutsos, in SRDS 2003 141
Conclusions n Two paths in graph mining: q q Specific applications General issues: n n Graph Patterns Marks of “realism” in a graph Generators R-MAT, a fast, scalable generator matching many of the patterns References: 1. R-MAT: A recursive model for graph mining, by Chakrabarti, Zhan and Faloutsos in SIAM Data Mining 2004. 2. Net. Mine: New mining tools for large graphs, by Chakrabarti, Zhan, Blandford, Faloutsos and Blelloch, in the SIAM 2004 Workshop on Link analysis, counter-terrorism and privacy 142
Other References n n n F 4: Large Scale Automated Forecasting using Fractals, by D. Chakrabarti and C. Faloutsos, in CIKM 2002. Using EM to Learn 3 D Models of Indoor Environments with Mobile Robots, by Y. Liu, R. Emery, D. Chakrabarti, W. Burgard and S. Thrun, in ICML 2001 Graph Mining: Laws, Generators and Algorithms, by D. Chakrabarti and C. Faloutsos, under submission to ACM Computing Surveys 143
References --- graphs 1. 2. 3. 4. 5. R-MAT: A recursive model for graph mining, by D. Chakrabarti, Y. Zhan, C. Faloutsos in SIAM Data Mining 2004. Epidemic spreading in real networks: An eigenvalue viewpoint, by Y. Wang, D. Chakrabarti, C. Wang and C. Faloutsos, in SRDS 2003 Fully automatic Cross-Associations, by D. Chakrabarti, S. Papadimitriou, D. Modha and C. Faloutsos, in KDD 2004 Auto. Part: Parameter-free graph partitioning and Outlier detection, by D. Chakrabarti, in PKDD 2004 Net. Mine: New mining tools for large graphs, by D. Chakrabarti, Y. Zhan, D. Blandford, C. Faloutsos and G. Blelloch, in the SIAM 2004 Workshop on Link analysis, counter-terrorism and privacy 144
Roadmap Specific applications 1 • Node grouping 2 • Viral propagation General issues 3 • Realistic graph generation • Graph patterns and “laws” 4 Other Work 5 Conclusions 145
Experiments (Clickstream bipartite Some personal graph) webpage Websites Count Clickstream + Yahoo, Google and others Users In-degree 146
Experiments (Clickstream bipartite graph) Email-checking surfers Websites Count Clickstream + “All-night” surfers Users Out-degree 147
Experiments (Clickstream bipartite graph) Users # Reachable pairs Websites Clickstream R-MAT Hops 148
Graph Generation n Important for: q q q n Simulations of new algorithms Compression using a good graph generation model Insight into the graph formation process Our R-MAT (Recursive MATrix) generator can match many common graph patterns. 149
Recall the definition of eigenvalues A X = λA X λA = eigenvalue of A λ 1, A = largest eigenvalue β/δ < τ = 1/ λ 1, A 150
Tools for Large Graph Mining Deepayan Chakrabarti Carnegie Mellon University 151
- Slides: 151