Graph Processing COS 518 Advanced Computer Systems Lecture

  • Slides: 15
Download presentation
Graph Processing COS 518: Advanced Computer Systems Lecture 12 Mike Freedman [Content adapted from

Graph Processing COS 518: Advanced Computer Systems Lecture 12 Mike Freedman [Content adapted from K. Jamieson and J. Gonzalez]

Graphs are Everywhere Collaborative Filtering Users Social Network Netflix Movies Probabilistic Analysis Docs Text

Graphs are Everywhere Collaborative Filtering Users Social Network Netflix Movies Probabilistic Analysis Docs Text Analysis Wiki Words

Concrete Examples Label Propagation Page Rank

Concrete Examples Label Propagation Page Rank

Label Propagation Algorithm • Social Arithmetic: + Sue Ann 50% What I list on

Label Propagation Algorithm • Social Arithmetic: + Sue Ann 50% What I list on my profile 40% Sue Ann Likes 10% Carlos Like I Like: 60% Cameras, 40% Biking Profile Me • Recurrence Algorithm: 80% Cameras 20% Biking 40% – iterate until convergence • Parallelism: – Compute all Likes[i] in parallel 50% Cameras 50% Biking Carlos 10% 30% Cameras 70% Biking

Page. Rank Algorithm • Page. Rank of u is dependent on PR of all

Page. Rank Algorithm • Page. Rank of u is dependent on PR of all pages linking to u, divided by the number of links from each of these pages • Recurrence Algorithm: PR[u] = �� v∈Bu PR[v] / L[v] – iterate until convergence • Parallelism: – Compute all PR[u] in parallel

Properties of Graph Parallel Algorithms Dependency Graph Factored Computation Iterative Computation What I Like

Properties of Graph Parallel Algorithms Dependency Graph Factored Computation Iterative Computation What I Like What My Friends Like

Map-Reduce for Data-Parallel ML • Excellent for large data-parallel tasks! Data-Parallel Map. Reduce Feature

Map-Reduce for Data-Parallel ML • Excellent for large data-parallel tasks! Data-Parallel Map. Reduce Feature Extraction Algorithm Tuning Basic Data Processing Graph-Parallel Map. Reduce? Label Propagation Lasso Kernel Methods Tensor Factorization Deep Belief Networks Belief Propagation Page. Rank Neural Networks 7

Problem: Data Dependencies • Map. Reduce doesn’t efficiently express data dependencies Independent Data Rows

Problem: Data Dependencies • Map. Reduce doesn’t efficiently express data dependencies Independent Data Rows – User must code substantial data transformations – Costly data replication

Iterative Algorithms • MR doesn’t efficiently express iterative algorithms: Iterations Data CPU 1 Data

Iterative Algorithms • MR doesn’t efficiently express iterative algorithms: Iterations Data CPU 1 Data Data Data CPU 2 Data Data CPU 3 Data Data Barrier Slow Processor CPU 1 Data

Map. Abuse: Iterative Map. Reduce • Only a subset of data needs computation: Iterations

Map. Abuse: Iterative Map. Reduce • Only a subset of data needs computation: Iterations Data CPU 1 Data Data CPU 2 Data Data CPU 3 Data Data Barrier Data

Map. Abuse: Iterative Map. Reduce • System is not optimized for iteration: Iterations Data

Map. Abuse: Iterative Map. Reduce • System is not optimized for iteration: Iterations Data CPU 1 Data CPU 2 CPU 3 Data Data CPU 2 CPU 3 Disk Penalty CPU 3 Data Startup Penalty Data CPU 1 Disk Penalty Data Disk Penalty Startup Penalty CPU 2 Data Startup Penalty Data Data Data

ML Tasks Beyond Data-Parallelism Data-Parallel Graph-Parallel Map Reduce Feature Extraction Cross Validation Computing Sufficient

ML Tasks Beyond Data-Parallelism Data-Parallel Graph-Parallel Map Reduce Feature Extraction Cross Validation Computing Sufficient Statistics Graphical Models Gibbs Sampling Belief Propagation Variational Opt. Collaborative Filtering Tensor Factorization Semi-Supervised Learning Label Propagation Co. EM Graph Analysis Page. Rank Triangle Counting 12

This week’s lectures • Graph processing – Why relationships, sampling, and iterations often use

This week’s lectures • Graph processing – Why relationships, sampling, and iterations often use in graph processing not well fit by Map. Reduce – How to take a graph-centric processing perspective • Machine learning – These are solving one type of ML algorithm – What other systems are needed, particularly given heavy focus on iterative algorithms 13

Today’s readings • Streaming is about unbounded data sets, not particular execution engines •

Today’s readings • Streaming is about unbounded data sets, not particular execution engines • Streaming is in fact a strict superset of batch, Lambda Architecture destined for retirement • Needs of good streaming systems: correctness and tools for reasoning about time. • Differences between event time and processing time, and the challenges they impose 14

Today’s readings • What about major data processing approaches for bounded & unbounded data?

Today’s readings • What about major data processing approaches for bounded & unbounded data? • Challenges/needs for unbounded include: – time-agnostic, approximation, windowing by processing time, windowing by event time • Key mechanisms (in Cloud Data. Flow) – Watermarks: ideal vs. heuristic – Triggers – Discarding, accumulating + retracting 15