Graph Algorithms Ch 5 Lin and Dyer Graphs

Graphs • • • Are everywhere Manifest in the flow of emails Connections on

Graph algorithms – Graph search and path planning: : shortest path to a node

Graph Representations n 1 n 2 n 5 n 3 n 4 How do

Simple, Baseline Data Structure n 1 n 2 n 3 n 4 n 5

Problem definition: intuition • Input: graph adjacency list with edges and vertices, w edges

single source shortest path problem • Sequential solution: Dijkstra’s algorithm 5. 2 Dijkstra (G,

Sample graph : lets apply the algorithm 5. 2 n 4 n 2 1

Issues • Sequential • Need to keep global state: not possible with MR •

Parallel Breadth First • Assume distance of 1 for all edges (simplifying assumption): later

Issues in processing a graph in MR • Goal: start from a given node

Input data format for MR Node: node. Id, distance. Label, adjancency list {node. Id,

Mapper Class Mapper method map (nid n, Node N) d N. distance emit(nid n,

Reducer Class Reducer method Reduce(nid m, [d 1, d 2, d 3. . ])

Trace with sample Data 1 0 2: 3: 2 10000 3: 4: 3 10000

Intermediate data 1 2 3 4 5 0 2: 3: 1 3: 4: 1

Sample Data 1 0 2: 3: 2 10000 3: 4: 3 10000 2: 4:

Page. Rank • 25 Billion Dollar algorithm (huge matrix and Eigen vector problem. )

Consider this web problem Nodes 1, 2, 3, 4 with ranks x 1, x

Linear Algebra problem x 1 = [ ½ x 2 + ½x 3] =

Solve Square link matrix for Eigen Vector [0 + ½ + 0 ] [

General idea • Consider the world wide web with all its links. • Now

Example • Figure 5. 7 • Lets assume alpha as zero • Lets look

Mapper for Page. Rank Class Mapper method map (nid n, Node N) p N.

Reducer for Pagerank Class Reducer method Reduce(nid m, [p 1, p 2, p 3.

Discussion • How to account for dangling nodes: one that has many incoming links

Slides: 30

Download presentation

Graph Algorithms Ch. 5 Lin and Dyer

Graphs • • • Are everywhere Manifest in the flow of emails Connections on social network Bus or flight routes Social graphs: twitter friends and followers Take a look at Jon Kleinberg’s page and book on Networks, Crowds and Markets Reasoning about a highly connected world.

Graph algorithms – Graph search and path planning: : shortest path to a node – Graph clustering: : diving the graphs into smaller related clusters – Minimum spanning tree: : graph that covers the nodes in an efficient way – Bipartite graph match: : div graph into two mapping sets: job seekers and employers – Maximum flow: : designate source and sink; determine max flow between the two: transportation – Identifying special nodes: authoritative nodes: containment of spread of diseases; Broad street water pump in London, cholera and beginnings of epidemiology

Graph Representations n 1 n 2 n 5 n 3 n 4 How do you represent this visual diagram as data?

Simple, Baseline Data Structure n 1 n 2 n 3 n 4 n 5 0 1 0 1 0 0 0 1 0 n 4 0 0 1 n 5 1 1 1 0 0 n 1 n 2 n 3 (i) Adjacency matrix – this is good for linear algebra; But most web links and social Networks are sparse x/ 100000 Space req. is O(n 2) n 1 [n 2, n 4] n 2 [n 3, n 5] n 3 [n 4] n 4 [n 5] n 5 [n 1, n 2, n 3] (ii) Adjacency lists n 2

Problem definition: intuition • Input: graph adjacency list with edges and vertices, w edges distances, starting vertex • Output(goal): label the nodes/vertices with the shortest distance value from the starting node

single source shortest path problem • Sequential solution: Dijkstra’s algorithm 5. 2 Dijkstra (G, w, s) // w edge distances list, s starting node, G graph d[s] 0 for all other vertices d[v] ∞ Q {V} // Q is priority queue based on distances while Q # 0 u min(Q) // node with min d value for all vertex v in u. adjacency. List if d[v] > d[u] + w[u, v] mark u and remove from Q At each iteration of while loop, the algorithm expands the node with the shortest distance and updates distances to all reachable nodes

Sample graph : lets apply the algorithm 5. 2 n 4 n 2 1 10 0 3 2 9 4 6 7 n 1 5 2 n 3 n 5

Issues • Sequential • Need to keep global state: not possible with MR • Lets see how we can handle this graph problem for parallel processing with MR

Parallel Breadth First • Assume distance of 1 for all edges (simplifying assumption): later we will expand it to other distances

Issues in processing a graph in MR • Goal: start from a given node and label all the nodes in the graph so that we can determine the shortest distance • Representation of the graph (of course, generation of a synthetic graph) • Determining the <key, value> pair • Iterating through various stages of processing and intermediate data • When to terminate the execution

Input data format for MR Node: node. Id, distance. Label, adjancency list {node. Id, distance} This is one split Input as text and parse it to determine <key, value> From mapper to reducer two types of <key, value> pairs <nodeid n, Node N> <nodeid n, distance until now label> Need to keep the termination condition in the Node class Terminate MR iterations when none of the labels change, or when the graph has reached a steady state or all the nodes have been labeled with min distance or other conditions using the counters can be used. • Now lets look at the algorithm given in the book • •

Mapper Class Mapper method map (nid n, Node N) d N. distance emit(nid n, N) // type 1 for all m in N. Adjacencylist emit(nid m, d+1) // type 2

Reducer Class Reducer method Reduce(nid m, [d 1, d 2, d 3. . ]) dmin = ∞; // or a large # Node M null for all d in [d 1, d 2, . . ] { if Is. Node(d) then M d else if d < dmin then dmin d} M. distance dmin // update the shortest distance in M emit (nid m, Node M)

Trace with sample Data 1 0 2: 3: 2 10000 3: 4: 3 10000 2: 4: 5 4 10000 5: 5 10000 1: 4

Intermediate data 1 2 3 4 5 0 2: 3: 1 3: 4: 1 2: 4: 5: 10000 1: 4:

Intermediate Data 1 2 3 4 5 0 2: 3: 1 3: 4: 1 2: 4: 5: 2 1: 4:

Final Data 1 2 3 4 5 0 2: 3: 1 3: 4: 1 2: 4: 5: 2 1: 4:

Sample Data 1 0 2: 3: 2 10000 3: 4: 3 10000 2: 4: 5 4 10000 5: 5 10000 1: 4 1 0 2: 3: 2 1 3: 4: 3 1 2: 4: 5 1 0 2: 3: 2 1 3: 4: 4 2 5: 5 2 1: 4 4 10000 5: 5 10000 1: 4 3 1 2: 4: 5

Page. Rank • 25 Billion Dollar algorithm (huge matrix and Eigen vector problem. ) • Larry Page and Sergei Brin (Standford Ph. D. students) • Rajeev Motwani and Terry Winograd (Standford Profs)

Consider this web problem Nodes 1, 2, 3, 4 with ranks x 1, x 2, x 3, x 4 1 2 4 X 3 Problem: How to calculate the Ranks or “influence” of these web linked nodes? Solution: Treat it as linear algebraic problem. . Write the linear equations, Solve the equation system. Let’s do just that for this network

Linear Algebra problem x 1 = [ ½ x 2 + ½x 3] = [0 x 1+ ½ x 2 + ½x 3+0 x 4] x 2 = [ ½ x 1+ 0 x 2 + 0 x 3+ ½ x 4] x 3 = [ ½ x 1+ 0 x 2 + 0 x 3+ ½ x 4] x 4 = [0 x 1+ ½ x 2 + ½ x 3+0 x 4] Web link problem develops into a problem of finding the Eigen vector for the square matrix. • We seek the Eigen vector X with value of 1 for the link matrix Ax = 1; lets do that • • •

Solve Square link matrix for Eigen Vector [0 + ½ + 0 ] [ ½ + 0 + ½ ] X [x 1 x 2 x 3 x 4] [ ½ + 0 + ½ ] [0 + ½ + 0 ] A x = 1 solve this for x 1=? x 2 =? x 3=? x 4=? Transpose the matrix, etc. . Now scale the problem to billions of nodes? !!

General idea • Consider the world wide web with all its links. • Now imagine a random web surfer who visits a page and clicks a link on the page • Repeats this to infinity • Pagerank is a measure of how frequently will a page will be encountered. • In other words it is a probability distribution over nodes in the graph representing the likelihood that a random walk over the linked structure will arrive at a particular node.

Page. Rank Formula •

Example • Figure 5. 7 • Lets assume alpha as zero • Lets look at the MR

Mapper for Page. Rank Class Mapper method map (nid n, Node N) p N. Pagerank/|N. Adajacency. List| emit(nid n, N) for all m in N. Adjacency. List emit(nid m, p) “divider”

Reducer for Pagerank Class Reducer method Reduce(nid m, [p 1, p 2, p 3. . ]) node M null; s = 0; for all p in [p 1, p 2, . . ] { if p is a Node then M p else s s+p } M. pagerank s emit (nid m, node M) “aggregator”

Lets trace with sample data 3 1 2 4

Discussion • How to account for dangling nodes: one that has many incoming links and no outgoing links – Simply redistributes its pagerank to all – One iteration requires pagerank computation + redistribution of “unused” pagerank • Pagerank is iterated until convergence: when is convergence reached? • Probability distribution over a large network means underflow of the value of pagerank. . Use log based computation