SGD ON HADOOP FOR BIG DATA HUGE MODELS

SGD ON HADOOP FOR BIG DATA & HUGE MODELS Alex Beutel Based on work done with Abhimanu Kumar, Vagelis Papalexakis, Partha Talukdar, Qirong Ho, Christos Faloutsos, and Eric Xing

Outline 1. When to use SGD for distributed learning 2. Optimization • Review of DSGD • SGD for Tensors • SGD for ML models – topic modeling, dictionary learning, MMSB 3. Hadoop 1. General algorithm 2. Setting up the Map. Reduce body 3. Reducer communication 4. Distributed normalization 5. “Always-On SGD” – How to deal with the straggler problem 4. Experiments 5. Looking Forward – Graphical Models and the Cloud

When distributed SGD is useful 1 Billion users on Facebook Collaborative Filtering Predict movie preferences 300 Million Photos uploaded to Facebook per day! Dictionary Learning Remove noise or missing pixels from images Tensor Decomposition Find communities in temporal graphs 400 million tweets per day Topic Modeling What are the topics of webpages, tweets, or status updates

Gradient Descent

Stochastic Gradient Descent (SGD)

DSGD for Matrices (Gemulla, 2011) Movies V U Genres Users ≈ X

DSGD for Matrices (Gemulla, 2011) V U ≈ Independent! X

DSGD for Matrices (Gemulla, 2011) Independent Blocks

DSGD for Matrices (Gemulla, 2011) Partition your data & model into d × d blocks Results in d=3 strata Process strata sequentially, process blocks in each stratum in parallel

TENSORS

What is a tensor? • Tensors are used for structured data > 2 dimensions • Think of as a 3 D-matrix For example: Derek Jeter plays baseball Subject Object Verb

Tensor Decomposition W V U ≈ X

Tensor Decomposition Independent W V U ≈ Not Independent X

Tensor Decomposition

For d=3 blocks per stratum, we require d 2=9 strata

Coupled Matrix + Tensor Decomposition Subject Y X Object Document Verb

Coupled Matrix + Tensor Decomposition W A V U ≈ Y X

Coupled Matrix + Tensor Decomposition

CONSTRAINTS & PROJECTIONS

Example: Topic Modeling Words Topics Documents

Constraints

How to enforce? Projections • Example: Non-negative

More projections • Sparsity (soft thresholding): • Simplex • Unit ball

Dictionary Learning • Learn a dictionary of concepts and a sparse reconstruction • Useful for fixing noise and missing pixels of images Sparse encoding Within unit ball

Mixed Membership Network Decomp. • Used for modeling communities in graphs (e. g. a social network) Simplex Non-negative

IMPLEMENTING ON HADOOP

High level algorithm Stratum 1 Stratum 2 Stratum 3 … for Epoch e = 1 … T do for Subepoch s = 1 … d 2 do Let be the set of blocks in stratum s for block b = 1 … d in parallel do Run SGD on all points in block end end

Bad Hadoop Algorithm: Subepoch 1 Mappers Reducers Run SGD on Update: U 2 V 1 W 3 U 3 V 2 W 1 U 1 V 3 W 2

Bad Hadoop Algorithm: Subepoch 2 Mappers Reducers Run SGD on Update: U 2 V 1 W 2 U 3 V 2 W 3 U 1 V 3 W 1

Hadoop Challenges • Map. Reduce is typically very bad for iterative algorithms • T × d 2 iterations • Sizable overhead per Hadoop job • Little flexibility

High Level Algorithm W 3 W 2 W 1 V 1 U 2 U 1 V 1 W 1 V 2 V 3 V 3 U 1 U 3 U 2 V 2 W 2 U 3 U 3 V 3 W 3

High Level Algorithm W 3 W 2 W 1 V 1 U 2 U 1 V 1 W 3 W 1 V 2 V 3 V 3 U 1 U 3 U 2 V 2 W 1 U 2 U 3 V 3 W 2

High Level Algorithm W 3 W 2 W 1 V 1 U 2 U 1 V 1 W 2 W 1 V 2 V 3 V 3 U 1 U 3 U 2 V 2 W 3 U 2 U 3 V 3 W 1

Hadoop Algorithm Mappers Process points: Map each point Partition & Sort to its block with necessary info to order Use: Partitioner Key. Comparator Grouping. Comparator Reducers

Hadoop Algorithm Mappers Process points: … Map each point Partition & Sort to its block with necessary info to order … Reducers

Hadoop Algorithm Run SGD on Mappers Process points: Reducers … Map each point U 1 V 1 W 1 Partition & Run SGD on Sort Run SGD on … Update: U 2 V 2 W 2 to its block with necessary info to order Update: U 3 V 3 W 3

Hadoop Algorithm Run SGD on Mappers Process points: Reducers … Update: U 1 V 1 W 1 HDFS Map each point Partition & Run SGD on Sort Update: U 2 V 2 W 2 to its block HDFS Run SGD on with necessary info to order … Update: U 3 V 3 W 3

Hadoop Summary 1. Use mappers to send data points to the correct reducers in order 2. Use reducers as machines in a normal cluster 3. Use HDFS as the communication channel between reducers

Distributed Normalization Words Topics π 1 β 1 Documents π 2 β 2 π 3 β 3

Distributed Normalization Transfer σ(b) to all machines Each machine calculates σ: σ(b) is a k-dimensional vector, summing the terms of βb π 1 β 1 σ(2) σ(1) σ(2) Normalize: σ(3) π 3 β 3 π 2 β 2 σ(1) σ(3)

Barriers & Stragglers Run SGD on Mappers Process points: Reducers … Update: U 1 V 1 W 1 HDFS Map each point Partition & Wasting time Run Update: waiting! SGD on Sort U 2 V 2 W 2 to its block HDFS Run SGD on with necessary info to order … Update: U 3 V 3 W 3

Solution: “Always-On SGD” For each reducer: Run SGD on all points in current block Z Shuffle points in Z and decrease step size Check if other reducers are ready to sync Run SGD on points in Z again If not ready to sync Sync parameters and get new block Z Wait If not ready to sync

“Always-On SGD” Run SGD on old points again! Process points: Reducers Run SGD on … Update: U 1 V 1 W 1 HDFS Map each point Partition & Run SGD on Sort Update: U 2 V 2 W 2 to its block HDFS Run SGD on with necessary info to order … Update: U 3 V 3 W 3

“Always-On SGD” Reducer 1 Reducer 2 Reducer 3 Reducer 4 First SGD pass of block Z Read Parameters from HDFS Extra SGD Updates Write Parameters to HDFS

EXPERIMENTS

Flexi. Fa. CT (Tensor Decomposition) Convergence

Flexi. Fa. CT (Tensor Decomposition) Scalability in Data Size

Flexi. Fa. CT (Tensor Decomposition) Scalability in Tensor Dimension Handles up to 2 billion parameters!

Flexi. Fa. CT (Tensor Decomposition) Scalability in Rank of Decomposition Handles up to 4 billion parameters!

Flexi. Fa. CT (Tensor Decomposition) Scalability in Number of Machines

Fugue (Using “Always-On SGD”) Dictionary Learning: Convergence

Fugue (Using “Always-On SGD”) Community Detection: Convergence

Fugue (Using “Always-On SGD”) Topic Modeling: Convergence

Fugue (Using “Always-On SGD”) Topic Modeling: Scalability in Data Size

Fugue (Using “Always-On SGD”) Topic Modeling: Scalability in Rank

Fugue (Using “Always-On SGD”) Topic Modeling: Scalability over Machines

Fugue (Using “Always-On SGD”) Topic Modeling: Number of Machines

Fugue (Using “Always-On SGD”)

Key Points • Flexible method for tensors & ML models • Can use stock Hadoop through using HDFS for communication • When waiting for slower machines, run updates on old data again

FURTHER SYSTEM IMPROVEMENTS Joint work with Markus Weimer, Tom Minka, Yordan Zaykov and Vijay Narayanan at Microsoft

Task Pipelining Time Load Block 1 Run Model on Block 1 Save Block 1 Load Block 2 Run Model on Block 2 Save Block 2 Load Block 1 Load Block 4 Block 2 Block 3 Run Model on Block 3 Block 4 Block 1 Block 2 Save S Block 2 Block 3 Block 1 Blo Load Block 1 Load Load Block 3 Block 6 Block 2 Block 4 Block 5 Run Model on Run Model o Block 4 Block 3 Block 1 Block 2 Block 5 Save Block 2 Block 3 Block 1 Block 4 …

Block scheduling

Block scheduling No Free Block!

Block scheduling

Asynchronous scheduling 1 column bigger 1 row bigger [Zhuang, 2013]

Asynchronous scheduling

Asynchronous high-latency scheduling

Asynchronous high-latency scheduling But t parallel tasks with queue length q requires (qt+1)2 blocks

Asynchronous high-latency scheduling

Asynchronous high-latency scheduling Squaring the number of blocks squares the queue length

Asynchronous high-latency scheduling

Pipelining Without With

Limited impact of failures

BAYESIAN GRAPHICAL MODELS Joint work with Markus Weimer, Tom Minka, Yordan Zaykov and Vijay Narayanan

Matrix Factorization Movies V U Genres Users ≈ X

Bayesian MF αU ~ 5 4 1 5 4

Graphical Models

Belief Propagation αV Updates! αU Dot product 5

Parallel Inference Dot product 5 Dot product 4

92 Partition data & parameters (Gemulla, 2011) Partition your data & model into d × d blocks Results in d=3 strata Process strata sequentially, process blocks in each stratum in parallel

Parallel Inference αU αV Dot product 5 Dot product 4

Questions? Alex Beutel abeutel@cs. cmu. edu http: //alexbeutel. com