How to Escape Saddle Points Efficiently Rong Ge

  • Slides: 23
Download presentation
How to Escape Saddle Points Efficiently? Rong Ge Duke University IPAM Optimization and Optimal

How to Escape Saddle Points Efficiently? Rong Ge Duke University IPAM Optimization and Optimal Control for Complex Energy and Property Landscapes Based on joint works with Chi Jin, Praneeth Netrapalli, Sham M. Kakade, Michael I. Jordan

“Simple” Objectives Simple algorithms can have new, stronger guarantees. Many interesting problems have simple

“Simple” Objectives Simple algorithms can have new, stronger guarantees. Many interesting problems have simple objectives. Gradient Descent Simple Algorithms

Outline Machine Learning, Non-convex optimization and saddle points • Why are saddle points ubiquitous

Outline Machine Learning, Non-convex optimization and saddle points • Why are saddle points ubiquitous in machine learning? • Why is it enough to handle saddle points? How to escape saddle points efficiently

Why non-convex? • Many machine learning problems are non-convex • Find the best clustering

Why non-convex? • Many machine learning problems are non-convex • Find the best clustering • Learn the best neural networks • Find communities in social networks • In many cases, don’t have other scalable algorithms.

How to Optimize Non-convex Problems? • In Theory • NP-hard in worst case! better

How to Optimize Non-convex Problems? • In Theory • NP-hard in worst case! better to avoid it • In (machine learning) practice • Stochastic gradient descent (SGD) !! + tuning Why? • Hope: tractable for real-life instances? What properties can we use?

Convex Optimization Geometry Algorithm Gradient Descent (Stochastic, Accelerated, …) Can we find clean geometric

Convex Optimization Geometry Algorithm Gradient Descent (Stochastic, Accelerated, …) Can we find clean geometric property for non-convex function Newton’s Algorithm (trust region, cubic regularization…) ……. . ?

Symmetry Saddle Points • Problem asks for multiple components, but the components have no

Symmetry Saddle Points • Problem asks for multiple components, but the components have no ordering. Solution = k centers

Symmetry Saddle Points • Problem asks for multiple components, but the components have no

Symmetry Saddle Points • Problem asks for multiple components, but the components have no ordering. x 1 x 3 x 2 Optimal Solution (a) x 1 x 2 x 3 Equivalent Solution (b) x 2 x 3 Convex Combination (a+b)/2

“Strict Saddle” Functions [G Huang Jin Yuan’ 15] • It can be easy to

“Strict Saddle” Functions [G Huang Jin Yuan’ 15] • It can be easy to find a local minima even with saddle points (and sometimes all local minima are permutations of global minima) • If all points are in one of three cases Near a local minimum Near a saddle point (in a strongly-convex ball) Hessian has negative eigenvalue Has large Gradient

Simple Example: Top eigenector

Simple Example: Top eigenector

What problems are strict-saddle? Eigenvector Generalized Eigenvector Some Tensor Problems [G Huang Jin Yuan’

What problems are strict-saddle? Eigenvector Generalized Eigenvector Some Tensor Problems [G Huang Jin Yuan’ 15] Community/Synchronization [Bandeira, Boumal, Voroninski’ 16] Dictionary Learning [Sun Qu Wright’ 15] Matrix Completion [G Lee Ma’ 16] Matrix Sensing [Bhojanapalli, Neyshabur, Srebro’ 16] asymmetric versions, sparse PCA [G Jin Zheng’ 17]

Outline Machine Learning, Non-convex optimization and saddle points • Why are saddle points ubiquitous

Outline Machine Learning, Non-convex optimization and saddle points • Why are saddle points ubiquitous in machine learning? • Why is it enough to handle saddle points? How to escape saddle points efficiently

Setting • Want to optimize a function f(x) • f(x) has Lipschitz-gradient • f(x)

Setting • Want to optimize a function f(x) • f(x) has Lipschitz-gradient • f(x) has Lipschitz Hessian • Goal: find a local minimum of f(x). (Recall: in many cases this is also global minimum)

Gradient Descent •

Gradient Descent •

What can we do on saddle points? • Rely on the second order (Hessian)

What can we do on saddle points? • Rely on the second order (Hessian) information • Find the negative eigenvector, go along that direction. Can we do this without computing the Hessian ?

Our Result •

Our Result •

Previous Results

Previous Results

Intuition: constant-Hessian case • Take-away: If x 0 has > 0 projection in negative

Intuition: constant-Hessian case • Take-away: If x 0 has > 0 projection in negative eigendirection, then GD can escape!

Difficulty •

Difficulty •

Tight analysis for Gradient Descent • Green: region where gradient descent gets stuck •

Tight analysis for Gradient Descent • Green: region where gradient descent gets stuck • Shape of stuck region complicated • Idea: Prove the volume of stuck region is small without knowing where the region is!

Tight analysis for Gradient Descent Must be able to escape! Stuck at saddle Key

Tight analysis for Gradient Descent Must be able to escape! Stuck at saddle Key observation: Width of stuck region is small.

Summary • Saddle points are ubiquitous in machine learning problems because of symmetry •

Summary • Saddle points are ubiquitous in machine learning problems because of symmetry • For many problems, all local minima are global, and we only need to worry about saddle points. • Perturbed gradient descent can find second-order stationary points as efficiently as first order.

Open Problems • What other problems are strict saddle? • Extend to “not-so-simple” functions?

Open Problems • What other problems are strict saddle? • Extend to “not-so-simple” functions? • Can we design new objectives/modify old objectives to make them strict saddle? • Can we analyze other popular algorithms? • Stochastic gradient descent • Acceleration Thank You!