Math Stat Department Colloquium Georgetown University March 20

Security and Privacy for Distributed Optimization and Learning Nitin Vaidya Georgetown University

Goals g Background g Problem formulation g Intuition No theorems/proofs

Machine Learning g Data is distributed across different agents Agent 1 Agent 2 Agent

g Data is distributed across different Collaborate to learn agents

Distributed Optimization Iterative algorithm g Each agent maintains local estimate of optimum x g

x 1 x 2 x 3 x 3 Example based on [Nedic and Ozdaglar,

Distributed Optimization In the limit as t ∞ g Consensus: All agents converge to

Many Variations Server … stochastic optimization … asynchronous … gradient compression … acceleration …

Challenges g Fault-tolerant distributed optimization f 1(x) + f 2(x) + f 3(x) How

Challenges g Privacy-preserving distributed optimization g How to collaborate without revealing own cost function?

Byzantine Fault Model g No constraint on misbehavior of faulty agents

Machine Learning Faulty agent can adversely affect model parameters g g g

Fault-Tolerance g What should be the objective of fault-tolerant optimization? 41

Fault-Tolerance g What should be the objective of fault-tolerant optimization? g Optimize over only

g ? Is a this ble a v hie c It Depends Independent functions

Independent Functions Provably impossible to compute g

g Independent functions Approximate lele? ? b b a a vv e e i

n=6 t=1 (number of agents) (faulty agents) G. Renee Guzlas

Norm Filter g g g g Exact optimum computed despite faulty agents

Another Example of Redundancy g Machine learning Agent 1 g Agents draw samples from

How to Approximate g g g Ideal goal: Equal weight for all non-faulty agents

Results g For each faulty agent, a good agent may be ignored Weight 0

Results g g For each faulty agent, a good agent may be ignored Weight

Communication Leaks Information g Server g g

Communication Leaks Information g Server g g Server can use gradients to infer polynomial

Related Work Cryptographic Methods Transformation Methods g Differential Privacy Query g Perturbed Output Database

Our Approach g Motivated by secret sharing & differential privacy Add cancellable noise 80

Multiple Parameter Servers consensus step Server 1 Server 2 g 81

Improving Privacy Server 1 Server 2 g g ε 1 + ε 2 =

Convex Sum of Non-Convex Functions Server 2 Server 1 g g g

Convex Sum of Non-Convex Functions Server 2 Server 1 g g g “Privacy” if

Privacy Fault-tolerance Gradient filters Cancellable noise Server g g g 1 g g

Decentralized Control/Optimization Distributed Computing Picture from Wikipedia 86

Lili Su Shripad Gade Dimitrios Pylorof Nirupam Gupta Shuo Liu

Net-X: Multi. Channel Mesh capacity D E Fixed F B A Switchable C channels

Hajnal 1958 Weak ergodicity of nonhomogeneous Markov chains Distributed Computing De. Groot 1974 Reaching

Distributed Optimization g g Each agent maintains an estimate g Local estimates shared with

Distributed Optimization As time ∞ g Consensus: All agents converge to same estimate g

t faulty agents g g Discard smallest t and largest t gradients Average the

t faulty agents g Discard smallest t and largest t gradients Average the rest

Independent Functions How good an approximation? g Instead of uniform weights g the filter

Independent Functions g Cost functions of faulty nodes “filtered away”

Independent Functions g Cost functions of faulty nodes “filtered away” At most t good

Independent Functions How good an approximation? g g 0 0 ¼ ¼ 0 0

Good News Cost functions often naturally redundant

Good News Cost functions often naturally redundant g Data sets at different agents may

Good News Cost functions often naturally redundant g g 111

2 t-Redundancy … Linear Regression g 120

Privacy Techniques Differential privacy … add noise ε g Server g g

Privacy Techniques Differential privacy … add noise ε g Server g g Optimality compromised

Privacy Techniques Homomorphic encryption g Expensive 124

Slides: 124

Download presentation

Math & Stat Department Colloquium Georgetown University March 20, 2020

Security and Privacy for Distributed Optimization and Learning Nitin Vaidya Georgetown University

Goals g Background g Problem formulation g Intuition No theorems/proofs

Rendezvous 4

Rendezvous g g X g 5

Averaging g 7

Averaging g

Machine Learning g Data is distributed across different agents Agent 1 Agent 2 Agent 3 Agent 4

g Data is distributed across different Collaborate to learn agents

Machine Learning g g

Classification G. Renee Guzlas

Gradient Method g x[0] x[1] x[2] x[3]

Gradient Method g x[0] x[1] x[2] x[3] g

Distributed Optimization g 17

1984 18

Architectures g g g Server 1 g g

Distributed Optimization Iterative algorithm g Each agent maintains local estimate of optimum x g Local estimates shared with neighbors in each iteration g Local estimates converge to optimum 20

x 1 x 2 x 3 x 3 Example based on [Nedic and Ozdaglar, 2009]

x 1 x 2 x 3 x 3 g

x 1 x 2 x 3 x 3 g g

Distributed Optimization In the limit as t ∞ g Consensus: All agents converge to same estimate g Optimality 24

Why does this work?

Architectures g g g Server 1 g g

Parameter Server g g Server

Parameter Server g g

Many Variations Server … stochastic optimization … asynchronous … gradient compression … acceleration … shared memory

Architectures g g g Server 1 g g

Challenges

Challenges g Fault-tolerant distributed optimization f 1(x) + f 2(x) + f 3(x) How to optimize if agents inject bogus information? Server g g

Challenges g Privacy-preserving distributed optimization g How to collaborate without revealing own cost function? g

Fault-Tolerant Optimization 2015 …

Byzantine Fault Model g No constraint on misbehavior of faulty agents

Rendezvous g g X g 38

Rendezvous g g X g 39

Machine Learning Faulty agent can adversely affect model parameters g g g

Fault-Tolerance g What should be the objective of fault-tolerant optimization? 41

Fault-Tolerance g What should be the objective of fault-tolerant optimization? g Optimize over only good agents … set G

Fault-Tolerance g What should be the objective of fault-tolerant optimization? g Optimize over only good agents … set G g

Parameter Server g g

Parameter Server g g g

g ? Is a this c ble a v hie

g ? Is It Depends a this c ble a v hie

g ? Is a this ble a v hie c It Depends Independent functions “Enough” redundancy

Independent Functions a b c

Independent Functions Provably impossible to compute g

g Independent functions Approximate lele? ? b b a a vv e e i i h h aacc s s i i th Is. Isth “Enough” redundancy Exact

An Example of Redundancy g g

n=6 t=1 (number of agents) (faulty agents) G. Renee Guzlas

Parameter Server g g g

Norm Filter g g 60

Norm Filter g g g g

Norm Filter g g g g Exact optimum computed despite faulty agents

Another Example of Redundancy 63

Another Example of Redundancy g Machine learning Agent 1 g Agents draw samples from identical data distribution g Filter on stochastic gradients Agent 2

g Independent functions Approximate lele? ? b b a a vv e e i i h h aacc s s i i th Is. Isth “Enough” redundancy Exact

How to Approximate g ? 66

How to Approximate g ? g 67

How to Approximate g g g Ideal goal: Equal weight for all non-faulty agents ?

How to Approximate g g g Ideal goal: Equal weight for all non-faulty agents g Approximation: Unequal weights ?

Results g For each faulty agent, a good agent may be ignored Weight 0

Results g g

Results g g For each faulty agent, a good agent may be ignored Weight 0 But remaining good agent can get almost uniform importance 72

Results g Cannot compute g

Results g g

Parameter Server g g g

Privacy-Preserving Optimization 2016 …

Communication Leaks Information g Server g g

Communication Leaks Information g Server g g Server can use gradients to infer polynomial cost functions (up to a constant)

Related Work Cryptographic Methods Transformation Methods g Differential Privacy Query g Perturbed Output Database + Noise

Our Approach g Motivated by secret sharing & differential privacy Add cancellable noise 80

Multiple Parameter Servers consensus step Server 1 Server 2 g 81

Improving Privacy Server 1 Server 2 g g ε 1 + ε 2 = 0 over time 82

Convex Sum of Non-Convex Functions Server 2 Server 1 g g g

Convex Sum of Non-Convex Functions Server 2 Server 1 g g g “Privacy” if at least one server is non-adversarial

Privacy Fault-tolerance Gradient filters Cancellable noise Server g g g 1 g g

Decentralized Control/Optimization Distributed Computing Picture from Wikipedia 86

Lili Su Shripad Gade Dimitrios Pylorof Nirupam Gupta Shuo Liu

Thanks! disc. georgetown. domains

Net-X: Multi. Channel Mesh capacity D E Fixed F B A Switchable C channels Theory to Practice Net-X testbed Capacity bounds Insights on protocol design OS improvements Software architecture User Applications Multi-channel protocol IP Stack ARP Channel Abstraction Module Linux box CSL Interface Device Driver 89

Hajnal 1958 Weak ergodicity of nonhomogeneous Markov chains Distributed Computing De. Groot 1974 Reaching a consensus 1980: Pease, Shostak, Lamport Byzantine consensus 1983: Fischer, Lynch, Paterson Decentralized Control Tsitsiklis 1984 Asynchronous consensus impossibility result Jadbabaei, Lin, Morse 2003 Flocking problem Nedich, Ozdaglar 2009 1986: Dolev et al. Approximate Byzantine consensus

Distributed Optimization g g Each agent maintains an estimate g Local estimates shared with neighbors & updated g Estimates converge to optimum 91

x 1 x 2 x 3 x 3 Example based on [Nedic and Ozdaglar, 2009]

x 1 x 2 x 3 x 3 g

x 1 x 2 x 3 x 3 g g

Distributed Optimization As time ∞ g Consensus: All agents converge to same estimate g Optimality g 95

t faulty agents

t faulty agents g g Discard smallest t and largest t gradients Average the rest

t faulty agents g Discard smallest t and largest t gradients Average the rest g t=1 g g

t faulty agents g Discard smallest t and largest t gradients Average the rest g t=1 g g What does this achieve?

Our Ideal Goal g g 100

Independent Functions How good an approximation? g Instead of uniform weights g the filter achieves unequal weights g

Independent Functions g Cost functions of faulty nodes “filtered away”

Independent Functions g Cost functions of faulty nodes “filtered away” At most t good cost functions also filtered away

Independent Functions g Cost functions of faulty nodes “filtered away” At most t good cost functions also filtered away Nearly uniform importance to the remaining costs

Independent Functions How good an approximation? g g 0 0 ¼ ¼ 0 0

Independent Functions How good an approximation? g g 0 0 ¼ ¼ 0 0 g ⅛ ⅛ ¼ ¼ 0 0

Independent Functions g

Good News Cost functions often naturally redundant

Good News Cost functions often naturally redundant g Data sets at different agents may be drawn from same distribution In expectation, all cost functions are identical

Good News Cost functions often naturally redundant g Data sets at different agents may be drawn from same distribution In expectation, all cost functions are identical g Observations by different agents conditioned on the same ground truth

Good News Cost functions often naturally redundant g g 111

Linear Regression

Linear Regression g

Linear Regression g g

Linear Regression g g =

Distributed Linear Regression g 116

Distributed Linear Regression g g g =

Server g g g 119

2 t-Redundancy … Linear Regression g 120

2 t-Redundancy … Linear Regression g

Privacy Techniques Differential privacy … add noise ε g Server g g

Privacy Techniques Differential privacy … add noise ε g Server g g Optimality compromised due to the noise g g