Neural Networks as Universal Approximators Yotam Amar and

Motivation Practical uses: • • Replicating black box functions: decryption functions More effective implementation

Universal Approximation Good Theorem Barron Functions Matlab Bad Multilayer Barron Theorem Demo

Single Hidden Layer NN Output Neuron Input Layer Hidden Layer

Quick Review: Measure Theory Set of all outcomes Measure

Lemma 1 [Cybenko 89’] y s z a Cr uou tin n Co

Can We Classify? The previous result shows that we can approximate continuous functions up

What is the catch? • No guaranty of layer size • The curse of

Common Activation Functions Name Binary Step Logistic Tan. H Arctan Relu Plot Equation Range

Multilayer Barron Theorem [2017] Why is this important? Because there are complex functions that

Multilayer Barron Theorem [2017] What’s next?

fitnet Vs. parametric curve fitting Curve Fitting • Advantages: if the model of the

References • George Cybenko, 1989, Approximation by superpositions of a sigmoidal function • Rudin,

Multilayer Barron Theorem [2017] Why is this important? A network with multiple hidden layers

Slides: 46

Download presentation

Neural Networks as Universal Approximators Yotam Amar and Heli Ben Hamu Advanced Reading in Deep-Learning and Vision Spring 2017

Motivation Practical uses: • • Replicating black box functions: decryption functions More effective implementation Visualization tools of neural networks Optimization of functions using backpropagation of the approximation Theoretical uses: • Understanding NN as hypothesis class • What NN can be used for? (except vision)

Universal Approximation Good Theorem Barron Functions Matlab Bad Multilayer Barron Theorem Demo

Universal Approximation Theorem

Single Hidden Layer NN Output Neuron Input Layer Hidden Layer

Quick Review: Measure Theory Set of all outcomes Measure

Notations

Definitions

Theorem 1 [Cybenko 89’]

Proof of Theorem 1 [Cybenko 89’]

Theorem 1 [Cybenko 89’]

Lemma 1 [Cybenko 89’] y s z a Cr uou tin n Co

Theorem 2 [Cybenko 89’]

Universal Approximation Good Theorem Barron Functions Matlab Bad Multilayer Barron Theorem Demo

Good Bad

Can We Classify? The previous result shows that we can approximate continuous functions up to a desired precision. However, classification is a task which involves approximating a discontinuous function - a decision function. YES!

What is the catch? • No guaranty of layer size • The curse of dimensionality: the number of nodes needed is usually dependent on the dimension • No optimization guarantees Are there functions with better guarantees? What about multilayer? New results for Barron function: • The approximation does not depend on the dimension • 1 function can be approximated with 1 hidden layer • A composition of n functions can be approximated by n hidden layers • Still no optimization guarantees

Universal Approximation Good Theorem Barron Functions Matlab Bad Multilayer Barron Theorem Demo

Barron Functions Theorem

Barron functions

Barron Theorem [Bar 93]

Common Activation Functions Name Binary Step Logistic Tan. H Arctan Relu Plot Equation Range

Universal Approximation Good Theorem Barron Functions Matlab Bad Multilayer Barron Theorem Demo

Multilayer Barron Theorem

Multilayer Barron Theorem [2017] g X

Multilayer Barron Theorem [2017]

Multilayer Barron Theorem [2017] Why is this important? Because there are complex functions that are combination of baron functions

Multilayer Barron Theorem [2017] What’s next?

Universal Approximation Good Theorem Barron Functions Matlab Bad Multilayer Barron Theorem Demo

Matlab Demo

fitnet Vs. parametric curve fitting Curve Fitting • Advantages: if the model of the function is known, can just estimate the parameters. • Disadvantages: need to have a model to begin with fitting the function. fitnet • Advantages: does not need a model of the function in order to approximate well. • Disadvantages: sometimes, we want to learn from the parameters of the model, however the NN is a black box.

Questions

References • George Cybenko, 1989, Approximation by superpositions of a sigmoidal function • Rudin, Walter (1991). Functional analysis. Mc. Graw-Hill Science/Engineering/Math (Hahn Banach Theorem) • Riesz Representation Theorem - Wolfram • Holden Lee, Rong Ge, Andrej Risteski, Tengyu Ma, Sanjeev Arora, On the ability of neural nets to express distributions

Theorem 2 [Cybenko 89’]

Proof of Theorem 1 [Cybenko 89’]

Theorem 3 [Cybenko 89’]

Multilayer Barron Theorem [2017] Why is this important? A network with multiple hidden layers of poly(n) width can approximate functions that need exponential number of nodes with 1 hidden layer.