Sparse Modeling of Data and its Relation to

  • Slides: 59
Download presentation
Sparse Modeling of Data and its Relation to Deep Learning Michael Elad Computer Science

Sparse Modeling of Data and its Relation to Deep Learning Michael Elad Computer Science Department The Technion - Israel Institute of Technology Haifa 32000, Israel October 31 st – November 1 st, 2019

This Lecture Presents … A Theoretical Explanation of Deep-Learning (DL) Architectures based on Sparse

This Lecture Presents … A Theoretical Explanation of Deep-Learning (DL) Architectures based on Sparse Data Modeling Context: o Theoretical explanation for DL has become the holy-grail of data -sciences – this event is all about this o There is a growing volume of such contributions o Our work presents another chapter in this “growing book” of knowledge o The overall dream: A coherent and complete theory for deep-learning Michael Elad The Computer-Science Department The Technion 2

Who Needs Theory ? We All Do !! … because … A theory o

Who Needs Theory ? We All Do !! … because … A theory o … could bring the next rounds of ideas to this field, breaking existing barriers and opening new opportunities o … could map clearly the limitations of existing DL solutions, and point to key features that control their performance o … could remove the feeling with many of us that DL is a “dark magic”, turning it into a solid scientific discipline Michael Elad The Computer-Science Department The Technion Ali Rahimi: NIPS 2017 Test -of-Time Award “Machine learning has become alchemy” Yan Le. Cun Understanding is a good thing … but another goal is inventing methods. In the history of science and technology, engineering preceded theoretical understanding: § Lens & telescope Optics § Steam engine Thermodynamics § Airplane Aerodynamics § Radio & Comm. Info. Theory § Computer Science 3

A Theory for DL ? Raja Giryes (TAU): Studied the architecture of DNN in

A Theory for DL ? Raja Giryes (TAU): Studied the architecture of DNN in the context of their ability to give distance-preserving embedding of signals Stephane Mallat (ENS) & Joan Bruna (NYU): Proposed the scattering transform and emphasized the treatment of invariances in the input data Richard Baraniuk & Ankit Patel (RICE): Offered a generative probabilistic model for the data, showing how classic architectures and learning algorithms relate to it Gitta Kutyniok (TU) & Helmut Bolcskei (ETH): Studied the ability of DNN architectures to approximate families of functions Architecture Data Algorithms Rene Vidal (JHU): Explained the ability to optimize the typical non-convex objective and yet get to a global minima Stefano Soatto’s team (UCLA): Analyzed the Stochastic Gradient Descent (SGD) algorithm, connecting it to the Information Bottleneck objective Michael Elad The Computer-Science Department The Technion 4

Where Are We in this Map? What About You? Raja Giryes (TAU): Studied the

Where Are We in this Map? What About You? Raja Giryes (TAU): Studied the architecture of DNN in the context Architecture of their ability to give distance-preserving embedding of signals Data § Eran Malach (SGD, generalization, deep generative model) Gitta Kutyniok (TU) & Helmut Bolcskei (ETH): Studied the ability of Algorithms DNN architectures to approximate families of functions § Haim Sompolinsky (data manifold, geometry) Stephane Mallat (ENS) & Joan §Bruna (NYU): Proposed the Sanjeev Arora (Loss func. connectivity, optimization & generalization) scattering transform and § Tomaso Poggio (approximation, optimization, generalization) Architecture emphasized the treatment of § Jeffery Pennington (rotation-based NN, batch-normalization) invariances in the input data Data § Surya Ganguli (point networks, dynamics of learning) Richard Baraniuk & Ankit Patel (RICE): Offered a generative Algorithms § Naftali Tishbi (information bottleneck) probabilistic model for the §data, showing how classic Yasaman Bahri (training & generalization) Rene Vidal (JHU): Explained the ability to optimize the typical non architectures and learning algorithms relate to it -convex objective and yet get to a global minima Our work? Stefano Soatto’s team (UCLA): Analyzed the Stochastic Gradient We start by modeling the data and show it reflects on the Descent (SGD) algorithm, connecting it to the Information Bottleneck objective choice of the architectures and on their expected performance Michael Elad The Computer-Science Department The Technion 5

Interesting Observation o Languages used: Signal Processing, Control Theory, Information Theory, Harmonic Analysis, Sparse

Interesting Observation o Languages used: Signal Processing, Control Theory, Information Theory, Harmonic Analysis, Sparse Representation, Quantum Physics, PDE, Machine learning, Theoretical CS, Neuroscience, … Ron Kimmel: “DL is a dark monster covered with mirrors. Everyone sees his reflection in it …” David Donoho: “… these mirrors are taken from Cinderella's story, telling each that he is the most beautiful” o Today’s talk is on our proposed theoretical view: Architecture Data Algorithms … and our theory is the best Yaniv Romano Vardan Papyan Jeremias Sulam Aviad Aberdam Michael Elad The Computer-Science Department The Technion 6

This Lecture: More Specifically Sparseland Sparse Representation Theory Sparsity-Inspired Models CSC Convolutional Sparse Coding

This Lecture: More Specifically Sparseland Sparse Representation Theory Sparsity-Inspired Models CSC Convolutional Sparse Coding ML-CSC Multi -Layered Convolutional Sparse Coding Deep-Learning Another underlying idea that accompanies us Generative modeling of data sources enables o A systematic algorithm development, & o A theoretical analysis of their performance Michael Elad The Computer-Science Department The Technion 7

Our eventual goal in today’s talk is to present the … Multi-Layered Convolutional Sparse

Our eventual goal in today’s talk is to present the … Multi-Layered Convolutional Sparse Modeling So, lets use this as our running title, parse it into words, and explain each of them Michael Elad The Computer-Science Department The Technion 8

Multi-Layered Convolutional Sparse Modeling Michael Elad The Computer-Science Department The Technion 9

Multi-Layered Convolutional Sparse Modeling Michael Elad The Computer-Science Department The Technion 9

Our Data is Structured Stock Market Text Documents Matrix Data Biological Signals Social Networks

Our Data is Structured Stock Market Text Documents Matrix Data Biological Signals Social Networks Still Images Videos Seismic Data Using models Radar Imaging o We are surrounded by various diverse sources of massive information o Each of these sources have an internal structure, which can be exploited o This structure, when identified, is the engine behind the ability to process data o How to identify structure? Michael Elad The Computer-Science Department The Technion Traffic info Voice Signals 3 D Objects Medical Imaging 10

Models o A model: a mathematical description of the underlying signal of interest, describing

Models o A model: a mathematical description of the underlying signal of interest, describing our beliefs regarding its structure o The following is a partial list of commonly used models for images o Good models should be simple while matching the signals Simplicity Reliability o Models are almost always imperfect Michael Elad The Computer-Science Department The Technion Principal-Component-Analysis Gaussian-Mixture Markov Random Field Laplacian Smoothness DCT concentration Wavelet Sparsity Piece-Wise-Smoothness C 2 -smoothness Besov-Spaces Total-Variation Beltrami-Flow 11

What this Talk is all About? Data Models and Their Use o Almost any

What this Talk is all About? Data Models and Their Use o Almost any task in data processing requires a model – true for denoising, deblurring, super-resolution, inpainting, compression, anomaly-detection, sampling, recognition, separation, and more o Sparse and Redundant Representations offer a new and highly effective model – we call it Sparseland o We shall describe this and descendant versions of it that lead all the way to … deep-learning Michael Elad The Computer-Science Department The Technion 12

Multi-Layered Convolutional Sparse Modeling Michael Elad The Computer-Science Department The Technion 13

Multi-Layered Convolutional Sparse Modeling Michael Elad The Computer-Science Department The Technion 13

 A New Emerging Model Machine Learning Signal Processing Mathematics Approximation Theory Wavelet Theory

A New Emerging Model Machine Learning Signal Processing Mathematics Approximation Theory Wavelet Theory Multi-Scale Analysis Signal Transforms Sparsela nd Semi-Supervised Interpolation Learning Inference (solving inverse problems) Compression Recognition Michael Elad The Computer-Science Department The Technion Linear Algebra Source. Separation Prediction Clustering Optimization Theory Segmentation Sensor-Fusion Classification Summarizing Denoising Anomaly Identification detection Synthesis 14

 The Sparseland Model o Task: model image patches of Σ size 8× 8

The Sparseland Model o Task: model image patches of Σ size 8× 8 pixels o We assume that a dictionary of such image patches is given, containing 256 atom images α 1 α 2 α 3 o The Sparseland model assumption: every image patch can be described as a linear combination of few atoms Michael Elad The Computer-Science Department The Technion 15

 The Sparseland Model Properties of this model: Σ Sparsity and Redundancy o We

The Sparseland Model Properties of this model: Σ Sparsity and Redundancy o We start with a 8 -by-8 pixels patch and represent it using 256 numbers – This is a redundant representation α 1 α 2 α 3 o However, out of those 256 elements in the representation, only 3 are non-zeros – This is a sparse representation o Bottom line in this case: 64 numbers representing the patch are replaced by 6 (3 for the indices of the non-zeros, and 3 for their entries) Michael Elad The Computer-Science Department The Technion 16

 Chemistry of Data We could refer to the Sparseland model as the chemistry

Chemistry of Data We could refer to the Sparseland model as the chemistry of information: o Our dictionary stands for the Periodic Table containing all the elements Σ α 1 α 2 α 3 o Our model follows a similar rationale: Every molecule is built of few elements Michael Elad The Computer-Science Department The Technion 17

Sparseland : A Formal Description M m n A sparse vector A Dictionary Michael

Sparseland : A Formal Description M m n A sparse vector A Dictionary Michael Elad The Computer-Science Department The Technion o The vector is generated n with few nonzeros at arbitrary locations and values o This is a generative model that describes how (we believe) signals are created 18

 Difficulties with Sparseland o Problem 1: Given a signal, how can we find

Difficulties with Sparseland o Problem 1: Given a signal, how can we find its atom decomposition? o A simple example: Σ α 1 α 2 α 3 § There are 2000 atoms in the dictionary § The signal is known to be built of 15 atoms possibilities § If each of these takes 1 nano-sec to test, this will take ~7. 5 e 20 years to finish !!!!!! o So, are we stuck? Michael Elad The Computer-Science Department The Technion 19

 Atom Decomposition Made Formal n m Approximation Algorithms § L 0 – counting

Atom Decomposition Made Formal n m Approximation Algorithms § L 0 – counting number of non-zeros in the vector Relaxation methods Greedy methods Basis-Pursuit Thresholding Michael Elad The Computer-Science Department The Technion § This is a projection onto the Sparseland model § These problems are known to be NP-Hard problem 20

 Pursuit Algorithms Approximation Algorithms Basis Pursuit Change the L 0 into L 1

Pursuit Algorithms Approximation Algorithms Basis Pursuit Change the L 0 into L 1 and then the problem becomes convex and manageable Michael Elad The Computer-Science Department The Technion 21

 Difficulties with Sparseland o There are various pursuit algorithms o Here is an

Difficulties with Sparseland o There are various pursuit algorithms o Here is an example using the Basis Pursuit (L 1): Σ α 1 α 2 α 3 o Surprising fact: Many of these algorithms are often accompanied by theoretical guarantees for their success, if the unknown is sparse enough Michael Elad The Computer-Science Department The Technion 22

 The Mutual Coherence o Compute = Assume normalized columns Michael Elad The Computer-Science

The Mutual Coherence o Compute = Assume normalized columns Michael Elad The Computer-Science Department The Technion 23

 Basis-Pursuit Success Donoho, Elad & Temlyakov (‘ 06) M + Michael Elad The

Basis-Pursuit Success Donoho, Elad & Temlyakov (‘ 06) M + Michael Elad The Computer-Science Department The Technion 24

 Difficulties with Sparseland o Problem 2: Given a family of signals, how do

Difficulties with Sparseland o Problem 2: Given a family of signals, how do Σ we find the dictionary to represent it well? o Solution: Learn! Gather a large set of α 1 α 3 α 2 signals (many thousands), and find the dictionary that sparsifies them o Such algorithms were developed in the past 10 years (e. g. , K-SVD), and their performance is surprisingly good o We will not discuss this matter further in this talk due to lack of time Michael Elad The Computer-Science Department The Technion 25

 Difficulties with Sparseland o Problem 3: Why is this model suitable to describe

Difficulties with Sparseland o Problem 3: Why is this model suitable to describe various sources? e. g. , Is it good for images? Audio? Stocks? … α 1 Σ α 2 o General answer: Yes, this model is extremely effective in representing various sources α 3 § Theoretical answer: Clear connection to other models § Empirical answer: In a large variety of signal and image processing (and later machine learning), this model has been shown to lead to state-of-the-art results Michael Elad The Computer-Science Department The Technion 26

 ? Difficulties with Sparseland o Problem 1: Given an image patch, how can

? Difficulties with Sparseland o Problem 1: Given an image patch, how can we find its atom decomposition ? o o Σ α α D E Problem 2: Given a family of signals, R E W ND S how do we find the dictionary to N A A L Y L L represent it well? A E Y L V I E T V I I S T O C Problem 3: Is this model flexible P U R T S enough to describe various sources? N O C E. g. , Is it good for images? audio? … Michael Elad The Computer-Science Department The Technion α 1 2 3 27

 Sparseland for Image Processing o When handling images, Sparseland is typically deployed on

Sparseland for Image Processing o When handling images, Sparseland is typically deployed on small overlapping patches due to the desire to train the model to fit the data better o The model assumption is: each patch in the image is believed to have a sparse representation w. r. t. a common local dictionary o What is the corresponding global model? This brings us to … the Convolutional Sparse Coding (CSC) Michael Elad The Computer-Science Department The Technion 28

Multi-Layered Convolutional Sparse Modeling Joint work with 1. V. Papyan, J. Sulam, and M.

Multi-Layered Convolutional Sparse Modeling Joint work with 1. V. Papyan, J. Sulam, and M. Elad, Working Locally Thinking Globally: Theoretical Guarantees for Convolutional Sparse Coding, IEEE Trans. on Signal Processing, Vol. 65, No. 21, Pages 5687 -5701, November 2017. Michael Elad The Computer-Science Department The Technion Yaniv Romano Vardan Papyan Jeremias Sulam 29

Convolutional Sparse Coding (CSC) This model emerged in 2005 -2010, developed and advocated by

Convolutional Sparse Coding (CSC) This model emerged in 2005 -2010, developed and advocated by Yan Le. Cun and others. It serves as the foundation of Convolutional Neural Networks Michael Elad The Computer-Science Department The Technion 30

CSC in Matrix Form • Michael Elad The Computer-Science Department The Technion 31

CSC in Matrix Form • Michael Elad The Computer-Science Department The Technion 31

The CSC Dictionary Michael Elad The Computer-Science Department The Technion 32

The CSC Dictionary Michael Elad The Computer-Science Department The Technion 32

Why CSC? = Ω stripe-dictionary stripe vector Michael Elad The Computer-Science Department The Technion

Why CSC? = Ω stripe-dictionary stripe vector Michael Elad The Computer-Science Department The Technion 33

Classical Sparse Theory for CSC ? • ory e h T d n a

Classical Sparse Theory for CSC ? • ory e h T d n a l e s s r n a o p i t S a c n i s pla x e d o The clas o de g i v o r p l t does no r the CSC mode fo Michael Elad The Computer-Science Department The Technion 34

Moving to Local Sparsity: Stripes The main question we aim to address is this:

Moving to Local Sparsity: Stripes The main question we aim to address is this: Can we generalize the vast theory of Sparseland to this new notion of local sparsity? For example, could we provide guarantees for success for pursuit algorithms? Michael Elad The Computer-Science Department The Technion 35

Success of the Basis Pursuit Papyan, Sulam & Elad (‘ 17) Michael Elad The

Success of the Basis Pursuit Papyan, Sulam & Elad (‘ 17) Michael Elad The Computer-Science Department The Technion 36

Multi-Layered Convolutional Sparse Modeling Yaniv Romano Vardan Papyan Jeremias Sulam Michael Elad The Computer-Science

Multi-Layered Convolutional Sparse Modeling Yaniv Romano Vardan Papyan Jeremias Sulam Michael Elad The Computer-Science Department The Technion 2. V. Papyan, Y. Romano, and M. Elad, Convolutional Neural Networks Analyzed via Convolutional Sparse Coding, Journal of Machine Learning Research, Vol. 18, Pages 1 -52, July 2017. 3. V. Papyan, Y. Romano, J. Sulam, and M. Elad, Theoretical Foundations of Deep Learning via Sparse Representations, IEEE Signal Processing Magazine, Vol. 35, No. 4, Pages 72 -89, June 2018. 37

From CSC to Multi-Layered CSC We propose to impose the same structure on the

From CSC to Multi-Layered CSC We propose to impose the same structure on the representations themselves Convolutional sparsity (CSC) assumes an inherent structure is present in natural signals Multi-Layer CSC (ML-CSC) Michael Elad The Computer-Science Department The Technion 38

Intuition: From Atoms to Molecules & atoms molecules cells tissue body-parts … Michael Elad

Intuition: From Atoms to Molecules & atoms molecules cells tissue body-parts … Michael Elad The Computer-Science Department The Technion 39

Intuition: From Atoms to Molecules Michael Elad The Computer-Science Department The Technion 40

Intuition: From Atoms to Molecules Michael Elad The Computer-Science Department The Technion 40

A Small Taste: Model Training (MNIST) MNIST Dictionary: • D 1: 32 filters of

A Small Taste: Model Training (MNIST) MNIST Dictionary: • D 1: 32 filters of size 7× 7, with stride of 2 (dense) • D 2: 128 filters of size 5× 5× 32 with stride of 1 - 99. 09 % sparse • D 3: 1024 filters of size 7× 7× 128 – 99. 89 % sparse Michael Elad The Computer-Science Department The Technion 41

ML-CSC: Pursuit Michael Elad The Computer-Science Department The Technion 42

ML-CSC: Pursuit Michael Elad The Computer-Science Department The Technion 42

A Small Taste: Pursuit x 94. 51 % sparse (213 nnz) 99. 52% sparse

A Small Taste: Pursuit x 94. 51 % sparse (213 nnz) 99. 52% sparse (30 nnz) 99. 51% sparse (5 nnz) Michael Elad The Computer-Science Department The Technion 43

ML-CSC: The Simplest Pursuit • Michael Elad The Computer-Science Department The Technion 44

ML-CSC: The Simplest Pursuit • Michael Elad The Computer-Science Department The Technion 44

Consider this for Solving the DCP o Layered Thresholding (LT): o Now let’s take

Consider this for Solving the DCP o Layered Thresholding (LT): o Now let’s take a look at how Conv. Neural Network operates: The layered (soft nonnegative) thresholding and the CNN forward pass algorithm are the very same thing !!! Michael Elad The Computer-Science Department The Technion 45

Theoretical Path M A Armed with this view of a generative source model, we

Theoretical Path M A Armed with this view of a generative source model, we may ask new and daring theoretical questions Michael Elad The Computer-Science Department The Technion 46

Success of the Layered-THR Papyan, Romano & Elad (‘ 17) The stability of the

Success of the Layered-THR Papyan, Romano & Elad (‘ 17) The stability of the forward pass is guaranteed if the underlying representations are locally sparse and the noise is locally bounded Michael Elad The Computer-Science Department The Technion Problems: 1. Contrast 2. Error growth 3. Error even if no noise 47

Layered Basis Pursuit (BP) o We chose the Thresholding algorithm due to its simplicity,

Layered Basis Pursuit (BP) o We chose the Thresholding algorithm due to its simplicity, but we do know that there are better pursuit methods – how about using them? o Lets use the Basis Pursuit instead … Deconvolutional networks [Zeiler, Krishnan, Taylor & Fergus ‘ 10] Michael Elad The Computer-Science Department The Technion 48

Success of the Layered BP Papyan, Romano & Elad (‘ 17) Michael Elad The

Success of the Layered BP Papyan, Romano & Elad (‘ 17) Michael Elad The Computer-Science Department The Technion Problems: 1. Contrast 2. Error growth 3. Error even if no noise 49

Layered Iterative Thresholding Note that our suggestion implies that groups of layers share the

Layered Iterative Thresholding Note that our suggestion implies that groups of layers share the same dictionaries Michael Elad The Computer-Science Department The Technion Can be seen as a very deep residual neural network [He, Zhang, Ren, & Sun ‘ 15] 50

Where are the Labels? Answer 1: M o We do not need labels because

Where are the Labels? Answer 1: M o We do not need labels because everything we show refer to the unsupervised case, in which we operate on signals, not necessarily in the context of recognition We presented the ML-CSC as a machine that produces signals X Michael Elad The Computer-Science Department The Technion 51

Where are the Labels? M We presented the ML-CSC as a machine that produces

Where are the Labels? M We presented the ML-CSC as a machine that produces signals X Michael Elad The Computer-Science Department The Technion 52

What About Learning? Sparseland Sparse Representation Theory CSC Convolutional Sparse Coding ML-CSC Multi -Layered

What About Learning? Sparseland Sparse Representation Theory CSC Convolutional Sparse Coding ML-CSC Multi -Layered Convolutional Sparse Coding All these models rely on proper Dictionary Learning Algorithms to fulfil their mission: § Sparseland: We have unsupervised and supervised such algorithms, and a beginning of theory to explain how these work § CSC: We have few and only unsupervised methods, and even these are not fully stable/clear § ML-CSC: Two algorithms were proposed – unsupervised and supervised Michael Elad The Computer-Science Department The Technion 53

Time to Conclude Michael Elad The Computer-Science Department The Technion 54

Time to Conclude Michael Elad The Computer-Science Department The Technion 54

This Talk Take Home Message 1: Generative modeling of data sources enables algorithm development

This Talk Take Home Message 1: Generative modeling of data sources enables algorithm development along with theoretically analyzing algorithms’ performance Sparsela nd The desire to model data Novel View of Convolutional Sparse Coding Take Home Message 2: The Multi-Layer Convolutional Sparse Multi-Layer A novel interpretation Coding model could be Convolutional and theoretical a new platform for Sparse Coding understanding of CNN understanding and We spoke about the importance of models in signal/image The ML-CSC was shown to enable a theoretical developing deep. We propose a multi-layer extension of We presented a theoretical study of the CSC model and study of CNN, along with new insights CSC, shown to be tightly connected to CNN how to operate locally while getting global optimality processing and described Sparseland in details learning solutions Michael Elad The Computer-Science Department The Technion 55

Fresh from the Oven My team’s work proceeds along the above-described line of thinking:

Fresh from the Oven My team’s work proceeds along the above-described line of thinking: 4. J. Sulam, V. Papyan, Y. Romano, and M. Elad, Multi-Layer Convolutional Sparse Modeling: Pursuit and Dictionary Learning, IEEE Trans. on Signal Proc. , Vol. 66, No. 15, Pages 4090 -4104, August 2018. 5. A. Aberdam, J. Sulam, and M. Elad, Multi Layer Sparse Coding: the Holistic Way, SIAM Journal on Mathematics of Data Science (SIMODS), Vol. 1, No. 1, Pages 46 -77. 6. J. Sulam, A. Aberdam, A. Beck, and M. Elad, On Multi-Layer Basis Pursuit, Efficient Algorithms and Convolutional Neural Networks, to appear in IEEE T-PAMI. 7. Y. Romano, A. Aberdam, J. Sulam, and M. Elad, Adversarial Noise Attacks of Deep Learning Architectures – Stability Analysis via Sparse Modeled Signals, to appear in JMIV. 8. Ev Zisselman, Jeremias Sulam, and Michael Elad, A Local Block Coordinate Descent Algorithm for the CSC Model, CVPR 2019. 9. I. Rey-Otero, J. Sulam, and M. Elad, Variations on the CSC model, submitted to IEEE Transactions on Signal Processing. 10. D. Simon and M. Elad, Rethinking the CSC model for Natural Images, NIPS 2019. 11. M. Scetbon, P. Milanfar and M. Elad, Deep K-SVD Denoising, submitted to IEEE-TPAMI. Michael Elad The Computer-Science Department The Technion 56

 On a Personal Note … Disclaimer: I am biased, so take my words

On a Personal Note … Disclaimer: I am biased, so take my words with a grain of salt … Conjecture: Sparse modeling of data is at the heart of Deep-Learning architectures, and as such it is one of the main avenues for developing theoretical foundations for this field. Elad (‘ 19) My research activity (past, present & future) is dedicated to establishing this connection and addressing various aspects of it (applicative & theoretical) Michael Elad The Computer-Science Department The Technion 57

 A New Massive Open Online Course Michael Elad The Computer-Science Department The Technion

A New Massive Open Online Course Michael Elad The Computer-Science Department The Technion 58

Questions? More on these (including these slides and the relevant papers) can be found

Questions? More on these (including these slides and the relevant papers) can be found in http: //www. cs. technion. ac. il/~elad Michael Elad The Computer-Science Department The Technion