Machine Learning Neural Networks Support Vector Machines Georg

  • Slides: 34
Download presentation
Machine Learning Neural Networks, Support Vector Machines Georg Dorffner Section for Artificial Intelligence and

Machine Learning Neural Networks, Support Vector Machines Georg Dorffner Section for Artificial Intelligence and Decision Support Ce. MSIIS – Medical University of Vienna

Machine Learning – possible definitions • Computer programs that improve with experience (Mitchell 1997)

Machine Learning – possible definitions • Computer programs that improve with experience (Mitchell 1997) (artificial intelligence) • To find non-trivial structures in data based on examples (pattern recognition, data mining) • To estimate a model from data, which describes them (statistical data analysis) 2

Some prerequisites • Features – Describe the cases of a problem – Measurements, data

Some prerequisites • Features – Describe the cases of a problem – Measurements, data • Learner (Version Space) – A class of models • Learning rule – An algorithm that finds the best model • Generalisation – Model is supposed to describe new data well 3

Example learner: Perceptron • • Features: 2 numerical values (drawn as points in a

Example learner: Perceptron • • Features: 2 numerical values (drawn as points in a plane) Task: Separate into 2 classes (white and black) Learner (version space): Straight line through origin Learning rule: – Take nornal vector – Add point vector of a falsely classified example – Turn line such that the new vector becomes normal vector – Repeat until everything correctly classified • • Generalisation: new points are correctly classified Convergence is guaranteed, if problem is solvable (Rosenblatt 1962) 4

Types of learning • supervised learning – Classes of all training samples known („labeled

Types of learning • supervised learning – Classes of all training samples known („labeled data“) – Find the relationship with input – Examples: medical diagnosis, forecasting • unsupervised learning – Classes not known („unlabeled data“) – Find inherent structure in data – Examples: segmentation, visualisation • Reinforcement Learning – Find relationships based on global feedback – Examples: robot arm control, learning games 5

Neural networks: The simple mathematical model activation, output weight w 1 w 2 •

Neural networks: The simple mathematical model activation, output weight w 1 w 2 • Propagation rule: unit (neuron) yj f xj • Transfer function f: … wi – Weighted sum – Euclidian distance (net-) input – Threshold function (Mc. Culloch & Pitts) – Linear fct. – Sigmoid fct. 6

Perceptron as neural network Neuron. eng. wayne. edu • Inputs are random „feature“ detectors

Perceptron as neural network Neuron. eng. wayne. edu • Inputs are random „feature“ detectors • Binary codes • Perceptron learns classification • Learning rule = weight adaptation • Model of perception / object recongition • But can solve only linearly separable 7 problems

Multilayer perceptron (MLP) • 2 (or more) layers (= connections) Output units (typically linear)

Multilayer perceptron (MLP) • 2 (or more) layers (= connections) Output units (typically linear) Hidden units (typically sigmoid) Input units 8

Learning rule (weight adaptation): Backpropagation • Generalised delta rule yout, xout Wout yhid, xhid

Learning rule (weight adaptation): Backpropagation • Generalised delta rule yout, xout Wout yhid, xhid Whid • • Error is being propagated back „Pseudo-error“ for the hidden units 9

Backpropagation as gradient descent • Define (quadratic) error (for pattern l): • Minimize error

Backpropagation as gradient descent • Define (quadratic) error (for pattern l): • Minimize error • Change weights in the direction of the gradient (partial derivative by weight) • Chain rule leads to backpropagation 10

Limits of backpropagation • Gradient descent can get stuck in local minimum (depends on

Limits of backpropagation • Gradient descent can get stuck in local minimum (depends on initial values) • it is not guranteed that backpropagation can find an existing solution • Further problems: slow, can oscillate • Solution: conjugent gradient, quasi-Newton 11

The power of NN: Arbitrary classifications • Each hidden unit separates space into 2

The power of NN: Arbitrary classifications • Each hidden unit separates space into 2 halves (perceptron) • Output units work like “AND” • Sigmoids: smooth transitions 12

Example • MLP with 5 hidden und 2 output units • Linear transfer function

Example • MLP with 5 hidden und 2 output units • Linear transfer function at output • Quadratic error 13

MLP to produce probabilities • MLP can approximate the Bayes posterior • Activation function:

MLP to produce probabilities • MLP can approximate the Bayes posterior • Activation function: Softmax • Prior probabilities: Distribution in training set 14

Regression • To model the data generator: estimate joint distribution • Likelihood: Distribution with

Regression • To model the data generator: estimate joint distribution • Likelihood: Distribution with expected value f(xi) SS 2008 Maschinelles Lernen und Neural Computation 15

Gaussian noise • Likelihood: • Maximize = -log. L minimize (constant terms can be

Gaussian noise • Likelihood: • Maximize = -log. L minimize (constant terms can be dropped incl. p(x)) • Corresponds to the quadratic error (see backpropagation) 16

Gradient der Fehlerfunktion • Optimierung basiert auf Gradienteninformation: Beitrag der Fehlerfunktion Beitrag des Netzes

Gradient der Fehlerfunktion • Optimierung basiert auf Gradienteninformation: Beitrag der Fehlerfunktion Beitrag des Netzes • Backpropagation (nach Bishop 1995): effiziente Berechnung des Gradienten (Beitrag des Netzes): O(W) statt O(W 2), siehe p. 146 f • ist unabhängig von der gewählten Fehlerfunktion

Gradientenabstiegsverfahren • Einfachstes Verfahren: Ändere Gewichte direkt proportional zum Gradienten klassische „Backpropagation“ (lt. NN-Literatur)

Gradientenabstiegsverfahren • Einfachstes Verfahren: Ändere Gewichte direkt proportional zum Gradienten klassische „Backpropagation“ (lt. NN-Literatur) Endpunkt nach 100 Schritten: [-1. 11, 1. 25], ca. 2900 flops • Langsam, Oszillationen und sogar Divergenz möglich

 • Ziel: Schritt bis ins Minimum in der gewählten Richtung • Approximation durch

• Ziel: Schritt bis ins Minimum in der gewählten Richtung • Approximation durch Parabel (3 Punkte) • Ev. 2 -3 mal wiederholen Endpunkt nach 100 Schritten: [0. 78, 0. 61], ca. 47000 flops Line Search

Konjugierte Gradienten • Problem des Line Search: neuer Gradient ist normal zum alten •

Konjugierte Gradienten • Problem des Line Search: neuer Gradient ist normal zum alten • Nimm Suchrichtung, die Minimierung in vorheriger Richtung beibehält dt+1 dt wt+1 wt Endpunkt nach 18 Schritten: • Wesentlich gezielteres Vorgehen [0. 99, 0. 99], ca. 11200 flops • Variante: skalierter konjugierter Gradient

MLP as universal function approximator • E. g: 1 Input, 1 Output, 5 Hidden

MLP as universal function approximator • E. g: 1 Input, 1 Output, 5 Hidden • MLP can approximate arbitray functions (Hornik et al. 1990) • Through superposition of sigmoids • Complexity by combining simple elements stretch, mirror move (bias) 21

Overfitting • If too few training data: NN tries to model the noise 50

Overfitting • If too few training data: NN tries to model the noise 50 samples, 15 H. U. • Overfitting: worse performance on new data (quadratic error becomes bigger) 22

Avoiding overfitting • As much data as possible (good coverage of distribution) • Model

Avoiding overfitting • As much data as possible (good coverage of distribution) • Model (network) as small as possible • More generally: regularisation (= limit the effective number of degrees of freedom): – Several training runs, average – Penalty for large networks, e. g. : – „Pruning“ (remove connections) – Early stopping 23

The important steps in practice Owing to their power and characteristics, neural network require

The important steps in practice Owing to their power and characteristics, neural network require a sound and careful strategy: 1. 2. 3. 4. 5. 6. 7. Data inspection (visualisation) Data preprocessing Feature selection Model selection (pick best network size) Comparison with simpler methods Testing on independent data Interpretation of results 24

Model selection • Strategy for the optimal choice of model complexity: – – Start

Model selection • Strategy for the optimal choice of model complexity: – – Start small (e. g. 1 or 2 hidden units) n-fold cross-validation Add hidden units one by one Accept as long as there is a significant improvement (test) • No regularization necessary overfitting is captured by cross-validation (averaging) • Too many hidden units too large variance no statistical significance • The same method can also be used for feature selection (“wrapper”) 25

Support Vector Machines: Returning to the perceptron • Advantage of (linear) perecptron: – Global

Support Vector Machines: Returning to the perceptron • Advantage of (linear) perecptron: – Global solution guaranteed (no local minima) – Easy to solve / optimize • Disadvantage: – Restricted to linear separability • Idea: – Transformation of data to a highdimensional space, such that problem becomes linearly separable 26

Mathematical formulation of perceptron learning rule • Perceptron (1 Output): • ti = +1/-1:

Mathematical formulation of perceptron learning rule • Perceptron (1 Output): • ti = +1/-1: Inner product (dot product) • Data is described in terms of inner products („dual form“) 27

Kernels • The goal is a certain transformation xi→Φ(xi), such that problem becomes linearly

Kernels • The goal is a certain transformation xi→Φ(xi), such that problem becomes linearly separable (can be highdimensional) • Kernel: Function that is depictable as inner product of Φs: • Φ does not have to be explicitly known 28

Example: polynomial kernel • 2 dimensions: • Kernel is indeed an inner product of

Example: polynomial kernel • 2 dimensions: • Kernel is indeed an inner product of vectors after transformation („preprocessing“) 29

The effect of the „kernel trick“ • Use of the kernel, e. g: •

The effect of the „kernel trick“ • Use of the kernel, e. g: • 16 x 16 -dimensional vectors (e. g. pixel images), 5 th degree polynomial: dimension = 1010 – Inner product of two 100000 -dim. vectors • Calculation is done in low-dimensional space: – Inner Product of two 256 -dim. vectors – To the power of 5 30

Large Margin Classifier • Highdimensional space: Overfitting easily possible • Solution: Search for decision

Large Margin Classifier • Highdimensional space: Overfitting easily possible • Solution: Search for decision border (hyperplabe) with largest distance to closest points • Optimization: Minimize distance maximal (Maximize ) w Boundary condition: 31

Optimization of large margin classifier • Quadratic optimization problem, Lagrange multiplier approach, leads to:

Optimization of large margin classifier • Quadratic optimization problem, Lagrange multiplier approach, leads to: • „Dual“ form • Important: Data is again denoted in terms of inner products • Kernel trick can be used again 32

Support Vectors • Support-Vectors: Points at the margin (closest to decision border • Determine

Support Vectors • Support-Vectors: Points at the margin (closest to decision border • Determine the solution, all other points could be omitted Kernel function Back projection support vectors 33

Summary • Neural networks are powerful machine learners for numerical features, initally inspired by

Summary • Neural networks are powerful machine learners for numerical features, initally inspired by neurophysiology • Nonlinearity through interplay of simpler learners (perceptrons) • Statistical/probabilistic framework most appropriate • Learning = Maximum Likelihood, minimizing error function with efficient gradient-based method (e. g. conjugent gradient) • Power comes with downsides (overfitting) -> careful validation necessary • Support vector machines are interesting alternatives, simplify learning problem through „Kernel trick“ 34