Motor cortex Somatosensory cortex Sensory associative cortex Pars

Learning and Memory Declarative Episodic memory of a situation Non-Declarative Semantic general facts Procedural

Learning and Memory There are two different types of learning ¨ Skill Learning ¨

Skill and Fact Learning involve different mechanisms n Certain brain injuries involving the hippocampal

Short term memory § § How do we remember someone’s telephone number just after

Long term memory n But we do recall memories from decades past. These long

Situational Memory n n Think about an old situation that you still remember well.

Dreaming and Memory n There is general agreement and considerable evidence that dreaming involves

Models of Learning Hebbian ~ coincidence n Recruitment ~ one trial n Supervised ~

Hebb’s Rule § § The key idea underlying theories of neural learning go back

Hebb (1949) “When an axon of cell A is near enough to excite a

Hebb’s rule n n Each time that a particular synaptic connection is active, see

LTP and Hebb’s Rule n Hebb’s Rule: neurons that fire together wire together strengthen

Chemical realization of Hebb’s rule n It turns out that there are elegant chemical

Calcium Channels Facilitate Learning n In addition to the synaptic channels responsible for neural

Long Term Potentiation (LTP) n n These changes make each of the winning synapses

The Hebb rule is found with long term potentiation (LTP) in the hippocampus Schafer

During normal low-frequency trans-mission, glutamate interacts with NMDA and non. NMDA (AMPA) and metabotropic

Early and late LTP A. (Kandel, ER, JH Schwartz and TM Jessell (2000) Principles

Computational Models based on Hebb’s rule The activity-dependent tuning of the developing nervous system,

WTA: Stimulus ‘at’ is presented 1 a 2 t o

Competition starts at category level 1 a 2 t o

Hebbian learning takes place 1 a 2 t o Category node 2 now represents

Presenting ‘to’ leads to activation of category node 1 1 a 2 t o

Category 1 is established through Hebbian learning as well 1 a 2 t o

Hebb’s rule is not sufficient n What happens if the neural circuit fires perfectly,

Hebb’s rule is insufficient tastebud tastes rotten eats food gets sick drinks water n

Recruiting connections n Given that LTP involves synaptic strength changes and Hebb’s rule involves

Recruitment Learning K n Y n X N F = B/N B n Suppose

Finding a Connection P = (1 -F) **B**K P = Probability of NO link

Finding a Connection in Random Networks For Networks with N nodes and branching factor,

Recruiting a Connection in Random Networks 1. Activate the two nodes to be linked

Recruiting triangle nodes n n Let’s say we are trying to remember a green

Strengthen these connections n and you end up with this picture has-color has-shape Green

Supervised Learning - Backprop n How do we train the weights of the network

Backpropagation Algorithm “activations” “errors”

Backprop n To learn on data which is not linearly separable: ¨ Build multiple

Tasks Unconstrained pattern classification Credit assessment Digit Classification Function approximation Learning control Stock prediction

Sigmoid Squashing Function output w 0 y 0=1 w 1 y 1 w 2

Learning Rule – Gradient Descent on an Root Mean Square (RMS) n Learn wi’s

Gradient Descent Gradient: Training rule:

Backpropagation Algorithm n Generalization to multiple layers and multiple output units

An informal account of Back. Prop For each pattern in the training set: Compute

The output layer wjk k wij j yi ti: target i E = Error

The hidden layer wjk k wij j yi ti: target i E = Error

Let’s just do an example 0 i 1 0 i 2 b=1 w 01

Momentum term n The speed of learning is governed by the learning rate. ¨

Convergence May get stuck in local minima n Weights may diverge …but works well

Local Minimum USE A RANDOM COMPONENT SIMULATED ANNEALING

Overfitting and generalization TOO MANY HIDDEN NODES TENDS TO OVERFIT

Overfitting in ANNs Early Stopping : ¨ Stop training when error goes up on

Stopping criteria n Sensible stopping criteria: ¨ total mean squared error change: Back-prop is

Architectural Considerations What is the right size network for a given job? How many

Network Topology n n The number of layers and of neurons depend on the

Supervised vs Unsupervised Learning • Backprop is supervised • requires a 'target' • how

Autoassociative Networks • Network trained to reproduce the input at the output layer copy

Problems and Networks • Some problems have natural "good" solutions • Solving a problem

Summary n Multiple layer feed-forward networks ¨ Replace Step with Sigmoid (differentiable) function ¨

Use MLP Neural Networks when … n n n (vectored) Real inputs, (vectored) real

Applications of FFNN Classification, pattern recognition: n FFNN can be applied to tackle non-linearly

Extensions of Backprop Nets Recurrent Architectures n Backprop through time n

Elman Nets & Jordan Nets Output 1 Hidden Context Input α Hidden Context Updating

Recurrent Backprop w 2 a w 4 b w 1 c w 3 unrolling

Slides: 80

Download presentation

Motor cortex Somatosensory cortex Sensory associative cortex Pars opercularis Visual associative cortex Broca’s area Visual cortex Primary Auditory cortex Wernicke’s area Learning and Memory [Adapted from Neural Basis of Thought and Language Jerome Feldman, Spring 2007, feldman@icsi. berkeley. edu

Learning and Memory Declarative Episodic memory of a situation Non-Declarative Semantic general facts Procedural skills

Learning and Memory There are two different types of learning ¨ Skill Learning ¨ Fact and Situation Learning n General Fact Learning n Episodic Learning n The process underlying skill (procedural) learning is partially different from those underlying fact/situation (declarative) learning.

Skill and Fact Learning involve different mechanisms n Certain brain injuries involving the hippocampal region of the brain render their victims incapable of learning any new facts or new situations. ¨ But these people can still learn new skills, including relatively abstract skills like solving puzzles. n n Fact learning can be single-instance based. Skill learning requires repeated exposure to stimuli.

Short term memory § § How do we remember someone’s telephone number just after they tell us or the words in this sentence? Short term memory is known to have a different biological basis than long term memory of either facts or skills. We now know that this kind of short term memory depends upon ongoing electrical activity in the brain. § You can keep something in mind by rehearsing it, but this will interfere with your thinking about anything else. §

Long term memory n But we do recall memories from decades past. These long term memories are known to be based on structural changes in the synaptic connections between neurons. ¨ Such permanent changes require the construction of new protein molecules and their establishment in the membranes of the synapses connecting neurons, and this can take several hours. ¨ n n Thus there is a huge time gap between short term memory that lasts only for a few seconds and the building of long-term memory that takes hours to accomplish. In addition to bridging the time gap, the brain needs mechanisms for converting the content of a memory from electrical to structural form.

Situational Memory n n Think about an old situation that you still remember well. Your memory will include multiple modalities- vision, emotion, sound, smell, etc. The standard theory is that memories in each particular modality activate much of the brain circuitry from the original experience. There is general agreement that the Hippocampal area contains circuitry that can bind together the various aspects of an important experience into a coherent memory. This process is believed to involve the Calcium based potentiation (LTP).

Dreaming and Memory n There is general agreement and considerable evidence that dreaming involves simulating experiences and is important in consolidating memory.

Models of Learning Hebbian ~ coincidence n Recruitment ~ one trial n Supervised ~ correction (backprop) n Reinforcement ~ delayed reward n Unsupervised ~ similarity n

Hebb’s Rule § § The key idea underlying theories of neural learning go back to the Canadian psychologist Donald Hebb and is called Hebb’s rule. From an information processing perspective, the goal of the system is to increase the strength of the neural connections that are effective.

Hebb (1949) “When an axon of cell A is near enough to excite a cell B and repeatedly or persistently takes part in firing it, some growth process or metabolic change takes place in one or both cells such that A’s efficiency, as one of the cells firing B, is increased” From: The organization of behavior.

Hebb’s rule n n Each time that a particular synaptic connection is active, see if the receiving cell also becomes active. If so, the connection contributed to the success (firing) of the receiving cell and should be strengthened. If the receiving cell was not active in this time period, our synapse did not contribute to the success the trend and should be weakened.

LTP and Hebb’s Rule n Hebb’s Rule: neurons that fire together wire together strengthen weaken Long Term Potentiation (LTP) is the biological basis of Hebb’s Rule n Calcium channels are the key mechanism n

Chemical realization of Hebb’s rule n It turns out that there are elegant chemical processes that realize Hebbian learning at two distinct time scales ¨ ¨ n Early Long Term Potentiation (LTP) Late LTP These provide the temporal and structural bridge from short term electrical activity, through intermediate memory, to long term structural changes.

Calcium Channels Facilitate Learning n In addition to the synaptic channels responsible for neural signaling, there also Calciumbased channels that facilitate learning. ¨ As Hebb suggested, when a receiving neuron fires, chemical changes take place at each synapse that was active shortly before the event.

Long Term Potentiation (LTP) n n These changes make each of the winning synapses more potent for an intermediate period, lasting from hours to days (LTP). In addition, repetition of a pattern of successful firing triggers additional chemical changes that lead, in time, to an increase in the number of receptor channels associated with successful synapses - the requisite structural change for long term memory. ¨ There also related processes for weakening synapses and also for strengthening pairs of synapses that are active at about the same time.

The Hebb rule is found with long term potentiation (LTP) in the hippocampus Schafer collateral pathway Pyramidal cells 1 sec. stimuli At 100 hz

During normal low-frequency trans-mission, glutamate interacts with NMDA and non. NMDA (AMPA) and metabotropic receptors. With highfrequency stimulation, Calcium comes in

Enhanced Transmitter Release AMPA

Early and late LTP A. (Kandel, ER, JH Schwartz and TM Jessell (2000) Principles of Neural Science. New York: Mc. Graw-Hill. ) B. Experimental setup for demonstrating LTP in the hippocampus. The Schaffer collateral pathway is stimulated to cause a response in pyramidal cells of CA 1. Comparison of EPSP size in early and late LTP with the early phase evoked by a single train and the late phase by 4 trains of pulses.

Computational Models based on Hebb’s rule The activity-dependent tuning of the developing nervous system, as well as post-natal learning and development, do well by following Hebb’s rule. Explicit Memory in mammals appears to involve LTP in the Hippocampus. Many computational systems for modeling incorporate versions of Hebb’s rule. n Winner-Take-All: n n Recruitment Learning n n Units compete to learn, or update their weights. The processing element with the largest output is declared the winner Lateral inhibition of its competitors. Learning Triangle Nodes LTP in Episodic Memory Formation

WTA: Stimulus ‘at’ is presented 1 a 2 t o

Competition starts at category level 1 a 2 t o

Competition resolves 1 a 2 t o

Hebbian learning takes place 1 a 2 t o Category node 2 now represents ‘at’

Presenting ‘to’ leads to activation of category node 1 1 a 2 t o

Category 1 is established through Hebbian learning as well 1 a 2 t o Category node 1 now represents ‘to’

Hebb’s rule is not sufficient n What happens if the neural circuit fires perfectly, but the result is very bad for the animal, like eating something sickening? A pure invocation of Hebb’s rule would strengthen all participating connections, which can’t be good. ¨ On the other hand, it isn’t right to weaken all the active connections involved; much of the activity was just recognizing the situation – we would like to change only those connections that led to the wrong decision. ¨ n No one knows how to specify a learning rule that will change exactly the offending connections when an error occurs. ¨ Computer systems, and presumably nature as well, rely upon statistical learning rules that tend to make the right changes over time. More in later lectures.

Hebb’s rule is insufficient tastebud tastes rotten eats food gets sick drinks water n should you “punish” all the connections?

Models of Learning Hebbian ~ coincidence n Recruitment ~ one trial n Supervised ~ correction (backprop) n Reinforcement ~ delayed reward n Unsupervised ~ similarity n

Recruiting connections n Given that LTP involves synaptic strength changes and Hebb’s rule involves coincident-activation based strengthening of connections ¨ How can connections between two nodes be recruited using Hebbs’s rule?

Recruitment Learning K n Y n X N F = B/N B n Suppose we want to link up node X to node Y The idea is to pick the two nodes in the middle to link them up Can we be sure that we can find a path to get from X to Y? the point is, with a fan-out of 1000, if we allow 2 intermediate layers, we can almost always find a path

X Y

X Y

Finding a Connection P = (1 -F) **B**K P = Probability of NO link between X and Y N = Number of units in a “layer” B = Number of randomly outgoing units per unit F = B/N , the branching factor K = Number of Intermediate layers, 2 in the example K= N= 106 107 108 0 . 99999 1 2 . 367 . 905 . 989 10 -440 10 -44 10 -5 # Paths = (1 -P k-1)*(N/F) = (1 -P k-1)*B

Finding a Connection in Random Networks For Networks with N nodes and branching factor, there is a high probability of finding good links.

Recruiting a Connection in Random Networks 1. Activate the two nodes to be linked 2. Have nodes with double activation strengthen their active synapses (Hebb) 3. There is evidence for a “now print” signal based on LTP (episodic memory)

Recruiting triangle nodes n n Let’s say we are trying to remember a green circle currently weak connections between concepts (dotted lines) has-color blue has-shape green round oval

Strengthen these connections n and you end up with this picture has-color has-shape Green circle blue green round oval

Has-color Green Has-shape Round

Has-color GREEN Has-shape ROUND

Models of Learning Hebbian ~ coincidence n Recruitment ~ one trial n Supervised ~ correction (backprop) n Reinforcement ~ delayed reward n Unsupervised ~ similarity n

Back Propagation

Supervised Learning - Backprop n How do we train the weights of the network ¨ Basic Concepts Use a continuous, differentiable activation function (Sigmoid) n Use the idea of gradient descent on the error surface n Extend to multiple layers n

Backpropagation Algorithm “activations” “errors”

Backprop n To learn on data which is not linearly separable: ¨ Build multiple layer networks (hidden layer) ¨ Use a sigmoid squashing function instead of a step function.

Tasks Unconstrained pattern classification Credit assessment Digit Classification Function approximation Learning control Stock prediction

Sigmoid Squashing Function output w 0 y 0=1 w 1 y 1 w 2 y 2 wn. . . input yn

Gradient Descent on an error

Learning Rule – Gradient Descent on an Root Mean Square (RMS) n Learn wi’s that minimize squared error O = output layer

Gradient Descent Gradient: Training rule:

Backpropagation Algorithm n Generalization to multiple layers and multiple output units

An informal account of Back. Prop For each pattern in the training set: Compute the error at the output nodes Compute Dw for each wt in 2 nd layer Compute delta (generalized error expression) for hidden units Compute Dw for each wt in 1 st layer After amassing Dw for all weights and, change each wt a little bit, as determined by the learning rate

The output layer wjk k wij j yi ti: target i E = Error = ½ ∑i (ti – yi)2 The derivative of the sigmoid is just learning rate

The hidden layer wjk k wij j yi ti: target i E = Error = ½ ∑i (ti – yi)2

Let’s just do an example 0 i 1 0 i 2 b=1 w 01 0. 8 w 02 0. 6 w 0 b 0. 5 x 0 1/(1+e^-0. 5) f y 0 0. 6224 i 2 y 0 0 0 1 1 1 1 E = Error = ½ ∑i (ti – yi)2 E = ½ (t 0 – y 0)2 0. 5 0. 4268 E = ½ (0 – 0. 6224)2 = 0. 1937 0 0 suppose = 0. 5 learning rate i 1

Momentum term n The speed of learning is governed by the learning rate. ¨ ¨ n If the rate is low, convergence is slow If the rate is too high, error oscillates without reaching minimum. Momentum tends to smooth small weight error fluctuations. the momentum accelerates the descent in steady downhill directions. the momentum has a stabilizing effect in directions that oscillate in time.

Convergence May get stuck in local minima n Weights may diverge …but works well in practice n n Representation power: ¨ 2 layer networks : any continuous function ¨ 3 layer networks : any function

Local Minimum USE A RANDOM COMPONENT SIMULATED ANNEALING

Overfitting and generalization TOO MANY HIDDEN NODES TENDS TO OVERFIT

Overfitting in ANNs Early Stopping : ¨ Stop training when error goes up on validation set

Stopping criteria n Sensible stopping criteria: ¨ total mean squared error change: Back-prop is considered to have converged when the absolute rate of change in the average squared error per epoch is sufficiently small (in the range [0. 01, 0. 1]). ¨ generalization based criterion: After each epoch the NN is tested for generalization. If the generalization performance is adequate then stop. If this stopping criterion is used then the part of the training set used for testing the network generalization will not be used for updating the weights.

Architectural Considerations What is the right size network for a given job? How many hidden units? Too many: no generalization Too few: no solution Possible answer: Constructive algorithm, e. g. Cascade Correlation (Fahlman, & Lebiere 1990) etc

Network Topology n n The number of layers and of neurons depend on the specific task. In practice this issue is solved by trial and error. Two types of adaptive algorithms can be used: ¨ start from a large network and successively remove some nodes and links until network performance degrades. ¨ begin with a small network and introduce new neurons until performance is satisfactory.

Supervised vs Unsupervised Learning • Backprop is supervised • requires a 'target' • how realistic is that? • Hebbian learning is unsupervised • but limited in power • How can we combine the power of backprop with the idea of unsupervised learning?

Autoassociative Networks • Network trained to reproduce the input at the output layer copy of input as target • Non-trivial if number of hidden units is smaller than inputs/outputs • Forced to develop compressed representations of the patterns • Hidden unit representations may reveal natural kinds (e. g. Vowels vs Consonants) • Problem of explicit teacher is circumvented input

Problems and Networks • Some problems have natural "good" solutions • Solving a problem may be possible by providing the right armoury of general-purpose tools • Networks are general purpose tools. • Choice of network type, training, architecture, etc greatly influences the chances of successfully solving a problem • Tailoring tools for a specific job Vs Exploiting general purpose learning mechanism

Summary n Multiple layer feed-forward networks ¨ Replace Step with Sigmoid (differentiable) function ¨ Learn weights by gradient descent on error function ¨ Backpropagation algorithm for learning ¨ Avoid overfitting by early stopping

ALVINN drives 70 mph on highways

Use MLP Neural Networks when … n n n (vectored) Real inputs, (vectored) real outputs You’re not interested in understanding how it works Long training times acceptable Short execution (prediction) times required Robust to noise in the dataset

Applications of FFNN Classification, pattern recognition: n FFNN can be applied to tackle non-linearly separable learning problems. Recognizing printed or handwritten characters, ¨ Face recognition ¨ Classification of loan applications into credit-worthy and non-credit-worthy groups ¨ Analysis of sonar radar to determine the nature of the source of a signal ¨ Regression and forecasting: n FFNN can be applied to learn non-linear functions (regression) and functions whose inputs is a sequence of measurements over time (time series).

Extensions of Backprop Nets Recurrent Architectures n Backprop through time n

Elman Nets & Jordan Nets Output 1 Hidden Context Input α Hidden Context Updating the context as we receive input n n n In Jordan nets we model “forgetting” as well The recurrent connections have fixed weights You can train these networks using good ol’ backprop Input

Recurrent Backprop w 2 a w 4 b w 1 c w 3 unrolling 3 iterations a b c w 1 a n n w 2 w 3 b w 4 c we’ll pretend to step through the network one iteration at a time backprop as usual, but average equivalent weights (e. g. all 3 highlighted edges on the right are equivalent)