Chapter 6 Neural Network Implementations Neural Network Implementations

  • Slides: 114
Download presentation
Chapter 6 Neural Network Implementations

Chapter 6 Neural Network Implementations

Neural Network Implementations Back-propagation networks Learning vector quantizer networks Kohonen self-organizing feature map networks

Neural Network Implementations Back-propagation networks Learning vector quantizer networks Kohonen self-organizing feature map networks Evolutionary multi-layer perceptron networks

The Iris Data Set Consists of 150 four-dimensional vectors (50 plants of each of

The Iris Data Set Consists of 150 four-dimensional vectors (50 plants of each of three Iris species) Features are: sepal length, sepal width, petal length and petal width We are working with scaled values in the range [0, 1] Examples of patterns: 0. 637500 0. 437500 0. 875000 0. 400000 0. 787500 0. 412500 0. 175000 0. 587500 0. 750000 0. 025000 0. 175000 0. 312500 1 0 0 0 1

Implementation Issues • Topology • Network initialization and normalization • Feedforward calculations • Supervised

Implementation Issues • Topology • Network initialization and normalization • Feedforward calculations • Supervised adaptation versus unsupervised adaptation • Issues in evolving neural networks

Topology • Pattern of PEs and interconnections • Direction of data flow • PE

Topology • Pattern of PEs and interconnections • Direction of data flow • PE activation functions Back-propagation uses at least three layers; LVQ and SOFM use two.

Definition: Neural Network Architecture Specifications sufficient to build, train, test, and operate a neural

Definition: Neural Network Architecture Specifications sufficient to build, train, test, and operate a neural network

Back-propagation Networks • Software on web site • Topology • Network input • Feedforward

Back-propagation Networks • Software on web site • Topology • Network input • Feedforward calculations • Training • Choosing network parameters • Running the implementation

Elements of an artificial neuron (PE) • Set of connection weights • Linear combiner

Elements of an artificial neuron (PE) • Set of connection weights • Linear combiner • Activation function

Back-propagation Network Structure

Back-propagation Network Structure

Back-propagation network input • Number of inputs depends on application • Don’t combine parameters

Back-propagation network input • Number of inputs depends on application • Don’t combine parameters unnecessarily • Inputs usually over range [0, 1], continuous valued • Type float in C++: 24 bits value, 8 bits expon. ; ~7 decimal places • Scaling usually used as a preprocessing tool • Usually scale on like groups of channels • Amplitude • Time

Feedforward Calculations • Input PEs distribute signal forward along multiple paths • Fully connected,

Feedforward Calculations • Input PEs distribute signal forward along multiple paths • Fully connected, in general • No feedback loop, not even self-feedback • Additive sigmoid PE is used in our implementation Activation of ith hidden PE: where fn(. ) is the sigmoid function 0 is bias PE

Sigmoid Activation Function

Sigmoid Activation Function

Feedforward calculations, cont’d. • Sigmoid function performs job similar to electronic amplifier (gain is

Feedforward calculations, cont’d. • Sigmoid function performs job similar to electronic amplifier (gain is slope) • Once hidden layer activations are calculated, outputs are calculated:

Training by Error Back-propagation Error per pattern: Error_signalkj We derived this using the chain

Training by Error Back-propagation Error per pattern: Error_signalkj We derived this using the chain rule.

Backpropagation Training, Cont’d. • But we have to have weights initialized in order to

Backpropagation Training, Cont’d. • But we have to have weights initialized in order to update them. • Often (usually) randomize [-0. 3, 0. 3] • Two ways to update weights: Online, or “single pattern” adaptation Off-line, or epoch adaptation (we use this in our back-prop)

Updating Output Weights Basic weight update method: But this tends to get caught in

Updating Output Weights Basic weight update method: But this tends to get caught in local minima. So, introduce “momentum” α, [0, 1] (includes bias weights)

Updating Hidden Weights As derived previously: So, Note: δ’s are calculated one pattern at

Updating Hidden Weights As derived previously: So, Note: δ’s are calculated one pattern at a time, and are calculated using “old” weights.

Keep in mind… In offline training: The deltas are calculated pattern by pattern, while

Keep in mind… In offline training: The deltas are calculated pattern by pattern, while the weights are updated once per epoch. The values for η and α are usually assigned to the entire network, and left constant after good values are found. When the δ’s are calculated for the hidden layer, the old (existing) weights are used.

Kohonen Networks Probably second only to backpropagation in number of applications Rigorous mathematical derivation

Kohonen Networks Probably second only to backpropagation in number of applications Rigorous mathematical derivation has not occurred Seem to be more biologically oriented than most paradigms Reduce dimensionality of inputs We’ll consider LVQI, LVQII, and Self-Organizing Feature Maps

Initial Weight Settings 1. Randomize weights [0, 1]. 2. Normalize weights: • Note: Randomization

Initial Weight Settings 1. Randomize weights [0, 1]. 2. Normalize weights: • Note: Randomization often occurs in centroid area of problem space.

Preprocessing Alternatives 1. Transform each variable onto [-1, 1] 2. Then normalize by: a.

Preprocessing Alternatives 1. Transform each variable onto [-1, 1] 2. Then normalize by: a. Dividing each vector component by total length: or by b. “Z-axis normalization with a “synthetic” variable or by c. Assigning a fixed interval (perhaps 0. 1 or 1/n, whichever is smaller) to a synthetic variable that is the scale factor in a. scaled to the fixed interval

Euclidean Distance for the j th PE, and the k th pattern

Euclidean Distance for the j th PE, and the k th pattern

Distance Measures l = 1: Hamming distance l = 2: Euclidean distance l =

Distance Measures l = 1: Hamming distance l = 2: Euclidean distance l = 3: ? ? ?

Weight Updating Weights are adjusted in the neighborhood only Sometimes, where z = total

Weight Updating Weights are adjusted in the neighborhood only Sometimes, where z = total no. of iterations Rule of thumb: No. of training iterations should be about 500 times the number of output PEs. * Some people start out with eta = 1 or near 1. * Initial neighborhood shoud include most or all of output PE field * Options exist for configuration of output slab: ring, cyl. surface, cube, etc.

Error Measurement *Unsupervised, so no “right” or “wrong” *Two approaches – pick or mix

Error Measurement *Unsupervised, so no “right” or “wrong” *Two approaches – pick or mix * Define error as mean error vector length * Define error as max error vector length (adding PE when this is large could improve performance) * Convergence metric: max_error_vector_length/eta (best when epoch training is used)

Learning Vector Quantizers: Outline • Introduction • Topology • Network initialization and input •

Learning Vector Quantizers: Outline • Introduction • Topology • Network initialization and input • Unsupervised training calculations • Giving the network a conscience • LVQII • The LVQI implementation

Learning Vector Quantization: Introduction • Related to SOFM • Several versions exist, both supervised

Learning Vector Quantization: Introduction • Related to SOFM • Several versions exist, both supervised and unsupervised • LVQI is unsupervised; LVQII is supervised (I & II do not correspond to Kohonen’s notation) • Related to perceptrons and delta rule, however : * Only one (winner) PE’s weights updated * Depending on version, updating is done for correct and/or incorrect classification * Weight updating method analogous to metric used to pick winning PE for updating * Network weight vectors approximate density function of input

LVQ-I Network Topology

LVQ-I Network Topology

LVQI Network Initialization and Input • LVQI clusters input data • More common to

LVQI Network Initialization and Input • LVQI clusters input data • More common to input raw data (preprocessed) • Usually normalize input vectors, but sometimes better not to • Initial normalization of weight vectors almost always done, but in various ways • In implementation, for p PEs in output layer, first p patterns chosen randomly to initiate weights

Weight and Input Vector Initialization (a) before, (b) after, input vector normalization

Weight and Input Vector Initialization (a) before, (b) after, input vector normalization

LVQ Version I - Unsupervised Training • Present one pattern at a time, and

LVQ Version I - Unsupervised Training • Present one pattern at a time, and select winning output PE based on minimum Euclidean distance • Update weights: • Continue until weight changes are acceptably small or max. iterations occur • Ideally, output will reflect probability distribution of input • But, what if we want to more accurately characterize the decision hypersurface? • Important to have training patterns near decision hypersurface

Giving the Network a Conscience • The optimal 1/n representation by each output PE

Giving the Network a Conscience • The optimal 1/n representation by each output PE is unlikely (without some “help”) • This is especially serious when initial weights don’t reflect the probability distribution of the input patterns • De. Sieno developed a method for adding a conscience to the network In example: With no conscience, given uniform distribution of input patterns, w 7 will win about half of the time, other weights about 1/12 of the time each.

Conscience Equations

Conscience Equations

Conscience Parameters • Conscience factor fj with initial value = 1/n (so initial bias

Conscience Parameters • Conscience factor fj with initial value = 1/n (so initial bias values are all 0) • Bias factor γ set approximately to 10 • Constant β set to about. 0001 (set β so that conscience factors don’t reflect noise in the data)

Example of Conscience If there are 5 output PEs, then 1/n = 0. 2

Example of Conscience If there are 5 output PEs, then 1/n = 0. 2 = all initial fj values Biases are 0 initially, and first winner is selected based on Euclidean distance minimum Conscience factors are now updated: Winner’s fj = [0. 2 + 0. 0001(1. 0 - 0. 2)] = 0. 20008 All others’ fj = 0. 2 - 0. 00002 = 0. 19998 Winner’s bj = –. 0008; all others’ bj = 0. 0002

Probability Density Function Shows regions of equal area

Probability Density Function Shows regions of equal area

Learning: No Conscience A = 0. 03 for 16, 000 iterations

Learning: No Conscience A = 0. 03 for 16, 000 iterations

Learning: With Conscience A = 0. 03 for 16, 000 iterations

Learning: With Conscience A = 0. 03 for 16, 000 iterations

With Conscience, Better Weight Allocation

With Conscience, Better Weight Allocation

LVQ - Version II - Supervised * Instantiate first p ak vectors to weights

LVQ - Version II - Supervised * Instantiate first p ak vectors to weights wji * Relative numbers of weights assigned by class must correspond to a priori probabilities of classes * Assume pattern Ak belongs to class Cr and that the winning PE’s weight vector belongs to class Cs ; then for winning PE: For all other PEs, no weight changes are done. * This LVQ version reduces misclassifications

Evolving Neural Networks: Outline • Introduction and definitions • Artificial neural networks • Adaptation

Evolving Neural Networks: Outline • Introduction and definitions • Artificial neural networks • Adaptation and computational intelligence • Advantages and disadvantages of previous approaches • Using particle swarm optimization (PSO) • An example application • Conclusions

Introduction • Neural networks are very good at some problems, such as mapping input

Introduction • Neural networks are very good at some problems, such as mapping input vectors to outputs • Evolutionary algorithms are very good at other problems, such as optimization • Hybrid tools are possible that are better than either approach by itself • Review articles on evolving neural networks: Schaffer, Whitley, and Eshelman (1992); Yao (1995); and Fogel (1998) • Evolutionary algorithms usually used to evolve network weights, but sometimes used to evolve structures and/or learning algorithms

Typical Neural Network OUTPUTS INPUTS

Typical Neural Network OUTPUTS INPUTS

More Complex Neural Network

More Complex Neural Network

Evolutionary Algorithms (EAs) Applied to Neural Network Attributes • Network connection weights • Network

Evolutionary Algorithms (EAs) Applied to Neural Network Attributes • Network connection weights • Network topology (structure) • Network PE transfer function • Network learning algorithms

Early Approaches to Evolve Weights • Bremmerman (1968) suggested optimizing weights in multilayer neural

Early Approaches to Evolve Weights • Bremmerman (1968) suggested optimizing weights in multilayer neural networks. • Whitley (1989) used GA to learn weights in feedforward network; used for relatively small problems. • Montana and Davis (1989) used “steady state” GA to train 500 weight neural network. • Schaffer (1990) evolved a neural network with better generalization performance than one designed by human.

Evolution of Network Architecture • Most work has focused on evolving network topological structure

Evolution of Network Architecture • Most work has focused on evolving network topological structure • Less has been done on evolving processing element (PE) transfer functions • Very little has been done on evolving topological structure and PE transfer functions simultaneously

Examples of Approaches • Indirect coding schemes Evolve parameters that specify network topology Evolve

Examples of Approaches • Indirect coding schemes Evolve parameters that specify network topology Evolve number of PEs and/or number of hidden layers • Evolve developmental rules to construct network topology • Stork et al. (1990) evolved both network topology and PE transfer functions (Hodgkin-Huxley equation) for neuron in tail-flip circuitry of crayfish (only 7 PEs) • Koza and Rice (1991) used genetic programming to find weights and topology. They encoded a tree structure of Lisp S-expressions in the chromosome.

Examples of Approaches, Cont’d. • Optimization of EA operators used to evolve neural networks

Examples of Approaches, Cont’d. • Optimization of EA operators used to evolve neural networks (optimize hill-climbing capabilities of GAs) • Summary: • Few quantitative comparisons with other approaches typically given (speed of computation, performance, generalization, etc. ) • Comparisons should be between best available approaches (fast EAs versus fast NNs, for example)

Advantages of Previous Approaches • EAs can be used to train neural networks with

Advantages of Previous Approaches • EAs can be used to train neural networks with non-differentiable PE transfer functions. • Not all PE transfer functions in a network need to be the same. • EAs can be used when error gradient or other error information is not available. • EAs can perform a global search in a problem space. • The fitness of a network evolved by an EA can be defined in a way appropriate for the problem. (The fitness function does not have to be continuous or differentiable. )

Disadvantages of Previous Approaches • GAs do not generally seem to be better than

Disadvantages of Previous Approaches • GAs do not generally seem to be better than best gradient methods such as quickprop in training weights • Evolution of network topology is often done in ways that result in discontinuities in the search space (e. g. , removing and inserting connections and PEs). Networks must therefore be retrained, which is computationally intensive. • Representation of weights in a chromosome is difficult. • Order of weights? • Encoding method? • Custom designed genetic operators?

Disadvantages of Previous Approaches, Cont’d. Permutation problem (also known as competing conventions problem or

Disadvantages of Previous Approaches, Cont’d. Permutation problem (also known as competing conventions problem or isomorphism problem ): Multiple chromosome configurations can represent equivalent optimum solutions. Example: various permutations of hidden PEs can represent equivalent networks. We believe, as does Hancock (1992), that this problem is not as severe as reported. (In fact, it may be an advantage. )

Evolving Neural Networks with Particle Swarm Optimization • Evolve neural network capable of being

Evolving Neural Networks with Particle Swarm Optimization • Evolve neural network capable of being universal approximator, such as backpropagation or radial basis function network. • In backpropagation, most common PE transfer function is sigmoidal function: output = 1/(1 + e - input ) • Eberhart, Dobbins, and Simpson (1996) first used PSO to evolve network weights (replaced backpropagation learning algorithm) • PSO can also be used to indirectly evolve the structure of a network. An added benefit is that the preprocessing of input data is made unnecessary.

Evolving Neural Networks with Particle Swarm Optimization, Cont’d. • Evolve both the network weights

Evolving Neural Networks with Particle Swarm Optimization, Cont’d. • Evolve both the network weights and the slopes of sigmoidal transfer functions of hidden and output PEs. • If transfer function now is: output = 1/(1 + e evolving k in addition to evolving the weights. -k*input ) then we are • The method is general, and can be applied to other topologies and other transfer functions. • Flexibility is gained by allowing slopes to be positive or negative. A change in sign for the slope is equivalent to a change in signs of all input weights.

Evolving the Network Structure with PSO • If evolved slope is sufficiently small, sigmoidal

Evolving the Network Structure with PSO • If evolved slope is sufficiently small, sigmoidal output can be clamped to 0. 5, and hidden PE can be removed. Weights from bias PE to each PE in next layer are increased by one-half the value of the weight from the PE being removed to the next-layer PE. PEs are thus pruned, reducing network complexity. • If evolved slope is sufficiently high, sigmoid transfer function can be replaced by step transfer function. This works with large negative or positive slopes. Network computational complexity is thus reduced.

Evolving the Network Structure with PSO, Cont’d. • Since slopes can evolve to large

Evolving the Network Structure with PSO, Cont’d. • Since slopes can evolve to large values, input normalization is generally not needed. This simplifies applications process and shortens development time. • The PSO process is continuous, so neural network evolution is also continuous. No sudden discontinuities exist such as those that plague other approaches.

Example Application: the Iris Data Set • Introduced by Anderson (1935), popularized by Fisher

Example Application: the Iris Data Set • Introduced by Anderson (1935), popularized by Fisher (1936) • 150 records total; 50 of each of 3 varieties of iris flowers • Four attributes in each record • sepal length • sepal width • petal length • petal width • We used both normalized and unnormalized versions of the data set; all 150 patterns were used to evolve a neural network. Issue of generalization was thus not addressed.

Example Application, Continued • Values of -k*input > 100 resulted in clamping PE transfer

Example Application, Continued • Values of -k*input > 100 resulted in clamping PE transfer output to zero, to avoid computational overflow. • Normalized version of data set first used to test concept of evolving both weights and slopes. Next we looked at threshold value for slope at which the sigmoidal transfer function could be transitioned into a step function without significant loss in performance.

Performance Variations with Slope Thresholds Discussion of Example Application • Average number of errors

Performance Variations with Slope Thresholds Discussion of Example Application • Average number of errors was 2. 15 out of 150 with no slope threshold. (This is a good result for this data set. ) • Accuracy degrades gracefully until slope threshold decreases to 4. • Preliminary indication is that slopes can be evolved, and that a slope threshold of about 10 to 20 would be reasonable for this problem. • Other data sets are being examined. • More situations with slopes near zero are being tested.

Un-normalized Data Set Results One set of runs; 40 runs of 1000 generations Number

Un-normalized Data Set Results One set of runs; 40 runs of 1000 generations Number correct 149 148 147 146 145 Number of runs with this number 11 16 correct 6 3 1 144 100 99 1 1 1 Good solution obtained in 38 of 40 runs. Average number correct was 145. Ignoring two worst solutions, average of only 2 mistakes.

Examples of Recent Applications • Scheduling (Integrated automated container terminal) • Manufacturing (Product content

Examples of Recent Applications • Scheduling (Integrated automated container terminal) • Manufacturing (Product content combination optimization) • Figure of merit for electric vehicle battery pack • Optimizing reactive power and voltage control • Medical analysis/diagnosis (Parkinson’s disease and essential tremor) • Human performance prediction (cognitive and physical)

Conclusions • Brief review of applying EC techniques to evolving neural networks was presented.

Conclusions • Brief review of applying EC techniques to evolving neural networks was presented. Advantages and disadvantages were summarized. • A new methodology for using particle swarm optimization to evolve network weights and structures was presented. • The methodology seems to overcome the first four disadvantages discussed. • We believe that multimodality is a help rather than a hindrance with EAs (including PSO). • Iris Data Set was used as an example of new approach.

The BP Software An implementation of a fully-connected feed-forward network. main() routine BP_Start_Up()reads parameters

The BP Software An implementation of a fully-connected feed-forward network. main() routine BP_Start_Up()reads parameters from input (run) file and allocates memory BP_Clean_Up()stores results in output file and deallocates memory bp_state_handler() is the most important part of the BP state machine Output PEs can be linear or sigmoid; hidden are always sigmoid. Number of layers and number of PEs per layer can be specified.

Back-prop. State Transition Diagram

Back-prop. State Transition Diagram

BP Software, Cont’d. Enumeration data types used for: • NN operating mode (train or

BP Software, Cont’d. Enumeration data types used for: • NN operating mode (train or recall) • PE function type • Nature of the layer (input, hidden, output) • Training mode (offline or online) • States in the state machine

Enumeration Data Types for All NNs

Enumeration Data Types for All NNs

Enumeration Data Types for Back-prop.

Enumeration Data Types for Back-prop.

BP Software, Cont’d. Structure data types used for: • PE configuration • Network configuration

BP Software, Cont’d. Structure data types used for: • PE configuration • Network configuration • Environment and training parameters • Network architecture • Pattern configuration

Structure Data Type Example Structure data type BP_Arch_Type defines the network architecture: Number of

Structure Data Type Example Structure data type BP_Arch_Type defines the network architecture: Number of layers Pointer to number of PEs in hidden layers

BP State Handler • Total of 15 states • Most important part of the

BP State Handler • Total of 15 states • Most important part of the state machine • Routes program to proper state

Running the BP Software To run, you need bp. exe and a run file,

Running the BP Software To run, you need bp. exe and a run file, such as iris_bp. run First train, then test. For example: To train, run: bp iris_bpr. run You will get: bp_res. txt (weights of trained net) You will see (or you can >filename 1): error values for each iteration To test, run: bp iris_bps. run You will get: bp_test. txt (summary of correct patterns) You will see (or >filename 2): detailed results (I run bp iris_bps. run >irisres. txt)

Sample BP Run File 0 0 0. 075 0. 15 0. 01 10000 99

Sample BP Run File 0 0 0. 075 0. 15 0. 01 10000 99 3 4 150 4 3 iris. dat 0=train 1=test if train, 0=batch 1=sequential learning rate momentum rate error termination criterion (not implemented) max number of generations number of training patterns number of layers ( 3 -> one hidden layer) number of PEs in hidden layer total number of patterns in pattern file dimension of input dimension of output data file

Choosing BP Network Parameters How many hidden PEs? Guess/estimate: (This is only a “rule

Choosing BP Network Parameters How many hidden PEs? Guess/estimate: (This is only a “rule of thumb. ”)

Choosing BP Network Parameters • Too few hidden PEs, and network won’t generalize or

Choosing BP Network Parameters • Too few hidden PEs, and network won’t generalize or won’t train • Too many hidden PEs, and the net will “memorize” • Assign one output PE per class • Probably best to start with low values for η and α • Avoid getting stuck on an error value that’s too high, maybe. 06 or. 08 SSE/pattern/PE • I often try values of η between 0. 02 and 0. 20, and α = [0. 01, 0. 10]

The Kohonen Network Implementations Learning vector quantization (LVQ) software implementation is presented first. The

The Kohonen Network Implementations Learning vector quantization (LVQ) software implementation is presented first. The self-organizing feature map (SOFM) is presented next.

LVQ Software General definitions (in BP section) are still valid. New data types are

LVQ Software General definitions (in BP section) are still valid. New data types are defined in enumeration and structure data type code. Enumeration types: Network can be trained randomly or sequentially, and can use (or not use) a conscience (described later). Structure types: Establish PE type, define environment parameters such as training parameters, flag for conscience, and the number of clusters, which is the number of output PEs.

LVQ Software, Cont’d.

LVQ Software, Cont’d.

LVQ Software, Cont’d. main() routine LVQ_Start_Up()reads parameters from input (run) file and allocates memory

LVQ Software, Cont’d. main() routine LVQ_Start_Up()reads parameters from input (run) file and allocates memory LVQ_Main_Loop is the primary part of the implementation LVQ_Clean_Up()stores results in output file and de-allocates memory The LVQ implementation has 13 states.

LVQ State Diagram for Training Mode

LVQ State Diagram for Training Mode

LVQ Software, Cont’d. Output PEs are linear. Weights (from all inputs to an output)

LVQ Software, Cont’d. Output PEs are linear. Weights (from all inputs to an output) are normalized. Euclidean distance calculated between input vector and each weight vector. The output PE with the smallest distance between input and weight vectors is selected as winner. Weight vector of winning PE is updated, then the learning rate is updated. If conscience is used, the conscience factor is updated.

LVQ Run File 0 0 0. 3 0. 999 10 0. 0001 0. 001

LVQ Run File 0 0 0. 3 0. 999 10 0. 0001 0. 001 500 99 1 6 0=train, 1=test 0=random pattern selection, 1=sequential initial learning rate shrinking factor bias factor (gamma) beta training termination criterion max number of iterations number of training patterns 1=conscience max number of clusters 150 4 3 iris. dat total number of patterns input dimension output dimension data file

LVQ Results File Example 0. 789628 0. 573990 0. 213485 0. 038044 Weights to

LVQ Results File Example 0. 789628 0. 573990 0. 213485 0. 038044 Weights to first ouput PE (first cluster) 0. 696514 0. 335583 0. 592744 0. 225625 0. 727000 0. 299744 0. 589254 0. 185483 0. 808415 0. 529362 0. 254345 0. 039350 0. 207525 0. 075463 0. 130591 0. 966532 0. 760180 0. 348239 0. 524717 0. 159773 Sixth cluster weights

LVQ Test File Example Cluster Class 0 Class 1 Class 2 -----------------0 0 0

LVQ Test File Example Cluster Class 0 Class 1 Class 2 -----------------0 0 0 26 1 0 25 0 22 6 3 29 0 0 4 4 21 0 0 5 5 0 3 18 Class 0: clusters 3 and 4 Class 1: clusters 1 and 2 Class 2: clusters 0 and 5 141 out of 150 clustered “correctly”

Self Organizing Feature Maps An extension of LVQ; use LVQ features such as the

Self Organizing Feature Maps An extension of LVQ; use LVQ features such as the conscience Also developed by Teuvo Kohonen Utilize slabs of PEs Incorporate the concept of a neighborhood Primary features of input cause corresponding local responses in the output PE field. Are non-linear mappings of input space onto the output PE space (field).

SOFM Slab of PEs • PEs in a slab have similar attributes. • The

SOFM Slab of PEs • PEs in a slab have similar attributes. • The slab has a fixed topology. • Most slabs are two-dimensional.

Hexagonal Slab of PEs

Hexagonal Slab of PEs

SOFM Network Model More likely to use raw data as input to SOFM. Kohonen

SOFM Network Model More likely to use raw data as input to SOFM. Kohonen often initializes weight vectors to be between 0. 4 and 0. 6 in length. Winning output PE has minimum Euclidean distance between input and weight vectors. (Can use conscience)

SOFM Weight Updating Weight updates made to winning PE and its neighborhood. Learning coefficient

SOFM Weight Updating Weight updates made to winning PE and its neighborhood. Learning coefficient and neighborhood both shrink over time. Sometimes, where z = total number of iterations, and t is the iteration index.

SOFM Neighborhood Types

SOFM Neighborhood Types

Hats Sombrero Stovepipe hat Chef’s hat

Hats Sombrero Stovepipe hat Chef’s hat

SOFM Phases of Learning Two phases of learning in the Kohonen SOFM: 1. Topological

SOFM Phases of Learning Two phases of learning in the Kohonen SOFM: 1. Topological ordering, where the weight vectors order themselves. 2. Convergence, in which fine tuning occurs.

SOFM Hints Rule of thumb: No. of training iterations should be about 500 times

SOFM Hints Rule of thumb: No. of training iterations should be about 500 times the number of output PEs. Some people start out with eta near 1. 0. The initial neighborhood should include most or all of the output PE slab. Options exist for the configuration of the output slab: ring, cylindrical surface, cube, etc.

SOFM Error Measurement Unsupervised, so no right or wrong Two approaches – pick or

SOFM Error Measurement Unsupervised, so no right or wrong Two approaches – pick or mix • Define error as mean error vector length • Define error as max error vector length (adding PE when this is large could improve performance) Convergence metric could be: Max_error_vector_length/eta (best when epoch training is used)

SOFM Advantages • Can do real-time non-parametric pattern classification • Don’t need to know

SOFM Advantages • Can do real-time non-parametric pattern classification • Don’t need to know classes a priori • Does nearest neighbor-like classifications • Relatively simple paradigm • Can deal with many classes • Can handle high-dimensionality inputs

SOFM Disadvantages • Long training time • Can’t add new classes without retraining •

SOFM Disadvantages • Long training time • Can’t add new classes without retraining • Hard to figure out how to implement • Not good with parameterized data • Must normalize input patterns (? )

SOFM Applications • Speech processing • Image processing • Data compression • Combinatorial optimization

SOFM Applications • Speech processing • Image processing • Data compression • Combinatorial optimization • Robot control • Sensory mapping • Preprocessing

SOFM Run File 0 0 0. 3 0. 999 10 0. 0001 0. 001

SOFM Run File 0 0 0. 3 0. 999 10 0. 0001 0. 001 500 99 1 1 1 4 4 0 150 4 3 iris. dat Training/recall 0 = train; 1 = recall Training mode if training, 0 = random Learning rate Shrinking coefficient Bias factor Beta Training error criterion for termination Maximum number of generations Number of patterns used for training 1 = conscience; 0 = no conscience Initial width of neighborhood Initial height of neighborhood Output slab height Output slab width Neighborhood function type (0 = chef hat) Total number of patterns Input dimension Output dimension Data file for patterns

SOFM Weights File 0. 762695 0. 409230 0. 477594 0. 150768 Weights from inputs

SOFM Weights File 0. 762695 0. 409230 0. 477594 0. 150768 Weights from inputs to first output PE 0. 744240 0. 379303 0. 521246 0. 174752 0. 776556 0. 443671 0. 428612 0. 128095 0. 757758 0. 397594 0. 492467 0. 158740 0. 778668 0. 421259 0. 446406 0. 130147 0. 758743 0. 376317 0. 507493 0. 158574 0. 765185 0. 391521 0. 488735 0. 149472 0. 748811 0. 363523 0. 527234 0. 170756 0. 769893 0. 418876 0. 460760 0. 139670 0. 731007 0. 357475 0. 549875 0. 188358 0. 784809 0. 461558 0. 398094 0. 112071 0. 745425 0. 374062 0. 523810 0. 173326 0. 785969 0. 437032 0. 421340 0. 117167 0. 752147 0. 362909 0. 524794 0. 164813 0. 771124 0. 401549 0. 473482 0. 141214 0. 736854 0. 345378 0. 551698 0. 182727 First PE O O O O Last PE Weights from inputs to last output PE

SOFM Test Results Class 0 Class 1 Class 2 --------------------------00 00 01 0 00

SOFM Test Results Class 0 Class 1 Class 2 --------------------------00 00 01 0 00 02 0 00 03 0 1 0 01 00 0 01 01 50 0 0 01 02 0 01 03 0 1 0 02 00 0 3 0 02 01 0 02 02 0 4 0 Also output is cluster 02 03 0 1 0 assignment for each pattern. 03 00 0 7 25 03 01 0 3 0 03 02 0 14 0 03 03 0 15 25

Attributes Needed to Specify a Kohonen SOFM n n n n n Number and

Attributes Needed to Specify a Kohonen SOFM n n n n n Number and configuration of input PEs Number and configuration of output PEs Dimensionality of output slab (1, 2, 3, etc. ) Geometry of output slab (square or hexagonal neighborhood, wraparound or not) Neighborhood definition as function of time Learning coefficient as function of time and space Initialization of weights Preprocessing (normalization) and presentation (random or not) of inputs Method to select winner (Euclidean distance or dot product)

Summary of SOFM Process Allocate storage Read weights and patterns Loop through iteration Loop

Summary of SOFM Process Allocate storage Read weights and patterns Loop through iteration Loop through patterns Compute activations Find winning PE Adapt weights of winner and its neighborhood Shrink neighborhood size Reduce learning coefficient eta If eta <= 0, break Write final weights Write activation values Free storage

Evolutionary Back-Propagation Implementation • A merger of the back-propagation implementation and the PSO implementation

Evolutionary Back-Propagation Implementation • A merger of the back-propagation implementation and the PSO implementation • PSO is used only to evolve weights (not slopes of sigmoid functions) • BP is used only in recall mode; the outputs are used to evaluate fitness for each particle (candidate set of weights)

Evolutionary BP, Cont’d. • Both BP and PSO start-up and clean-up routines are included

Evolutionary BP, Cont’d. • Both BP and PSO start-up and clean-up routines are included • Length of individual particles is calculated from dimensions in input file • Particle elements correspond to individual weights • BP recall is run for each particle after each iteration of PSO to evaluate fitness (error) • The BP network is the “problem” for PSO to solve

Main Routine for Evolutionary Back-Prop void main (int argc, char *argv[]) { // check

Main Routine for Evolutionary Back-Prop void main (int argc, char *argv[]) { // check command line if (argc != 3) { printf("Usage: exe_file pso_run_file bp_run_filen"); exit(1); } // initialize main_start_up(argv[1], argv[2]); PSO_Main_Loop(); main_clean_up(); } static void main_start_up (char *pso. Data. File, char *bp. Data. File) { BP_Start_Up(bp. Data. File); PSO_Start_Up(pso. Data. File); } static void main_clean_up (void) { PSO_Clean_Up(); BP_Clean_Up(); }

Running the Evolutionary BP Network Implementation • Need the executable file pso_nn. exe •

Running the Evolutionary BP Network Implementation • Need the executable file pso_nn. exe • Need two run files, such as pso. run and bp. run • PSO run file same as for single PSO, except that length of particle not specified • BP run file is short; only information for recall needed Example bp. run: 3 4 150 4 3 iris. dat # # # of layers hidden PEs patterns inputs outputs data file

PSO Run File 1 0 1 // num of psos // pso_update_pbest_each_cycle_flag // total

PSO Run File 1 0 1 // num of psos // pso_update_pbest_each_cycle_flag // total cycles of running PSOs 1 17 1 0 -10. 0 5 10 200 // // // 30 // population size 0. 9 0 // initial inertia weight // boundary flag // boundaries if boundary flag is 1 optimization type: min or max – max. no. correct evaluation function – 17 calls BP weights from PSO inertia weight update method initialization type: sym/asym left initialization range right initialization range maximum velocity maximum position max number of generations

BP_RES. TXT Output File Weights … Weights from inputs to first hidden PE (bias

BP_RES. TXT Output File Weights … Weights from inputs to first hidden PE (bias first) from inputs to last hidden PE (bias first) from first hidden to first output PE (bias first) from last hidden to last output PE (bias first)

-2. 555491 Weights to first hidden PE (bias first) -3. 560039 2. 198371 8.

-2. 555491 Weights to first hidden PE (bias first) -3. 560039 2. 198371 8. 452043 -0. 000573 -4. 703630 6. 440988 8. 627151 -3. 195024 0. 699212 -1. 443098 -6. 584295 0. 430629 2. 237892 0. 960514 -5. 099212 Weights to 4 th hidden PE (bias first) -3. 314713 0. 362337 -8. 708467 -3. 981537 -5. 676066 Weights to first output PE (bias first) 2. 128347 -1. 152100 5. 140296 -3. 994824 4. 449585 -2. 012187 0. 222005 -3. 648189 -1. 876380 7. 973076 Weights to 3 rd output PE (bias first) 6. 194356 -0. 598305 -6. 768669 -11. 408623 BP_RES. TXT Output File

Example Application: the Iris Data Set n n n Introduced by Anderson (1935), popularized

Example Application: the Iris Data Set n n n Introduced by Anderson (1935), popularized by Fisher (1936) 150 records total; 50 of each of 3 varieties of iris flowers Four attributes in each record n n n sepal length sepal width petal length petal width We used both normalized and unnormalized versions of the data set; all 150 patterns were used to evolve a neural network. Issue of generalization was thus not addressed.

Example Application, Continued n n Values of -k*input > 100 resulted in clamping PE

Example Application, Continued n n Values of -k*input > 100 resulted in clamping PE transfer output to zero, to avoid computational overflow. Normalized version of data set first used to test concept of evolving both weights and slopes. Next we looked at threshold value for slope at which the sigmoidal transfer function could be transitioned into a step function without significant loss in performance.

Performance Variations with Slope Thresholds For each threshold value, 40 runs of 1000 generations

Performance Variations with Slope Thresholds For each threshold value, 40 runs of 1000 generations were made of the 150 -pattern data set.

Discussion of Example Application n n Average number of errors was 2. 15 out

Discussion of Example Application n n Average number of errors was 2. 15 out of 150 with no slope threshold. (This is a good result for this data set. ) Accuracy degrades gracefully until slope threshold decreases to 4. Preliminary indication is that slopes can be evolved, and that a slope threshold of about 10 to 20 would be reasonable for this problem. Other data sets are being examined. More situations with slopes near zero are being tested.

Un-normalized Data Set Results n One set of runs; 40 runs of 1000 generations

Un-normalized Data Set Results n One set of runs; 40 runs of 1000 generations Number correct 149 148 147 146 145 144 100 99 Number of runs with this number 11 16 correct 6 3 1 1 Good solution obtained in 38 of 40 runs. Average number correct was 145. Ignoring two worst solutions, average of only 2 mistakes.

Examples of Recent Applications n n n Scheduling (Integrated automated container terminal) Manufacturing (Product

Examples of Recent Applications n n n Scheduling (Integrated automated container terminal) Manufacturing (Product content combination optimization) Figure of merit for electric vehicle battery pack Optimizing reactive power and voltage control Medical analysis/diagnosis (Parkinson’s disease and essential tremor) Human performance prediction (cognitive and physical)