Algorithm Implementation in FPGAs Demonstrated Through Neural Network

Algorithm Hardware Implementation • Algorithms – One Output for each Input – Purely Combinational

Implementation Hardware • SRC-6 e Reconfigurable Computer – 2 Pentium III Processors • 1

SRC-6 e Development Environment • main. c –C – Executes on Pentium Processors –

SRC-6 e Development Environment • hardware. mc – Modified C or FORTRAN – Executes

Hardware Description. VHDL and VERILOG • Reasons for Use – To avoid c-compiler idiosyncrasies

Problem Background • To determine the optimal sonar setup to maximize the ensonification of

Ensonification Example • • 15 by 80 pixel grid Red: High signal to interference

Original Solution • Take current conditions • Match to previous optimum sonar setups with

New Problem • Problem: – One acoustic model run took tens of seconds •

Neural Network Overview • Inspired by the human ability to recognize patterns. • Mathematical

Neural Network Structure • • Each neuron is the squashed sum of the inputs

Ensonification Neural Network • Taught using examples from the acoustical model. • Recognizes a

Did the neural network solve the problem? • Yes: – Neural network acoustic model

How to find a good setup solution: Particle Swarm Optimization • Idea – Several

Particle Swarm - Math Next Position = Current Position + Current Velocity xk+1 =

Particle Swarm in Operation Link to Particle Swarm file – in Quicktime

Particle Swarm Optimization • Swarm – 27 Inputs to Neural Network, Sonar System Setup

New Problem Enters • Time for 100 k step particle swarm using a 2.

Three Design Stages • Activation Function Design – Sigmoid not efficient to calculate •

Activation Function Design • Fixed Point Design • Sigmoid Accuracy Level • Weight Accuracy

Fixed Point Design • VS Floating Point – Easier – Less Area – Faster

Fixed Point Results • 16 -bit Number – 1 Sign Bit – 7 Integer

Activation Function Approximation • Compared 4 Designs – – Look-up Table Shift and Add

Look-up Table • Advantages – Unlimited Accuracy – Short Latency of 3 • Disadvantages

Shift and Add • Y(x)=2 -n*x + b • Advantages – Small Design –

CORDIC • Computation – Divide Argument By 2 – Series of Rotations • Sinh(x)

CORDIC • Advantages – Unlimited Accuracy – Real Calculation • Disadvantages – Long Latency

Taylor Series • Y(x) = a+b(x-x 0)+c(x-x 0)2 • Advantages – Unlimited Accuracy •

Neural Network Design • Desired – 27 -40 -50 -70 -1200 Architecture – Maximum

Neural Network Design • Initial Test Design – Serial Pipeline – One Multiply per

Neural Network Design • Maximum Parallel Version – 71 Multiplies in Parallel – Zero

Particle Swarm Optimization • 2 Chips in SRC – Particle Swarm • Controls inputs

Particle Swarm Implementation • Problem - randomness – vk+1 = vk + rand*w 1*(Gb-xk)+rand*w

Random vs. Deterministic – Blue Random – Green/Red

Randomness Results • Standard Conventional Swarm Error – 1. 9385 units per pixel •

Randomness Results • The gain from randomness is not significant. – Deterministic method used.

Particle Swarm Chip • 10 Agents – Preset Starting Points and Velocities – 8

Update Equation Implementation Xmaxk Xmink Xn. Dimk Vn. Dimk Pn. Dimk Gk Vmaxk Xmink

Results – Output Matching 100 k iteration PSO ->1. 76 s SWARM REAL

Particle Swarm-Area Specific 100 k iteration PSO ->1. 76 s

Slides: 62

Download presentation

Algorithm Implementation in FPGAs Demonstrated Through Neural Network Inversion on the SRC-6 e MSECE Thesis Presentation Paul D. Reynolds

Algorithm Hardware Implementation • Algorithms – One Output for each Input – Purely Combinational • Problems – Too Large to be Directly Implemented – Timing Issues • Solution – Clocked Design – Repeated Use of Hardware

Implementation Hardware • SRC-6 e Reconfigurable Computer – 2 Pentium III Processors • 1 GHz – 2 Xilinx XC 2 V 6000 FPGAs • 100 MHz • 144 Multipliers • 144 Block RAMs – 6 Memory Blocks • 4 MB each

Hardware Architecture

SRC-6 e Development Environment • main. c –C – Executes on Pentium Processors – Command Line Interface – Hardware accessed as a Function

SRC-6 e Development Environment • hardware. mc – Modified C or FORTRAN – Executes in Hardware – Controls Memory Transfer – One for each FPGA used – Can be for entire code or with hardware description functions

Hardware Description. VHDL and VERILOG • Reasons for Use – To avoid c-compiler idiosyncrasies • Latency added to certain loops • 16 bit multiplies converted to 32 bit multiplies – More control • Fixed point multiplication with truncation • Pipelines and parallel execution simpler – IP Cores Useable • More efficient implementation

Neural Network and Inversion Example

Problem Background • To determine the optimal sonar setup to maximize the ensonification of a grid of water. • Influences to ensonification: – Environmental Conditions – Temperature, Wind Speed – Bathymetry – Bottom Type, Shape of Bottom – Sonar System – Total of 27 different factors accounted for

Ensonification Example • • 15 by 80 pixel grid Red: High signal to interference ratio Blue: Low signal to interference ratio Bottom: No signal

Original Solution • Take current conditions • Match to previous optimum sonar setups with similar conditions • Run acoustic model using current conditions and previous optimum setups • Use sonar setup with highest signal to interference ratio

New Problem • Problem: – One acoustic model run took tens of seconds • Solution – Train a Neural Network on the acoustic model (APL & University of Washington)

Neural Network Overview • Inspired by the human ability to recognize patterns. • Mathematical structure able to mimic a pattern • Trained using known data – Show the network several examples and identify each example – The network learns the pattern – Show the network a new case and let the network identify it.

Neural Network Structure • • Each neuron is the squashed sum of the inputs to that neuron A squash is a non-linear function that restricts outputs to between 0 and 1 OUTPUTS WE IGH T LAYER • NEURON Each arrow is a weight times a neuron output INPUTS

Ensonification Neural Network • Taught using examples from the acoustical model. • Recognizes a pattern between the 27 given inputs and 15 by 80 grid output • 27 -40 -50 -70 -1200 Architecture • Squash =

Did the neural network solve the problem? • Yes: – Neural network acoustic model approximation: 1 ms • However– Same method of locating best: • Run many possible setups in neural network • Choose best – Problem: • Better, but still not real time

How to find a good setup solution: Particle Swarm Optimization • Idea – Several Particles Wandering over a Fitness Surface • Math – xk+1 = xk + vk – vk+1 = vk + rand*w 1*(Gb-xk)+rand*w 2*(Pb-xk) • Theory – – Momentum pushes particles around surface Pulled towards Personal Best Pulled towards Global Best Eventually particles oscillate around Global Best

Particle Swarm - Math Next Position = Current Position + Current Velocity xk+1 = xk + vk Next Velocity = Current Velocity + Global Pull + Personal Pull vk+1 = vk + rand*w 1*(Gb-xk)+rand*w 2*(Pb-xk)

Particle Swarm in Operation Link to Particle Swarm file – in Quicktime

Particle Swarm Optimization • Swarm – 27 Inputs to Neural Network, Sonar System Setup • Fitness Surface – Calculated from neural network output • Two Options – Match a desired output • Sum of the difference from desired output • Minimize the difference – Maximize signal to interference ratio in an area • Ignore output in undesired locations

Particle Swarm in Operation Link to Particle Swarm file – in Quicktime

New Problem Enters • Time for 100 k step particle swarm using a 2. 2 Ghz Pentium: nearly 2 minutes • Desire a real time version • Solution: Implement the neural network and particle swarm optimization in parallel on reconfigurable hardware

Three Design Stages • Activation Function Design – Sigmoid not efficient to calculate • Neural Network Design – Parallel Design • Particle Swarm Optimization – Hardware Implementation

Activation Function Design • Fixed Point Design • Sigmoid Accuracy Level • Weight Accuracy Level

Fixed Point Design • VS Floating Point – Easier – Less Area – Faster • Data Range of -50 to 85 – 2’s Complement – 7 integer bits – 1 sign bit • Fractional Portion – Sigmoid outputs less than 1 – Some number of fractional bits

Sigmoid Accuracy Level

Weight Accuracy Level

Total Accuracy

Fixed Point Results • 16 -bit Number – 1 Sign Bit – 7 Integer Bits – 8 Fractional Bits • Advantages – 18 x 18 multipliers – 64 -bit input banks

Activation Function Approximation • Compared 4 Designs – – Look-up Table Shift and Add CORDIC Taylor Series

Look-up Table • Advantages – Unlimited Accuracy – Short Latency of 3 • Disadvantages – Desire entirely in chip design – Might use memory needed for weights

Look-up Table

Shift and Add • Y(x)=2 -n*x + b • Advantages – Small Design – Short Latency of 5 • Disadvantages – Piecewise Outputs – Limited Accuracy

Shift and Add

CORDIC • Computation – Divide Argument By 2 – Series of Rotations • Sinh(x) • Cosh(x) – Division for Tanh(x) – Shift and Add for Result

CORDIC • Advantages – Unlimited Accuracy – Real Calculation • Disadvantages – Long Latency of 50 – Large Design

CORDIC

Taylor Series • Y(x) = a+b(x-x 0)+c(x-x 0)2 • Advantages – Unlimited Accuracy • Average – Latency of 10 – Medium Size Design • Disadvantages – 3 multipliers

Taylor Series

Neural Network Design • Desired – 27 -40 -50 -70 -1200 Architecture – Maximum Parallel Design – Entirely on Chip design • Limitations – 92, 000 16 -bit weights in 144 RAMB 16 s – Layers are Serial – 144 18 x 18 Multipliers

Neural Network Design • Initial Test Design – Serial Pipeline – One Multiply per Clock – 92, 000 Clocks – 1 ms=PC equivalent

Test Output FPGA output Real output

Neural Network Design • Maximum Parallel Version – 71 Multiplies in Parallel – Zero weight padding – Treat all layers as the same length 71 – 25 clock wait for Pipeline – Total 1475 clocks per Network Evaluation • 15 microseconds • 60, 000 Networks Evaluations per Second

Neural Network Design

Particle Swarm Optimization • 2 Chips in SRC – Particle Swarm • Controls inputs • Sends to Fitness Chip • Receives a fitness back – Fitness Function • Calculates Network • Compares to Desired Output

Particle Swarm Implementation • Problem - randomness – vk+1 = vk + rand*w 1*(Gb-xk)+rand*w 2*(Pb-xk) • Solutions – Remove randomness • vk+1 = vk + w 1*(Gb-xk) + w 2*(Pb-xk) – Linear Feedback Shift Register – Squared Decimal Implementation

Random vs. Deterministic – Blue Random – Green/Red

Linear Feedback Shift Register

Squared Decimal

Randomness Results • Standard Conventional Swarm Error – 1. 9385 units per pixel • Deterministic Swarm Error – 2. 3587 units per pixel • LFSR Swarm Error – 2. 3522 units per pixel • Squared Decimal Error – 2. 3694 units per pixel

Randomness Results • The gain from randomness is not significant. – Deterministic method used. • All much higher than conventional swarm – Approximated Network – Approximation Error between Networks • 1. 423 units per pixel – Deterministic error on approximated network • 1. 8055 units per pixel

Particle Swarm Chip • 10 Agents – Preset Starting Points and Velocities – 8 from Previous Data, Random Velocities – 1 at maximum range, aimed down – 1 at minimum range, aimed up • Restrictions – Maximum Velocity – Range

Update Equation Implementation Xmaxk Xmink Xn. Dimk Vn. Dimk Pn. Dimk Gk Vmaxk Xmink X+V Vn. Dimk P-X G-X Vmaxk Compare V+1/8(P-X)+1/16(G-X) New Xn. Dimk Compare New Xn. Dimk New Vn. Dimk xk+1 = xk + vk vk+1 = vk + w 1*(Gb-xk)+w 2*(Pb-xk) Vmaxk

Results – Output Matching 100 k iteration PSO ->1. 76 s SWARM REAL

Particle Swarm-Area Specific 100 k iteration PSO ->1. 76 s

ANY QUESTIONS?