Algorithm Implementation in FPGAs Demonstrated Through Neural Network

  • Slides: 62
Download presentation
Algorithm Implementation in FPGAs Demonstrated Through Neural Network Inversion on the SRC-6 e MSECE

Algorithm Implementation in FPGAs Demonstrated Through Neural Network Inversion on the SRC-6 e MSECE Thesis Presentation Paul D. Reynolds

Algorithm Hardware Implementation • Algorithms – One Output for each Input – Purely Combinational

Algorithm Hardware Implementation • Algorithms – One Output for each Input – Purely Combinational • Problems – Too Large to be Directly Implemented – Timing Issues • Solution – Clocked Design – Repeated Use of Hardware

Implementation Hardware • SRC-6 e Reconfigurable Computer – 2 Pentium III Processors • 1

Implementation Hardware • SRC-6 e Reconfigurable Computer – 2 Pentium III Processors • 1 GHz – 2 Xilinx XC 2 V 6000 FPGAs • 100 MHz • 144 Multipliers • 144 Block RAMs – 6 Memory Blocks • 4 MB each

Hardware Architecture

Hardware Architecture

SRC-6 e Development Environment • main. c –C – Executes on Pentium Processors –

SRC-6 e Development Environment • main. c –C – Executes on Pentium Processors – Command Line Interface – Hardware accessed as a Function

SRC-6 e Development Environment • hardware. mc – Modified C or FORTRAN – Executes

SRC-6 e Development Environment • hardware. mc – Modified C or FORTRAN – Executes in Hardware – Controls Memory Transfer – One for each FPGA used – Can be for entire code or with hardware description functions

Hardware Description. VHDL and VERILOG • Reasons for Use – To avoid c-compiler idiosyncrasies

Hardware Description. VHDL and VERILOG • Reasons for Use – To avoid c-compiler idiosyncrasies • Latency added to certain loops • 16 bit multiplies converted to 32 bit multiplies – More control • Fixed point multiplication with truncation • Pipelines and parallel execution simpler – IP Cores Useable • More efficient implementation

Neural Network and Inversion Example

Neural Network and Inversion Example

Problem Background • To determine the optimal sonar setup to maximize the ensonification of

Problem Background • To determine the optimal sonar setup to maximize the ensonification of a grid of water. • Influences to ensonification: – Environmental Conditions – Temperature, Wind Speed – Bathymetry – Bottom Type, Shape of Bottom – Sonar System – Total of 27 different factors accounted for

Ensonification Example • • 15 by 80 pixel grid Red: High signal to interference

Ensonification Example • • 15 by 80 pixel grid Red: High signal to interference ratio Blue: Low signal to interference ratio Bottom: No signal

Original Solution • Take current conditions • Match to previous optimum sonar setups with

Original Solution • Take current conditions • Match to previous optimum sonar setups with similar conditions • Run acoustic model using current conditions and previous optimum setups • Use sonar setup with highest signal to interference ratio

New Problem • Problem: – One acoustic model run took tens of seconds •

New Problem • Problem: – One acoustic model run took tens of seconds • Solution – Train a Neural Network on the acoustic model (APL & University of Washington)

Neural Network Overview • Inspired by the human ability to recognize patterns. • Mathematical

Neural Network Overview • Inspired by the human ability to recognize patterns. • Mathematical structure able to mimic a pattern • Trained using known data – Show the network several examples and identify each example – The network learns the pattern – Show the network a new case and let the network identify it.

Neural Network Structure • • Each neuron is the squashed sum of the inputs

Neural Network Structure • • Each neuron is the squashed sum of the inputs to that neuron A squash is a non-linear function that restricts outputs to between 0 and 1 OUTPUTS WE IGH T LAYER • NEURON Each arrow is a weight times a neuron output INPUTS

Ensonification Neural Network • Taught using examples from the acoustical model. • Recognizes a

Ensonification Neural Network • Taught using examples from the acoustical model. • Recognizes a pattern between the 27 given inputs and 15 by 80 grid output • 27 -40 -50 -70 -1200 Architecture • Squash =

Did the neural network solve the problem? • Yes: – Neural network acoustic model

Did the neural network solve the problem? • Yes: – Neural network acoustic model approximation: 1 ms • However– Same method of locating best: • Run many possible setups in neural network • Choose best – Problem: • Better, but still not real time

How to find a good setup solution: Particle Swarm Optimization • Idea – Several

How to find a good setup solution: Particle Swarm Optimization • Idea – Several Particles Wandering over a Fitness Surface • Math – xk+1 = xk + vk – vk+1 = vk + rand*w 1*(Gb-xk)+rand*w 2*(Pb-xk) • Theory – – Momentum pushes particles around surface Pulled towards Personal Best Pulled towards Global Best Eventually particles oscillate around Global Best

Particle Swarm - Math Next Position = Current Position + Current Velocity xk+1 =

Particle Swarm - Math Next Position = Current Position + Current Velocity xk+1 = xk + vk Next Velocity = Current Velocity + Global Pull + Personal Pull vk+1 = vk + rand*w 1*(Gb-xk)+rand*w 2*(Pb-xk)

Particle Swarm in Operation Link to Particle Swarm file – in Quicktime

Particle Swarm in Operation Link to Particle Swarm file – in Quicktime

Particle Swarm Optimization • Swarm – 27 Inputs to Neural Network, Sonar System Setup

Particle Swarm Optimization • Swarm – 27 Inputs to Neural Network, Sonar System Setup • Fitness Surface – Calculated from neural network output • Two Options – Match a desired output • Sum of the difference from desired output • Minimize the difference – Maximize signal to interference ratio in an area • Ignore output in undesired locations

Particle Swarm in Operation Link to Particle Swarm file – in Quicktime

Particle Swarm in Operation Link to Particle Swarm file – in Quicktime

New Problem Enters • Time for 100 k step particle swarm using a 2.

New Problem Enters • Time for 100 k step particle swarm using a 2. 2 Ghz Pentium: nearly 2 minutes • Desire a real time version • Solution: Implement the neural network and particle swarm optimization in parallel on reconfigurable hardware

Three Design Stages • Activation Function Design – Sigmoid not efficient to calculate •

Three Design Stages • Activation Function Design – Sigmoid not efficient to calculate • Neural Network Design – Parallel Design • Particle Swarm Optimization – Hardware Implementation

Activation Function Design • Fixed Point Design • Sigmoid Accuracy Level • Weight Accuracy

Activation Function Design • Fixed Point Design • Sigmoid Accuracy Level • Weight Accuracy Level

Fixed Point Design • VS Floating Point – Easier – Less Area – Faster

Fixed Point Design • VS Floating Point – Easier – Less Area – Faster • Data Range of -50 to 85 – 2’s Complement – 7 integer bits – 1 sign bit • Fractional Portion – Sigmoid outputs less than 1 – Some number of fractional bits

Sigmoid Accuracy Level

Sigmoid Accuracy Level

Weight Accuracy Level

Weight Accuracy Level

Total Accuracy

Total Accuracy

Fixed Point Results • 16 -bit Number – 1 Sign Bit – 7 Integer

Fixed Point Results • 16 -bit Number – 1 Sign Bit – 7 Integer Bits – 8 Fractional Bits • Advantages – 18 x 18 multipliers – 64 -bit input banks

Activation Function Approximation • Compared 4 Designs – – Look-up Table Shift and Add

Activation Function Approximation • Compared 4 Designs – – Look-up Table Shift and Add CORDIC Taylor Series

Look-up Table • Advantages – Unlimited Accuracy – Short Latency of 3 • Disadvantages

Look-up Table • Advantages – Unlimited Accuracy – Short Latency of 3 • Disadvantages – Desire entirely in chip design – Might use memory needed for weights

Look-up Table

Look-up Table

Shift and Add • Y(x)=2 -n*x + b • Advantages – Small Design –

Shift and Add • Y(x)=2 -n*x + b • Advantages – Small Design – Short Latency of 5 • Disadvantages – Piecewise Outputs – Limited Accuracy

Shift and Add

Shift and Add

CORDIC • Computation – Divide Argument By 2 – Series of Rotations • Sinh(x)

CORDIC • Computation – Divide Argument By 2 – Series of Rotations • Sinh(x) • Cosh(x) – Division for Tanh(x) – Shift and Add for Result

CORDIC • Advantages – Unlimited Accuracy – Real Calculation • Disadvantages – Long Latency

CORDIC • Advantages – Unlimited Accuracy – Real Calculation • Disadvantages – Long Latency of 50 – Large Design

CORDIC

CORDIC

Taylor Series • Y(x) = a+b(x-x 0)+c(x-x 0)2 • Advantages – Unlimited Accuracy •

Taylor Series • Y(x) = a+b(x-x 0)+c(x-x 0)2 • Advantages – Unlimited Accuracy • Average – Latency of 10 – Medium Size Design • Disadvantages – 3 multipliers

Taylor Series

Taylor Series

Neural Network Design • Desired – 27 -40 -50 -70 -1200 Architecture – Maximum

Neural Network Design • Desired – 27 -40 -50 -70 -1200 Architecture – Maximum Parallel Design – Entirely on Chip design • Limitations – 92, 000 16 -bit weights in 144 RAMB 16 s – Layers are Serial – 144 18 x 18 Multipliers

Neural Network Design • Initial Test Design – Serial Pipeline – One Multiply per

Neural Network Design • Initial Test Design – Serial Pipeline – One Multiply per Clock – 92, 000 Clocks – 1 ms=PC equivalent

Test Output FPGA output Real output

Test Output FPGA output Real output

Test Output FPGA output Real output

Test Output FPGA output Real output

Test Output FPGA output Real output

Test Output FPGA output Real output

Neural Network Design • Maximum Parallel Version – 71 Multiplies in Parallel – Zero

Neural Network Design • Maximum Parallel Version – 71 Multiplies in Parallel – Zero weight padding – Treat all layers as the same length 71 – 25 clock wait for Pipeline – Total 1475 clocks per Network Evaluation • 15 microseconds • 60, 000 Networks Evaluations per Second

Neural Network Design

Neural Network Design

Particle Swarm Optimization • 2 Chips in SRC – Particle Swarm • Controls inputs

Particle Swarm Optimization • 2 Chips in SRC – Particle Swarm • Controls inputs • Sends to Fitness Chip • Receives a fitness back – Fitness Function • Calculates Network • Compares to Desired Output

Particle Swarm Implementation • Problem - randomness – vk+1 = vk + rand*w 1*(Gb-xk)+rand*w

Particle Swarm Implementation • Problem - randomness – vk+1 = vk + rand*w 1*(Gb-xk)+rand*w 2*(Pb-xk) • Solutions – Remove randomness • vk+1 = vk + w 1*(Gb-xk) + w 2*(Pb-xk) – Linear Feedback Shift Register – Squared Decimal Implementation

Random vs. Deterministic – Blue Random – Green/Red

Random vs. Deterministic – Blue Random – Green/Red

Linear Feedback Shift Register

Linear Feedback Shift Register

Squared Decimal

Squared Decimal

Randomness Results • Standard Conventional Swarm Error – 1. 9385 units per pixel •

Randomness Results • Standard Conventional Swarm Error – 1. 9385 units per pixel • Deterministic Swarm Error – 2. 3587 units per pixel • LFSR Swarm Error – 2. 3522 units per pixel • Squared Decimal Error – 2. 3694 units per pixel

Randomness Results • The gain from randomness is not significant. – Deterministic method used.

Randomness Results • The gain from randomness is not significant. – Deterministic method used. • All much higher than conventional swarm – Approximated Network – Approximation Error between Networks • 1. 423 units per pixel – Deterministic error on approximated network • 1. 8055 units per pixel

Particle Swarm Chip • 10 Agents – Preset Starting Points and Velocities – 8

Particle Swarm Chip • 10 Agents – Preset Starting Points and Velocities – 8 from Previous Data, Random Velocities – 1 at maximum range, aimed down – 1 at minimum range, aimed up • Restrictions – Maximum Velocity – Range

Update Equation Implementation Xmaxk Xmink Xn. Dimk Vn. Dimk Pn. Dimk Gk Vmaxk Xmink

Update Equation Implementation Xmaxk Xmink Xn. Dimk Vn. Dimk Pn. Dimk Gk Vmaxk Xmink X+V Vn. Dimk P-X G-X Vmaxk Compare V+1/8(P-X)+1/16(G-X) New Xn. Dimk Compare New Xn. Dimk New Vn. Dimk xk+1 = xk + vk vk+1 = vk + w 1*(Gb-xk)+w 2*(Pb-xk) Vmaxk

Results – Output Matching 100 k iteration PSO ->1. 76 s SWARM REAL

Results – Output Matching 100 k iteration PSO ->1. 76 s SWARM REAL

Results – Output Matching 100 k iteration PSO ->1. 76 s SWARM REAL

Results – Output Matching 100 k iteration PSO ->1. 76 s SWARM REAL

Results – Output Matching 100 k iteration PSO ->1. 76 s SWARM REAL

Results – Output Matching 100 k iteration PSO ->1. 76 s SWARM REAL

Particle Swarm-Area Specific 100 k iteration PSO ->1. 76 s

Particle Swarm-Area Specific 100 k iteration PSO ->1. 76 s

Particle Swarm-Area Specific 100 k iteration PSO ->1. 76 s

Particle Swarm-Area Specific 100 k iteration PSO ->1. 76 s

Particle Swarm-Area Specific 100 k iteration PSO ->1. 76 s

Particle Swarm-Area Specific 100 k iteration PSO ->1. 76 s

ANY QUESTIONS?

ANY QUESTIONS?