Reconfigurable Computing Current Status and Potential for Spacecraft

Reconfigurable Computing: Current Status and Potential for Spacecraft Computing Systems Rod Barto NASA/GSFC Office of Logic Design Spacecraft Digital Electronics 3312 Moonlight El Paso, Texas 79904 Barto 1 127 -MAPLD 2005

Reconfigurable Computing is… • A design methodology by which computational components can be arranged in several ways to perform various computing tasks • Two types of reconfigurable computing: – Static, i. e. , the computing system is configured before launch – Dynamic, i. e. , the computing system can be reconfigured after launch Barto 2 127 -MAPLD 2005

Static Reconfigurability • Several examples exist, e. g. , Cray • Typically processing modules connected by an intercommunication mechanism, e. g. , Ethernet • Goals are – To reduce system development costs – To provide higher performance computing Barto 3 127 -MAPLD 2005

Dynamic Reconfigurability (DR) • Processing modules that can be reconfigured in flight • Goal is to provide processing support for algorithms that do not map well onto general purpose computers using reduced amounts of hardware Barto 4 127 -MAPLD 2005

Outline of Paper 1. 2. 3. 4. 5. Barto Discuss the computation of a series of algorithms on general purpose, special purpose, and DR computers Calculate the execution time of an image processing algorithm on a concept DR computer Compare the reconfiguration time of a Xilinx FPGA with the algorithm execution time calculated in section 2. Obtain an extremely rough estimate of image processing algorithm execution time on a flight computer Conclude that the DR computer described offers higher performance than does the flight computer 5 127 -MAPLD 2005

Section 1: Algorithm Execution on General Purpose (GP), Special Purpose (SP), and DR Computers Barto 6 127 -MAPLD 2005

Processing example Input f 1 f 2 fn Output • A computing function is the composition of n algorithms executed serially • Can be executed on a general purpose computer (GP) or a special purpose computer (SP) Barto 7 127 -MAPLD 2005

Execution on a GP Computer Input f 1 f 2 fn Output Processing time of each stage = ti, i=1. . n Total processing time = Latency time = GP computer must execute processing stages sequentially, and cannot exploit parallelism in overall computing function Barto 8 127 -MAPLD 2005

Processing on an SP Processor Input f 1 f 2 fn Output Each stage is an independently operating processor designed specifically for the algorithm it executes Processing time of each stage = ti, i=1. . n Results appear at rate of one per max(ti), 1=1. . n Latency time = Performance increase comes from two factors: • Pipelining of constituent algorithms exploiting parallelism • Processors being designed specifically for their algorithms Barto 9 127 -MAPLD 2005

Processing on a DR Computer Input fodd Output feven • Two processing elements alternately process and reconfigure, i. e. , fodd executes one algorithm while feven reconfigures for the next algorithm, etc. Barto 10 127 -MAPLD 2005

DR Computer Processing Flow fodd feven f 1 R f 3 R fn R f 2 R f 4 R time R = reconfiguration Results appear at rate of one per Latency = Performance increase comes from configuring processors specifically for the algorithm they are executing Do not get increase from exploiting parallelism. Barto 11 127 -MAPLD 2005

Section 2: Execution Time of an Image Processing Algorithm on a Concept DR Computer Barto 12 127 -MAPLD 2005

DR Computer Concept FPGA 0 RAM 1 FPGA 1 • • RAM 0 is source for FPGA 0, destination for FPGFA 1, etc. Processing elements are implemented in FPGAs FPGA 0 and FPGA 1 alternately process and reconfigure, as previously discussed. Input and output not shown Barto 13 127 -MAPLD 2005

Algorithm. Example: 3 x 3 Image Convolution Circular shift rows through convolution processor row i-1 row i+1 one pixel Parallel shift rows up Source row i+2 3 x 3 convolution processor RAM Image width in pixels Destination RAM • Shifting in 1 row at a time pixel-serial, and parallel shifting into the upper 3 row registers, the rows are shifted around through the convolution processor. All the row registers and processing is inside the FPGA. The results are written to the destination RAM after a latency of 3 row reads. Barto 14 127 -MAPLD 2005

Convolution Operation i-1 j-1 i-1 j+1 m 12 m 13 i j-1 row I column j i j+1 m 22 m 23 i+1 j-1 i+1 j+1 m 32 m 33 Pixel array Convolution mask Used, for example, to compute the intensity gradient (derivative) at pixel (i, j) Result = P(i-1, j-1)*m 11+P(i-1, j)*m 12+P(i-1, j-1)*m 13+…+P(i+1, j+1)*m 33 Barto 15 127 -MAPLD 2005

Convolution Calculation P(i-1, j-1) m 11 P(i-1, j) m 12 P(i-1, j+1) m 13 * P(i, j-1) * P(i, j+1) m 21 m 22 m 23 * P(i+1, j-1) * P(i+1, j+1) m 31 m 32 m 33 * * * Result(I, j) • Arithmetic processing may require some pipelining Barto 16 127 -MAPLD 2005

Convolution Timing • Total time = latency+processing = 20. 971 msec – This assumes we can get pixels into the FPGA at a 20 nsec/pixel rate – Latency = time to read 3 rows: • 1024 pixels *3 rows * 20 nsec/pixel = 61 usec – Processing = time to stream remaining 1021 rows through and process: • 1024 * 1021 * 20 nsec = 20. 910 msec • Larger convolutions (e. g. , 7 x 7) have longer latencies, but same computation time • Calculation is for a mono image, stereo image would take twice as long. Barto 17 127 -MAPLD 2005

Section 3: Comparing the Reconfiguration Time of a Xilinx FPGA With the Algorithm Execution Time Calculated in Section 2. Barto 18 127 -MAPLD 2005

DR Computer Processing Element: Virtex-4 LX FPGA • Eight versions: – XC 4 VLX 15, -25, -40, -60, -80, -100, -160, -200 • Logic hierarchically arranged: – 2 flip-flops per slice – 4 slices per CLB Device -15 -200 Barto CLBs 64 x 24 192 x 116 19 FFs 12, 288 178, 176 127 -MAPLD 2005

Time to Configure FPGA • FPGA Configuration Sequence PROG_B INIT_B CCLK DONE Tpl Tconfig Total Configuration Time Barto 20 127 -MAPLD 2005

Configuration Timing: Tpl • Tpl = 0. 5 usec/frame • “frame” is a unit of configuration RAM • Tpl period clears configuration RAM Barto Device Frames Tpl -15 3740 1. 87 msec -200 40108 20. 1 msec 21 127 -MAPLD 2005

Configuration Timing: Tconfig • FPGA programmed by bitstream • CCLK (programming CLK) can run at 100 MHz • Parallel mode loads 8 bits per CCLK Barto Device Bitstream length Parallel loads Tconfig -15 4, 765, 138 595, 648 5. 956 msec -200 48, 722, 432 6, 090, 304 60. 903 msec 22 127 -MAPLD 2005

Total Configuration Time Device -15 Tpl, msec 1. 87 Tconfig, Total Configration msec time, msec 5. 956 7. 826 -200 20. 054 60. 903 80. 957 • Plus some extra time amounting to a few CCLK cycles (@ 10 nsec each) Barto 23 127 -MAPLD 2005

Processing and Reconfiguration Time Comparison • Convolution execution is faster than reconfiguration – Convolution = 21 msec mono, 42 msec stereo – Reconfiguration = 81 msec – Assuming -200 device • Processing shown is well within FPGA’s capabilities • More complex algorithms may require use of FPGA performance features – Much higher internal clock rates – Large internal RAM – Dedicated arithmetic support in –SX series • What this shows is that it’s reasonable to consider alternating execution and reconfiguration of two FPGAs Barto 24 127 -MAPLD 2005

ROUGH ESTIMATE Section 4: An Extremely Rough Estimate of Image Processing Algorithm Execution Time on a Flight Computer Barto 25 127 -MAPLD 2005

GP Computing Performance Estimate • DANGER: really rough estimate! • Based on data from this paper: ROUGH ESTIMATE – “Stereo Vision and Rover Navigation Software for Planetary Exploration”, Steven B. Goldberg, Indelible Systems; Mark Maimone, Larry Matthies, JPL; 2002 IEEE Aerospace Conference – Available at robotics. jpl. nasa. gov/people/mwm/visnavsw/aero. pdf – Describes processing and algorithms to be used on 2004 Rover missions, and Rover requirements. Barto 26 127 -MAPLD 2005

Published Vision Algorithm Timing • Timed on Pentium III 700 MHz CPU, 32 K L 1 cache, 256 K L 2 cache, 512 M RAM, Win 2 K • algorithms explicitly timed (names from paper): ROUGH ESTIMATE Algorithm C code Vector Difference of Gaussian and Decimate 4229 msec 2047 msec Prepare Next Row 5559 msec 1163 msec Inner Loop 4360 msec 2928 msec Compute Sub-Pixel 667 msec 871 msec • The Gaussian and most vision algorithms involve neighborhood operations that are comparable to an image convolution of some size Barto 27 127 -MAPLD 2005

Flight Computer Performance • Flight processor is RAD 6000 • GESTALT Navigation algorithm timed on 3 processors: Processor Execution Time Multiplier Pentium III, 500 MHz, Linux 1 Sparc 300 MHz, Solaris 3. 0 -3. 5 RAD 6000, 20 MHz, Vx. Works 7. 7 -8. 7 Assume that the RAD 6000 takes 7 times as long as the 500 MHz Pentium ROUGH ESTIMATE Barto 28 127 -MAPLD 2005

Final Peformance Estimate • • ROUGH ESTIMATE Assume RAD 6000 time = 7 times the 500 MHz Pentium time Assume 500 MHz Pentium time = 7/5=1. 4 times the 700 MHz Pentium time Then, RAD 6000 time is 1. 4*7=9. 8 times the 700 MHz Pentium time Vision algorithm timing can be estimated as follows: Algorithm Fastest time RAD 6000 time Difference of Gaussian and Decimate 2047 msec 20, 060 msec Prepare Next Row 1163 msec 11, 397 msec Inner Loop 2928 msec 28, 694 msec Compute Sub-Pixel 667 msec 6536 msec Remember: This is a really rough estimate!! Barto 29 127 -MAPLD 2005

Section 5: Conclusions Barto 30 127 -MAPLD 2005

What We Have Shown ROUGH ESTIMATE • We have shown that the concept DR computer presented executes a 3 x 3 neighborhood-type algorithm “a lot” faster than it appears that a RAD 6000 executes what are probably a bunch of neighborhood algorithms. • The reader is cautioned to not try to quantify what “a lot” means based on the data given here. • But, it’s a good enough estimate to tell us that this is worth looking into in more detail. Barto 31 127 -MAPLD 2005

Conclusions • Xilinx-based DR computer shows promise for performance enhancement of a vision system • By extension, the DR computer shows promise for the performance enhancement of other algorithms. Barto 32 127 -MAPLD 2005