A Structural Object Programming Model Architecture Chip and

A Structural Object Programming Model, Architecture, Chip and Tools for Reconfigurable Computing Mike Butts mike@ambric. com

Ambric Introduction n n Fabless Semiconductor Company Founded in 2003 in Beaverton, Oregon Veteran team – 60+ employees and growing Production silicon – August 2007 Product releases: Chip, IDE, applications, board – January 2008 Copyright © 2003 -2007 Ambric, Inc. 2

Ambric Objectives n Maximum possible performance and performance/watt for embedded and accelerated applications — — n Reasonable and reliable application development — n streaming media, image processing, networking, software radio superior to FPGAs, DSPs, multicores, even approaching ASICs write software not hardware, with reliable reuse Hardware and software scalability to track Moore’s Law — — future silicon processes development productivity Copyright © 2003 -2007 Ambric, Inc. 3

A Structural Object Programming Model, Architecture, Chip and Tools for Reconfigurable Computing n n n Structural Object Programming Model Architecture Chip Tools Applications University Program Copyright © 2003 -2007 Ambric, Inc. 4

What’s in a Good Programming Model? n What should be in a programming model? — n Software languages (C, Java, . . . ) are familiar and productive — n Array of sequential processors for (int c = min; c <= max; c += 2*inc) { int fac; for (fac = 3; fac <= max; fac += 2) { if (c % fac == 0) break; } if (c == fac) { primes. write(c); } else primes. write(0); } Block diagrams are familiar, scalable, encapsulated and hierarchical — n What is familiar, productive, scalable to any size and speed? Reconfigurable interconnect Strict encapsulation and hierarchy with standard interfaces enables strong design reuse, necessary for scalable development cost. pg 1 pg 2 pg 3 pg 4 fifo 1 fifo 2 fifo 3 fifo 4 join 1 join 2 join 3 pl Out Copyright © 2003 -2007 Ambric, Inc. 5

Structural Object Programming Model n Objects are software programs running concurrently on an asynchronous array of Ambric processors and memories n Objects exchange data and control through a structure of -synchronizing asynchronous Ambric channels n Objects are mixed and matched hierarchically to create new objects, snapped together through a simple common interface n Easier development, high performance and scalability Asynchronous Ambric channel Leaf object running on Ambric processor 3 1 2 5 4 self Composite object 6 7 Application Copyright © 2003 -2007 Ambric, Inc. 6

Ambric Channels v a v v a object n — — a a v a object Word-wide, unidirectional, point-to-point, strictly ordered Inter-stage object throttles its channels with Ambric protocol • v downstream, a upstream Fully encapsulated, fully scalable for control and data between objects Objects linked through channels are asynchronous to each other — — n v Chains of Ambric registers form Ambric channels — n a v Each operates when it can, on its own, according to its channels Objects are synchronized with one another only through channels Globally Asynchronous Local Synchronous (GALS) clocking — Physically scalable, no low-skew long wires Copyright © 2003 -2007 Ambric, Inc. 7

SOPM Realized in Silicon Object Channel (CPU, memory, . . . ) synch: stall if channel’s not accepting n Encapsulation, reuse Self-synchronizing on each transfer — n Each stage has forward and backward flow control, and buffering Standard interface between objects — n synch: stall if channel’s not valid Objects exchange data and control thru a structure of Ambric channels — n Object (CPU, memory, . . . ) Asynchronous system Channels can be any length or speed: no scheduling, no timing closure — Easier on tools, easier to program, easier to debug, reliable Copyright © 2003 -2007 Ambric, Inc. 8

Model of Computation: not quite CSP ü x slides from “A Brief Tutorial on Models of Computation” The MESCAL Team, UC Berkeley, Fall 2001 n Ambric Mo. C was inspired by CSP, but is not quite CSP. — Message passing is buffered, not strictly synchronous. Copyright © 2003 -2007 Ambric, Inc. 9

Model of Computation: Process Network ü üü x ü n Ambric Mo. C is a Process Network with bounded FIFOs. — — slides from “A Brief Tutorial on Models of Computation” The MESCAL Team, UC Berkeley, Fall 2001 FIFO-like primitive register, streaming RAMs for bigger FIFOs. Channels carry data and control, and strictly preserve sequence. Copyright © 2003 -2007 Ambric, Inc. 10

Traditional vs. Ambric Processors n Traditional processor architecture — — Primary: register-memory hierarchy Secondary: communication Regs ALU RAM I/O n Ambric processor architecture — — — Primary: communicate through channels All data goes through channels • Memory • Registers • Inter-processor streams • Instruction streams to reduce local storage Channels synchronize all events Channels RAM ALU Regs Channels RAM Copyright © 2003 -2007 Ambric, Inc. 11

A Structural Object Programming Model, Architecture, Chip and Tools for Reconfigurable Computing n n n Structural Object Programming Model Architecture Chip Tools Applications University Program Copyright © 2003 -2007 Ambric, Inc. 12

Ambric SR Processor n n Simple 32 -bit Streaming RISC Mainly for fast small utility objects: — n Ambric channels — — n n n ALU adder logical mux 64 wd RAM R 0 PC R 1 -7 IR output channels to interconnect Zero-overhead looping 64 word local code/data RAM — n 1 input, 1 output per instruction Instruction fields select inputs, outputs just like selecting registers One ALU: 32 b or dual 16 b ops 8 general registers 16 bit instructions for code density — n complex addressing, complex fork/join, pack/unpack, serialize/deserialize input channels from interconnect 128 instructions Three-stage Ambric channel datapath 16 bit instructions ALU op mem r/w branch loopstart src 1 src 2 opcode src 2 condition count out dest address offset Copyright © 2003 -2007 Ambric, Inc. 13

Ambric SRD Processor n n n Streaming RISC with DSP extensions 32 bit instructions — 256 in local RAM, more from RU — Zero-overhead looping — — — n n 32 b, dual 16 b, quad 8 b ops 3 rd ALU alongside is iterative, pipelined, for MAC, SAD. One 32 b*8 b or two 16 b*8 b per cycle, 64 -bit accumulator, rounding RU read-write channels 3 -stage Ambric channel datapath mux permute logical mul + abs+/- 2 ALUs in series, a parallel third, with individual instruction fields. For stream processing, superior code density to VLIW 3 ALUs ALUM msb Multiple ALUs capture instruction-level parallelism — n ALUS adder shifter permute logical input channels from interconnect 64 rnd RU R/W port ALUE prev R 1 -20 out 256 wd RAM PC IR RU instruction port output channels to interconnect 32 bit instructions ALUS op ALUM op ALUE op mem r/w base branch loopstart src 1 src 2 condition out dest address offset count offset Copyright © 2003 -2007 Ambric, Inc. 14

Ambric Compute Unit, RAM Unit other RU Compute Unit (CU) n n n Two SRD 32 -bit CPUs Two SR 32 -bit CPUs Channel interconnect — — CPU-CPU is dynamic under instruction control CU-Neighbor, CU-Distant are statically configured distant SR CPU configurable and dynamic interconnect of Ambric channels RAM SRD CPU CU neighbor RAM SRD CPU RAM RAM Unit (RU) n n Four 2 KB RAM banks RU engines turn RAM regions into channels — n FIFO and random access Engines dynamically connect to banks through channels with arbitration RU other CU Inst RW str engines str RW Inst dynamically arbitrating interconnect 2 KB RAM RAM = Ambric channel Copyright © 2003 -2007 Ambric, Inc. 15

Bric and Interconnect — — Two Compute Units (CUs) Two RAM Units (RUs) RAM RAM Bric is the silicon building block SRD RAM n RAM SRD RAM RAM SR RAM Core is an array of brics RAM n RAM RU CU Hierarchical Interconnect — — — Neighbor CU-CU channels Switched inter-bric network No wires longer than a bric RAM n CU SR RU RAM Copyright © 2003 -2007 Ambric, Inc. 16

1 channel each way RAM SR RAM RAM SRD SRD SR RU SRD SR SR CU CU SRD SR RAM RAM RAM RAM SRD RAM SR SRD RAM RU SR RAM Each channel is 32 -bits wide @ up to 9. 6 Gbps RAM RU SR n RAM — RAM Neighbor channels connect neighbor CUs N/S/E/W RAM n RAM Neighbor Channels SR RAM RU SRD SR SR CU CU SR SRD SR RU RAM RAM RAM Copyright © 2003 -2007 Ambric, Inc. 17

SR n Switches are interconnected by bric-long channels SRD SR SR CU CU CU SRD SR RU SR SRD SRD SR RAM RAM RAM RAM SRD SRD SR RAM RU CU CU Total interconnect bisection bandwidth is 792 Gbps CU CU — RAM RU four channels each way up to 9. 6 Gbps — n connected to CUs RAM RU CU SR RAM — RAM Each bric has a switch RAM n RAM one bric per hop — RAM RU RAM Configurable network for longer routes RAM n RAM Distant Channels SR SR RU SRD SRD SR SR RU RAM RAM RAM Copyright © 2003 -2007 Ambric, Inc. 18

A Structural Object Programming Model, Architecture, Chip and Tools for Reconfigurable Computing n n n Structural Object Programming Model Architecture Chip Tools Applications University Program Copyright © 2003 -2007 Ambric, Inc. 19

— 45 brics, 1. 03 tera. OPS — — — Flash/UPI PCIe GPIO High-bandwidth I/O SRD SRD SR CU SR RU 2 K 2 K RAM RAM GPIO RAM Unit bric GPIO SR DDR 2 Ctlr RU CU Compute Unit JTAG 2 K 2 K RAM RAM RAM Package — 31 x 31 mm — 896 -balls — Flip-Chip RAM — PCI Express DDR 2 -400 x 2 128 bits GPIO Serial flash, JTAG, μP I/O RAM — n 336 32 -bit processors 7. 1 Mbits dist. SRAM 8 µ–engine VLIW accelerators RAM n 180 million transistors RAM n Am 2045 Chip 130 nm standard-cell ASIC RAM n m Eng. GPIO Copyright © 2003 -2007 Ambric, Inc. 20

Performance Metrics n Am 2045 @ 300 MHz: — — — 1. 03 trillion operations per second (8 -bit, 16 -bit Sum of Abs. Diff. ) • 60 GMACS (16 x 16, 32 bit sum) 792 Gbps interconnect bisection bandwidth 26 Gbps DRAM + 16 Gbps high-speed serial + 13 Gbps parallel Kernel Instances @ Rate Each 32 -tap FIR filters 168 @ 4. 7 Msps. . . 5 @ 223 Msps 16 bit data Dot Product 168 @ 200 Msps 16 bits in, 32 bit sum Maximum Value 168 @ 343 Msps n=100, 16 bits, 2 wide Saturators 168 @ 600 Msps signed 16 to unsigned 8, 2 wide Viterbi ACS 336 @ 600 Mbps 16 -bit 1 K point FFT 84 @ 8. 8 Msps complex 16 -bit radix-2 AES 56 @ 181 Mbps 7 @ 1. 1 Gbps feedback modes non-feedback modes Copyright © 2003 -2007 Ambric, Inc. 21

Ambric Development Boards n Am 2045 Software Development Board — 1 production Am 2045 + SDRAM — PCI Express interface to host — For rapid software development and application acceleration n Am 2045 Integrated Development Board — 1 production Am 2045 + SDRAM — 4 32 -bit GPIO connectors, USB — Stand-alone capable on the benchtop, or in a PCIe slot with PC cover off — Serial Flash, power connector — For embedded development Copyright © 2003 -2007 Ambric, Inc. 22

A Structural Object Programming Model, Architecture, Chip and Tools for Reconfigurable Computing n n n Structural Object Programming Model Architecture Chip Tools Applications University Program Copyright © 2003 -2007 Ambric, Inc. 23

Ambric Tool Chain n Eclipse IDE (Integrated Design Env. ) n Structure — — n Encapsulated library objects Code and Test — — n Conceive your application as a structure of objects and the messages they exchange Divide-and-conquer using hierarchy Reuse — n All tools in the open IDE 3 Simulate — 1 1 2 2 5 4 3 6 4 7 6 5 7 Library Compil e Each Write your new objects in Java or Assembler Verify with functional simulation Map & Route Realize on HW — — — Compile each object separately Run mapper-router, configure chip Debug, profile and tune performance Debug, Tune on HW Copyright © 2003 -2007 Ambric, Inc. 24

Structural Programming in a. Struct pg 1 pg 2 pg 3 pg 4 primes in fifo 1 out in in in fifo 2 fifo 3 fifo 4 l r out join 1 out join 2 out join 3 ins pl outs primes. Out n Graphical or textual entry — n menus or text (not shown) defines channel interfaces, object parameters binding Prime. Maker. Impl implements Prime. Maker { Prime. Gen pg 1 = {min = 3, increment = 4, max = IPrime. Maker. max}; Prime. Gen pg 2 = {min = 5, increment = 4, max = IPrime. Maker. max}; Prime. Gen pg 3 = {min = 7, increment = 4, max = IPrime. Maker. max}; Prime. Gen pg 4 = {min = 9, increment = 4, max = IPrime. Maker. max}; Fifo fifo 1 = {max_size = fifo. Size}; Fifo fifo 2 = {max_size = fifo. Size}; Fifo fifo 3 = {max_size = fifo. Size}; Fifo fifo 4 = {max_size = fifo. Size}; Alt. Word. Join join 1; Alt. Word. Join join 2; Alt. Word. Join join 3; Prime. List pl; channel c 0 = {pg 1. primes, f 1. in}, c 1 = {pg 2. primes, f 2. in}, c 2 = {pg 3. primes, f 3. in}, c 3 = {pg 4. primes, f 4. in}, c 4 = {f 1. out , j 1. l}, c 5 = {f 2. out , j 1. r}, c 6 = {j 1. out , j 2. l}, c 7 = {f 3. out , j 2. r}, c 8 = {j 2. out , j 3. l}, c 9 = {f 4. out , j 3. r}, c 10 = {j 3. out , pl. ins}, c 11 = {pl. outs , primes. Out}; } Hierarchical, modular Copyright © 2003 -2007 Ambric, Inc. 25

Standard Language for Objects n pg 1 pg 2 pg 3 pg 4 primes in fifo 1 out in in fifo 3 fifo 4 l r out join 1 out join 2 — in fifo 2 out Program primitive objects in Java out join 3 ins pl outs — Strict subset of standard Java • static memory Classes define the channels primes. Out n Language-agnostic — C, etc. to follow public void Prime. Gen(Output. Stream<Integer> primes) { for (int candidate = min; candidate <= max; candidate += 2*increment) { int factor; for (factor = 3; factor <= max; factor += 2) { if (candidate % factor == 0) break; } if (candidate == factor) {// is primes. write(candidate); // write out } else primes. write(0); } } Copyright © 2003 -2007 Ambric, Inc. 26

Debugging a Massively-Parallel Application n Debugger/Profiler in Eclipse IDE — n — — SRD SR Debug network in silicon — n JTAG, PCI Express, etc. Integrated with source code SRD CU Debug master Strictly separate network that can’t deadlock Transparently observes and controls processors and channels Every processor can trap, Debug network run a watchdog timer SR CU SR SRD SRD SR CU SRD SR Add debug objects — — Program unused processors and RAMs for debug tools No effect on function or performance, thanks to Ambric channels 3 1 2 5 4 fork 6 debug 7 Copyright © 2003 -2007 Ambric, Inc. 27

Source & Graphical Debugging n Profile — n Identify processing bottlenecks, deadlock and utilization problems Find — — — Seamlessly access debugger features from graphical view Symbolic references, single step and breakpoints Special features like ‘channel monitoring’ Copyright © 2003 -2007 Ambric, Inc. 28

A Structural Object Programming Model, Architecture, Chip and Tools for Reconfigurable Computing n n n Structural Object Programming Model Architecture Chip Tools Applications University Program Copyright © 2003 -2007 Ambric, Inc. 29

Library Objects n Video Compression Estimation • Full-search • Hierarchical-search — H. 264 I-frame Decoder Module • Deblocking Filter • Inverse Transform • Intra-prediction • Macroblock assembler • Motion Compensation • Cache-controller • CABAC decode — DV Decoder Module — DVCPRO-HD modules • Variable length codec • Forward & Inverse DCT n — Motion n Signal Processing — — — n FFT radix-2, radix-4 FIR, IIR Filters Vector Saturation Maximum value search Dot Product Matrix Math Communications — Turbo CTC — Viterbi — AES encryption — Regular expression search Pixel Processing — — HD Video scaler HD De-interlacer Copyright © 2003 -2007 Ambric, Inc. 30

HD H. 264 Decoder Structure External 10 MB H. 264 Compressed Video bytestream Bytestream Decoder Non-VCL NAL Unit Handler n n VCL Decoder/ Buffer n Internal 12 KB n n Intra Prediction n Inverse Transform Decoding Syntax Element Prediction Inter Prediction Macroblock Assembler External 17. 5 MB De-blocking Filter n HD-H. 264, High Profile Level 4 Chip resources required — 27 brics with 100% of RUs used and 80% of associated CUs used 1 video stream per chip with 18 brics left over for scaling, etc. No global state machine Self-synchronizing Encapsulated, modular, re-usable code Asynchronous operation ranging from 100 MHz - 333 MHz saves power Internal 44 KB External Memory Controller Frame Buffer Control Display Process Write Request YCb. Cr Video Picture output Copyright © 2003 -2007 Ambric, Inc. 31

Media Access Control (MAC) Layer Wi. MAX Base Station Block Diagram 18 Mbps 7½ CU, 2 RU ½ CU, 3 RU 1½ CU, 1 RU 16 CU CTC Turbo Encoders Channel Multiplexer 1 K i. FFT Modulator I/Q 40 -tap Filters TX 80 Msps 1 CU Synchronization 12 Mbps CTC Turbo Decoders Channel Demultiplexer 1 K FFT Demodulator I/Q 40 -tap Filters 21 CU, 21 RU ½ CU, 1 RU 17 CU RX 80 Msps Total: 67 CU, 31 RU = 34 brics = 80% Am 2045 Copyright © 2003 -2007 Ambric, Inc. 32

A Structural Object Programming Model, Architecture, Chip and Tools for Reconfigurable Computing n n n Structural Object Programming Model Architecture Chip Tools Applications University Program Copyright © 2003 -2007 Ambric, Inc. 33

Ambric University Program n Ambric offers development tools, documentation, hardware, and limited technical support, at no cost, to University Program partners. n Benefits to University — Access to a real massively-parallel embedded-systems architecture — Very high performance execution with energy efficiency — Easier and faster software-only development effort — Real execution target for research tools, languages, etc. Benefits to Ambric — Real development experience with more application areas — Promote innovative tools, methodologies, libraries — Get to know the best future graduates n n Current Members as of January 2008 — U. Wash EE Dept. : Prof. Scott Hauck — Portland State U. ECE Dept. : Prof. Dan Hammerstrom — Halmstad U. CERES (Sweden): Prof. Bertil Svenson Copyright © 2003 -2007 Ambric, Inc. 34

www. . com Publications • "Synchronization through Communication in a Massively Parallel Processor Array", Mike Butts, IEEE Micro, Sept/Oct 2007. • "A Structural Object Programming Model, Architecture, Chip and Tools for Reconfigurable Computing", Mike Butts, Anthony Mark Jones, Paul Wasson, IEEE Symposium on Field-Programmable Custom Computing Machines, April 2007.

A Structural Object Programming Model, Architecture, Chip and Tools for Reconfigurable Computing n n n n Structural Object Programming Model Architecture Chip Tools Applications University Program Additional Background Material Copyright © 2003 -2007 Ambric, Inc. 36

The Energy Efficiency of Parallelism n Strict power budgets at all levels limit power scalability — 1 W handheld, 10 W portable, 100 W desktop/server n Minimize energy per operation — Lower voltage: slower but far less power n Get performance back with parallelism — This makes the most power-efficient use of silicon area n Example: — One cool processor: 75% speed, 42% power* — Two in parallel: 150% speed, 84% power n The catch is making parallelism practical. Ambric’s programming model opens this door. — High performance made scalable * power µ v f, f µ v 2 Þ power µ v 3 Copyright © 2003 -2007 Ambric, Inc. 37

20 th Century Reconfigurable Computing 4 -LUT+ FF 4 -LUT+ FF 4 -LUT+ FF 1987 2 2 mm 0. 5 mm CMOS n In 1987, a great new way to spend 0. 5 mm 2 of silicon was: — A 4 -LUT, a flip-flop, and reconfigurable wires n But the FPGA was never an ideal computing platform: — — — n RTL productivity is not scaling with Moore’s Law High-level synthesis had limited success Developer must be mindful of HW issues such as timing closure RC developer must be application expert, SW and HW engineer Copyright © 2003 -2007 Ambric, Inc. 38

21 st Century Reconfigurable Computing CPU+ RAM CPU+ RAM CPU+ RAM 2007 2 130 nm 0. 5 mm CMOS n What is the best way to use 0. 5 mm 2 of silicon today? — 32 -bit CPU, several KB of RAM, and reconfigurable buses n So just fab a chip with CPUs and buses, and throw it over the wall at the programmers. Not likely to succeed! n Pick a good programming model for reconfigurable computing first. Then build silicon and tools to implement that model. Copyright © 2003 -2007 Ambric, Inc. 39