RAMP Infrastructure Krste Asanovic UC Berkeley RAMP Tutorial
RAMP Infrastructure Krste Asanovic UC Berkeley RAMP Tutorial, ISCA/FCRC, San Diego June 10, 2007 1
RAMP: An infrastructure to build simulators using FPGAs 2
Run Target Model on Host Platform CPU Target Model CPU CPU Interconnect Network DRAM Hard Work Host Platform 3
Reduce, Reuse, Recycle n Reduce effort to build target models ¨ Users just build components, infrastructure handles connections (The RDL Compiler) n Reuse components by having good abstractions ¨ Across different target models ¨ Across different host platforms n n XUP, Calinx, BEE 2, BEE 3, also Altera (see Greg) Recycle existing IP for use as simulation models ¨ Commercial processor RTL is its own model 4
RAMP Target Models Unit A Unit B FIFO Channel Pipeline Channel Unit C Units n Relatively large chunks of functionality ¨ n e. g. , processor + L 1 cache User-written in some HDL or software Channels n Point-point, undirectional, two kinds: FIFO channel: Flow-controlled interface ¨ Pipeline channel: Simple shift register, bits drop off end ¨ n Generated by RAMP infrastructure 5
D RDY ENQ D Buffering Forward Latency Datawidth Target FIFO Channel Parameters RDY DEQ Reverse Latency n n Need buffering of at least (Forward+Reverse) latency to get full bandwidth over link RAMP infrastructure instantiates channel with desired parameters 6
D D Datawidth Target Pipeline Channel Parameters Forward Latency n n Only recommended for expert use in target models (Should use FIFO channels and latency-insensitive protocols in target design) 7
RAMP Description Language (RDL) Target: Unit A [ Greg Gibeling, UCB ] Unit B Unit C RDLC Host: n n Generated Unit Wrappers Uni t. B Uni t. A FPGA 1 Generated links carry channels FPGA 2 Unit C User describes target model topology, channel parameters, and (manual) mapping to host platform FPGAs using RDL Compiler (RDLC) generates configurations 8
Virtual Target Clock 9
Virtualized RTL Improves FPGA Resource Usage n n RAMP allows units to run at varying target-host clock ratios to optimize area and overall performance Example 1: Multiported register file Example, Sun Niagara has 3 read ports and 2 write ports to 6 KB of register storage ¨ If RTL mapped directly, requires 48 K flip-flops ¨ n ¨ If mapping into block RAMs (one read+one write per cycle), takes 3 host cycles and 3 x 2 KB block RAMs n n Slow cycle time, large area Faster cycle time (~3 X) and far less resources Example 2: Large L 2/L 3 caches Current FPGAs only have ~1 MB of on-chip SRAM ¨ Use on-chip SRAM to build cache of active piece of L 2/L 3 cache, stall target cycle if access misses and fetch data from off-chip DRAM ¨ 10
Start/Done Timing Interface Wrapper Start In 1 In 2 Unit Out Done n n Wrapper generated by RDL asserts “Start” on the physical FPGA cycle when the inputs to the unit are ready for the next target cycle Unit asserts “Done” when it finishes the target cycle and its outputs are ready Unit can take variable amount of time Unvirtualized RTL unit can connect “Done” to “Start” (but must not clock until “Start”) 11
Distributed Timing Models 12
Distributed Timing Example Unit A Target: Host: RDYs D Latency L Pipeline target channel implemented as distributed FIFO with at least L buffers Start RDY Unit A DEQs Unit B D Done Start D ENQ DEQ Unit B Done 13
Timing Target FIFO Channel Target: Latency L D RDY ENQ n n n Credit control D D Credits D RDY DEQ Can build timed credit-based flow control (CBFC) FIFO inside Target model, using pipeline channels for communicating data forwards and credits backwards But this puts two CBFCs in series (one in target unit, one hidden in host implementation of pipeline channels) RDL can generate a unified FIFO that merges both of these behind the FIFO interface 14
Other Automatically Generated Networks n Control network has workstation as master and every unit as slave device Memory-mapped interface with block transfers ¨ Used for initialization, stats gathering, debugging, and monitoring ¨ n Units can connect to DRAM resources outside of timed target channels ¨ n Used to support emulation and virtualization state Units can communicate with each other outside of timed target channels ¨ Support arbitrary communication. E. g. , for distributed stats gathering 15
Wide Variety of RAMP Simulators 16
Simulator Design Choices n Structural Analog versus Highly Virtualized n Functional-only versus Functional+Timing n Timing via (virtual) RTL design versus separate functional and timing models n Hybrid software/hardware simulators We’re trying to build layers of abstractions that are useful to all types of simulator Also, trying to make modules in different styles interoperate 17
Effective Abstractions Hide Details 18
…But Provide Inter-Operability 19
Work in Progress: Stay Tuned 20
- Slides: 20