RAMP Blue A Message Passing ManyCore System in
RAMP Blue: A Message. Passing Many-Core System in FPGAs ISCA Tutorial/Workshop June 10 th, 2007 John Wawrzynek Alex Krasnov Dan Burke © 2006 Regents University of California. All Rights Reserved
Introduction • • RAMP Blue is one of several initial design drivers for the RAMP project. RAMP Blue is sibling to other RAMP design driver projects: 1) RAMP Red: Port of existing transactional cache system to FPGA Power. PC cores 2) RAMP Blue: Message passing distributed memory system using an existing FPGA optimized soft core 3) RAMP White: Cache coherent multiprocessor system with full featured soft-core Goal of building a class of large (500 -1 K processor core) distributed memory message passing many coresystems on a FPGAs RAMP Blue 3 RAMP Tutorial and Workshop, ISCA 2007
Version Highlights • V 1. 0: – – • V 3. 0: – – • Uses 16 BEE 2 modules 12 cores per FPGA, 768 cores total Cores running at 90 MHz Demo today Future versions – RAMP Blue 4 Used 8 BEE 2 modules 4 user FPGAs of each module held 100 MHz Xilinx Micro. Blaze soft cores running UCLinux. 8 cores per FPGA, 256 cores total Dec 06: 256 cores running benchmark suite of UPC NAS Parallel Benchmarks Use newer BEE 3 FPGA platform RAMP Tutorial and Workshop, ISCA 2007 Support for other processor
The Berkeley Team Dave Patterson & Students of CS 252, Spring 2006: Brainstorming and help with initial implementation Andrew Schultz: Original RAMP Blue design and implementation (graduated) Pierre-Yves Droz: BEE 2 Detailed Design, Gateware blocks (graduated) Chen Chang: BEE 2 High-level Design (graduated, now staff) Greg Gibeling: RAMP Description Language (RDL) Jue Sun: RDL version of RAMP Blue Alex Krasnov: RAMP Blue design, implementation, debugging Dan Bonachea: UPC Microblaze port Dan Burke: Hardware Platform, gateware, RAMP support RAMP Blue 5 RAMP Tutorial and Workshop, ISCA 2007
RAMP Blue Objectives • Primary objective is to experiment and learn lessons on building large scale manycore architectures on FPGA platforms – – • • RAMP Blue 6 Issues of FPGA implementation of processor cores NOC implementation and emulation Physical (power, cooling, packaging) of large FPGA arrays Set directions for parameterization and instrumentation of manycore architectures Starting point for research in distributed memory (old style “super computers”) and cluster style architectures (“internet in a box”) Proof of concept for RAMP - Technical imperative is to fit as many cores as possible in the system. Develop and provide gateware blocks for other projects. Application driven: run off-the-shelf, message passing, scientific codes and benchmarks (provide existing tests RAMP Tutorial and Workshop, ISCA 2007
RAMP Blue Hardware • • • Completed Dec. 2004 (14 x 17 inch 22 -layer PCB) 5 Virtex II FPGAs, BEE 2: Berkeley Emulation Engine 2 20 banks DDR 2400 memory, 18 10 Gbps conn. Administration/ maintenance ports: – – – • 10/100 Enet HDMI/DVI USB ~$6 K (w/o FPGAs or DRAM or enclosure) RAMP Blue 7 RAMP Tutorial and Workshop, ISCA 2007
Xilinx Virtex 2 Pro 70 FPGA • • • RAMP Blue 8 130 nm, ~70 K logic cells 1704 package with 996 user I/O pins 2 Power. PC 405 cores 326 dedicated multipliers (18 -bit) 5. 8 Mbit on-chip SRAM 20 X 3. 125 -Gbit/s duplex serial comm. links (MGTs) RAMP Tutorial and Workshop, ISCA 2007
BEE 2 Module Details • • Four independent 200 MHz (400 DDR 2) SDRAM interfaces per FPGA to FPGA connects using LVCMOS – – • • Designed to run at 300 MHz Tested and routinely used at 200 MHz (SDR) single ended 2 VP 70 FPGA User FPGA 2 VP 70 FPGA Control FPGA 2 VP 70 FPGA “Infiniband” 10 Gb/s links use “XAUI” (IEEE 802. 3 ae 10 Gb. E specification) communications core for the physical layer interface. Hardware will support others, ex: Xilinx “Aurora” standard, or ad hoc interfaces. RAMP Blue 9 RAMP Tutorial and Workshop, ISCA 2007
BEE 2 Module Design FPGAs DRAM 10 Gig. E ports Compact Flash Card DVI/HDMI 10/100 Enet USB RAMP Blue 10 RAMP Tutorial and Workshop, ISCA 2007
BEE 2 Module Packaging • • • RAMP Blue 11 • Custom 2 u rack mountable chassis Custom power supply (550 W), 80 A @ 5 volts + 12 volts Adequate fans for cooling max power draw (300 W) Typical operation ~150 W RAMP Tutorial and Workshop, ISCA 2007
Physical Network Topology • The 10 Gb/s off-module serial links (4 per user FPGA) permit a wide variety of network topologies. BEE 2 Module MGT Link • • • RAMP Blue 12 Example Topology: 3 -D Mesh of FPGAs In RAMP Blue links are used to implement an all-to-all module topology FPGAs on-module in a ring Therefore each FPGA is at most 4 on-module links + 1 serial link connection away from any other in the system. + Minimizes dependence on use of serial links: – • Latency in 10’s of cycles versus 2 -3 cycles for on-module links (ideal) - Scales only to 17 modules total RAMP Tutorial and Workshop, ISCA 2007
16 module wiring diagram module 15 module 14 module 0 User FPGA 4 RAMP Blue 13 User FPGA 2 RAMP Tutorial and Workshop, ISCA 2007 User FPGA 1
Which processor core to pick? • • Long-term RAMP goal is replaceable CPU L 1 cache, rest infrastructure unchanged (router, memory controller, …) What do you need from a CPU in long run? – • • Standard ISA (binaries, libraries, …), simple (area), 64 -bit (coherency), DP Fl. Pt. (apps), verification suites ISAs we’ve considered so far – “Synthesizable” versions of Power 405 (32 b), SPARC v 8 (32 b), SPARC v 9 (Niagara, 64 b), SPARC Leon, – Xilinx Microblaze (32 b) simple RISC processor optimized for FPGAs RAMP Blue uses Microblaze – – RAMP Blue 14 Highest density (least logic block usage) core we know of GCC, u. CLinux support Limitations: 32 -bit addressing, no MMU (virtual memory support) Source is not openly available from Xilinx (although has been made available to some under NDA). Xilinx “black-boxes” Microblaze - we will do same in our releases. RAMP Tutorial and Workshop, ISCA 2007
Which soft-core to pick? RAMP Blue 15 RAMP Tutorial and Workshop, ISCA 2007
Micro. Blaze V 4 Characteristics • 3 -stage, RISC designed for implementation on FPGAs – – • • • RAMP Blue 16 Takes full advantage of FPGA unique features (e. g. fast carry chains) and addresses FPGA shortcomings (e. g. lack of CAMs in cache) Short pipeline minimizes need for large multiplexers to implement bypass logic Maximum clock rate of 100 MHz (~0. 5 MIPS/MHz) on Virtex-II Pro FPGAs Split I and D cache with configurable size, direct mapped (we use 2 KB $I, 8 KB $D) Fast hardware multiplier/divider (off), optional hardware barrel shifter (off), optional single precision floating point unit (off). Up to 8 independent fast simplex links (FSLs) with ISA support Configurable hardware debugging support (watch/breakpoints) : MDM (off), trace interface used RAMP Tutorial and Workshop, ISCA 2007
Other RAMP Blue Requirements • Require design and implementation of gateware and software for implementing multiple Micro. Blazes with u. Clinux on BEE 2 modules – – – • “On-chip network” for Micro. Blaze to Micro. Blaze communication – • • RAMP Blue 17 Sharing of DDR 2 memory system Communication with and bootstrapping user FPGAs Debugging and control from control FPGA Communication on-chip, FPGA to FPGA on board, and board to board Double precision floating point unit for scientific codes Built from existing tools (RDL not available at time) but fit RAMP Design Framework (RDF) guidelines for RAMP Tutorial and Workshop, ISCA 2007
Node Architecture RAMP Blue 18 RAMP Tutorial and Workshop, ISCA 2007
Memory System • Requires sharing memory channel with a configurable number of Micro. Blaze cores – RAMP Blue 19 No coherence, each DIMM is partitioned, but bank management keeps cores from fighting with each other RAMP Tutorial and Workshop, ISCA 2007
Control Communication • Communication channel from control Power. PC to individual Micro. Blaze required for bootstrapping and debugging – – RAMP Blue 20 Gateware provides general purpose, lowspeed network Software provides character and Ethernet abstraction on channel Linux kernel is sent over channel and NFS file systems can be mounted Linux console channel allows debugging messages and control Duplicated: one is console, – ISCA 2007 RAMP Tutorial and Workshop,
Double Precision FPU • • • RAMP Blue 21 Due to size of FPU, sharing is crucial to meeting resource budget Implemented with Xilinx Core. Gen library FP components Shared FPU much like reservation stations in microarchitecture with Micro. Blaze issuing instructions RAMP Tutorial and Workshop, ISCA 2007
Network Characteristics • Interconnect fits within the RDF model – • Source routing between nodes (dimension-order style routing, non-adaptive link failure intolerant) – – • • • RAMP Blue 22 Network interface uses simple FSL channels, currently programmed I/O, but could/should be DMA FPGA-FPGA and board-to-board link failure rare (1 error / 1 M packets) CRC checks for corruption, software (GASNet/UDP) uses acks/time-outs and retransmits for reliable transport. Topology of interconnect is full cross-bar on chip, ring on module, and all-to-all connection of module-tomodule links Encapsulated Ethernet packets with source routing information prepended Cut through flow control with virtual channels for RAMP Tutorial and Workshop, ISCA 2007
Network Implementation Switch (8 b wide) provides connections between all on-chip cores, 2 fpga-fpga links, & 4 module links RAMP Blue 23 RAMP Tutorial and Workshop, ISCA 2007
Early “Ideal” RAMP Blue Floorplan 16 Xilinx Micro. Blazes per 2 V 70 FPGA Each group of 4 clustered around a memory interface FP Integrated with each core RAMP Blue 24 RAMP Tutorial and Workshop, ISCA 2007
8 Core Layout MB • Switch • • • FPU – • RAMP Blue 25 Floorplanning everything may lead to suboptimal results Floorplanning nothing leads to inconsistency from build to build FPU size/efficiency shared block Scaling up to multiple cores per FPGA is primarily constrained by resources This version implemented 8 cores/FPGA using roughly 86% of the slices (but only slightly more than half of the LUTs/FFs) Sixteen cores fit on each FPGA only without infrastructure (switch, FPU, etc) RAMP Tutorial and Workshop, ISCA 2007
MB 12 Core Layout • Switch FPU RAMP Blue 26 • • Used floorplanning for FPU and switch placement Uses roughly 93% of logic blocks, 55% BRAMs. Place and route to 100 HMz not practical (many PAR builds). Currently running at 90 MHz. RAMP Tutorial and Workshop, ISCA 2007
Software (1) • Development: – – • All development was done with standard Xilinx FPGA tools (EDK, ICE) First version was done without RDL, newer versions use RDL (but without cycle accurate time synchronization between units: no time dilation possible) System: – – Each node in the cluster boots its own copy of u. Clinux and mounts a file system from an external NFS file system (for applications) The Unified Parallel C (UPC) shared memory abstraction over messages framework was ported to u. Clinux • RAMP Blue 27 The main porting effort with UPC is adapting to its transport layer, GASnet, was eased by using the GASnet UDP RAMP Tutorial and Workshop, ISCA 2007
Software (2) • System: – – • RAMP Blue 28 Floating point integration is achieved via modification to the GCC Soft. FPU backend to emit code to interact with FPU Integrated HW timer for performance monitoring and GASnet, added with special ISA opcodes and CGG tools support Application: – Runs UPC (Unified Parallel C) version of a subset of NAS (NASA Advanced Scientific) Parallel Benchmarks (all class S, to date) CG Conjugate Gradient, IS Integer Sort 512 cores EP Embarassingly Parallel, MG Multi-Grid on 512, 768 cores RAMP Tutorial and Workshop, ISCA 2007
Implementation Issues • Building such a large design exposed several difficult bugs in both hardware and gateware development – – – Reliable Low-level physical SDRAM controller has been a major challenge A few Micro. Blaze bugs in both gateware and GCC tool-chain required time to track down (race conditions, OS bugs, GCC backend bugs) RAMP Blue pushed the use of BEE 2 modules to new levels - previously most aggressive users were for Radio Astronomy • • • RAMP Blue 29 memory errors exposed memory controller calibration loop errors (tracked down to PAR problems). DIMM socket mechanical integrity problems Long “recompile” (FPGA place and route) times (3 RAMP Tutorial and Workshop, ISCA 2007
Future Work / Opportunities • Processor/network interface currently very inefficient – • Many of the features for a useful RAMP currently missing – – • • RAMP Blue 30 DMA support should replace programmed I/O approach Time dilation (ex: change relative speed of network, processor, memory) Extensive HW supported monitoring Virtual memory, other CPU/ISA models Other network topologies Good starting point for processor+HWaccelerator architectures We are very interested in working with others RAMP Tutorial and Workshop, ISCA 2007
Conclusions • RAMP Blue represents one of the first steps to developing a robust RAMP infrastructure for more complicated parallel systems – – – • RAMP Blue 31 Much of the RAMP Blue gateware is directly applicable to future systems We are learning important lessons about required debugging/insight capabilities New bugs and reliability issues were exposed in BEE 2 platform and gateware to help influence future RAMP hardware platforms and characteristics for robust software/gateware infrastructure RAMP Blue also represents the largest softcore, FPGA based computing system ever built! RAMP Tutorial and Workshop, ISCA 2007
For more information … Chen Chang, John Wawrzynek, and Robert W. Brodersen. BEE 2: A High-End Reconfigurable Computing System. 2005. IEEE Design and Test of Computers, Mar/Apr 2005 (Vol. 22, No. 2). 2. J. Wawrzynek, D. Patterson, M. Oskin, S. Lu, C. Kozyrakis, J. C. Hoe, D. Chiou, and K. Asanovic. RAMP: A Research Accelerator for Multiple Processors, IEEE Micro Mar/Apr 2007. 3. Arvind, Krste Asanovic, Derek Chiou, James C. Hoe, Christoforos Kozyrakis, Shih-Lien Lu, Mark Oskin, David Patterson, Jan Rabaey, and John Wawrzynek. RAMP: Research Accelerator for Multiple Processors - A Community Vision for a Shared Experimental Parallel HW/SW Platform. Technical Report UCB/CSD-05 -1412, 2005. http: //ramp. eecs. berkeley. edu/Publications/ramp-nsf 2005. pdf 4. Andrew Schultz. RAMP Blue: Design and Implementation of a Message Passing Multi-processor System on the BEE 2. Master's Report 2006. http: //ramp. eecs. berkeley. edu/Publications/Andrew Schultz Masters. pdf RAMP Blue 32 1. RAMP Tutorial and Workshop, ISCA 2007
For more information … 5. 6. 7. 8. 9. Xilinx. Micro. Blaze Processor Reference Guide. http: //www. xilinx. com/ise/embedded/mb ref guide. pdf. John Williams. Micro. Blaze u. Clinux Project Home Page. http: //www. itee. uq. edu. au/jwilliams/mblaze-uclinux/. Alex Krasnov, Andrew Schultz, John Wawrzynek, Greg Gibeling, and Pierre-Yves Droz. Ramp Blue: A Message-Passing Many Core System in FPGAs, FPL International Conference on Field Programmable Logic and Applications, Aug 27 - 29, 2007 Amsterdam, Holland. Soon to be available on RAMP website. Greg Gibeling, Andrew Schultz, and Krste Asanovic. The RAMP Architecture and Description Language. Technical report, 2005. http: //ramp. eecs. berkeley. edu/Publications/RAMP Documentation. pdf Jue Sun. RAMP Blue in RDL. Master's Report May 2007. http: //ramp. eecs. berkeley. edu/Jue Sun Masters. pdf. RAMP Blue 33 RAMP Tutorial and Workshop, ISCA 2007
Questions? RAMP Blue 34 RAMP Tutorial and Workshop, ISCA 2007
RAMP Blue Demonstration • • NASA Advanced Supercomputing (NAS) Parallel Benchmarks (all class S). UPC versions (C plus shared-memory abstraction) CG Conjugate Gradient EP Embarassingly Parallel IS Integer Sort MG Multi Grid Virtual network traffic RAMP Blue 35 Physical network traffic RAMP Tutorial and Workshop, ISCA 2007
- Slides: 34