BEE 3 Update Chuck Thacker John Davis Microsoft
BEE 3 Update Chuck Thacker John Davis Microsoft Research 2 March 2008
Outline • • BEE 3 Overview BEE 3 Status BEE 3 Gateware Moving forward
BEE 3 System
BEE 3 Package
BEE 3 Tidbits • Design uses essentially every pin on the chip. • Design was done to be “PC-like” to leverage PC economies: – PWB is about half the area of BEE 2. – PWB is 18 layers rather than 22 for BEE 2. – Uses PC power and peripherals. • System is divided into main board plus a separate (and separately designed) Control Board. – Allow designs to proceed in parallel at Celestica and BWRC, and reduced the risk of having to spin the (expensive) main board. – Control board has JTAG, and Flash for bitstreams and boot flash for each FPGA. Can operate without it. • The use of pros for PCB and mechanical design was an enormous win. – Celestica’s design was 100% correct, and five systems worked with only one problem (which was easily corrected). – Took (probably) half the time, to produce something much more manufacturable and robust (and therefore cheaper).
BEE 3 Subsystems
BEE 3 Control Board
Project Participants and Roles • Microsoft Research (Silicon Valley) – Funds, manages system engineering, does some gateware • Celestica (Ottawa and Shanghai) – Did main board engineering, prototype fabrication – Microsoft has a very deep relationship with Celestica • BEECube – Builds and delivers functioning systems • Function Engineering (Palo Alto) – Did thermal and mechanical engineering • Xilinx (San Jose) – Provides FPGAs for academic machines – Provides FPGA application expertise • Ramp Group (BWRC) – Control board, basic software • Ramp Community – Uses the systems for research – Expanding to industrial users (e. g. , us)
BEE 3 Status • All subsystems work! • Board spin is required to correct MGT placement. – 10 Gbit channels require long routing. • Due to lack of information from Xilinx, not Celestica’s error. – Respin is in progress. ETA for final board is 1 May.
BEE 3 Gateware • Today, consists primarily of test and characterization routines. – Much of this was ported from BEE 2, although some is new: – DDR 2 Controller – Control RISC • MS designs use a minimal subset of the Xilinx tool suite: – Just ISE, Chip. Scope, and (soon) Data 2 MEM. – May need EDK, but not yet.
DDR 2 Controller • Largest piece of new Gateware. – 5 Modules, ~2000 lines of Verilog. • Supports 2 4 GB DIMMS/channel, 2 channels per FPGA. • Transfers are DDR 400 (5 ns clock) with -2. • Supports only x 4 registered DIMMs – Unbuffered DIMMs can’t work because of address/control loading. • Handles all initialization, refresh, and calibration (semi) automatically. – Keeps track of up to 16 open banks/controller. • Calibration is fast (768 clocks). – So can be done at frequent intervals or in response to single errors. • Primary user commands are Read and Write: – Both deal with 36 -byte blocks. Simple FIFO interfaces. • Each channel is about 3% of the LX 110 T LUTs (no BRAMs).
DDR Controller Organization • Centralized main controller – – Main control FSM Address Fifo (64 30 -bit command/addresses) Open bank CAMs. Clock generation, timing limit enforcement. • Six replicated I/O pin bank logic: – Read and Write Fifos for 24 data bits (3 4 -bit lanes, with one RAM chip/lane on each DIMM). – Calibration state machine, so that all 6 banks can calibrate in parallel.
DDR Controller (simplified)
Control RISC (TC 4) • 36 bits (memories are 36 n bits wide) • Harvard architecture – 1 K instruction memory (1 BRAM) – 1 K data memory (1 BRAM) – 256 register 3 -port register file (2 BRAMs) • Very small (~100 slices) “Tiny Computer” – All instructions execute in three 5 ns phases. No pipelining. • Assembler, no C compiler. Sigh… • So far, DRAM initialization, DRAM calibration, Control shell with UART interface.
TC 4
Next Steps • Use Data 2 Mem to speed up TC 4 edit, assemble, load cycle time: – Currently takes 30 minutes, since we regenerate cores and rebuild entire design. – Should be a couple of minutes. • Add DDR 2 test system (LFSRs) to do full-speed testing with random addresses and data. Should be rock solid. • Use Xilinx Plan. Ahead to lock the design so that it can be used as a component in larger designs. • Develop an on-chip interconnect to allow multiple DDR 2 requesters without needing huge cross-chip busses. • Use BEE 3 in our own research programs – A couple have already started. – This is the fun part. Building it was just work.
Questions?
- Slides: 17