A Parameterizable FPGA Prototype of a VectorThread Processor
A Parameterizable FPGA Prototype of a Vector-Thread Processor Jared Casper, Ronny Krashinsky, Christopher Batten, Krste Asanović MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, USA
SCALE Vector-Thread Processor Key Features Vector Execution Unit Control Proc Lane 0 Lane 1 Lane 2 Lane 3 – 4 lanes, 4 clusters – Cluster for indexed accesses – 4 segment address generators – 4 VLDQs VRU Throttle Logic Refill Unit Stride SEG SEG – VRU includes throttle logic, refill address generator
SCALE Cache Key Features Cache Arbiter and Crossbar T a g D s a t M a S H R S e g B u f Memory Port Arbiter and Crossbar T a g D s a t M a S H R S e g B u f – – – Two cycle hit latency Four 8 KB banks 32 way associative 32 B cachelines 16 B/cycle per bank Four 16 B segment buffers per bank
SCALE Prototype Chip • Prototype SCALE processor in development Control processor: MIPS, 1 instr/cycle VTU: 4 lanes, 4 clusters/lane, 32 registers/cluster, 128 VPs max Primary I/D cache: 32 KB, 4 x 128 b per cycle, non blocking DRAM: 64 b, 200 MHz DDR 2 (64 b at 400 Mb/s: 3. 2 GB/s) Estimated 10 mm 2 in 0. 18μm, 400 MHz (25 FO 4) • Cycle level execution driven C++ microarchitectural simulator Detailed VTU and memory system model 4 mm Cache Tags ctrl RF shftr ALU RF ctrl shftr ALU RF RF RF latch IQC latch IQC mux/ mux/ ctrl RF shftr ALU ctrl shftr ALU ctrl shftr ALU RF RF Control Processor Mult Div Cluster latch IQC mux/ PC CP 0 MD ctrl shftr ALU latch IQC mux/ L/S byp ALU shftr LDQ ctrl RF ctrl Memory Interface / Cache Control Lane 2. 5 mm Cache Bank (8 KB) Memory Unit Cache Bank (8 KB) Crossbar Cache Bank (8 KB)
Scale Prototype Board • • Single Xilinx Virtex II FPGA Configured via direct JTAG connection or System. ACE Multiple Memory Chips Six Micron DDR 2 SDRAMs Two Micron Mobile SDRAMs One Micron RLDRAM One Samsung SRAM • • • Two Logic Analyzer connections Multiple separate power islands Attached to custom test baseboard Sixteen independently measurable power supplies Byte serial connection to a Linux PC
Module Placement • Reduce the risk of the final custom chip implementation Allow early rapid prototyping of many of the system interactions • Provide a parameterizable prototype for architectural experiments
Testing Setup
Testing Setup
Status • Completed Work Single issue seven stage pipeline MIPS processor core • • Mapped to the board and passes our MIPS verification test suite Will form the SCALE control processor DDR 2 memory controllers • Tested in isolation using simple memory traffic generators • Work in progress Cache subsystem Vector thread unit
Advantages of Using an FPGA • Rapid full system simulation of a large variety of designs Allows extensive characterization of the design space • Parameterization allows exploration of various tradeoffs Cache parameters and replacement policies Prefetch strategies DRAM access scheduling policies and power down modes DRAM types (e. g. , DDR 2 vs. Mobile DRAM) • Fast emulation system for SCALE software development • Allows thorough debugging before going to silicon
- Slides: 10