Improving Pipelined Soft Processors with Multithreading Martin Labrecque

Improving Pipelined Soft Processors with Multithreading Martin Labrecque Gregory Steffan ECE Dept. University of Toronto Presented at RAAW 2006, Orlando, FL

Processors and FPGAs § FPGAs increasingly implement So. Cs, with CPUs Soft processors: processors in the FPGA fabric FPGA Processor Custom Logic Soft processors are: • Easier to program than HDL • Customizable 2

Soft processors in Embedded Systems What do designers care about? Minimizing area? Matching frequency? Hitting performance target? § Area efficiency: a combined metric Performance Area Instr. Count xx Frequency Cycle Count x Area We trade-off 4 criteria (soft proc. power is related to area) 3

Multithreading Replace processor stalls Million Instr. xx Frequency # Cycles x Area §Fill them with instructions from other threads §When to switch thread? §Every instruction (e. g. Sun’s Niagara) §Convenient technique for in-order processors Fine-grained multithreading: 1 instr. per thread in round-robin 4

Traditional execution 3 stages BEFORE Avoiding processor stall cycles F F E W W F E E W W Time Data and control hazards create stall cycles Multithreading: execute streams of independent instructions Ideally, eliminates all stalls 3 stages AFTER Legend F F E Thread 1 F F F Thread 2 E E E Thread 3 W W W W Time 5

How useful is multithreading? Commercial SPs: single-threaded (NIOS-II, Microblaze) Fort et al. [FCCM’ 06] have shown: multithreaded SP smaller than multiple SPs with some performance degradation We go further by showing that: the Area-Efficiency of Multithreaded SP is GREATER THAN the Area-Efficiency of Single-Threaded SP Not straightforward, here is how we did it 6

Outline Architectural Support for Multiple Threads Soft Processor Infrastructure Improvements to Baseline Multithreading 7

P C Instr. Mem +4 Reg. Array Forwarding lines Single-Threaded Processor (simplified) Data Mem ALU Hazard Detection Logic 8

2 -Threaded Processor (simplified) Data Mem P C Instr. Mem Reg. Array ALU +4 Ctrl. Hazard Detection Logic Replicate state for each thread Simplify control logic 9

Additional storage for multiple threads Program counters Registers Data mem. N x § More efficiently done in FPGA than in ASIC § Increase memory size while preserving frequency Multithreading builds on the strengths of FPGAs 10

Outline Architectural Support for Multiple Threads Soft Processor Infrastructure Improvements to baseline multithreading 11

Measurement Infrastructure Benchmarks RTL (Mi. Bench, Dhrystone 2. 1, RATES, Xi. Risc) Modelsim RTL Simulator Single-Thread Processors SPREE System [FPGA’ 06] Quartus II 5. 0 CAD Software Stratix 1 S 40 C 5 1. Cycle Count 2. Resource Usage 3. Clock Frequency 4. Power We can measure area/performance/energy accurately 12

Evaluation methodology Same benchmark running on all threads Some mixed benchmarks results in the paper Run until completion of the last thread Same instruction space § We present results with fixed latency on-chip RAM § We are implementing a solution for off-chip RAM 13

Processors: 3, 5 and 7 stages Pipe 3 F/D R/EX/M WB Pipe 5 F D R/EX 1 Pipe 7 F Pipe 7 D R EX 1 F: D: R: EX: M: WB: 1174 LEs 78. 3 MHz EX 2/M WB Fetch Decode Register Execute Memory Writeback 1283 LEs 86. 79 MHz EX 3/WB 1 WB 2 1557 LEs, 100. 59 MHz Best of each pipeline depth generated by SPREE By default: thread count = number of pipeline stages 14

Area efficiency results 77% 33% 106% 3 -stage 5 -stage 7 -stage § Area efficiency is most improved with deeper pipelines § 3 - and 7 -stages have similar area efficiency 15

IPC results for 3, 5 and 7 stages Ideal IPC = 1 IPC versus single-threaded proc. 24%, 45% and 104% more instructions per cycle, respectively 16

Improvements to the Baseline Multithreaded Soft Processors Optimize away unpipelined multi-cycle 1. 1. Optimize away unpipelined multi-cycle paths § Selection of architectural features 1) Multiplier implementation 2) Number of registers 3) Number of threads Combination of techniques optimizing area efficiency 17

1 - Changing multiplication support Register file • Default MIPS has Hi/Lo registers Hi/Lo Multiplier MUX • 3 -operand multiplies (NIOS 2 and Microblaze) – Two instructions compute high and low parts – Avoids replicating Hi and Lo registers support 18

2 - Reducing the register file Not all registers are utilized [RAAW’ 06] Many threads can combine the savings Results in saved memory blocks 1. . N-k 2 N-2 k • Applicable to the 5 -stage processor • Increases slightly cycle count due to increased register pressure • Allows area and frequency improvements 19

Reducing the Number of Threads 3 stages • Usually: # threads = # pipeline stages • Last stage: writeback to non-conflicting register F F E Legend F F E E E W W W Thread 1 Thread 2 Thread 3 Time Positive effect on the 5 and 7 -stage processors Helps meet processing latency deadline (shorter round-robin) Gives designers more flexibility 20

Conclusions Multithreaded SPs outperforms Single-threaded Assumes independent threads Assumes use of on-chip memory 33%, 77% and 106% increase in area-efficiency Demonstrated that benefits increase with pipeline depth Techniques to optimize away unpipelined multi-cycle paths Selection and combination of architectural features Multiplier support Number of threads Number of registers Commercial FPGA makers should have a Multi-Threaded SP 21

Long term goals Multiple multithreaded soft processors Research using off-chip memory hierarchy Study of synchronization mechanisms Make easy to target and scale up for non-HW people §Experimental Testbed: Net. FPGA –Virtex-II Pro – 4 x 1 Gbps Ethernet –PCI board – 64 MB DDR 2 DRAM Stanford/Xilinx platform Collaboration with network researchers Perform real high bandwidth experiments 22

Thank you Martin Labrecque (martinl@eecg. utoronto. ca) Gregory Steffan ECE Dept. University of Toronto 23

Where do threads come from? Event processing e. g. multiple sources of interrupts Packet processing § e. g. CAN, RS-485, Ethernet, etc. Systems handling requests e. g. bus controllers For now, we consider independent threads 24

SPREE vs Nios II [IEEE TCAD’ 07] faster smaller 25

Architectural Parameters Used in SPREE Multiplication Support Hardware Shifter implementation Flipflops, FU or software routine multiplier, or LUTs Pipelining Depth (2 -7 stages) Forwarding lines We focus on core microarchitecture (for now) 26

Contributions on Multithreaded Soft Processors §Multithreaded SP dominate single-threaded §processors in area and IPC §Demonstrated that these benefits §Increase with the # of pipeline stages §Explained techniques to optimize away §unpipelined multi-cycle paths §Selection of architectural features §Number of threads §Number of registers §Multiplier support Combination of techniques that optimize area efficiency 27

Unpipelined Multicycle Paths Example of 3 -stage pipeline with multicycle on load, store, shift and multiplies ST F/D R/EX EX WB MT F/D R/EX M WB Not practical in ST because of hazard detection Important source of IPC improvement 28

Changing multiplication support 3 -stage 5 -stage 7 -stage For multithreaded SPs, 3 op-multiplies always win 29

Reducing the Number of Threads Positive effect on the 5 and 7 stage processors 30

SPREE System (Soft Processor Rapid Exploration Environment) ISA Processor Description Datapath ■ Input: Processor description ■ Made of hand-coded components ■ SPREE System 1. 2. 3. SPREE Verify ISA against datapath Datapath Instantiation Control Generation ■ Output: Synthesizable Verilog RTL 31

Multithreading Million Instr. xx Frequency # Cycles x Area Replace processor stalls §Fill them with instructions from other threads §When to switch thread? §Multiple techniques §Most common: every instruction (e. g. Sun’s Niagara) Interleaved instructions in pipeline T 1 T 2 T 3 Time Fine-grained multithreading: 1 instr. per thread in round-robin 32

Experimental Testbed: Net. FPGA –Virtex-II Pro – 4 x 1 Gbps Ethernet –PCI board – 64 MB DDR 2 DRAM Stanford/Xilinx platform Collaboration with network researchers Perform real high bandwidth experiments 33

Removed load and branch delay slots in the code 34