ApplicationSpecific Customization of Soft Processor Microarchitecture Peter Yiannacouras

  • Slides: 24
Download presentation
Application-Specific Customization of Soft Processor Microarchitecture Peter Yiannacouras J. Gregory Steffan Jonathan Rose University

Application-Specific Customization of Soft Processor Microarchitecture Peter Yiannacouras J. Gregory Steffan Jonathan Rose University of Toronto Electrical and Computer Engineering

Processors and FPGA Systems n Processors are the “heart” of FPGA systems UART Custom

Processors and FPGA Systems n Processors are the “heart” of FPGA systems UART Custom Logic Soft Processor Memory Interface Ethernet n Performs coordination and even computation ¨ Better processors => less hardware to design We seek improvement through customization 2

Enablers for customizing soft processors 1. FPGA Reconfigurability ¨ 2. Applications differ in architectural

Enablers for customizing soft processors 1. FPGA Reconfigurability ¨ 2. Applications differ in architectural requirements ¨ 3. No hardware cost for altering a design Can specialize architecture for each application A soft processor might be used to run either: a) b) c) A single application A single class of applications Many applications, but can be reconfigured We want to evaluate effectiveness of specialization 3

Research Goals 1. Investigate “Application-tuning” ¨ ¨ 2. Tune microarchitecture to favour an application

Research Goals 1. Investigate “Application-tuning” ¨ ¨ 2. Tune microarchitecture to favour an application Preserve general purpose functionality Investigate “Instruction-set Subsetting” ¨ ¨ Sacrifice general purpose functionality Eliminate hardware not required by application Investigate efficiency through real implementations 4

SPREE System (Soft Processor Rapid Exploration Environment) Processor ISADescription Datapath ■ Input: Processor description

SPREE System (Soft Processor Rapid Exploration Environment) Processor ISADescription Datapath ■ Input: Processor description ■ SPREE System SPREE RTL 1. Verify ISA against datapath 2. Datapath Instantiation 3. Control Generation ■ Multi-cycle/variable-cycle FUs ■ Multiplexer select signals ■ Interlocking ■ Branch handling ■ Output: Synthesizable Verilog 5

Back-end Infrastructure RTL Benchmarks (Mi. Bench, Dhrystone 2. 1, RATES, Xi. Risc) Modelsim Quartus

Back-end Infrastructure RTL Benchmarks (Mi. Bench, Dhrystone 2. 1, RATES, Xi. Risc) Modelsim Quartus II 5. 0 RTL Simulator CAD Software Stratix 1 S 40 C 5 1. Cycle Count 2. Resource Usage 3. Clock Frequency 4. Power We can measure area/performance/energy accurately 6

Exploration of Architectural Customizations 1. 2. Architectural-tuning Instruction-set subsetting 7

Exploration of Architectural Customizations 1. 2. Architectural-tuning Instruction-set subsetting 7

What exactly are we tuning? Hardware vs software multiplication n Shifter implementation n Pipelining

What exactly are we tuning? Hardware vs software multiplication n Shifter implementation n Pipelining n ¨ Depth ¨ Organization ¨ Forwarding n Not ISA (we use MIPS-I) We focus on core microarchitecture 8

Comparison to Altera’s Nios II n Has three variations: ¨ Nios II/e – unpipelined,

Comparison to Altera’s Nios II n Has three variations: ¨ Nios II/e – unpipelined, no HW multiplier ¨ Nios II/s – 5 -stage, with HW multiplier ¨ Nios II/f – 6 -stage, dynamic branch prediction n Caveats – not completely fair comparison ¨ Very similar but tweaked ISA ¨ Nios II Supports exceptions, OS, and caches n We do not and save on the hardware costs We believe the comparison is meaningful 9

SPREE vs Nios II faster -3 -stage pipe -HW multiply -Multiply-based shifter smaller Competitive

SPREE vs Nios II faster -3 -stage pipe -HW multiply -Multiply-based shifter smaller Competitive while allowing more customization 10

1. Architectural Tuning Experiment Hardware vs software multiplication n Shifter implementation n Pipelining n

1. Architectural Tuning Experiment Hardware vs software multiplication n Shifter implementation n Pipelining n ¨ Depth ¨ Organization ¨ Forwarding What is best overall (general purpose) configuration What are best per application (application-tuned) configurations 11

Performance per Area of All Processors 14. 1% improvement over general purpose, some 30%

Performance per Area of All Processors 14. 1% improvement over general purpose, some 30% 12

2. Instruction-set Subsetting n SPREE automatically removes ¨ Unused connections ¨ Unused components n

2. Instruction-set Subsetting n SPREE automatically removes ¨ Unused connections ¨ Unused components n Reduce processor by reducing the ISA ¨ Can n create application-specific processor Eliminate unused parts of the ISA 13

Instruction-set Usage of Benchmark Set n Applications do not use complete ISA Strong potential

Instruction-set Usage of Benchmark Set n Applications do not use complete ISA Strong potential for hardware reduction 14

Fraction of Area Reduction from Instruction-set Subsetting Area reduced by 60% in some, 25%

Fraction of Area Reduction from Instruction-set Subsetting Area reduced by 60% in some, 25% on average 15

Combining Application Tuning and Instruction-set Subsetting 33. 2% Efficiency Gain: Subsetting 16%, Combined 24.

Combining Application Tuning and Instruction-set Subsetting 33. 2% Efficiency Gain: Subsetting 16%, Combined 24. 5% 16

Summary of Presented Architectural Conclusions n Application tuning: 14% average efficiency gain ¨ Will

Summary of Presented Architectural Conclusions n Application tuning: 14% average efficiency gain ¨ Will n only increase as we explore more architectures Instruction-set Subsetting ¨ Up to 60% area & energy savings ¨ 16% average efficiency gain n Combined Application tuning & Subsetting ¨ 24. 5% average efficiency gain 17

General Purpose vs App-tuned vs Nios II n Choose best Nios II overall and

General Purpose vs App-tuned vs Nios II n Choose best Nios II overall and per application 17% SPREE customizations allow 17% better efficiency than Nios II 18

Future Work n Consider other exciting architectural axes ¨ Branch prediction, aggressive forwarding ¨

Future Work n Consider other exciting architectural axes ¨ Branch prediction, aggressive forwarding ¨ ISA changes ¨ Datapaths (eg. VLIW) ¨ Caches and memory hierarchy n Compiler assistance ¨ Can improve tuning & subsetting 19

Metrics for Measurement Efficiency: Performance per area n Performance: MIPS n Area: Equivalent Stratix

Metrics for Measurement Efficiency: Performance per area n Performance: MIPS n Area: Equivalent Stratix Logic Elements (LEs) n ¨ Relative silicon areas used for RAMs/Multipliers 20

Energy Impact of Subsetting Up to 60% energy savings and 25% on average 21

Energy Impact of Subsetting Up to 60% energy savings and 25% on average 21

What exactly are we tuning? n n n HW Multiply FU Shifter type Pipelining

What exactly are we tuning? n n n HW Multiply FU Shifter type Pipelining ¨ Depth ¨ Organization ¨ Forwarding Are we tuning enough? Instruction Set Microarchitecture Control Pipeline Datapath Reg File FUs ISA Extensions (Tensilica, Stretch) Memory Hierarchy 22

Performance per Area of All Processors 14. 1% improvement over general purpose, some 30%

Performance per Area of All Processors 14. 1% improvement over general purpose, some 30% 23

Processors and FPGA Designs UART Custom Logic Memory Interface n FPGA Soft Processor Ethernet

Processors and FPGA Designs UART Custom Logic Memory Interface n FPGA Soft Processor Ethernet P Our goal is to explore customization of soft processors 24