Addressing Instruction Fetch Bottlenecks by Using an Instruction

Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida State University June 8 -16, 2007 Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File

Instruction Packing n n Store frequently occurring instructions as specified by the compiler in a small, lowpower Instruction Register File (IRF) Allow multiple instruction fetches from the IRF by packing instruction references together ¡ ¡ n Tightly packed – multiple IRF references Loosely packed – piggybacks an IRF reference onto an existing instruction Facilitate parameterization of some instructions using an Immediate Table (IMM) Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 2

Execution of IRF Instructions Instruction Fetch Stage Instruction Cache packed instruction IF/ID insn 1 insn 4 insn 2 insn 3 packed instruction PC First Half of Instruction Decode Stage IRF insn 2 insn 4 insn 1 insn 3 IRWP IMM imm 3 To Instruction Decoder imm 3 Executing a Tightly Packed Param 4 c Instruction Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 3

Outline n n n n Introduction IRF and Instruction Packing Overview Integrating an IRF with an L 0 I-Cache Decoupling Instruction Fetch Experimental Evaluation Related Work Conclusions & Future Work Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 4

MIPS+IRF Instruction Formats T-type R-type I-type J-type 6 bits 5 bits opcode inst 1 inst 2 inst 3 6 bits 5 bits 6 bits 5 bits rt rd function inst opcode rs shamt 5 bits 1 bit inst 4 param 5 bits s inst 5 param 6 bits 5 bits 11 bits 5 bits opcode rs rt immediate inst 6 bits 24 bits opcode win immediate Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 5

Previous Work in IRF n n Register Windowing + Loop Cache (MICRO 2005) Compiler Optimizations (CASES 2006) ¡ ¡ ¡ Instruction Selection Register Renaming Instruction Scheduling Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 6

Integrating an IRF with an L 0 I-Cache n L 0 or Filter Caches ¡ Small and direct-mapped n n n ¡ 256 B L 0 I-cache 8 B line size [Kin 97] n n Fast hit time Low energy per access Higher miss rate than L 1 Fetch energy reduced 68% Cycle time increased 46%!!! IRF reduces code size, while L 0 only focuses on energy reduction at the cost of performance IRF can alleviate performance penalty associated with L 0 cache misses, due to overlapping fetch Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 7

L 0 Cache Miss Penalty Cycle 1 2 3 4 5 6 7 8 9 Insn 1 IF ID EX M WB Insn 2 Insn 3 IF ID EX M WB Insn 4 Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File IF ID EX M WB 8

Overlapping Fetch with an IRF Cycle 1 2 3 4 5 6 7 8 9 Insn 1 IF ID EX M WB Pack 2 a IFab IDa EXa Ma WBa Pack 2 b Insn 3 IDb EXb Mb WBb IF Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File ID EX M WB 9

Decoupling Instruction Fetch n Instruction bandwidth in a pipeline is usually uniform (fetch, decode, issue, commit, …) ¡ n n Artificially limits the effective design space Front-end throttling improves energy utilization by reducing the fetch bandwidth in areas of low ILP IRF can provide virtual front-end throttling ¡ ¡ ¡ Fetch fewer instructions every cycle, but allow multiple issue of packed instructions Areas of high ILP are often densely packed Lower ILP for infrequently executed sections of code Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 10

Out-of-order Pipeline Configurations Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 11

Experimental Evaluation n n Mi. Bench embedded benchmark suite – 6 categories representing common tasks for various domains Simple. Scalar MIPS/PISA architectural simulator ¡ n Wattch/Cacti extensions for modeling energy consumption (inactive portions of pipeline only dissipate 10% of normal energy when using cc 3 clock gating) VPO – Very Portable Optimizer targeted for Simple. Scalar MIPS/PISA Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 12

L 0 Study Configuration Data Parameter Low-Power In-order Embedded Processor I-Fetch Queue 4 entries Branch Predictor Bimodal-128 entries, 3 cycle penalty Fetch/Decode/Issue Single instruction RUU size 8 LSQ size 8 L 1 Data Cache 16 KB, 256 lines, 16 B line, 4 -way s. a. , 1 cycle hit L 1 Instruction Cache 16 KB, 256 lines, 16 B line, 4 -way s. a. , 1 / 2 cycle hit L 0 Instruction Cache 256 B, 32 lines, 8 B line, direct mapped, 1 cycle hit Memory Latency 32 cycles IRF/IMM 4 windows, 32 -entry IRF (128 total), 32 -entry IMM. 1 branch/pack Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 13

Execution Efficiency for L 0 I-Caches Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 14

Energy Efficiency for L 0 I-Caches Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 15

Decoupled Fetch Configurations Parameter High-end Out-of-order Embedded Processor I-Fetch Queue 4/8 entries Branch Predictor Bimodal-2048 entries, 3 cycle penalty Fetch Width 1/2/4 Decode/Issue/Commit Width 1/2/3/4 RUU size 16 LSQ size 8 L 1 Data Cache 32 KB, 512 lines, 16 B line, 4 -way s. a. , 1 cycle hit L 1 Instruction Cache 32 KB, 512 lines, 16 B line, 4 -way s. a. , 1 cycle hit Unified L 2 Cache 256 KB, 1024 lines, 64 B line, 4 -way s. a. 6 cycle hit Memory Latency 32 cycles IRF/IMM 4 windows, 32 -entry IRF (128 total), 32 -entry IMM. 1 branch/pack Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 16

Execution Efficiency for Asymmetric Pipeline Bandwidth Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 17

Energy Efficiency for Asymmetric Pipeline Bandwidth Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 18

Energy-Delay 2 for Asymmetric Pipeline Bandwidth Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 19

Related Work n n L-caches – subdivide instruction cache, such that one portion contains the most frequently accessed code Loop Caches – capture simple loop behaviors and replay instructions Zero Overhead Loop Buffers (ZOLB) Pipeline gating / Front-end throttling – stall fetch when in areas of low IPC Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 20

Conclusions and Future Work n Future Topics ¡ ¡ ¡ n IRF can alleviate fetch bottlenecks from L 0 I-Cache misses or branch mispredictions ¡ ¡ n Can we pack areas where L 0 is likely to miss? IRF + encrypted or compressed I-Caches IRF + asymmetric frequency clustering (of pipeline backend functional units) Increased IPC of L 0 system by 6. 75% Further decreased energy of L 0 system by 5. 78% Decoupling fetch provides a wider spectrum of design points to be evaluated (energy/performance) Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 21

The End Questions ? ? ? Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 22

Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 23

Energy Consumption Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 24

Static Code Size Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 25

Conclusions & Future Work n n n Compiler optimizations targeted specifically for IRF can further reduce energy (12. 2% 15. 8%), code size (16. 8% 28. 8%) and execution time Unique transformation opportunities exist due to IRF, such as code duplication for code size reduction and predication As processor designs become more idiosyncratic, it is increasingly important to explore the possibility of evolving existing compiler optimizations Register targeting and loop unrolling should also be explored with instruction packing Enhanced parameterization techniques Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 26

Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 27

Instruction Redundancy n n Profiled largest benchmark in each of six Mi. Bench categories Most frequent 32 instructions comprise 66. 5% of total dynamic and 31% of total static instructions Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 28

Compilation Framework Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 29

Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File 30