Physical Register Inlining PRI Mikko H Lipasti 1

  • Slides: 26
Download presentation
Physical Register Inlining (PRI) Mikko H. Lipasti 1, Brian Mestan 2, and Erika Gunadi

Physical Register Inlining (PRI) Mikko H. Lipasti 1, Brian Mestan 2, and Erika Gunadi 1 1 Department of Electrical and Computer Engineering University of Wisconsin—Madison 2 IBM Microelectronics IBM Corporation – Austin, TX http: //www. ece. wisc. edu/~pharm

Demand for Large Register Files Fetch Dcd Rnm Sched Disp RF Exe Retire Commit

Demand for Large Register Files Fetch Dcd Rnm Sched Disp RF Exe Retire Commit Instruction Window n Deeper Pipeline n n Increasing pressure on Register File Lots of attention / prior work 2

Challenges with Scaling Register Files n Additional pipe stages needed for access n n

Challenges with Scaling Register Files n Additional pipe stages needed for access n n Increases branch misprediction penalty Increases scheduling misprediction penalty Requires additional bypass logic Further increases pipeline depth n Increases the demand for more registers 3

Physical Register Lifetime width 8 width 4 Ø Managed inefficiently 4

Physical Register Lifetime width 8 width 4 Ø Managed inefficiently 4

Prior Work n Register file caching [Swenson et al. 1988, Zalamea et al. 2000,

Prior Work n Register file caching [Swenson et al. 1988, Zalamea et al. 2000, Postiff et al. 2001, Cruz et al. 2000, Borch et al. 2002] n Late Allocation [Gonzalez et al. 1998, Monreal et al. 1999] n Efficient Management n n Early deallocation [Moudgill et al. 1993] Program semantics [Martin et al. 1997, Lo et al. 1999] Checkpointing [Martinez et al. 2002, Akkary et al. 2003] Value-based optimizations [Jourdan et al. 1998] 5

Early Deallocation n n Moudgill et al. 1993 Focused on “last read to release”

Early Deallocation n n Moudgill et al. 1993 Focused on “last read to release” Avoid waiting for the next writer to commit Deallocate registers as soon as: n n Complete (complete flag) Unmapped (unmap flag) No outstanding readers (reference counter) Still requires next writer to enter the window 6

Physical Register Inlining Exploits narrow operands: sizable fraction of operands can be stored in

Physical Register Inlining Exploits narrow operands: sizable fraction of operands can be stored in less than 8 bits [Canal et al. 2000] n n n Often fewer bits than needed to specify physical registers Store the value instead of the pointer Stores narrow values in map table Reduces physical register lifetime 7

Operand Significance Ø Also have FP graph in the paper – exploits 0. 0/1.

Operand Significance Ø Also have FP graph in the paper – exploits 0. 0/1. 0 (54%) 8

Outline n n n Motivation Prior Work Physical Register Inlining n n n Quick

Outline n n n Motivation Prior Work Physical Register Inlining n n n Quick Microarchitectural Review Modifications Needed PRI + early deallocation Experiments Conclusions 9

Microarchitectural Review n Register Rename/Map Tables n n Maps logical names to physical names

Microarchitectural Review n Register Rename/Map Tables n n Maps logical names to physical names Removes false name dependences Two common types: RAM and CAM map is positional. n Not suitable for storing values RAM map 0 CAM map ? Phys reg # 1 Logical reg # 0 2 ? . . . V 1 ? 2 L Logical reg # . . . ? Phys reg # 10

Microarchitectural Review n Allocating and Freeing Physical Registers n n n Allocates physical register

Microarchitectural Review n Allocating and Freeing Physical Registers n n n Allocates physical register at decode – map table entry is updated Releases physical register when next writer is committed Checkpoint and Recovery of Register Map n Optimization to reduce branch misprediction penalty 11

Modifications to Data Flow Fetch Dcd Rnm Queue Sched Map n n n RF

Modifications to Data Flow Fetch Dcd Rnm Queue Sched Map n n n RF Payload RAM Exe ALU Retire Commit Narrow? Execution stage must allow both operands to be read from payload RAM n n Disp Already supports one immediate operands Sign extension between payload RAM and the ALU input Narrow checking logic to verify if the operands are narrow Narrow datapath back to the map table 12

Modifications to Map Table n n Registers freed from the retire/wb stage and commit

Modifications to Map Table n n Registers freed from the retire/wb stage and commit stage Tolerant of duplicate deallocations of the same physical register n n Once as narrow, again at next write commit Map entries need to be writable from rename stage and retire/wb stage 13

Stale Pointer Problem PRF MAP Checkpoints copy ROB n Deallocating physical registers early makes

Stale Pointer Problem PRF MAP Checkpoints copy ROB n Deallocating physical registers early makes these pointers stale n n Issue. Q Equivalent to the garbage collection issue Two choices n n Delay deallocation until pointers not valid (refcount) Update all pointers (ideal IPC) 14

Map table checkpoints problem n Map table checkpoints need to be updated in case

Map table checkpoints problem n Map table checkpoints need to be updated in case of narrow operands write n Lazy update Ø Complex, but not cycle time critical n Checkpoint reference counting Ø Similar to Akkary et al. Ø Delays deallocation, reduces IPC benefit slightly 15

Example of WAR Violation Load p 1 <= MEM[p 7] And p 2 <=

Example of WAR Violation Load p 1 <= MEM[p 7] And p 2 <= p 3 & p 4 narrow Add p 5 <= p 1 + p 2 WAR violation Or p 2 <= p 8 & p 9 n n Rare, but frequent enough to affect performance Must have efficient solution 16

Rename Table WAW Hazards Fetch Decode Execute p 4 p 5 = p 1

Rename Table WAW Hazards Fetch Decode Execute p 4 p 5 = p 1 + & p 2 r 3 = r 1 + r 2 r 3 = r 1 & r 2 MAP r 3 p 3 p 4 p 5 p 3 p 4 Retire Commit p 4 = p 1 + p 2 ROB (Dst) p 4 p 5 narrow WAW! n WAW hazards n n Writes narrow value to a remapped map entry Must ensure that the map entry has not been remapped 17

Integrating PRI with Early Deallocation n Not all operands are narrow Reduces register lifetime

Integrating PRI with Early Deallocation n Not all operands are narrow Reduces register lifetime further Adds unmap flags and complete flags [Moudgill et al. 1993] width 4 PRI+ER PRI baseline 18

Machine Model n n n n 4 -wide fetch, issue, commit 512 ROB, 256

Machine Model n n n n 4 -wide fetch, issue, commit 512 ROB, 256 LSQ 32 -entry scheduler 64 physical registers Speculative scheduling with selective recovery Combined bimodal branch predictor 32 KB IL 1, 32 KB DL 1, 512 KB L 2 7 bits PRI for integer, 1 bit PRI for FP 19

Speed Up for Integer Benchmarks n n n PRI (checkpoint + reference counting) performs

Speed Up for Integer Benchmarks n n n PRI (checkpoint + reference counting) performs substantially better than previous work Reference + checkpoint counting scheme performs close enough with ideal case (ideal + lazy) Combining PRI and ER increases the performance further 20

PRF Occupancy for Int. Benchmarks n n PRI reduces more register file pressure than

PRF Occupancy for Int. Benchmarks n n PRI reduces more register file pressure than the previous work (ER) Combining PRI and ER reduces the pressure more 21

Speed Up for FP Benchmark n n n Ammp benchmark -> physical registers are

Speed Up for FP Benchmark n n n Ammp benchmark -> physical registers are not the performance bottleneck Art benchmark -> a lot of narrow operands to exploit Wupwise benchmark -> few narrow operands 22

Conclusion n n PRI can lead to substantial performance improvement for both integer and

Conclusion n n PRI can lead to substantial performance improvement for both integer and fp benchmarks Ideal Update of stale pointers provides marginal benefit Ø Reference +checkpoint counting is the best choice 23

Future Work n Interaction of PRI with delayed register allocation (virtual physical register) [Gonzalez

Future Work n Interaction of PRI with delayed register allocation (virtual physical register) [Gonzalez et al. 1998] n Interaction of PRI with software-based techniques to deallocate dead registers n n PRI enables a binary-compatible mechanism for the compiler to communicate the fact that a register is dead to the hardware Compiler can simply insert load immediate of narrow values to any register that seems dead 24

Questions? Thank you 25

Questions? Thank you 25

Machine Model 26

Machine Model 26