Physical Register Inlining PRI Mikko H Lipasti 1
- Slides: 26
Physical Register Inlining (PRI) Mikko H. Lipasti 1, Brian Mestan 2, and Erika Gunadi 1 1 Department of Electrical and Computer Engineering University of Wisconsin—Madison 2 IBM Microelectronics IBM Corporation – Austin, TX http: //www. ece. wisc. edu/~pharm
Demand for Large Register Files Fetch Dcd Rnm Sched Disp RF Exe Retire Commit Instruction Window n Deeper Pipeline n n Increasing pressure on Register File Lots of attention / prior work 2
Challenges with Scaling Register Files n Additional pipe stages needed for access n n Increases branch misprediction penalty Increases scheduling misprediction penalty Requires additional bypass logic Further increases pipeline depth n Increases the demand for more registers 3
Physical Register Lifetime width 8 width 4 Ø Managed inefficiently 4
Prior Work n Register file caching [Swenson et al. 1988, Zalamea et al. 2000, Postiff et al. 2001, Cruz et al. 2000, Borch et al. 2002] n Late Allocation [Gonzalez et al. 1998, Monreal et al. 1999] n Efficient Management n n Early deallocation [Moudgill et al. 1993] Program semantics [Martin et al. 1997, Lo et al. 1999] Checkpointing [Martinez et al. 2002, Akkary et al. 2003] Value-based optimizations [Jourdan et al. 1998] 5
Early Deallocation n n Moudgill et al. 1993 Focused on “last read to release” Avoid waiting for the next writer to commit Deallocate registers as soon as: n n Complete (complete flag) Unmapped (unmap flag) No outstanding readers (reference counter) Still requires next writer to enter the window 6
Physical Register Inlining Exploits narrow operands: sizable fraction of operands can be stored in less than 8 bits [Canal et al. 2000] n n n Often fewer bits than needed to specify physical registers Store the value instead of the pointer Stores narrow values in map table Reduces physical register lifetime 7
Operand Significance Ø Also have FP graph in the paper – exploits 0. 0/1. 0 (54%) 8
Outline n n n Motivation Prior Work Physical Register Inlining n n n Quick Microarchitectural Review Modifications Needed PRI + early deallocation Experiments Conclusions 9
Microarchitectural Review n Register Rename/Map Tables n n Maps logical names to physical names Removes false name dependences Two common types: RAM and CAM map is positional. n Not suitable for storing values RAM map 0 CAM map ? Phys reg # 1 Logical reg # 0 2 ? . . . V 1 ? 2 L Logical reg # . . . ? Phys reg # 10
Microarchitectural Review n Allocating and Freeing Physical Registers n n n Allocates physical register at decode – map table entry is updated Releases physical register when next writer is committed Checkpoint and Recovery of Register Map n Optimization to reduce branch misprediction penalty 11
Modifications to Data Flow Fetch Dcd Rnm Queue Sched Map n n n RF Payload RAM Exe ALU Retire Commit Narrow? Execution stage must allow both operands to be read from payload RAM n n Disp Already supports one immediate operands Sign extension between payload RAM and the ALU input Narrow checking logic to verify if the operands are narrow Narrow datapath back to the map table 12
Modifications to Map Table n n Registers freed from the retire/wb stage and commit stage Tolerant of duplicate deallocations of the same physical register n n Once as narrow, again at next write commit Map entries need to be writable from rename stage and retire/wb stage 13
Stale Pointer Problem PRF MAP Checkpoints copy ROB n Deallocating physical registers early makes these pointers stale n n Issue. Q Equivalent to the garbage collection issue Two choices n n Delay deallocation until pointers not valid (refcount) Update all pointers (ideal IPC) 14
Map table checkpoints problem n Map table checkpoints need to be updated in case of narrow operands write n Lazy update Ø Complex, but not cycle time critical n Checkpoint reference counting Ø Similar to Akkary et al. Ø Delays deallocation, reduces IPC benefit slightly 15
Example of WAR Violation Load p 1 <= MEM[p 7] And p 2 <= p 3 & p 4 narrow Add p 5 <= p 1 + p 2 WAR violation Or p 2 <= p 8 & p 9 n n Rare, but frequent enough to affect performance Must have efficient solution 16
Rename Table WAW Hazards Fetch Decode Execute p 4 p 5 = p 1 + & p 2 r 3 = r 1 + r 2 r 3 = r 1 & r 2 MAP r 3 p 3 p 4 p 5 p 3 p 4 Retire Commit p 4 = p 1 + p 2 ROB (Dst) p 4 p 5 narrow WAW! n WAW hazards n n Writes narrow value to a remapped map entry Must ensure that the map entry has not been remapped 17
Integrating PRI with Early Deallocation n Not all operands are narrow Reduces register lifetime further Adds unmap flags and complete flags [Moudgill et al. 1993] width 4 PRI+ER PRI baseline 18
Machine Model n n n n 4 -wide fetch, issue, commit 512 ROB, 256 LSQ 32 -entry scheduler 64 physical registers Speculative scheduling with selective recovery Combined bimodal branch predictor 32 KB IL 1, 32 KB DL 1, 512 KB L 2 7 bits PRI for integer, 1 bit PRI for FP 19
Speed Up for Integer Benchmarks n n n PRI (checkpoint + reference counting) performs substantially better than previous work Reference + checkpoint counting scheme performs close enough with ideal case (ideal + lazy) Combining PRI and ER increases the performance further 20
PRF Occupancy for Int. Benchmarks n n PRI reduces more register file pressure than the previous work (ER) Combining PRI and ER reduces the pressure more 21
Speed Up for FP Benchmark n n n Ammp benchmark -> physical registers are not the performance bottleneck Art benchmark -> a lot of narrow operands to exploit Wupwise benchmark -> few narrow operands 22
Conclusion n n PRI can lead to substantial performance improvement for both integer and fp benchmarks Ideal Update of stale pointers provides marginal benefit Ø Reference +checkpoint counting is the best choice 23
Future Work n Interaction of PRI with delayed register allocation (virtual physical register) [Gonzalez et al. 1998] n Interaction of PRI with software-based techniques to deallocate dead registers n n PRI enables a binary-compatible mechanism for the compiler to communicate the fact that a register is dead to the hardware Compiler can simply insert load immediate of narrow values to any register that seems dead 24
Questions? Thank you 25
Machine Model 26
- Mikko lipasti
- Mikko lipasti
- Mikko lipasti
- Mikko h. lipasti
- Mikko h. lipasti
- Mikko h. lipasti
- Mikko lipasti
- Mikko lipasti
- Mikko lipasti
- Mikko h. lipasti
- Mikko lipasti
- Mikko prii
- Private lipasti
- Riverside permit portal
- Physical register file
- Hh embryo
- Mikko routala
- Mikko lappalainen
- Sara leitsamo
- Mikko nieminen
- Mikko vienonen
- Mikko kesonen kuopio
- System by mikko
- Mikko ulander
- Mikko keränen kamk
- Mikko juusela
- Mikko bentlin