IA64 Register Model Stack Rotation Dale Morris Architect
- Slides: 49
IA-64 Register Model: Stack & Rotation Dale Morris Architect Hewlett Packard Co.
Philosophy l Large files – Most processors have lots of registers l Explicit control over register-renaming – Most processors have register renaming l IA-64 makes the register names SWvisible & makes the renaming explicit
Outline l Register Stack – Register Stack Engine l Register Rotation – Loop Branches – Modulo-Scheduling of Loops l Summary
Register Stack l Motivation: – Automatic save/restore of GRs on procedure call/return – Cache traffic reduction – Latency hiding of register spill/fill
General Registers 127 Stacked 32 31 Static 0
GR Stack Frame 127 32 31 illegal outputs locals (inputs) size of frame (sof) size of locals (sol) Static 0 Current Frame Marker (CFM) sol sof
GR Stack Frame - Example 52 46 out loc 32 sol sof CFM 14 21 size of frame (sof) size of locals (sol)
GR Stack Frame - Call 52 46 out loc 32 sol sof 38 32 out call sof CFM 14 21 0 7 PFM x x 14 21
GR Stack Frame - Allocate 50 48 52 46 out loc 32 sol sof 38 32 out call out loc 32 alloc sol sof CFM 14 21 0 7 16 19 PFM x x 14 21 inputs
GR Stack Frame - Return 50 48 52 46 out 38 32 loc call out loc 52 46 32 alloc return 32 out loc 32 sol sof CFM 14 21 0 7 16 19 14 21 PFM x x 14 21
Instructions l br. call – Copies CFM to PFM – Creates new frame with only output regs – Saves local regs from previous frame l alloc – Resizes current frame – Saves PFM to a GR
Instructions (cont. ) l mov to PFS – Restores PFM from a GR l br. ret – Restores CFM from PFM – Restores local regs for previous frame
Leaf Procedure Optimization l No need to save/restore PFM l Can always use scratch static GRs l Can omit alloc if: – Not many registers needed – Register rotation not needed
Register Save Engine l Automatically spills/fills registers from memory as needed l Registers saved on a Backing Store Stack l Spills/fills Na. T bits as well
Reg Stack & Backing Store call unallocated sofc proc. C solb proc. B sola proc. A unallocated return Physical stacked registers current frame A calls B calls C proc. B RSE loads/ stores proc. A’s ancestors Backing Store
Register Stack: Summary l Exposes register renaming to SW l Avoids register spill when few needed l Hides register spill/fill l Programmable sizes – only use as many registers as you need
Outline l Register Stack – Register Stack Engine l Register Rotation – Loop Branches – Modulo-Scheduling of Loops l Summary
Register Rotation l Motivation: – pipeline-schedule loops onto HW – remove extraneous work from loop – minimize start-up overhead – small code footprint – maximum computational throughput with few instructions
GR Stack Frame w/ Rotation 127 sof outputs sol 32 31 locals Size of Rotating (sor) Static 0 Current Frame Marker (CFM) rrb. pr rrb. fr rrb. gr sol sof
GR Rotation l Size of rotating region multiple of 8 l Rotating region overlays current frame – Starts at r 32 – Overlay allows rotation & stack renaming in a single level of adders – Must copy input registers before loop
FR Rotation 127 Rotating 32 31 Static 0 Upper 3/4 of register file rotates
Predicate Rotation 63 Rotating 16 15 Static 0 Upper 3/4 of register file rotates
Register Rotation & RRB l l l Separate Rotating Register Base for each: GRs, FRs, PRs Loop branches decrement all register rotating bases (RRB) Instructions contain a “virtual” register number – RRB + virtual register number = physical register number. . ld 1 R 35 Palm Springs is Sunny 36: 35: Palm 34: 33: 32: . . . RRB=0
Register Rotation & RRB l l l Separate Rotating Register Base for each: GRs, FRs, PRs Loop branches decrement all register rotating bases (RRB) Instructions contain a “virtual” register number – RRB + virtual register number = physical register number. Palm st 1 R 35 ld 2 R 34 Palm Springs is Sunny . . IA-64. 36: 35: Palm 34: Springs 33: 32: . . . RRB=0
Register Rotation & RRB l l l Separate Rotating Register Base for each: GRs, FRs, PRs Loop branches decrement all register rotating bases (RRB) Instructions contain a “virtual” register number – RRB + virtual register number = physical register number. Palm Springs st 2 R 35 ld 3 R 34 Palm Springs is Sunny . . IA-64. 35: Palm 34: Springs 33: is 32: 127: . . . RRB=-1
Register Rotation & RRB l l l Separate Rotating Register Base for each: GRs, FRs, PRs Loop branches decrement all register rotating bases (RRB) Instructions contain a “virtual” register number – RRB + virtual register number = physical register number. Palm Springs is st 3 R 35 ld 4 R 34 Palm Springs is Sunny . . IA-64. 34: Springs 33: is 32: Sunny 127: 126: . . . RRB=-2
Register Rotation & RRB l l l Separate Rotating Register Base for each: GRs, FRs, PRs Loop branches decrement all register rotating bases (RRB) Instructions contain a “virtual” register number – RRB + virtual register number = physical register number. Palm Springs is Sunny st 4 R 35 Palm Springs is Sunny . . IA-64. 33: is 32: Sunny 127: 126: 125: . . . RRB=-3
Loop Branches l br. cloop uses LC for simple, nonpipelined loops – decrements LC and loops until LC is 0 l br. ctop uses LC and EC for pipelined counted loops l br. wtop uses branch predicate and EC for pipelined “while” loops l br. cexit, br. wexit used for unrolled, pipelined loops
br. ctop l Function (simplified): – if (LC>0) {LC--; pr[63]=1; else if (EC>1) {EC--; pr[63]=0; else {EC--; pr[63]=0; l LC rrb--; loop; } rrb--; fall_through; } counts main loop iterations l EC counts pipeline stages for drain
Software Pipelining l Overlapping execution of different loop iterations vs. l More iterations in same amount of time
Software Pipelining l Traditional architectures use loop unrolling – High overhead: extra code for loop body, prologue, and epilogue l Synergistic use of IA-64 features: – Full Predication – Special branches – Register rotation: removes loop copy overhead – Predicate rotation: removes prologue & epilogue Especially Useful for Integer Code With Small Number of Loop Iterations
Pipelined Loop Example l DAXPY inner loop – dy[i] = dy[i] + (da * dx[i]) – 2 loads, 1 fma, 1 store / iteration l Machine assumptions – can do 2 loads, 1 store, 1 fma, 1 br / cycle – load latency of 2 clocks – fma latency of 1 clocks
Example: Pipeline l Each column represents 1 source iteration load dx, dy tmp = dy + da * dx store dy
Example Code. rotf dx[3], dy[3], tmp[2] mov ar. lc = 3 // #iterations-1 mov ar. ec = 4 // #stages mov pr. rot = 0 x 10000 ; ; looptop: (p 16) ldfd dx[0] = [dxsp], 8 (p 16) ldfd dy[0] = [dysp], 8 (p 18) fma. d tmp[0] = da, dx[2], dy[2] (p 19) stfd [dydp] = tmp[1], 8 br. ctop looptop ; ;
Loop Execution Sequence (p 16) ldx . . . 19: 18: 17: 16: 63: 0 0 0 1 0 (p 16) ldy (p 18) fma (p 19) st (p 19) (p 18) (p 16) (p 63) . LC=3 EC=4 RRB=0 Initialization
Loop Execution Sequence (p 16) ldx . . . 1 19: 18: 17: 16: 63: 62: 0 0 0 1 1 1 0 (p 16) ldy (p 18) fma (p 19) st (p 19) (p 18) (p 16) (p 63) . . LC=3 EC=4 LC=2 RRB=0 RRB=-1 Branch 1
Loop Execution Sequence (p 16) ldx . . . 1 18: 17: 16: 63: 62: 61: 0 0 1 1 0 (p 16) ldy (p 18) fma (p 19) st (p 19) (p 18) (p 16) (p 63) . . LC=2 LC=1 EC=4 RRB=-1 RRB=-2 Branch 2
Loop Execution Sequence. . . 1 16: 17: 63: 16: 62: 63: 61: 62: 60: 61: 1 0 1 1 0 (p 19) (p 18) (p 16) ldx (p 16) ldy (p 18) fma (p 19) st (p 16) (p 63) . . LC=1 LC=0 EC=4 RRB=-2 RRB=-3 Branch 3
Loop Execution Sequence. . . 0 16: 63: 62: 61: 60: 59: 1 1 0 0 (p 19) (p 18) (p 16) ldx (p 16) ldx (p 16) ldy (p 16) ldy (p 18) fma (p 18) fma (p 19) st (p 19) st (p 16) (p 63) . . LC=0 EC=4 EC=3 RRB=-4 Branch 4
Loop Execution Sequence. . . 0 63: 62: 61: 60: 59: 58: 1 1 1 0 0 0 (p 19) (p 18) (p 16) ldx (p 16) ldx (p 16) ldy (p 16) ldy (p 18) fma (p 18) fma (p 19) st (p 19) st (p 16) (p 63) . . LC=0 EC=3 EC=2 RRB=-4 RRB=-5 Branch 5
Loop Execution Sequence. . . 0 61: 62: 60: 61: 59: 60: 58: 59: 57: 58: 1 0 0 0 (p 19) (p 18) (p 16) (p 63) (p 16) ldx (p 16) ldx (p 16) ldy (p 16) ldy (p 18) fma (p 18) fma (p 19) st (p 19) st . . LC=0 EC=2 EC=1 RRB=-5 RRB=-6 Branch 6
Loop Execution Sequence. . . 0 60: 61: 59: 60: 58: 59: 57: 58: 56: 57: 0 1 0 0 (p 19) (p 18) (p 16) (p 63) (p 16) ldx (p 16) ldy (p 16) ldx (p 16) ldy fall through (p 18) fma (p 18) fma (p 19) st (p 19) st . . EC=1 LC=0 EC=0 RRB=-6 RRB=-7 Branch 7
Pipelining & Latency l Suppose we change the latencies – load latency of 6 clocks – fma latency of 4 clocks
Example: New Pipeline l Each column represents 1 source iteration load dx, dy tmp = dy + da * dx store dy
Updated Loop. rotf dx[7], dy[7], tmp[5] mov ar. lc = 3 // #iterations-1 mov ar. ec = 11 // #stages mov pr. rot = 0 x 10000 ; ; looptop: (p 16) ldfd dx[0] = [dxsp], 8 (p 16) ldfd dy[0] = [dysp], 8 (p 22) fma. d tmp[0] = da, dx[6], dy[6] (p 26) stfd [dydp] = tmp[4], 8 br. ctop looptop ; ;
Rotation: Summary l Loop pipelining maximizes performance; minimizes overhead – Avoids code expansion of unrolling and code explosion of prologue and epilogue – Smaller code means fewer cache misses – Greater performance improvements in higher latency conditions l Reduced overhead allows S/W pipelining of small loops with unknown trip counts – Typical of integer scalar codes
Outline l Register Stack – Register Stack Engine l Register Rotation – Loop Branches – Modulo-Scheduling of Loops l Summary
Register Model Summary l GR Stack – Overlap call/ret operations with real work – RSE hides spills/fillls l GR, FR, PR Rotation – General acceleration for all types of loops l SW-visible resources – Large named register files & renaming l HW simplicity and explicit control
IA-64 Register Model: Stack & Rotation Dale Morris Architect Hewlett Packard Co.
- Dale carnegie conversation stack
- Ia64
- Ia64 architecture
- Intel itanium
- Intel ia64
- Specific rotation of sugar solution
- Stack smashing
- Stack pointer is a
- C++ stack vs heap
- Riverside permit portal
- Workshop rotation model
- Shayna morris
- Knuth morris pratt algorithm time complexity
- Morris fuller benton
- Morris yock
- Horace morris but mostly dolores
- Supratrochanteric
- Ananya tripathi trader
- Frank morris alcatraz
- Chaucer font
- Wallpaper - hyacinth, pattern #480
- Morris art nouveau
- Text processing and pattern searching
- Skyline elementary tacoma
- Jeff morris rpi
- Wallpaper - hyacinth, pattern #480
- Morris badminton racket
- History of volleyball
- James nickel morris
- Binary algorithm
- Ocps school board meeting
- The nature of gothic william morris
- Morris mano computer system architecture
- Morris water maze video
- Robert morris document 1963
- Why digital stephen morris
- Tsotsi chapter 6
- Morris
- Shozo shimamoto
- Esc of morris county
- Morris louis saraband
- Signo de morris
- Morris county office of health management
- Morris mosley
- M morris mano
- Robert morris
- Morris louis floral v
- Morris county public safety
- Morris home philadelphia
- Morris smith