IA64 Register Model Stack Rotation Dale Morris Architect

Philosophy l Large files – Most processors have lots of registers l Explicit control

Outline l Register Stack – Register Stack Engine l Register Rotation – Loop Branches

Register Stack l Motivation: – Automatic save/restore of GRs on procedure call/return – Cache

General Registers 127 Stacked 32 31 Static 0

GR Stack Frame 127 32 31 illegal outputs locals (inputs) size of frame (sof)

GR Stack Frame - Example 52 46 out loc 32 sol sof CFM 14

GR Stack Frame - Call 52 46 out loc 32 sol sof 38 32

GR Stack Frame - Allocate 50 48 52 46 out loc 32 sol sof

GR Stack Frame - Return 50 48 52 46 out 38 32 loc call

Instructions l br. call – Copies CFM to PFM – Creates new frame with

Instructions (cont. ) l mov to PFS – Restores PFM from a GR l

Leaf Procedure Optimization l No need to save/restore PFM l Can always use scratch

Register Save Engine l Automatically spills/fills registers from memory as needed l Registers saved

Reg Stack & Backing Store call unallocated sofc proc. C solb proc. B sola

Register Stack: Summary l Exposes register renaming to SW l Avoids register spill when

Register Rotation l Motivation: – pipeline-schedule loops onto HW – remove extraneous work from

GR Stack Frame w/ Rotation 127 sof outputs sol 32 31 locals Size of

GR Rotation l Size of rotating region multiple of 8 l Rotating region overlays

FR Rotation 127 Rotating 32 31 Static 0 Upper 3/4 of register file rotates

Predicate Rotation 63 Rotating 16 15 Static 0 Upper 3/4 of register file rotates

Register Rotation & RRB l l l Separate Rotating Register Base for each: GRs,

Loop Branches l br. cloop uses LC for simple, nonpipelined loops – decrements LC

br. ctop l Function (simplified): – if (LC>0) {LC--; pr[63]=1; else if (EC>1) {EC--;

Software Pipelining l Overlapping execution of different loop iterations vs. l More iterations in

Software Pipelining l Traditional architectures use loop unrolling – High overhead: extra code for

Pipelined Loop Example l DAXPY inner loop – dy[i] = dy[i] + (da *

Example: Pipeline l Each column represents 1 source iteration load dx, dy tmp =

Example Code. rotf dx[3], dy[3], tmp[2] mov ar. lc = 3 // #iterations-1 mov

Loop Execution Sequence (p 16) ldx . . . 19: 18: 17: 16: 63:

Loop Execution Sequence (p 16) ldx . . . 1 19: 18: 17: 16:

Loop Execution Sequence (p 16) ldx . . . 1 18: 17: 16: 63:

Loop Execution Sequence. . . 1 16: 17: 63: 16: 62: 63: 61: 62:

Loop Execution Sequence. . . 0 16: 63: 62: 61: 60: 59: 1 1

Loop Execution Sequence. . . 0 63: 62: 61: 60: 59: 58: 1 1

Loop Execution Sequence. . . 0 61: 62: 60: 61: 59: 60: 58: 59:

Loop Execution Sequence. . . 0 60: 61: 59: 60: 58: 59: 57: 58:

Pipelining & Latency l Suppose we change the latencies – load latency of 6

Example: New Pipeline l Each column represents 1 source iteration load dx, dy tmp

Updated Loop. rotf dx[7], dy[7], tmp[5] mov ar. lc = 3 // #iterations-1 mov

Rotation: Summary l Loop pipelining maximizes performance; minimizes overhead – Avoids code expansion of

Register Model Summary l GR Stack – Overlap call/ret operations with real work –

Slides: 49

Download presentation

IA-64 Register Model: Stack & Rotation Dale Morris Architect Hewlett Packard Co.

Philosophy l Large files – Most processors have lots of registers l Explicit control over register-renaming – Most processors have register renaming l IA-64 makes the register names SWvisible & makes the renaming explicit

Outline l Register Stack – Register Stack Engine l Register Rotation – Loop Branches – Modulo-Scheduling of Loops l Summary

Register Stack l Motivation: – Automatic save/restore of GRs on procedure call/return – Cache traffic reduction – Latency hiding of register spill/fill

General Registers 127 Stacked 32 31 Static 0

GR Stack Frame 127 32 31 illegal outputs locals (inputs) size of frame (sof) size of locals (sol) Static 0 Current Frame Marker (CFM) sol sof

GR Stack Frame - Example 52 46 out loc 32 sol sof CFM 14 21 size of frame (sof) size of locals (sol)

GR Stack Frame - Call 52 46 out loc 32 sol sof 38 32 out call sof CFM 14 21 0 7 PFM x x 14 21

GR Stack Frame - Allocate 50 48 52 46 out loc 32 sol sof 38 32 out call out loc 32 alloc sol sof CFM 14 21 0 7 16 19 PFM x x 14 21 inputs

GR Stack Frame - Return 50 48 52 46 out 38 32 loc call out loc 52 46 32 alloc return 32 out loc 32 sol sof CFM 14 21 0 7 16 19 14 21 PFM x x 14 21

Instructions l br. call – Copies CFM to PFM – Creates new frame with only output regs – Saves local regs from previous frame l alloc – Resizes current frame – Saves PFM to a GR

Instructions (cont. ) l mov to PFS – Restores PFM from a GR l br. ret – Restores CFM from PFM – Restores local regs for previous frame

Leaf Procedure Optimization l No need to save/restore PFM l Can always use scratch static GRs l Can omit alloc if: – Not many registers needed – Register rotation not needed

Register Save Engine l Automatically spills/fills registers from memory as needed l Registers saved on a Backing Store Stack l Spills/fills Na. T bits as well

Reg Stack & Backing Store call unallocated sofc proc. C solb proc. B sola proc. A unallocated return Physical stacked registers current frame A calls B calls C proc. B RSE loads/ stores proc. A’s ancestors Backing Store

Register Stack: Summary l Exposes register renaming to SW l Avoids register spill when few needed l Hides register spill/fill l Programmable sizes – only use as many registers as you need

Outline l Register Stack – Register Stack Engine l Register Rotation – Loop Branches – Modulo-Scheduling of Loops l Summary

Register Rotation l Motivation: – pipeline-schedule loops onto HW – remove extraneous work from loop – minimize start-up overhead – small code footprint – maximum computational throughput with few instructions

GR Stack Frame w/ Rotation 127 sof outputs sol 32 31 locals Size of Rotating (sor) Static 0 Current Frame Marker (CFM) rrb. pr rrb. fr rrb. gr sol sof

GR Rotation l Size of rotating region multiple of 8 l Rotating region overlays current frame – Starts at r 32 – Overlay allows rotation & stack renaming in a single level of adders – Must copy input registers before loop

FR Rotation 127 Rotating 32 31 Static 0 Upper 3/4 of register file rotates

Predicate Rotation 63 Rotating 16 15 Static 0 Upper 3/4 of register file rotates

Register Rotation & RRB l l l Separate Rotating Register Base for each: GRs, FRs, PRs Loop branches decrement all register rotating bases (RRB) Instructions contain a “virtual” register number – RRB + virtual register number = physical register number. . ld 1 R 35 Palm Springs is Sunny 36: 35: Palm 34: 33: 32: . . . RRB=0

Register Rotation & RRB l l l Separate Rotating Register Base for each: GRs, FRs, PRs Loop branches decrement all register rotating bases (RRB) Instructions contain a “virtual” register number – RRB + virtual register number = physical register number. Palm st 1 R 35 ld 2 R 34 Palm Springs is Sunny . . IA-64. 36: 35: Palm 34: Springs 33: 32: . . . RRB=0

Register Rotation & RRB l l l Separate Rotating Register Base for each: GRs, FRs, PRs Loop branches decrement all register rotating bases (RRB) Instructions contain a “virtual” register number – RRB + virtual register number = physical register number. Palm Springs st 2 R 35 ld 3 R 34 Palm Springs is Sunny . . IA-64. 35: Palm 34: Springs 33: is 32: 127: . . . RRB=-1

Register Rotation & RRB l l l Separate Rotating Register Base for each: GRs, FRs, PRs Loop branches decrement all register rotating bases (RRB) Instructions contain a “virtual” register number – RRB + virtual register number = physical register number. Palm Springs is st 3 R 35 ld 4 R 34 Palm Springs is Sunny . . IA-64. 34: Springs 33: is 32: Sunny 127: 126: . . . RRB=-2

Register Rotation & RRB l l l Separate Rotating Register Base for each: GRs, FRs, PRs Loop branches decrement all register rotating bases (RRB) Instructions contain a “virtual” register number – RRB + virtual register number = physical register number. Palm Springs is Sunny st 4 R 35 Palm Springs is Sunny . . IA-64. 33: is 32: Sunny 127: 126: 125: . . . RRB=-3

Loop Branches l br. cloop uses LC for simple, nonpipelined loops – decrements LC and loops until LC is 0 l br. ctop uses LC and EC for pipelined counted loops l br. wtop uses branch predicate and EC for pipelined “while” loops l br. cexit, br. wexit used for unrolled, pipelined loops

br. ctop l Function (simplified): – if (LC>0) {LC--; pr[63]=1; else if (EC>1) {EC--; pr[63]=0; else {EC--; pr[63]=0; l LC rrb--; loop; } rrb--; fall_through; } counts main loop iterations l EC counts pipeline stages for drain

Software Pipelining l Overlapping execution of different loop iterations vs. l More iterations in same amount of time

Software Pipelining l Traditional architectures use loop unrolling – High overhead: extra code for loop body, prologue, and epilogue l Synergistic use of IA-64 features: – Full Predication – Special branches – Register rotation: removes loop copy overhead – Predicate rotation: removes prologue & epilogue Especially Useful for Integer Code With Small Number of Loop Iterations

Pipelined Loop Example l DAXPY inner loop – dy[i] = dy[i] + (da * dx[i]) – 2 loads, 1 fma, 1 store / iteration l Machine assumptions – can do 2 loads, 1 store, 1 fma, 1 br / cycle – load latency of 2 clocks – fma latency of 1 clocks

Example: Pipeline l Each column represents 1 source iteration load dx, dy tmp = dy + da * dx store dy

Example Code. rotf dx[3], dy[3], tmp[2] mov ar. lc = 3 // #iterations-1 mov ar. ec = 4 // #stages mov pr. rot = 0 x 10000 ; ; looptop: (p 16) ldfd dx[0] = [dxsp], 8 (p 16) ldfd dy[0] = [dysp], 8 (p 18) fma. d tmp[0] = da, dx[2], dy[2] (p 19) stfd [dydp] = tmp[1], 8 br. ctop looptop ; ;

Loop Execution Sequence (p 16) ldx . . . 19: 18: 17: 16: 63: 0 0 0 1 0 (p 16) ldy (p 18) fma (p 19) st (p 19) (p 18) (p 16) (p 63) . LC=3 EC=4 RRB=0 Initialization

Loop Execution Sequence (p 16) ldx . . . 1 19: 18: 17: 16: 63: 62: 0 0 0 1 1 1 0 (p 16) ldy (p 18) fma (p 19) st (p 19) (p 18) (p 16) (p 63) . . LC=3 EC=4 LC=2 RRB=0 RRB=-1 Branch 1

Loop Execution Sequence (p 16) ldx . . . 1 18: 17: 16: 63: 62: 61: 0 0 1 1 0 (p 16) ldy (p 18) fma (p 19) st (p 19) (p 18) (p 16) (p 63) . . LC=2 LC=1 EC=4 RRB=-1 RRB=-2 Branch 2

Loop Execution Sequence. . . 1 16: 17: 63: 16: 62: 63: 61: 62: 60: 61: 1 0 1 1 0 (p 19) (p 18) (p 16) ldx (p 16) ldy (p 18) fma (p 19) st (p 16) (p 63) . . LC=1 LC=0 EC=4 RRB=-2 RRB=-3 Branch 3

Loop Execution Sequence. . . 0 16: 63: 62: 61: 60: 59: 1 1 0 0 (p 19) (p 18) (p 16) ldx (p 16) ldx (p 16) ldy (p 16) ldy (p 18) fma (p 18) fma (p 19) st (p 19) st (p 16) (p 63) . . LC=0 EC=4 EC=3 RRB=-4 Branch 4

Loop Execution Sequence. . . 0 63: 62: 61: 60: 59: 58: 1 1 1 0 0 0 (p 19) (p 18) (p 16) ldx (p 16) ldx (p 16) ldy (p 16) ldy (p 18) fma (p 18) fma (p 19) st (p 19) st (p 16) (p 63) . . LC=0 EC=3 EC=2 RRB=-4 RRB=-5 Branch 5

Loop Execution Sequence. . . 0 61: 62: 60: 61: 59: 60: 58: 59: 57: 58: 1 0 0 0 (p 19) (p 18) (p 16) (p 63) (p 16) ldx (p 16) ldx (p 16) ldy (p 16) ldy (p 18) fma (p 18) fma (p 19) st (p 19) st . . LC=0 EC=2 EC=1 RRB=-5 RRB=-6 Branch 6

Loop Execution Sequence. . . 0 60: 61: 59: 60: 58: 59: 57: 58: 56: 57: 0 1 0 0 (p 19) (p 18) (p 16) (p 63) (p 16) ldx (p 16) ldy (p 16) ldx (p 16) ldy fall through (p 18) fma (p 18) fma (p 19) st (p 19) st . . EC=1 LC=0 EC=0 RRB=-6 RRB=-7 Branch 7

Pipelining & Latency l Suppose we change the latencies – load latency of 6 clocks – fma latency of 4 clocks

Example: New Pipeline l Each column represents 1 source iteration load dx, dy tmp = dy + da * dx store dy

Updated Loop. rotf dx[7], dy[7], tmp[5] mov ar. lc = 3 // #iterations-1 mov ar. ec = 11 // #stages mov pr. rot = 0 x 10000 ; ; looptop: (p 16) ldfd dx[0] = [dxsp], 8 (p 16) ldfd dy[0] = [dysp], 8 (p 22) fma. d tmp[0] = da, dx[6], dy[6] (p 26) stfd [dydp] = tmp[4], 8 br. ctop looptop ; ;

Rotation: Summary l Loop pipelining maximizes performance; minimizes overhead – Avoids code expansion of unrolling and code explosion of prologue and epilogue – Smaller code means fewer cache misses – Greater performance improvements in higher latency conditions l Reduced overhead allows S/W pipelining of small loops with unknown trip counts – Typical of integer scalar codes

Outline l Register Stack – Register Stack Engine l Register Rotation – Loop Branches – Modulo-Scheduling of Loops l Summary

Register Model Summary l GR Stack – Overlap call/ret operations with real work – RSE hides spills/fillls l GR, FR, PR Rotation – General acceleration for all types of loops l SW-visible resources – Large named register files & renaming l HW simplicity and explicit control

IA-64 Register Model: Stack & Rotation Dale Morris Architect Hewlett Packard Co.