IA64 Register Model Stack Rotation Dale Morris Architect

  • Slides: 49
Download presentation
IA-64 Register Model: Stack & Rotation Dale Morris Architect Hewlett Packard Co.

IA-64 Register Model: Stack & Rotation Dale Morris Architect Hewlett Packard Co.

Philosophy l Large files – Most processors have lots of registers l Explicit control

Philosophy l Large files – Most processors have lots of registers l Explicit control over register-renaming – Most processors have register renaming l IA-64 makes the register names SWvisible & makes the renaming explicit

Outline l Register Stack – Register Stack Engine l Register Rotation – Loop Branches

Outline l Register Stack – Register Stack Engine l Register Rotation – Loop Branches – Modulo-Scheduling of Loops l Summary

Register Stack l Motivation: – Automatic save/restore of GRs on procedure call/return – Cache

Register Stack l Motivation: – Automatic save/restore of GRs on procedure call/return – Cache traffic reduction – Latency hiding of register spill/fill

General Registers 127 Stacked 32 31 Static 0

General Registers 127 Stacked 32 31 Static 0

GR Stack Frame 127 32 31 illegal outputs locals (inputs) size of frame (sof)

GR Stack Frame 127 32 31 illegal outputs locals (inputs) size of frame (sof) size of locals (sol) Static 0 Current Frame Marker (CFM) sol sof

GR Stack Frame - Example 52 46 out loc 32 sol sof CFM 14

GR Stack Frame - Example 52 46 out loc 32 sol sof CFM 14 21 size of frame (sof) size of locals (sol)

GR Stack Frame - Call 52 46 out loc 32 sol sof 38 32

GR Stack Frame - Call 52 46 out loc 32 sol sof 38 32 out call sof CFM 14 21 0 7 PFM x x 14 21

GR Stack Frame - Allocate 50 48 52 46 out loc 32 sol sof

GR Stack Frame - Allocate 50 48 52 46 out loc 32 sol sof 38 32 out call out loc 32 alloc sol sof CFM 14 21 0 7 16 19 PFM x x 14 21 inputs

GR Stack Frame - Return 50 48 52 46 out 38 32 loc call

GR Stack Frame - Return 50 48 52 46 out 38 32 loc call out loc 52 46 32 alloc return 32 out loc 32 sol sof CFM 14 21 0 7 16 19 14 21 PFM x x 14 21

Instructions l br. call – Copies CFM to PFM – Creates new frame with

Instructions l br. call – Copies CFM to PFM – Creates new frame with only output regs – Saves local regs from previous frame l alloc – Resizes current frame – Saves PFM to a GR

Instructions (cont. ) l mov to PFS – Restores PFM from a GR l

Instructions (cont. ) l mov to PFS – Restores PFM from a GR l br. ret – Restores CFM from PFM – Restores local regs for previous frame

Leaf Procedure Optimization l No need to save/restore PFM l Can always use scratch

Leaf Procedure Optimization l No need to save/restore PFM l Can always use scratch static GRs l Can omit alloc if: – Not many registers needed – Register rotation not needed

Register Save Engine l Automatically spills/fills registers from memory as needed l Registers saved

Register Save Engine l Automatically spills/fills registers from memory as needed l Registers saved on a Backing Store Stack l Spills/fills Na. T bits as well

Reg Stack & Backing Store call unallocated sofc proc. C solb proc. B sola

Reg Stack & Backing Store call unallocated sofc proc. C solb proc. B sola proc. A unallocated return Physical stacked registers current frame A calls B calls C proc. B RSE loads/ stores proc. A’s ancestors Backing Store

Register Stack: Summary l Exposes register renaming to SW l Avoids register spill when

Register Stack: Summary l Exposes register renaming to SW l Avoids register spill when few needed l Hides register spill/fill l Programmable sizes – only use as many registers as you need

Outline l Register Stack – Register Stack Engine l Register Rotation – Loop Branches

Outline l Register Stack – Register Stack Engine l Register Rotation – Loop Branches – Modulo-Scheduling of Loops l Summary

Register Rotation l Motivation: – pipeline-schedule loops onto HW – remove extraneous work from

Register Rotation l Motivation: – pipeline-schedule loops onto HW – remove extraneous work from loop – minimize start-up overhead – small code footprint – maximum computational throughput with few instructions

GR Stack Frame w/ Rotation 127 sof outputs sol 32 31 locals Size of

GR Stack Frame w/ Rotation 127 sof outputs sol 32 31 locals Size of Rotating (sor) Static 0 Current Frame Marker (CFM) rrb. pr rrb. fr rrb. gr sol sof

GR Rotation l Size of rotating region multiple of 8 l Rotating region overlays

GR Rotation l Size of rotating region multiple of 8 l Rotating region overlays current frame – Starts at r 32 – Overlay allows rotation & stack renaming in a single level of adders – Must copy input registers before loop

FR Rotation 127 Rotating 32 31 Static 0 Upper 3/4 of register file rotates

FR Rotation 127 Rotating 32 31 Static 0 Upper 3/4 of register file rotates

Predicate Rotation 63 Rotating 16 15 Static 0 Upper 3/4 of register file rotates

Predicate Rotation 63 Rotating 16 15 Static 0 Upper 3/4 of register file rotates

Register Rotation & RRB l l l Separate Rotating Register Base for each: GRs,

Register Rotation & RRB l l l Separate Rotating Register Base for each: GRs, FRs, PRs Loop branches decrement all register rotating bases (RRB) Instructions contain a “virtual” register number – RRB + virtual register number = physical register number. . ld 1 R 35 Palm Springs is Sunny 36: 35: Palm 34: 33: 32: . . . RRB=0

Register Rotation & RRB l l l Separate Rotating Register Base for each: GRs,

Register Rotation & RRB l l l Separate Rotating Register Base for each: GRs, FRs, PRs Loop branches decrement all register rotating bases (RRB) Instructions contain a “virtual” register number – RRB + virtual register number = physical register number. Palm st 1 R 35 ld 2 R 34 Palm Springs is Sunny . . IA-64. 36: 35: Palm 34: Springs 33: 32: . . . RRB=0

Register Rotation & RRB l l l Separate Rotating Register Base for each: GRs,

Register Rotation & RRB l l l Separate Rotating Register Base for each: GRs, FRs, PRs Loop branches decrement all register rotating bases (RRB) Instructions contain a “virtual” register number – RRB + virtual register number = physical register number. Palm Springs st 2 R 35 ld 3 R 34 Palm Springs is Sunny . . IA-64. 35: Palm 34: Springs 33: is 32: 127: . . . RRB=-1

Register Rotation & RRB l l l Separate Rotating Register Base for each: GRs,

Register Rotation & RRB l l l Separate Rotating Register Base for each: GRs, FRs, PRs Loop branches decrement all register rotating bases (RRB) Instructions contain a “virtual” register number – RRB + virtual register number = physical register number. Palm Springs is st 3 R 35 ld 4 R 34 Palm Springs is Sunny . . IA-64. 34: Springs 33: is 32: Sunny 127: 126: . . . RRB=-2

Register Rotation & RRB l l l Separate Rotating Register Base for each: GRs,

Register Rotation & RRB l l l Separate Rotating Register Base for each: GRs, FRs, PRs Loop branches decrement all register rotating bases (RRB) Instructions contain a “virtual” register number – RRB + virtual register number = physical register number. Palm Springs is Sunny st 4 R 35 Palm Springs is Sunny . . IA-64. 33: is 32: Sunny 127: 126: 125: . . . RRB=-3

Loop Branches l br. cloop uses LC for simple, nonpipelined loops – decrements LC

Loop Branches l br. cloop uses LC for simple, nonpipelined loops – decrements LC and loops until LC is 0 l br. ctop uses LC and EC for pipelined counted loops l br. wtop uses branch predicate and EC for pipelined “while” loops l br. cexit, br. wexit used for unrolled, pipelined loops

br. ctop l Function (simplified): – if (LC>0) {LC--; pr[63]=1; else if (EC>1) {EC--;

br. ctop l Function (simplified): – if (LC>0) {LC--; pr[63]=1; else if (EC>1) {EC--; pr[63]=0; else {EC--; pr[63]=0; l LC rrb--; loop; } rrb--; fall_through; } counts main loop iterations l EC counts pipeline stages for drain

Software Pipelining l Overlapping execution of different loop iterations vs. l More iterations in

Software Pipelining l Overlapping execution of different loop iterations vs. l More iterations in same amount of time

Software Pipelining l Traditional architectures use loop unrolling – High overhead: extra code for

Software Pipelining l Traditional architectures use loop unrolling – High overhead: extra code for loop body, prologue, and epilogue l Synergistic use of IA-64 features: – Full Predication – Special branches – Register rotation: removes loop copy overhead – Predicate rotation: removes prologue & epilogue Especially Useful for Integer Code With Small Number of Loop Iterations

Pipelined Loop Example l DAXPY inner loop – dy[i] = dy[i] + (da *

Pipelined Loop Example l DAXPY inner loop – dy[i] = dy[i] + (da * dx[i]) – 2 loads, 1 fma, 1 store / iteration l Machine assumptions – can do 2 loads, 1 store, 1 fma, 1 br / cycle – load latency of 2 clocks – fma latency of 1 clocks

Example: Pipeline l Each column represents 1 source iteration load dx, dy tmp =

Example: Pipeline l Each column represents 1 source iteration load dx, dy tmp = dy + da * dx store dy

Example Code. rotf dx[3], dy[3], tmp[2] mov ar. lc = 3 // #iterations-1 mov

Example Code. rotf dx[3], dy[3], tmp[2] mov ar. lc = 3 // #iterations-1 mov ar. ec = 4 // #stages mov pr. rot = 0 x 10000 ; ; looptop: (p 16) ldfd dx[0] = [dxsp], 8 (p 16) ldfd dy[0] = [dysp], 8 (p 18) fma. d tmp[0] = da, dx[2], dy[2] (p 19) stfd [dydp] = tmp[1], 8 br. ctop looptop ; ;

Loop Execution Sequence (p 16) ldx . . . 19: 18: 17: 16: 63:

Loop Execution Sequence (p 16) ldx . . . 19: 18: 17: 16: 63: 0 0 0 1 0 (p 16) ldy (p 18) fma (p 19) st (p 19) (p 18) (p 16) (p 63) . LC=3 EC=4 RRB=0 Initialization

Loop Execution Sequence (p 16) ldx . . . 1 19: 18: 17: 16:

Loop Execution Sequence (p 16) ldx . . . 1 19: 18: 17: 16: 63: 62: 0 0 0 1 1 1 0 (p 16) ldy (p 18) fma (p 19) st (p 19) (p 18) (p 16) (p 63) . . LC=3 EC=4 LC=2 RRB=0 RRB=-1 Branch 1

Loop Execution Sequence (p 16) ldx . . . 1 18: 17: 16: 63:

Loop Execution Sequence (p 16) ldx . . . 1 18: 17: 16: 63: 62: 61: 0 0 1 1 0 (p 16) ldy (p 18) fma (p 19) st (p 19) (p 18) (p 16) (p 63) . . LC=2 LC=1 EC=4 RRB=-1 RRB=-2 Branch 2

Loop Execution Sequence. . . 1 16: 17: 63: 16: 62: 63: 61: 62:

Loop Execution Sequence. . . 1 16: 17: 63: 16: 62: 63: 61: 62: 60: 61: 1 0 1 1 0 (p 19) (p 18) (p 16) ldx (p 16) ldy (p 18) fma (p 19) st (p 16) (p 63) . . LC=1 LC=0 EC=4 RRB=-2 RRB=-3 Branch 3

Loop Execution Sequence. . . 0 16: 63: 62: 61: 60: 59: 1 1

Loop Execution Sequence. . . 0 16: 63: 62: 61: 60: 59: 1 1 0 0 (p 19) (p 18) (p 16) ldx (p 16) ldx (p 16) ldy (p 16) ldy (p 18) fma (p 18) fma (p 19) st (p 19) st (p 16) (p 63) . . LC=0 EC=4 EC=3 RRB=-4 Branch 4

Loop Execution Sequence. . . 0 63: 62: 61: 60: 59: 58: 1 1

Loop Execution Sequence. . . 0 63: 62: 61: 60: 59: 58: 1 1 1 0 0 0 (p 19) (p 18) (p 16) ldx (p 16) ldx (p 16) ldy (p 16) ldy (p 18) fma (p 18) fma (p 19) st (p 19) st (p 16) (p 63) . . LC=0 EC=3 EC=2 RRB=-4 RRB=-5 Branch 5

Loop Execution Sequence. . . 0 61: 62: 60: 61: 59: 60: 58: 59:

Loop Execution Sequence. . . 0 61: 62: 60: 61: 59: 60: 58: 59: 57: 58: 1 0 0 0 (p 19) (p 18) (p 16) (p 63) (p 16) ldx (p 16) ldx (p 16) ldy (p 16) ldy (p 18) fma (p 18) fma (p 19) st (p 19) st . . LC=0 EC=2 EC=1 RRB=-5 RRB=-6 Branch 6

Loop Execution Sequence. . . 0 60: 61: 59: 60: 58: 59: 57: 58:

Loop Execution Sequence. . . 0 60: 61: 59: 60: 58: 59: 57: 58: 56: 57: 0 1 0 0 (p 19) (p 18) (p 16) (p 63) (p 16) ldx (p 16) ldy (p 16) ldx (p 16) ldy fall through (p 18) fma (p 18) fma (p 19) st (p 19) st . . EC=1 LC=0 EC=0 RRB=-6 RRB=-7 Branch 7

Pipelining & Latency l Suppose we change the latencies – load latency of 6

Pipelining & Latency l Suppose we change the latencies – load latency of 6 clocks – fma latency of 4 clocks

Example: New Pipeline l Each column represents 1 source iteration load dx, dy tmp

Example: New Pipeline l Each column represents 1 source iteration load dx, dy tmp = dy + da * dx store dy

Updated Loop. rotf dx[7], dy[7], tmp[5] mov ar. lc = 3 // #iterations-1 mov

Updated Loop. rotf dx[7], dy[7], tmp[5] mov ar. lc = 3 // #iterations-1 mov ar. ec = 11 // #stages mov pr. rot = 0 x 10000 ; ; looptop: (p 16) ldfd dx[0] = [dxsp], 8 (p 16) ldfd dy[0] = [dysp], 8 (p 22) fma. d tmp[0] = da, dx[6], dy[6] (p 26) stfd [dydp] = tmp[4], 8 br. ctop looptop ; ;

Rotation: Summary l Loop pipelining maximizes performance; minimizes overhead – Avoids code expansion of

Rotation: Summary l Loop pipelining maximizes performance; minimizes overhead – Avoids code expansion of unrolling and code explosion of prologue and epilogue – Smaller code means fewer cache misses – Greater performance improvements in higher latency conditions l Reduced overhead allows S/W pipelining of small loops with unknown trip counts – Typical of integer scalar codes

Outline l Register Stack – Register Stack Engine l Register Rotation – Loop Branches

Outline l Register Stack – Register Stack Engine l Register Rotation – Loop Branches – Modulo-Scheduling of Loops l Summary

Register Model Summary l GR Stack – Overlap call/ret operations with real work –

Register Model Summary l GR Stack – Overlap call/ret operations with real work – RSE hides spills/fillls l GR, FR, PR Rotation – General acceleration for all types of loops l SW-visible resources – Large named register files & renaming l HW simplicity and explicit control

IA-64 Register Model: Stack & Rotation Dale Morris Architect Hewlett Packard Co.

IA-64 Register Model: Stack & Rotation Dale Morris Architect Hewlett Packard Co.