Compiling for the Intel Itanium Architecture Compiler Tricks

Agenda l Architecture Principles l Compiler Bag of Tricks – – ® Speculation Predication

Traditional Architectures: Limited Parallelism Original Source Code Compiler Sequential Machine Hardware Code parallelized code

Itanium™ Architecture: Explicit Parallelism Original Source Code Parallel Machine Code Compiler Itanium™ Compiler Views

Itanium™ Architecture Principles l Explicit parallelism: – Instruction level parallelism (ILP) in machine code

Speculation Review Traditional Architectures instr 1 instr. . . 2 br Load use Barrier

Speculating Uses Itanium™ Architecture ld. s instr 1 instr 2 br chk. s use

Introducing the Na. T (“Not a Thing”) Itanium™ Architecture ld. s ; Exception Detection

Propagation l All computations propagate Na. Ts, which reduces the number of checks ld

Exception Deferral: More Than Skin Deep l l Costly exceptions can be deferred OS

Store Barrier Traditional Architectures instr 1 instr 2. . . Store(*) Barrier Load (*)

Introducing Data Speculation l Compiler can issue a load prior to a preceding, possibly-conflicting

Data Speculation l Uses can be speculated ld 8. a instr 1 instr 2

Architectural Support for Data Speculation l Instructions – ld. a - advanced loads –

Advanced Load Address Table - ALAT l l ld. a inserts entries. Conflicting stores

Speculation Benefits l Reduces impact of memory latency l Improves code with many cache

Predication Traditional Architectures Itanium™ Architecture cmp then else l p 1 p 2 Converts

Complex Transformations • Mark from SPEC CPU 95 130. li • Low ILP in

Complex Transformations Set p 1 = true p 1 p 2 p 1 set

Upward Code Movement cmp. unc. eq p 1, p 2 = r 1, r

Downward Code Movement A Predication enables downward code movement from A to C without

Code Motion Tradeoffs Downward Code Motion Slots available in hot path Predicate region formation

Introducing Parallel Compares l Three new types of compares: – AND: both target predicates

Method of Use Or Predicate • Initially clear predicate • All true compares will

Parallel Compare Example c 1 c 2 if (c 1 && c 2 &&

Predication Benefits l l l Reduces branches and mispredict penalties Parallel compares further reduce

Branch Instruction 41 -bits 127 QP Branch IP-Offset 128 -bit bundle Instruction 1 Instruction

Branch Predicates Unconditional branch Conditional branches (p 0) br target; cmp p 1 =

8 Queens Example if ((b[j] == true) && (a[i+j] == true) && (c[i-j+7] ==

Eight Queens Example if ((b[j] == true) && (a[i+j] == true) && (c[i-j+7] ==

Multi-way Branch w/o Speculation P 1 ld 8 r 6 = (ra) (p 1)

Multi-way Branch w/o Predication cmp p 1, p 2 = c 1 cmp p

$Loop Example Convert string to uppercase for (i=0, i< len, i++) { if (IS_LOWERCASE(line[i]))$

Loop Assembly Code Traditional Arch loop: ld c = [ra], 1 bgt c, 96

Unroll for ILP ld c = [ra], 1 loop: ld d = [ra], 1

Software Pipelining l Overlapping execution of different loop iterations vs. Whole loop computation in

Software Pipelining Input Cycle ld 1 ld 2 ld cmps 3 ld cmps ?

Introducing Rotating Registers l GR 32 -127, FR 32 -127 can rotate l Separate

Pipelined Loop Kernel code ld r 34 = [ra], 1 cmp p 1 =

Fill the pipe. . . RRB = 0 Physical register file ® Virtual register

Fill the pipe. . . RRB = -1 Physical register file + Virtual register

Fill the pipe. . . RRB = -2 Physical G o _ G r

Execute the Kernel Physical G o _ G r e y h register file

Execute the Kernel G o _ G r e y h ld r 34

Pipelining Overhead Prologue and Epilogue are bad • Code size expansion • Overhead not

Prologue Code Cycle ® 1 ld 2 ld cmps 3 ld cmps ? sub

Avoid Pro and Epilogues Epilogue Physical register file Kernel (loop count) ld Have enable

Revisiting Rotating Predicate Registers l l l PR 16 -63 can rotate, with separate

How does this work RRB = 0 Physical register file Kernel Epilogue Complete Loop

Auto Predicate Generation Initalize • lc to trip count • ec to epilogue count

Fill the pipe again. . . RRB = 0 Epilogue Physical register file Kernel

Fill the pipe again. . . RRB = -1 Epilogue Physical register file Kernel

Fill the pipe again. . . RRB = -2 Epilogue Physical register file Kernel

Chunking thru kernel RRB = -3 Epilogue Complete Loop Code Physical register file Kernel

Chunking thru kernel RRB = -4 Epilogue Complete Loop Code Physical register file Kernel

Chunking thru kernel RRB = -5 Physical register file Epilogue Complete Loop Code ld

Chunking thru kernel RRB = -6 Physical register file Epilogue Complete Loop Code ld

Chunking thru kernel RRB = -7 Physical register file Epilogue Complete Loop Code ld

Draining the pipe RRB = -8 Physical register file Complete Loop Code ld r

Draining the pipe RRB = -9 Physical register file Complete Loop Code ld r

Draining the pipe RRB = -10 Fall through the loop Don’t rotate Complete Loop

Example Summary • 8 iterations in 12 cycles • 2. 6 x speedup of

Software Pipelining l Itanium™ architecture features support SWP – Full Predication – Special branch

Compiler Bag of Tricks l Predication – Removes branches and mispredictions – Enables aggressive

Compiler Bag of Tricks l Rich branch architecture – Multi-way branches increase ILP –

Five Predicate Compare Types l (qp) p 1, p 2 <- cmp. relation –

Control Speculation Summary l All loads have a speculative form that sets the Na.

More complex example Killtime loop in m 88 ksim for (i=0, i<32, i++) comptime[i]

Software Pipelining Benefits l Loop pipelining maximizes performance; minimizes overhead – High applicability –

Memory Address Modes l Register Indirect is only address mode – Memory address comes

Memory Address Modes l Load Instructions – (qp) ld{1, 2, 4, 8} l Store

Slides: 79

Download presentation

Compiling for the Intel® Itanium™ Architecture Compiler Tricks Steve Skedzielewski Intel Corporation ®

Agenda l Architecture Principles l Compiler Bag of Tricks – – ® Speculation Predication Branching Loop Generation

Traditional Architectures: Limited Parallelism Original Source Code Compiler Sequential Machine Hardware Code parallelized code multiple functional units Execution Units Available. Used Inefficiently . . Today’s Processors are often 60% Idle ® . . .

Itanium™ Architecture: Explicit Parallelism Original Source Code Parallel Machine Code Compiler Itanium™ Compiler Views Wider Scope Hardware More efficient use of execution resources multiple functional units . . . Increases Parallel Execution ® . . .

Itanium™ Architecture Principles l Explicit parallelism: – Instruction level parallelism (ILP) in machine code – Compiler schedules across a wide scope l Enhanced ILP : – Predication, Speculation, Software pipelining, . . . l Compatibility: – Across all Itanium™ processor family members – IA-32 in hardware and PA-RISC through instruction mapping l Massive resources: – Many registers – Many functional units ®

Speculation Review Traditional Architectures instr 1 instr. . . 2 br Load use Barrier Itanium™ Architecture ld. s instr 1 instr 2 br chk. s use Advances a load, even above a branch l Memory latency is a major performance bottleneck in today’s systems ® – CPU to memory gap increasing

Speculating Uses Itanium™ Architecture ld. s instr 1 instr 2 br chk. s use l Uses of speculative data can also be executed speculatively – distinguishes speculation from simple prefetch ® Enables Further Parallelism

Introducing the Na. T (“Not a Thing”) Itanium™ Architecture ld. s ; Exception Detection instr 1 instr 2 Propagate br Exception chk. s use l Na. T ; Exception Delivery is the GR’s 65 th bit that indicates: – whether or not an exception has occurred – when a branch to recovery code is required l Na. T ® set during ld. s, tested by Chk. s

Propagation l All computations propagate Na. Ts, which reduces the number of checks ld 8. s r 3 = (r 9) ld 8. s r 4 = (r 10) shladd r 6 = r 3, 3, r 4 ld 8. s r 5 = (r 6) p 1, p 2 = cmp(. . . ) Needs only one chk on result chk. s r 5 sub r 7 = r 5, r 2 l ® Cmp propagates “false” when writing predicates

Exception Deferral: More Than Skin Deep l l Costly exceptions can be deferred OS can control deferral of: – Page faults – Protection violations – … l Na. Ts enable deferral with recovery ld. s instr 1 instr 2 uses br Recovery code chk. s (Home Block) Enables aggressive code motion at compile time ® ld uses br home

Store Barrier Traditional Architectures instr 1 instr 2. . . Store(*) Barrier Load (*) use Traditional architectures limited by the store barrier ®

Introducing Data Speculation l Compiler can issue a load prior to a preceding, possibly-conflicting store Traditional Architectures instr 1 instr 2. . . st 8 Barrier ld 8 use ® Itanium™ Architecture ld 8. a instr 1 instr 2 st 8 ld. c use Unique to Itanium™ Architecture

Data Speculation l Uses can be speculated ld 8. a instr 1 instr 2 st 8 ld. c use ld 8. a instr 1 use instr 2 st 8 chk. a Synergy with control speculation increases performance ® Recovery code ld 8 uses br home

Architectural Support for Data Speculation l Instructions – ld. a - advanced loads – ld. c - check loads – chk. a - advanced load checks l Speculative Advanced loads - ld. sa - is an advanced load with deferral l ALAT - HW structure containing outstanding advanced loads ®

Advanced Load Address Table - ALAT l l ld. a inserts entries. Conflicting stores remove entries – Also: ld. c. clr, chk. a. clr, l Presence of entry indicates success – chk. a branches when no entry is found ld. a reg# =. . . chk. a reg# ® ? reg # Address reg #. . . Address st

Speculation Benefits l Reduces impact of memory latency l Improves code with many cache accesses – Large databases – Operating systems l Gives ® scheduling flexibility

Agenda l Architecture Principles l Compiler Bag of Tricks – – ® Speculation Predication Branching Loop Generation

Predication Traditional Architectures Itanium™ Architecture cmp then else l p 1 p 2 Converts branches to conditional execution – Executes multiple paths simultaneously l ® Exposes parallelism and reduces critical path – Better utilizes wider machines – Reduces mispredicted branches

Complex Transformations • Mark from SPEC CPU 95 130. li • Low ILP in each block Highly mispredicted branch ® Not your simple if-then-else

Complex Transformations Set p 1 = true p 1 p 2 p 1 set p 1 or p 2 based upon next path • One loop back branch • Utilizes machine width - always taken Global control flow reduction ®

Upward Code Movement cmp. unc. eq p 1, p 2 = r 1, r 2 : (p 1) br --> label : ld r 4 = [r 3] add r 5 = r 4, 1 cmp. unc. eq p 1, p 2 = r 1, r 2 : ld. s r 4 = [r 3] add r 5 = r 4, 1 : (p 1) br --> label chk. s r 4, rec Speculate both the load and the use Depending upon deferral mode, the add could cause cache miss ®

Upward Code Movement cmp. unc. eq p 1, p 2 = r 1, r 2 : (p 1) br --> label : ld r 4 = [r 3] add r 5 = r 4, 1 cmp. unc. eq p 1, p 2 = r 1, r 2 : (p 2) ld r 4 = [r 3] (p 2) add r 5 = r 4, 1 : (p 1) br --> label Predicate with fall-thru predicate Motion bounded by compare ® Predication can avoid speculative side effects

Downward Code Movement A Predication enables downward code movement from A to C without compensation code in B B C Main Trace Compensation Block A C ® Merge Block Use predication to merge sparse code in compensation block with code in merge block

Code Motion Tradeoffs Downward Code Motion Slots available in hot path Predicate region formation occurs before scheduling A C B Predication can pull instructions from lower weight path Scheduler can move instructions from above and below D Upward Code Motion ® Solutions • Heuristic formation • Preschedule information • Reverse if-conversion

Introducing Parallel Compares l Three new types of compares: – AND: both target predicates set FALSE if compare is false – OR: both target predicates set TRUE if compare is true – ANDOR: if true, sets one TRUE, sets other FALSE A A B C D C Reduces Critical Path D ® B

Method of Use Or Predicate • Initially clear predicate • All true compares will set • All false compares do nothing And Predicate • Initially set predicate • All true compares do nothing • All false compares will clear ® 0 1 cmp. unc. ne p 1 = r 0, r 0 cmp. or. eq p 1 = 40, r 7 cmp. or. eq p 1 = 9, r 7 cmp. unc. eq cmp. and. ge cmp. and. lt p 1 = r 0, r 0 p 1 = 48, r 6 p 1 = 58, r 6

Parallel Compare Example c 1 c 2 if (c 1 && c 2 && c 3 && c 4) then_code else_code Itanium™ Architecture Code 0 c 3 c 4 1 2 then ® cmp. unc. eq cmp. and. orcm (p 1) then_code (p 2) else_code else Significant control height reduction p 1, p 2 = r 0, 0 p 1, p 2 = c 1 p 1, p 2 = c 2 p 1, p 2 = c 3 p 1, p 2 = c 4

Predication Benefits l l l Reduces branches and mispredict penalties Parallel compares further reduce critical paths Greatly improves code with hard to predict branches Works in tandem with speculation Traditional architectures’ “bolt-on” approach can’t efficiently approximate predication – Cmove: 39% more instructions, 23% slower performance* – All instructions need predication ® * Source: S. Mahlke, 1995

Agenda l Architecture Principles l Compiler Bag of Tricks – – ® Speculation Predication Branching Loop Generation

Branch Instruction 41 -bits 127 QP Branch IP-Offset 128 -bit bundle Instruction 1 Instruction 0 0 Template 21 -bits l Two basic branch formats – Relative: IP : = IP + Offset 21 – Indirect: IP : = BR[I] – 8 branch registers for efficient branch execution – Call/Return linking through branch registers l Loop branches with 64 -bit loopcount register (LC) – Enables perfect branch prediction of counted loops – Traditional architectures always mispredict last iteration – Important for low trip count loops ®

Branch Predicates Unconditional branch Conditional branches (p 0) br target; cmp p 1 = cond (p 1) br target; l Compare and branch can be in same cycle l Compiler-directed static prediction augments dynamic prediction – Reduced false mispredicts due to aliasing – Frees space in H/W predictor – Can give hint for dynamic predictor ®

8 Queens Example if ((b[j] == true) && (a[i+j] == true) && (c[i-j+7] == true)) Unconditional Compares 1 2 4 5 6 7 ® R 1=&b[j] R 3=&a[i+j] R 5=&c[i-j+7] ld R 2=[R 1] ld. s R 4=[R 3] ld. s R 6=[R 5] P 1, P 2 <-cmp. unc(R 2==true) (p 1) chk. s R 4 (p 1) P 3, P 4 <-cmp. unc(R 4==true) (p 3) chk. s R 6 (p 3) P 5, P 6 <-cmp. unc(R 5==true) (P 5) br then else 8 queens control flow P 2 P 1 P 3 P 4 P 6 P 5 Then Else

Eight Queens Example if ((b[j] == true) && (a[i+j] == true) && (c[i-j+7] == true)) 1 2 4 ® R 1=&b[j] R 3=&a[i+j] R 5=&c[i-j+7] p 1 <- true ld R 2=[R 1] ld R 4=[R 3] ld R 6=[R 5] p 1, p 2 <- cmp. and(R 2==true) p 1, p 2 <- cmp. and(R 4==true) p 1, p 2 <- cmp. and(R 6==true) (p 1) br then else Major reduction in control flow

Multi-way Branch w/o Speculation P 1 ld 8 r 6 = (ra) (p 1) br exit 1 P 2 ld 8 r 7 = (rb) (p 3) br exit 2 P 3 P 5 P 4 ld 8 r 8 = (rc) (p 5) br exit 3 Hoisting Loads ld 8. s r 6 = (ra) ld 8. s r 7 = (rb) ld 8. s r 8 = (rc) chk r 6, rec 0 (p 1) br exit 1 Chk r 7, rec 1 (p 3) br exit 2 Chk r 8, rec 2 (p 5) br exit 3 chk r 6, rec 0 (p 2) chk r 7, rec 1 (p 4) chk r 8, rec 2 }{ (p 1) br exit 1 (p 3) br exit 2 (p 5) br exit 3 } P 6 3 branch cycles l ® 1 branch cycle Multi-way branches: more than 1 branch in a single cycle l Allows n-way branching Supports Aggressive Speculation

Multi-way Branch w/o Predication cmp p 1, p 2 = c 1 cmp p 3, p 4 = c 2 cmp p 5, p 6 = c 3 : : st [r 10] = (p 1) br exit 1 st [r 11] = (p 3) br exit 2 st [r 12] = (p 5) br exit 3 ® Predication (p 2) st [r 11] = (p 4) st [r 12] = (p 1) br exit 1 (p 3) br exit 2 (p 5) br exit 3 Predication and Multi-way increase ILP

Agenda l Architecture Principles l Compiler Bag of Tricks – – ® Speculation Predication Branching Loop Generation

$Loop Example Convert string to uppercase for (i=0, i< len, i++) { if (IS_LOWERCASE(line[i]))$

Loop Example Convert string to uppercase for (i=0, i< len, i++) { if (IS_LOWERCASE(line[i])) newline[i] = CNVT_TO_UPPERCASE(line[i]); else newline[i] = line[i]; } After macro expansion for (i=0, i< len, i++) { if (line[i] >= ‘a’ && line[i] <= ‘z’) newline[i] = line[i]-32; else newline[i] = line[i]; } ® Typical integer-type loop

Loop Assembly Code Traditional Arch loop: ld c = [ra], 1 bgt c, 96 bottom blt c, 123 bottom 1 2 3 4 sub c = c, 32 bottom: st [rb] = c, 1 blt ra, end loop 5 40 cycles for 8 iterations Itanium™ Architecture loop: ld c = [ra], 1 cmp p 1 = true cmp. and p 1 = (c > 96) cmp. and p 1 = (c < 123) (p 1) sub c = c, 32 st [rb] = c, 1 br. cloop 1 2 3 4 32 cycles for 8 iterations Fewer branches and no mispredictions. Still low ILP. ®

Unroll for ILP ld c = [ra], 1 loop: ld d = [ra], 1 bgt c, 115, b 1 blt c, 96, b 1 sub c=c, 36 b 1: st [rb] = c, 1 beq rb, end, exit ld c = [ra], 1 bgt d, 115, b 2 blt d, 96, b 2 sub d=d, 36 b 2: st [rb] = d, 1 blt rb, end, loop ® Unroll twice ld c loop: ld d bgt c blt c sub b 1: st c beq ld c bgt d blt d sub b 2: st d blt • 8 iterations in 33 cycles • 1. 2 x perf. inprov. • Code size: 2 x • Won’t gain by unrolling more

Software Pipelining l Overlapping execution of different loop iterations vs. Whole loop computation in one cycle l ® More iterations in same amount of time

Software Pipelining Input Cycle ld 1 ld 2 ld cmps 3 ld cmps ? sub 4 ld cmps ? sub cmps st Kernel ? sub st ® Data transferred from one functional unit to the next Output

Introducing Rotating Registers l GR 32 -127, FR 32 -127 can rotate l Separate Rotating Register Base for each: GRs, FRs l Loop branches decrement all register rotating bases (RRB) l Instructions contain a “virtual” register number – RRB + virtual register number = physical register number. l References – “Overlapped Loop Support in the Cydra 5” - Dehnert et. al, 1989 – “Code Generation Schemas for Modulo-Scheduled Loops” Rau et. al, MICRO-25, 1992 Allows painless transfer of data between stages ®

Pipelined Loop Kernel code ld r 34 = [ra], 1 cmp p 1 = true s 1 cmp. and p 1 = (r 35>96) cmp. and p 1 = (r 35<123) s 2 s 3 (p 1) sub r 36 = r 36, 32 st [rb] = r 37, 1 s 4 br. ctop loop ® Physical register file + Virtual register r 34 = xx r 35 = xx sub r 36 = xx st r 37 = xx ld loop: RRB = 0 cmp< cmp>

Fill the pipe. . . G o _ G r e y RRB = 0 h Physical register file + Virtual register r 34 = G r 35 = xx sub r 36 = xx st r 37 = xx ld Execute prologue stage cmp< Kernel code loop: ld r 34 = [ra], 1 cmp p 1 = true cmp. and p 1 = (r 35>96) cmp. and p 1 = (r 35<123) (p 1) sub r 36 = r 36, 32 st [rb] = r 37, 1 br. ctop loop ® cmp>

Fill the pipe. . . RRB = 0 Physical register file ® Virtual register r 34 = G r 35 = xx sub r 36 = xx st r 37 = xx ld Perform a loop branch • Decrement lc • Rotate registers by decrementing RRB + cmp< cmp>

Fill the pipe. . . RRB = -1 Physical register file + Virtual register r 33 = o r 34 = G r 35 = G sub r 35 = xx r 36 = xx st r 36 = xx r 37 = xx ld Execute prologue stage cmp< Kernel code loop: ld r 34 = [ra], 1 cmp p 1 = true cmp. and p 1 = (r 35>96) cmp. and p 1 = (r 35<123) (p 1) sub r 36 = r 36, 32 st [rb] = r 37, 1 br. ctop loop ® cmp>

Fill the pipe. . . RRB = -2 Physical G o _ G r e y h register file + Virtual register r 32 = _ r 34 = _ r 33 = o r 35 = o sub r 34 = G r 36 = G st r 35 = xx r 37 = xx ld Execute prologue stage cmp< Kernel code loop: ld r 34 = [ra], 1 cmp p 16 = true cmp. and p 16 = (r 35>96) cmp. and p 16 = (r 35<123) (p 17) sub r 36 = r 36, 32 st [rb] = r 37, 1 br. ctop loop ® cmp>

Execute the Kernel Physical G o _ G r e y h register file ld r 34 = [ra], 1 cmp p 16 = true cmp. and p 16 = (r 35>96) cmp. and p 16 = (r 35<123) (p 17) sub r 36 = r 36, 32 st [rb] = r 37, 1 br. ctop loop ® G Virtual register r 34 = G r 32 = _ r 35 = _ sub r 33 = o r 36 = o st r 34 = G r 37 = G cmp< Kernel code loop: + r 37 = G ld Execute kernel Whole iteration per cycle RRB = -3 cmp>

Execute the Kernel G o _ G r e y h ld r 34 = [ra], 1 cmp p 16 = true cmp. and p 16 = (r 35>96) cmp. and p 16 = (r 35<123) (p 17) sub r 36 = r 36, 32 st [rb] = r 37, 1 br. ctop loop ® G O Virtual register r 34 = r r 37 = G r 35 = G sub r 32 = _ r 36 = _ st r 33 = O r 37 = O cmp< Kernel code loop: + r 36 = r ld Execute kernel Whole iteration per cycle Physical register file RRB = -4 cmp>

Execute the Kernel G o _ G r e y h ld r 34 = [ra], 1 cmp p 16 = true cmp. and p 16 = (r 35>96) cmp. and p 16 = (r 35<123) (p 17) sub r 36 = r 36, 32 st [rb] = r 37, 1 br. ctop loop ® G O Virtual register r 34 = e r 36 = r r 35 = r sub r 37 = G r 36 = G st r 32 = _ r 37 = _ cmp< Kernel code loop: + r 35 = e ld Execute kernel Whole iteration per cycle Physical register file RRB = -5 cmp>

Pipelining Overhead Prologue and Epilogue are bad • Code size expansion • Overhead not good for low trip count loops - cache performance Kernel Epilogue ® Can we avoid prologue and epilogue?

Prologue Code Cycle ® 1 ld 2 ld cmps 3 ld cmps ? sub 4 ld cmps ? sub st Kernel Incrementally turn on functional units

Avoid Pro and Epilogues Epilogue Physical register file Kernel (loop count) ld Have enable bit on each functional unit Enablers are initialized to off cmp< r 35 = xx Feed through a sequence of bits of length dependent upon loop count and pipe depth ® r 34 = xx Unit Enabler cmp> sub r 36 = xx st r 37 = xx

Revisiting Rotating Predicate Registers l l l PR 16 -63 can rotate, with separate Rotating Register Base Loop branches decrement all register rotating base (RRB) Instructions contain a “virtual” predicate register number – RRB + virtual register number = physical register number. l l Some predicates control pipeline stages, Stage Predicates Qualifying Predicates can still be in the loop Complete Loop Code s 1 s 2 s 3 s 4 ® loop: (p 16) (p 17) (p 22) (p 19) ld r 34 = [ra], 1 cmp. unc p 20 = true cmp. and p 21 = (r 35>96) cmp. and p 21 = (r 35<123) sub r 36 = r 36, 32 st [rb] = r 37, 1 br. ctop loop

How does this work RRB = 0 Physical register file Kernel Epilogue Complete Loop Code ld r 34 = G loop: (p 16) (p 17) (p 22) (p 19) ld cmp. and sub st br. ctop r 34 = [ra], 1 p 20 = true p 21 = (r 35>96) p 21 = (r 35<123) r 36 = r 36, 32 [rb] = r 37, 1 loop Qualifying Predicate ® Stage Predicates cmp< r 35 = xx cmp> sub r 36 = xx st r 37 = xx

Auto Predicate Generation Initalize • lc to trip count • ec to epilogue count • p 16 to true Loop branches • Rotate predicates by decrementing RRB • When lc > 0 - Decr. lc, set p 16=true • When lc = 0 - Decr. ec, set p 16=false • Fall through when ec=0 ® lc RRB Predicate Generator p 16 ec

Fill the pipe again. . . RRB = 0 Epilogue Physical register file Kernel Complete Loop Code ld r 34 = G loop: (p 16) (p 17) (p 22) (p 19) ® ld cmp. unc cmp. and sub st br. ctop r 34 = [ra], 1 p 20 = true p 21 = (r 35>96) p 21 = (r 35<123) r 36 = r 36, 32 [rb] = r 37, 1 loop Stage Predicates cmp< r 35 = xx cmp> sub r 36 = xx st r 37 = xx

Fill the pipe again. . . RRB = -1 Epilogue Physical register file Kernel Complete Loop Code ld r 33 = o loop: (p 16) (p 17) (p 22) (p 19) ® ld cmp. unc cmp. and sub st br. ctop r 34 = [ra], 1 p 20 = true p 21 = (r 35>96) p 21 = (r 35<123) r 36 = r 36, 32 [rb] = r 37, 1 loop Stage Predicates cmp< r 34 = G cmp> sub r 35 = xx st r 36 = xx

Fill the pipe again. . . RRB = -2 Epilogue Physical register file Kernel Complete Loop Code ld r 32 = _ loop: (p 16) (p 17) (p 22) (p 19) ® ld cmp. unc cmp. and sub st br. ctop r 34 = [ra], 1 p 20 = true p 21 = (r 35>96) p 21 = (r 35<123) r 36 = r 36, 32 [rb] = r 37, 1 loop Stage Predicates cmp< r 33 = o cmp> sub r 34 = G st r 35 = xx

Chunking thru kernel RRB = -3 Epilogue Complete Loop Code Physical register file Kernel ld r 37 = G loop: (p 16) (p 17) (p 22) (p 19) ® ld cmp. unc cmp. and sub st br. ctop r 34 = [ra], 1 p 20 = true p 21 = (r 35>96) p 21 = (r 35<123) r 36 = r 36, 32 [rb] = r 37, 1 loop cmp< r 32 = _ cmp> sub r 33 = o st r 34 = G

Chunking thru kernel RRB = -4 Epilogue Complete Loop Code Physical register file Kernel ld r 36 = r loop: (p 16) (p 17) (p 22) (p 19) ® ld cmp. unc cmp. and sub st br. ctop G r 34 = [ra], 1 p 20 = true p 21 = (r 35>96) p 21 = (r 35<123) r 36 = r 36, 32 [rb] = r 37, 1 loop cmp< r 37 = G cmp> sub r 32 = _ st r 33 = O

Chunking thru kernel RRB = -5 Physical register file Epilogue Complete Loop Code ld r 35 = e loop: (p 16) (p 17) (p 22) (p 19) ® ld cmp. unc cmp. and sub st br. ctop r 34 = [ra], 1 p 20 = true p 21 = (r 35>96) p 21 = (r 35<123) r 36 = r 36, 32 [rb] = r 37, 1 loop G O cmp< r 36 = r cmp> sub r 37 = G st r 32 = _

Chunking thru kernel RRB = -6 Physical register file Epilogue Complete Loop Code ld r 34 = y loop: (p 16) (p 17) (p 22) (p 19) ® ld cmp. unc cmp. and sub st br. ctop r 34 = [ra], 1 p 20 = true p 21 = (r 35>96) p 21 = (r 35<123) r 36 = r 36, 32 [rb] = r 37, 1 loop G O cmp< r 35 = e cmp> sub r 36 = r st r 37 = G

Chunking thru kernel RRB = -7 Physical register file Epilogue Complete Loop Code ld r 33 = h loop: (p 16) (p 17) (p 22) (p 19) ® ld cmp. unc cmp. and sub st br. ctop r 34 = [ra], 1 p 20 = true p 21 = (r 35>96) p 21 = (r 35<123) r 36 = r 36, 32 [rb] = r 37, 1 loop G O G cmp< r 34 = y cmp> sub r 35 = e st r 36 = r

Draining the pipe RRB = -8 Physical register file Complete Loop Code ld r 32 = xx loop: (p 16) (p 17) (p 22) (p 19) ® ld cmp. unc cmp. and sub st br. ctop r 34 = [ra], 1 p 20 = true p 21 = (r 35>96) p 21 = (r 35<123) r 36 = r 36, 32 [rb] = r 37, 1 loop G O G R cmp< r 33 = h cmp> sub r 34 = Y st r 35 = E

Draining the pipe RRB = -9 Physical register file Complete Loop Code ld r 33 = xx loop: (p 16) (p 17) (p 22) (p 19) ® ld cmp. unc cmp. and sub st br. ctop r 34 = [ra], 1 p 20 = true p 21 = (r 35>96) p 21 = (r 35<123) r 36 = r 36, 32 [rb] = r 37, 1 loop G O G R E cmp< r 34 = xx cmp> sub r 35 = H st r 36 = Y

Draining the pipe RRB = -10 Fall through the loop Don’t rotate Complete Loop Code Physical register file ld r 32 = xx loop: (p 16) (p 17) (p 22) (p 19) ® ld cmp. unc cmp. and sub st br. ctop r 34 = [ra], 1 p 20 = true p 21 = (r 35>96) p 21 = (r 35<123) r 36 = r 36, 32 [rb] = r 37, 1 loop G O G R E cmp< r 33 = xx cmp> Y sub r 34 = xx st r 35 = H

Example Summary • 8 iterations in 12 cycles • 2. 6 x speedup of initial code • 2. 75 x over unrolled traditional • No code expansion • No mispredicts (4 x, 1 10 cycle miss) • Minimal register usage loop: (p 16) (p 17) (p 22) (p 19) ® RRB = -10 Physical register file ld r 32 = xx cmp< ld cmp. unc cmp. and sub st br. ctop r 34 = [ra], 1 p 20 = true p 21 = (r 35>96) p 21 = (r 35<123) r 36 = r 36, 32 [rb] = r 37, 1 loop G O G R E r 33 = xx cmp> Y H sub r 34 = xx st r 35 = H

Software Pipelining l Itanium™ architecture features support SWP – Full Predication – Special branch handling features – Register rotation: removes loop copy overhead – Predicate rotation/generation: removes prologue & epilogue l Traditional architectures use loop unrolling – High overhead: extra code for loop body, prologue, and epilogue Especially Useful for Integer Code with Small Number of Loop Iterations ®

Compiler Bag of Tricks l Predication – Removes branches and mispredictions – Enables aggressive code motion – Parallel compares increase parallelism l Speculation – Hides memory latency – Enables aggressive code motion – Control speculation over branches – Data speculation over stores – Compiler-controlled recovery code ®

Compiler Bag of Tricks l Rich branch architecture – Multi-way branches increase ILP – Loop branches – Static direction hints assist prediction l S/W pipelining support with minimal overhead encourages broad usage – Performance for small integer loops with unknown trip counts as well as monster FP loops ®

BACKUP ®

8 Queens Example if ((b[j] == true) && (a[i+j] == true) && (c[i-j+7] == true)) Parallel Compares 1 2 4 5 ® R 1=&b[j] R 3=&a[i+j] R 5=&c[i-j+7] p 1 <- true ld R 2=[R 1] ld R 4=[R 3] ld R 6=[R 5] p 1, p 2 <- cmp. and(R 2==true) p 1, p 2 <- cmp. and(R 4==true) p 1, p 2 <- cmp. and(R 6==true) (p 1) br then else 8 queens control flow P 2 P 1 P 4 P 3 P 1= true P 5 Then Reduced from 7 cycles to 5 P 1=False P 6 Else

Five Predicate Compare Types l (qp) p 1, p 2 <- cmp. relation – if(qp) {p 1 = relation; p 2 = !relation}; l (qp) p 1, p 2 <- cmp. relation. unc – p 1 = qp&relation; p 2 = qp&!relation; l (qp) p 1, p 2 <- cmp. relation. and – if(qp & (relation==FALSE)) { p 1=0; p 2=0; } l (qp) p 1, p 2 <- cmp. relation. or – if(qp & (relation==TRUE)) { p 1=1; p 2=1; } l (qp) p 1, p 2 <- cmp. relation. or. andcm – if(qp & (relation==TRUE)) { p 1=1; p 2=0; } ®

Control Speculation Summary l All loads have a speculative form that sets the Na. T bit when deferring exceptions l Computational instructions propagate Na. Ts l OS controls deferral of faults but supported directly in HW - “no-fault speculation” – Minimizes overhead of data that is not used l Chk ® more effective than non-faulting load

More complex example Killtime loop in m 88 ksim for (i=0, i<32, i++) comptime[i] -= MIN(comptime[i], time) ® Initial Loop Pipelined Loop loop: Loop: (p 16) (p 18) (p 22) (p 24) (p 20) ld r 5=[r 10], 4 cmp p 1, p 2 = r 5, r 32 (p 1) br side sub r 5=r 5, r 32 st [addr]=r 5, 4 br cloop side: add t=0, r 0 st 4 [addr]=t, 4 br cloop ld r 36 = [r 10], 4 cmp p 21, p 23 = r 38, r 32 sub r 37 = r 0, 0 sub r 38 = r 38, r 32 st [r 11] = r 40, 4 br. ctop loop

Software Pipelining Benefits l Loop pipelining maximizes performance; minimizes overhead – High applicability – Minimum code size - fewer cache misses – Reduced register usage – Greater performance improvements in higher latency conditions l Reduced overhead allows S/W pipelining of small loops with unknown trip counts – Good for integer scalar codes ®

Memory Address Modes l Register Indirect is only address mode – Memory address comes from a General Register – no add in critical memory access path l Post-Increment provided for efficient address arithmetic – can add 9 -bit signed immediate value, or a value from a general register – uses idle ALU resources – avoid extra add instructions ® Benefits vector Floating Point Code

Memory Address Modes l Load Instructions – (qp) ld{1, 2, 4, 8} l Store r 1 = [r 3] no post-inc r 1 = [r 3] , imm 9 r 1 = [r 3] , r 2 Instructions – (qp) st{1, 2, 4, 8} [r 3] = r 2 no post-inc – (qp) st{1, 2, 4, 8} [r 3] = r 2, imm 9 ®