Compiling for the Intel Itanium Architecture Compiler Tricks
- Slides: 79
Compiling for the Intel® Itanium™ Architecture Compiler Tricks Steve Skedzielewski Intel Corporation ®
Agenda l Architecture Principles l Compiler Bag of Tricks – – ® Speculation Predication Branching Loop Generation
Traditional Architectures: Limited Parallelism Original Source Code Compiler Sequential Machine Hardware Code parallelized code multiple functional units Execution Units Available. Used Inefficiently . . Today’s Processors are often 60% Idle ® . . .
Itanium™ Architecture: Explicit Parallelism Original Source Code Parallel Machine Code Compiler Itanium™ Compiler Views Wider Scope Hardware More efficient use of execution resources multiple functional units . . . Increases Parallel Execution ® . . .
Itanium™ Architecture Principles l Explicit parallelism: – Instruction level parallelism (ILP) in machine code – Compiler schedules across a wide scope l Enhanced ILP : – Predication, Speculation, Software pipelining, . . . l Compatibility: – Across all Itanium™ processor family members – IA-32 in hardware and PA-RISC through instruction mapping l Massive resources: – Many registers – Many functional units ®
Speculation Review Traditional Architectures instr 1 instr. . . 2 br Load use Barrier Itanium™ Architecture ld. s instr 1 instr 2 br chk. s use Advances a load, even above a branch l Memory latency is a major performance bottleneck in today’s systems ® – CPU to memory gap increasing
Speculating Uses Itanium™ Architecture ld. s instr 1 instr 2 br chk. s use l Uses of speculative data can also be executed speculatively – distinguishes speculation from simple prefetch ® Enables Further Parallelism
Introducing the Na. T (“Not a Thing”) Itanium™ Architecture ld. s ; Exception Detection instr 1 instr 2 Propagate br Exception chk. s use l Na. T ; Exception Delivery is the GR’s 65 th bit that indicates: – whether or not an exception has occurred – when a branch to recovery code is required l Na. T ® set during ld. s, tested by Chk. s
Propagation l All computations propagate Na. Ts, which reduces the number of checks ld 8. s r 3 = (r 9) ld 8. s r 4 = (r 10) shladd r 6 = r 3, 3, r 4 ld 8. s r 5 = (r 6) p 1, p 2 = cmp(. . . ) Needs only one chk on result chk. s r 5 sub r 7 = r 5, r 2 l ® Cmp propagates “false” when writing predicates
Exception Deferral: More Than Skin Deep l l Costly exceptions can be deferred OS can control deferral of: – Page faults – Protection violations – … l Na. Ts enable deferral with recovery ld. s instr 1 instr 2 uses br Recovery code chk. s (Home Block) Enables aggressive code motion at compile time ® ld uses br home
Store Barrier Traditional Architectures instr 1 instr 2. . . Store(*) Barrier Load (*) use Traditional architectures limited by the store barrier ®
Introducing Data Speculation l Compiler can issue a load prior to a preceding, possibly-conflicting store Traditional Architectures instr 1 instr 2. . . st 8 Barrier ld 8 use ® Itanium™ Architecture ld 8. a instr 1 instr 2 st 8 ld. c use Unique to Itanium™ Architecture
Data Speculation l Uses can be speculated ld 8. a instr 1 instr 2 st 8 ld. c use ld 8. a instr 1 use instr 2 st 8 chk. a Synergy with control speculation increases performance ® Recovery code ld 8 uses br home
Architectural Support for Data Speculation l Instructions – ld. a - advanced loads – ld. c - check loads – chk. a - advanced load checks l Speculative Advanced loads - ld. sa - is an advanced load with deferral l ALAT - HW structure containing outstanding advanced loads ®
Advanced Load Address Table - ALAT l l ld. a inserts entries. Conflicting stores remove entries – Also: ld. c. clr, chk. a. clr, l Presence of entry indicates success – chk. a branches when no entry is found ld. a reg# =. . . chk. a reg# ® ? reg # Address reg #. . . Address st
Speculation Benefits l Reduces impact of memory latency l Improves code with many cache accesses – Large databases – Operating systems l Gives ® scheduling flexibility
Agenda l Architecture Principles l Compiler Bag of Tricks – – ® Speculation Predication Branching Loop Generation
Predication Traditional Architectures Itanium™ Architecture cmp then else l p 1 p 2 Converts branches to conditional execution – Executes multiple paths simultaneously l ® Exposes parallelism and reduces critical path – Better utilizes wider machines – Reduces mispredicted branches
Complex Transformations • Mark from SPEC CPU 95 130. li • Low ILP in each block Highly mispredicted branch ® Not your simple if-then-else
Complex Transformations Set p 1 = true p 1 p 2 p 1 set p 1 or p 2 based upon next path • One loop back branch • Utilizes machine width - always taken Global control flow reduction ®
Upward Code Movement cmp. unc. eq p 1, p 2 = r 1, r 2 : (p 1) br --> label : ld r 4 = [r 3] add r 5 = r 4, 1 cmp. unc. eq p 1, p 2 = r 1, r 2 : ld. s r 4 = [r 3] add r 5 = r 4, 1 : (p 1) br --> label chk. s r 4, rec Speculate both the load and the use Depending upon deferral mode, the add could cause cache miss ®
Upward Code Movement cmp. unc. eq p 1, p 2 = r 1, r 2 : (p 1) br --> label : ld r 4 = [r 3] add r 5 = r 4, 1 cmp. unc. eq p 1, p 2 = r 1, r 2 : (p 2) ld r 4 = [r 3] (p 2) add r 5 = r 4, 1 : (p 1) br --> label Predicate with fall-thru predicate Motion bounded by compare ® Predication can avoid speculative side effects
Downward Code Movement A Predication enables downward code movement from A to C without compensation code in B B C Main Trace Compensation Block A C ® Merge Block Use predication to merge sparse code in compensation block with code in merge block
Code Motion Tradeoffs Downward Code Motion Slots available in hot path Predicate region formation occurs before scheduling A C B Predication can pull instructions from lower weight path Scheduler can move instructions from above and below D Upward Code Motion ® Solutions • Heuristic formation • Preschedule information • Reverse if-conversion
Introducing Parallel Compares l Three new types of compares: – AND: both target predicates set FALSE if compare is false – OR: both target predicates set TRUE if compare is true – ANDOR: if true, sets one TRUE, sets other FALSE A A B C D C Reduces Critical Path D ® B
Method of Use Or Predicate • Initially clear predicate • All true compares will set • All false compares do nothing And Predicate • Initially set predicate • All true compares do nothing • All false compares will clear ® 0 1 cmp. unc. ne p 1 = r 0, r 0 cmp. or. eq p 1 = 40, r 7 cmp. or. eq p 1 = 9, r 7 cmp. unc. eq cmp. and. ge cmp. and. lt p 1 = r 0, r 0 p 1 = 48, r 6 p 1 = 58, r 6
Parallel Compare Example c 1 c 2 if (c 1 && c 2 && c 3 && c 4) then_code else_code Itanium™ Architecture Code 0 c 3 c 4 1 2 then ® cmp. unc. eq cmp. and. orcm (p 1) then_code (p 2) else_code else Significant control height reduction p 1, p 2 = r 0, 0 p 1, p 2 = c 1 p 1, p 2 = c 2 p 1, p 2 = c 3 p 1, p 2 = c 4
Predication Benefits l l l Reduces branches and mispredict penalties Parallel compares further reduce critical paths Greatly improves code with hard to predict branches Works in tandem with speculation Traditional architectures’ “bolt-on” approach can’t efficiently approximate predication – Cmove: 39% more instructions, 23% slower performance* – All instructions need predication ® * Source: S. Mahlke, 1995
Agenda l Architecture Principles l Compiler Bag of Tricks – – ® Speculation Predication Branching Loop Generation
Branch Instruction 41 -bits 127 QP Branch IP-Offset 128 -bit bundle Instruction 1 Instruction 0 0 Template 21 -bits l Two basic branch formats – Relative: IP : = IP + Offset 21 – Indirect: IP : = BR[I] – 8 branch registers for efficient branch execution – Call/Return linking through branch registers l Loop branches with 64 -bit loopcount register (LC) – Enables perfect branch prediction of counted loops – Traditional architectures always mispredict last iteration – Important for low trip count loops ®
Branch Predicates Unconditional branch Conditional branches (p 0) br target; cmp p 1 = cond (p 1) br target; l Compare and branch can be in same cycle l Compiler-directed static prediction augments dynamic prediction – Reduced false mispredicts due to aliasing – Frees space in H/W predictor – Can give hint for dynamic predictor ®
8 Queens Example if ((b[j] == true) && (a[i+j] == true) && (c[i-j+7] == true)) Unconditional Compares 1 2 4 5 6 7 ® R 1=&b[j] R 3=&a[i+j] R 5=&c[i-j+7] ld R 2=[R 1] ld. s R 4=[R 3] ld. s R 6=[R 5] P 1, P 2 <-cmp. unc(R 2==true) (p 1) chk. s R 4 (p 1) P 3, P 4 <-cmp. unc(R 4==true) (p 3) chk. s R 6 (p 3) P 5, P 6 <-cmp. unc(R 5==true) (P 5) br then else 8 queens control flow P 2 P 1 P 3 P 4 P 6 P 5 Then Else
Eight Queens Example if ((b[j] == true) && (a[i+j] == true) && (c[i-j+7] == true)) 1 2 4 ® R 1=&b[j] R 3=&a[i+j] R 5=&c[i-j+7] p 1 <- true ld R 2=[R 1] ld R 4=[R 3] ld R 6=[R 5] p 1, p 2 <- cmp. and(R 2==true) p 1, p 2 <- cmp. and(R 4==true) p 1, p 2 <- cmp. and(R 6==true) (p 1) br then else Major reduction in control flow
Multi-way Branch w/o Speculation P 1 ld 8 r 6 = (ra) (p 1) br exit 1 P 2 ld 8 r 7 = (rb) (p 3) br exit 2 P 3 P 5 P 4 ld 8 r 8 = (rc) (p 5) br exit 3 Hoisting Loads ld 8. s r 6 = (ra) ld 8. s r 7 = (rb) ld 8. s r 8 = (rc) chk r 6, rec 0 (p 1) br exit 1 Chk r 7, rec 1 (p 3) br exit 2 Chk r 8, rec 2 (p 5) br exit 3 chk r 6, rec 0 (p 2) chk r 7, rec 1 (p 4) chk r 8, rec 2 }{ (p 1) br exit 1 (p 3) br exit 2 (p 5) br exit 3 } P 6 3 branch cycles l ® 1 branch cycle Multi-way branches: more than 1 branch in a single cycle l Allows n-way branching Supports Aggressive Speculation
Multi-way Branch w/o Predication cmp p 1, p 2 = c 1 cmp p 3, p 4 = c 2 cmp p 5, p 6 = c 3 : : st [r 10] = (p 1) br exit 1 st [r 11] = (p 3) br exit 2 st [r 12] = (p 5) br exit 3 ® Predication (p 2) st [r 11] = (p 4) st [r 12] = (p 1) br exit 1 (p 3) br exit 2 (p 5) br exit 3 Predication and Multi-way increase ILP
Agenda l Architecture Principles l Compiler Bag of Tricks – – ® Speculation Predication Branching Loop Generation
Loop Example Convert string to uppercase for (i=0, i< len, i++) { if (IS_LOWERCASE(line[i])) newline[i] = CNVT_TO_UPPERCASE(line[i]); else newline[i] = line[i]; } After macro expansion for (i=0, i< len, i++) { if (line[i] >= ‘a’ && line[i] <= ‘z’) newline[i] = line[i]-32; else newline[i] = line[i]; } ® Typical integer-type loop
Loop Assembly Code Traditional Arch loop: ld c = [ra], 1 bgt c, 96 bottom blt c, 123 bottom 1 2 3 4 sub c = c, 32 bottom: st [rb] = c, 1 blt ra, end loop 5 40 cycles for 8 iterations Itanium™ Architecture loop: ld c = [ra], 1 cmp p 1 = true cmp. and p 1 = (c > 96) cmp. and p 1 = (c < 123) (p 1) sub c = c, 32 st [rb] = c, 1 br. cloop 1 2 3 4 32 cycles for 8 iterations Fewer branches and no mispredictions. Still low ILP. ®
Unroll for ILP ld c = [ra], 1 loop: ld d = [ra], 1 bgt c, 115, b 1 blt c, 96, b 1 sub c=c, 36 b 1: st [rb] = c, 1 beq rb, end, exit ld c = [ra], 1 bgt d, 115, b 2 blt d, 96, b 2 sub d=d, 36 b 2: st [rb] = d, 1 blt rb, end, loop ® Unroll twice ld c loop: ld d bgt c blt c sub b 1: st c beq ld c bgt d blt d sub b 2: st d blt • 8 iterations in 33 cycles • 1. 2 x perf. inprov. • Code size: 2 x • Won’t gain by unrolling more
Software Pipelining l Overlapping execution of different loop iterations vs. Whole loop computation in one cycle l ® More iterations in same amount of time
Software Pipelining Input Cycle ld 1 ld 2 ld cmps 3 ld cmps ? sub 4 ld cmps ? sub cmps st Kernel ? sub st ® Data transferred from one functional unit to the next Output
Introducing Rotating Registers l GR 32 -127, FR 32 -127 can rotate l Separate Rotating Register Base for each: GRs, FRs l Loop branches decrement all register rotating bases (RRB) l Instructions contain a “virtual” register number – RRB + virtual register number = physical register number. l References – “Overlapped Loop Support in the Cydra 5” - Dehnert et. al, 1989 – “Code Generation Schemas for Modulo-Scheduled Loops” Rau et. al, MICRO-25, 1992 Allows painless transfer of data between stages ®
Pipelined Loop Kernel code ld r 34 = [ra], 1 cmp p 1 = true s 1 cmp. and p 1 = (r 35>96) cmp. and p 1 = (r 35<123) s 2 s 3 (p 1) sub r 36 = r 36, 32 st [rb] = r 37, 1 s 4 br. ctop loop ® Physical register file + Virtual register r 34 = xx r 35 = xx sub r 36 = xx st r 37 = xx ld loop: RRB = 0 cmp< cmp>
Fill the pipe. . . G o _ G r e y RRB = 0 h Physical register file + Virtual register r 34 = G r 35 = xx sub r 36 = xx st r 37 = xx ld Execute prologue stage cmp< Kernel code loop: ld r 34 = [ra], 1 cmp p 1 = true cmp. and p 1 = (r 35>96) cmp. and p 1 = (r 35<123) (p 1) sub r 36 = r 36, 32 st [rb] = r 37, 1 br. ctop loop ® cmp>
Fill the pipe. . . RRB = 0 Physical register file ® Virtual register r 34 = G r 35 = xx sub r 36 = xx st r 37 = xx ld Perform a loop branch • Decrement lc • Rotate registers by decrementing RRB + cmp< cmp>
Fill the pipe. . . RRB = -1 Physical register file + Virtual register r 33 = o r 34 = G r 35 = G sub r 35 = xx r 36 = xx st r 36 = xx r 37 = xx ld Execute prologue stage cmp< Kernel code loop: ld r 34 = [ra], 1 cmp p 1 = true cmp. and p 1 = (r 35>96) cmp. and p 1 = (r 35<123) (p 1) sub r 36 = r 36, 32 st [rb] = r 37, 1 br. ctop loop ® cmp>
Fill the pipe. . . RRB = -2 Physical G o _ G r e y h register file + Virtual register r 32 = _ r 34 = _ r 33 = o r 35 = o sub r 34 = G r 36 = G st r 35 = xx r 37 = xx ld Execute prologue stage cmp< Kernel code loop: ld r 34 = [ra], 1 cmp p 16 = true cmp. and p 16 = (r 35>96) cmp. and p 16 = (r 35<123) (p 17) sub r 36 = r 36, 32 st [rb] = r 37, 1 br. ctop loop ® cmp>
Execute the Kernel Physical G o _ G r e y h register file ld r 34 = [ra], 1 cmp p 16 = true cmp. and p 16 = (r 35>96) cmp. and p 16 = (r 35<123) (p 17) sub r 36 = r 36, 32 st [rb] = r 37, 1 br. ctop loop ® G Virtual register r 34 = G r 32 = _ r 35 = _ sub r 33 = o r 36 = o st r 34 = G r 37 = G cmp< Kernel code loop: + r 37 = G ld Execute kernel Whole iteration per cycle RRB = -3 cmp>
Execute the Kernel G o _ G r e y h ld r 34 = [ra], 1 cmp p 16 = true cmp. and p 16 = (r 35>96) cmp. and p 16 = (r 35<123) (p 17) sub r 36 = r 36, 32 st [rb] = r 37, 1 br. ctop loop ® G O Virtual register r 34 = r r 37 = G r 35 = G sub r 32 = _ r 36 = _ st r 33 = O r 37 = O cmp< Kernel code loop: + r 36 = r ld Execute kernel Whole iteration per cycle Physical register file RRB = -4 cmp>
Execute the Kernel G o _ G r e y h ld r 34 = [ra], 1 cmp p 16 = true cmp. and p 16 = (r 35>96) cmp. and p 16 = (r 35<123) (p 17) sub r 36 = r 36, 32 st [rb] = r 37, 1 br. ctop loop ® G O Virtual register r 34 = e r 36 = r r 35 = r sub r 37 = G r 36 = G st r 32 = _ r 37 = _ cmp< Kernel code loop: + r 35 = e ld Execute kernel Whole iteration per cycle Physical register file RRB = -5 cmp>
Pipelining Overhead Prologue and Epilogue are bad • Code size expansion • Overhead not good for low trip count loops - cache performance Kernel Epilogue ® Can we avoid prologue and epilogue?
Prologue Code Cycle ® 1 ld 2 ld cmps 3 ld cmps ? sub 4 ld cmps ? sub st Kernel Incrementally turn on functional units
Avoid Pro and Epilogues Epilogue Physical register file Kernel (loop count) ld Have enable bit on each functional unit Enablers are initialized to off cmp< r 35 = xx Feed through a sequence of bits of length dependent upon loop count and pipe depth ® r 34 = xx Unit Enabler cmp> sub r 36 = xx st r 37 = xx
Revisiting Rotating Predicate Registers l l l PR 16 -63 can rotate, with separate Rotating Register Base Loop branches decrement all register rotating base (RRB) Instructions contain a “virtual” predicate register number – RRB + virtual register number = physical register number. l l Some predicates control pipeline stages, Stage Predicates Qualifying Predicates can still be in the loop Complete Loop Code s 1 s 2 s 3 s 4 ® loop: (p 16) (p 17) (p 22) (p 19) ld r 34 = [ra], 1 cmp. unc p 20 = true cmp. and p 21 = (r 35>96) cmp. and p 21 = (r 35<123) sub r 36 = r 36, 32 st [rb] = r 37, 1 br. ctop loop
How does this work RRB = 0 Physical register file Kernel Epilogue Complete Loop Code ld r 34 = G loop: (p 16) (p 17) (p 22) (p 19) ld cmp. and sub st br. ctop r 34 = [ra], 1 p 20 = true p 21 = (r 35>96) p 21 = (r 35<123) r 36 = r 36, 32 [rb] = r 37, 1 loop Qualifying Predicate ® Stage Predicates cmp< r 35 = xx cmp> sub r 36 = xx st r 37 = xx
Auto Predicate Generation Initalize • lc to trip count • ec to epilogue count • p 16 to true Loop branches • Rotate predicates by decrementing RRB • When lc > 0 - Decr. lc, set p 16=true • When lc = 0 - Decr. ec, set p 16=false • Fall through when ec=0 ® lc RRB Predicate Generator p 16 ec
Fill the pipe again. . . RRB = 0 Epilogue Physical register file Kernel Complete Loop Code ld r 34 = G loop: (p 16) (p 17) (p 22) (p 19) ® ld cmp. unc cmp. and sub st br. ctop r 34 = [ra], 1 p 20 = true p 21 = (r 35>96) p 21 = (r 35<123) r 36 = r 36, 32 [rb] = r 37, 1 loop Stage Predicates cmp< r 35 = xx cmp> sub r 36 = xx st r 37 = xx
Fill the pipe again. . . RRB = -1 Epilogue Physical register file Kernel Complete Loop Code ld r 33 = o loop: (p 16) (p 17) (p 22) (p 19) ® ld cmp. unc cmp. and sub st br. ctop r 34 = [ra], 1 p 20 = true p 21 = (r 35>96) p 21 = (r 35<123) r 36 = r 36, 32 [rb] = r 37, 1 loop Stage Predicates cmp< r 34 = G cmp> sub r 35 = xx st r 36 = xx
Fill the pipe again. . . RRB = -2 Epilogue Physical register file Kernel Complete Loop Code ld r 32 = _ loop: (p 16) (p 17) (p 22) (p 19) ® ld cmp. unc cmp. and sub st br. ctop r 34 = [ra], 1 p 20 = true p 21 = (r 35>96) p 21 = (r 35<123) r 36 = r 36, 32 [rb] = r 37, 1 loop Stage Predicates cmp< r 33 = o cmp> sub r 34 = G st r 35 = xx
Chunking thru kernel RRB = -3 Epilogue Complete Loop Code Physical register file Kernel ld r 37 = G loop: (p 16) (p 17) (p 22) (p 19) ® ld cmp. unc cmp. and sub st br. ctop r 34 = [ra], 1 p 20 = true p 21 = (r 35>96) p 21 = (r 35<123) r 36 = r 36, 32 [rb] = r 37, 1 loop cmp< r 32 = _ cmp> sub r 33 = o st r 34 = G
Chunking thru kernel RRB = -4 Epilogue Complete Loop Code Physical register file Kernel ld r 36 = r loop: (p 16) (p 17) (p 22) (p 19) ® ld cmp. unc cmp. and sub st br. ctop G r 34 = [ra], 1 p 20 = true p 21 = (r 35>96) p 21 = (r 35<123) r 36 = r 36, 32 [rb] = r 37, 1 loop cmp< r 37 = G cmp> sub r 32 = _ st r 33 = O
Chunking thru kernel RRB = -5 Physical register file Epilogue Complete Loop Code ld r 35 = e loop: (p 16) (p 17) (p 22) (p 19) ® ld cmp. unc cmp. and sub st br. ctop r 34 = [ra], 1 p 20 = true p 21 = (r 35>96) p 21 = (r 35<123) r 36 = r 36, 32 [rb] = r 37, 1 loop G O cmp< r 36 = r cmp> sub r 37 = G st r 32 = _
Chunking thru kernel RRB = -6 Physical register file Epilogue Complete Loop Code ld r 34 = y loop: (p 16) (p 17) (p 22) (p 19) ® ld cmp. unc cmp. and sub st br. ctop r 34 = [ra], 1 p 20 = true p 21 = (r 35>96) p 21 = (r 35<123) r 36 = r 36, 32 [rb] = r 37, 1 loop G O cmp< r 35 = e cmp> sub r 36 = r st r 37 = G
Chunking thru kernel RRB = -7 Physical register file Epilogue Complete Loop Code ld r 33 = h loop: (p 16) (p 17) (p 22) (p 19) ® ld cmp. unc cmp. and sub st br. ctop r 34 = [ra], 1 p 20 = true p 21 = (r 35>96) p 21 = (r 35<123) r 36 = r 36, 32 [rb] = r 37, 1 loop G O G cmp< r 34 = y cmp> sub r 35 = e st r 36 = r
Draining the pipe RRB = -8 Physical register file Complete Loop Code ld r 32 = xx loop: (p 16) (p 17) (p 22) (p 19) ® ld cmp. unc cmp. and sub st br. ctop r 34 = [ra], 1 p 20 = true p 21 = (r 35>96) p 21 = (r 35<123) r 36 = r 36, 32 [rb] = r 37, 1 loop G O G R cmp< r 33 = h cmp> sub r 34 = Y st r 35 = E
Draining the pipe RRB = -9 Physical register file Complete Loop Code ld r 33 = xx loop: (p 16) (p 17) (p 22) (p 19) ® ld cmp. unc cmp. and sub st br. ctop r 34 = [ra], 1 p 20 = true p 21 = (r 35>96) p 21 = (r 35<123) r 36 = r 36, 32 [rb] = r 37, 1 loop G O G R E cmp< r 34 = xx cmp> sub r 35 = H st r 36 = Y
Draining the pipe RRB = -10 Fall through the loop Don’t rotate Complete Loop Code Physical register file ld r 32 = xx loop: (p 16) (p 17) (p 22) (p 19) ® ld cmp. unc cmp. and sub st br. ctop r 34 = [ra], 1 p 20 = true p 21 = (r 35>96) p 21 = (r 35<123) r 36 = r 36, 32 [rb] = r 37, 1 loop G O G R E cmp< r 33 = xx cmp> Y sub r 34 = xx st r 35 = H
Example Summary • 8 iterations in 12 cycles • 2. 6 x speedup of initial code • 2. 75 x over unrolled traditional • No code expansion • No mispredicts (4 x, 1 10 cycle miss) • Minimal register usage loop: (p 16) (p 17) (p 22) (p 19) ® RRB = -10 Physical register file ld r 32 = xx cmp< ld cmp. unc cmp. and sub st br. ctop r 34 = [ra], 1 p 20 = true p 21 = (r 35>96) p 21 = (r 35<123) r 36 = r 36, 32 [rb] = r 37, 1 loop G O G R E r 33 = xx cmp> Y H sub r 34 = xx st r 35 = H
Software Pipelining l Itanium™ architecture features support SWP – Full Predication – Special branch handling features – Register rotation: removes loop copy overhead – Predicate rotation/generation: removes prologue & epilogue l Traditional architectures use loop unrolling – High overhead: extra code for loop body, prologue, and epilogue Especially Useful for Integer Code with Small Number of Loop Iterations ®
Compiler Bag of Tricks l Predication – Removes branches and mispredictions – Enables aggressive code motion – Parallel compares increase parallelism l Speculation – Hides memory latency – Enables aggressive code motion – Control speculation over branches – Data speculation over stores – Compiler-controlled recovery code ®
Compiler Bag of Tricks l Rich branch architecture – Multi-way branches increase ILP – Loop branches – Static direction hints assist prediction l S/W pipelining support with minimal overhead encourages broad usage – Performance for small integer loops with unknown trip counts as well as monster FP loops ®
BACKUP ®
8 Queens Example if ((b[j] == true) && (a[i+j] == true) && (c[i-j+7] == true)) Parallel Compares 1 2 4 5 ® R 1=&b[j] R 3=&a[i+j] R 5=&c[i-j+7] p 1 <- true ld R 2=[R 1] ld R 4=[R 3] ld R 6=[R 5] p 1, p 2 <- cmp. and(R 2==true) p 1, p 2 <- cmp. and(R 4==true) p 1, p 2 <- cmp. and(R 6==true) (p 1) br then else 8 queens control flow P 2 P 1 P 4 P 3 P 1= true P 5 Then Reduced from 7 cycles to 5 P 1=False P 6 Else
Five Predicate Compare Types l (qp) p 1, p 2 <- cmp. relation – if(qp) {p 1 = relation; p 2 = !relation}; l (qp) p 1, p 2 <- cmp. relation. unc – p 1 = qp&relation; p 2 = qp&!relation; l (qp) p 1, p 2 <- cmp. relation. and – if(qp & (relation==FALSE)) { p 1=0; p 2=0; } l (qp) p 1, p 2 <- cmp. relation. or – if(qp & (relation==TRUE)) { p 1=1; p 2=1; } l (qp) p 1, p 2 <- cmp. relation. or. andcm – if(qp & (relation==TRUE)) { p 1=1; p 2=0; } ®
Control Speculation Summary l All loads have a speculative form that sets the Na. T bit when deferring exceptions l Computational instructions propagate Na. Ts l OS controls deferral of faults but supported directly in HW - “no-fault speculation” – Minimizes overhead of data that is not used l Chk ® more effective than non-faulting load
More complex example Killtime loop in m 88 ksim for (i=0, i<32, i++) comptime[i] -= MIN(comptime[i], time) ® Initial Loop Pipelined Loop loop: Loop: (p 16) (p 18) (p 22) (p 24) (p 20) ld r 5=[r 10], 4 cmp p 1, p 2 = r 5, r 32 (p 1) br side sub r 5=r 5, r 32 st [addr]=r 5, 4 br cloop side: add t=0, r 0 st 4 [addr]=t, 4 br cloop ld r 36 = [r 10], 4 cmp p 21, p 23 = r 38, r 32 sub r 37 = r 0, 0 sub r 38 = r 38, r 32 st [r 11] = r 40, 4 br. ctop loop
Software Pipelining Benefits l Loop pipelining maximizes performance; minimizes overhead – High applicability – Minimum code size - fewer cache misses – Reduced register usage – Greater performance improvements in higher latency conditions l Reduced overhead allows S/W pipelining of small loops with unknown trip counts – Good for integer scalar codes ®
Memory Address Modes l Register Indirect is only address mode – Memory address comes from a General Register – no add in critical memory access path l Post-Increment provided for efficient address arithmetic – can add 9 -bit signed immediate value, or a value from a general register – uses idle ALU resources – avoid extra add instructions ® Benefits vector Floating Point Code
Memory Address Modes l Load Instructions – (qp) ld{1, 2, 4, 8} l Store r 1 = [r 3] no post-inc r 1 = [r 3] , imm 9 r 1 = [r 3] , r 2 Instructions – (qp) st{1, 2, 4, 8} [r 3] = r 2 no post-inc – (qp) st{1, 2, 4, 8} [r 3] = r 2, imm 9 ®
- Explicitly parallel instruction computing
- Xie jun feng
- Yet another compiler compiler
- Cross compiler in compiler design
- Ia 64 architecture
- Biu 8086
- C++ hungarian notation
- Compiling process
- Compiled data in research
- Compiling information
- An integral part planning and compiling a calendar is
- Excludeabroad
- Compiling creates a
- Architecture of 8086 microprocessor
- Pentium processor architecture
- Intel core processor architecture
- Kontinuitetshantering
- Typiska novell drag
- Nationell inriktning för artificiell intelligens
- Returpilarna
- Varför kallas perioden 1918-1939 för mellankrigstiden?
- En lathund för arbete med kontinuitetshantering
- Särskild löneskatt för pensionskostnader
- Personlig tidbok
- Anatomi organ reproduksi
- Förklara densitet för barn
- Datorkunskap för nybörjare
- Boverket ka
- Tes debattartikel
- Delegerande ledarstil
- Nyckelkompetenser för livslångt lärande
- Påbyggnader för flakfordon
- Vätsketryck formel
- Publik sektor
- Jag har nigit för nymånens skära text
- Presentera för publik crossboss
- Argument för teckenspråk som minoritetsspråk
- Kanaans land
- Klassificeringsstruktur för kommunala verksamheter
- Fimbrietratt
- Bästa kameran för astrofoto
- Centrum för kunskap och säkerhet
- Verifikationsplan
- Mat för idrottare
- Verktyg för automatisering av utbetalningar
- Rutin för avvikelsehantering
- Smärtskolan kunskap för livet
- Ministerstyre för och nackdelar
- Tack för att ni har lyssnat
- Vad är referatmarkeringar
- Redogör för vad psykologi är
- Stål för stötfångarsystem
- Atmosfr
- Borra hål för knoppar
- Orubbliga rättigheter
- Stickprovsvarians
- Tack för att ni har lyssnat
- Steg för steg rita
- Ledningssystem för verksamhetsinformation
- Tobinskatten för och nackdelar
- Toppslätskivling dos
- Gibbs reflekterande cykel
- Egg för emanuel
- Elektronik för barn
- Antikt plagg
- Strategi för svensk viltförvaltning
- Kung dog 1611
- Ellika andolf
- Romarriket tidslinje
- Tack för att ni lyssnade
- Större än
- Exempel på dikter
- Inköpsprocessen steg för steg
- Rbk mätning
- Ledarskapsteorier
- Skivepiteldysplasi
- Myndigheten för delaktighet
- Frgar
- Tillitsbaserad ledning
- Läkarutlåtande för livränta