IA64 Architecture Innovations John Crawford Jerry Huck Architect

Agenda l Architecture Principles l Predication & Speculation l Branch Architecture l Software Pipelining

Traditional Architectures: Limited Parallelism Original Source Code Compiler Sequential Machine Hardware Code parallelized code

IA-64 Architecture: Explicit Parallelism Parallel Machine Code Original Source Code Compiler IA-64 Compiler Views

IA-64 Principles l Explicitly parallel: – Instruction level parallelism (ILP) in machine code –

Predication Traditional Architectures IA-64 cmp then else l p 1 p 2 Removes branches,

Predication Review l Two kinds of normal compares – Regular – Unconditional (nested IF’s)

Introducing Parallel Compares l Three new types of compares: – AND: both target predicates

Eight Queens Example if ((b[j] == true) && (a[i+j] == true) && (c[i-j+7] ==

Five Predicate Compare Types l (qp) p 1, p 2 <- cmp. relation –

Predication Benefits l Reduces branches and mispredict penalties – 50% fewer branches and 37%

Speculation Review Traditional Architectures instr 1 instr. . . 2 br Load use Barrier

Hoisting Uses IA-64 ld. s instr 1 instr 2 br chk. s use l

Introducing the Na. T (“Not a Thing”) IA-64 ld. s instr 1 instr 2

Propagation l All computation instructions propagate Na. Ts to reduce number of checks ld

Exception Deferral: More Than Skin Deep l l Deferral allows the efficient delay of

Control Speculation Summary l All loads have a speculative form that sets the Na.

Store Barrier Traditional Architectures instr 1 instr 2. . . Store(*) Barrier Load (*)

Introducing Data Speculation l Compiler can issue a load prior to a preceding, possibly-conflicting

Data Speculation l Uses can be hoisted ld 8. a instr 1 instr 2

Advanced Load Address Table - ALAT l l ld. a inserts entries. Conflicting stores

Architectural Support for Data Speculation l Instructions – ld. a - advanced loads –

Speculation Benefits l Reduces impact of memory latency – Study demonstrates performance improvement of

Agenda ü Architecture Principles ü Predication & Speculation l Branch Architecture l Software Pipelining

Branch Instruction 128 -bit bundle 41 -bits 127 QP Branch IP-Offset Instruction 1 0

Branch Predicates Conditional branches Unconditional branches (p 1) BR #label_A; (p 0) BR #label_A;

Compare & Branch in Same Cycle Queens Loop: Parallel Compares & Compare-branch 1 2

Multi-way Branch w/o Speculation P 1 ld 8 r 6 = (ra) (p 1)

Software Pipelining l Overlapping execution of different loop iterations vs. l ® More iterations

Software Pipelining l IA-64 features that make this possible – Full Predication – Special

Basic Loop Example For (i=0; i<n; i++) { *b++ = *a++; } /* Mem.

Loop Support: Unrolling Test for loop count 0, 1 ld 8 r 34 =

Software Register Renaming Traditional Architecture. . . ld 1 r 34 ® R 32

Software Register Renaming Traditional Architecture. . . st 1 r 34 ld 2 r

Software Register Renaming Traditional Architecture. . . ld 3 r 34 st 2 r

Software Register Renaming Traditional Architecture. . . st 3 r 34 ld 4 r

Software Register Renaming Traditional Architecture. . . st 4 r 35 ® R 32

Introducing Rotating Registers l l GR 32 -127, FR 32 -127 can rotate Separate

Loop Support: Rotating Registers // setup ra/rb/lc/ec, check n > 2 { ld 8

Introducing Rotating Predicate Registers l l l PR 16 -63 can rotate, with separate

Introducing Rotating Predicate Registers l l l LC=2 EC=2 ® PR 16 -63 can

Introducing Rotating Predicate Registers l l l LC=1 EC=2 ® PR 16 -63 can

Introducing Rotating Predicate Registers l l l LC=0 EC=2 ® PR 16 -63 can

Introducing Rotating Predicate Registers l l l LC=0 EC=1 ® PR 16 -63 can

Introducing Rotating Predicate Registers l l l LC=0 EC=0 ® PR 16 -63 can

Loop Support: Rotating Predicates Software Pipelined Copy Loop // setup ra/rb/lc/ec, check n >

Software Pipelining Benefits l Loop pipelining maximizes performance; minimizes overhead – Avoids code expansion

Reviewing What’s New: l Parallel compares l Tbit l Nat bits l Deferral l

Summary l Speculation reduces memory latency impact – IA-64 removes recovery from critical path

Slides: 55

Download presentation

IA-64 Architecture Innovations ® John Crawford Jerry Huck Architect & Intel Fellow Intel Corporation Manager & Lead Architect Hewlett Packard Co.

Agenda l Architecture Principles l Predication & Speculation l Branch Architecture l Software Pipelining ®

Traditional Architectures: Limited Parallelism Original Source Code Compiler Sequential Machine Hardware Code parallelized code multiple functional units Execution Units Available Used Inefficiently ® . . . Today’s Processors often 60% Idle . . .

IA-64 Architecture: Explicit Parallelism Parallel Machine Code Original Source Code Compiler IA-64 Compiler Views Wider Scope ® Hardware More efficient use of execution resources multiple functional units . . . Increases Parallel Execution . . .

IA-64 Principles l Explicitly parallel: – Instruction level parallelism (ILP) in machine code – Compiler schedules across a wider scope l Enhanced ILP : – Predication, Speculation, Software pipelining, . . . l Fully compatible: – Across all IA-64 family members – IA-32 in hardware and PA-RISC through instruction mapping – Inherently scalable l Massively resourced: – Many registers – Many functional units ®

Predication Traditional Architectures IA-64 cmp then else l p 1 p 2 Removes branches, converts to predicated execution – Executes multiple paths simultaneously l ® Increases performance by exposing parallelism and reducing critical path – Better utilization of wider machines – Reduces mispredicted branches

Predication Review l Two kinds of normal compares – Regular – Unconditional (nested IF’s) (p 1) p 3= (p 2) p 3= p 1, p 2, <-. . . (p 2) p 3, p 4 <-cmp. unc. . . (p 3). . . Regular: p 3 is set just once ® p 2&p 3 (p 3). . . p 2&p 4 (p 4). . . Unconditional: p 3 and p 4 are AND’ed with p 2 Opportunity for Even More Parallelism

Introducing Parallel Compares l Three new types of compares: – AND: both target predicates set FALSE if compare is false – OR: both target predicates set TRUE if compare is true – ANDOR: if true, sets one TRUE, sets other FALSE A A B B C D ® Reduces Critical Path

Eight Queens Example if ((b[j] == true) && (a[i+j] == true) && (c[i-j+7] == true)) Unconditional Compares 1 2 4 5 6 7 ® R 1=&b[j] R 3=&a[i+j] R 5=&c[i-j+7] ld R 2=[R 1] ld. s R 4=[R 3] ld. s R 6=[R 5] P 1, P 2 <-cmp. unc(R 2==true) (p 1) chk. s R 4 (p 1) P 3, P 4 <-cmp. unc(R 4==true) (p 3) chk. s R 6 (p 3) P 5, P 6 <-cmp. unc(R 5==true) (P 5) br then else 8 queens control flow P 2 P 1 P 3 P 5 P 4 P 6 Then Else

Eight Queens Example if ((b[j] == true) && (a[i+j] == true) && (c[i-j+7] == true)) Parallel Compares 1 2 4 5 ® R 1=&b[j] R 3=&a[i+j] R 5=&c[i-j+7] p 1 <- true ld R 2=[R 1] ld R 4=[R 3] ld R 6=[R 5] p 1, p 2 <- cmp. and(R 2==true) p 1, p 2 <- cmp. and(R 4==true) p 1, p 2 <- cmp. and(R 6==true) (p 1) br then else P 1 8 queens control flow P 2 P 1 P 4 P 3 P 6 P 5 Then Else

Eight Queens Example if ((b[j] == true) && (a[i+j] == true) && (c[i-j+7] == true)) Parallel Compares 1 2 4 5 ® R 1=&b[j] R 3=&a[i+j] R 5=&c[i-j+7] p 1 <- true ld R 2=[R 1] ld R 4=[R 3] ld R 6=[R 5] p 1, p 2 <- cmp. and(R 2==true) p 1, p 2 <- cmp. and(R 4==true) p 1, p 2 <- cmp. and(R 6==true) (p 1) br then else 8 queens control flow P 2 P 1 P 4 P 3 P 1= true P 5 Then Reduced from 7 cycles to 5 P 1=False P 6 Else

Five Predicate Compare Types l (qp) p 1, p 2 <- cmp. relation – if(qp) {p 1 = relation; p 2 = !relation}; l (qp) p 1, p 2 <- cmp. relation. unc – p 1 = qp&relation; p 2 = qp&!relation; l (qp) p 1, p 2 <- cmp. relation. and – if(qp & (relation==FALSE)) { p 1=0; p 2=0; } l (qp) p 1, p 2 <- cmp. relation. or – if(qp & (relation==TRUE)) { p 1=1; p 2=1; } l (qp) p 1, p 2 <- cmp. relation. or. andcm – if(qp & (relation==TRUE)) { p 1=1; p 2=0; } ® Tbit (Test Bit) Also Sets Predicates

Predication Benefits l Reduces branches and mispredict penalties – 50% fewer branches and 37% faster code* l l Parallel compares further reduce critical paths Greatly improves code with hard to predict branches – Large server apps- capacity limited – Sorting, data mining- large database apps – Data compression l Traditional architectures’ “bolt-on” approach can’t efficiently approximate predication – Cmove: 39% more instructions, 23% slower performance* – Instructions must all be speculative ® * Source: S. Mahlke, 1995

Speculation Review Traditional Architectures instr 1 instr. . . 2 br Load use Barrier IA-64 ld. s instr 1 instr 2 br chk. s use Allows elevation of load, even above a branch l Memory latency is a major performance bottleneck in today’s systems ® – CPU to memory gap increasing

Hoisting Uses IA-64 ld. s instr 1 instr 2 br chk. s use l The uses of speculative data can also be executed speculatively – distinguishes speculation from simple prefetch ® Enables Further Parallelism

Introducing the Na. T (“Not a Thing”) IA-64 ld. s instr 1 instr 2 br chk. s use l Na. T ; Exception Detection Propagate Exception ; Exception Delivery is the GR’s 65 th bit that indicates: – whether or not an exception has occurred – branch to fixup code required l Na. T ® set during ld. s, checked by Chk. s

Propagation l All computation instructions propagate Na. Ts to reduce number of checks ld 8. s r 3 = (r 9) ld 8. s r 4 = (r 10) add r 6 = r 3, r 4 ld 8. s r 5 = (r 6) p 1, p 2 = cmp(. . . ) Allows single chk on result chk. s r 5 sub r 7 = r 5, r 2 Cmp propagates “false” when writing predicates l RISC architectures require more instructions for equivalent integrity l ® – e. g. , non faulting load

Exception Deferral: More Than Skin Deep l l Deferral allows the efficient delay of costly exceptions OS controlled deferral by hardware of: – Page faults – Protection violations – … l l Na. Ts enable deferral with recovery Efficiently support structured exception handling in C/C++ ld. s instr 1 instr 2 uses br Recovery code chk. s (Home Block) ld uses br home Complete Solution for Exception Management ®

Control Speculation Summary l All loads have a speculative form that sets the Na. T bit when deferring exceptions l Computational instructions propagate Na. Ts l OS controls deferral of faults but supported directly in HW - “no-fault speculation” – Minimizes overhead of data that is not used l Chk ® more effective than non-faulting load

Store Barrier Traditional Architectures instr 1 instr 2. . . Store(*) Barrier Load (*) use Traditional architectures limited by the Store Barrier ®

Introducing Data Speculation l Compiler can issue a load prior to a preceding, possibly-conflicting store Traditional Architectures instr 1 instr 2. . . st 8 Barrier ld 8 use ® IA-64 ld 8. a instr 1 instr 2 st 8 ld. c use Unique feature to IA-64

Data Speculation l Uses can be hoisted ld 8. a instr 1 instr 2 st 8 ld. c use ® ld 8. a instr 1 use instr 2 st 8 chk. a Synergy with control speculation yields greater performance Recovery code ld 8 uses br home

Advanced Load Address Table - ALAT l l ld. a inserts entries. Conflicting stores remove entries – Also: ld. c. clr, chk. a. clr, l Presence of entry indicates success – chk. a branches when no entry is found ld. a reg# =. . . chk. a reg# ® ? reg # Address reg #. . . Address st

Architectural Support for Data Speculation l Instructions – ld. a - advanced loads – ld. c - check loads – chk. a - advance load checks l Speculative Advanced loads - ld. sa - is an advanced load with deferral l ALAT - HW structure containing outstanding advanced loads ®

Speculation Benefits l Reduces impact of memory latency – Study demonstrates performance improvement of 79% when combined with predication* l Greatest improvement to code with many cache accesses – Large databases – Operating systems l Scheduling flexibility enables new levels of performance headroom ® * August, et. al, 1998

Agenda ü Architecture Principles ü Predication & Speculation l Branch Architecture l Software Pipelining ®

Branch Instruction 128 -bit bundle 41 -bits 127 QP Branch IP-Offset Instruction 1 0 Instruction 0 Template 21 -bits l Two basic branch formats – Relative: IP : = IP + Offset 21 – Indirect: IP : = BR[I] – 8 branch registers for efficient branch execution – Call/Return linking through branch registers l Loop branches with 64 -bit loopcount register (LC) – Enables perfect branch prediction of counted loops – Traditional architectures always mispredict last iteration – Incurs misprediction stall costing many cycles ®

Branch Predicates Conditional branches Unconditional branches (p 1) BR #label_A; (p 0) BR #label_A; P 1=true A P 1=false B “always true” A l Compiler directed static prediction augments dynamic prediction ® – Better predict highly correlated branches (always/never taken) – Frees space in H/W predictor – Can give hint for dynamic predictor

Compare & Branch in Same Cycle Queens Loop: Parallel Compares & Compare-branch 1 2 4 ® R 1=&b[j] R 3=&a[i+j] R 5=&c[i-j+7] p 1 <- true ld R 2=[R 1] ld R 4=[R 3] ld R 6=[R 5] p 1, p 2 <- cmp. and(R 2==true) p 1, p 2 <- cmp. and(R 4==true) p 1, p 2 <- cmp. and(R 6==true) (p 1) br then else From 5 Cycles Down to 4

Multi-way Branch w/o Speculation P 1 ld 8 r 6 = (ra) (p 1) br exit 1 P 2 ld 8 r 7 = (rb) (p 3) br exit 2 P 3 P 5 P 4 ld 8 r 8 = (rc) (p 5) br exit 3 Hoisting Loads IA-64 ld 8. s r 6 = (ra) ld 8. s r 7 = (rb) ld 8. s r 8 = (rc) chk r 6, rec 0 (p 1) br exit 1 chk r 6, rec 0 (p 2) chk r 7, rec 1 (p 4) chk r 8, rec 2 }{ (p 1) br exit 1 (p 3) br exit 2 (p 5) br exit 3 } Chk r 7, rec 1 (p 3) br exit 2 Chk r 8, rec 2 (p 5) br exit 3 P 6 3 branch cycles l ® 1 branch cycle Multiway branches: more than 1 branch in a single cycle l Allows n-way branching Supports Aggressive Speculation

Software Pipelining l Overlapping execution of different loop iterations vs. l ® More iterations in same amount of time

Software Pipelining l IA-64 features that make this possible – Full Predication – Special branch handling features – Register rotation: removes loop copy overhead – Predicate rotation: removes prologue & epilogue l Traditional architectures use loop unrolling – High overhead: extra code for loop body, prologue, and epilogue Especially Useful for Integer Code With Small Number of Loop Iterations ®

Basic Loop Example For (i=0; i<n; i++) { *b++ = *a++; } /* Mem. Copy */ // setup ra/rb/lc, . label loop { ld 8 r 35 = [ra], 8 }{ st 8 [rb], 8 = r 35 br. cloop #loop // check n!=0 } 3 ops l Simple ® Basic Copy Loop Execution (Cycles) ld 1 1 st 1 2 3 ld 2 st 2 4 5 ld 3 st 3 6 ld 4 7 st 4 8 br. cloop Non-overlapping iterations – 2 cycles per iteration – 3 operations in loop body

Loop Support: Unrolling Test for loop count 0, 1 ld 8 r 34 = [ra], 8 10 ops . label loop ld 8 r 35 = [ra], 8 st 8 [rb], 8 = r 34 br. cle #e-exit ld 8 r 34 = [ra], 8 st 8 [rb], 8 = r 35 br. cloop #loop st 8 [rb], 8 = r 34 br #thru. label e-exit st 8 [rb], 8 = r 35. label thru ® Unrolled Copy Loop Execution cycles 1 2 3 4 5 ld 1 ld 2 ld 3 ld 4 Prologue st 1 st 2 st 3 st 4 l Overlapped br. cle br. cloop br. cle iterations – 1 cycle per word – 1. 6 X performance improvement – 3. 3 X code expansion Incurs Code Expansion Penalties Main loop Epilogue

Software Register Renaming Traditional Architecture. . . ld 1 r 34 ® R 32 R 33 R 34 R 35. . .

Software Register Renaming Traditional Architecture. . . st 1 r 34 ld 2 r 35 ® R 32 R 33 R 34 R 35. . .

Software Register Renaming Traditional Architecture. . . ld 3 r 34 st 2 r 35 ® R 32 R 33 R 34 R 35. . .

Software Register Renaming Traditional Architecture. . . st 3 r 34 ld 4 r 35 ® R 32 R 33 R 34 R 35. . .

Software Register Renaming Traditional Architecture. . . st 4 r 35 ® R 32 R 33 R 34 R 35. . .

Introducing Rotating Registers l l GR 32 -127, FR 32 -127 can rotate Separate Rotating Register Base for each: GRs, FRs Loop branches decrement all register rotating bases (RRB) Instructions contain a “virtual” register number – RRB + virtual register number = physical register number. . ld 1 R 35 Palm Springs ® is Sunny 36: 35: Palm 34: 33: 32: . . . RRB=0

Introducing Rotating Registers l l GR 32 -127, FR 32 -127 can rotate Separate Rotating Register Base for each: GRs, FRs Loop branches decrement all register rotating bases (RRB) Instructions contain a “virtual” register number – RRB + virtual register number = physical register number. Palm st 1 R 35 ld 2 R 34 Palm Springs ® is Sunny . . IA-64. 36: 35: Palm 34: Springs 33: 32: . . . RRB=0

Introducing Rotating Registers l l GR 32 -127, FR 32 -127 can rotate Separate Rotating Register Base for each: GRs, FRs Loop branches decrement all register rotating bases (RRB) Instructions contain a “virtual” register number – RRB + virtual register number = physical register number. Palm Springs st 2 R 35 ld 3 R 34 Palm Springs ® is Sunny . . IA-64. 35: Palm 34: Springs 33: is 32: 127: . . . RRB=-1

Introducing Rotating Registers l l GR 32 -127, FR 32 -127 can rotate Separate Rotating Register Base for each: GRs, FRs Loop branches decrement all register rotating bases (RRB) Instructions contain a “virtual” register number – RRB + virtual register number = physical register number. Palm Springs is st 3 R 35 ld 4 R 34 Palm Springs ® is Sunny . . IA-64. 34: Springs 33: is 32: Sunny 127: 126: . . . RRB=-2

Introducing Rotating Registers l l GR 32 -127, FR 32 -127 can rotate Separate Rotating Register Base for each: GRs, FRs Loop branches decrement all register rotating bases (RRB) Instructions contain a “virtual” register number – RRB + virtual register number = physical register number. Palm Springs is Sunny st 4 R 35 Palm Springs ® is Sunny . . IA-64. 33: is 32: Sunny 127: 126: 125: . . . RRB=-3

Loop Support: Rotating Registers // setup ra/rb/lc/ec, check n > 2 { ld 8 r 35 = [ra], 8 }. label loop { r 34 = [ra], 8 5 ops ld 8 st 8 [rb] = r 35, 8 br. ctop #loop }{ st 8 [rb] = r 35, 8 } l Software Pipelined Copy Loop Execution cycles 1 2 3 4 5 ld 1 ld 2 ld 3 ld 4 Prologue st 1 st 2 st 3 st 4 br. ctop Main loop br. ctop Epilogue Modulo Scheduled Iterations – 1 cycle per word – 1. 6 X performance improvement – additional upside for higher latency conditions ® – 1. 7 X code expansion

Introducing Rotating Predicate Registers l l l PR 16 -63 can rotate, with separate Rotating Register Base Loop branches decrement all register rotating base (RRB) Instructions contain a “virtual” predicate register number – RRB + virtual register number = physical register number. LC=3 EC=2 ® . . IA-64. 18: 0 17: 0 16: 10 63: 0 62: 0. . . RRB=0 Code (p 16) ld 1 R 34 (p 17) st R 35 (p 16) ld R 34 Initialize (p 17) st R 35

Introducing Rotating Predicate Registers l l l LC=2 EC=2 ® PR 16 -63 can rotate, with separate Rotating Register Base Loop branches decrement all register rotating base (RRB) Instructions contain a “virtual” predicate register number – RRB + virtual register number = physical register number. . . IA-64. 17: 18: 0 17: 10 16: 10 63: 01 62: 0 61: . . . RRB=-1 Code (p 17) st R 35 (p 16) ld R 34 (p 16) ld 1 R 34 (p 16) ld 2 R 34 Branch 1 (p 17) st R 35 (p 17) st 1 R 35

Introducing Rotating Predicate Registers l l l LC=1 EC=2 ® PR 16 -63 can rotate, with separate Rotating Register Base Loop branches decrement all register rotating base (RRB) Instructions contain a “virtual” predicate register number – RRB + virtual register number = physical register number. . . IA-64. 16: 18: 10 17: 10 16: 63: 16: 10 63: 62: 63: 01 62: 61: 62: 0 61: 60: . . . RRB=-2 Code (p 17) st R 35 (p 16) ld R 34 (p 16) ld 1 R 34 (p 16) ld 2 R 34 (p 16) ld 3 R 34 Branch 2 (p 17) st R 35 (p 17) st 1 R 35 (p 17) st 2 R 35

Introducing Rotating Predicate Registers l l l LC=0 EC=2 ® PR 16 -63 can rotate, with separate Rotating Register Base Loop branches decrement all register rotating base (RRB) Instructions contain a “virtual” predicate register number – RRB + virtual register number = physical register number. . . IA-64. 63: 18: 10 17: 16: 17: 10 16: 63: 62: 16: 10 63: 62: 61: 63: 01 62: 61: 60: 62: 0 61: 60: 59: . . . RRB=-3 Code (p 17) st R 35 (p 16) ld R 34 (p 16) ld 1 R 34 (p 16) ld 2 R 34 (p 16) ld 3 R 34 (p 16) ld 4 R 34 Branch 3 (p 17) st R 35 (p 17) st 1 R 35 (p 17) st 2 R 35 (p 17) st 3 R 35

Introducing Rotating Predicate Registers l l l LC=0 EC=1 ® PR 16 -63 can rotate, with separate Rotating Register Base Loop branches decrement all register rotating base (RRB) Instructions contain a “virtual” predicate register number – RRB + virtual register number = physical register number. . . IA-64. 62: 18: 10 17: 16: 63: 17: 10 16: 63: 62: 61: 16: 01 63: 62: 61: 60: 63: 01 62: 61: 60: 59: 62: 0 61: 60: 59: 58: . . . RRB=-4 Code (p 17) st R 35 (p 16) ld ld R 34 (p 16) ld 1 R 34 (p 16) ld 2 R 34 (p 16) ld 3 R 34 (p 16) ld 4 R 34 (p 16) ld R 34 Branch 4 (p 17) st R 35 (p 17) st 1 R 35 (p 17) st 2 R 35 (p 17) st 3 R 35

Introducing Rotating Predicate Registers l l l LC=0 EC=0 ® PR 16 -63 can rotate, with separate Rotating Register Base Loop branches decrement all register rotating base (RRB) Instructions contain a “virtual” predicate register number – RRB + virtual register number = physical register number. . . IA-64. 61: 18: 10 17: 16: 63: 62: 17: 01 16: 63: 62: 61: 60: 16: 01 63: 62: 61: 60: 59: 63: 01 62: 61: 60: 59: 58: 62: 0 61: 60: 59: 58: 57: . . . RRB=-5 Code (p 17) (p 16) ld 1 R 34 (p 16) ld 2 R 34 (p 16) ld 3 R 34 (p 16) ld 4 R 34 (p 16) ld R 34 (p 17) st R 35 (p 17) st 1 R 35 (p 17) st 2 R 35 (p 17) st 3 R 35 (p 17) st 4 R 35 Fall Through

Loop Support: Rotating Predicates Software Pipelined Copy Loop // setup ra/rb/lc/ec, check n > 1. label loop { (p 16) ld 8 r 34 = [ra], 8 3 ops (p 17) st 8 [rb] = r 35, 8 br. ctop #loop } l Software Execution cycles 1 2 3 4 5 ld 1 ld 2 ld 3 ld 4 ld st st 1 st 2 st 3 st 4 br. ctop Main loop br. ctop Pipelined Mem. Copy – 1 cycle per word – 1. 6 X performance improvement – no code expansion ® Efficient Loop, Efficient Code Size

Software Pipelining Benefits l Loop pipelining maximizes performance; minimizes overhead – Avoids code expansion of unrolling and code explosion of prologue and epilogue – Smaller code means fewer cache misses – Greater performance improvements in higher latency conditions l Reduced overhead allows S/W pipelining of small loops with unknown trip counts – Typical of integer scalar codes ®

Reviewing What’s New: l Parallel compares l Tbit l Nat bits l Deferral l Hoisting uses l Propagation l Branch instructions l Static prediction l Advanced loads ® l ALAT l Loop branches l LC register l EC register l Multiway branch l Branch registers l Register rotation l Predicate rotation l RRB

Summary l Speculation reduces memory latency impact – IA-64 removes recovery from critical path – Benefits applications with poor cache locality: server applications, OS l Predication removes branches – Parallel compares increase parallelism – Benefits complex control flow: large databases l S/W pipelining support with minimal overhead enables broad usage – Performance for small integer loops with unknown trip counts as well as monster FP loops ®