IA64 and Intel Itanium Processor Microarchitecture Overview Table

IA-64 and Intel® Itanium™ Processor Micro-architecture Overview 고려대학교 전기전자전파공학부 최 린

Table of Contents l l Itanium™ Introduction IA-64 Architecture Features – – – l Instruction Format Predication Control Speculation Data Speculation Register Stack Software Pipelining Itanium™ Micro-architecture – Itanium™ Pipeline – Front-End Pipeline l Instruction fetch & branch prediction – Back-End Pipeline l Instruction dispersal network l FP, speculation and IA-32 support – Caches and Platform Architecture l IA-64 Roadmap 2021 -06 -05 고려 대학교 Itanium Micro-architecture Overview 2

Itanium™ Processor l The 1 st implementation of Intel Architecture 64 (IA-64) – Both IA-64 and Itanium jointly developed by Intel and HP l Intel: high-frequency deeply-pipelined OOO superscalar processor l HP: VLIW wide-issue parallel HW with novel ISA/compiler support – EPIC (Explicitly Parallel Instruction Coding) architecture l LIW: 3 Instruction syllables encoded in a 128 -bit instruction bundle l A sequence of parallel instruction groups specified by a compiler l Aggressive control and data speculation l Strong compiler micro-architecture interaction – Explicit dependence enable compiler to create greater parallelism while fully interlocked HW provides compatibility across implementations – Compiler-directed prefetching, static branch prediction, SW allocation of caches/TLB l 0. 18 process technology - massively resourced l l l 6 -way superscalar processor 9 issue ports : 2 MEM, 2 INT, 2 FP, and 3 BRANCH Large register files: 128 GPRs, 128 FPs, 64 PRs, and 8 BRs Full IA-32 compatibility through direct execution of IA-32 codes Targeted for servers and high-end workstation market l Large address space, flexible page sizes up to 256 MB and high TPC performance 2021 -06 -05 고려 대학교 Itanium Micro-architecture Overview 3

Instruction Format l Instruction syllables, bundles, and templates – 4 Instruction types: M, I, F, B l Explicit parallel semantics – Program = a sequence of parallel instruction groups l The compiler can group any number of independent instructions l Breaking the sequential instruction paradigm l Simplifying hardware by removing dynamic dependence checking 2021 -06 -05 고려 대학교 Itanium Micro-architecture Overview 4

Predication l Predicated execution of virtually all instructions – (p) add r 1 = r 2, r 3 l If p is true, normal add operation. Otherwise, NOP – 64 1 -bit predicate registers – Advantages of predicated execution l Remove branches – Convert control dependence to data dependence – Reduce misprediction penalties l Increase the size of basic block – Both codes from taken & not-taken path can be scheduled in the same cycle 2021 -06 -05 고려 대학교 Itanium Micro-architecture Overview 5

Control Speculation l Loads incur high latency – Need to schedule loads as early as possible – Two barriers – branches and stores l Control speculation – move loads above branches l However, loads can cause exceptions – Separate load behavior from exception behavior l Speculative load (ld. s) initiates a load op. & detects exceptions l On an exception, hardware propagates exception token (Na. T stored with destination register) from ld. s to chk. s l Speculative check (chk. s) delivers the exception detected by ld. s 2021 -06 -05 고려 대학교 Itanium Micro-architecture Overview 6

Control Speculation l Control speculating uses further increase ILP – Dependent instructions following the load can be also speculated above branches 2021 -06 -05 고려 대학교 Itanium Micro-architecture Overview 7

Data Speculation l Move loads above potentially overlapping stores – However, loads and previous stores can conflict l When the loads/stores overlap (access the same memory location), the loads must wait for previous stores due to RAW dependence – IA-64 enables data speculation by ld. a and ld. c/chk. a with ALAT l ld. a performs a normal load and inserts the address to ALAT l ALAT (Advanced Load Address Table) – Any intervening stores eliminate the overlapping entries from ALAT l The advanced load check (ld. c) checks ALAT If there is a violation and reissue the load if necessary 2021 -06 -05 고려 대학교 Itanium Micro-architecture Overview 8

Data Speculation l Uses of speculative data can be further speculated l Also, control and data speculation can be combined – Schedule loads across branches and across stores at the same time – Speculative advanced loads – ld. sa combines the semantics of ld. a and ld. s 2021 -06 -05 고려 대학교 Itanium Micro-architecture Overview 9

Register Stack l Procedure call overhead – Stack area in memory to save/restore procedure context – Need to spill registers to memory on procedure call – Need to fill registers from memory on procedure return l GR Register Stack – Register stack is used to save/restore procedure contexts across calls – Explicit allocation of stack frames l Effective use of 96 registers – Allocate only what is needed – programmable size of up to 96 registers – Effective parameter passing l Overlapping stack frames avoids parameter copying – Mechanism implemented by renaming register addresses 2021 -06 -05 고려 대학교 Itanium Micro-architecture Overview 10

Register Stack 2021 -06 -05 고려 대학교 Itanium Micro-architecture Overview 11

Register Stack Engine (RSE) l Automatically saves/restores stack registers without software intervention – – l Avoids explicit spill/fill Provides the illusion of infinite physical registers Overflow: alloc needs more registers than available Underflow: return needs to restore frame saved in memory RSE uses unused memory bandwidth (cycle stealing) to perform register spill and fill operations in the background – Eliminates stack management overhead 2021 -06 -05 고려 대학교 Itanium Micro-architecture Overview 12

Software Pipelining Support l High performance loops without code size overhead – No prologue and epilogue l Rotating registers – Provide automatic renaming l Rotating predicates (stage predicates) – Unify prolog, kernel, and epilog l l Loop control registers (LC, EC) Loop branches – Counted loop (br. ctop) – While loop (br. wtop) – Especially valuable for integer loops with small trip counts 2021 -06 -05 고려 대학교 Itanium Micro-architecture Overview 13

Software Pipelining Example L 1: ld 4 r 4 = [r 5], 4; ; //0 add r 7 = r 4, r 9; ; //2 st 4 [r 6] = r 7, 4 //3 br. cloop L 1; ; ld Prolog ld add ld st add ld Kernel st add ld st add Epilog st add L 1: (p 16) ld 4 r 32 = [r 5], 4 // Cycle 0 st (p 18) add r 35 = r 34, r 9 // Cycle 0 (p 19) st 4 [r 6] = r 36, 4 // Cycle 0 br. ctop L 1; ; Iteration 1 r 32 r 33 r 34 r 35 … p 16 p 17 p 18 p 19. . 1 0 0 0. . Iteration 2 r 33 r 34 r 35 r 36 … p 17 p 18 p 19. . 1 0 0. . Iteration 3 r 34 r 35 r 36 r 37 … p 18 p 19. . 1 0. . 2021 -06 -05 고려 대학교 Itanium Micro-architecture Overview p 16 1 p 16 p 17 1 1 14

IA-64 Architecture Performance Features l l l l Parallelism - groups of independent instructions Predication - reduces branches, enhancing ILP Control Spec. - breaks branch barrier, increasing ILP Data Spec. - breaks data dependences, increasing ILP Control & Data Spec. - addresses memory latency Stack/RSE - reduces procedure call overhead Loop support - yields performance w/o overhead 2021 -06 -05 고려 대학교 Itanium Micro-architecture Overview 15

Itanium™ - Maximizing Compiler/HW Synergy 2021 -06 -05 고려 대학교 Itanium Micro-architecture Overview 16

10 -stage In-Order Pipeline 2021 -06 -05 고려 대학교 Itanium Micro-architecture Overview 17

FE Pipeline – Instruction Fetch l Software prefetching – Streaming prefetch of large blocks on branch hints l For a long sequential instruction stream until a predicted taken branch – Early prefetch of small blocks if the branch is likely to be taken l brp instruction is used to prefetch the targets of IP-relative branches – Prefetch vector indicates the request is only useful when the branch pattern for the subsequent branches is a certain value (i. e. cancel unless TT) l l move to br instruction is used to prefetch the targets of indirect branches I-fetch of 32 Bs/clock feeds an 8 bundle decoupling buffer – Hides instruction cache misses and bubbles 2021 -06 -05 고려 대학교 Itanium Micro-architecture Overview 18

FE Pipeline – Branch Predictors l Dynamic predictors – 1 -cycle TARs for important branches – 2 -cycle 512 -entry local 2 -level predictors for branch direction l Multi-way branch prediction up to 3 predictions per cycle – 2 -cycle 64 -entry target address cache 2021 -06 -05 고려 대학교 Itanium Micro-architecture Overview 19

Instruction Delivery l l Stop bits eliminate dependency checking Templates simplify routing 2021 -06 -05 고려 대학교 Itanium Micro-architecture Overview 20

Industry-leading Floating Point Performance l 4 FMACs deliver 8 SP FLOPs/cycle or 4 EP/DP FLOPs/cycle – 2 extended precision FMACs + 2 single-precision FMACs – Peak performance of 3 GFLOPs (EP) or 6 GFLOPs (SP) – Also, perform integer multiply & software divide/square-root l Software divide breaks a single divide into several FMA operations l Slightly greater latency of each divide, but much greater throughput – Balanced with plenty of operand bandwidth from registers/memory 2021 -06 -05 고려 대학교 Itanium Micro-architecture Overview 21

Speculation Hardware l Control speculation support – Memory exception delivered with data as tokens (Na. Ts) – Na. Ts propagate through subsequent executions like source data l Data speculation support – 32 outstanding advanced loads – Indexed by reg-ids, keeps partial physical address tag 2021 -06 -05 고려 대학교 Itanium Micro-architecture Overview 22

IA-32 Compatibility l Direct execution of IA-32 binary code – Sharing I/D caches and execution core increases area efficiency – Dynamic scheduler optimizes performance l Full, efficient IA-32 instruction compatibility in hardware 2021 -06 -05 고려 대학교 Itanium Micro-architecture Overview 23

Itanium™ Processor Block Diagram 2021 -06 -05 고려 대학교 Itanium Micro-architecture Overview 24

Itanium™ Software Development l SDK – compilers, linkers, libraries, debuggers, IA-64 OS, Merced simulator 2021 -06 -05 고려 대학교 Itanium Micro-architecture Overview 25

Itanium™ Processor Status l l l Itanium™ first silicon (1 H’ 1999) Samples shipped to OEMs (mid-1999) Comprehensive functional validation (2 H’ 1999) – Testing includes 7 OS’s and many key scientific/enterprise apps l 64 -bit Windows, Linux, SUN Solaris, HP-UX, SGI-IRIX, Compaq UNIX, . . – Multiple Intel and OEM test platform configurations (2 ~ 64 CPUs) l Compiler progress – Almost 100% of functional tests passing – Exceeding performance targets l Development tools progress – SDK delivered to key OEMs, OSVs, and tool vendors – Full SDK with OS, compiler, tools to select ISVs in Q 1 2000 l l Performance testing/tuning (1 H’ 2000) Production scheduled in 2 H’ 2000 2021 -06 -05 고려 대학교 Itanium Micro-architecture Overview 26

IA-64 Roadmap 2021 -06 -05 고려 대학교 Itanium Micro-architecture Overview 27