12 Multithreaded Processors Dezs Sima Fall 2006 D

  • Slides: 85
Download presentation
12. Multithreaded Processors Dezső Sima Fall 2006 D. Sima, 2006

12. Multithreaded Processors Dezső Sima Fall 2006 D. Sima, 2006

Overview • 1 Introduction • 2 Overview • 3 Coarse grain multithreading • 4

Overview • 1 Introduction • 2 Overview • 3 Coarse grain multithreading • 4 Fine grain multithreading • 5 Simultaneous multithreading

1. Introduction (1) Aim of multithreading: to raise performance compared to superscalar execution or

1. Introduction (1) Aim of multithreading: to raise performance compared to superscalar execution or multitasking by increased parallelism at execution. Thread: flow of control Main features of multithreading: Threads • belong to the same process, • share a common address space (usually, else multiple address translation paths (virtual to real) need to be maintained in parallel) • are executed simultaneously (overlapped or in parallel). Thread management: • • • creation, control and termination of threads, maintaining multiple sets of thread states, context swithing between threads.

1. Introduction (2) Implementation of multithreading (while executing multithreaded apps/OSs) Software implementation Hardware implementation

1. Introduction (2) Implementation of multithreading (while executing multithreaded apps/OSs) Software implementation Hardware implementation Execution of multithreaded apps/OSs on a single threaded processor by time sharing Execution of multithreaded apps/OSs on a multithreaded processor concurrently Maintaining multiple threads concurrently by the OS Maintaining multiple threads concurrently by the processor Multithreaded OSs Multithreaded processors Fast context swithing between threads required.

1. Introduction (3) Basic options to implement multithreaded processors Multicore processors Multithreaded cores (SMP:

1. Introduction (3) Basic options to implement multithreaded processors Multicore processors Multithreaded cores (SMP: Symmetric Multiprocessing CMP: Chip Multiprocessing) Chip Core L 2/L 3 MT core Core L 2/L 3 L 3/Memory

1. Introduction (4) Requirement of software multithreading: Maintaining multiple thread states concurrently by the

1. Introduction (4) Requirement of software multithreading: Maintaining multiple thread states concurrently by the OS, including: PC, FX/FP registers, state registers Core enhancements needed in case of multithreaded cores: • Maintaining multiple thread states, including: PC, architectural registers, state registers (in case of merged arch. and rename registers providing appropriatly large file sizes (FX/FP)) • Maintaning multiple thread microstates, pertaining to: rename mappings, the RAS (Return Address Stack), ROB, etc. • Providing increased sizes for scarce or sensitive resorces, such as: the instruction buffer, store queue, etc.

1. Introduction (5) Multicore processors Multithreaded cores Additional complexity ~ (60 – 80) %

1. Introduction (5) Multicore processors Multithreaded cores Additional complexity ~ (60 – 80) % ~ (2 – 10) % Additional gain (in gen. purp. apps) ~ (60 – 80) % ~ (0 – 30) %

1. Introduction (6) Multithreaded OSs: • Windows NT • OS/2 • Unix w/Posix •

1. Introduction (6) Multithreaded OSs: • Windows NT • OS/2 • Unix w/Posix • most OSs developed from the 90’s on

Principle of sequential-, multitask- and multithreaded programming Sequential programm ing P 1 Multitask programming

Principle of sequential-, multitask- and multithreaded programming Sequential programm ing P 1 Multitask programming P 1 Multithreaded programming P 1 fork() Process / Thread Management Example T 1 Create. Thread() P 2 exec() T 2 fork() Create Process() T 3 P 2 P 3 P 2 T 4 T 5 exec() T 6 P 3 join()

Execution of sequential-, multitask- and multithreaded programs Key Issues Key Advantages Description Sequential programs

Execution of sequential-, multitask- and multithreaded programs Key Issues Key Advantages Description Sequential programs Multitask programs Software implementation Multithreaded programs Software impl. Hardware impl. Single Multiple processes on a single processor process on using time sharing a single processor Multithreaded software on a single threaded processor by time sharing Multithreaded software on a multithreaded processor No issues with parallel programs Multiple programs with quasi-parallel execution Shared process address spaces Faster intra-process context switches True parallel execution Shared process address spaces Near linear speedup Fastest intra-process context switches Thread state management and context switching Thread state management Thread scheduling Multiple programs with quasi-parallel execution Private address spaces Sequential Solutions for fast context switching bottleneck

Implementation of multiprocessing and multithreading (2) Software Development Performance Level OS Support Sequential programs

Implementation of multiprocessing and multithreading (2) Software Development Performance Level OS Support Sequential programs Multitask programs Software implementation Multithreaded programs Software impl. Hardware impl. Legacy OS Traditional Unix support Most modern OS’s (Windows NT/2000, OS/2, Unix+Posix) Low-medium Higher No API level support Process life cycle management API Process and thread life cycle management API Explicit threading API Open. MP

2. Overview 2. 1 Thread scheduling while implementing software multithreading on a traditional supercalar

2. Overview 2. 1 Thread scheduling while implementing software multithreading on a traditional supercalar processor The execution of a new thread is initiated by a context switch (needed to save the state of the suspended thread and loading the state of the thread to be executed). Figure 2. 1: Thread scheduling in a traditional superscalar processor Source: Mazzucco P. , „Fundamentals of Multithreading, ” http: //www. slcentral. com/articles/01/6/multithreading

Thread scheduling in CMP-s Cores execute different threads independently. Figure 2. 2: Thread scheduling

Thread scheduling in CMP-s Cores execute different threads independently. Figure 2. 2: Thread scheduling in an CMP Source: Mazzucco P. , „Fundamentals of Multithreading, ” http: //www. slcentral. com/articles/01/6/multithreading

2. Overview Thread scheduling in multithreaded cores Coarse grain MT

2. Overview Thread scheduling in multithreaded cores Coarse grain MT

Threads are switched by means of rapid, HW-supported context switches. Figure 2. 3: Thread

Threads are switched by means of rapid, HW-supported context switches. Figure 2. 3: Thread scheduling in a 4 -way coarse grained multithreaded processor Source: Mazzucco P. , „Fundamentals of Multithreading, ” http: //www. slcentral. com/articles/01/6/multithreading

2. Overview Thread scheduling in multithreaded cores Coarse grain MT Fine grain MT

2. Overview Thread scheduling in multithreaded cores Coarse grain MT Fine grain MT

The hardware thread scheduler choses a thread in each cycle and instructions from this

The hardware thread scheduler choses a thread in each cycle and instructions from this thread are dispatched/issued in this cycle. . Figure 2. 4: Thread scheduling in a 4 -way fine grained multithreaded processor Source: Mazzucco P. , „Fundamentals of Multithreading, ” http: //www. slcentral. com/articles/01/6/multithreading

2. Overview Thread scheduling in multithreaded cores Coarse grain MT Fine grain MT Simultaneous

2. Overview Thread scheduling in multithreaded cores Coarse grain MT Fine grain MT Simultaneous MT (SMT)

Available instructions (chosen according to an appropriate selection policy, such as the priority of

Available instructions (chosen according to an appropriate selection policy, such as the priority of the threads) are dispatched/issued for execution in each cycle. SMT: Proposed by Tullsen, Eggers and Levy in 1995 (U. of Washington). Figure 2. 5: Thread scheduling in a 4 -way symultaneous multithreaded processor Source: Mazzucco P. , „Fundamentals of Multithreading, ” http: //www. slcentral. com/articles/01/6/multithreading

2. 2 Overview of multithreaded cores (1) Single core multi- threaded Dual core multi-

2. 2 Overview of multithreaded cores (1) Single core multi- threaded Dual core multi- threaded RS 64 IV (Sstar) POWER 5 Multi core multi- threaded Superscalars RISCs IBM (2000) 2 T 0. 18 /44 mtrs DEC/Compaq (2004) 2 T 0. 13 /276 mtrs Alpha 21464 (V 8) (2003) 4 T 0. 13 /250 mtrs Ultra. SPARC T 1 (Niagara) (2005) 8 cores/4 T 0. 09 /279 mtrs Sun Figure 2. 6: Multithreaded cores (1)

2. 2 Overview of multithreaded cores (2) Single core multi- threaded Dual core multi-

2. 2 Overview of multithreaded cores (2) Single core multi- threaded Dual core multi- threaded Pentium 4 (Northwood) Pentium EE 840 (2002) 0. 13 /55 mtrs (4/2005) 0. 09 /230 mtrs Superscalars CISCs Intel Pentium EE 955/965 (Presler) VLIWs Intel (4/2005) 0. 065 /2*188 mtrs Montecito (2006? ) 2*Itanium 2 (Madison) 0. 09 /1730 mtrs. Figure 2. 7: Multithreaded cores (2) Multi core multi- threaded

2. 2 Overview of multithreaded cores (3) Underlying core(s) Scalar core(s) SUN Ultra. SPARC

2. 2 Overview of multithreaded cores (3) Underlying core(s) Scalar core(s) SUN Ultra. SPARC T 1 (2005) (Niagara) up to 8 cores, 4 threads Superscalar core(s) VLIW core(s) IBM RS 64 IV (2000) (SStar) 2 -way SUN MAJC 5200 (2000) Quad-core/4 -way (dedicated use) Pentium 4 (2002) 2 -way Intel Montecito (2006? ) Dual-core/2 -way DEC 21464 (2003) Dual-core/2 -way IBM POWER 5 (2005) Dual-core/2 -way Pentium EE 840 (2005) Dual-core/2 -way Pentium EE 955/965 (2005) Dual-core/2 -way

3. Coarse grain multithreading 3. 1 Overview (1) Thread scheduling in multithreaded cores Coarse

3. Coarse grain multithreading 3. 1 Overview (1) Thread scheduling in multithreaded cores Coarse grain MT Fine grain MT Simultaneous MT (SMT)

3. Coarse grain multithreading 3. 1 Overview (2) Coarse grain MT Scalar based Superscalar

3. Coarse grain multithreading 3. 1 Overview (2) Coarse grain MT Scalar based Superscalar based IBM RS 64 IV (2000) (SStar) 2 T VLIW based SUN MAJC 5200 (2000) Quad-core/4 T (dedicated use) Intel Montecito (2006? ) Dual-core/2 T

3. 2 Case example 1: IBM RS 64 IV (1) Microarchitecture 4 -way superscalar,

3. 2 Case example 1: IBM RS 64 IV (1) Microarchitecture 4 -way superscalar, dual-threaded. Used in IBM’s i. Series and p. Series commercial servers. Optimized for commercial server workloads, such as on-line transaction processing, Web-serving, ERP (Enterprise Resource Planning). Instruction fetch width: 8 instr. /cycle Architectural state: • GPRs, FPRs, CR (condition reg. ), CTR (count reg. ), • spec. purpose priviledged mode reg. s, such as the MSR (machine state reg. . ) • status and control reg. s, such as T priority. Each T executes in its own effective address space. Units used for address translation need to be duplicated, such as the SRs (Segment Address Reg. s) Duplicated resources: ~ + 5 % chip area Both single threaded and multithreaded modes of execution.

3. 2 Case example 1: IBM RS 64 IV (2) IERAT: Effective to real

3. 2 Case example 1: IBM RS 64 IV (2) IERAT: Effective to real address translation cache (2 x 64 entries) 6 XX bus Figure 3. 1: Microarchitecture of IBM’s RS 64 IV Source: Borkenhagen J. M. et al. „A multithreaded Power. PC processor for commercial servers”, IBM J. Res. Develop. Vol. 44. No. 6. Nov. 2000, pp. 885 -898

3. 2 Case example 1: IBM RS 64 IV (3) Aim: Commercial workloads •

3. 2 Case example 1: IBM RS 64 IV (3) Aim: Commercial workloads • • large working sets and frequently occurring task switches • need for large L 1$s • high cach miss rates Thread switching (strongly simplified): Two Ts are implemented; a foreground T and a background T. The foreground T executes until a long latency event, such as a cache miss or an IERAT miss occurs. Subsequently, a T switch is performed and the background T begins to execute. After the miss is serviced, a T switch back to the foreground T occurs. The Thread Swith Buffer holds up to 8 instructions from the background T, to eliminate the latency of the I$ Threads can be allocated different priorities by explicit instructions.

3. 2 Case example 1: IBM RS 64 IV (4) Figure 3. 2: Thread

3. 2 Case example 1: IBM RS 64 IV (4) Figure 3. 2: Thread switch on data cache miss in IBM’s RS 64 IV Source: Borkenhagen J. M. et al. „A multithreaded Power. PC processor for commercial servers”, IBM J. Res. Develop. Vol. 44. No. 6. Nov. 2000, pp. 885 -898

3. 2 Case example 2: SUN MAJC 5200 (1) Aim: Dedicated use, high-end graphics,

3. 2 Case example 2: SUN MAJC 5200 (1) Aim: Dedicated use, high-end graphics, networking with wire-speed computational demands. Microarchitecture: • • • up to 4 processors on a die, each processor has 4 FUs (Functional Units); 3 of them are identical, one is enhanced, each FU has its private logic and register set (e. g. 32 or 64 regs. , the 4 FUs of a processor share a set of global regs. , e. g. 64 regs. , all registers are unified (not splitted to FX/FP files), any FU can process any data type. Each processor is a 4 -wide VLIW and can be 4 -way multithreaded.

3. 2 Case example 2: SUN MAJC 5200 (2) Figure 3. 3: General view

3. 2 Case example 2: SUN MAJC 5200 (2) Figure 3. 3: General view of SUN’s MAJC 5200 Source: “MAJC Architecture Tutorial, ” Whitepaper, Sun Microsystems, Inc

3. 2 Case example 2: SUN MAJC 5200 (3) Figure 3. 4: The principle

3. 2 Case example 2: SUN MAJC 5200 (3) Figure 3. 4: The principle of private, unified register files associated with each FU Source: “MAJC Architecture Tutorial, ” Whitepaper, Sun Microsystems, Inc

3. 2 Case example 2: SUN MAJC 5200 (4) Threading Each processor with its

3. 2 Case example 2: SUN MAJC 5200 (4) Threading Each processor with its 4 FUs can be operated in a 4 -way multithreaded mode (called Vertical Multithreading by Sun) Implementation of 4 -way multithreading: by executing each T by one of the 4 FUs („Vertical multithreading”) Thread switch: Following a cache miss, the processor saves the T state and begins to process the next T. Example: Comparison of program execution without and with multithreading on a 4 -wide VLIW Considered program: • • It consists of 100 instructions, on average 2. 5 instrs. /cycle executed on average, giving birth to a cache miss after each 20 instructions. Latency of serving a cache miss: 75 cycles.

3. 2 Case example 2: SUN MAJC 5200 (5) Figure 3. 5: Execution for

3. 2 Case example 2: SUN MAJC 5200 (5) Figure 3. 5: Execution for subsequent cache misses in a single threaded processor Source: “MAJC Architecture Tutorial, ” Whitepaper, Sun Microsystems, Inc

3. 2 Case example 2: SUN MAJC 5200 (6) Figure 3. 6: Execution for

3. 2 Case example 2: SUN MAJC 5200 (6) Figure 3. 6: Execution for subsequent cache misses in SUN’s MAJC 5200 Source: “MAJC Architecture Tutorial, ” Whitepaper, Sun Microsystems, Inc

3. 2 Case example 3: Intel Montecito (1) Aim: High end servers Main differencies

3. 2 Case example 3: Intel Montecito (1) Aim: High end servers Main differencies between Itanim 2 and Montecito: • Split L 2 caches, • higher unified L 3 cache, • duplicated architectural states maintained. Additional support of dual-threading: • the branch prediction structures provide T tagging, • per stack return stack strucktures, • per thread ALATs (Advance Load Address Table) Additional core area needed: ~ 2 %.

3. 2 Case example 3: Intel Montecito (2) Figure 3. 7: Microarchitecture of Intel’s

3. 2 Case example 3: Intel Montecito (2) Figure 3. 7: Microarchitecture of Intel’s Itanium 2 Source: Mc. Nairy, C. , „Itanium 2”, IEEE Micro, March/April 2003, Vol. 23, No. 2, pp. 44 -55

3. 2 Case example 3: Intel Montecito (3) Figure 3. 8: Microarchitecture of Intel’s

3. 2 Case example 3: Intel Montecito (3) Figure 3. 8: Microarchitecture of Intel’s Montecito (ALAT: Advanced Load Address Table) Source: Mc. Nairy, C. , „Montecito”, IEEE Micro, March/April 2005, Vol. 25, No. 2, pp. 10 -20

3. 2 Case example 3: Intel Montecito (4) Thread swithes: 5 event types cause

3. 2 Case example 3: Intel Montecito (4) Thread swithes: 5 event types cause thread switches, such as L 3 cache misses, programmed switched hints. Total switch penalty: 15 cycles Example for thread switching: If control logic detects that a thread doesn’t make progress, a thread switch will be initiated.

3. 2 Case example 3: Intel Montecito (5) Figure 3. 9: Thread switch in

3. 2 Case example 3: Intel Montecito (5) Figure 3. 9: Thread switch in Intel’s Montecito vs single thread execution Source: Mc. Nairy, C. , „Montecito”, IEEE Micro, March/April 2005, Vol. 25, No. 2, pp. 10 -20

4. Fine grain multithreading 4. 1 Overview (1) Thread scheduling in multithreaded cores Coarse

4. Fine grain multithreading 4. 1 Overview (1) Thread scheduling in multithreaded cores Coarse grain MT Fine grain MT Simultaneous MT (SMT)

4. Fine grain multithreading 4. 1 Overview (2) Fine grain MT Round robin selection

4. Fine grain multithreading 4. 1 Overview (2) Fine grain MT Round robin selection policy Scalar based Superscalar based VLIW based Priority based selection policy Scalar based Superscalar based SUN Ultra. SPARC T 1 (2005) (Niagara) up to 8 cores/4 T VLIW based

4. 2 Case example: SUN Ultra. SPARC T 1 (1) Aim: Commercial server applications,

4. 2 Case example: SUN Ultra. SPARC T 1 (1) Aim: Commercial server applications, such as • web servicing, • transaction processing, • ERP (Enterprise Resource Planning), • DSS (Decision Support Systems) Charasteristics of commercial server applications: • large working sets, • poor locality of memory references. • high cache miss rates, • low prediction accuracy for data dependent branches. Memory latency strongly limits performance. Multithreading to hide memory latency.

4. 2 Case example: SUN Ultra. SPARC T 1 (2) Structure: • 8 scalar

4. 2 Case example: SUN Ultra. SPARC T 1 (2) Structure: • 8 scalar cores, 4 -way multithreaded each. • All 32 threads share an L 2 cache of 3 MB, built up of 4 banks,

4. 2 Case example: SUN Ultra. SPARC T 1 (3) Figure 4. 3: Block

4. 2 Case example: SUN Ultra. SPARC T 1 (3) Figure 4. 3: Block diagram of SUN’s Ultra. SPARC T 1 Source: Kongetira P. , et al. „Niagara”, IEEE Micro, March/April 2005, Vol. 25, No. 2, pp. 21 -29

4. 2 Case example: SUN Ultra. SPARC T 1 (2) Structure: • 8 scalar

4. 2 Case example: SUN Ultra. SPARC T 1 (2) Structure: • 8 scalar cores, 4 -way multithreaded each. • All 32 threads share an L 2 cache of 3 MB, built up of 4 banks, • 4 memory channels with on chip DDR 2 memory controllers. It runs under Solaris.

4. 2 Case example: SUN Ultra. SPARC T 1 (4) Figure 4. 3: SUN’s

4. 2 Case example: SUN Ultra. SPARC T 1 (4) Figure 4. 3: SUN’s Ultra. SPARC T 1 chip Source: www. princeton. edu/~jdonald/research/hyperthreading/romanescu_niagara. pdf

4. 2 Case example: SUN Ultra. SPARC T 1 (5) Processor Elements (Sparc pipes):

4. 2 Case example: SUN Ultra. SPARC T 1 (5) Processor Elements (Sparc pipes): • Scalar FX-units, 6 -stage pipeline • all Processor Elements share a single FP-unit

4. 2 Case example: SUN Ultra. SPARC T 1 (6) Figure 4. 3: Microarchitecture

4. 2 Case example: SUN Ultra. SPARC T 1 (6) Figure 4. 3: Microarchitecture of the core of SUN’s Ultra. SPARC T 1 Source: Kongetira P. , et al. „Niagara”, IEEE Micro, March/April 2005, Vol. 25, No. 2, pp. 21 -29

4. 2 Case example: SUN Ultra. SPARC T 1 (5) Processor Elements (Sparc pipes):

4. 2 Case example: SUN Ultra. SPARC T 1 (5) Processor Elements (Sparc pipes): • Scalar FX-units, 6 -stage pipeline • all Processor Elements share a single FP-unit Each thread of a processor element has its private: • PC-logic • register file, • instruction buffer, • store buffer.

4. 2 Case example: SUN Ultra. SPARC T 1 (6) Figure 4. 3: Microarchitecture

4. 2 Case example: SUN Ultra. SPARC T 1 (6) Figure 4. 3: Microarchitecture of the core of SUN’s Ultra. SPARC T 1 Source: Kongetira P. , et al. „Niagara”, IEEE Micro, March/April 2005, Vol. 25, No. 2, pp. 21 -29

4. 2 Case example: SUN Ultra. SPARC T 1 (5) Processor Elements (Sparc pipes):

4. 2 Case example: SUN Ultra. SPARC T 1 (5) Processor Elements (Sparc pipes): • Scalar FX-units, 6 -stage pipeline • all Processor Elements share a single FP-unit Each thread of a processor element has its private: • PC-logic • register file, • instruction buffer, • store buffer. No thread switch penalty!

4. 2 Case example: SUN Ultra. SPARC T 1 (7) Thread switch: Threads are

4. 2 Case example: SUN Ultra. SPARC T 1 (7) Thread switch: Threads are switched on a per cycle basis. Selection of threads: In the thread select pipeline stage thread select multiplexer selects a thread from the set of available threads in each clock cycle and issues the subsequent instr. of this thread into the pipeline for execution.

4. 2 Case example: SUN Ultra. SPARC T 1 (6) Figure 4. 3: Microarchitecture

4. 2 Case example: SUN Ultra. SPARC T 1 (6) Figure 4. 3: Microarchitecture of the core of SUN’s Ultra. SPARC T 1 Source: Kongetira P. , et al. „Niagara”, IEEE Micro, March/April 2005, Vol. 25, No. 2, pp. 21 -29

4. 2 Case example: SUN Ultra. SPARC T 1 (7) Thread switch: Threads are

4. 2 Case example: SUN Ultra. SPARC T 1 (7) Thread switch: Threads are switched on a per cycle basis. Selection of threads: In the thread select pipeline stage thread select multiplexer selects a thread from the set of available threads in each clock cycle and issues the subsequent instr. of this thread into the pipeline for execution. Thread selection policy: the least recently used policy. Threads become unavailable due to: • long-latency instructions, such as loads, branches, multiplies, divides, • pipeline stalls because of cache misses, traps, resource conflicts. 1. Example: • all 4 threads are available.

4. 2 Case example: SUN Ultra. SPARC T 1 (8) Figure 4. 3: Thread

4. 2 Case example: SUN Ultra. SPARC T 1 (8) Figure 4. 3: Thread switch in the SUN’s Ultra. SPARC T 1 when all threads are available Source: Kongetira P. , et al. „Niagara”, IEEE Micro, March/April 2005, Vol. 25, No. 2, pp. 21 -29

4. 2 Case example: SUN Ultra. SPARC T 1 (9) 2. Example: • There

4. 2 Case example: SUN Ultra. SPARC T 1 (9) 2. Example: • There are only 2 threads available, • speculative execution of instructions following a load. (Data referenced by a load instruction arrive in the 3. cycle after decoding, assuming a cache hit. So, after issuing a load the thread becomes unavailable for the next two subsequent cycles. )

4. 2 Case example: SUN Ultra. SPARC T 1 (10) Figure 4. 3: Thread

4. 2 Case example: SUN Ultra. SPARC T 1 (10) Figure 4. 3: Thread switch in the SUN’s Ultra. SPARC T 1 when all threads are available (The add instruction from thread t 0 is speculatively switched into the pipeline assuming a cache hit. ) Source: Kongetira P. , et al. „Niagara”, IEEE Micro, March/April 2005, Vol. 25, No. 2, pp. 21 -29

5. Simultaneous multithreading 5. 1 Overview (2) Thread scheduling in multithreaded cores Coarse grain

5. Simultaneous multithreading 5. 1 Overview (2) Thread scheduling in multithreaded cores Coarse grain MT Fine grain MT Simultaneous MT (SMT)

5. Simultaneous multithreading 5. 1 Overview (2) Simultaneous MT Scalar based Superscalar based Pentium

5. Simultaneous multithreading 5. 1 Overview (2) Simultaneous MT Scalar based Superscalar based Pentium 4 (2002) 2 T DEC 21464 (2003) Dual-core/2 T IBM POWER 5 (2005) Dual-core/2 T Pentium EE 840 (2005) Dual-core/2 T Pentium EE 955/965 (2005) Dual-core/2 T VLIW based

5. 2 Case example 1: Intel Pentium 4 / HT (1) Intel designates SMT

5. 2 Case example 1: Intel Pentium 4 / HT (1) Intel designates SMT as Hyperthreading (HT) Introduced in the Northwood based DP- and MP-server cores in 2/2002 and 3/2002 resp. (called the Prestonia and Foster MP cores), followed by the Northwood core for desktops in 11/2002. Additions for implementing MT: • Duplicated architectural state, including • instruction pointer, • the general purpose regs. , • the control regs. , • the APIC (Advanced Programable Interrupt Controller) regs. , • some machine state regs.

5. 2 Case example 1: Intel Pentium 4 / HT (2) Figure 5. 1.

5. 2 Case example 1: Intel Pentium 4 / HT (2) Figure 5. 1. Intel Pentium 4 and the visible processor resources duplicated to support hyperthreading technology. Hyperthreading requires duplication of additional miscellaneous pointers and control logic, but these are too small to point out. Source: Koufaty D. and Marr D. T. „Hyperthreading Technology in the Netburst Microarchitecture, IEEE. Micro, Vol. 23, No. 2, March-April 2003, pp. 56 -65.

5. 2 Case example 1: Intel Pentium 4 / HT (1) Intel designates SMT

5. 2 Case example 1: Intel Pentium 4 / HT (1) Intel designates SMT as Hyperthreading (HT) Introduced in the Northwood based DP- and MP-server cores in 2/2002 and 3/2002 resp. (called the Prestonia and Foster MP cores), followed by the Northwood core for desktops in 11/2002. Additions for implementing MT: • Duplicated architectural state, including • instruction pointer, • the general purpose regs. , • the control regs. , • the APIC (Advanced Programable Interrupt Controller) regs. , • some machine state regs. • Further enhancements to support MT (thread microstate): • TC-entries (Trace cache) are tagged, • BHB (Branch History Buffer) is duplicated, • Global History Table is tagged, • RAS (Return Address Stack) is duplicated, • Rename tables are duplicated, • ROB is tagged.

5. 2 Case example 1: Intel Pentium 4/HT (3) Figure 5. 2: SMT pipeline

5. 2 Case example 1: Intel Pentium 4/HT (3) Figure 5. 2: SMT pipeline in Intel’s Pentium 4/HT Source: Marr T. T. et al. „Hyper-Threading Technology Architecture and Microarchitecture”, Intel Technology Journal, Vol. 06, Issue 01, Febr 14, 2002, pp. 4 -16

5. 2 Case example 1: Intel Pentium 4 / HT (1) Intel designates SMT

5. 2 Case example 1: Intel Pentium 4 / HT (1) Intel designates SMT as Hyperthreading (HT) Introduced in the Northwood based DP- and MP-server cores in 2/2002 and 3/2002 resp. (called the Prestonia and Foster MP cores), followed by the Northwood core for desktops in 11/2002. Additions for implementing MT: • Duplicated architectural state, including • instruction pointer, • the general purpose regs. , • the control regs. , • the APIC (Advanced Programable Interrupt Controller) regs. , • some machine state regs. • Further enhancements to support MT (thread microstate): • TC-entries (Trace cache) are tagged, • BHB (Branch History Buffer) is duplicated, • Global History Table is tagged, • RAS (Return Address Stack) is duplicated, • Rename tables are duplicated, • ROB is tagged. Moore chip area required for MT: less than 5 %. Single thread/dual thread modes: To prevent single thread performance degradation: in single thred mode partitioned resources are recombined.

5. 2 Case example 2: Alpha 21464 (V 8) (1) 8 -way superscalar, scheduled

5. 2 Case example 2: Alpha 21464 (V 8) (1) 8 -way superscalar, scheduled for 2003, but canceled in June 2001 in favour of the Itanium line. In 2001 all Alpha intellectual property rights were sold to Intel. Core enhancements for 4 -way multithreading: • Providing replicated (4 x) thread states for: PC, architectural registers (by increasing the sizes of the merged GPR and FPR architectural and rename reg. files): Alpha 21264 GPRs FPRs 80 80 Alpha 21464 512 Source: : Preston R. P. and all. , Design of an 8 -wide Superscalar RISC Microprocessor with Simultaneous Mltithreading”, Proc. ISSCC, 2002, pp. 334 -243

5. 2 Case example 2: Alpha 21464 (V 8) (2) Figure 5. 3: SMT

5. 2 Case example 2: Alpha 21464 (V 8) (2) Figure 5. 3: SMT pipeline in the Alpha 21464 (V 8) Source: Mukkherjee S. , „The Alpha 21364 and 21464 Microprocessors, ” http: //www. compaq. com

5. 2 Case example 2: Alpha 21464 (V 8) (1) 8 -way superscalar, scheduled

5. 2 Case example 2: Alpha 21464 (V 8) (1) 8 -way superscalar, scheduled for 2003, but canceled in June 2001 in favour of the Itanium line. In 2001 all Alpha intellectual property rights were sold to Intel. Core enhancements for 4 -way multithreading: • Providing replicated (4 x) thread states for: PC, architectural registers (by increasing the sizes of the merged GPR and FPR architectural and rename reg. files): Alpha 21264 GPRs FPRs 80 80 Alpha 21464 512 • Providing replicated (4 x) thread microstates for: Register Maps, Source: : Preston R. P. and all. , Design of an 8 -wide Superscalar RISC Microprocessor with Simultaneous Mltithreading”, Proc. ISSCC, 2002, pp. 334 -243

5. 2 Case example 2: Alpha 21464 (V 8) (2) Figure 5. 3: SMT

5. 2 Case example 2: Alpha 21464 (V 8) (2) Figure 5. 3: SMT pipeline in the Alpha 21464 (V 8) Source: Mukkherjee S. , „The Alpha 21364 and 21464 Microprocessors, ” http: //www. compaq. com

5. 2 Case example 2: Alpha 21464 (V 8) (1) 8 -way superscalar, scheduled

5. 2 Case example 2: Alpha 21464 (V 8) (1) 8 -way superscalar, scheduled for 2003, but canceled in June 2001 in favour of the Itanium line. In 2001 all Alpha intellectual property rights were sold to Intel. Core enhancements for 4 -way multithreading: • Providing replicated (4 x) thread states for: PC, architectural registers (by increasing the sizes of the merged GPR and FPR architectural and rename reg. files): Alpha 21264 GPRs FPRs 80 80 Alpha 21464 512 • Providing replicated (4 x) thread microstates for: Register Maps, Additional core area needed for SMT: ~ 6 % Source: : Preston R. P. and all. , Design of an 8 -wide Superscalar RISC Microprocessor with Simultaneous Mltithreading”, Proc. ISSCC, 2002, pp. 334 -243

5. 2 Case example 3: IBM POWER 5 (1) POWER 5 enhancements vs the

5. 2 Case example 3: IBM POWER 5 (1) POWER 5 enhancements vs the POWER 4: • on-chip memory control,

5. 2 Case example 3: IBM POWER 5 (2) Fabric Controller Figure 5. 14:

5. 2 Case example 3: IBM POWER 5 (2) Fabric Controller Figure 5. 14: POWER 4 and POWER 5 system structures Source: R. Kalla, B. Sinharoy, J. M. Tendler: IBM Power 5 chip: A Dual-core multithreaded Processor, IEEE. Micro, Vol. 24, No. 2, March-April 2004, pp. 40 -47.

5. 2 Case example 3: IBM POWER 5 (1) POWER 5 enhancements vs the

5. 2 Case example 3: IBM POWER 5 (1) POWER 5 enhancements vs the POWER 4: • on-chip memory control, • separate L 3/memory attachment,

5. 2 Case example 3: IBM POWER 5 (2) Fabric Controller Figure 5. 14:

5. 2 Case example 3: IBM POWER 5 (2) Fabric Controller Figure 5. 14: POWER 4 and POWER 5 system structures Source: R. Kalla, B. Sinharoy, J. M. Tendler: IBM Power 5 chip: A Dual-core multithreaded Processor, IEEE. Micro, Vol. 24, No. 2, March-April 2004, pp. 40 -47.

5. 2 Case example 3: IBM POWER 5 (1) POWER 5 enhancements vs the

5. 2 Case example 3: IBM POWER 5 (1) POWER 5 enhancements vs the POWER 4: • on-chip memory control, • separate L 3/memory attachment, • dual threaded.

5. 2 Case example 3: IBM POWER 5 (3) Figure 5. 3: Microarchitecture of

5. 2 Case example 3: IBM POWER 5 (3) Figure 5. 3: Microarchitecture of IBM’s POWER 5 Source: Kalla R. , „IBM's POWER 5 Micro Processor Design and Methodology”, IBM Corporation, 2003

5. 2 Case example 3: IBM POWER 5 (4) Figure 5. 3: IBM POWER

5. 2 Case example 3: IBM POWER 5 (4) Figure 5. 3: IBM POWER 5 Chip Source: Kalla R. , „IBM's POWER 5 Micro Processor Design and Methodology”, IBM Corporation, 2003

5. 2 Case example 3: IBM POWER 5 (5) Core enhancements for multithreading: •

5. 2 Case example 3: IBM POWER 5 (5) Core enhancements for multithreading: • Providing duplicated thread states for: PC, architectural registers (by increasing the sizes of the merged GPR and FPR architectural and rename reg. files): POWER 4 GPRs FPRs 80 72 POWER 5 120

5. 2 Case example 3: IBM POWER 5 (6) Figure 5. 3: SMT pipeline

5. 2 Case example 3: IBM POWER 5 (6) Figure 5. 3: SMT pipeline of IBM’s POWER 5 Source: Kalla R. , „IBM's POWER 5 Micro Processor Design and Methodology”, IBM Corporation, 2003

5. 2 Case example 3: IBM POWER 5 (5) Core enhancements for multithreading: •

5. 2 Case example 3: IBM POWER 5 (5) Core enhancements for multithreading: • Providing duplicated thread states for: PC, architectural registers (by increasing the sizes of the merged GPR and FPR architectural and rename reg. files): POWER 4 GPRs FPRs 80 72 • Providing duplicated thread microstates for: Return Address Stack, Group Completion (ROB) POWER 5 120

5. 2 Case example 3: IBM POWER 5 (6) Figure 5. 3: SMT pipeline

5. 2 Case example 3: IBM POWER 5 (6) Figure 5. 3: SMT pipeline of IBM’s POWER 5 Source: Kalla R. , „IBM's POWER 5 Micro Processor Design and Methodology”, IBM Corporation, 2003

5. 2 Case example 3: IBM POWER 5 (5) Core enhancements for multithreading: •

5. 2 Case example 3: IBM POWER 5 (5) Core enhancements for multithreading: • Providing duplicated thread states for: PC, architectural registers (by increasing the sizes of the merged GPR and FPR architectural and rename reg. files): POWER 4 GPRs FPRs 80 72 POWER 5 120 • Providing duplicated thread microstates for: Return Address Stack, Group Completion (ROB) • Providing increased (duplicated) size for scarce or sensitive resorces, such as: Instruction Buffer, Store Queue

5. 2 Case example 3: IBM POWER 5 (6) Figure 5. 3: SMT pipeline

5. 2 Case example 3: IBM POWER 5 (6) Figure 5. 3: SMT pipeline of IBM’s POWER 5 Source: Kalla R. , „IBM's POWER 5 Micro Processor Design and Methodology”, IBM Corporation, 2003

5. 2 Case example 3: IBM POWER 5 (5) Core enhancements for multithreading: •

5. 2 Case example 3: IBM POWER 5 (5) Core enhancements for multithreading: • Providing duplicated thread states for: PC, architectural registers (by increasing the sizes of the merged GPR and FPR architectural and rename reg. files): POWER 4 GPRs FPRs 80 72 POWER 5 120 • Providing duplicated thread microstates for: Return Address Stack, Group Completion (ROB) • Providing increased (duplicated) size for scarce or sensitive resorces, such as: Instruction Buffer, Store Queue Additional core area needed for SMT: ~ 10 %

5. 2 Case example 3: IBM POWER 5 (7) Unbalanced execution of threads: (an

5. 2 Case example 3: IBM POWER 5 (7) Unbalanced execution of threads: (an enhancement of the single mode/dual mode thred execution model) • Threads have 8 priority levels (0. . . 7) controlled by HW/SW, • the decode rate of each thread will be controlled according to the associated priority Figure 5. 3: Unbalanced execution of threads in IBM’s POWER 5 Source: Kalla R. , „IBM's POWER 5 Micro Processor Design and Methodology”, IBM Corporation, 2003

5. 2 Case example 3: IBM POWER 5 (8) Development effort: • Concept phase:

5. 2 Case example 3: IBM POWER 5 (8) Development effort: • Concept phase: • High level design phase: • Implementation phase: ~ 10 persons/ 4 month ~ 50 persons/ 6 month ~ 200 persons/ 12 -18 month Source: Kalla R. , „IBM's POWER 5 Micro Processor Design and Methodology”, IBM Corporation, 2003