Simultaneous Multithreading SMT An evolutionary processor architecture originally

SMT Issues • • SMT CPU performance gain potential. Modifications to Superscalar CPU architecture

Microprocessor Architecture Trends CMPs SMT/CMPs EECC 722 - Shaaban #3 Lec # 2 Fall

Evolution of Microprocessors Multi-cycle Pipelined (single issue) Multiple Issue (CPI <1) Superscalar/VLIW/SMT 1 GHz

100 Intel IBM Power PC DEC Gate delays/clock Processor freq scales by 2 X

Parallelism in Microprocessor VLSI Generations Multiple micro-operations per cycle (Superscalar) Simultaneous Multithreading SMT: e.

CPU Architecture Evolution: Single Threaded/Issue Pipeline • Traditional 5 -stage integer pipeline. • Increases

CPU Architecture Evolution: Superscalar Architectures • Fetch, issue, execute, etc. more than one instruction

Superscalar Architectures: Issue Slot Waste Classification • Empty or wasted issue slots can be

Sources of Unused Issue Cycles in an 8 -issue Superscalar Processor. Single-Threaded Average 1.

Single-Threaded Superscalar Architectures: All possible causes of wasted issue slots, and latency-hiding or latency

Advanced CPU Architectures: Fine-grain or Traditional Multithreaded Processors • Multiple HW contexts (PC, SP,

Fine-grain or Traditional Multithreaded Processors The Tera Computer System • The Tera computer system

Advanced CPU Architectures: VLIW: Intel/HP IA-64 Explicitly Parallel Instruction Computing (EPIC) • Strengths: –

Advanced CPU Architectures: Single Chip Multiprocessor • Strengths: – Create a single processor block

Advanced CPU Architectures: Single Chip Multiprocessor EECC 722 - Shaaban #16 Lec # 2

SMT: Simultaneous Multithreading • Multiple Hardware Contexts running at the same time (HW context:

SMT • With multiple threads running penalties from long-latency operations, cache misses, and branch

SMT: Simultaneous Multithreading EECC 722 - Shaaban #19 Lec # 2 Fall 2005 9

Time (processor cycles) The Power Of SMT 1 1 1 1 2 2 2

SMT Performance Example Inst A B C D E F G H I J

SMT Performance Example (continued) • • 2 additional cycles for SMT to complete program

Modifications to Superscalar CPUs Necessary to support SMT • Multiple program counters and some

Current Implementations of SMT • Intel’s recent implementation of Hyper-Threading (HT) Technology (2 -thread

A Base SMT Hardware Architecture. SMT-2 Source: Exploiting Choice: Instruction Fetch and Issue on

Example SMT Vs. Superscalar Pipeline Based on the Alpha 21164 Two extra pipeline stages

Intel Hyper-Threaded (2 -way SMT) P 4 Processor Pipeline Source: Intel Technology Journal ,

Intel P 4 Out-of-order Execution Engine Detailed Pipeline Hyper-Threaded (2 -way SMT) Source: Intel

SMT Performance Comparison • Instruction throughput from simulations by Eggers et al. at The

Possible Machine Models for an 8 -way Multithreaded Processor • • The following machine

Comparison of Multithreaded CPU Models Complexity A comparison of key hardware complexity features of

Simultaneous Vs. Fine-Grain Multithreading Performance IPC Workload: SPEC 92 Instruction throughput as a function

Simultaneous Multithreading Vs. Single-Chip Multiprocessing • Results for the multiprocessor MP vs. simultaneous multithreading

Impact of Level 1 Cache Sharing on SMT Performance • Results for the simulated

The Impact of Increased Multithreading on Some Low Level Metrics for Base SMT Architecture

Possible SMT Thread Instruction Fetch Scheduling Policies • Round Robin: – Instruction from Thread

Instruction Throughput For Round Robin Instruction Fetch Scheduling Best overall instruction throughput achieved using

Instruction throughput & Thread Fetch Policy ICOUNT. 2. 8 All other fetch heuristics provide

Low-Level Metrics For Round Robin 2. 8, Icount 2. 8 ICOUNT improves on the

Possible SMT Instruction Issue Policies • OLDEST FIRST: Issue the oldest instructions (those deepest

SMT: Simultaneous Multithreading • Strengths: – Overcomes the limitations imposed by low single thread

Slides: 41

Download presentation

Simultaneous Multithreading (SMT) • An evolutionary processor architecture originally introduced in 1995 by Dean Tullsen at the University of Washington that aims at reducing resource waste in wide issue processors. • SMT has the potential of greatly enhancing superscalar processor computational capabilities by: – Exploiting thread-level parallelism (TLP), simultaneously issuing, executing and retiring instructions from different threads during the same cycle. – Providing multiple hardware contexts, hardware thread scheduling and context switching capability. – Providing effective long latency hiding. EECC 722 - Shaaban #1 Lec # 2 Fall 2005 9 -5 -2005

SMT Issues • • SMT CPU performance gain potential. Modifications to Superscalar CPU architecture necessary to support SMT performance evaluation vs. Fine-grain multithreading, Superscalar, Chip Multiprocessors. Ref. Papers Hardware techniques to improve SMT performance: SMT-1, SMT-2 – Optimal level one cache configuration for SMT. – SMT thread instruction fetch, issue policies. – Instruction recycling (reuse) of decoded instructions. Software techniques: – Compiler optimizations for SMT. – Software-directed register deallocation. – Operating system behavior and optimization. SMT support for fine-grain synchronization. SMT as a viable architecture for network processors. Current SMT implementation: Intel’s Hyper-Threading (2 -way SMT) Microarchitecture and performance in compute-intensive workloads. EECC 722 - Shaaban #2 Lec # 2 Fall 2005 9 -5 -2005

Microprocessor Architecture Trends CMPs SMT/CMPs EECC 722 - Shaaban #3 Lec # 2 Fall 2005 9 -5 -2005

Evolution of Microprocessors Multi-cycle Pipelined (single issue) Multiple Issue (CPI <1) Superscalar/VLIW/SMT 1 GHz to ? ? GHz IPC Source: John P. Chen, Intel Labs Single-issue Processor = Scalar Processor Instructions Per Cycle (IPC) = 1/CPI EECC 722 - Shaaban #4 Lec # 2 Fall 2005 9 -5 -2005

100 Intel IBM Power PC DEC Gate delays/clock Processor freq scales by 2 X per generation 21264 S 1, 000 Mhz 21164 A 21264 Pentium(R) 21064 A 21164 II 21066 MPC 750 604+ 10 Pentium Pro 601, 603 (R) Pentium(R) 100 486 386 2005 2003 2001 1999 1997 1995 1993 1991 1989 1 1987 10 ÊFrequency doubles each generation ËNumber of gates/clock reduce by 25% ÌLeads to deeper pipelines with more stages Gate Delays/ Clock 10, 000 Microprocessor Frequency Trend Realty Check: Clock frequency scaling is slowing down! (Did silicone finally hit the wall? ) Why? 1 - Power leakage 2 - Clock distribution delays Result: Deeper Pipelines Longer stalls Higher CPI (lowers effective performance per cycle) (e. g Intel Pentium 4 E has 30+ pipeline stages) EECC 722 - Shaaban #5 Lec # 2 Fall 2005 9 -5 -2005

Parallelism in Microprocessor VLSI Generations Multiple micro-operations per cycle (Superscalar) Simultaneous Multithreading SMT: e. g. Intel’s Hyper-threading Chip-Multiprocessors (CMPs) e. g IBM Power 4 Chip-Level Parallel Processing Thread Level Parallelism (TLP) EECC 722 - Shaaban #6 Lec # 2 Fall 2005 9 -5 -2005

CPU Architecture Evolution: Single Threaded/Issue Pipeline • Traditional 5 -stage integer pipeline. • Increases Throughput: Ideal CPI = 1 EECC 722 - Shaaban #7 Lec # 2 Fall 2005 9 -5 -2005

CPU Architecture Evolution: Superscalar Architectures • Fetch, issue, execute, etc. more than one instruction per cycle (CPI < 1). • Limited by instruction-level parallelism (ILP). EECC 722 - Shaaban #8 Lec # 2 Fall 2005 9 -5 -2005

Superscalar Architectures: Issue Slot Waste Classification • Empty or wasted issue slots can be defined as either vertical waste or horizontal waste: – Vertical waste is introduced when the processor issues no instructions in a cycle. – Horizontal waste occurs when not all issue slots can be filled in a cycle. EECC 722 - Shaaban #9 Lec # 2 Fall 2005 9 -5 -2005

Sources of Unused Issue Cycles in an 8 -issue Superscalar Processor. Single-Threaded Average 1. 5 instructions/cycle issue rate Processor busy represents the utilized issue slots; all others represent wasted issue slots. 61% of the wasted cycles are vertical waste, the remainder are horizontal waste. Workload: SPEC 92 benchmark suite. SMT-1 Source: Simultaneous Multithreading: Maximizing On-Chip Parallelism Dean Tullsen et al. , Proceedings of the 22 rd Annual International Symposium on Computer Architecture, June 1995, pages 392 -403. EECC 722 - Shaaban #10 Lec # 2 Fall 2005 9 -5 -2005

Single-Threaded Superscalar Architectures: All possible causes of wasted issue slots, and latency-hiding or latency reducing traditional techniques that can reduce the number of cycles wasted by each cause. SMT-1 Source: Simultaneous Multithreading: Maximizing On-Chip Parallelism Dean Tullsen et al. , Proceedings of the 22 rd Annual International Symposium on Computer Architecture, June 1995, pages 392 -403. EECC 722 - Shaaban #11 Lec # 2 Fall 2005 9 -5 -2005

Advanced CPU Architectures: Fine-grain or Traditional Multithreaded Processors • Multiple HW contexts (PC, SP, and registers). • Only one context or thread issues instructions each cycle. • Performance limited by Instruction-Level Parallelism (ILP) within each individual thread: – Can reduce some of the vertical issue slot waste. – No reduction in horizontal issue slot waste. • Example Architecture: The Tera Computer System EECC 722 - Shaaban #12 Lec # 2 Fall 2005 9 -5 -2005

Fine-grain or Traditional Multithreaded Processors The Tera Computer System • The Tera computer system is a shared memory multiprocessor that can accommodate up to 256 processors. • Each Tera processor is fine-grain multithreaded: – Each processor can issue one 3 -operation Long Instruction Word (LIW) every 3 ns cycle (333 MHz) from among as many as 128 distinct instruction streams (hardware threads), thereby hiding up to 128 cycles (384 ns) of memory latency. – In addition, each stream can issue as many as eight memory references without waiting for earlier ones to finish, further augmenting the memory latency tolerance of the processor. – A stream implements a load/store architecture with three addressing modes and 31 general-purpose 64 -bit registers. – The instructions are 64 bits wide and can contain three operations: a memory reference operation (M-unit operation or simply M-op for short), an arithmetic or logical operation (A-op), and a branch or simple arithmetic or logical operation (C-op). Source: http: //www. cscs. westminster. ac. uk/~seamang/PAR/tera_overview. html EECC 722 - Shaaban #13 Lec # 2 Fall 2005 9 -5 -2005

Advanced CPU Architectures: VLIW: Intel/HP IA-64 Explicitly Parallel Instruction Computing (EPIC) • Strengths: – Allows for a high level of instruction parallelism (ILP). – Takes a lot of the dependency analysis out of HW and places focus on smart compilers. • Weakness: – – Limited by instruction-level parallelism (ILP) in a single thread. Keeping Functional Units (FUs) busy (control hazards). Static FUs Scheduling limits performance gains. Resulting overall performance heavily depends on compiler performance. EECC 722 - Shaaban #14 Lec # 2 Fall 2005 9 -5 -2005

Advanced CPU Architectures: Single Chip Multiprocessor • Strengths: – Create a single processor block and duplicate. – Exploits Thread-Level Parallelism. – Takes a lot of the dependency analysis out of HW and places focus on smart compilers. • Weakness: – Performance within each processor still limited by individual thread performance (ILP). – High power requirements using current VLSI processes. EECC 722 - Shaaban #15 Lec # 2 Fall 2005 9 -5 -2005

Advanced CPU Architectures: Single Chip Multiprocessor EECC 722 - Shaaban #16 Lec # 2 Fall 2005 9 -5 -2005

SMT: Simultaneous Multithreading • Multiple Hardware Contexts running at the same time (HW context: registers, PC, and SP etc. ). • Reduces both horizontal and vertical waste by having multiple threads keeping functional units busy during every cycle. • Builds on top of current time-proven advancements in CPU design: superscalar, dynamic scheduling, hardware speculation, dynamic HW branch prediction, multiple levels of cache, hardware pre-fetching etc. • Enabling Technology: VLSI logic density in the order of hundreds of millions of transistors/Chip. – Potential performance gain is much greater than the increase in chip area and power consumption needed to support SMT. EECC 722 - Shaaban #17 Lec # 2 Fall 2005 9 -5 -2005

SMT • With multiple threads running penalties from long-latency operations, cache misses, and branch mispredictions will be hidden: – Reduction of both horizontal and vertical waste and thus improved Instructions Issued Per Cycle (IPC) rate. • Functional units are shared among all contexts during every cycle: – More complicated register read and writeback stages. • More threads issuing to functional units results in higher resource utilization. • CPU resources may have to resized to accommodate the additional demands of the multiple threads running. – (e. g cache, TLBs, branch prediction tables, rename registers) EECC 722 - Shaaban #18 Lec # 2 Fall 2005 9 -5 -2005

SMT: Simultaneous Multithreading EECC 722 - Shaaban #19 Lec # 2 Fall 2005 9 -5 -2005

Time (processor cycles) The Power Of SMT 1 1 1 1 2 2 2 3 3 3 4 2 2 4 4 5 1 1 1 2 2 3 1 2 4 1 2 5 4 1 1 1 5 5 1 1 1 2 2 2 1 3 1 4 4 4 Superscalar Traditional Multithreaded Rows of squares represent instruction issue slots Box with number x: instruction issued from thread x Empty box: slot is wasted 5 1 Simultaneous Multithreading EECC 722 - Shaaban #20 Lec # 2 Fall 2005 9 -5 -2005

SMT Performance Example Inst A B C D E F G H I J K • • • Code LUI FMUL ADD MUL LW ADD NOT FADD XOR SUBI SW R 5, 100 F 1, F 2, F 3 R 4, 8 R 3, R 4, R 5 R 6, R 4 R 1, R 2, R 3 R 7, R 7 F 4, F 1, F 2 R 8, R 1, R 7 R 2, R 1, 4 ADDR, R 2 Description R 5 = 100 F 1 = F 2 x F 3 R 4 = R 4 + 8 R 3 = R 4 x R 5 R 6 = (R 4) R 1 = R 2 + R 3 R 7 = !R 7 F 4=F 1 + F 2 R 8 = R 1 XOR R 7 R 2 = R 1 – 4 (ADDR) = R 2 Functional unit Int ALU FP ALU Int mul/div Memory port Int ALU FP ALU Int ALU Memory port 4 integer ALUs (1 cycle latency) 1 integer multiplier/divider (3 cycle latency) 3 memory ports (2 cycle latency, assume cache hit) 2 FP ALUs (5 cycle latency) Assume all functional units are fully-pipelined EECC 722 - Shaaban #21 Lec # 2 Fall 2005 9 -5 -2005

SMT Performance Example (continued) • • 2 additional cycles for SMT to complete program 2 Throughput: – Superscalar: 11 inst/7 cycles = 1. 57 IPC – SMT: 22 inst/9 cycles = 2. 44 IPC – SMT is 2. 44/1. 57 = 1. 55 times faster than superscalar for this example EECC 722 - Shaaban #22 Lec # 2 Fall 2005 9 -5 -2005

Modifications to Superscalar CPUs Necessary to support SMT • Multiple program counters and some mechanism by which one fetch unit selects one each cycle (thread instruction fetch policy). • A separate return stack for each thread for predicting subroutine return destinations. • Per-thread instruction retirement, instruction queue flush, and trap mechanisms. • A thread id with each branch target buffer entry to avoid predicting phantom branches. • A larger register file, to support logical registers for all threads plus additional registers for register renaming. (may require additional pipeline stages). • • A higher available main memory fetch bandwidth may be required. Larger data TLB with more entries to compensate for increased virtual to physical address translations. • Improved cache to offset the cache performance degradation due to cache sharing among the threads and the resulting reduced locality. SMT-2 – e. g Private per-thread vs. shared L 1 cache. EECC 722 - Shaaban Source: Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor, Dean Tullsen et al. Proceedings of the 23 rd Annual International Symposium on Computer Architecture, May 1996, pages 191 -202. #23 Lec # 2 Fall 2005 9 -5 -2005

Current Implementations of SMT • Intel’s recent implementation of Hyper-Threading (HT) Technology (2 -thread SMT) in its current P 4 processor family. • IBM POWER 5: Dual cores each 2 -thread SMT. • The Alpha EV 8 (4 -thread SMT) originally scheduled for production in 2001 is currently on indefinite hold : ( • A number of special-purpose processors targeted towards network processor (NP) applications. • Current technology has the potential for 4 -8 simultaneous threads: – Based on transistor count and design complexity. EECC 722 - Shaaban #24 Lec # 2 Fall 2005 9 -5 -2005

A Base SMT Hardware Architecture. SMT-2 Source: Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor, Dean Tullsen et al. Proceedings of the 23 rd Annual International Symposium on Computer Architecture, May 1996, pages 191 -202. EECC 722 - Shaaban #25 Lec # 2 Fall 2005 9 -5 -2005

Example SMT Vs. Superscalar Pipeline Based on the Alpha 21164 Two extra pipeline stages added for reg. Read/write to account for the size increase of the register file • The pipeline of (a) a conventional superscalar processor and (b) that pipeline modified for an SMT processor, along with some implications of those pipelines. Source: Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor, Dean Tullsen et al. Proceedings of the 23 rd Annual International Symposium on Computer Architecture, May 1996, pages 191 -202. SMT-2 EECC 722 - Shaaban #26 Lec # 2 Fall 2005 9 -5 -2005

Intel Hyper-Threaded (2 -way SMT) P 4 Processor Pipeline Source: Intel Technology Journal , Volume 6, Number 1, February 2002. SMT-8 EECC 722 - Shaaban #27 Lec # 2 Fall 2005 9 -5 -2005

Intel P 4 Out-of-order Execution Engine Detailed Pipeline Hyper-Threaded (2 -way SMT) Source: Intel Technology Journal , Volume 6, Number 1, February 2002. SMT-8 EECC 722 - Shaaban #28 Lec # 2 Fall 2005 9 -5 -2005

SMT Performance Comparison • Instruction throughput from simulations by Eggers et al. at The University of Washington, using both multiprogramming and parallel workloads: Multiprogramming workload Superscalar Threads 1 2 4 8 2. 7 - Traditional Multithreading 2. 6 3. 3 3. 6 2. 8 SMT 3. 1 3. 5 5. 7 6. 2 Parallel Workload Superscalar Threads 1 2 4 8 3. 3 - MP 2 MP 4 2. 4 4. 3 - 1. 5 2. 6 4. 2 - Traditional Multithreading 3. 3 4. 1 4. 2 3. 5 SMT 3. 3 4. 7 5. 6 6. 1 EECC 722 - Shaaban #29 Lec # 2 Fall 2005 9 -5 -2005

Possible Machine Models for an 8 -way Multithreaded Processor • • The following machine models for a multithreaded CPU that can issue 8 instruction per cycle differ in how threads use issue slots and functional units: Fine-Grain Multithreading: – Only one thread issues instructions each cycle, but it can use the entire issue width of the processor. This hides all sources of vertical waste, but does not hide horizontal waste. SM: Full Simultaneous Issue: – This is a completely flexible simultaneous multithreaded superscalar: all eight threads compete for each of the 8 issue slots each cycle. This is the least realistic model in terms of hardware complexity, but provides insight into the potential for simultaneous multithreading. The following models each represent restrictions to this scheme that decrease hardware complexity. SM: Single Issue, SM: Dual Issue, and SM: Four Issue: – These three models limit the number of instructions each thread can issue, or have active in the – • scheduling window, each cycle. For example, in a SM: Dual Issue processor, each thread can issue a maximum of 2 instructions per cycle; therefore, a minimum of 4 threads would be required to fill the 8 issue slots in one cycle. SM: Limited Connection. – Each hardware context is directly connected to exactly one of each type of functional unit. – For example, if the hardware supports eight threads and there are four integer units, each integer – unit could receive instructions from exactly two threads. The partitioning of functional units among threads is thus less dynamic than in the other models, but each functional unit is still shared (the critical factor in achieving high utilization). SMT-1 Source: Simultaneous Multithreading: Maximizing On-Chip Parallelism Dean Tullsen et al. , Proceedings of the 22 rd Annual International Symposium on Computer Architecture, June 1995, pages 392 -403. EECC 722 - Shaaban #30 Lec # 2 Fall 2005 9 -5 -2005

Comparison of Multithreaded CPU Models Complexity A comparison of key hardware complexity features of the various models (H=high complexity). The comparison takes into account: – the number of ports needed for each register file, – the dependence checking for a single thread to issue multiple instructions, – the amount of forwarding logic, – and the difficulty of scheduling issued instructions onto functional units. SMT-1 Source: Simultaneous Multithreading: Maximizing On-Chip Parallelism Dean Tullsen et al. , Proceedings of the 22 rd Annual International Symposium on Computer Architecture, June 1995, pages 392 -403. EECC 722 - Shaaban #31 Lec # 2 Fall 2005 9 -5 -2005

Simultaneous Vs. Fine-Grain Multithreading Performance IPC Workload: SPEC 92 Instruction throughput as a function of the number of threads. (a)-(c) show the throughput by thread priority for particular models, and (d) shows the total throughput for all threads for each of the six machine models. The lowest segment of each bar is the contribution of the highest priority thread to the total throughput. SMT-1 Source: Simultaneous Multithreading: Maximizing On-Chip Parallelism Dean Tullsen et al. , Proceedings of the 22 rd Annual International Symposium on Computer Architecture, June 1995, pages 392 -403. EECC 722 - Shaaban #32 Lec # 2 Fall 2005 9 -5 -2005

Simultaneous Multithreading Vs. Single-Chip Multiprocessing • Results for the multiprocessor MP vs. simultaneous multithreading SM comparisons. The multiprocessor always has one functional unit of each type per processor. In most cases the SM processor has the same total number of each FU type as the MP. SMT-1 Source: Simultaneous Multithreading: Maximizing On-Chip Parallelism Dean Tullsen et al. , Proceedings of the 22 rd Annual International Symposium on Computer Architecture, June 1995, pages 392 -403. EECC 722 - Shaaban #33 Lec # 2 Fall 2005 9 -5 -2005

Impact of Level 1 Cache Sharing on SMT Performance • Results for the simulated cache configurations, shown relative to the throughput (instructions per cycle) of the 64 s. 64 p 64 K data cache shared 64 K instruction cache private • The caches are specified as: [total I cache size in KB][private or shared]. [D cache size][private or shared] For instance, 64 p. 64 s has eight private 8 KB I caches and a shared 64 KB data Best overall performance of configurations considered achieved by 64 s (64 K data cache shared 64 K instruction cache shared) SMT-1 Source: Simultaneous Multithreading: Maximizing On-Chip Parallelism Dean Tullsen et al. , Proceedings of the 22 rd Annual International Symposium on Computer Architecture, June 1995, pages 392 -403. EECC 722 - Shaaban #34 Lec # 2 Fall 2005 9 -5 -2005

The Impact of Increased Multithreading on Some Low Level Metrics for Base SMT Architecture SMT-2 EECC 722 - Shaaban Source: Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor, Dean Tullsen et al. Proceedings of the 23 rd Annual International Symposium on Computer Architecture, May 1996, pages 191 -202. #35 Lec # 2 Fall 2005 9 -5 -2005

Possible SMT Thread Instruction Fetch Scheduling Policies • Round Robin: – Instruction from Thread 1, then Thread 2, then Thread 3, etc. (eg RR 1. 8 : each cycle one thread fetches up to eight instructions RR 2. 4 each cycle two threads fetch up to four instructions each) • BR-Count: – Give highest priority to those threads that are least likely to be on a wrong path by by counting branch instructions that are in the decode stage, the rename stage, and the instruction queues, favoring those with the fewest unresolved branches. • MISS-Count: – Give priority to those threads that have the fewest outstanding Data cache misses. • ICount: – Highest priority assigned to thread with the lowest number of instructions in static portion of pipeline (decode, rename, and the instruction queues). • IQPOSN: – Give lowest priority to those threads with instructions closest to the head of either the integer or floating point instruction queues (the oldest instruction is at the head of the queue). SMT-2 EECC 722 - Shaaban Source: Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor, Dean Tullsen et al. Proceedings of the 23 rd Annual International Symposium on Computer Architecture, May 1996, pages 191 -202. #36 Lec # 2 Fall 2005 9 -5 -2005

Instruction Throughput For Round Robin Instruction Fetch Scheduling Best overall instruction throughput achieved using round robin RR. 2. 8 (in each cycle two threads each fetch a block of 8 instructions) SMT-2 Workload: SPEC 92 EECC 722 - Shaaban Source: Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor, Dean Tullsen et al. Proceedings of the 23 rd Annual International Symposium on Computer Architecture, May 1996, pages 191 -202. #37 Lec # 2 Fall 2005 9 -5 -2005

Instruction throughput & Thread Fetch Policy ICOUNT. 2. 8 All other fetch heuristics provide speedup over round robin Instruction Count ICOUNT. 2. 8 provides most improvement 5. 3 instructions/cycle vs 2. 5 for unmodified superscalar. Workload: SPEC 92 Source: Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor, Dean Tullsen et al. Proceedings of the 23 rd Annual International Symposium on Computer Architecture, May 1996, pages 191 -202. ICOUNT: Highest priority assigned to thread with the lowest number of instructions in static portion of pipeline (decode, rename, and the instruction queues). SMT-2 EECC 722 - Shaaban #38 Lec # 2 Fall 2005 9 -5 -2005

Low-Level Metrics For Round Robin 2. 8, Icount 2. 8 ICOUNT improves on the performance of Round Robin by 23% by reducing Instruction Queue (IQ) clog by selecting a better mix of instructions to queue SMT-2 EECC 722 - Shaaban #39 Lec # 2 Fall 2005 9 -5 -2005

Possible SMT Instruction Issue Policies • OLDEST FIRST: Issue the oldest instructions (those deepest into the instruction queue, the default). • OPT LAST and SPEC LAST: Issue optimistic and speculative instructions after all others have been issued. • BRANCH FIRST: Issue branches as early as possible in order to identify mispredicted branches quickly. Instruction issue bandwidth is not a bottleneck in SMT as shown above Source: Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor, Dean Tullsen et al. Proceedings of the 23 rd Annual International Symposium on Computer Architecture, May 1996, pages 191 -202. ICOUNT. 2. 8 Fetch policy used for all issue policies above SMT-2 EECC 722 - Shaaban #40 Lec # 2 Fall 2005 9 -5 -2005

SMT: Simultaneous Multithreading • Strengths: – Overcomes the limitations imposed by low single thread instruction-level parallelism. – Multiple threads running will hide individual control hazards (branch mispredictions). • Weaknesses: – Additional stress placed on memory hierarchy Control unit complexity. – Sizing of resources (cache, branch prediction, TLBs etc. ) – Accessing registers (32 integer + 32 FP for each HW context): • Some designs devote two clock cycles for both register reads and register writes. EECC 722 - Shaaban #41 Lec # 2 Fall 2005 9 -5 -2005