Chapter 6 Future Processors to use CoarseGrain Parallelism

Future processors to use coarse-grain parallelism n Chip multiprocessors (CMPs) or multiprocessor chips –

Principal chip multiprocessor alternatives n Symmetric multiprocessor (SMP) n Distributed shared memory multiprocessor (DSM)

shared address space Organizational principles of multiprocessors global memory Processor . . . Processor

Typical SMP Processor Primary Cache Secondary Cache Bus Global Memory 5

Shared memory candidates for CMPs Shared-main memory and shared-secondary cache Processor Processor Primary Cache

Shared memory candidates for CMPs Processor Primary Cache Secondary Cache Global Memory shared-primary cache

Grain-levels for CMPs n multiple processes in parallel n multiple threads from a single

32 32 64 Advanced DSP 1 L G I 32 32 64 32 32

Hydra: A single-chip multiprocessor A Single Chip Centralized Bus Arbitration Mechanisms CPU 0 Primary

Conclusions on CMP n Usually, a CMP will feature: – separate L 1 I-cache

Multithreaded processors Aim: Latency tolerance n What is the problem? n n Load access

Multithreaded processors n Multithreading: – Provide several program counters registers (and usually several register

Approaches of multithreaded processors Cycle-by-cycle interleaving – An instruction of another thread is fetched

Comparision of multithreading with nonmultithreading approaches: (a) single-threaded scalar (b) cycle-by-cycle interleaving multithreaded scalar

Comparision of multithreading with nonmultithreading approaches: (a) superscalar (b) VLIW interleaving VLIW (c) cycle-by-cycle

Comparison of multithreading with non-multithreading: simultaneous multithreading (SMT) (CMP) chip multiprocessor 17

Cycle-by-cycle interleaving n n n the processor switches to a different thread after each

Cycle-by-cycle interleaving - Improving single-thread performance n The dependence look-ahead technique adds several bits

Tera MTA cycle-by-cycle interleaving technique n employs the dependence-look-ahead technique n VLIW ISA (3

Tera processing A-ops element C-ops M-unit Done Fail OK Done Register state for 128

Tera MTA Processors (max 256) Computational Processors (max. I/O 256) CP CP . .

Block interleaving n n n Executes a single thread until it reaches a situation

Interleaving techniques multithreading block interleaving cycle-by-cycle interleaving static explicit-switch dynamic implicit-switch-on-cache-miss (switch-on-load, switch-on-store, switch-on-signal

Komodo-microcontroller Develop multithreaded embedded real-time Javamicrocontroller n Java processor core bytecode as machine language,

Nanothreading and microthreading - multithreading in same register set n Nanothreading (Dan. Soft processor)

Simultaneous multithreading (SMT) n The SMT approach combines a wide superscalar instruction issue with

Simultaneous multithreading (SMT) - Hardware organization (1) n SMT processors can be organized in

Simultaneous multithreading (SMT) - Hardware organization (2) n Second: Replicate all internal buffers of

Simultaneous multithreading (SMT) n n n SMT fetch unit can take advantage of the

SMT at the Universities of Washington and San Diego n n n Hypothetical out-of-order

SMT at the Universities of Washington and San Diego Instruction fetch Instruction decode Fetch

SMT at the Universities of Washington and San Diego - Instruction fetching schemes Basic:

SMT at the Universities of Washington and San Diego - Instruction fetching schemes The

SMT processor with multimedia enhancement - Combining SMT and multimedia Start with a wide-issue

Maximum processor configuration - IPCs of 8 -threaded 8 -issue cases n n n

IPC of „maximum“ processor On-chip RAM and two local load/store units 4 MB I-cache,

More realistic processor D-cache fill burst rate of 32: 4: 4: 4 issue bandwidth

Speedup Realistic processor Maximum processor A threefold speedup 42

IPC-Performance of SMT and CMP (1) SPEC 92 -simulations [Tullsen et al. ] vs.

IPC-Performance of SMT and CMP (2) SPEC 95 -simulations [Eggers et al. ]. CMP

IPC-Performance of SMT and CMP SPEC 95 -simulations. Performance is given relative to a

Comments to the simulation results [Hammond et al. ] n CMP (eight 2 -issue

SMT vs. multiprocessor chip [Eggers et al. ] n SMT obtained better speedups than

Conclusions The performance race between SMT and CMP is not yet decided. n CMP

Slides: 48

Download presentation

Chapter 6 Future Processors to use Coarse-Grain Parallelism 1

Future processors to use coarse-grain parallelism n Chip multiprocessors (CMPs) or multiprocessor chips – integrate two or more complete processors on a single chip, – every functional unit of a processor is duplicated n Simultaneous multithreaded processors (SMPs) – store multiple contexts in different register sets on the chip, – the functional units are multiplexed between the threads, – instructions of different contexts are simultaneously executed 2

Principal chip multiprocessor alternatives n Symmetric multiprocessor (SMP) n Distributed shared memory multiprocessor (DSM) n Message-passing shared-nothing multiprocessor 3

shared address space Organizational principles of multiprocessors global memory Processor . . . Processor Interconnection physically distributed memory Processor . . . Local Memory Shared Memory Processor Interconnection distributed address spaces (DSM) distributed-shared-memory (SMP) symmetric multiprocessor Processor empty . . . Processor Local Memory send receive Interconnection message-passing (shared-nothing) multiprocessor 4

Typical SMP Processor Primary Cache Secondary Cache Bus Global Memory 5

Shared memory candidates for CMPs Shared-main memory and shared-secondary cache Processor Processor Primary Cache Primary Cache Secondary Cache Global Memory Secondary Cache Global Memory 6

Shared memory candidates for CMPs Processor Primary Cache Secondary Cache Global Memory shared-primary cache 7

Grain-levels for CMPs n multiple processes in parallel n multiple threads from a single application implies a common address space for all threads n extracting threads of control dynamically from a single instruction stream see last chapter, multiscalar, trace processors, . . . 8

32 32 64 Advanced DSP 1 L G I 32 32 64 32 32 Parameter RAM Data RAM 2 Data RAM 1 Data RAM 0 I-Cache 32 64 Advanced DSP 2 L G I Parameter RAM Data RAM 2 Data RAM 1 Data RAM 0 I-Cache Advanced DSP 3 L G I Parameter RAM Data RAM 2 Data RAM 1 Data RAM 0 I-Cache 32 Parameter RAM Data RAM 2 Data RAM 1 Data RAM 0 I-Cache Texas Instruments TMS 320 C 80 Multimedia Video Processor Advanced DSP 0 L G I FPU 64 C/D 64 VC MP 32 I TAP 32 64 64 TC 9

Hydra: A single-chip multiprocessor A Single Chip Centralized Bus Arbitration Mechanisms CPU 0 Primary I-cache Primary D-cache CPU 0 Memory Controller On-chip Secondary Cache CPU 1 Primary I-cache Primary D-cache CPU 1 Memory Controller CPU 2 Primary I-cache CPU 3 Primary D-cache Primary I-cache CPU 2 Memory Controller Off-chip L 3 Interface Rambus Memory Interface Cache SRAM Array DRAM Main Memory Primary D-cache CPU 3 Memory Controller DMA I/O Bus Interface I/O Device 10

Conclusions on CMP n Usually, a CMP will feature: – separate L 1 I-cache and D-cache per on-chip CPU – and an optional unified L 2 cache. n If the CPUs always execute threads of the same process, the L 2 cache organization will be simplified, because different processes do not have to be distinguished. n Recently announced commercial processors with CMP hardware: – IBM Power 4 processor with 2 processor on a single die – Sun MAJC 5200 two processor on a die (each processor a 4 threaded block-interleaving VLIW) 11

Multithreaded processors Aim: Latency tolerance n What is the problem? n n Load access latencies measured on an Alpha Server 4100 SMP with four 300 MHz Alpha 21164 processors are: – 7 cycles for a primary cache miss which hits in the on-chip L 2 cache of the 21164 processor, – 21 cycles for a L 2 cache miss which hits in the L 3 (boardlevel) cache, – 80 cycles for a miss that is served by the memory, and – 125 cycles for a dirty miss, i. e. , a miss that has to be served from another processor's cache memory. n Multithreaded processors are able to bridge latencies by switching to another thread of control - in contrast to chip multiprocessors. 12

Multithreaded processors n Multithreading: – Provide several program counters registers (and usually several register sets) on chip – Fast context switching by switching to another thread of control Thread 1: Register set 1 PC PSR 1 Thread 2: Register set 2 PC PSR 2 Thread 3: Register set 3 PC PSR 3 Thread 4: Register set 4 PC PSR 4 . . . FP . . . 13

Approaches of multithreaded processors Cycle-by-cycle interleaving – An instruction of another thread is fetched and fed into the execution pipeline at each processor cycle. n Block-interleaving – The instructions of a thread are executed successively until an event occurs that may cause latency. This event induces a context switch. n Simultaneous multithreading – Instructions are simultaneously issued from multiple threads to the FUs of a superscalar processor. – combines a wide issue superscalar instruction issue with n multithreading. 14

Comparision of multithreading with nonmultithreading approaches: (a) single-threaded scalar (b) cycle-by-cycle interleaving multithreaded scalar (c) block interleaving multithreaded scalar 15

Comparision of multithreading with nonmultithreading approaches: (a) superscalar (b) VLIW interleaving VLIW (c) cycle-by-cycle interleaving (d) cycle-by-cycle 16

Comparison of multithreading with non-multithreading: simultaneous multithreading (SMT) (CMP) chip multiprocessor 17

Cycle-by-cycle interleaving n n n the processor switches to a different thread after each instruction fetch pipeline hazards cannot arise and the processor pipeline can be easily built without the necessity of complex forwarding paths context-switching overhead is zero cycles memory latency is tolerated by not scheduling a thread until the memory transaction has completed requires at least as many threads as pipeline stages in the processor degrading the single-thread performance if not enough threads are present 18

Cycle-by-cycle interleaving - Improving single-thread performance n The dependence look-ahead technique adds several bits to each instruction format in the ISA. – Scheduler feeds non data or control dependent instructions of the same thread successively into the pipeline. n The interleaving technique proposed by Laudon et al. adds caching and full pipeline interlocks to the cycle-by-cycle interleaving approach. 19

Tera MTA cycle-by-cycle interleaving technique n employs the dependence-look-ahead technique n VLIW ISA (3 -issue) n The processor switches context every cycle (3 ns cycle period) among as many as 128 distinct threads, thereby hiding up to 128 cycles (384 ns) of memory latency. n 128 register sets 20

Tera processing A-ops element C-ops M-unit Done Fail OK Done Register state for 128 threads ~ 16 tick A and C execution pipeline Instruction fetch & issue logic I-cache Asynchronous Memory access through network (70 ticks avrg. latency) 21

Tera MTA Processors (max 256) Computational Processors (max. I/O 256) CP CP . . . CP IOP . . . IOP 3 D Toroidal Interconnection Network MU MU . . . MU Memories (max 512) IOC . . . IOC I/O Caches (max 512) 22

Block interleaving n n n Executes a single thread until it reaches a situation that triggers a context switch. Typical switching event: the instruction execution reaches a long-latency operation or a situation where a latency may arise. Compared to the cycle-by-cycle interleaving technique, a smaller number of threads is needed A single thread can execute at full speed until the next context switch. Single thread performance is similar to the performance of a comparable processor without multithreading. IBM North. Star processors are two-threaded 64 bit Power. PCs with switch-on-cache-miss; implemented in departmental computers (e. Servers) of IBM since 10/98! (revealed at MTEAC-4, Dec. 2000) n Recent announcement (Oct. 1999): Sun MAJC 5200 two processor on a die, 23

Interleaving techniques multithreading block interleaving cycle-by-cycle interleaving static explicit-switch dynamic implicit-switch-on-cache-miss (switch-on-load, switch-on-store, switch-on-signal switch-on-branch, . . ) (interrupt, trap, . . ) conditional-switch (explicit with condition) switch-on-use (lazy-cache-miss) 24

Rhamma 25

Komodo-microcontroller Develop multithreaded embedded real-time Javamicrocontroller n Java processor core bytecode as machine language, portability across all platforms dense machine code, important for embedded applications fast byte code execution in hardware, microcode and traps n Interrupts activate interrupt service threads (ISTs) instead of interrupt service routines (ISRs) extremely fast context switch no blocking of interrupt services n Switch-on-signal technique enhanced to very fine-grain switching due to hardware-implemented real-time scheduling algorithms (FPP, EDF, LLF, guranteed percentage) hard real-time requirements fulfilled n For more information see: 26

Komodo - microcontroller 27

Nanothreading and microthreading - multithreading in same register set n Nanothreading (Dan. Soft processor) dismisses full multithreading for a nanothread that executes in the same register set as the main thread. – only a 9 -bit PC, some simple control logic, and it resides in the same page as the main thread. – Whenever the processor stalls on the main thread, it automatically begins fetching instructions from the nanothread. The microthreading technique (Bolychevsky et al. 1996) is similar to nanothreading. n All threads share the same register set and the same run-time stack. However, the number of threads is not restricted to two. n 28

Simultaneous multithreading (SMT) n The SMT approach combines a wide superscalar instruction issue with the multithreading approach – by providing several register sets on the multiprocessor – and issuing instructions from several instruction queues simultaneously. n The issue slots of a wide issue processor can be filled by operations of several threads. n Latencies occurring in the execution of single threads are bridged by issuing operations of the remaining threads loaded on the processor. 29

Simultaneous multithreading (SMT) - Hardware organization (1) n SMT processors can be organized in two ways: n First: Instructions of different threads share all buffer resources in an extended superscalar pipeline – Thus SMT adds minimal hardware complexity to conventional superscalars, – hardware designers can focus on building a fast singlethreaded superscalar and add multithread capability on top. – Complexity added to superscalars by multithreading are thread tag for each internal instruction representation, multiple register sets, and the abilities of the fetch and the retire units to fetch respectively retire instructions of different threads. 30

Simultaneous multithreading (SMT) - Hardware organization (2) n Second: Replicate all internal buffers of a superscalar such that each buffer is bound to a specific thread. – The issue unit is able to issue instructions of different instruction windows simultaneously to the FUs. – Adds more changes to superscalar processors organization – but leads to a natural partitioning of the instruction window (similar to CMP) – and simplifes the issue and retire stages. 31

Simultaneous multithreading (SMT) n n n SMT fetch unit can take advantage of the interthread competition for instruction bandwidth in two ways: – First, it can partition fetch bandwidth among the threads and fetch from several threads each cycle. Goal: increasing the probability of fetching only non speculative instructions. – Second, the fetch unit can be selective about which threads it fetches. The main drawback to simultaneous multithreading may be that it complicates the instruction issue stage, which always is central to the multiple threads. A functional partitioning as demanded for processors of the 109 transistor era is therefore not easily reached. No simultaneous multithreaded processors exist to date. Only simulations. General opinion: SMT will be in next generation microprocessors. 32

SMT at the Universities of Washington and San Diego n n n Hypothetical out-of-order issue superscalar microprocessor that resembles MIPS R 10000 and HP PA-8000. 8 threads and 8 -issue superscalar organization are assumed. Eight instructions are decoded, renamed and fed to either the integer or floating-point instruction window. Unified buffers are used When operands become available, up to 8 instructions are issued out-of-order per cycle, executed and retired. Each thread can address 32 architectural integer (and floatingpoint) registers. These registers are renamed to a large physical register le of 356 physical registers. 33

SMT at the Universities of Washington and San Diego Instruction fetch Instruction decode Fetch Unit Decode Instruction issue Floating-point Instruction Queue Execution pipelines Floatingpoint register file Register Renaming PC I-cache … Floating-point Units … Integer Instruction Queue Integer register file D-cache Integer load/store Units 34

SMT at the Universities of Washington and San Diego - Instruction fetching schemes Basic: Round-robin: RR. 2. 8 fetching scheme, i. e. , in each cycle, two times 8 instructions are fetched in round-robin policy from two different 2 threads, – superior to different other schemes like RR. 1. 8, RR. 4. 2, and RR. 2. 4 n Other fetch policies: – BRCOUNT scheme gives highest priority to those threads that are least likely to be on a wrong path, – MISSCOUNT scheme gives priority to the threads that have the fewest outstanding D-cache misses – IQPOSN policy gives lowest priority to the oldest instructions by penalizing those threads with instructions closest to the head of either the integer or the floating-point queue – ICOUNT feedback technique gives highest fetch priority to the threads with the fewest instructions in the decode, renaming, 35 and queue pipeline stages n

SMT at the Universities of Washington and San Diego - Instruction fetching schemes The ICOUNT policy proved as superior! n The ICOUNT. 2. 8 fetching strategy reached a IPC of about 5. 4 (the RR. 2. 8 reached about 4. 2 only). n Most interesting is that neither mispredicted branches nor blocking due to cache misses, but a mix of both and perhaps some other effects showed as the best fetching strategy. n Recently, simultaneous multithreading has been evaluated with – SPEC 95, – database workloads, – and multimedia workloads. n Both achieving roughly a 3 -fold IPC increase with an eightthreaded SMT over a single-threaded superscalar with similar resources. n 36

SMT processor with multimedia enhancement - Combining SMT and multimedia Start with a wide-issue superscalar general-purpose processor n Enhance by simultaneous multithreading n Enhance by multimedia unit(s) – Utilization of subword parallelism (data parallel instructions, SIMD) – Saturation arithmetic – Additional arithmetic, masking and selection, reordering and conversion instructions n Enhance by additional features useful for multimedia processing, e. g. on-chip RAM memory, special cache techniques n For more information see: http: //goethe. ira. uka. de/people/ungerer/smt-mm/SM-MM-processor. htm 37

The SMT multimedia processor model 38

Maximum processor configuration - IPCs of 8 -threaded 8 -issue cases n n n n Initial maximum configuration: 2. 28 16 entry reservation stations for thread, global and local load/store units (instead of 256): 2. 96 one common 256 -entry reservation station unit for all integer/multimedia units (instead of 256 -entry reservation stations each): 3. 27 loads and stores may pass blocked load/stores of other threads: 4. 1 highest-priority-first, non-speculative-instruction-first, nonsaturated-first strategies for issue, dispatch, and retire stages: 4. 34 32 -entry reorder buffer (instead of 256): 4. 69 second local load/store unit (because of 20. 1% local load/stores): 6. 07 (6. 32 with dynamic branch prediction) 39

IPC of „maximum“ processor On-chip RAM and two local load/store units 4 MB I-cache, D-cache fill burst rate of 6: 2: 2: 2 40

More realistic processor D-cache fill burst rate of 32: 4: 4: 4 issue bandwidth 8 41

Speedup Realistic processor Maximum processor A threefold speedup 42

IPC-Performance of SMT and CMP (1) SPEC 92 -simulations [Tullsen et al. ] vs. [Sigmund and Ungerer]. 43

IPC-Performance of SMT and CMP (2) SPEC 95 -simulations [Eggers et al. ]. CMP 2: 2 processors, 4 -issue superscalar 2*(1, 4) CMP 4: 4 processors, 2 -issue superscalar 4*(1, 2) SMT: 8 -threaded, 8 -issue superscalar 1*(8, 8) 44

IPC-Performance of SMT and CMP SPEC 95 -simulations. Performance is given relative to a single 2 -issue superscalar processor as baseline processor [Hammond et al. ]. 45

Comments to the simulation results [Hammond et al. ] n CMP (eight 2 -issue processors) outperforms a 12 -issue superscalar and a 12 -issue, 8 -threaded SMT processor on four SPEC 95 benchmark programs (by hand parallelized for CMP and SMP). n The CMP achieved higher performance than SMT due to a total of 16 issue slot instead of 12 issue slots for SMT. n Hammond et al. argue that design complexity for 16 -issue CMPs is similar to 12 -issue superscalars or 12 -issue SMT processors. 46

SMT vs. multiprocessor chip [Eggers et al. ] n SMT obtained better speedups than the (CMP) chip multiprocessors - in contrast to results of Hammond et al. !! Eggers et al. compared 8 -issue, 8 -threaded SMTs with four 2 issue CMPs. Hammond et al. compared 12 -issue, 8 -threaded SMTs with eight 2 -issue CMPs. n Eggers et al. : – Speedups on the CMP were hindered by the fixed partitioning of their hardware resources across the processors. – In CMP processors were idle when thread-level parallelism was insufficient. – Exploiting large amounts of instruction-level parallelism in the unrolled loops of individual threads not possible due to CMP processors smaller issue bandwidth. n An SMT processor dynamically partitions its resources among 47 threads, and therefore can respond well to variations in both

Conclusions The performance race between SMT and CMP is not yet decided. n CMP is easier to implement, but only SMT has the ability to hide latencies. n A functional partitioning is not easily reached within a SMT processor due to the centralized instruction issue. – A separation of the thread queues is a possible solution, although it does not remove the central instruction issue. – A combination of simultaneous multithreading with the CMP may be superior. – We favor a CMP consisting of moderately equipped (e. g. , 4 threaded 4 -issue superscalar) SMTs. n Future research: combine SMT or CMP organization with the ability to create threads with compiler support or fully dynamically out of a single thread – thread-level speculation – close to multiscalar n 48