Introduction to Multiprocessors and Thread Level Parallelism 1

  • Slides: 24
Download presentation
Introduction to Multiprocessors and Thread Level Parallelism 1

Introduction to Multiprocessors and Thread Level Parallelism 1

Outline ã Motivation ã Multiprocessors l SISD, SIMD, MIMD, and MISD l Memory organization

Outline ã Motivation ã Multiprocessors l SISD, SIMD, MIMD, and MISD l Memory organization l Communication mechanisms ã Multithreading 2

Motivation Instruction-Level Parallelism (ILP): What all we have covered so far: l simple pipelining

Motivation Instruction-Level Parallelism (ILP): What all we have covered so far: l simple pipelining l dynamic scheduling: scoreboarding and Tomasulo ’s alg. l dynamic branch prediction l multiple-issue architectures: superscalar, VLIW l hardware-based speculation l compiler techniques and software approaches Bottom line: There just aren’t enough instructions that can actually be executed in parallel! l instruction issue: limit on maximum issue count l branch prediction: imperfect l # registers: finite l functional units: limited in number l data dependencies: hard to detect dependencies via memory 3

So, What do we do? Key Idea: Increase number of running processes l multiple

So, What do we do? Key Idea: Increase number of running processes l multiple processes: at a given “point” in time Ø i. e. , at the granularity of one (or a few) clock cycles Ø not sufficient to have multiple processes at the OS level! Two Approaches: l multiple CPU’s: each executing a distinct process Ø “Multiprocessors” or “Parallel Architectures” l single CPU: executing multiple processes (“threads”) Ø “Multi-threading” or “Thread-level parallelism” 4

Taxonomy of Parallel Architectures Flynn’s Classification: l SISD: Single instruction stream, single data stream

Taxonomy of Parallel Architectures Flynn’s Classification: l SISD: Single instruction stream, single data stream Ø uniprocessor l SIMD: Single instruction stream, multiple data streams Ø same instruction executed by multiple processors Ø each has its own data memory Ø Ex: multimedia processors, vector architectures l MISD: Multiple instruction streams, single data stream Ø successive functional units operate on the same stream of data Ø rarely found in general-purpose commercial designs Ø special-purpose stream processors (digital filters etc. ) l MIMD: Multiple instruction stream, multiple data stream Ø each processor has its own instruction and data streams Ø most popular form of parallel processing – single-user: high-performance for one application – multiprogrammed: running many tasks simultaneously (e. g. , servers) 5

Multiprocessor: Memory Organization Centralized, shared-memory multiprocessor: l usually few processors l share single memory

Multiprocessor: Memory Organization Centralized, shared-memory multiprocessor: l usually few processors l share single memory & bus l use large caches 6

Multiprocessor: Memory Organization Distributed-memory multiprocessor: l can support large processor counts Ø cost-effective way

Multiprocessor: Memory Organization Distributed-memory multiprocessor: l can support large processor counts Ø cost-effective way to scale memory bandwidth Ø works well if most accesses are to local memory node l requires interconnection network Ø communication between processors becomes more complicated, slower 7

Multiprocessor: Hybrid Organization ã Use distributed-memory organization at top level ã Each node itself

Multiprocessor: Hybrid Organization ã Use distributed-memory organization at top level ã Each node itself may be a shared-memory multiprocessor (2 -8 processors) 8

Communication Mechanisms ã Shared-Memory Communication l around for a long time, so well understood

Communication Mechanisms ã Shared-Memory Communication l around for a long time, so well understood and standardized Ø memory-mapped l ease of programming when communication patterns are complex or dynamically varying l better use of bandwidth when items are small l Problem: cache coherence harder Ø use “Snoopy” and other protocols ã Message-Passing Communication l simpler hardware because keeping caches coherent is easier l communication is explicit, simpler to understand Ø focusses programmer attention on communication l synchronization: naturally associated with communication Ø fewer errors due to incorrect synchronization 9

Multi-threading 10

Multi-threading 10

Performance Beyond Single Thread ã Motivation: l There is much higher natural parallelism in

Performance Beyond Single Thread ã Motivation: l There is much higher natural parallelism in some applications Ø e. g. , Database or Scientific l Explicit Thread Level Parallelism or Data Level Parallelism ã What is a Thread? l a process with own instructions and data l thread may be a process, part of a parallel program of multiple processes, or an independent program l each thread has all the state (instructions, data, PC, register state, and so on) necessary to allow it to execute ã What is Data Level Parallelism l Perform identical operations on lots of data 11 11

Multithreading Threads: multiple processes that share code and data (and much of their address

Multithreading Threads: multiple processes that share code and data (and much of their address space) Ø recently, the term has come to include processes that may run on different processors and even have disjoint address spaces, as long as they share the code Multithreading: exploit thread-level parallelism within a processor l fine-grain multithreading Ø switch between threads on each instruction! l coarse-grain multithreading Ø switch to a different thread only if current thread has a costly stall – E. g. , switch only on a level-2 cache miss 12

Thread Level Parallelism (TLP) ã ILP s. TLP l ILP exploits implicit parallel operations

Thread Level Parallelism (TLP) ã ILP s. TLP l ILP exploits implicit parallel operations within a loop or straight-line code segment l TLP explicitly represented by the use of multiple threads of execution that are inherently parallel Ø each thread needs: its own PC and its own Register File ã Goal: Use multiple instruction streams to improve l Throughput of computers that run many programs l Execution time of multi-threaded programs ã TLP could be more cost-effective to exploit than ILP 13

Multithreaded Execution ã Multithreading: multiple threads to share the functional units of 1 processor

Multithreaded Execution ã Multithreading: multiple threads to share the functional units of 1 processor via overlapping l processor must duplicate independent state of each thread e. g. , a separate copy of register file, a separate PC, and for running independent programs, a separate page table l memory shared through the virtual memory mechanisms, which already support multiple processes l HW for fast thread switch; much faster than full process switch (100 s to 1000 s of clocks) ã When to switch? l Alternate instruction per thread (fine grain) l When a thread is stalled, perhaps for a cache miss, another thread can be executed (coarse grain) 14

Fine-Grain Multithreading Fine-grain multithreading l switch between threads on each instruction! l multiple threads

Fine-Grain Multithreading Fine-grain multithreading l switch between threads on each instruction! l multiple threads executed in interleaved manner l interleaving is usually round-robin l CPU must be capable of switching threads on every cycle! Ø fast, frequent switches l main disadvantage: Ø slows down the execution of individual threads Ø that is, traded off latency for better throughput l example: Sun’s Niagara 15

Coarse-Grain Multithreading Coarse-grain multithreading l switch only if current thread has a costly stall

Coarse-Grain Multithreading Coarse-grain multithreading l switch only if current thread has a costly stall Ø E. g. , level-2 cache miss l can accommodate slightly costlier switches l less likely to slow down an individual thread Ø a thread is switched “off” only when it has a costly stall l main disadvantage: Ø limited in ability to overcome throughput losses – shorter stalls are ignored, and there may be plenty of those Ø issues instructions from a single thread – every switch involves emptying and restarting the instruction pipeline Ø hence, better for reducing penalty of high cost stalls, where pipeline refill << stall time l example: IBM AS/400 16

Simultaneous Multithreading (SMT) Example: new Pentium with “Hyperthreading” Key Idea: Exploit ILP across multiple

Simultaneous Multithreading (SMT) Example: new Pentium with “Hyperthreading” Key Idea: Exploit ILP across multiple threads! l i. e. , convert thread-level parallelism into more ILP l exploit following features of modern processors: Ø multiple functional units – modern processors typically have more functional units available than a single thread can utilize Ø register renaming and dynamic scheduling – multiple instructions from independent threads can co-exist and coexecute! 17

Time (processor cycle) Multithreaded Categories Simultaneous Superscalar Coarse-Grained Fine-Grained. Multiprocessing. Multithreading Thread 1 Thread

Time (processor cycle) Multithreaded Categories Simultaneous Superscalar Coarse-Grained Fine-Grained. Multiprocessing. Multithreading Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Idle slot 18

SMT: Design Challenges ã Dealing with a large register file l needed to hold

SMT: Design Challenges ã Dealing with a large register file l needed to hold multiple contexts ã Maintaining low overhead on clock cycle l fast instruction issue: choosing what to issue l instruction commit: choosing what to commit l keeping cache conflicts within acceptable bounds 19

Example: Power 4 Single-threaded predecessor to Power 5. 8 execution units in out-of-order engine,

Example: Power 4 Single-threaded predecessor to Power 5. 8 execution units in out-of-order engine, each may issue an instruction each cycle. 20

Power 4 Power 5 2 commits 2 fetch (PC), 2 initial decodes 21

Power 4 Power 5 2 commits 2 fetch (PC), 2 initial decodes 21

Power 5 Data Flow Why only 2 threads? With 4, one of the shared

Power 5 Data Flow Why only 2 threads? With 4, one of the shared resources (physical registers, cache, memory bandwidth) would be prone to bottleneck 22

Power 5 Performance ã On 8 processor IBM servers ã ST baseline w/ 8

Power 5 Performance ã On 8 processor IBM servers ã ST baseline w/ 8 threads ã SMT with 16 threads ã Note few with performance loss 23

Changes in Power 5 to support SMT ã Increased associativity of L 1 instruction

Changes in Power 5 to support SMT ã Increased associativity of L 1 instruction cache and ã ã ã the instruction address translation buffers Added per thread load and store queues Increased size of the L 2 (1. 92 vs. 1. 44 MB) and L 3 caches Added separate instruction prefetch and buffering per thread Increased the number of virtual registers from 152 to 240 Increased the size of several issue queues The Power 5 core is about 24% larger than the Power 4 core because of the addition of SMT support 24