Lecture 6 Multicore Systems Multicore Computers chip multiprocessors

  • Slides: 32
Download presentation
Lecture 6: Multicore Systems

Lecture 6: Multicore Systems

Multicore Computers (chip multiprocessors) l Combine two or more processors (cores) on a single

Multicore Computers (chip multiprocessors) l Combine two or more processors (cores) on a single piece of silicon l Each core consists of ALU, registers, pipeline hardware, L 1 instruction and data caches l Multithreading is used

Pollack’s Rule l Performance increase is roughly proportional to the square root of the

Pollack’s Rule l Performance increase is roughly proportional to the square root of the increase in complexity performance √complexity l Power consumption increase is roughly linearly proportional to the increase in complexity power consumption complexity

Pollack’s Rule complexity power performance 1 1 1 4 4 2 25 25 5

Pollack’s Rule complexity power performance 1 1 1 4 4 2 25 25 5 100 s of low complexity cores, each operating at very low power Ex: Four small cores complexity power performance 4 x 1 4

Increasing CPU Performance Manycore Chip l Composed of hybrid cores • • • Some

Increasing CPU Performance Manycore Chip l Composed of hybrid cores • • • Some general purpose Some graphics Some floating point

Exascale Systems l l l Board composed of multiple manycore chips sharing memory Rack

Exascale Systems l l l Board composed of multiple manycore chips sharing memory Rack composed of multiple boards A room full of these racks Millions of cores Exascale systems (1018 Flop/s)

Moore’s Law Reinterpreted l Number of cores per chip doubles every 2 years l

Moore’s Law Reinterpreted l Number of cores per chip doubles every 2 years l Number of threads of execution doubles every 2 years

Shared Memory MIMD P P Shared memory Bus • Single address space Memory •

Shared Memory MIMD P P Shared memory Bus • Single address space Memory • All processes have access to the pool of shared memory

Shared Memory MIMD CU PE data l CU CU PE PE PE instruction data

Shared Memory MIMD CU PE data l CU CU PE PE PE instruction data Memory CU data Each processor executes different instructions asynchronously, using different data

Symmetric Multiprocessors (SMP) Proc L 1 … L 2 l l L 2 l

Symmetric Multiprocessors (SMP) Proc L 1 … L 2 l l L 2 l System bus Main Memory I/O I/O MIMD Shared memory UMA

Symmetric Multiprocessors (SMP) Characteristics: l Two or more similar processors l Processors share the

Symmetric Multiprocessors (SMP) Characteristics: l Two or more similar processors l Processors share the same memory and I/O facilities l Processors are connected by bus or other internal connection scheme, such that memory access time is the same for each processor l All processors share access to I/O devices l All processors can perform the same functions l The system is controlled by an integrated operating system that provides interaction between processors and their programs

Symmetric Multiprocessors (SMP) Operating system: l Provides tools and functions to exploit the parallelism

Symmetric Multiprocessors (SMP) Operating system: l Provides tools and functions to exploit the parallelism l Schedules processes or threads across all of the processors l Takes care of • • scheduling of threads and processes on processors synchronization among processors

Multicore Computers CPU core 1 L 1 -I L 1 -D L 2 Main

Multicore Computers CPU core 1 L 1 -I L 1 -D L 2 Main Memory CPU core n … L 1 -I L 1 -D I/O I/O Dedicated L 1 Cache (ARM 11 MPCore)

Multicore Computers CPU core 1 L 1 -I L 1 -D L 2 CPU

Multicore Computers CPU core 1 L 1 -I L 1 -D L 2 CPU core n … L 1 -I L 1 -D L 2 I/O Main Memory I/O Dedicated L 2 Cache (AMD Opteron)

Multicore Computers CPU core 1 L 1 -I L 1 -D CPU core n

Multicore Computers CPU core 1 L 1 -I L 1 -D CPU core n … L 1 -I L 1 -D L 2 I/O Main Memory I/O Shared L 2 Cache (Intel Core Duo)

Multicore Computers CPU core 1 L 1 -I L 1 -D CPU core n

Multicore Computers CPU core 1 L 1 -I L 1 -D CPU core n … L 2 L 1 -I L 1 -D L 2 L 3 I/O Main Memory I/O Shared L 3 Cache (Intel Core i 7)

Multicore Computers Advantages of Shared L 2 cache l l Reduced overall miss rate

Multicore Computers Advantages of Shared L 2 cache l l Reduced overall miss rate • Thread on one core may cause a frame to be brought into the cache, thread on another core may access the same location that has already been brought into the cache Data shared by multiple cores is not replicated The amount of shared cache allocated to each core may be dynamic Interprocessor communication is easy to implement Advantages of Dedicated L 2 cache l Each core can access its private cache more rapidly L 3 cache l When the amount of memory and number of cores grow, L 3 cache provides better performance

Multicore Computers On-chip interconnects l Bus l Crossbar Off-chip communication (CPU-to-CPU or I/O): l

Multicore Computers On-chip interconnects l Bus l Crossbar Off-chip communication (CPU-to-CPU or I/O): l Bus-based

Multicore Computers (chip multiprocessors) l Combine two or more processors (cores) on a single

Multicore Computers (chip multiprocessors) l Combine two or more processors (cores) on a single piece of silicon l Each core consists of ALU, registers, pipeline hardware, L 1 instruction and data caches l Multithreading is used

Multicore Computers Multithreading A multithreaded processor provides a separate PC for each thread (hardware

Multicore Computers Multithreading A multithreaded processor provides a separate PC for each thread (hardware multithreading) l l Implicit multithreading • Concurrent execution of multiple threads extracted from a single sequential program Explicit multithreading • Execute instructions from different explicit threads by interleaving instructions from different threads on shared or parallel pipelines

Multicore Computers Explicit Multithreading l l Fine-grained multithreading (Interleaved multithreading) • • Processor deals

Multicore Computers Explicit Multithreading l l Fine-grained multithreading (Interleaved multithreading) • • Processor deals with two or more thread contexts at a time Switching from one thread to another at each clock cycle Coarse-grained multithreading (Blocked multithreading) • • Instructions of a thread are executed sequentially until an event that causes a delay (eg. cache miss) occurs This event causes a switch to another thread Simultaneous multithreading (SMT) • • Instructions are simultaneously issued from multiple threads to the execution units of a superscalar processor Thread-level parallelism is combined with instruction-level parallelism (ILP) Chip multiprocessing (CMP) • Each processor of a multicore system handles separate threads

Coarse-grained, Fine-grained, Symmetric Multithreading, CMP

Coarse-grained, Fine-grained, Symmetric Multithreading, CMP

GPUs (Graphics Processing Units) Characteristics of GPUs l GPUs are accelerators for CPUs l

GPUs (Graphics Processing Units) Characteristics of GPUs l GPUs are accelerators for CPUs l SIMD l GPUs have many parallel processors and many concurrent threads (i. e. 10 or more cores; 100 s or 1000 s of threads per core) l CPU-GPU combination is an example for heterogeneous computing l GPGPU (general purpose GPU): using a GPU to perform applications traditionally handled by the CPU

GPUs

GPUs

GPUs Core Complexity l Out-of-order execution l Dynamic branch prediction l Larger pipelines for

GPUs Core Complexity l Out-of-order execution l Dynamic branch prediction l Larger pipelines for higher clock rates More circuitry High performance

GPUs Complex cores are preferable: l l Highly instruction parallel numeric applications Floating-point applications

GPUs Complex cores are preferable: l l Highly instruction parallel numeric applications Floating-point applications Large number of simple cores are preferable: l Application’s serial part is small

Cache Performance l Intel Core i 7

Cache Performance l Intel Core i 7

Roofline Performance Model Arithmetic intensity is the ratio of floating-point operations in a program

Roofline Performance Model Arithmetic intensity is the ratio of floating-point operations in a program to the number of data bytes accessed by the program from main memory floating-point operations Arithmetic intensity = -------------------- = FLOPs/Byte number of data bytes

Roofline Performance Model Attainable GFLOPs/second Peak memory bandwidth x Arithmetic intensity = min Peak

Roofline Performance Model Attainable GFLOPs/second Peak memory bandwidth x Arithmetic intensity = min Peak floating-point performance

Roofline Performance Model l l Peak floating-point performance is given by the hardware specifications

Roofline Performance Model l l Peak floating-point performance is given by the hardware specifications of the computer (FLOPs/second) For multicore chips, peak performance is the collective performance of all the cores on the chip. So, multiply the peak per chip by the number of chips Peak memory performance is also given by the hardware specifications of the computer (Mbytes/second) Maximum floating-point performance that the memory system of the computer can support for a given arithmetic intensity, can be plotted as Peak memory bandwidth x Arithmetic intensity (bytes/second) x (FLOPs/bytes) ==> FLOPs/second

Roofline Performance Model l Roofline sets an upper bound on performance l Roofline of

Roofline Performance Model l Roofline sets an upper bound on performance l Roofline of a computer does not vary by benchmark kernel

Stream Benchmark l l A synthetic benchmark Measures the performance of long vector operations

Stream Benchmark l l A synthetic benchmark Measures the performance of long vector operations They have no temporal locality and they access arrays that are larger than the cache size http: //www. cs. virginia. edu/stream/ref. html define N 2000000. . . void tuned_STREAM_Copy() { int j; #pragma omp parallel for (j=0; j<N; j++) c[j] = a[j]; } void tuned_STREAM_Add() { int j; #pragma omp parallel for (j=0; j<N; j++) c[j] = a[j]+b[j]; } void tuned_STREAM_Scale(double scalar) { int j; #pragma omp parallel for (j=0; j<N; j++) b[j] = scalar*c[j]; } void tuned_STREAM_Triad(double scalar) { int j; #pragma omp parallel for (j=0; j<N; j++) a[j] = b[j]+scalar*c[j]; }