NVIDIAS FERMI THE FIRST COMPLETE GPU COMPUTING ARCHITECTURE































- Slides: 31
NVIDIA’S FERMI: THE FIRST COMPLETE GPU COMPUTING ARCHITECTURE A WHITE PAPER BY PETER N. GLASKOWSKY Presented by: Ahmad Hammad Course: CSE 661 - Fall 2011
Outline Ø Ø Ø Introduction What is GPU Computing? Fermi The Programming Model Ø The Streaming Multiprocessor Ø The Cache and Memory Hierarchy Ø Ø Conclusion
Introduction Traditional microprocessor technology see diminishing returns. Improvement in clock speeds and architectural sophistication is slowing Focus has shifted to multicore designs. These too are reaching practical limits for personal computing.
Introduction (2) CPUs are optimized for applications where work done by limited number of threads Threads exhibit high data locality Mix of different operations High percentage of conditional branches. CPUs are inefficient for high-performance computing applications Integer and floating-point execution units is small Most of the CPU space, complexity and heat it generates, devoted to: Caches, instruction decoders, branch predictors, and other features to enhance single-threaded performance.
Introduction (3) GPU design aims applications with multiple threads dominated by long sequences of computational instructions. GPUs are much better at thread handling, data caching, virtual memory management, flow control, and other CPU-like features. CPUs will never go away, but GPUs deliver more cost-effective and energyefficient performance.
Introduction (4) The key GPU design goal is to maximize floating-point throughput. Most of the circuitry within each core is dedicated to computation, rather than speculative features power consumed by GPUs goes into the application’s actual algorithmic work.
What is GPU Computing? Use of a graphics processing unit to do general purpose scientific and engineering computing. GPU computing not a replacement for CPU computing. Each approach has advantages for certain kinds of software. CPU and GPU work together in a heterogeneous co-processing computing model. The sequential part of the application runs on the CPU the computationally-intensive part is accelerated by the GPU.
What is GPU Computing? From the user’s perspective, the application just runs faster because of using the GPU to boost performance.
Fermi Code name for NVIDIA’s next-generation CUDA architecture consists of 16 streaming multiprocessors (SMs) each consisting of 32 cores each can execute one floating-point or integer instruction per clock. The SMs are supported by a second-level cache Host interface Giga. Thread scheduler Multiple DRAM interfaces.
Fermi Code name for NVIDIA’s next-generation CUDA architecture
The Programming Model Complexity of the Fermi architecture is managed by multi-level programming model allows software developers to focus on algorithm design No need to know details about mapping algorithm to the hardware improve productivity
The Programming Model Kernels In NVIDIA’s CUDA software platform the computational elements of algorithms called kernels Kernels can be written in the C language extended with additional keywords to express parallelism directly Once compiled, kernels consist of many threads that execute the same program in parallel
The Programming Model Thread Blocks
The Programming Model Warps Thread blocks are divided into warps of 32 threads. The warp is the fundamental unit of dispatch within a single SM. Two warps from different thread blocks can be issued and executed concurrently increase hardware utilization and energy efficiency. Thread blocks are grouped into grids each executes a unique kernel
The Programming Model At any one time, the entire Fermi device is dedicated to a single application. an application may include multiple kernels. Fermi supports simultaneous execution of multiple kernels from the same application Each kernel distributed to one or more SMs This capability increases the utilization of the device
The Programming Model Giga. Thread Switching from one application to another needs 25µsec Short enough to maintain high utilization even when running multiple applications This switching is managed by Giga. Thread (hardware thread scheduler) Manages 1, 536 simultaneously active threads for each streaming multiprocessor across 16 kernels.
The Programming Model Languages Fermi support C-language FORTRAN (with independent solutions from The Portland Group and NOAA) Java, Matlab, and Python Fermi brings new instruction level support for C++ previously unsupported on GPUs will make GPU computing more widely available than ever.
Supported software platforms NVIDIA’s own CUDA development environment The Open. CL standard managed by the Khronos Group Microsoft’s Direct Compute API.
The Streaming Multiprocessor Comprise 32 cores each: can perform floating-point and integer operations 16 load-store units for memory operations four special-function units 64 K of local SRAM split between cache and local memory.
The Streaming Multiprocessor core Floating-point operations follow the IEEE 7542008 floating-point standard. Each core can perform one single-precision fused multiply-add operation in each clock period one double-precision fused multiply-add FMA in two clock periods. no rounding off in the intermediate result Fermi performs more than 8× as many doubleprecision operations per clock than previous GPU generations
The Streaming Multiprocessor core FMA support increases the accuracy and performance of other mathematical operations division and square root extended-precision arithmetic interval arithmetic Linear algebra. The integer ALU supports the usual mathematical and logical operations including values. multiplication, on both 32 -bit and 64 -bit
The Streaming Multiprocessor Memory operations Handled by a set of 16 load-store units in each SM. load/store instructions refer to memory in terms of two-dimensional arrays providing addresses in terms of x and y values. Data can be converted from one format to another as it passes between DRAM and the core registers at the full rate. examples of optimizations unique to GPUs
The Streaming Multiprocessor four Special Function Units handle special operations such as sin, cos and exp Four of these operations can be issued per cycle in each SM.
The Streaming Multiprocessor execution blocks Within Fermi SM has four execution blocks Cores are divided into two execution blocks: 16 cores each. One block offer 16 load-store units One block of the four SFUs, In each cycle, 32 instructions can be dispatched from one or two warps to these blocks. Two cycles to execute the 32 instructions on the cores or load/store units. 32 special-function instructions can issued in single cycle takes eight cycles to complete on the four SFUs. (32/4
This figure shows how instructions are issued to the execution blocks.
The Cache and Memory Hierarchy L 1 Fermi architecture provides local memory in each SM, can be split Shared memory First-level (L 1) cache for global memory references. The local memory is 64 K in size Split 16 K/48 K or 48 K/16 K between L 1 cache and shared memory. Depends on How much shared memory is needed, how predictable the kernel’s accesses to global memory are likely to be.
The Cache and Memory Hierarchy L 2 Fermi come with an L 2 cache 768 KB in size for a 512 -core chip. Covers GPU local DRAM as well as system memory. The L 2 cache subsystem implements: Set of memory read-modify-write atomic operations Managing access to data shared across thread blocks or kernels. atomic operations are 5× to 20× faster than on previous GPUs using conventional synchronization methods.
The Cache and Memory Hierarchy DRAM The final stage of the local memory hierarchy. Fermi provides six 64 -bit DRAM channels that support SDDR 3 and GDDR 5 DRAMs. Up to 6 GB of GDDR 5 DRAM can be connected to the chip.
Error Correcting Code ECC Fermi is the first GPU to provide ECC protects DRAM, register files, shared memories, L 1 and L 2 caches. The level of protection is known as SECDED: single (bit) error correction, double error detection. Instead of each 64 -bit memory channel carrying eight extra bits for ECC information NVIDIA has a secrete solution for packing the ECC bits into reserved lines of memory.
Conclusion CPUs is the best for dynamic workloads with short sequences of computational operations and unpredictable control flow. workloads dominated by computational work performed within a simpler control flow need GPU