NVIDIA Fermi Architecture Joseph Kider University of Pennsylvania

Administrivia n Project checkpoint on Monday

Sources Patrick Cozzi Spring 2011 n NVIDIA CUDA Programming Guide n CUDA by Example

G 80, GT 200, and Fermi November 2006: G 80 n June 2008: GT

New GPU Generation n What are the technical goals for a new GPU generation?

Fermi: What’s More? More total cores (SPs) – not SMs though n More registers:

Fermi: What’s Faster? Faster double precision – 8 x over GT 200 n Faster

Fermi: What’s New? n L 1 and L 2 caches. ¨ For n n

G 80, GT 200, and Fermi Image from: http: //stanford-cs 193 g-sp 2010. googlecode.

GT 200 and Fermi Image from: http: //stanford-cs 193 g-sp 2010. googlecode. com/svn/trunk/lectures/lecture_11/the_fermi_architecture. pdf

Fermi Block Diagram GF 100 n 16 SMs n Each with 32 cores ¨

Fermi SM n Why 32 cores per SM instead of 8? ¨ Why not

Fermi SM n Dual warp scheduling ¨ Why? 32 K registers n 32 cores

Fermi SM 16 SMs * 32 cores/SM = 512 floating point operations per cycle

Fermi SM n Each SM ¨ 64 KB on-chip memory 48 KB shared memory

Fermi Dual Warping Scheduling Image from: http: //stanford-cs 193 g-sp 2010. googlecode. com/svn/trunk/lectures/lecture_11/the_fermi_architecture. pdf

Slide from: http: //gpgpu. org/wp/wp-content/uploads/2009/11/SC 09_CUDA_luebke_Intro. pdf

Fermi Caches Slide from: http: //stanford-cs 193 g-sp 2010. googlecode. com/svn/trunk/lectures/lecture_11/the_fermi_architecture. pdf

Fermi: Unified Address Space Image from: http: //www. nvidia. com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper. pdf

Fermi: Unified Address Space 64 -bit virtual addresses n 40 -bit physical addresses (currently)

Fermi ECC n ECC Protected ¨ Register n file, L 1, L 2, DRAM

Fermi Tessellation Image from: http: //stanford-cs 193 g-sp 2010. googlecode. com/svn/trunk/lectures/lecture_11/the_fermi_architecture. pdf

Fermi Tessellation n Fixed function hardware on each SM for graphics ¨ Texture filtering

Observations n Becoming easier to port CPU code to the GPU ¨ Recursion, fast

Slides: 31

Download presentation

NVIDIA Fermi Architecture Joseph Kider University of Pennsylvania CIS 565 - Fall 2011

Administrivia n Project checkpoint on Monday

Sources Patrick Cozzi Spring 2011 n NVIDIA CUDA Programming Guide n CUDA by Example n Programming Massively Parallel Processors n

G 80, GT 200, and Fermi November 2006: G 80 n June 2008: GT 200 n March 2011: Fermi (GF 100) n Image from: http: //stanford-cs 193 g-sp 2010. googlecode. com/svn/trunk/lectures/lecture_11/the_fermi_architecture. pdf

New GPU Generation n What are the technical goals for a new GPU generation?

New GPU Generation n What are the technical goals for a new GPU generation? ¨ Improve existing application performance. How?

New GPU Generation n What are the technical goals for a new GPU generation? ¨ Improve existing application performance. How? ¨ Advance programmability. In what ways?

Fermi: What’s More? More total cores (SPs) – not SMs though n More registers: 32 K per SM n More shared memory: up to 48 K per SM n More Super Functional Units (SFUs) n

Fermi: What’s Faster? Faster double precision – 8 x over GT 200 n Faster atomic operations. What for? n ¨ 5 -20 x n Faster context switches ¨ Between applications – 10 x ¨ Between graphics and compute, e. g. , Open. GL and CUDA

Fermi: What’s New? n L 1 and L 2 caches. ¨ For n n n n compute or graphics? Dual warp scheduling Concurrent kernel execution C++ support Full IEEE 754 -2008 support in hardware Unified address space Error Correcting Code (ECC) memory support Fixed function tessellation for graphics

G 80, GT 200, and Fermi Image from: http: //stanford-cs 193 g-sp 2010. googlecode. com/svn/trunk/lectures/lecture_11/the_fermi_architecture. pdf

GT 200 and Fermi Image from: http: //stanford-cs 193 g-sp 2010. googlecode. com/svn/trunk/lectures/lecture_11/the_fermi_architecture. pdf

Fermi Block Diagram GF 100 n 16 SMs n Each with 32 cores ¨ n 512 total cores Each SM hosts up to 48 warps, or ¨ 1, 536 threads ¨ n In flight, up to ¨ 24, 576 threads Image from: http: //www. nvidia. com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper. pdf

Fermi SM n Why 32 cores per SM instead of 8? ¨ Why not more SMs? G 80 – 8 cores GT 200 – 8 cores GF 100 – 32 cores

Fermi SM n Dual warp scheduling ¨ Why? 32 K registers n 32 cores n ¨ Floating point and integer unit per core 16 Load/stores n 4 SFUs n Image from: http: //www. nvidia. com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper. pdf

Fermi SM 16 SMs * 32 cores/SM = 512 floating point operations per cycle n Why not in practice? n Image from: http: //www. nvidia. com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper. pdf

Fermi SM n Each SM ¨ 64 KB on-chip memory 48 KB shared memory / 16 KB L 1 cache, or n 16 KB L 1 cache / 48 KB shared memory n ¨ Configurable by CUDA developer Image from: http: //www. nvidia. com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper. pdf

Fermi Dual Warping Scheduling Image from: http: //stanford-cs 193 g-sp 2010. googlecode. com/svn/trunk/lectures/lecture_11/the_fermi_architecture. pdf

Slide from: http: //gpgpu. org/wp/wp-content/uploads/2009/11/SC 09_CUDA_luebke_Intro. pdf

Fermi Caches Slide from: http: //stanford-cs 193 g-sp 2010. googlecode. com/svn/trunk/lectures/lecture_11/the_fermi_architecture. pdf

Fermi: Unified Address Space Image from: http: //www. nvidia. com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper. pdf

Fermi: Unified Address Space 64 -bit virtual addresses n 40 -bit physical addresses (currently) n CUDA 4: Shared address space with CPU. Why? n

Fermi: Unified Address Space 64 -bit virtual addresses n 40 -bit physical addresses (currently) n CUDA 4: Shared address space with CPU. Why? n ¨ No explicit CPU/GPU copies ¨ Direct GPU-GPU copies ¨ Direct I/O device to GPU copies

Fermi ECC n ECC Protected ¨ Register n file, L 1, L 2, DRAM Uses redundancy to ensure data integrity against cosmic rays flipping bits ¨ For example, 64 bits is stored as 72 bits Fix single bit errors, detect multiple bit errors n What are the applications? n

Fermi Tessellation Image from: http: //stanford-cs 193 g-sp 2010. googlecode. com/svn/trunk/lectures/lecture_11/the_fermi_architecture. pdf

Fermi Tessellation n Fixed function hardware on each SM for graphics ¨ Texture filtering ¨ Texture cache ¨ Tessellation ¨ Vertex Fetch / Attribute Setup ¨ Stream Output ¨ Viewport Transform. Why? Image from: http: //stanford-cs 193 g-sp 2010. googlecode. com/svn/trunk/lectures/lecture_11/the_fermi_architecture. pdf

Observations n Becoming easier to port CPU code to the GPU ¨ Recursion, fast atomics, L 1/L 2 caches, faster global memory n In fact…

Observations n Becoming easier to port CPU code to the GPU ¨ Recursion, fast atomics, L 1/L 2 caches, faster global memory In fact… n GPUs are starting to look like CPUs n ¨ Beefier SMs, L 1 and L 2 caches, dual warp scheduling, double precision, fast atomics