CIS 565 GPU Programming and Architecture Original Slides

  • Slides: 39
Download presentation
CIS 565: GPU Programming and Architecture Original Slides by: Suresh Venkatasubramanian Updates by Joseph

CIS 565: GPU Programming and Architecture Original Slides by: Suresh Venkatasubramanian Updates by Joseph Kider and Patrick Cozzi

Administrivia Meeting l l l Monday and Wednesday 6: 00 – 7: 30 pm

Administrivia Meeting l l l Monday and Wednesday 6: 00 – 7: 30 pm Moore 212 Recorded lectures upon request Website: http: //www. seas. upenn. edu/~cis 565/

Administrivia l Instructor: Patrick Cozzi ‘ 99 -’ 03: ‘ 00: ‘ 02: ‘

Administrivia l Instructor: Patrick Cozzi ‘ 99 -’ 03: ‘ 00: ‘ 02: ‘ 03: ‘ 04 -Present: ’ 06 -’ 08: ‘ 11: l Email: pjcozzi _at_ siggraph. org l l Penn State Intel IBM AGI Penn 3 D Engine Design for Virtual Globes Include [CIS 565] in subject Office Hours l l Monday and Wednesday, 7: 30 -9: 00 pm Location: SIG Lab

Administrivia Teaching Assistant l l Jonathan Mc. Caffrey (jmccaf _at_ seas. upenn. edu) Office

Administrivia Teaching Assistant l l Jonathan Mc. Caffrey (jmccaf _at_ seas. upenn. edu) Office Hours: Tuesday and Thursday 34: 30 pm Office Location: SIG Lab

Administrivia l Prerequisites l CIS 460: Introduction to Computer Graphics l CIS 501: Computer

Administrivia l Prerequisites l CIS 460: Introduction to Computer Graphics l CIS 501: Computer Architecture l Most important: l C/C++ and Open. GL

CIS 534: Multicore Programming and Architecture l l l Course Description This course is

CIS 534: Multicore Programming and Architecture l l l Course Description This course is a pragmatic examination of multicore programming and the hardware architecture of modern multicore processors. Unlike the sequential single-core processors of the past, utilizing a multicore processor requires programmers to identify parallelism and write explicitly parallel code. Topics covered include: the relevant architectural trends and aspects of multicores, approaches for writing multicore software by extracting data parallelism (vectors and SIMD), thread-level parallelism, and task-based parallelism, efficient synchronization, and program profiling and performance tuning. The course focuses primarily on mainstream sharedmemory multicores with some coverage of graphics processing units (GPUs). Cluster-based supercomputing is not a focus of this course. Several programming assignments and a course project will provide students first-hand experience with programming, experimentally analyzing, and tuning multicore software. Students are expected to have a solid understanding of computer architecture and strong programming skills (including experience with C/C++). We will not overlap very much

What is GPU (Parallel) Computing l Parallel computing: using multiple processors to… l More

What is GPU (Parallel) Computing l Parallel computing: using multiple processors to… l More quickly perform a computation, or l Perform a larger computation in the same time l PROGRAMMER expresses parallelism Clusters of Computers : MPI , networks, cloud computing …. Shared memory Multiprocessor Called “multicore” when on the same chip GPU: Graphics processing units Slide curiosity of Milo Martin NOT COVERED CIS 534 MULTICORE COURSE FOCUS CIS 565

Administrivia l Course Overview l System and GPU architecture l Real-time graphics programming with

Administrivia l Course Overview l System and GPU architecture l Real-time graphics programming with l Open. GL l General and GLSL purpose programming with l CUDA and Open. CL l Problem domain: up to you l Hands-on

Administrivia l Goals l Program massively parallel processors: l High performance l Functionality and

Administrivia l Goals l Program massively parallel processors: l High performance l Functionality and maintainability l Scalability l Gain Knowledge l Parallel programming principles and patterns l Processor architecture features and constraints l Programming API, tools, and techniques

Administrivia Grading Homeworks (4 -5) l Paper Presentation l Final Project l Final l

Administrivia Grading Homeworks (4 -5) l Paper Presentation l Final Project l Final l 40% 10% 40% + 5% 10%

Administrivia l Bonus days: five person l No-questions-asked one-day extension l Multiple bonus days

Administrivia l Bonus days: five person l No-questions-asked one-day extension l Multiple bonus days can be used on the same assignment l Can be used for most, but not all assignments l Strict late policy: not turned by: l 11: 59 pm of due date: 25% deduction l 2 days late: 50% l 3 days late: 75% l 4 or more days: 100% l Add a Readme when using bonus days

Administrivia l Academic Honesty l l l Discussion with other students, past or present,

Administrivia l Academic Honesty l l l Discussion with other students, past or present, is encouraged Any reference to assignments from previous terms or web postings is unacceptable Any copying of non-trivial code is unacceptable Non-trivial = more than a line or so l Includes reading someone else’s code and then going off to write your own. l

Administrivia l Academic Honesty l Penalties l Zero for academic dishonesty: on the assignment

Administrivia l Academic Honesty l Penalties l Zero for academic dishonesty: on the assignment for the first occasion l Automatic failure of the course for repeat offenses

Administrivia l l Textbook: None Related graphics books: l l Graphics Shaders Open. GL

Administrivia l l Textbook: None Related graphics books: l l Graphics Shaders Open. GL Shading Language GPU Gems 1 - 3 Related general GPU books: l l Programming Massively Parallel Processors Patterns for Parallel Programming

Administrivia l Do I need a GPU? l Yes: NVIDIA Ge. Force 8 series

Administrivia l Do I need a GPU? l Yes: NVIDIA Ge. Force 8 series or higher l No l Moore 100 b - NVIDIA Ge. Force 9800 s l SIG Lab - NVIDIA Ge. Force 8800 s, two Ge. Force 480 s, and one Fermi Tesla

Administrivia Demo: What GPU do I have? l Demo: What version of Open. GL/CUDA/Open.

Administrivia Demo: What GPU do I have? l Demo: What version of Open. GL/CUDA/Open. CL does it support? l

Aside: This class is about 3 things l PERFORMANCE l Ok, not really l

Aside: This class is about 3 things l PERFORMANCE l Ok, not really l l Also about correctness, “-abilities”, etc. Nitty Gritty real world wall-clock performance l No Proofs! Slide curiosity of Milo Martin

Exercise l Parallel Sorting

Exercise l Parallel Sorting

Credits David Kirk (NVIDIA) l Wen-mei Hwu (UIUC) l David Lubke l Wolfgang Engel

Credits David Kirk (NVIDIA) l Wen-mei Hwu (UIUC) l David Lubke l Wolfgang Engel l Etc. etc. l

What is a GPU? GPU: Graphics Processing Unit Processor that resides on your graphics

What is a GPU? GPU: Graphics Processing Unit Processor that resides on your graphics card. GPUs allow us to achieve the unprecedented graphics capabilities now available in games

What is a GPU? Demo: NVIDIA GTX 400 l Demo: Triangle throughput l

What is a GPU? Demo: NVIDIA GTX 400 l Demo: Triangle throughput l

Why Program the GPU ? Chart from: http: //ixbtlabs. com/articles 3/video/cuda-1 -p 1. html

Why Program the GPU ? Chart from: http: //ixbtlabs. com/articles 3/video/cuda-1 -p 1. html

Why Program the GPU ? l Compute l l l Memory Bandwidth l l

Why Program the GPU ? l Compute l l l Memory Bandwidth l l l Intel Core i 7 – 4 cores – 100 GFLOP NVIDIA GTX 280 – 240 cores – 1 TFLOP System Memory – 60 GB/s NVIDIA GT 200 – 150 GB/s Install Base l Over 200 million NVIDIA G 80 s shipped

How did this happen? Games demand advanced shading l Fast GPUs = better shading

How did this happen? Games demand advanced shading l Fast GPUs = better shading l Need for speed = continued innovation l The gaming industry has overtaken the defense, finance, oil and healthcare industries as the main driving factor for high performance processors. l

GPU = Fast co-processor ? GPU speed increasing at cubed-Moore’s Law. l This is

GPU = Fast co-processor ? GPU speed increasing at cubed-Moore’s Law. l This is a consequence of the data-parallel streaming aspects of the GPU. l GPUs are cheap! Put a couple together, and you can get a super-computer. l So can we use the GPU for general-purpose computing ? NYT May 26, 2003: TECHNOLOGY; From Play. Station to Supercomputer for $50, 000: National Center for Supercomputing Applications at University of Illinois at Urbana-Champaign builds supercomputer using 70 individual Sony Playstation 2 machines; project required no hardware engineering other than mounting Playstations in a rack and connecting them with high-speed network switch

Yes ! Wealth of applications Data Analysis Motion Planning Particle Systems Voronoi Diagrams Force-field

Yes ! Wealth of applications Data Analysis Motion Planning Particle Systems Voronoi Diagrams Force-field simulation Molecular Dynamics Graph Drawing Geometric Optimization Physical Simulation Matrix Multiplication Database queries Conjugate Gradient Sorting and Searching Range queries Image Processing Signal Processing Finance Optimization Planning Radar, Sonar, Oil Exploration … and graphics too !!

When does “GPU=fast co-processor” work ? Real-time visualization of complex phenomena The GPU (like

When does “GPU=fast co-processor” work ? Real-time visualization of complex phenomena The GPU (like a fast parallel processor) can simulate physical processes like fluid flow, n-body systems, molecular dynamics In general: Massively Parallel Tasks

When does “GPU=fast coprocessor” work ? Interactive data analysis For effective visualization of data,

When does “GPU=fast coprocessor” work ? Interactive data analysis For effective visualization of data, interactivity is key

When does “GPU=fast co-processor” work ? Rendering complex scenes Procedural shaders can offload much

When does “GPU=fast co-processor” work ? Rendering complex scenes Procedural shaders can offload much of the expensive rendering work to the GPU. Still not the Holy Grail of “ 80 million triangles at 30 frames/sec*”, but it helps. * Alvy Ray Smith, Pixar. Note: NVIDIA Quadro 5000 is calculated to push 950 million triangles per second http: //www. nvidia. com/object/product-quadro-5000 -us. html

Stream Programming A stream is a sequence of data (could be numbers, colors, RGBA

Stream Programming A stream is a sequence of data (could be numbers, colors, RGBA vectors, …) l A kernel is a (fragment) program that runs on each element of a stream, generating an output stream (pixel buffer). l

Stream Programming Kernel = vertex/fragment shader l Input stream = stream of vertices, primitives,

Stream Programming Kernel = vertex/fragment shader l Input stream = stream of vertices, primitives, or fragments l Output stream = frame buffer or other buffer (transform feedback) l Multiple kernels = multi-pass rendering sequence on the GPU. l

To program the GPU, one must think of it as a (parallel) stream processor.

To program the GPU, one must think of it as a (parallel) stream processor.

What is the cost of a stream program ? l Number of kernels l

What is the cost of a stream program ? l Number of kernels l l Complexity of kernel l l More complexity takes longer to move data through a rendering pipeline Number of memory accesses l l Readbacks from the GPU to main memory are expensive, and so is transferring data to the GPU. Non-local memory access is expensive Number of branches l Divergent branches are expensive

What will this course cover ?

What will this course cover ?

1. Stream Programming Principles Open. GL programmable pipeline l The principles of stream hardware

1. Stream Programming Principles Open. GL programmable pipeline l The principles of stream hardware l How do we program with streams? l

2. Shaders and Effects l How do we compute complex effects found in today’s

2. Shaders and Effects l How do we compute complex effects found in today’s games? Examples: l Parallax Mapping l Reflections l Skin and Hair l Particle Systems l Deformable Mesh l Morphing l Animation

3. GPGPU / GPU Computing l How do we use the GPU as a

3. GPGPU / GPU Computing l How do we use the GPU as a fast co-processor? l l l GPGPU Languages: CUDA and Open. CL High Performance Computing Numerical methods and linear algebra: l l l l l Inner products Matrix-vector operations Matrix-Matrix operations Sorting Fluid Simulations Fast Fourier Transforms Graph Algorithms And More… At what point does the GPU become faster than the CPU for matrix operations ? For other operations ?

4. Optimizations How do we use the full potential of the GPU? l What

4. Optimizations How do we use the full potential of the GPU? l What tools are there to analyze the performance of our algorithms? l

What we want you to get out of this course! 1. 2. 3. 4.

What we want you to get out of this course! 1. 2. 3. 4. 5. 6. Understanding of the GPU as a graphics pipeline Understanding of the GPU as a high performance compute device Understanding of GPU architectures Programming in GLSL, CUDA, and Open. CL Exposure to many core graphics effects performed on GPUs Exposure to many core parallel algorithms performed on GPUs