NVIDIA Ge Force Ryan Hendrixson Ryan Schubert Allison

NVIDIA Ge. Force Ryan Hendrixson Ryan Schubert Allison Walthall

What Does a GPU Actually Do? n Historically, from: – Acting simply as a frame buffer – Doing vertex transformations and pixel color calculations – Now even programmable n In the simplest sense, a modern GPU implements a 3 D rendering pipeline

3 D Rendering Pipeline (direct illumination) 3 D Geometric Primitives Modeling Transformation Lighting Viewing Transformation Projection Transformation Clipping Scan Conversion Image This is a pipelined sequence of operations to draw a 3 D primitive into a 2 D image

3 D Rendering Pipeline (direct illumination) 3 D Geometric Primitives Modeling Transformation Lighting Viewing Transformation Projection Transformation Clipping Scan Conversion Image Transform into 3 D world coordinate system

3 D Rendering Pipeline (direct illumination) 3 D Geometric Primitives Modeling Transformation Lighting Viewing Transformation Projection Transformation Clipping Scan Conversion Image Transform into 3 D world coordinate system Illuminate according to lighting and reflectance

3 D Rendering Pipeline (direct illumination) 3 D Geometric Primitives Modeling Transformation Lighting Viewing Transformation Projection Transformation Clipping Scan Conversion Image Transform into 3 D world coordinate system Illuminate according to lighting and reflectance Transform into 3 D camera coordinate system

3 D Rendering Pipeline (direct illumination) 3 D Geometric Primitives Modeling Transformation Lighting Transform into 3 D world coordinate system Illuminate according to lighting and reflectance Viewing Transformation Transform into 3 D camera coordinate system Projection Transformation Transform into 2 D screen coordinate system Clipping Scan Conversion Image

3 D Rendering Pipeline (direct illumination) 3 D Geometric Primitives Modeling Transformation Lighting Transform into 3 D world coordinate system Illuminate according to lighting and reflectance Viewing Transformation Transform into 3 D camera coordinate system Projection Transformation Transform into 2 D screen coordinate system Clipping Scan Conversion Image Clip primitives outside camera’s view

3 D Rendering Pipeline (direct illumination) 3 D Geometric Primitives Modeling Transformation Lighting Transform into 3 D world coordinate system Illuminate according to lighting and reflectance Viewing Transformation Transform into 3 D camera coordinate system Projection Transformation Transform into 2 D screen coordinate system Clipping Scan Conversion Image Clip primitives outside camera’s view Draw pixels

Modern Open. GL Pipeline Graphics State Vertex Processor Application Vertices (3 D) CPU Assembly & Rasterization Xformed, Lit Vertices (2 D) Fragments (pre-pixels) GPU Pixel Processor Final pixels (Color, Depth) Render-to-texture Programmable Vertex Processor n Programmable Fragment (Pixel) Processor n Video Memory (Textures)

Open. GL vs. Direct. X n Just graphics n Standard C interfaces n State machine n Multiple platforms n Academic use n Graphics, multimedia, etc. n C++ interfaces n Object oriented n Windows n PC games

Possible GPU Performance Bottlenecks n CPU/Bus Bound – Simply not able to send enough vertices to the card to keep it busy n Vertex Bound – Vertex processing engine is fully loaded, while the fragment engine is just waiting and grabbing data as soon as it’s ready n Pixel Bound – The fragment engine is fully loaded, causing the vertex engine to have to wait before sending more data

Early History n n NVIDIA founded in 1993 1997: RIVA 1998: RIVA TNT 1999: Ge. Force 256 (NV 10)

Ge. Force 256 (NV 10) n Lighting and transformation n DDR and SDR n HDTV compliant n Hardware alpha-blending n 4 pixel pipelines at 120 MHz n Fill Rate: 480 Megapixels/second

Ge. Force 2 n 2000: Ge. Force 2 GTS: – – – Doubled the pixel fill rate Quadrupled the texel fill rate Increased clock speed Multi-texturing S 3 TC, MPEG-2, FSAA

Anti-Aliasing Without Anti-Aliasing With Anti-Aliasing

Ge. Force 2 n 2000: Ge. Force 2 MX – Cut pixel pipeline by 2, making it cost effective – Twinview – Compatible with MACs

Ge. Force 2 n n Jan 2001: Apple selected Ge. Force 2 MX as default high-end graphics solution for Power Mac G 4 August 2000: Ge. Force 2 Ultra November 2000: Ge. Force 2 Go December 2000: NVIDIA buys 3 DFX

Ge. Force 3 2001: Ge. Force 3 (NV 20) n – – – 240 MHz Core/500 MHz Memory 57 million transistors 46 -76 Gigaflops Vertex shader technology Pixel shader technology Light. Speed Memory architecture

Light. Speed Memory Architecture

Ge. Force 4 2002: Ge. Force 4 Ti (NV 25) and MX (NV 17) n – Ti: § 4200, 4400, 4600, and 4800 versions § 63 million transistors § Chip clock 225 -300 MHz § Memory Clock 500 -650 MHz § 75 -100 million vertices/second

Ge. Force FX November 2002: Geforce FX (NV 30) n – – – 16 variations for different price ranges 125 million transistors 8 pixels/clock 1 tmu/pipe (16 textures/unit) 128 bit memory interface 128 MB/256 MB Memory size support

Ge. Force 6 series (NV 40 ) n – – – 6200; 6600 GT and Ultra; 6800 GT, Ultra, and Ultra Extreme Core clock speed 450 MHz Memory clock speed 600 MHz 6 4 -wide fp 32 vector MADDs/ clock cycle vertex shader units 16 4 -wide fp 32 vector MADDs/ clock cycle pixel shader units

Ge. Force 6 series n Super scalar 16 pipe architecture n Cine. FX 3. 0 engine n All operations done in FP 32 precision per component n 200 Gigaflops (Compare this to the Itanium’s 6. 4 Gigaflops)

General Diagram (6800/NV 40)

Turbo. Cache n Uses PCI-Express bandwidth to render directly to system memory n Card needs less memory n Performance boost while lowering cost n Turbo. Cache Manager dynamically allocates from main memory n Local memory used to cache data and to deliver peak performance when needed

Turbo. Cache

NV 40 Vertex Processor An NV 40 vertex processor is able to execute one vector operation (up to four FP 32 components), one scalar FP 32 operation, and make one access to the texture per clock cycle

NV 40 Fragment Processors Early termination from mini z buffer and z buffer checks; resulting sets of 4 pixels (quads) passed on to fragment units

Programmable 2 D and Video Processor n Can be used for video decoding and coding (IDCT, deinterlacing, color model transformations, etc. )

Why NV 40 series was better Massive parallelism n Scalability n – n Computation Power – – n Lower end products have fewer pixel pipes and fewer vertex shader units 222 million transistors First to comply with Microsoft’s Direct. X 9 spec Dynamic Branching in pixel shaders

Dynamic Branching n Helps detect if pixel needs shading n Instruction flow handled in groups of pixels n Specify branch granularity (the number of consecutive pixels that take the same branch) n Better distribution of blocks of pixels between the different quad engines

Dynamic Branching

Ge. Force 7 series 7800 GT n $449 n 7 vertex units n 20 pixel pipelines n Clock speed 400 MHz n Memory clock speed 500 MHz n n n 7800 GTX $600 8 vertex units 24 pixel pipelines Clock speed 430 MHz Memory clock speed 600 MHz

Ge. Force 7800 302 million transistors n 200 Gigaflops of multiply/add calculations per second n 128 -bit floating point precision through the entire rendering pipeline n Fill Rate: 10. 3 Gigatexels n 860 million vertices/sec

Ge. Force 7800

ALU Units in Pixel Processor n Sub-unit 1: – NV 40: textures data and can issue a MUL vector instruction or use its mini-ALU to issue a non-vector instruction – G 70: same but also can issue a multiply/add n Sub-unit 2: – NV 40: can issue a multiply/add vector instruction or use its own mini-ALU to issue a non-vector instruction – G 70: same

Ge. Force 6 vs. Ge. Force 7 n ALU Units – G 70: 24 ALU Units – NV 40: 16 ALU Units n Register file: same size n Texture samplers the same but when fetching large textures in preparation for filtering, G 70's samplers have less latency pulling those textures out of memory

Ge. Force 6 vs. Ge. Force 7 (speculative) n Increased L 2 texture cache (to around 12 KB) n Better cache re-use with larger textures, decompressing those larger textures into L 1 faster n Possibly offering more granularity in cache access by the GPU, to reduce texture bandwidth, speeding up rendering.

Ge. Force 6 vs. Ge. Force 7 n 33 % more vertex units, each with more performance n Improved vertex fetch unit (unconfirmed by Nvidia) n Triangle setup and rasteriser optimized via the use of a new raster pattern (again unconfirmed by Nvidia)

General Diagram (7800/G 70)

32 -bit IEEE floating-point throughout pipeline (NV 40) n Framebuffer n Textures n Fragment processor n Vertex processor n Interpolants n Ge. Force 7800 (G 70) supports 128 bit through entire pipeline!

Hardware supports several other data types n Fragment processor also supports: – 16 -bit “half” floating point – 12 -bit fixed point – These may be faster than 32 -bit on some HW n Framebuffer/textures also support: – – – Large variety of fixed-point formats E. g. , classical 8 -bit per component These formats use less memory bandwidth than FP 32

How are current GPU’s different from CPU? GPU is a stream processor Multiple programmable processing units Connected by data flows Textures Framebuffer Fragment Processor Framebuffer Operations Vertex Processor Assembly & Rasterization Application

How are current GPU’s different from CPU? Optimized for 4 -vector arithmetic – Useful for graphics – colors, vectors, texcoords – Easy way to get high performance/cost – SIMD/MIMD

GPU Memory Model vs CPU’s n Much more restricted memory access – Allocate/free memory only before computation – Limited memory access during computation (kernel) § Registers – Read/write § Local memory – Does not exist § Global memory – Read-only during computation – Write-only at end of computation (pre-computed address) § Disk access – Does not exist

GPU Memory Model n Where is GPU Data Stored? – Vertex buffer – Frame buffer – Texture Vertex Buffer Vertex Processor VS 3. 0 GPUs Texture Rasterizer Fragment Processor Frame Buffer(s)

GPGPU and Motivation n GPUs are fast… – Itanium: 6. 4 GFLOPS – Ge. Force. FX 7800: 200 GFLOPs – GPUs are getting faster, faster – CPUs: annual growth 1. 5× decade growth 60× – GPUs: annual growth > 2. 0× decade growth > 1000

Motivation: Computational Power GPU CPU Courtesy Naga Govindaraju

GPGPU n Good for inherently parallel applications n Rapidly evolving ISA and HW architecture – Largely secret n Can’t CPU! simply “port” code written for the

Programs are Shaders n Bound by the specific hardware profile: – E. g. different cards have different supported hardware, Open. GL has different restrictions than Direct. X, etc n Hardware profiles change relatively drastically as new GPUs are developed – But typically new profiles only add features, so there is generally still backwards compatibility (but not always)

Vertex processor n 256 instructions per program originally (effectively higher with branching) – Now up to 65535 instructions n Executes on all vertices n Outputs new vertices or texture coordinates, etc

Fragment Processor Flow Chart

Fragment processor has flexible texture mapping n Memory is accessible through texture reads n Texture reads are just another instruction n Allows computed texture coordinates, nested to arbitrary depth n Allows multiple uses of a single texture unit

Additional fragment processor capabilities n Read access to window-space position n Read/write access to fragment Z n Built-in derivative instructions – Partial derivatives w. r. t. screen-space x or y – Useful for anti-aliasing n Conditional fragment-kill instruction n Multiple FP formats supported

Fragment processor limitations n Originally No branching – Now support dynamic branching (but it’s still costly) n No indexed reads from registers – Use texture reads instead n No memory writes

Branching Instruction Costs (Ge. Force 6800)

Fragment shaders Originally very limited in size (only 96 instructions), now expanded to 65535+ instructions n New cards support dynamic branching (but it still incurs some performance penalty) n Now have the ability to output to multiple render targets n

Cine. FX 4. 0 Engine A redesigned vertex shader unit reduces the time to set up and perform geometry processing. n A new pixel shader unit design carry out twice as many floating-point operations and greatly accelerates other mathematical operations to increase throughput. n An advanced texture unit incorporates new hardware algorithms and better caching to speed filtering and blending operations. n

Vertex Shaders n n The 7800 has 8 vertex shaders The Triangle Setup stage turns the vertex points into a triangle It also determines mathmatically the rasterization for each triangle Accelerating triangle setup increases the total throughput of the 3 D pipeline

Theoretical Rasterization Pattern of a Triangle

New Pixel Shader – MADD n Multiply and Accumulate are commonly used math functions in 3 D graphics n MADD stands for Multiply-ADD operations n The 7800 can do twice the amount of MADD operations than previous GPUs could n This allows developers to create much more complex visual effects

Transparency Adaptive Supersampling n Takes extra passes of thin-lined objects such as chain linked fences or trees to enhance quality n Pixels inside of a polygon are usually not touched by anti-aliasing methods n With this, a key set is devised, and those pixels are anti-aliased, creating a smoother image.

Transparency Adaptive Supersampling

Transparency Adaptive Multisampling n Higher levels of performance, because it uses one texel to determine other subpixel values n Not as high quality

Supporting the Future n The 7800 is already set up to support the new Microsoft Longhorn OS with some of the following advancements – Video post-processing – Real-time desktop compositing – Seamless multiple 3 D applications – Accelerated antialiased text rendering – Special effects and animation

Accelerated Graphics Port (AGP) n n The AGP is superior to the PCI because it provides a dedicated pathways between the slot and the processor Uses sideband addressing PCI must load a texture from the hard drive into the systems RAM, then from the RAM into the GNU framebuffer AGP can read textures directly from system RAM by “tricking” the CPU into believing the textures are in the framebuffer, when they are really in memory

PCI Express Based on the PCI system, allowing for backwards compatibility n Uses 1 bit, bi-directional lanes (PCI used a bus) n Each lane can support 250 MB/s in each lane (4 GB/s total) n – AGP is only 2 GB/s

Scalable Link Interface (SLI) n Takes advantage of the PCI express bus, which will allow more than one discrete graphics device on the same PCI host n Allows two of the same Ge. Force GPUs to run on one machine, thus “sharing” load. n There are two modes for this – Split-frame Rendering (SFR) – Alternate-frame Rendering (AFR)

Split-frame Rendering Has each GPU render a portion of the screen, split horizontally n No extra latency n Not necessarily evenly split n – SFR is load shared, so it splits up the frame by the amount of work, not the size n A large amount of overhead is involved, causing a max speed up of around 1. 8 times

Alternate-frame Rendering n n n Avoids all the overhead problems of SFR Many buffer swaps Reliant on the speed of the processor Can cause latency issues Recommended mode by NVIDIA

Ge. Force Go 7800 GTX The mobile version of the 7800 GTX n Everything from the desktop release has been carried over to this n Can switch between x 1 and x 16 lanes of PCI Express n Uses Power. Mizer 6. 0, which allows this chip to operate in the same envelope as it’s predecessor, the 6800 n

Ge. Force Go 7800 – Power Issues n n n Power consumption and package are the same as the 6800 Ultra chip, meaning notebook designers do not have to change very much about their thermal designs Dynamic clock scaling can run as slow as 16 MHz – This is true for the engine, memory, and pixel clocks Heavier use of clock gating than the desktop version Runs at voltages lower than any other mobile performance part Regardless, you won’t get much battery-based runtime for a 3 D game

Questions?