Attila Research Group attila ac upc edu Computer

  • Slides: 38
Download presentation
Attila Research Group attila. ac. upc. edu Computer Architecture Department Univ Politècnica de Catalunya

Attila Research Group attila. ac. upc. edu Computer Architecture Department Univ Politècnica de Catalunya (UPC) 1

Attila Project • Started 2003 • Research on GPUs – Focus on the microarchitecture

Attila Project • Started 2003 • Research on GPUs – Focus on the microarchitecture – Use real games as workloads – Analyze bandwidth/latency/threading tradeoffs • Spent large fraction of time developing tools • Currently three Ph. Ds in progress • Funding from – CICYT / Ministry of Education, Spain – Intel (2) (1) • 2 Students spent 6 months with ATI 2

Attila Team • Faculty – Agustin Fernandez • 3 Ph. D. Students – Victor

Attila Team • Faculty – Agustin Fernandez • 3 Ph. D. Students – Victor Moya – Carlos González – Jordi Roca -- Hired by Intel / VCG ’ 06 -- 6 months internship at ATI (Jun’ 07) • Master Thesis – Chema Solis – DX 9 Driver Development • Alumni – David Abella – DX 9 Player and PIX reader – Christian Perez – Color Compression in Attila • Industrial Advisor – Roger Espasa, Intel VCG 3

Attila Facts • Simulation time – 1 frame @1280 x 1024 per hour •

Attila Facts • Simulation time – 1 frame @1280 x 1024 per hour • Lines of code – Simulator: 142697 lines – Library, driver and trace tools: 217266 lines • ACDL : 37791 lines • Open. GL : 35960 lines • D 3 D 9: 17348 lines

Attila Publications • Conference Papers – – • Workload Characterization of 3 D Games.

Attila Publications • Conference Papers – – • Workload Characterization of 3 D Games. Jordi Roca, Victor Moya, Carlos González, Chema Solis, Agustín Fernández and Roger Espasa. IEEE International Symposium on Workload Characterization (IISWC-2006), pp. - , January 2006. ATTILA: A Cycle-Level Execution-Driven Simulator for Modern GPU Architectures. Víctor Moya, Carlos González, Jordi Roca, Agustín Fernández and Roger Espasa. IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS-2006), March 2006. Shader Performance Analysis on a Modern GPU Architecture. Víctor Moya, Carlos González, Jordi Roca, Agustín Fernández and Roger Espasa. The 38 th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-38), November 2005. A Single (Unified) Shader GPU Microarchitecture for Embedded Systems. Víctor Moya, Carlos González, Jordi Roca, Agustín Fernández and Roger Espasa. 2005 International Conference on High Performance Embedded Architectures & Compilers (Hi. PEAC 2005), November 2005. Master Thesis – – – Caracterización e implementación de algoritmos de compresión en la GPU ATILA (Text in Spanish) Christian Perez. Master Thesis for the Graduate Studies, January 2008. Extensión a Direct 3 D del driver de un simulador de GPU (Text in Spanish) Chema Solis Master Thesis for the Graduate Studies, July 2007. Librería Direct 3 D (Text in Catalan) David Abella Master Thesis for the Graduate Studies, July 2007 Shader generation and compilation for a programmable GPU (Text in Spanish) Jordi Roca. Master Thesis for the Graduate Studies, July 2005. Support tools for a 3 D graphics processor simulation framework (Text in Spanish) Carlos González. Master Thesis for the Graduate Studies, June 2004. 5

Outline • Attila Tracing Environment • Attila Architecture & Simulator • Current Research –

Outline • Attila Tracing Environment • Attila Architecture & Simulator • Current Research – Shaders – Memory Hierarchy – Micropolygons – DX 9 Driver Development 6

Supported workloads Doom 3 UT 2004 Quake 4 Riddick and upcoming D 3 D

Supported workloads Doom 3 UT 2004 Quake 4 Riddick and upcoming D 3 D games … Prey Half Life 2 7

Collect Verify Simulate Analyze OGL/D 3 D App OGL/D 3 DCapturer or Microsoft PIX

Collect Verify Simulate Analyze OGL/D 3 D App OGL/D 3 DCapturer or Microsoft PIX Capturer Detailed cycle-to-cycle visualization Trace OGL/D 3 DPlayer API Stats or Attila Pix Player Vendor OGL/D 3 D Driver Signal Trace Visualizer ATTILA OGL/D 3 D Driver ATI R 600/NVIDIA G 80 ATTILA Simulator Framebuffer CHECK Signal Traffic µ-Arch Statistics Internal traces (mem, $, …) 8

Verify Collect Simulate Analyze API Capturers • Capture API calls from a real game

Verify Collect Simulate Analyze API Capturers • Capture API calls from a real game • Gather API level statistics OGL/D 3 D App OGL/D 3 DCapturer or Microsoft PIX Capturer Detailed cycle-to-cycle visualization Trace OGL/D 3 DPlayer API Stats or Attila Pix Player Vendor OGL/D 3 D Driver Signal Trace Visualizer ATTILA OGL/D 3 D Driver ATI R 600/NVIDIA G 80 ATTILA Simulator Framebuffer CHECK Signal Traffic µ-Arch Statistics Internal traces (mem, $, …) 9

Collect Verify Simulate Analyze API Players • Trace checking/integrity • Batch-to-batch playing (helps debug)

Collect Verify Simulate Analyze API Players • Trace checking/integrity • Batch-to-batch playing (helps debug) OGL/D 3 D App OGL/D 3 DCapturer or Microsoft PIX Capturer Detailed cycle-to-cycle visualization Trace OGL/D 3 DPlayer API Stats or Attila Pix Player Vendor OGL/D 3 D Driver Signal Trace Visualizer ATTILA OGL/D 3 D Driver ATI R 600/NVIDIA G 80 ATTILA Simulator Framebuffer CHECK Signal Traffic µ-Arch Statistics Internal traces (mem, $, …) 10

Collect Verify Simulate Simulation • Attila Drivers • AOGL (90%) • AD 3 D

Collect Verify Simulate Simulation • Attila Drivers • AOGL (90%) • AD 3 D 9 (60%) • Attila Simulator • Detailed cycle-to-cycle simulation • 20 Boxes modeling 100 -deep pipeline • Execute@Execute: Functionality embedded at each pipeline stage OGL/D 3 D App OGL/D 3 DCapturer or Microsoft PIX Capturer Trace OGL/D 3 DPlayer API Stats or Attila Pix Player Vendor OGL/D 3 D Driver Detailed cycle-to-cycle visualization Signal Trace Visualizer ATTILA OGL/D 3 D Driver ATI R 600/NVIDIA G 80 ATTILA Simulator Framebuffer CHECK Analyze CHECK Signal Traffic µ-Arch Statistics Internal traces (mem, $, …) 11

Collect Verify Simulate Analyze Simulation output • Micro-architectural statistics • Traffic for cache, mem,

Collect Verify Simulate Analyze Simulation output • Micro-architectural statistics • Traffic for cache, mem, … • Signal trace (input for STV tool) • Debug simulation performance OGL/D 3 D App OGL/D 3 DCapturer or Microsoft PIX Capturer Detailed cycle-to-cycle visualization Trace OGL/D 3 DPlayer API Stats or Attila Pix Player Vendor OGL/D 3 D Driver Signal Trace Visualizer ATTILA OGL/D 3 D Driver ATI R 600/NVIDIA G 80 ATTILA Simulator Framebuffer CHECK Signal Traffic µ-Arch Statistics Internal traces (mem, $, …) 12

Attila Drivers • Open. GL driver • Direct. X 9 driver – 200 API

Attila Drivers • Open. GL driver • Direct. X 9 driver – 200 API calls supported. – 80% Open. GL 2. 0 fixed functionality – About 50 calls supported. – 60% API functionality. Attila Open. GL Driver (GLLIB) Attila DX 9 Driver (D 3 DLIB) HAL ATTILA Architecture 13

Unified Driver Architecture • Currently stalled due to lack of resources • Runs basics

Unified Driver Architecture • Currently stalled due to lack of resources • Runs basics traces – Non-textured torus with simple vtx shader. AOGL* ACDLX AGL/ES ADX 9* ADX 10 AREY ACDL HAL ATTILA Architecture 14

Outline • Attila Tracing Environment • Attila Architecture & Simulator • Current Research –

Outline • Attila Tracing Environment • Attila Architecture & Simulator • Current Research – Shaders – Memory Hierarchy – Micropolygons – DX 9 Driver Development 15

Hierarchical. Z Unified shaders, multithreaded … Memory Controller ROP Memory Controller Shader Memory Controller

Hierarchical. Z Unified shaders, multithreaded … Memory Controller ROP Memory Controller Shader Memory Controller Rasterization Shader ROP Triangle Setup Shader Distributor Clipping Scheduler Primitive Assembly Shader ROP Vertex Fetch ROP Attila Architecture GDDR 4 detailed protocol, selectable memory schedulers… 16

Attila Simulator Implementation Using Boxes & Signals STREAMER/VERTEX FETCH Streamer Fetch Streamer Output Cache

Attila Simulator Implementation Using Boxes & Signals STREAMER/VERTEX FETCH Streamer Fetch Streamer Output Cache Streamer Commit Primitive Assembly Clipper Triangle Setup Fragment Generator Hierarchical Z Streamer Loader SHADER Shader Fetch Shader Decode Execute Command Processor Fragment FIFO Texture Unit Memory Controller Z Stencil Test Interpolator Color Write DAC Data-driven & cycle-accurate 17

Lots of configurable parameters GPU Unit Params Examples COMMAND PROCESSOR 1 Batch pipelining MEMORY

Lots of configurable parameters GPU Unit Params Examples COMMAND PROCESSOR 1 Batch pipelining MEMORY CONTROLLER 42 Size, channels and banks (number and interleaving). STREAMER 13 Fetched indices and attributes per cycle PRIMITIVE ASSEMBLY 4 Assembled triangles per cycle CLIPPER 5 Clipping latency SETUP + RASTERIZER 43 MSAA samples/cycle, Enabled HZ UNIFIED SHADER UNIT 39 Fetch Instrs/cycle, temp regs, scalar ALU TEXTURE CACHE 19 Line size, ways, port width ROP (Z + COLOR) 47 Compression, cache size. DAC 9 Refresh rate TOTAL 222 18

Statistics – High Level • API level • µ-arch level • “Workload Characterization of

Statistics – High Level • API level • µ-arch level • “Workload Characterization of 3 D Games”, IEEE International Symposium on WC 2006 19

Statistics – Zooming In Stencil pass Shading pass Light 0 Stencil pass Shading pass

Statistics – Zooming In Stencil pass Shading pass Light 0 Stencil pass Shading pass Light 1 • Fine-grain stats at configurable fractions of i. e: 100, 1 K, 10 K or 100 K execution cycles. 20

Statistics – Cycle Level 21

Statistics – Cycle Level 21

Outline • Attila Tracing Environment • Attila Architecture & Simulator • Current Research –

Outline • Attila Tracing Environment • Attila Architecture & Simulator • Current Research – Shaders – Memory Hierarchy – Micropolygons – DX 9 Driver Development 22

GPU Memory Hierarchy Optimizations Carlos González cgonzale@ac. upc. edu

GPU Memory Hierarchy Optimizations Carlos González cgonzale@ac. upc. edu

Previous Work 1. Initial Attila’s Boxes & Signals framework 2. Tracing Framework – –

Previous Work 1. Initial Attila’s Boxes & Signals framework 2. Tracing Framework – – – GLInterceptor & GLPlayer Tools Open. GL Driver for Attila Signal Trace Visualizer tool 3. New highly-detailed Memory Controller for Attila 4. Internship at ATI (6 months, 07’) – – – Work mainly focused on the MC block Analysis of bandwidth and latency by means of simulation techniques Some contributions to the initial system • Mechanisms to pinpoint sources of latency and analyze bandwidth over time slices

Today’s GPUs remarks • Tremendous bandwidth available – Core 2: 12 GB/sec VS NVIDIA

Today’s GPUs remarks • Tremendous bandwidth available – Core 2: 12 GB/sec VS NVIDIA G 80 > 100 GB/sec • But… – Dozens of clients accessing memory simultaneously – Unbalance and inefficient scheduling of memory transactions can lead to poor performance • Workload unbalance – Total available BW decreases • Inefficient scheduling – Latency increases (DDR protocol overhead) • Overall performance degradation

Thesis Goals 1. Optimize bank mapping and load balancing among memory channels. Also, propose

Thesis Goals 1. Optimize bank mapping and load balancing among memory channels. Also, propose multiple separated address spaces (per client) 2. Propose efficient memory controller scheduling algorithms • Also: Measure DRAM chips consumption of our proposals 3. Propose new cache hierarchies for ROP and Texture units 4. Research in interconnection topologies

Some experiments… Channel Interleaving Analysis Some config. parameters of the experiment • 8 channels

Some experiments… Channel Interleaving Analysis Some config. parameters of the experiment • 8 channels of 32 -bit • 8 banks per 32 -bit IO chip • Bank interleaving fixed to 256 • 4 unified shaders (4 x) Memory Scheduling Analysis Some config. parameters of the simulation • 4 channels of 64 -bit • 8 banks per 32 -bit IO chip 17’ 8 % 17’ 6 % 17’ 1 % • Channel interleaving = 256 bytes • Bank interleaving = 1024 bytes • 4 unified shaders (4 x) • Texture cache line (L 1) = 64 bytes 13’ 6 % • Texture cache ways (L 1) = 16 • Texture cache lines (L 1) = 16 • Color and Zstencil caches: 4 ways • line size = 256 bytes - 16 cache lines

Micropolygon Rendering Jordi Roca jroca@ac. upc. edu 28

Micropolygon Rendering Jordi Roca jroca@ac. upc. edu 28

Past work 1. Open. GL Fixed Function to ARB vp/fp 1. 0 translator. 2.

Past work 1. Open. GL Fixed Function to ARB vp/fp 1. 0 translator. 2. Workload Characterization of 3 D Games (IISWC´ 06): – Extensive analysis of current games in terms of both API call and µarchitectural level stats. 3. Multi-GPU performance evaluation project (at ATI 2007´s internship): – Hybrid SFR/AFR modes. – Alternatives for RTT surface synchronization. – Scaling of current PCIe BW. (Related paper is currently submitted at the IISWC 2008). 29

Micropolygon rendering • Understanding and characterizing the pipeline backend unbalance due to very small

Micropolygon rendering • Understanding and characterizing the pipeline backend unbalance due to very small polygons. – Newer games tend to render outsides, thus projecting polygons of a few pixels size. Synthetic micropolygon test: Fills the screen with 1 pixel aligned quads: Raster Input: 1 triangle/clock Raster Output: 15/16 empty slots/clock (high-end cards). 30

Research on: • Proposal #1: µpolygon grid traversal scheme: – An alternative rasterization path

Research on: • Proposal #1: µpolygon grid traversal scheme: – An alternative rasterization path to detect and efficiently traverse grids of adjacent pixel-size primitives: • Fill backend slots combining fragments of different primitives. • Reuse triangle setup and traversal computations for pixel proximate primitives. • Proposal #2: Dynamic balancing of rasterization workload: – Assign & schedule shader threads for rasterization. 31

DX 9 Driver Development Chema Solís csolis@ac. upc. edu 32

DX 9 Driver Development Chema Solís csolis@ac. upc. edu 32

Project target D 3 D application D 3 D 9 Trace Microsoft D 3

Project target D 3 D application D 3 D 9 Trace Microsoft D 3 D 9 ATTILA D 3 D 9 driver • Project target is to use D 3 D 9 games as workload for ATTILA GPU simulator. • Two main tasks: – Trace D 3 D 9 calls executed by the games. – Build a D 3 D 9 driver on top of GPU simulator.

Pix. Run Player • Executes traces of calls to D 3 D 9 captured

Pix. Run Player • Executes traces of calls to D 3 D 9 captured by Microsoft PIX. • Analyse how the game is using D 3 D 9.

D 3 D 9 Driver • D 3 D 9 functionality is being added

D 3 D 9 Driver • D 3 D 9 functionality is being added progressively. • The driver is close to support commercial games.

Unified Shader Architecture Victor Moya vmoya@ac. upc. edu 36

Unified Shader Architecture Victor Moya vmoya@ac. upc. edu 36

Unified Shader Architecture • Evaluated performance of an unified vertex and fragment shader architecture

Unified Shader Architecture • Evaluated performance of an unified vertex and fragment shader architecture on legacy applications – Evaluated area vs performance • Evaluated the performance of implementing Triangle Setup on the shader for embedded GPU architectures • Evaluated bottleneck of GPU architectures with high shader ALU to texture.

Current Research • Evaluate thread and resource scheduling in an unified shader architecture •

Current Research • Evaluate thread and resource scheduling in an unified shader architecture • Implementation blending on the shader