OCELOT PTX EMULATOR SCHOOL OF ELECTRICAL AND COMPUTER

Overview Ocelot PTX Emulator Multicore-Backend NVIDIA AMD GPU Backend SCHOOL OF ELECTRICAL AND COMPUTER

3 Execution Model NVIDIA’s PTX Execution Model Parallel thread Execution (PTX) Explicit memory hierarchy

4 The Ocelot PTX Emulator Abstract machine model Performs functional simulation of the PTX

Emulator Implementation Serialized execution of CTAs Implements the CUDA execution model semantics CUDA does

Ocelot Source Code: PTX Emulator Device Backend • ocelot/ • executive/ • • •

7 Trace Generator Interface Emulator broadcasts events to event trace analyzers during execution Events

8 Trace Generator Interface // ocelot/trace/interface/Trace. Generator. h // // Base class for generating

9 Trace. Event Object Captures execution of a dynamic instruction, including Device pointer to

trace: : Trace. Event // file ocelot/trace/interface/Trace. Event. h class Trace. Event { public:

11 PTX Emulator – CUDA Debugging – Race Detection // file: race. Condition. cu

12 PTX Emulator – CUDA Debugging- Illegal Memory Accesses // file: memory. Check. cu

Interactive Debugger Interactive command-line debugger implemented using Trace. Generator interface $. /Test. Cuda. Sequence

Performance Tuning Identify critical regions • Memory demand • Floating-point intensity • Shared memory

15 Example: Inter-thread Data Flow • Which kernels exchange computed results through shared memory?

16 Current Trace Generators Function Instruction Integrated Debugger Kernel Dimension Measures control flow uniformity

17 Using Trace Generators Implement override methods: initialize( ), event( ), post. Event( ),

18 Configuring & Executing Applications Controls Ocelot’s initial state Located in application’s startup directory

19 Dynamic Instruction Count Example: Thread Load Imbalance //! Computes number of dynamic instructions

EXAMPLE: CONTROL-FLOW DIVERGENCE SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Control-Flow Divergence in PTX Emulator facilitates customizable handlers for control flow divergence Currently implements:

class Reconvergence. Mechanism // ocelot/executive/interface/Reconvergence. Mechanism. h // virtual void eval_Reconverge(CTAContext &context, class Reconvergence.

Example: Immediate Post-Dominator Reconvergence // ocelot/executive/interface/Reconvergence. Mechanism. h // CTAContext& get. Context(); size_t stack.

Example: Immediate Post-Dominator Reconvergence // ocelot/executive/implementation/Reconvergence. Mechanism. cpp // void executive: : Reconvergence. IPDOM:

Example: Immediate Post-Dominator Reconvergence void executive: : Reconvergence. IPDOM: : eval_Reconverge( executive: : CTAContext

Applications: Thread Frontiers Evaluate the impact of novel warp reconvergence mechanisms on unstructured control-flow

Applications: Thread Frontiers // ocelot/executive/interface/Reconvergence. Mechanism. h // class Reconvergence. TFSorted. Stack: public Reconvergence.

Applications: Thread Frontiers // ocelot/executive/implementation/Reconvergence. Mechanism. cpp // executive: : CTAContext& executive: : Reconvergence.

Applications: Thread Frontiers SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Summary: Ocelot PTX Emulator Used Can for instruction level analysis be attached to trace

Slides: 30

Download presentation

OCELOT: PTX EMULATOR SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 1

Overview Ocelot PTX Emulator Multicore-Backend NVIDIA AMD GPU Backend SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 2

3 Execution Model NVIDIA’s PTX Execution Model Parallel thread Execution (PTX) Explicit memory hierarchy Cooperative thread arrays (CTAs) and ordering constraints Array of multiprocessors each executing a CTA (coarse grain) SIMD multiprocessors (fine grain) Single instruction multiple thread (SIMT) execution Enables hardware to exploit control flow uniformity and data locality SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

4 The Ocelot PTX Emulator Abstract machine model Performs functional simulation of the PTX execution model Access to complete machine state Enables detailed performance evaluation/debugging Program correctness checking Alignment checks, out of bounds etc. Workload modeling characterization and Trace generation to drive architecture simulators PTX 3. 0 (Fermi) support Timing SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY information not available 4

Emulator Implementation Serialized execution of CTAs Implements the CUDA execution model semantics CUDA does not guarantee concurrent CTA execution! Implements Multiple abstract reconvergence mechanisms are implemented: IPDOM, Barrier, Thread frontiers Software Texture support for special functions: sampling 1 D, 2 D, 3 D, cube Nearest, linear New instructions may be prototyped SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Ocelot Source Code: PTX Emulator Device Backend • ocelot/ • executive/ • • • interface/Emulator. Device. h interface/Emulated. Kernel. h interface/Emulator. Call. Stack. h interface/Cooperative. Thread. Array. h interface/Texture. Operations. h • trace/ • interface/Trace. Event. h • interface/Trace. Generator. h SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 6

7 Trace Generator Interface Emulator broadcasts events to event trace analyzers during execution Events provide detailed device state: PC, activity mask, operand data, thread ID, etc. Used for error checking, instrumentation, and simulation SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 7

8 Trace Generator Interface // ocelot/trace/interface/Trace. Generator. h // // Base class for generating traces class Trace. Generator { public: Trace. Generator(); virtual ~Trace. Generator(); // called when a traced kernel is launched to // retrieve some parameters from the kernel virtual void initialize( const executive: : Executable. Kernel& kernel); // Called whenever an event takes place. virtual void event(const Trace. Event & event); // called when an event is committed virtual void post. Event(const Trace. Event & event); // Called when a kernel is finished. There will // be no more events for this kernel. virtual void finish(); }; SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 8

9 Trace. Event Object Captures execution of a dynamic instruction, including Device pointer to access device state Kernel grid, CTA dimensions, parameter values PC Set and internal representation of PTX instruction of active threads executing instruction Memory Branch addresses and size of transfer target(s) and diverging threads SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 9

trace: : Trace. Event // file ocelot/trace/interface/Trace. Event. h class Trace. Event { public: // ID of the block that generated the event ir: : Dim 3 block. Id; // PC index into Emulated. Kernel's packed instruction sequence ir: : PTXU 64 PC; // Depth of call stack [i. e. number of contexts on the runtime stack] ir: : PTXU 32 context. Stack. Size; // Instruction const pointer to instruction pointed to by PC const ir: : PTXInstruction* instruction; // Bit mask of active threads that executed this instruction Bit. Mask active; // Taken thread mask in case of a branch Bit. Mask taken; // Fall through thread mask in case of a branch Bit. Mask fallthrough; // Vector of memory addresses possibly generated for this instruction U 64 Vector memory_addresses; // Vector of sizes of memory operations possibly issued by this // instruction ir: : PTXU 32 memory_size; // Dimensions of the kernel grid that generated the event ir: : Dim 3 grid. Dim; // Dimensions of the kernel block that generated the event ir: : Dim 3 block. Dim; // Captures just events related to thread reconvergence Reconvergence. Trace. Event reconvergence; }; SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

11 PTX Emulator – CUDA Debugging – Race Detection // file: race. Condition. cu __global__ void race. Condition(int *A) { __shared__ int Shared. Mem[64]; Shared. Mem[thread. Idx. x] = A[thread. Idx. x]; // no synchronization barrier! A[thread. Idx. x] = Shared. Mem[64 - thread. Idx. x]; // line 9 - faulting load }. . . race. Condition<<< dim 3(1, 1), dim 3(64, 1) >>>( valid. Ptr ); . . . ==Ocelot== Ocelot PTX Emulator failed to run kernel "_Z 13 race. Condition. Pi" with exception: ==Ocelot== [PC 15] [thread 0] [cta 0] ld. shared. s 32 %r 14, [%r 13 + 252] - Shared memory race condition, address 0 xfc was previously written by thread 63 without a memory barrier in between. ==Ocelot== Near race. Condition. cu: 9: 0 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 11

12 PTX Emulator – CUDA Debugging- Illegal Memory Accesses // file: memory. Check. cu __global__ void bad. Memory. Reference(int *A) { A[thread. Idx. x] = 0; // line 3 - faulting store } int main() { int *invalid. Ptr = 0 x 0234; // arbitrary pointer does not refer // int *valid. Ptr = 0; to an existing memory allocation cuda. Malloc((void **)&valid. Ptr, sizeof(int)*64); bad. Memory. Reference<<< dim 3(1, 1), dim 3(64, 1) >>>( invalid. Ptr ); return 0; } ==Ocelot== Ocelot PTX Emulator failed to run kernel "_Z 18 bad. Memory. Reference. Pi" with exception: ==Ocelot== [PC 5] [thread 0] [cta 0] st. global. s 32 [%r 4 + 0], %r 0 - Global memory access 0 x 234 is not within any allocated or mapped range. ==Ocelot== Nearby Device Allocations ==Ocelot== [0 x 12 fa 2 e 0] - [0 x 12 fa 3 e 0] (256 bytes) ==Ocelot== Near memory. Check. cu: 3: 0 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 12

Interactive Debugger Interactive command-line debugger implemented using Trace. Generator interface $. /Test. Cuda. Sequence A_gpu = 0 x 16 dcbe 0 (ocelot-dbg) Attaching debugger to kernel 'sequence' (ocelot-dbg) watch global address 0 x 16 dcbe 4 s 32[3] set #1: watch global address 0 x 16 dcbe 4 s 32[3] - 12 bytes (ocelot-dbg) continue st. global. s 32 [%r 11 + 0], %r 7 watchpoint #1 - CTA (0, 0) thread (1, 0, 0) - store to 0 x 16 dcbe 4 4 bytes old value = -1 new value = 2 thread (2, 0, 0) - store to 0 x 16 dcbe 8 4 bytes old value = -1 new value = 4 thread (3, 0, 0) - store to 0 x 16 dcbec 4 bytes old value = -1 new value = 6 break on watchpoint (ocelot-dbg) • Enables • Inspection of application state • Single-stepping of instructions • Breakpoints and watchpoints • Faults in Memory. Checker and Race. Detector invoke ocelotdbg automatically SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Performance Tuning Identify critical regions • Memory demand • Floating-point intensity • Shared memory bank conflicts • . . for (int offset = 1; offset < n; offset *= 2) { // line 61 pout = 1 - pout; pin = 1 - pout; __syncthreads(); temp[pout*n+thid] = temp[pin*n+thid]; if (thid >= offset) { temp[pout*n+thid] += temp[pin*n+thid - offset]; } }. . SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

15 Example: Inter-thread Data Flow • Which kernels exchange computed results through shared memory? • Track id of producer thread • Ensure threads are well synchronized • Optionally ignore uses of shared memory to transfer working sets SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 15

16 Current Trace Generators Function Instruction Integrated Debugger Kernel Dimension Measures control flow uniformity and branch divergence Static and dynamic instruction count GDB-like interface Kernel grid and block dimensions Machine Attributes Observe and record machine characteristics Branch Memory Checker Memory Race Detector Parallelism Performance Bound Shared Computation Warp Synchronous Working set size, memory intensity, memory efficiency i) Bounds checks, ii) alignment checks, and iii) uninitialized loads (shared memory) Race conditions on shared memory MIMD and SIMD parallelism limits Compute and memory throughput Extent of data flow among threads Hot-paths/regions for warp synchronous execution SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 16

17 Using Trace Generators Implement override methods: initialize( ), event( ), post. Event( ), finish( ) Add to Ocelot runtime explicitly: or, ocelot: : add. Trace. Generator( ) add to trace: : Trace. Configuration Online Link Trace. Generator interface analysis or serialize event traces applications with libocelot. Trace. so SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 17

18 Configuring & Executing Applications Controls Ocelot’s initial state Located in application’s startup directory trace: { Edit configure. ocelot memory. Checker: { enabled: true, check. Initialization: false trace specifies which trace generators are initially attached }, race. Detector: { executive controls device properties enabled: true, ignore. Irrelevant. Writes: true trace: memory. Checker – ensures race. Detector - enforces synchronized access to. shared debugger - interactive debugger }, debugger: { enabled: true, kernel. Filter: "_Z 13 scalar. Prod. GPUPf. S_S_ii", executive: devices: List of Ocelot backend devices that are enabled emulated – Ocelot PTX emulator (trace generators) always. Attach: false }, }, executive: { devices: [ "emulated" ], Additional devices: nvidia – execution on NVIDIA GPUs llvm – efficient execution of PTX on multicore CPU amd – translation to AMD IL for PTX on AMD RADEON GPU SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY } }

19 Dynamic Instruction Count Example: Thread Load Imbalance //! Computes number of dynamic instructions for each thread class Thread. Load. Imbalance: public trace: : Trace. Generator { • Mandelbrot (CUDA SDK) public: std: : vector< size_t > dynamic. Instructions; // For each dynamic instruction, increment counters of each // thread that executes it virtual void event(const Trace. Event & event) { if (!dynamic. Instructions. size()) dynamic. Instructions. resize(event. active. size(), 0); for (int i = 0; i < event. active. size(); i++) { if (event. active[i]) dynamic. Instructions[i]++; } } }; SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 19

EXAMPLE: CONTROL-FLOW DIVERGENCE SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 20

Control-Flow Divergence in PTX Emulator facilitates customizable handlers for control flow divergence Currently implements: Immediate post-dominator (ipdom) Barrier divergence Thread frontiers, sorted stack Thread frontiers, GEN 6 Assumes warp is CTA wide Abstract handlers implement potentially divergent control instructions eg. Bra, Bar, Exit, Vote Executing instructions drives Trace. Generators Reconvergence affects active threads, dynamic instruction count, and instruction trace Analysis tools can group threads into warps of arbitrary size SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 21

class Reconvergence. Mechanism // ocelot/executive/interface/Reconvergence. Mechanism. h // virtual void eval_Reconverge(CTAContext &context, class Reconvergence. Mechanism { const ir: : PTXInstruction &instr) = 0; public: Reconvergence. Mechanism(Cooperative. Thread. Array *cta); virtual void eval_Exit(CTAContext &context, virtual ~Reconvergence. Mechanism(); const ir: : PTXInstruction &instr) = 0; virtual void initialize() = 0; virtual void eval_Vote(CTAContext &context, const ir: : PTXInstruction &instr); virtual void eval. Predicate(CTAContext &context) = 0; virtual bool next. Instruction(CTAContext &context, virtual bool eval_Bra(CTAContext &context, const ir: : PTXInstruction &instr, const boost: : dynamic_bitset<> & branch, const ir: : PTXInstruction: : Opcode &) = 0; const boost: : dynamic_bitset<> & fallthrough) = 0; virtual CTAContext& get. Context() = 0; virtual void eval_Bar(CTAContext &context, } const ir: : PTXInstruction &instr) = 0; SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 22

Example: Immediate Post-Dominator Reconvergence // ocelot/executive/interface/Reconvergence. Mechanism. h // CTAContext& get. Context(); size_t stack. Size() const; class Reconvergence. IPDOM: public Reconvergence. Mechanism { void push(CTAContext&); public: void pop(); Reconvergence. IPDOM(Cooperative. Thread. Array *cta); ~Reconvergence. IPDOM(); std: : vector<CTAContext> runtime. Stack; std: : vector<int> pc. Stack; void initialize(); void eval. Predicate(CTAContext &context); bool eval_Bra(CTAContext &context, unsigned int reconverge. Events; }; PTXInstruction &instr, dynamic_bitset<> & branch, dynamic_bitset<> & fallthrough); void eval_Bar(CTAContext &context, PTXInstruction &instr); void eval_Reconverge(CTAContext &context, PTXInstruction &instr); void eval_Exit(CTAContext &context, PTXInstruction &instr); bool next. Instruction(CTAContext &context, PTXInstruction &instr, PTXInstruction: : Opcode &); SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 23

Example: Immediate Post-Dominator Reconvergence // ocelot/executive/implementation/Reconvergence. Mechanism. cpp // void executive: : Reconvergence. IPDOM: : eval_Bar(executive: : CTAContext &context, const ir: : PTXInstruction &instr) { if (context. active. count() < context. active. size()) { // deadlock - not all threads reach synchronization barrier #if REPORT_BAR report(" Bar called - " << context. active. count() << " of " << context. active. size() << " threads active"); #endif std: : stringstream message; message << "barrier deadlock: n"; throw Runtime. Exception(message. str(), context. PC, instr); } } SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 24

Example: Immediate Post-Dominator Reconvergence void executive: : Reconvergence. IPDOM: : eval_Reconverge( executive: : CTAContext &context, const ir: : PTXInstruction &instr) { if(runtime. Stack. size() > 1) { if(pc. Stack. back() == context. PC) { pc. Stack. pop_back(); runtime. Stack. pop_back(); ++reconverge. Events; } else { context. PC++; } } void executive: : Reconvergence. IPDOM: : eval_Exit(executive: : CTAContext &context, const ir: : PTXInstruction &instr) { eval_Bar(context, instr); context. running = false; } SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 25

Applications: Thread Frontiers Evaluate the impact of novel warp reconvergence mechanisms on unstructured control-flow graphs Approach: Control the layout of basic blocks Select threads with highest priority PC Model priority queue in hardware Evaluation: measure impact on Activity factor - SIMD utilization Dynamic instruction count Effective memory bandwidth Gregory Diamos, Benjamin Ashbaugh, Subramaniam Maiyuran, Andrew Kerr, Haicheng Wu, Sudhakar Yalamanchili. SIMD Reconvergence at Thread Frontiers. 44 th International Symposium on Microarchitecture (MICRO 44). December 2011. SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 26

Applications: Thread Frontiers // ocelot/executive/interface/Reconvergence. Mechanism. h // class Reconvergence. TFSorted. Stack: public Reconvergence. Mechanism { public: Reconvergence. TFSorted. Stack(Cooperative. Thread. Array *cta); ~Reconvergence. TFSorted. Stack(); //. . . omitted typedef std: : map<int, CTAContext> Runtime. Stack; typedef std: : vector<Runtime. Stack> Stack. Vector; Stack. Vector stack; unsigned int reconverge. Events; }; SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 27

Applications: Thread Frontiers // ocelot/executive/implementation/Reconvergence. Mechanism. cpp // executive: : CTAContext& executive: : Reconvergence. TFSorted. Stack: : get. Context() { bool executive: : Reconvergence. TFSorted. Stack: : next. Instruction( executive: : CTAContext &context, PTXInstruction &instr, return stack. back(). begin()->second; PTXInstruction: : Opcode &opcode) { } // advance to next instruction if the current instruction void executive: : Reconvergence. TFSorted. Stack: : eval_Exit( // wasn't a branch executive: : CTAContext &context, PTXInstruction &instr) { if (opcode != ir: : PTXInstruction: : Bra if (stack. back(). size() == 1) { && opcode != ir: : PTXInstruction: : Call context. running = false; && opcode != ir: : PTXInstruction: : Ret) { } context. PC++; else { } throw Runtime. Exception("not all threads hit the exit: ", context. PC, instr); } return context. running; } } SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 28

Applications: Thread Frontiers SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 29

Summary: Ocelot PTX Emulator Used Can for instruction level analysis be attached to trace generators and trace analyzers Simple Forms processing and filtering of machine state the basis for a range of productivity tools Correctness tools Debugging tools Workload characterization Drives (Part instruction and address traces to MACSIM 2 of this tutorial) SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 30