GPU baseline architecture and gpgpusim Presented by 2017

GPU baseline architecture and gpgpu-sim Presented by 王建飞 2017. 9. 28

A typical GPGPU： Related terminology： • GPC：SM cluster • SM：streaming multiprocessor • SIMT core：single instruction multiple threads （？SIMD） On-chip memory： • RF：register file，large • L 1 D cache：private，weak coherence • Shared memory： programmer-controlled

Runtime of GPGPU 1：

Runtime of GPGPU 2： Scheduler：LRR，GTO SIMT stack：post-dominator Operand collector：access RF Lane：SP，SFU，MEM

A typical code study 1： Constant grid. Dim. x， block. Dim. x Variable：block. Idx. x thread. Idx. x __global__: call from host __device__: call from device Source: cuda by example; blocks. Per. Grid = 32 threads. Per. Block = 256 So: grid. Dim. x = 32 block. Dim. x = 256

A typical code study 2：

GPGPU-sim： a cycle-level GPU performance simulator that focuses on "GPU computing" (general purpose computation on GPUs) Replace cuda api and supply a configurable GPU Simulation model: functional simulation (cuda-sim. h/cc) and timing simulation (shader. h/cc) gpu-cache. h/cc: cache model

Simulation line： register_set: instruction temporary buffer m_fu: sp, sfu, ldst_unit Reference: 1. GPGPU-sim manual; 2. Nvidia Fermi/Kepler architecture whitepaper

Instruction Set Architecture： PTX: Parallel Thread e. Xecution , a pseudo-assembly instruction set ptxas SASS: a native GPU ISA (strength reduction, instruction scheduling, register allocation) PTXPlus: to extend PTX with the required features in order to provide a one-to-one mapping to SASS

Instruction Set Architecture：

Instruction Set Architecture： //SASS S 2 R R 0, SR_CTAid_X; S 2 R R 2, SR_Tid_X; //PTX mov. u 32 %r 3, %ctaid. x; mov. u 32 %r 5, %tid. x; ; //PTXPlus mad. lo. u 16 $r 0, %ctaid. x, 0 x 00000200, $r 0; mov. u 16 $r 4. lo, 0 x 0000;

Thanks