GPU baseline architecture and gpgpusim Presented by 2017
GPU baseline architecture and gpgpu-sim Presented by 王建飞 2017. 9. 28
A typical GPGPU: Related terminology: • GPC:SM cluster • SM:streaming multiprocessor • SIMT core:single instruction multiple threads (?SIMD) On-chip memory: • RF:register file,large • L 1 D cache:private,weak coherence • Shared memory: programmer-controlled
Runtime of GPGPU 1:
Runtime of GPGPU 2: Scheduler:LRR,GTO SIMT stack:post-dominator Operand collector:access RF Lane:SP,SFU,MEM
A typical code study 1: Constant grid. Dim. x, block. Dim. x Variable:block. Idx. x thread. Idx. x __global__: call from host __device__: call from device Source: cuda by example; blocks. Per. Grid = 32 threads. Per. Block = 256 So: grid. Dim. x = 32 block. Dim. x = 256
A typical code study 2:
GPGPU-sim: a cycle-level GPU performance simulator that focuses on "GPU computing" (general purpose computation on GPUs) Replace cuda api and supply a configurable GPU Simulation model: functional simulation (cuda-sim. h/cc) and timing simulation (shader. h/cc) gpu-cache. h/cc: cache model
Simulation line: register_set: instruction temporary buffer m_fu: sp, sfu, ldst_unit Reference: 1. GPGPU-sim manual; 2. Nvidia Fermi/Kepler architecture whitepaper
Instruction Set Architecture: PTX: Parallel Thread e. Xecution , a pseudo-assembly instruction set ptxas SASS: a native GPU ISA (strength reduction, instruction scheduling, register allocation) PTXPlus: to extend PTX with the required features in order to provide a one-to-one mapping to SASS
Instruction Set Architecture:
Instruction Set Architecture: //SASS S 2 R R 0, SR_CTAid_X; S 2 R R 2, SR_Tid_X; //PTX mov. u 32 %r 3, %ctaid. x; mov. u 32 %r 5, %tid. x; ; //PTXPlus mad. lo. u 16 $r 0, %ctaid. x, 0 x 00000200, $r 0; mov. u 16 $r 4. lo, 0 x 0000;
Thanks
- Slides: 12