JheYu Liou NCKU Computer Architecture and System Laboratory

Jhe-Yu Liou NCKU Computer Architecture and System Laboratory From ESL to Silicon

NCKU Computer Architecture and System Laboratory Outline From ESL to Silicon From SIMD to SIMT The challenge of SIMT Conclusion

NCKU Computer Architecture and System Laboratory From ESL to Silicon CPU ARM v 5 instruction compatible dual core. Full-system verification capability Ø Linux kernel 2. 6 boot up. Continue to focus on issue on power consumption, cache efficiency.

NCKU Computer Architecture and System Laboratory From ESL to Silicon Network and Virtualization A CASL hypervisor, Ø so called a Virtual Machine Monitor (VMM). A timing approximated TCP/IP offload engine. Full system verification with Ø Multiple guest OS. Ø Communication with another computer by real Ethernet interface.

NCKU Computer Architecture and System Laboratory From ESL to Silicon GPU Support Open. GL ES 1. 1 Ø Fixed function pipeline Both pure C model and gate-level RTL model are presented. Full system simulation environment with QEMU Ø Debian, kernel 2. 6. 18

NCKU Computer Architecture and System Laboratory From SIMD to SIMT SIMD – single instruction multiple data. Often referred as vector-based instruction. Ø High level language: l Open. GL Shading language. l Template-based vector library in C++ (SSE-like instruction wrapper) Ø Low level language: l x 86 SSE or AVX instruction set l ARM NEON instruction set l Nvidia NVGP 4 pseudo assembly language.

NCKU Computer Architecture and System Laboratory From SIMD to SIMT SIMD example Open. GL Shading program (fragment) vec 3 c 1, c 2; vec 3 eye. Vector = vec 3(0. 0, 1. 0) c 1 = texture 2 D( Color. Map, UV ). rgb; c 2 = texture 2 D( Normal. Map, UV ). rgb; c 2 = c 2*2 - vec 3(1. 0, 1. 0); color = c 1 * dot(c 2, eye. Vector ); NVGP 4 pseudo assembly TEMP R 0, R 1; TEX. F R 0. xyz, fragment. attrib[0], texture[0], 2 D; TEX. F R 1. z, fragment. attrib[0], texture[1], 2 D; MUL. F R 1. xyz, R 1. z, R 0; MAD. F result_color 0. xyz, R 1, {2}. x, -R 0; END MUL. F R 1. xyz, R 1. z, R 0 Ø Ø R 1. x R 1. y R 1. z R 1. w = = R 1. z * R 0. x R 1. z * R 0. y R 1. z * R 0. z (nothing)

NCKU Computer Architecture and System Laboratory From SIMD to SIMT SIMD conclusion The programmer itself has to take care the vector operation and optimization if possible. ØIt’s not realistic to fully depend on compiler. ØMay waste the execution power if only partial vector elements are required.

NCKU Computer Architecture and System Laboratory From SIMD to SIMT SIMD conclusion (cont. ) Our Open. GL GPU simulator shows only about 30%~70% utilization in NVGP 4 vector-based shader core. Part of POM fragment shader in NVGP 4 assembly DP 3. F RSQ. F MUL. F DIV. F RSQ. F MUL. F MOV. F R 0. x, frag. attrib[1]; R 0. x, R 0. x; R 1. xyz, R 0. x, frag. attrib[1]; R 2. xy, R 1. z; R 0. x, R 0. y; R 2. zw, R 2. xyxy, {-0. 0057142857}. x; R 0. xyz, R 0. x, frag. attrib[2]; R 2. xy, frag. attrib[0]; R 3. xy, {0}. x; R 3. z, {1}. x; R 1. w, {0}. x;

NCKU Computer Architecture and System Laboratory From SIMD to SIMT – single instruction multiple thread Ø This terminology is introduced by Nvidia. SIMT is a kind of execution model on hardware. Ø Not an instruction set. Ø Still within the concept of SIMD. Working on narrow domain of programming model Ø Like Open. GL and Open. CL

NCKU Computer Architecture and System Laboratory From SIMD to SIMT execution model Using Open. GL as example. 19200 vertices => 19200 threads Shader program => for lighting calculation Each thread has its own data=> position, normal vector, etc. Each thread will execute the same program here with different data from its own.

NCKU Computer Architecture and System Laboratory From SIMD to SIMT execution model Decode one instruction for multiple execution units. Ø Ex: 16 exec units The programmer cares about the programming model, instead of operations. Ø Hardware controls the execution flow and thread dispatch. Instruction Decode Shared register Pool EU EU EU EU

NCKU Computer Architecture and System Laboratory From SIMD to SIMT SIMD instruction + SIMT model? SIMD/SIMT, they can co-exist with each other. Instruction Decode Shared register Pool Vector exec unit EU EU

NCKU Computer Architecture and System Laboratory From SIMD to SIMT scalar SIMT vs SIMD + SIMT decode register EU EU EU Vector exec unit EU EU EU EU EU EU 16 scalar cores vs 4 4 -lane vector cores?

NCKU Computer Architecture and System Laboratory From SIMD to SIMT scalar SIMT vs SIMD + SIMT Using scalar core in SIMT. Ø Advantage l Increase the execution unit’s utilization. l Even a vector based instruction can be simply divided into scalar part. Ø Disadvantage l More threads executed at same time means more thread-related data kept in registers. We have an on-going project worked with ITRI Ø HSAIL-based SIMT shader processor

NCKU Computer Architecture and System Laboratory The Challenge of SIMT How SIMT executes a branch instruction? ØWhich one is the target instruction if condition execution is involved? Thread 1 Thread 2 Thread 3 Thread 4 R 0 = 0 R 0 = 1 If (R 0 == 1) R 1 = 0; else R 1 = 1;

NCKU Computer Architecture and System Laboratory The Challenge of SIMT divergence Divergence Ø Threads are going in different direction and are not uniform anymore. Naïve Solution – go through every code path. If (R 0 == 1) { R 1 = 0; if (R 7 == 1) R 2 = 1 } else R 1 = 1; Cycle 1 Cycle 2 Cycle 3

NCKU Computer Architecture and System Laboratory The Challenge of SIMT divergence This is why the traditional Ray-tracing is still relied on cluster server (processor) Ø Photorealistic rendering method Ø Can be highly parallel. (project million to billion ray around the scene) Ø extreme high divergence in each ray-casting (thread) on photon collision, reflection, refraction, and diffusion computation. Trace level: 50

NCKU Computer Architecture and System Laboratory The Challenge of SIMT divergence There is still some solution for divergence Ø Hardware l Using more instruction issue unit and aggressive scheduler. l Trade-off between silicon area and computation power. Ø Software l Avoid branch operation. (depend on hardware and instruction architecture) - a = clamp( (b - c)*99999. 0, 0. 0, 1. 0) - if (b-c) > 0 a = 1. 0; else a = 0. 0;

NCKU Computer Architecture and System Laboratory Conclusion There are lots of issues about tradeoff between utilization and computation power. ØDeeper data parallel or aggressive scheduler? ØStrongly application specific.

NCKU Computer Architecture and System Laboratory If you are interesting in Graphic Web: http: //lioujheyu. synology. me/~git/ogles 1_1/ Git: http: //lioujheyu. synology. me/~git/index. cgi/ogles 1_1. git/