GPU Concurrency Weak Behaviours and Programming Assumptions Tyler

GPU Concurrency: Weak Behaviours and Programming Assumptions Tyler Sorensen Adviser: Jade Alglave University College London WPLI 2015 April 12, 2105 1

Based on our ASPLOS ‘ 15 paper: GPU Concurrency: Weak Behaviours and Programming Assumptions Jade Alglave 1, 2, Mark Batty 3, Alastair F. Donaldson 4, Ganesh Gopalakrishnan 5, Jeroen Ketema 4, Daniel Poetzl 6, Tyler Sorensen 1, 5, John Wickerson 4 1 University College London, 2 Microsoft Research, 3 University of Cambridge, 4 Imperial College London, 5 University of Utah, 6 University of Oxford 2

Intel Core i 7 4500 CPU 3

4

Nvidia Tesla C 2075 GPU 5

Roadmap • what happened to the pony • how we found the bug • how we are able to fix the pony (background) (methodology) (contribution) 6

What happened to the pony? • the visualization bugs are due to weak memory behaviours on GPUs 7

Weak memory models • consider the test known as message passing (mp) • an instance of this test appears in the pony code 8

Weak memory models • consider the test known as message passing (mp) • initial state: x and y are memory locations 9

Weak memory models • consider the test known as message passing (mp) • thread ids 10

Weak memory models • consider the test known as message passing (mp) • program: for each thread id 11

Weak memory models • consider the test known as message passing (mp) • assertion: question about the final state of registers 12

Message passing (mp) test • Tests how to implement a handshake idiom Data 13

Message passing (mp) test • Tests how to implement a handshake idiom Flag 14

Message passing (mp) test • Tests how to implement a handshake idiom Stale Data 15

16

17

18

19

20

this is known as Lamport’s sequential consistency (or SC) assertion cannot be satisfied by interleavings 21

Weak memory models • can we assume assertion will never pass? 22

Weak memory models • can we assume assertion will never pass? No! 23

Weak memory models • Alglave and Maranget report this assertion appears 41 million times out of 5 billion test runs on Tegra 2 ARM processor 1 1 http: //diy. inria. fr/cats/tables. html 24

Weak memory models • what happened? • architectures implement weak memory models where the hardware is allowed to re-order certain memory instructions. • weak memory models can allow weak behaviors (executions that do not correspond to an interleaving) 25

GPU memory models • what type of memory model do current GPUs implement? • documentation is sparse • CUDA has 1 page + 1 example • PTX has 1 page + 0 examples • given in English prose • we need to know this if we are to write correct GPU programs! 26

GPU programming CTA 0 CTA 1 CTA n Threads Shared Memory For CTA 0 Shared Memory For CTA 1 Shared Memory For CTA n Within CTAs, threads are grouped into warps (32 threads per warp in Nvidia GPUs) Global Memory 27

GPU programming Threads Global Memory 28

GPU programming CTA 0 CTA 1 CTA n Threads Global Memory 29

GPU programming CTA 0 CTA 1 CTA n Shared Memory For CTA 0 Shared Memory For CTA 1 Shared Memory For CTA n Threads Global Memory 30

GPU programming CTA 0 CTA 1 CTA n Threads Shared Memory For CTA 0 Shared Memory For CTA 1 Shared Memory For CTA n Within CTAs, threads are grouped into warps (32 threads per warp in Nvidia GPUs) Global Memory 31

Roadmap • what happened to the pony • how we found the bug • how we are able to fix the pony (background) (methodology) (contribution) 32

Methodology GPU hardware compare results GPU litmus tests formal model 33

GPU tests • GPU litmus test considerations Scope Tree (device (cta T 0) (cta T 1) ) x: global, y: global 34

GPU tests • GPU litmus test considerations • PTX instructions Scope Tree (device (cta T 0) (cta T 1) ) x: global, y: global 35

GPU tests • GPU litmus test considerations • what memory region (shared or global) are x and y in? Scope Tree (device (cta T 0) (cta T 1) ) x: global, y: global 36

GPU tests • GPU litmus test considerations • what memory region (shared or global) are x and y in? 37

GPU tests • GPU litmus test considerations • are T 0 and T 1 in the same CTA or different CTAs? Scope Tree (device (cta T 0) (cta T 1) ) x: global, y: global 38

GPU tests • GPU litmus test considerations • are T 0 and T 1 in the same CTA or different CTAs? 39

Running tests • we extend the litmus CPU testing tool of Alglave and Maranget to run GPU tests • given a GPU litmus test, generates an executable CUDA or Open. CL code for the test 40

Heuristics • memory stress: extra threads read and write to scratch memory T 0 run T 0 test program T 1 run T 1 test program extra thread 1 loop: read or write to scratchpad . . . extra thread n loop: read or write to scratchpad 41

Heuristics • random threads: randomize the location of threads T 1 T 0 42

Heuristics • random threads: randomize the location of threads 43

Heuristics • random threads: randomize the location of threads 44

Heuristics • random threads: randomize the location of threads 45

Heuristics # of weak behaviours in 100, 000 runs for different heuristics on a Nvidia Tesla C 2075 test gpu-mp none random threads memory stress + random threads 0 46

Heuristics # of weak behaviours in 100, 000 runs for different heuristics on a Nvidia Tesla C 2075 test gpu-mp none random threads 0 0 memory stress + random threads 47

Heuristics # of weak behaviours in 100, 000 runs for different heuristics on a Nvidia Tesla C 2075 test gpu-mp none random threads memory stress 0 0 139 memory stress + random threads 48

Heuristics # of weak behaviours in 100, 000 runs for different heuristics on a Nvidia Tesla C 2075 test gpu-mp none random threads memory stress 0 0 139 memory stress + random threads 522 49

How we found the pony bug This is the idiom and heuristics that caused bug! test gpu-mp none random threads memory stress 0 0 139 memory stress + random threads 522 50

Roadmap • what happened to the pony • how we found the bug • how we are able to fix the pony (background) (methodology) (contribution) 51

GPU fences • PTX gives 2 fences to disallow reading stale data • membar. cta – gives ordering intra-CTA • membar. gl – gives ordering over device 52

GPU fences • Test amended with a parameterizable fence Scope Tree (device (cta T 0) (cta T 1) ) x: global, y: global 53

GPU fences # of weak behaviours in 100, 000 runs for different fences on a Nvidia Tesla C 2075 test none gpu-mp 3380 membar. cta membar. gl 54

GPU fences # of weak behaviours in 100, 000 runs for different fences on a Nvidia Tesla C 2075 test none membar. cta gpu-mp 3380 2 membar. gl 55

GPU fences # of weak behaviours in 100, 000 runs for different fences on a Nvidia Tesla C 2075 test none membar. cta membar. gl gpu-mp 3380 2 0 56

How do we fix the pony Tesla C 2075 Nvidia GPU 57

How do we fix the pony • adding fences to the code Tesla C 2075 Nvidia GPU (with fences) 58

GPU testing campaign • we extend the diy CPU litmus test generation tool of Alglave and Maranget to generate GPU tests • generates litmus tests based on cycles • enumerates the tests over the GPU thread and memory hierarchy 59

GPU testing campaign • Using our tools, we generated and ran 10930 tests over 5 Nvidia chips: chip year architecture GTX 750 ti 2014 Maxwell GTX Titan 2013 Kepler GTX 660 2012 Kepler GTX 540 m 2011 Fermi Tesla C 2075 2011 Fermi 60

GPU testing campaign • Results are hosted at: http: //virginia. cs. ucl. ac. uk/sunflowers/asplos 15/flat. html 61

Modeling • we extended the CPU axiomaitic memory modeling tool herd of Alglave and Maranget, for GPUs • we developed an axiomatic memory model for PTX which is able to simulate all of our tests • our model is sound with respect to all of our hardware observations 62

Modeling • Demo of web interface 63

More results • surprising and buggy behaviours observed: • GPU mutex implementations allow stale data to be read (found in CUDA by Example book and other academic papers 1, 2) led to an erratum issued by Nvidia • Hardware re-orders loads from the same address in Nvidia Fermi and Kepler • Some testing on AMD GPUs 1 J. A. Stuart and J. D. Owens, "Efficient synchronization primitives for GPUs" Co. RR, 2011, http: //arxiv. org/pdf/1110. 4623. pdf. 2 B. He and J. X. Yu, “High-throughput transaction executions on graphics processors” PVLDB 2011. 64

Related work (CPU memory models) • Alglave et. al. have done extensive work on testing and modeling CPUs (notably IBM Power and ARM) and create the tools diy, litmus, and herd which we extended for this work • Collier tested CPU memory models using the ARCHTEST tool 65

Related work (GPU memory models) • Hower et. al. have proposed several SC for race-free language level memory models for GPUs 66

Questions? project page: http: //virginia. cs. ucl. ac. uk/sunflowers/asplos 15/ Intel Core i 7 4500 CPU Nvidia Tesla C 2075 GPU (with fences)

CUDA by Example Intel Core i 7 4500 CPU 68

CUDA by Example Nvidia Tesla C 2075 GPU 69

CUDA by Example Nvidia Tesla C 2075 GPU (with fences) 70

Read-after-Read Hazard 71

Ignore after this 72

Results • Surprising and buggy behaviours observed: • SC-per-location violations on NVIDIA Fermi and Kepler architecture: todo: add CORR test 73

Limitations • warps: we do not test intra-warp behaviours as the lock step behaviour of warps is not compatible with some of our heuristics • grids: we do not test inter-grid behaviours as we did not find any examples in the literature 74

GPU programming • GPUs are SIMT (Single Instruction, Multiple Thread) • Nvidia GPUs may be programmed using CUDA or Open. CL 75

Roadmap • background and motivation • approach • GPU tests • running tests • modeling 76

Heuristics • two additional heuristics: • synchronization: testing threads synchronize immediately before running the test program • general bank conflicts: generate memory access that conflict with the accesses in the memory stress heuristic 77

Challenges • PTX optimizing assembler may reorder or remove instructions • We developed a tool optcheck which compares the litmus test with the binary and checks for optimizations 78

Roadmap • background and motivation • approach • GPU tests • running tests • modeling 79

GPU tests • concrete GPU test T 0 st. cg. s 32 [x], 1 st. cg. s 32 [y], 1 | T 1 ; | ld. cg. s 32 r 1, [y] ; | ld. cg. s 32 r 2, [x] ; Scope. Tree (grid(cta(warp T 0) (warp T 1))) x: shared, y: global exists (1: r 1=1 / 1: r 2=0) 80

GPU tests • concrete GPU test T 0 st. cg. s 32 [x], 1 st. cg. s 32 [y], 1 | T 1 ; | ld. cg. s 32 r 1, [y] ; | ld. cg. s 32 r 2, [x] ; Scope. Tree (grid(cta(warp T 0) (warp T 1))) x: shared, y: global exists (1: r 1=1 / 1: r 2=0) 81

GPU tests • concrete GPU test T 0 st. cg. s 32 [x], 1 st. cg. s 32 [y], 1 | T 1 ; | ld. cg. s 32 r 1, [y] ; | ld. cg. s 32 r 2, [x] ; Scope. Tree (grid(cta(warp T 0) (warp T 1))) x: shared, y: global exists (1: r 1=1 / 1: r 2=0) 82

GPU programming explicit hierarchical concurrency model • thread hierarchy: • memory hierarchy: • thread • shared memory • warp • global memory • CTA (Cooperative Thread Array) • grid 83

GPU background • GPU is a highly parallel co-processor • currently found in devices from tablets to top super computers • not just used for visualization anymore! Images from Wikipedia [15, 16, 17] 84

References [1] L. Lamport, "How to make a multiprocessor computer that correctly executes multi-process programs" Trans. Comput. 1979. [2] J. Alglave, L. Maranget, S. Sarkar, and P. Sewell, "Litmus: Running tests against hardware" TACAS 2011. [3] J. Alglave, L. Maranget, and M. Tautschnig, "Herding cats: modelling, simulation, testing, and data-mining for weak memory" TOPLAS 2014. [4] NVIDIA, "CUDA C programming guide, version 6 (July 2014)" http: //docs. nvidia. com/cuda/pdf/CUDA C Programming Guide. pdf [5] NVIDIA, "Parallel Thread Execution ISA: Version 4. 0 (Feb. 2014), " http: //docs. nvidia. com/cuda/parallel-thread-execution [6] J. Alglave, L. Maranget, S. Sarkar, and P. Sewell, “Fences in weak memory models (extended version)” FMSD 2012 [7] J. Sanders and E. Kandrot, “CUDA by Example: An Introduction to General-Purpose GPU Programming” Addison-Wesley Professional, 2010. 85

References [8] J. A. Stuart and J. D. Owens, "Efficient synchronization primitives for GPUs" Co. RR, 2011, http: //arxiv. org/pdf/1110. 4623. pdf. [9] B. He and J. X. Yu, “High-throughput transaction executions on graphics processors” PVLDB 2011. [10] W. W. Collier, Reasoning About Parallel Architectures. Prentice-Hall, Inc. , 1992. [11] D. R. Hower, B. M. Beckmann, B. R. Gaster, B. A. Hechtman, M. D. Hill, S. K. Reinhardt, and D. A. Wood, "Sequential consistency for heterogeneous-race-free" MSPC 2013. [12] D. R. Hower, B. A. Hechtman, B. M. Beckmann, B. R. Gaster, M. D. Hill, S. K. Reinhardt, and D. A. Wood, "Heterogeneous-racefree memory models, " ASPLOS 2014 [13] T. Sorensen, G. Gopalakrishnan, and V. Grover, "Towards shared memory consistency models for GPUs" ICS 2013 [14] W. -m. W. Hwu, “GPU Computing Gems Jade Edition” Morgan Kaufmann Publishers Inc. , 2011. 86

References [15] http: //en. wikipedia. org/wiki/Samsung_Galaxy_S 5 [16] http: //en. wikipedia. org/wiki/Titan_(supercomputer) [17] http: //en. wikipedia. org/wiki/Barnes_Hut_simulation 87

Roadmap • what happened to the pony • how we found the bug • how we are able to fix the pony (background) (methodology) (contribution) 88

Message passing (mp) test • Tests how to implement a handshake idiom • Found in Octree code for the pony visualization 89

Message passing (mp) test • Tests how to implement a handshake idiom Data 90

Message passing (mp) test • Tests how to implement a handshake idiom Flag 91

Methodology • empirically explore the hardware memory model implemented on deployed NVIDIA and AMD GPUs • develop hardware memory model testing tools for GPUs • analyze classic (i. e. CPU) memory model properties and communication idioms in CUDA applications • run large families of tests on GPUs as a basis for modeling and bug hunting 92

Message passing (mp) test • Tests how to implement a handshake idiom Stale Data 93

Running tests • however, unlike CPUs, simply running the tests did not yield any weak memory behaviours for Nvidia chips! • we developed heuristics to run tests under a variety of stress to expose weak behaviours 94