TOWARDS A SOFTWARE TRANSACTIONAL MEMORY FOR GRAPHICS PROCESSORS

Overview Software Transactional Memory (STM) has long been suggested as a possible model for

Problem Scenario Binary Tree 12 10 12 3 1 20 7 12

Problem Scenario Binary Tree 12 17 10 3 1 12 20 7 17 12

Pseudocode Step 1 find position in tree; Step 2 insert element; Tree might change

Software Transactional Memory STMs provides a construction that divides code into transactions that are

Modified Pseudocode atomic { Step 1 find position in tree; insert element; } Just

Scenario with Transactions Transaction A 12 10 20 12 Transaction B 17 10 12

Scenario with Transactions Committed Transaction B 10 10 20 20 12 17 Concurrent

Scenario with Transactions Committed Aborted 10 10 20 20 12 17 Concurrent

STM Design Progress Guarantees Lock Scheme Lock Granularity Log or Undo-Log Conflict Detection Time

One Lock No concurrency Busy waiting Convoying Done yet?

Multiple Locks Better concurrency Difficult Deadlocks Process A Y X B C Process B

Dynamic Locks Transaction Log Read 10 15 25 10 25 15 20 15 25

Two STM Designs A Blocking STM Transactions that fail to acquire a lock are

Lock Granularity Object Word Stripe Object A int var 1; int var 2; int

Log or Undo-Log Changes Better with high contention No visibility problem Log Original Content

Conflict Detection Time Acquire locks immediately Conflicts detected early More false conflicts Acquire locks

CUDA Specifics Minimal use of processor local memory � Better SIMD instruction used where

More CUDA Specifics No Functions � Rewritten Possible Scheduler Problems � Bad to use

Experiments Queue (enqueue/dequeue) � Many conflicts expected Hash-map (inserts) � Fewer conflicts, but longer

Experiments Skip-List � Insert/Lookup/Remove � Scales similar to tree Comparison with lock-free skip-list �

Contention Levels We performed the experiments using different levels of contention High Contention (No

Backoff Strategies Lowers contention by waiting before aborted transactions are tried again Increases the

Hardware Nvidia GTX 280 with 30 multiprocessors 1 -60 thread blocks 32 threads in

Backoff Strategy (Queue, Full Con) Average number of aborts per transactions 16000 Non-Blocking 14000

Total number of operations per millisecond Blocking Queue (Full Contention) 30 25 20 No

Total number of operations per millisecond Queue (Full Contention) 45 40 35 30 25

Total number of operations per millisecond Binary Tree (Full Contention) 300 Non-Blocking Linear 250

Total number of operations per millisecond Hash-map (Full Contention) 140 120 100 80 Non-Blocking

Hash-Map Aborts (Full Contention) 1121 Average number of aborts per transactions 1200 1000 800

Total number of operations per millisecond Skip-List (Full Contention) 90 80 70 60 50

Total number of operations per millisecond Lock-Free Skip-List (Full Contention) 2000 1800 1600 1400

Conclusion Software Transactional Memory has attracted the interest of many researchers over the recent

Thank you! For more information: http: //www. cs. chalmers. se/~dcs

Slides: 37

Download presentation

TOWARDS A SOFTWARE TRANSACTIONAL MEMORY FOR GRAPHICS PROCESSORS Daniel Cederman, Philippas Tsigas and Muhammad Tayyab

Overview Software Transactional Memory (STM) has long been suggested as a possible model for parallel programming However, its practicality is debated By exploring how to design an STM for GPUs, we want to bring this debate over to the graphics processor domain

Introduction

Problem Scenario Binary Tree 12 10 12 3 1 20 7 12

Problem Scenario Binary Tree 12 17 10 3 1 12 20 7 17 12 17 17 ?

Pseudocode Step 1 find position in tree; Step 2 insert element; Tree might change here

Software Transactional Memory STMs provides a construction that divides code into transactions that are guaranteed to be executed atomically

Modified Pseudocode atomic { Step 1 find position in tree; insert element; } Just one step!

Scenario with Transactions Transaction A 12 10 20 12 Transaction B 17 10 12 20 17 Concurrent 17

Scenario with Transactions Committed Transaction B 10 10 20 20 12 17 Concurrent

Scenario with Transactions Committed Aborted 10 10 20 20 12 17 Concurrent

STM Design Progress Guarantees Lock Scheme Lock Granularity Log or Undo-Log Conflict Detection Time …

One Lock No concurrency Busy waiting Convoying Done yet?

Multiple Locks Better concurrency Difficult Deadlocks Process A Y X B C Process B A W D Q R S N U Z L T

Dynamic Locks Transaction Log Read 10 15 25 10 25 15 20 15 25 30 25 Read 20 Read 30 Write 20

Two STM Designs A Blocking STM Transactions that fail to acquire a lock are aborted Busy waits, but avoids deadlocks A Non-Blocking STM Transactions that fail to acquire a lock can steal it or abort the other transaction No busy wait Based on STM by Tim Harris and Keir Fraser "Language support for lightweight transactions", OOPSLA 2003

Lock Granularity Object Word Stripe Object A int var 1; int var 2; int var 3; int var 4; int var 5; Object B int var 1; int var 2; int var 3; int var 4; int var 5; Tradeoff between false conflicts and the number of locks Wide memory bus makes it quick to copy object

Log or Undo-Log Changes Better with high contention No visibility problem Log Original Content Optimistic Faster when no conflict Problem with visibility The visibility problem is hard to handle in a non-managed environment

Conflict Detection Time Acquire locks immediately Conflicts detected early More false conflicts Acquire locks at commit time Late conflict detection Locks are held shorter time Holding locks longer will lead to more transactions aborting long transactions in the non-blocking STM

CUDA Specifics Minimal use of processor local memory � Better SIMD instruction used where possible � Mostly left to the main application used to coalesce reads and writes No Recursion � Rewritten to use stack

More CUDA Specifics No Functions � Rewritten Possible Scheduler Problems � Bad to use goto statements scheduling can lead to infinite waiting time No Dynamic Memory during Runtime � Allocates much memory

Experiments

Experiments Queue (enqueue/dequeue) � Many conflicts expected Hash-map (inserts) � Fewer conflicts, but longer transactions as more items are inserted Binary Tree (inserts) � Fewer grows conflicts as the tree

Experiments Skip-List � Insert/Lookup/Remove � Scales similar to tree Comparison with lock-free skip-list � H. Sundell and P. Tsigas, Scaleable and lock-free concurrent dictionaries, SAC 04.

Contention Levels We performed the experiments using different levels of contention High Contention (No pause between transactions) Low Contention (500 ms of work distributed randomly)

Backoff Strategies Lowers contention by waiting before aborted transactions are tried again Increases the probability that at least one transaction is successful Exponential None/Static Linear Time Attempts

Hardware Nvidia GTX 280 with 30 multiprocessors 1 -60 thread blocks 32 threads in each block

Backoff Strategy (Queue, Full Con) Average number of aborts per transactions 16000 Non-Blocking 14000 12000 10000 8000 6000 4000 2000 0 None Linear Exponential

Total number of operations per millisecond Blocking Queue (Full Contention) 30 25 20 No Backoff Linear Backoff Exponential Backoff 15 10 5 0 1 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 Thread Blocks

Total number of operations per millisecond Queue (Full Contention) 45 40 35 30 25 20 15 10 Non-Blocking 5 Blocking 0 1 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 Thread Blocks

Total number of operations per millisecond Binary Tree (Full Contention) 300 Non-Blocking Linear 250 200 150 100 50 0 1 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 Thread Blocks

Total number of operations per millisecond Hash-map (Full Contention) 140 120 100 80 Non-Blocking 60 Blocking 40 20 0 1 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 Thread Blocks

Hash-Map Aborts (Full Contention) 1121 Average number of aborts per transactions 1200 1000 800 Non-Blocking 600 400 200 0 Long and expensive transactions! 57

Total number of operations per millisecond Skip-List (Full Contention) 90 80 70 60 50 40 30 Non-Blocking 20 Blocking 10 0 1 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 Thread Blocks

Total number of operations per millisecond Lock-Free Skip-List (Full Contention) 2000 1800 1600 1400 1200 1000 800 600 400 200 0 Non-Blocking Lock-Free Linear 1 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 Thread Blocks

Conclusion Software Transactional Memory has attracted the interest of many researchers over the recent years We have evaluated a blocking and a nonblocking STM design on a graphics processor The non-blocking design scaled better than the blocking For simplicity you pay in performance The debate is still open!

Thank you! For more information: http: //www. cs. chalmers. se/~dcs