Code Generation and Optimization for Transactional Memory Construct

  • Slides: 28
Download presentation
Code Generation and Optimization for Transactional Memory Construct in an Unmanaged Language Cheng Wang,

Code Generation and Optimization for Transactional Memory Construct in an Unmanaged Language Cheng Wang, *Wei-Yu Chen, Youfeng Wu, Bratin Saha, Ali Adl-Tabatabai Programming Systems Lab Microprocessor Technology Labs Intel Corporation *Computer Science Division University of California, Berkeley

Motivation • Existing Transactional Memory (TM) constructs focus on managed Language • Efficient software

Motivation • Existing Transactional Memory (TM) constructs focus on managed Language • Efficient software transactional memory (STM) takes advantages of managed language features – Optimistic Versioning (direct update memory with backup) – Optimistic Read (invisible read) • 2 Challenges in Unmanaged Language (e. g. C) – Consistency • No type safety, first-class exception handling – Function call • No just-in-time compilation – Stack rollback • Stack alias – Conflict detection • Not object oriented

Contributions • First to introduce comprehensive transactional memory construct to C programming language –

Contributions • First to introduce comprehensive transactional memory construct to C programming language – Transaction, function called within transaction, transaction rollback, … • First to support transactions in a production-quality optimizing C compiler – Code generation, optimization, indirect function calls, … • Novel STM algorithm and API that supports optimizing compiler in an unmanaged environment – quiescent transaction, stack rollback, … 3

Outline • TM Language Construct • STM Runtime • Code Generation and Optimization •

Outline • TM Language Construct • STM Runtime • Code Generation and Optimization • Experimental Results • Related Work • Conclusion 4

TM Language Constructs • #pragma tm_atomic • #pragma tm_function int foo(int); { stmt 1;

TM Language Constructs • #pragma tm_atomic • #pragma tm_function int foo(int); { stmt 1; stmt 2; int bar(int); } • #pragma tm_atomic { stmt 1; #pragma tm_atomic { foo(3); { stmt 2; … tm_abort(); } } 5 // OK bar(10); // ERROR } foo(2) // OK bar(1) // OK

Consistency Problem Thread 1 #pragma tm_atomic { Thread 2 #pragma tm_atomic Not NULL if(tq->free)

Consistency Problem Thread 1 #pragma tm_atomic { Thread 2 #pragma tm_atomic Not NULL if(tq->free) { { if(tq->free) { shared free list for(temp 1 = tq->free; for(temp 2 = tq->free; temp 1 ->next &&…, temp 2 ->next &&…, temp 1 = temp 1 ->next); task_struct[p_id]. loc_free = tq->free; tq->free = temp 1 ->next; tq->free = temp 2 ->next; … } • temp 2 = temp 2 ->next); task_struct[p_id]. loc_free = tq->free; temp 1 ->next = NULL; } NULL temp 2 ->next = NULL; local free list … Memory Fault } } NULL Solution: timestamp based aggressive consistent checking 6

Inconsistency Caused by Privatization Thread 1 Thread 2 #pragma tm_atomic Not NULL { {

Inconsistency Caused by Privatization Thread 1 Thread 2 #pragma tm_atomic Not NULL { { if(tq->free) { for(temp 1 = tq->free; for(temp 2 = tq->free; NULL temp 1 ->next &&…, temp 2 ->next &&…, temp 1 = temp 1 ->next); temp 2 = temp 2 ->next); task_struct[p_id 1]. loc_free = tq->free; task_struct[p_id 2]. loc_free = tq->free; tq->free = temp 1 ->next; tq->free = temp 2 ->next; temp 1 ->next = NULL; temp 2 ->next = NULL; … … NULL } } Memory Fault } } temp 1 = task_struct[p_id 1]. loc_free; /* process temp */ task_struct[p_id 1]. loc_free = temp 1 ->next; temp 1 ->next = NULL; • 7 temp 2 = task_struct[p_id 2]. loc_free; /* process temp */ task_struct[p_id 2]. loc_free = temp 2 ->next; temp 2 ->next = NULL; Solution: Quiescent Transaction

Quiescent Transaction Thread 1 Thread 2 #pragma tm_atomic Not NULL { { if(tq->free) {

Quiescent Transaction Thread 1 Thread 2 #pragma tm_atomic Not NULL { { if(tq->free) { for(temp 1 = tq->free; for(temp 2 = tq->free; temp 1 ->next &&…, temp 2 ->next &&…, temp 1 = temp 1 ->next); temp 2 = temp 2 ->next); task_struct[p_id 1]. loc_free = tq->free; task_struct[p_id 2]. loc_free = tq->free; tq->free = temp 1 ->next; tq->free = temp 2 ->next; temp 1 ->next = NULL; temp 2 ->next = NULL; … … Quiescent } } Consistency Checking Fail } } temp 1 = task_struct[p_id 1]. loc_free; /* process temp */ task_struct[p_id 1]. loc_free = temp 1 ->next; temp 1 ->next = NULL; 8 temp 2 = task_struct[p_id 2]. loc_free; /* process temp */ task_struct[p_id 2]. loc_free = temp 2 ->next; temp 2 ->next = NULL;

TM Runtime Issues (Stack Rollback) #pragma tm_atomic rollback a foo() { { int a;

TM Runtime Issues (Stack Rollback) #pragma tm_atomic rollback a foo() { { int a; … bar(&a) foo() … … // abort Stack Crash } } bar(int *p) { … *p … … } • 9 Solution: Selective Stack Rollback a

Optimization Issues (Redundant Barrier) #pragma tm_atomic { a = b + 1; …; //

Optimization Issues (Redundant Barrier) #pragma tm_atomic { a = b + 1; …; // may alias a or b a = b + 1; } 10 desc = stm. Get. Txn. Desc(); rec 1 = IRCompute. Txn. Rec(&b); ver 1 = IRRead(desc, rec 1); t = b; IRCheck. Read(desc, rec 1, ver 1); desc = stm. Get. Txn. Desc(); rec 2 = IRCompute. Txn. Rec(&a); IRWrite(desc, rec 2); IRUndo. Log(desc, &a); a = t + 1; not redundant

Experiment Setup • Target System – 16 -way IBM e. Server x. Series 445,

Experiment Setup • Target System – 16 -way IBM e. Server x. Series 445, 2. 2 GHz Xeon – Linux 2. 4. 20, icc v 9. 0 (with STM), -O 3 • Benchmarks – 3 synthetic concurrent data structure benchmarks • Hashtable, btree, avltree – 8 SPLASH-2 benchmarks • 4 SPLASH-2 benchmarks spend little time in critical sections – Fine-grained lock v. coarse-grained lock v. STM • Coarse-grain lock: replace all locks with a single global lock • STM: – Replace all lock sections with transactions – Put non-transactional conflicting accesses in transactions 11

Hashtable hashtable (80% update) fine lock time (seconds) 3. 5 coarse lock 3 2.

Hashtable hashtable (80% update) fine lock time (seconds) 3. 5 coarse lock 3 2. 5 manual stm 2 1. 5 compiler stm 1 0. 5 compile stm -- no consistency 0 0 5 10 15 20 threads • • 12 STM scales similarly as fine grain lock Manual and compiler STM comparable performance

FMM time (seconds) 5 fine lock 4 stm 3 coarse lock no consistency 2

FMM time (seconds) 5 fine lock 4 stm 3 coarse lock no consistency 2 1 0 0 5 10 threads • 13 STM is much better than coarse-grain lock 15 20

Splash 2 time (seconds) raytrace 8 7 6 5 4 3 2 1 0

Splash 2 time (seconds) raytrace 8 7 6 5 4 3 2 1 0 fine lock stm coarse lock 0 5 10 threads • 14 STM can be more scalable than locks 15 20

Optimization Benefits Normalized Execution Time 1. 25 1. 2 1. 15 no opt barrier

Optimization Benefits Normalized Execution Time 1. 25 1. 2 1. 15 no opt barrier elim inlining full opt no consistency lock 1. 1 1. 05 1 0. 95 0. 9 0. 85 0. 8 barnes • 15 cholesky fmm raytrace radiosity geo-mean The overhead is within 15%, with average only 6. 4%

Related Work • Transactional Memory – [Herlihy, ISCA 93] – [Ananian, HPCA 05], [Rajwar,

Related Work • Transactional Memory – [Herlihy, ISCA 93] – [Ananian, HPCA 05], [Rajwar, ISCA 05], [Moore, HPCA 06], [Hammond, ASPLOS 04], [Mc. Donald, ISCA 06], [Saha, MICRO 06] • Software Transactional Memory – [Shavit, PODC 95], [Herlihy, PODC 03], [Harris, ASPLOS 04] • Prior work on TM constructs in managed languages – [Adl-Tabatabai, PLDI 06], [Harris, PLDI 06], [Carlstrom, PLDI 06], [Ringengerg, ICFP 05] • Efficient STM – [Saha, PPo. PP 06] • Time-stamp based approach – [Dice, DISC 06], [Riegel, DISC 06] 16

Conclusion • We solve the key STM compiler problems for unmanaged languages – –

Conclusion • We solve the key STM compiler problems for unmanaged languages – – • Aggressive consistency checking Static function cloning Selective stack rollback Cache-line based conflict detection We developed a highly optimized STM compiler – Efficient register rollback – Barrier elimination – Barrier inlining • We evaluated our STM compiler with well-known parallel benchmarks – The optimized STM compiler can achieve most of the hand-coded benefits – There are opportunities for future performance tuning and enhancement 17

Questions ?

Questions ?

STM Runtime API Txn. Desc* stm. Get. Txn. Desc(); uint 32 stm. Start(Txn. Desc*,

STM Runtime API Txn. Desc* stm. Get. Txn. Desc(); uint 32 stm. Start(Txn. Desc*, Txn. Memento*); uint 32 stm. Start. Nested(Txn. Desc*, Txn. Memento*); void stm. Commit(Txn. Desc*); void stm. Commit. Nested(Txn. Desc*); void stm. User. Abort(Txn. Desc*); void stm. Abort(Txn. Desc*); uint 32 stm. Validate(Txn. Desc*); uint 32* stm. Compute. Txn. Rec(uint 32* addr); uint 32 stm. Read(Txn. Desc*, uint 32* txn. Rec); void stm. Check. Read(Txn. Desc*, uint 32* txn. Rec, uint 32 version); void stm. Write(Txn. Desc*, uint 32* txn. Rec); Void stm. Undo. Log(Txn. Desc*, uint 32* addr, uint 32 size); 19

Data Structures 20

Data Structures 20

Example 1 • • • #pragma tm_atomic • • … = *t; 21 {

Example 1 • • • #pragma tm_atomic • • … = *t; 21 { t = head; Head = t->next; } • • • #pragma tm_atomic { s = head; *s = …; }

Example 2 • #pragma tm_atomic • • • 22 { t = head; head

Example 2 • #pragma tm_atomic • • • 22 { t = head; head = t->next; } … = *t; • • • #pragma tm_atomic { s = head; *s = …; head = s->next; }

Example 3 • #pragma tm_atomic • • • 23 { t = head; head

Example 3 • #pragma tm_atomic • • • 23 { t = head; head = t->next; } *t = …; • • • #pragma tm_atomic { s = head; … = *s; head = s->next; }

Optimization Issues (Register Checkpointing) • Source Code #pragma tm_atomic • Checkpointing Code t 2_backup

Optimization Issues (Register Checkpointing) • Source Code #pragma tm_atomic • Checkpointing Code t 2_backup = t 2; while(setjmp(…)) { t 1 = 0; t 2 = t 2_bkup; t 1 = 0; } t 2 = t 1 + t 2; stm. Start(…) while(setjmp(…)) { t 2 = t 2_bkup; } t 1 = 0; } t 2 = t 1 + t 2; t 1 = t 3; … t 3 = 1; stm. Commit(…); stm. Start(…); t 2 = t 1 + t 2; can not recover t 1 = t 3; t 3 = 1; • 24 Optimized Code t 2_bkup = t 2; { … • t 1 = t 3; t 3 = 1; … Abort stm. Commit(…); Checkpointing all the live-in local data does not work with compiler optimizations across transaction boundary

Time. Stamp based Consistency Checking Thread 1 Global Timestamp 0 1 #pragma tm_atomic {

Time. Stamp based Consistency Checking Thread 1 Global Timestamp 0 1 #pragma tm_atomic { { if(tq->free) { } 25 if(tq->free) { Version 0 for(temp 1 = tq->free; for(temp 2 = tq->free; temp 1 ->next &&…, temp 2 ->next &&…, temp 1 = temp 1 ->next); } Thread 2 Version 1 temp 2 = temp 2 ->next); task_struct[p_id]. loc_free = tq->free; tq->free = temp 1 ->next; tq->free = temp 2 ->next; temp 1 ->next = NULL; temp 2 ->next = NULL; … … Local Timestamp 0 } } Version 1 Local Timestamp 0

Checkpointing Approach normal entry retry entry t 2_bkup = t 2 t 2 =

Checkpointing Approach normal entry retry entry t 2_bkup = t 2 t 2 = t 2_bkup t 3_bkup = t 3 t 3 = t 3_bkup t 1 = 0 #pragma tm_atomic { { t 1 = 0 t 2 = t 1 + t 2; Optimization t 1 = t 3; … t 3 = 1; } t 1 = t 3; t 3 = 1; 26 … }

Function Clone • Source Code #pragma tm_function void foo(…) { … } • STM

Function Clone • Source Code #pragma tm_function void foo(…) { … } • STM Code <foo-4>: &foo_tm Point to transactional version <foo>: // normal version no-op maker … // normal code Unique Marker <foo_tm>: // transactional version … // code for transaction #pragma tm_atomic { foo(); (*fp)(); } 27 foo_tm(); if(*fp == “no-op marker”) (**(fp-4))(); // call foo_tm else handle non-TM binary

 • 28 STM is much better than coarse-grain lock (fine lock ? ?

• 28 STM is much better than coarse-grain lock (fine lock ? ? ? )