Hybrid Transactional Memory Reza Sherafat Prof Cristiana Amza
Hybrid Transactional Memory Reza Sherafat Prof. Cristiana Amza University of Toronto Dec 4, 2006 1
Quick Background Review • A transaction is a sequence of operations that “as a whole” is performed atomically. • Life cycle of a transaction: – Initialization: start a transaction by storing the current state; – Execution: Open objects for read/write; • Data modifications are hidden from others; • Watch for conflicts; – Termination: end the transaction • Successful completion (Commit): Let other threads know about the changes were made; and modifications take effect; or • Unsuccessful completion (Abort): Discard modifications 2
Outline • Motivations • Hybrid Transactional Memory • Implementation • Evaluations • Conclusions 3
Motivations • In parallel programs we must protect concurrent access to shared data. • Locks are widely used; but several problems are associated with using locks: – Performance (speedup) Overhead of locking (wait time, acquire, release) Granularity (hard to balance wait time, overhead) Over serialization – Programming Hard for programmers to write and debug Deadlocks are hard to avoid – Other problems Priority inversion Problem when a process holding the lock crashes 4
Transactional Memory (TM) • Main idea: Non-blocking execution – Execute each concurrent transaction speculatively; – Apply changes when transaction completed successfully. • Non-conflicting access to shared objects within transactions is allowed: – Once conflict is detected, transaction rolls back and state is restored (abort); • TM support is provided through an API: – Start a transaction – Abort/commit a transaction – Wrap objects in TM objects • Properties of transactions: – – Atomic: a transaction is like a single unit (all-or-nothing) Serializable: concurrent Start a transaction t transactions are performed in some serial order Obstruction-freedom: guarantees progress of one process in absence of contention No deadlock 5
Conflicting Access to Shared Data • Conflicts in accessing shared data may result in data inconsistencies. • Conflicts happen when an object that has been accessed by other transactions (read or write) is updated before others commit. – Multiple readers are allowed – Only one writer is allowed at each time • The system ensures that transactions that access data don’t conflict. If no conflicts occur, the transactions are serializable. • Conflict resolution: once a conflict is detected, we can get a serializable execution by aborting all but one of the conflicting transactions. • Speculative modifications of aborted transactions are discarded. Old values before starting the transaction become valid. 6
Hybrid TM Each approach should implement TM semantics: Start transaction, open object, detect conflicts, abort, commit. • Hardware-based approaches: – Bounded number of locations – Maintain versions in cache → Low overhead • Software-based approaches: – Unbounded number of locations can be accessed within a transaction – Slow due to overhead of maintaining multiple copies – Potentially orders of magnitude • Hybrid: Combines the benefits of both approaches – High performance (unless the transaction exceeds HW limits) – Support for unlimited transactional objects – Handles simultaneous data access from HW/SW modes 7
Implementations • Two modes for executing transactions: HW vs. SW. • In general, HW mode is preferred (it is faster), unless we run out of resources. • Naïve approach: the system has a universal mode of operation. • A better approach: transactions have two modes to choose from. – Each transaction separately chooses the mode of operation when it starts. – Better performance and utilization of system resources • Other policies may also be applied to chose the mode: If the transaction fails for a number of time (e. g. , 3) then start in SW mode; • Pure HW/SW implementations must be tailored such that they can coexist. – Objects may be accessed simultaneously in transactions in HW, SW modes. – Interoperability is a must. 8
Hardware TM A HW-TM scheme that can used for the Hybrid implementation that relies on the standard cache coherence protocol and some additional components. • Cache coherence protocol handles data consistencies across multiple processors: – Only one processor has permission to write to a cache line; – No processor can read a line that another processor has permission to write to. • Additional components on each processor store speculative data and check for conflicts: – ISA extensions • Instructions for: transactional begin, commit, abort, load/store, etc. – Additional components on the processor chip (In parallel with the L 1 cache) • Transactional buffer: old, • Transactional state table: state of the contexts (threads) running on the processor • All memory accesses within a transaction are done transactionally. 9
HW-TM • Old field is keeps speculative values • Transactional semantics: – Start transaction: Transactional state for that context is set to SELECT, ALL. – Abort: Exception flag is set, clear corresponding read/write bits, invalidates speculative written data – Commit: Update the transactional state. – Detect conflicts: read/write bit vector • If the exception flag is set, any attempt to commit or load/store by the transaction results in a trap that will be handled by the exception handler. Question: How is abort implemented across multiple processors? CCP! 10
Quick Review of DSTM Before accessing an object within a transaction Object Pointer X State Pointer Old New State Object Contents Valid Copy Object Contents State Pointer Old New State Object Contents Modify 11
Software TM • Uses a locator similar to DSTM: – Redirection and object copying. • The locator also keeps track of the readers. – As opposed to local hash tables to store the last data value in each read transaction. – This helps early abort, and avoids validation when committing • A locator object in Hybrid-TM A locator consists of: – – – Valid field Write state (one) Read state (multiple) Old/new objects Object size 12
Putting Things Together • Transactions in HW may conflict with those of SW, and vice versa. – Opening an object in HW: • [read the TMObject pointer transactionally] • Abort all conflicting HW/SW – Opening an object in SW: • Create a state object, and load it transactionally • Abort conflicting HW/SW transactions – Hardware aborts Hardware • A load/store (trans. by default) causes an abort – Software aborts Hardware • When SW opens a TMObject, it assigns it to a new locator. Since the object is transactionally read by the HW, the transaction is aborted. – Hardware aborts Software • When HW opens a TMObject, it writes ABORTED to transaction state having this object – Software aborts Software • Write ABORTED to the state from the reader/writer pointers. 13
Software aborts Hardware Conflict detected by the threads in the hardware mode Object Pointer X State Pointer Old New State Object Contents In the Hardware Modify in place Thread 1: HW mode Thread 2: HW mode Object Contents State Pointer Old New State In the Software Mode Copy and Modify Thread 3: SW mode Object Contents 14
Evaluations • Three microbenchmarks – VR: Small critical section (overhead of starting/committing transactions) – HT: Simultaneous lookup operations (per object overhead of transactions) – GU: Course grained locking vs. transactional memory • For each case two scenarios: Low and High Contention • Compare four synchronization implementations – – Lock Pure Hardware Transactional Memory Pure Software Transactional Memory Hybrid Transactional Memory 15
Evaluations (Hybrid Execution) • In all cases of hybrid execution, the ratio of SW/HW mode is very small. • This is due to relatively (compared to size of transactional objects) large size of transactional buffer. (is this realistic? ) • Since in most transactions HW mode is used, this does not give a good view of the overhe 17
Evaluations (VR) • When # of processors grow, contention does not grow significantly – This is because transactions are too small (conflicts rarely happen) 18
Evaluations (HT) • It is true that several lookup operations can be performed simultaneously, however those o – This seems to be significant for slightly long duration transactions – The lock performance is better. • The paper claims similar behavior would be achieved by reader-writer locks; – I expect that would have a much better performance, since once underway concurrent operations 19
Evaluations (GU) • Why does the execution time decreases in the lock implementation from GU-low to GU-h • It is usually inverse! – Do locks have back-offs? 20
Conclusions • Transactional memory outperforms the lock-based synchronization in most cases • Hybrid Transactional Memory approach gives a good balance between scalability of SW and performance of HW – Requires only modest hardware support (transactional buffer, state table) – Within system limits: Good performance for most transactions – Exceeding system limits: fallbacks to software mode when a transaction cannot complete within the hardware bounds • More needs to be gone to ensure progress. 21
Questions? ! 22
• Nested transaction? • Additional limits for the HW: – Contexts • Hybrid has limitations: – Uses transactional buffer • I am not sure how the non-blocking mechanism is implemented across multiple processors. 23
- Slides: 22