Speculative Lock Elision Enabling Highly Concurrent Multithreaded Execution
Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution Ravi Rajwar and Jams R. Goodman Presented by Yang Liu CPS 221 Spring 2008
Outline n Why want to elide locks? n How to elide locks? – SLE n Why does SLE work correctly? n How to implement SLE? n How is performance improved?
Why want to elide locks? n n Locks force serialization Not all locks are required
How to elide locks? - Atomicity conditions n Within a speculatively executing critical section q q Read data is not modified by another thread before section ends Written data is not accessed by another thread before section ends
How to elide locks? - Eliding silent pairs
How to elide locks? - Steps n n n Predict lock release store (silent pairs) Predict atomicity and elide lock acquire Execute critical section speculatively and buffer results If no atomicity, trigger misspeculation, recover and explicitly acquire lock If lock release store is seen, elide lock release, commit state and exit section
How to elide locks? - Example
How to elide locks? - Example Lock Acquire
How to elide locks? - Example Lock Acquire
How to elide locks? - Example Lock Acquire Lock Release
Why does SLE work correctly? n Two predictions q q Predict lock release store Resolved by monitoring memory location of the stores Predict memory operation atomicity Resolved by checking atomicity conditions using cache coherence mechanisms
How to implement SLE? – Four Aspects n Initiating speculation q n Filter, index, confidence metric Buffering speculative state q Speculative register state n n q ROB Register checkpointing Speculative memory state n Augmented write-buffers
How to implement SLE? – Four Aspects n Misspeculation conditions and detection q Atomicity violations n n q Violations due to limited resources n n n ROB Register checkpointing with access bit Finite cache/write-buffer/ROB size Uncached accesses or events Committing speculative memory state
How to implement SLE?
How is performance improved? - Evaluation methodology n Three multiprocessor systems q CMP/SMP/DSM n A simple microbenchmark and six applications n Single register checkpoint n 32 -entry lock predictor indexed by PC
How is performance improved? - Microbenchmark result
How is performance improved? - Application result
How is performance improved? - Application result
Some thoughts n Idea is good q q n n Remove unnecessary serialization Make programmers’ work easier No results for using ROB Why padding the benchmarks to reduce false sharing? q Shouldn’t this be done by SLE?
- Slides: 19