OntheFly Garbage Collection Using Sliding Views Erez Petrank

On-the-Fly Garbage Collection Using Sliding Views Erez Petrank Technion – Israel Institute of Technology Joint work with Yossi Levanoni, Hezi Azatchi, and Harel Paz Erez Petrank GC via Sliding Views

Garbage Collection n n User allocates space dynamically, the garbage collector automatically frees the space when it “no longer needed”. Usually “no longer needed” = unreachable by a path of pointers from program local references (roots). Programmer does not have to decide when to free an object. (No memory leaks, no dereferencing of freed objects. ) Built into Java, C#. Erez Petrank GC via Sliding Views 2

Garbage Collection Two Classic Approaches Tracing [Mc. Carthy Reference counting 1960]: trace reachable objects, reclaim objects not traced. Good Erez Petrank [Collins 1960]: keep a reference count for each object, reclaim objects with count 0. Traditional Wisdom GC via Sliding Views Problematic 3

What (was) Bad about RC? n n n B Does not reclaim cycles A heavy overhead on pointer modifications. A Traditional belief: “Cannot be used efficiently with parallel processing” Erez Petrank GC via Sliding Views 4

What’s Good about RC? n Reference Counting work is proportional to work on creations and modifications. n Can tracing deal with tomorrow’s huge heaps? n Reference counting has good locality. n The Challenge: n n RC overhead on pointer modification seems too expensive. RC seems impossible to “parallelize”. Erez Petrank GC via Sliding Views 5

Garbage Collection Today’s advanced environments: n multiprocessors + large memories Dealing with multiprocessors Single-threaded stop the world Erez Petrank GC via Sliding Views 6

Garbage Collection Today’s advanced environments: n multiprocessors + large memories Dealing with multiprocessors Parallel collection Erez Petrank Concurrent collection GC via Sliding Views 7

Terminology (stop the world, parallel, concurrent, …) Stop-the-World Parallel (STW) Concurrent On-the-Fly program GC Erez Petrank GC via Sliding Views 8

Benefits & Costs Stop-the-World Parallel (STW) 200 ms Informal Pause times Concurrent Throughput Loss: 10 -20% 20 ms On-the-Fly 2 ms Erez Petrank program GC GC via Sliding Views 9

This Talk Introduction: RC and Tracing, Coping with SMP’s. Ø RC introduction and parallelization problem. § Main focus: a novel concurrent reference counting algorithm (suitable for Java). § Concurrent made on-the-fly based on “sliding views” n Extensions: ü n n cycle collection, mark and sweep, generations, age-oriented. Implementation and measurements on Jikes. n Extremely short pauses, good throughput. Erez Petrank GC via Sliding Views 10

Basic Reference Counting n n n Each object has an RC field, new objects get o. rc: =1. p When p that points to o 1 is modified to point to o 2 execute: o 2. rc++, o 1. rc--. o 1 o 2 if then o 1. rc==0: n n n Delete o 1. Decrement o. rc for all children of o 1. Recursively delete objects whose rc is decremented to 0. Erez Petrank GC via Sliding Views 11

An Important Term: n n A write barrier is a piece of code executed with each pointer update. p “p o 2 ” implies: Read p; (see o 1) p o 2; o 1 o 2. rc++; o 1. rc- -; Erez Petrank GC via Sliding Views 12

Deferred Reference Counting n n Problem: overhead on updating program variables (locals) is too high. Solution [Deutch & Bobrow 76] : n n Don’t update rc for local variables (roots). “Once in a while”: collect all objects with o. rc=0 that are not referenced from local variables. Deferred RC reduces overhead by 80%. Used in most modern RC systems. Still, “heap” write barrier is too costly. Erez Petrank GC via Sliding Views 13

Multithreaded RC? Traditional wisdom: write barrier must be synchronized ! Erez Petrank GC via Sliding Views 14

Multithreaded RC? n n Problem 1: ref-counts updates must be atomic Fortunately, this can be easily solved : Each thread logs required updates in a local buffer and the collector applies all the updates during GC (as a single thread). Erez Petrank GC via Sliding Views 15

Multithreaded RC? n n Problem 1: ref-counts updates must be atomic Problem 2: parallel updates confuse counters: Thread 1: Read A. next; (see B) A. next C; B. rc- -; C. rc++ C Erez Petrank A Thread 2: Read A. next; (see B) A. next D; B. rc- -; D. rc++ B GC via Sliding Views D 16

Known Multithreaded RC n [De. Treville 1990, Bacon et al 2001]: n n Cmp & swp for each pointer modification. Thread records its updates in a buffer. Erez Petrank GC via Sliding Views 17

To Summarize Problems… n Write barrier overhead is high. n n Even with deferred RC. Using RC with multithreading seems to bear high synchronization cost. n Lock or “compare & swap” with each pointer update. Erez Petrank GC via Sliding Views 18

Reducing RC Overhead: n n We start by looking at the “parent’s point of view”. We are counting rc for the child, but rc changes when a parent’s pointer is modified. Parent Child Erez Petrank GC via Sliding Views 19

An Observation n Consider a pointer p that takes the following values between GC’s: O 0, O 1, O 2, …, On. All RC algorithms perform 2 n operations: O 0. rc--; O 1. rc++; O 1. rc--; O 2. rc++; O 2. rc--; … ; On. rc++; But only two operations are needed: O 0. rc-- and On. rc++ O 0 Erez Petrank O 1 p O 2 O 3 O 4 GC via Sliding Views . . . O n 20

Use of Observation Garbage Collection P O 1; (record p’s previous value O 0) Time P O 2; (do nothing) … P On; (do nothing) Garbage Collection: For each modified slot p: n Read p to get On, read records to get O 0. n O 0. rc-- , On. rc++ Only the first modification of each pointer is logged. Erez Petrank GC via Sliding Views 21

Some Technical Remarks n n When a pointer is first modified, it is marked “dirty” and its previous value is logged. We actually log each object that gets modified (and not just a single pointer). n n Reason 1: we don’t want a dirty bit per pointer. Reason 2: object’s pointers tend to be modified together. Only non-null pointer fields are logged. New objects are “born dirty”. Erez Petrank GC via Sliding Views 22

Effects of Optimization • RC work significantly reduced: • The number of logging & counter updates is reduced by a factor of 100 -1000 for typical Java benchmarks ! Erez Petrank GC via Sliding Views 23

Elimination of RC Updates Benchmark jbb Compress Db Jack Javac Jess Mpegaudio Erez Petrank No of stores No of “first” stores Ratio of “first” stores 71, 011, 357 264, 115 1/269 64, 905 33, 124, 780 135, 174, 775 22, 042, 028 26, 258, 107 5, 517, 795 51 30, 696 1, 546 535, 296 27, 333 51 1/1273 1/1079 1/87435 1/41 1/961 1/108192 GC via Sliding Views 24

Effects of Optimization • RC work significantly reduced: • The number of logging & counter updates is reduced by a factor of 100 -1000 for typical Java benchmarks ! • Write barrier overhead dramatically reduced. • The vast majority of the write barriers run a single “if”. • Last but not least: the task has changed ! We need to record the first update. Erez Petrank GC via Sliding Views 25

Reducing Synch. Overhead Our second contribution: n A carefully designed write barrier (and an observation) does not require any sync. operation. Erez Petrank GC via Sliding Views 26

The write barrier Update(Object **slot, Object *new){ Object *old = *slot if (!Is. Dirty(slot)) { log( slot, old ) Observation: Set. Dirty(slot) If two threads: } 1. invoke the write barrier in *slot = new parallel, and 2. both log an old value, } then both record the same old value. Erez Petrank GC via Sliding Views 27

Running Write Barrier Concurrently Thread 1: Thread 2: Update(Object **slot, Object *new){ Object *old = *slot if (!Is. Dirty(slot)) { /* if we got here, Thread 2 has */ /* yet set the dirty bit, thus, has */ /* not yet modified the slot. */ log( slot, old ) Set. Dirty(slot) } *slot = new } Update(Object **slot, Object *new){ Object *old = *slot if (!Is. Dirty(slot)) { /* if we got here, Thread 1 has */ /* yet set the dirty bit, thus, has */ /* not yet modified the slot. */ log( slot, old ) Set. Dirty(slot) } *slot = new } Erez Petrank GC via Sliding Views 28

Concurrent Algorithm: n n Use write barrier with program threads. To collect: n Stop all threads n Scan roots (local variables) n get the buffers with modified slots n Clear all dirty bits. n Resume threads n For each modified slot: n n n decrement rc for old value (written in buffer), increment rc for current value (“read heap”), Reclaim non-local objects with rc 0. Erez Petrank GC via Sliding Views 29

Timeline Stop threads. Resume threads. Scan roots; Get buffers; erase dirty bits; Erez Petrank Decrement values in read buffers; Collect dead objects Increment “current” values; GC via Sliding Views 30

Timeline Stop threads. Resume threads. Scan roots; Get buffers; erase dirty bits; Erez Petrank Unmodified current values are in the heap. Modified are in new buffers. Decrement values in read buffers; Collect dead objects Increment “current” values; GC via Sliding Views 31

Concurrent Algorithm: n n Use write barrier with program threads. To collect: Goal 2: stop one n Stop all threads thread at a time n Scan roots (local variables) n get the buffers with modified slots n Clear all dirty bits. Goal 1: clear dirty bits n Resume threads during program run. n For each modified slot: n n n decrease rc for old value (written in buffer), increase rc for current value (“read heap”), Reclaim non-local objects with rc 0. Erez Petrank GC via Sliding Views 32

The Sliding Views “Framework” n Develop a concurrent algorithm n n There is a short time in which all the threads are stopped simultaneously to perform some task. Avoid stopping the threads together. Instead, stop one thread at a time. Tricky part: “fix” the problems created by this modification. Idea borrowed from the Distributed Computing community [Lamport]. Erez Petrank GC via Sliding Views 33

Graphically A Sliding View A Snapshot Heap Addr. t Erez Petrank time t 1 GC via Sliding Views t 2 time 34

Fixing Correctness n The way to do this in our algorithm is to use snooping: n n n While collecting the roots, record objects that get a new pointer. Do not reclaim these objects. No details… Erez Petrank GC via Sliding Views 35

Cycles Collection n Our initial solution: use a tracing algorithm infrequently. n More about this tracing collector and about cycle collectors later… Erez Petrank GC via Sliding Views 36

Performance Measurements n n Implementation for Java on the Jikes Research JVM Compared collectors: n n n Jikes parallel stop-the-world (STW) Jikes concurrent RC (Jikes concurrent) Benchmarks: n n SPECjbb 2000: a server benchmark --simulates business-like transactions. SPECjvm 98: a client benchmarks --- a suite of mostly single-threaded benchmarks Erez Petrank GC via Sliding Views 37

Pause Times vs. STW Erez Petrank GC via Sliding Views 38

Pause Times vs. Jikes Concurrent Erez Petrank GC via Sliding Views 39

SPECjbb 2000 Throughput Erez Petrank GC via Sliding Views 40

SPECjvm 98 Throughput Erez Petrank GC via Sliding Views 41

SPECjbb 2000 Throughput Erez Petrank GC via Sliding Views 42

A Glimpse into Subsequent Work: SPECjbb 2000 Throughput Erez Petrank GC via Sliding Views 43

Subsequent Work n Cycle Collection [CC’ 05]) n A Mark and Sweep Collector [OOPSLA’ 03] n A Generational Collector [CC’ 03] n An Age-Oriented Collector [CC’ 05] Erez Petrank GC via Sliding Views 44

Related Work n n n It’s not clear where to start… RC, concurrent, generational, etc… Some more relevant work was mentioned. Erez Petrank GC via Sliding Views 45

Conclusions n n A Study of Concurrent Garbage Collection with a Focus on RC. Novel techniques obtaining short pauses, high efficiency. The best approach: age-oriented collection with concurrent RC for old and concurrent tracing for young. Implementation and measurements on Jikes demonstrate non-obtrusiveness and high efficiency. Erez Petrank GC via Sliding Views 46

Project Building Blocks n n n A novel reference counting algorithm. State-of-the-art cycle collection. Generational RC (for old) and tracing (for young) A concurrent tracing collector. An age-oriented collector: fitting generations with concurrent collectors. Erez Petrank GC via Sliding Views 47