Incorporating Generations into a Modern Reference Counting Garbage

Incorporating Generations into a Modern Reference Counting Garbage Collector Hezi Azatchi Advisor: Erez Petrank 1

Outline n Background Ø Garbage Collection Ø Ø n n n Reference Counting Mark&Sweep Improving Tracing using Generations On-The-Fly Sliding-View Garbage Collectors Our Generational on-the-fly Algorithms Results Summary 2

Background – Reference Counting The Reference Counting Algorithm [Collins 1960] Each object has an RC field. n New objects get o. RC: =1. n p When p that points to o 1 is modified to point to o 2, we do: n n n o 1. RC--, o 2. RC++. o 1 o 3 if o 1. RC==0: n Delete o 1. n Decrement o. RC for all sons of o 1. n Recursively delete objects whose RC is decremented to 0. o 2 o 4 3

Background – Reference Counting 3 years later… n n [Harold-Mc. Beth 1963] Reference counting algorithm does not reclaim cycles!. But, n n It turns out that “normal” programs do not use too many cycles. So, other methods (such as mark and sweep) are used “seldom” to collect the cycles. o 1 o 2 4

Background – Reference Counting Deferred Reference Counting n n Problem: RC algorithms prescribe an action for each pointer operation. Solution [Deutch & Bobrow, 1976] : n Don’t update RC for locals. n n n Put objects with RC=0 in a Zero-Count-Table(ZCT). “Once in a while”: collect all the objects (in the ZCT) with o. RC=0 that are not referenced from local roots. Deferred RC reduces overhead by 80%. Used in most modern RC systems. 5

Background – Mark and Sweep The Mark-Sweep algorithm [Mc. Carthy 1960] n n Traverse & mark live objects. White objects can be reclaimed. globals Roots 6

Background – Generational GC Generational Garbage Collection n ü ü ü [Ungar, 1984] Weak generational hypothesis: “most objects die young” Segregate objects by age into two or more regions of the heap called: generations. Objects are first allocated in the youngest generation, but are promoted into older generation if they survive long enough. Most pauses are short (for young generation GC). Collection effort concentrated where there is garbage. Better locality. 7

Background – Generational GC – Inter-Generational-Pointers n Pointers from old to young generation must be part of the root set of the young generation. Globals Old Stack Young 8

$Background Note Interesting Properties n n Mark&sweep is good with low fraction of live$

Background Note Interesting Properties n n Mark&sweep is good with low fraction of live objects thus it “fits” the young generation which has low fraction of live objects. RC does not depend on amount of live space thus it “fits” to the old generation which does have large amount of live space. Thus – a combination of RC for old generation and Mark&sweep for the young may be good! This is exactly what we tried n n On a modern platform (SMP). With advanced modern on-the-fly collectors. 9

Background Terminology (Mutators) (Collector Threads) 10

On the fly Sliding-View Algorithms Levanoni-Petrank OOPSLA 2001 11

Levanoni Petrank Algorithms - Motivation for RC n Reference Counting work is proportional to the work on creations and modifications. n n n Can tracing deal with tomorrow’s huge heaps? Reference counting has good locality. The Challenge: n n RC write barriers seem too expensive. RC seems impossible to “parallelize”. 12

Levanoni Petrank Algorithms - Motivation Multithreaded RC? n Problem 1: ref-counts updates must be atomic. § Problem 2: parallel updates confuse counters: Thread 1: Read A. next; (see B) A. next C; B. RC- -; C. RC++ C A B Thread 2: Read A. next; (see B) A. next D; B. RC- -; D. RC++ D 13

Levanoni Petrank Algorithms - Motivation First Multithreaded RC n [De. Treville]: n n n Lock heap for each pointer modification. Thread records its updates in a buffer. Once in a while (snapshot alike): n n GC threads all buffers to update ref counts Reclaims all objects with 0 rc that are not local. 14

Levanoni Petrank Algorithms - Motivation To Summarize… n Overhead on write barrier is considered high. n n Even with deferred RC of Deutch & Bobrow. Using reference counting concurrently with program threads seems to bear high synchronization cost. n Lock or “compare & swap” for each pointer update. 15

Levanoni Petrank Algorithms Improving the write-barrier overhead n n n Consider a pointer p that takes the following values between GC’s: O 0, O 1, O 2, …, On. Out of 2 n operations: O 0. RC--; O 1. RC++; O 1. RC--; O 2. RC++; O 2. RC--; … ; On. RC++; Only two are needed: p O 0. RC-- and On. RC++ O 0 O 1 O 2 O 3 O 4 . . . O n 16

The write barrier Procedure Update(p: Pointer, new: Object) prev : = *p if !Dirty(p) then log <p, prev> // into local log buffer Dirty(p) = True; *p : = new n n n Time n n n Collection time P O 1; (record p’s previous value O 0) P O 2; (do nothing) … P On; (do nothing) Collection time: For each modified slot p: n Read p to get On, read records to get O 0 n O 0. RC-- , On. RC++ 17

Levanoni Petrank Algorithms The “Snapshot” (Concurrent) RC Algorithm: n Use write barrier with program threads. • Take a snapshot: – Stop all threads – Scan roots (locals) – get the buffers with modified slots – Clear all dirty bits. – Resume threads • Then run collector: – For each modified slot: • decrease rc for previous snapshot value (read buffer), • increase rc for current snapshot value (“read heap”), – Reclaim non-local objects with rc 0. 18

Levanoni Petrank Algorithms The General Picture P 1 P 2 P 3 P 4 P 5 P 6 p 7 Record Modifications Heap at collection k P 1 P 2 P 3 P 4 P 5 P 6 p 7 Use list of modifications to update reference counts. Heap at collection k+1 19

Levanoni Petrank Algorithms The “Snapshot” Tracing (Mark&Sweep) Collector n Use write barrier with program threads. • Take a snapshot: – Stop all threads – Scan roots (locals) – get the buffers with modified slots – Clear all dirty bits. – Resume threads • Then run collector: – Mark via current snapshot foreachable slot s if (!s. dirty) then “read heap” else “read buffer” recursively mark s value - Sweep all non-local objects which are not marked. 20

Levanoni Petrank Algorithms Intermediate Concurrent Algorithm Properties: n Snapshot oriented, concurrent, (not so bad…) Pause time: • Stop all threads • clear all dirty bits. • mark roots of all threads. Pause time goal: • Stop one thread to mark its own local roots! • The goal: an on-the-fly algorithm with a low throughput cost. 21

Levanoni Petrank Algorithms Collecting On-the-fly - What if we stop each thread at a time? • Take a sliding view: • For each thread t – Stop t – Scan roots (locals) – get the buffers with modified slots – Resume t – Clear all dirty bits. • Then run collector: – For each modified slot: • decrease rc for previous snapshot value (read buffer), • increase rc for current snapshot value (“read heap”), – Reclaim non-local objects with rc 0. • Several problems to be solved… 22

Levanoni Petrank Algorithms The New Picture – using Sliding-Views Read information from each thread at a time (while other threads run): no snapshot. p 1 p 2 p 3 Heap p 4 p 5 List of Modifications p 6 p 7 Sliding view of the heap at collection k p 3 p 4 p 5 p 6 p 7 Sliding view of the heap at collection k+1 23

Levanoni Petrank Algorithms Danger in Sliding Views Program does: P 1 O P 2 O P 1 NULL p 1 p 2 p 3 Heap Here sliding view reads P 2 (NULL) Here sliding view reads P 1 (NULL) p 4 p 5 Problem: reachability of O noticed! p 6 p 7 Solution: if a pointer to O has been stored during the sliding view phase – do not reclaim O (and descendants). 24

Levanoni Petrank Algorithms The Sliding Views Collector • Take a sliding view: – Start snooping – For each thread t – Stop t – Scan roots (locals) – get the buffers with modified slots – Resume t – Stop snooping – Clear all dirty bits. • Then run collector: – For each modified slot: • decrease rc for previous snapshot value (read buffer), • increase rc for current snapshot value (“read heap”), – Reclaim non-local objects with rc 0. 25

Levanoni Petrank Algorithms Implementation for Java n n Based on Sun’s JDK 1. 2. 2 for Windows NT Main features n n n 2 -bit RC field per object (á la [Wise et. al. ]) A custom allocator for on-the-fly RC Benchmarks: n Server benchmarks n n n SPECjbb 2000 --- simulates business-like transactions in a large firm MTRT --- a multi-threaded ray tracer Client benchmarks n SPECjvm 98 --- a suite of mostly single-threaded client benchmarks 26

Levanoni Petrank Algorithms Improved RC - How many RC updates are eliminated? Benchmark No of stores jbb Compress Db Jack Javac Jess mpegaudio 71, 011, 357 64, 905 33, 124, 780 135, 174, 775 22, 042, 028 26, 258, 107 5, 517, 795 No of “first” stored 264, 115 51 30, 696 1, 546 535, 296 27, 333 51 Ratio of “first” stores 1/269 1/1273 1/1079 1/87435 1/41 1/961 1/108192 27

Levanoni Petrank Algorithms SPECjbb – max pause time 28

Levanoni Petrank Algorithms SPECjbb Throughput 29

Levanoni Petrank Algorithms MTRT Throughput 30

This Work: Sliding Views Algorithms with Generations 31

This Work - Generational Algorithms Motivation n Investigate how generations integrate with reference-counting on a multiprocessor. n n n The goal: Get larger throughput n n Tracing work is proportional to the amount of live objects and by weak generational hypothesis: “many objects die young”. RC does not depend on the amount of live space. The old generation has high fraction of live objects. Algorithms match their generations Work is concentrated where garbage is. Better locality, working set size is smaller. Note: similar pauses expected. 32

This Work - Generational Algorithms Design issues: n n n Two generations Two collection types – minor and full Each object which has survived a collection is promoted n n n Simplify implementation Lower overhead for Inter-Generational-Pointers handling. The heap is partitioned logically n n In an on-the-fly collector object copying is very difficult if not impossible. An object is promoted by marking it as old. 33

This Work - Generational Algorithms Design issues: n n Promotion is done by the collector Collection triggering n n n Minor collection is triggered every X[Bytes] Allocations. Full collection is triggered when the heap occupancy grows to more than Y% Two local buffers for each mutator: 1. “young-objects” buffer – pointers to new objects. 2. “old-objects” buffer. n The young generation processed by this cycle: n All local “young-objects” buffers from the previous 34

This Work - Generational Algorithms Log modified objects instead of modified slots • Update(A. p 1, C) Heap Objects • Update(A. p 2, C) • Update(A. p 2, D) 35

This Work - Generational Algorithms 1. o 1. next : = new(256); 2. Update(*o 1. next, o 1); 3. Update(o 1. next, o 2); “old-Objects” K cycle “young-objects” buffer and “old-objects” buffer roles new o 1 K+1 cycle t K+2 cycle Mutator o 2 Heap 36

This Work - Generational Algorithms Three On-the-fly Generational Algorithms n Reference-Counting for both collections. n Reference-Counting for young collection. n n Tracing for the major collection. Reference-Counting for major collection. n n Tracing for the minor collection. Expected to be the best 37

This Work - Generational Algorithms Agenda n No time to present all algorithms n n Only major RC (the best) algorithm will be presented. Go over several interesting difficulties: n Issues for major RC collections n n n Efficient find the Inter-Generational-Pointers Prepare the buffers for the major reference-counting. Issues for minor RC collections n n n Efficient promotion with minor RC. Snoop selectively. No need to accurate update all objects RCs. 38

This Work - Generational Algorithms Reference Counting for the Major collection algorithm n n n Expected to be the best. Uses Mark and sweep for minor collections. Uses RC for the major collections. 39

The minor collection - mark&sweep • Take a sliding view: – Start snooping – For each thread t – Stop t – Scan roots (locals) – get the buffers with modified slots – Resume t – Stop snooping – Clear all dirty bits. • Then run collector: – Find the inter-generationalpointers, add them to the roots set. – Mark via the sliding view foreachable slot s if (!s. dirty) then “read heap” else “read buffer” recursively mark s value - Sweep non local, unmarked objects, promote survivals. - Prepare buffers for major collection 40

The major collection – RC • Take a sliding view: – Start snooping – For each thread t – Stop t – Scan roots (locals) – get the buffers with modified slots – Resume t – Stop snooping – Clear all dirty bits. • Then run collector: – For each modified slot: (which are in the current sliding-view buffers or in the prepared major buffers) • decrease rc for previous snapshot value (read buffer), • increase rc for current snapshot value (“read heap”), – Reclaim non-local objects with rc 0, promote survivals. 41

This Work - Generational Algorithms Issues for: Major RC collections n Young generation: n n How do we find inter-generationalpointers (for the mark&sweep of the young generation) efficiently? Provide the major RC collection with consistent buffers. 42

This Work - Generational Algorithms Inter-Generational-Pointers are “given for free” n n n Observation: Old objects that point to young objects - must have been modified since the previous collection, because young objects did not exist before. Thus: all inter-generational pointers must be logged in “old-objects” local buffers. Does this get all Inter-Generational Pointers? n Must note some race conditions due to the nonatomic sliding-view. 43

The first race – “intra sliding-view update” Collector v. Take Mutator A a sliding-view Mutator B v. Cooperate: v. Stop v. Mark-Roots K-1 cycle v. Read-Buffers v. Resume vp: =new(16) That new object is logged to the young generation of the next cycle v. Update(o. next, *p) v. Cooperate: v. Stop v. Mark-Roots v. Read-Buffers The “inter-generationalpointer” is logged in the buffers of cycle k-1, these buffers won’t be available in this cycle!!! K+1 cycle v. Resume K cycle 44

The second race – “update before clear” Collector v. Take Mutator A a sliding-view Mutator B v. Cooperate: K-1 cycle v. Stop v. Mark-Roots v. Cooperate: v. Read-Buffers v. Stop v. Resume v. Mark-Roots vp: =new(16) Inter-generational-pointer was created and not logged to any buffer! v. Read-Buffers v. Resume v. Update(o. next, *p) v. Read o. next=x v. Read o. Dirty=true v. Clear-Dirty-Marks K cycle If we won’t traverse through the intergenerational-pointer to the new object it might be sweeped mistakenly in this cycle! K+1 cycle 45

This Work - Generational Algorithms Solution to both races: n Record into “IGPs_buffer” all objects that are involved in an update to young object in the following uncertainty period: n n While taking the sliding-view and till the end of the clear-dirty-marks. The true inter-generational-set is contained in the following set: n {Union over all mutators’ old-objects buffers} {IGPs_buffer} 46

This Work - Generational Algorithms Full RC collection buffers preparation n n The mutators log objects to their local “young-objects” and “old-objects” buffers. The collector log part of these logged objects to the “major-new-objects” buffer, and to the “major-old-objects” buffer. 47

This Work - Generational Algorithms Which objects to log to the major buffers? n n n Only objects which will be alive in the next major collection. Use a Old. Dirty flag (for each logged object) – To avoid multiple loggings of the same object. Logging to the “major-new-objects” buffer n n Log only objects which were promoted (it is known at the young-generation sweep phase). No object’s children are logged (because the object did not exist in the previous major cycle, thus its children did not reference any object). 48

This Work - Generational Algorithms Logging to the “major-old-objects” buffer n The parents objects which are logged into the young generation “oldobjects” buffers are: old. (Why? ) n n Thus they can be logged to the major buffer. (They will survive). Their children may be sweeped, thus log only children which were promoted (only after the sweep phase). 49

This Work - Generational Algorithms Issues for: RC minor collection n n Efficient promotion with minor RC collections Reference-Counting for the young generation algorithm Advantage: n n The RC field might be not accurate. Selectively snoop only young objects. 50

This Work - Generational Algorithms Efficient promotion with minor RC collections n Recall reclamation with deferred RC: n n Go over suspects (young generation) If RC=0 and not local – reclaim (recursively) Otherwise – need to promote? A problem: n n Cannot promote objects because recursive reclamation may delete them later. Thus, only at the end of the reclamation phase we may promote. Bad (simple) solution: traverse the young generation twice. 51

This Work - Generational Algorithms Efficient promotion with minor RC collections n Our solution: n n Each object in the young generation whose RC>0 is marked as “pending. Promotion” object which is treated as young. Zero the “pending. Promotion” bitmap in the end, thus promote all the surviving objects. 52

This Work - Generational Algorithms Reference-Counting for the young generation n No need to decrement previous objects’ RCs n n They were pointed at previous collection. Each was either promoted or reclaimed, thus: the young generation should not fix their RC, the full tracing collection will handle them. Further improvement: after the sliding-view was taken, no need to log the object’s children. Should we snoop selectively on a minor collection? n Yes – the minor collection can reclaim only young objects. 53

Implementation n n The original and all three Garbage Collectors were implemented into Jikes – a Research Java Virtual Machine and JIT compiler developed at IBM T. J. Watson Research Center. The entire system, including the collector itself is written in Java (extended with unsafe primitives to access raw memory). The reference collector: SIGPLAN-2001: “Java without the Coffee Breaks: A Nonintrusive Multiprocessor Garbage Collector” by David F. Bacon et al. Measurements were taken over: 4 -way IBM Netfinity 8500 R server with 550 MHZ Intel. Pentium III Xeon processors and 2 GBytes of physical memory. 54

Best algorithm - “RC for full” SPECjvm 98 on Multiprocessor 55

Best algorithm - “RC for full” SPECjvm 98 on uniprocessor 56

Best algorithm - “RC for full” SPECjbb 2000 on Multiprocessor 57

Best algorithm - “RC for full” _227_mtrt on Multiprocessor 58

Best algorithm-“RC for minor” Max Pause Time 59

Second algorithm -“RC for minor” _227_mtrt on Multiprocessor 60

Second algorithm -“RC for minor” SPECjbb 00 on Multiprocessor 61

This Work - Summary n n We’ve presented (for the first time) an incorporation of generations into on-the-fly Reference Counting algorithm. We have implemented three algorithms into Jikes - IBM’s Research JVM with JIT compiler. It turns out that the generation incorporation (for all algorithms): n Improves efficiency n Keeps the short pauses times of the original. The algorithm that is using Reference-Counting in the old generation was doing better than the others (as expected). 62