Making Generational GC Thread-Safe #10317

ArchRobison · 2015-02-24T23:41:58Z

This issue is for discussing how to make the new GC (#8699) thread-safe. A GC expert should check the arguments here. For background, see the elegant state transition diagram in src/gc.c.

I'm assuming that garbage collection operates in stop-the-world mode, and for now, that the mark/sweep phase is single-threaded. Each thread can have its own thread-local remset, with occasional duplication across the sets because of the race described below.

Key question: what state transitions can race against each other? If I understand the GC correctly, the only racy transitions arise from a write barrier for the same parent, and that transition is GC_MARKED --> GC_QUEUED, which changes the low two bits of the memory location from 01 to 10. If this is indeed the only racy transition, then a plain load, modify, plain store sequence will suffice, since all racing participants will be attempting to write the same value to the location. To be portable C11, we should use relaxed atomic loads/stores instead of the current assignments into bitfields, but that's a purist nitpick.

The possibility of a race will require lightening up on assertions that check for duplicate enqueuing of roots. There will need to be a memory_order_acq_rel fence between normal operation and garbage collection, but that's probably already implied by synchronization between normal operation and garbage collection.

Possible problem: gc_wb_buf appears to engaging in other transitions. If so, which of these might race against each other? Are these transitions always on different gc_bits than the ones affected by gc_wb?

The text was updated successfully, but these errors were encountered:

ViralBShah · 2015-02-25T08:20:36Z

Cc: @carnaval

carnaval · 2015-02-25T11:17:33Z

Yep seems ok. I don't think there will be much trouble porting the write barrier, it should already be robust to duplication (except some heuristic counting code, not very important).
gc_wb_buf is used in the case of a julia object (say an Array) owns a pointer to a non-first-class buffer (no identity, no type, like the Array data). Here we only copy the gc state of the parent on the children, so I don't think a harmful race can happen either.
I'm not aware of status of the thread branch these days, but I believe the annoying part will be to implement safepoints. A naive implementation (which will of course be fine for the time being) will prohibit reordering of instructions around the safepoint, while we only need for the safepoint not to be inserted between a store and the write barrier, or between getting a pointer and rooting it.
Maybe if we switch to a conservative stack scanner and we emit write barriers before the stores, we could go with a zero-overhead approach to safepoints like this :

Have LLVM generate small padding NOPs such that every cycle in the CFG goes through at least one.
Convince ourselves that even after reordering and optimizations, we can still do a correct gc() at those points
When we want to trigger gc we stop all threads, look at their $ip, find the few next possible landing sites, replace the NOPs with a longjmp, run the threads again until they all fall into the trap.
GC, repair the NOPs, jump back.

This would avoid inserting mostly-useless branches in every loop of every function.

carnaval · 2015-02-25T11:18:33Z

(By the way if you run into memory corruption and you suspect the write barrier, run the program under GC_VERIFY which should check that no trigger was missed)

JeffBezanson · 2015-02-25T15:28:23Z

That approach seems pretty extreme. It would be nice not to depend on modifying code.

ArchRobison · 2015-02-25T20:41:25Z

It's not modifying code. It's in-place re-JITting :-).

If a conservative stack scan is used, why is stopping at safepoints still required? Won't the conservative scan find all possible roots?

The current threading branch, if I understand it correctly, uses memory allocation calls as safe points, which is simple, though that it could make threads stall for a long time and gives weak progress guarantees.

The Go flagship compiler (gc) currently uses function entry as its safepoints, and they piggyback on a conditional branch that used to deal with stack overflow. Go's procedure prolog checks if the stack is big enough, and if not, calls a routine that copies the entire stack. It looks like an attractive scheme for running many tasks without worrying about stack overflow, something we will have to worry about if parallelizing tasking. To stop all threads at a safepoint, the runtime modifies the stack limit so all threads jump to the overflow handler on next call. Of course it doesn't help for long loops.

tknopp · 2015-02-25T20:50:26Z

Wouldn't manually gc-ing also be a first valid approach? I know its not that flexible but performance critical code should ideally not allocate any memory. When I worked in the initial threading code this worked quite well. Combined with manual gc-ing after a threading barrier even allocations should be possible.

carnaval · 2015-02-26T09:52:40Z

@ArchRobison You're right, with a conservative stack scan we don't even need safepoints. Another huge argument in favor of the conservative scanner IMO.
If doing nothing works, I can understand that live patching looks a bit overkill :-)
@tknopp I'm sure it'll work but it's a very easy source of invisble slowdowns. If your perf critical code starts to allocate memory some day (someone changed a constant in inference.jl, ...) you can have one thread (or even n-1 threads) start to gc and stall until every other thread finishes its piece of work, reverting back to sequential performance.

tknopp · 2015-02-26T10:20:32Z

@carnaval: Yes you are right. I just would be extremely happy if we could get the threading branch opt-in for 0.4 and could live with a solution that is "experts only".

ArchRobison · 2015-02-26T21:47:46Z

@carnaval Why does a thread have multiple heaps? E.g., why isn't HEAP_COUNT==1? Is it because region_t uses fixed-sized arrays? (I'm taking a whack at making the gc thread-safe, and starting to deal with privatizing globals.)

carnaval · 2015-02-26T23:45:41Z

Yes, what I call a heap in the code (well, a region, I've not been very consistent there) is a fixed mmap'd zone. The pool allocating code uses this to ask (and release) aligned 16k chunks of memory via malloc_page() and free_page(). Those should be fairly low-contention so locking around them should be ok.
The pool descriptor themselves (pool_t) are high traffic memory so should probably be thread-local (I recall Tobias doing this some time ago).
For alloc_big() we could have thread-local big_objects and big_objects_marked linked lists but I'd guess it is a lot less important for perf. (or achieve this with an atomic CAS ?).
A bit unrelated but : will the threading branch use an existing work scheduling library ? I've been thinking about making some part of the gc parallel, and managing the queues & work stealing/distribution looks like 80% of the problem. A mature runtime to handle this would be cool to have hanging around the C codebase.

tknopp · 2015-02-27T00:13:19Z

Not sure if one can call it thread local what I did with the memory pool. I generated N pools and ensured that when several threads entered the pool_alloc function are dispatched on a "free" pool. Each pool had a mutex to ensure that no race conditions happen:

https://github.com/tknopp/julia/blob/jlthreading/src/gc.c#L974

In my tests this pool was a huge improvement. But has this been entirely removed from the threading branch? I thought the the experts replaced this with a proper TLS solution.

ArchRobison · 2015-02-27T18:56:47Z

Thanks for the comments and history.

@carnaval: We're still working out what sort of scheduler the threading branch will use. Both work-stealing and more loop-centric approaches have merit. I'm still pondering hybrids that the parallel run-times (OpenMP, TBB, Cilk Plus) teams have kicked around.

@tknopp The threading branch was using one pool per thread as far as I could tell. The "TLS" was rolled by hand via an array indexed by thread index.

eschnett · 2015-03-04T17:31:57Z

@ArchRobison: Re thread scheduler: I've begun to use Qthreads https://github.com/Qthreads/qthreads in another project recently. The main difference between Qthreads and, well, almost everything else (OpenMP, TBB, Cilk Plus) is that Qthreads doesn't introduce dependencies between threads. In the other threading architectures, threads can only wait for their children, or a core running a particular thread can only switch to running a child, or threads cannot easily switch between cores. This usually has to do with stack layout. Qthreads introduces a separate stack for each thread, leading to its flexibility.

Qtherads also introduces the notion of "shepherd", probably corresponding to NUMA domains. Threads can move between shepherds, but that happens more rarely than moving between cores that are managed by the same shepherd.

Anyway, since Qthreads seems to be more powerful than most other systems I thought I'd mention it here. Qthreads is used e.g. for Chapel.

ViralBShah · 2015-03-05T14:37:40Z

https://www.cs.sandia.gov/qthreads/

ViralBShah · 2015-03-06T15:31:54Z

See the thread safety tracker in #10421

ArchRobison · 2015-03-09T20:15:58Z

It's good to bring up QThreads. The one stack per OS thread model has surely done a lot of damage to enabling widespread use of parallelism. Indeed Cilk was an early system to break the one linear stack per OS thread association. It used heap-allocated procedure frames to simulate a cactus stack. Regrettably, that didn't play well in the marketplace since it required a different calling convention, and so the Intel version of Cilk worked out a trick to map a cactus stack onto multiple linear stacks,

The dependencies between parent and child tasks are a two-edged sword. For certain common patterns, they enable very efficient implementation with provable space guarantees, specifically O(S*P) space on P threads if the program takes O(S) space on one thread.. The bound sounds trivial, but it's easy to show space blowup in less constrained schedulers. On the other hand, the dependencies are annoying in other contexts.

I think the overall lesson is to give careful consideration to the options while the calling convention and threading model have some flexibility.

ArchRobison · 2015-03-18T14:44:41Z

@carnaval: Near here, it looks like there may be a copy-paste error:

    if (regions_lb[region_i] < i)
        regions_lb[region_i] = i;
    if (regions_ub[region_i] < i)
        regions_ub[region_i] = i;

Shouldn't the first comparison be > instead of <?

carnaval · 2015-03-22T12:08:03Z

I don't think so. The naming may be a bit misleading but those are not the lower/upper bound of the same quantity. We store a lower bound of the first free page and an upper bound of the last non-free page.
On allocation, the first one must grow up to where we are allocating since every page before is in use. The second one must also grow if we are allocating outside the previous bound.
Did you encounter a related bug ? If this logic is wrong we are bound to get either slower allocation or memory errors.

ArchRobison · 2015-03-23T14:16:22Z

Thanks for the clarifications.

JeffBezanson · 2015-09-03T16:47:15Z

As of 2fc8d97 on the jn/threading branch, this is now basically accomplished.

tkelman · 2015-11-13T14:52:42Z

can this be closed?

ViralBShah · 2015-11-14T13:41:26Z

We should have at least one travis target that builds with multi-threading enabled and runs known passing tests. I don't think this should hold up closing of this particular issue, but it would be nice to have it enabled.

tkelman · 2015-11-14T14:22:09Z

same as with #9336, it's rather difficult to build a newer llvm on travis right now. have to decide exactly which branch or release will be used, then package it.

ViralBShah · 2015-11-14T17:32:27Z

I guess we can do this once we move to our own branch of llvm 3.7, and use the same branch for threading, gallium, and everything else. Perhaps this is not the right place to discuss it.

yuyichao · 2015-12-10T19:12:11Z

We've got a Travis Linux with threading on passed =) https://travis-ci.org/JuliaLang/julia/builds/96090255

ViralBShah · 2015-12-10T19:29:36Z

👏

vtjnash · 2015-12-10T19:40:25Z

that's what you said above should be the close criteria for this issue (we have newer ones open tracking this better anyways)

yuyichao · 2015-12-10T19:43:35Z

Err, the branch tested is based on #14190

StefanKarpinski · 2015-12-10T21:44:00Z

Bravo, @yuyichao. Still, it seems premature to close since this hasn't been merged.

stevengj added the domain:parallelism Parallel or distributed computation label Feb 25, 2015

ViralBShah added this to the 0.4-projects milestone Feb 25, 2015

ViralBShah added domain:multithreading Base.Threads and related functionality and removed domain:parallelism Parallel or distributed computation labels Mar 6, 2015

vtjnash modified the milestones: 0.5, 0.4 Mar 7, 2015

StefanKarpinski mentioned this issue Jun 15, 2015

conservative stack scanning? #11714

Closed

yuyichao mentioned this issue Dec 10, 2015

GC safepoint and transition support #14190

Merged

8 tasks

vtjnash closed this as completed Dec 10, 2015

StefanKarpinski reopened this Dec 10, 2015

JeffBezanson closed this as completed in #14190 Jan 5, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Making Generational GC Thread-Safe #10317

Making Generational GC Thread-Safe #10317

ArchRobison commented Feb 24, 2015

ViralBShah commented Feb 25, 2015

carnaval commented Feb 25, 2015

carnaval commented Feb 25, 2015

JeffBezanson commented Feb 25, 2015

ArchRobison commented Feb 25, 2015

tknopp commented Feb 25, 2015

carnaval commented Feb 26, 2015

tknopp commented Feb 26, 2015

ArchRobison commented Feb 26, 2015

carnaval commented Feb 26, 2015

tknopp commented Feb 27, 2015

ArchRobison commented Feb 27, 2015

eschnett commented Mar 4, 2015

ViralBShah commented Mar 5, 2015

ViralBShah commented Mar 6, 2015

ArchRobison commented Mar 9, 2015

ArchRobison commented Mar 18, 2015

carnaval commented Mar 22, 2015

ArchRobison commented Mar 23, 2015

JeffBezanson commented Sep 3, 2015

tkelman commented Nov 13, 2015

ViralBShah commented Nov 14, 2015

tkelman commented Nov 14, 2015

ViralBShah commented Nov 14, 2015

yuyichao commented Dec 10, 2015

ViralBShah commented Dec 10, 2015

vtjnash commented Dec 10, 2015

yuyichao commented Dec 10, 2015

StefanKarpinski commented Dec 10, 2015

Making Generational GC Thread-Safe #10317

Making Generational GC Thread-Safe #10317

Comments

ArchRobison commented Feb 24, 2015

ViralBShah commented Feb 25, 2015

carnaval commented Feb 25, 2015

carnaval commented Feb 25, 2015

JeffBezanson commented Feb 25, 2015

ArchRobison commented Feb 25, 2015

tknopp commented Feb 25, 2015

carnaval commented Feb 26, 2015

tknopp commented Feb 26, 2015

ArchRobison commented Feb 26, 2015

carnaval commented Feb 26, 2015

tknopp commented Feb 27, 2015

ArchRobison commented Feb 27, 2015

eschnett commented Mar 4, 2015

ViralBShah commented Mar 5, 2015

ViralBShah commented Mar 6, 2015

ArchRobison commented Mar 9, 2015

ArchRobison commented Mar 18, 2015

carnaval commented Mar 22, 2015

ArchRobison commented Mar 23, 2015

JeffBezanson commented Sep 3, 2015

tkelman commented Nov 13, 2015

ViralBShah commented Nov 14, 2015

tkelman commented Nov 14, 2015

ViralBShah commented Nov 14, 2015

yuyichao commented Dec 10, 2015

ViralBShah commented Dec 10, 2015

vtjnash commented Dec 10, 2015

yuyichao commented Dec 10, 2015

StefanKarpinski commented Dec 10, 2015