Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Making Generational GC Thread-Safe #10317

Closed
ArchRobison opened this issue Feb 24, 2015 · 29 comments
Closed

Making Generational GC Thread-Safe #10317

ArchRobison opened this issue Feb 24, 2015 · 29 comments
Labels
domain:multithreading Base.Threads and related functionality
Milestone

Comments

@ArchRobison
Copy link
Contributor

This issue is for discussing how to make the new GC (#8699) thread-safe. A GC expert should check the arguments here. For background, see the elegant state transition diagram in src/gc.c.

I'm assuming that garbage collection operates in stop-the-world mode, and for now, that the mark/sweep phase is single-threaded. Each thread can have its own thread-local remset, with occasional duplication across the sets because of the race described below.

Key question: what state transitions can race against each other? If I understand the GC correctly, the only racy transitions arise from a write barrier for the same parent, and that transition is GC_MARKED --> GC_QUEUED, which changes the low two bits of the memory location from 01 to 10. If this is indeed the only racy transition, then a plain load, modify, plain store sequence will suffice, since all racing participants will be attempting to write the same value to the location. To be portable C11, we should use relaxed atomic loads/stores instead of the current assignments into bitfields, but that's a purist nitpick.

The possibility of a race will require lightening up on assertions that check for duplicate enqueuing of roots. There will need to be a memory_order_acq_rel fence between normal operation and garbage collection, but that's probably already implied by synchronization between normal operation and garbage collection.

Possible problem: gc_wb_buf appears to engaging in other transitions. If so, which of these might race against each other? Are these transitions always on different gc_bits than the ones affected by gc_wb?

@stevengj stevengj added the domain:parallelism Parallel or distributed computation label Feb 25, 2015
@ViralBShah
Copy link
Member

Cc: @carnaval

@ViralBShah ViralBShah added this to the 0.4-projects milestone Feb 25, 2015
@carnaval
Copy link
Contributor

Yep seems ok. I don't think there will be much trouble porting the write barrier, it should already be robust to duplication (except some heuristic counting code, not very important).
gc_wb_buf is used in the case of a julia object (say an Array) owns a pointer to a non-first-class buffer (no identity, no type, like the Array data). Here we only copy the gc state of the parent on the children, so I don't think a harmful race can happen either.
I'm not aware of status of the thread branch these days, but I believe the annoying part will be to implement safepoints. A naive implementation (which will of course be fine for the time being) will prohibit reordering of instructions around the safepoint, while we only need for the safepoint not to be inserted between a store and the write barrier, or between getting a pointer and rooting it.
Maybe if we switch to a conservative stack scanner and we emit write barriers before the stores, we could go with a zero-overhead approach to safepoints like this :

  • Have LLVM generate small padding NOPs such that every cycle in the CFG goes through at least one.
  • Convince ourselves that even after reordering and optimizations, we can still do a correct gc() at those points
  • When we want to trigger gc we stop all threads, look at their $ip, find the few next possible landing sites, replace the NOPs with a longjmp, run the threads again until they all fall into the trap.
  • GC, repair the NOPs, jump back.

This would avoid inserting mostly-useless branches in every loop of every function.

@carnaval
Copy link
Contributor

(By the way if you run into memory corruption and you suspect the write barrier, run the program under GC_VERIFY which should check that no trigger was missed)

@JeffBezanson
Copy link
Sponsor Member

That approach seems pretty extreme. It would be nice not to depend on modifying code.

@ArchRobison
Copy link
Contributor Author

It's not modifying code. It's in-place re-JITting :-).

If a conservative stack scan is used, why is stopping at safepoints still required? Won't the conservative scan find all possible roots?

The current threading branch, if I understand it correctly, uses memory allocation calls as safe points, which is simple, though that it could make threads stall for a long time and gives weak progress guarantees.

The Go flagship compiler (gc) currently uses function entry as its safepoints, and they piggyback on a conditional branch that used to deal with stack overflow. Go's procedure prolog checks if the stack is big enough, and if not, calls a routine that copies the entire stack. It looks like an attractive scheme for running many tasks without worrying about stack overflow, something we will have to worry about if parallelizing tasking. To stop all threads at a safepoint, the runtime modifies the stack limit so all threads jump to the overflow handler on next call. Of course it doesn't help for long loops.

@tknopp
Copy link
Contributor

tknopp commented Feb 25, 2015

Wouldn't manually gc-ing also be a first valid approach? I know its not that flexible but performance critical code should ideally not allocate any memory. When I worked in the initial threading code this worked quite well. Combined with manual gc-ing after a threading barrier even allocations should be possible.

@carnaval
Copy link
Contributor

@ArchRobison You're right, with a conservative stack scan we don't even need safepoints. Another huge argument in favor of the conservative scanner IMO.
If doing nothing works, I can understand that live patching looks a bit overkill :-)
@tknopp I'm sure it'll work but it's a very easy source of invisble slowdowns. If your perf critical code starts to allocate memory some day (someone changed a constant in inference.jl, ...) you can have one thread (or even n-1 threads) start to gc and stall until every other thread finishes its piece of work, reverting back to sequential performance.

@tknopp
Copy link
Contributor

tknopp commented Feb 26, 2015

@carnaval: Yes you are right. I just would be extremely happy if we could get the threading branch opt-in for 0.4 and could live with a solution that is "experts only".

@ArchRobison
Copy link
Contributor Author

@carnaval Why does a thread have multiple heaps? E.g., why isn't HEAP_COUNT==1? Is it because region_t uses fixed-sized arrays? (I'm taking a whack at making the gc thread-safe, and starting to deal with privatizing globals.)

@carnaval
Copy link
Contributor

Yes, what I call a heap in the code (well, a region, I've not been very consistent there) is a fixed mmap'd zone. The pool allocating code uses this to ask (and release) aligned 16k chunks of memory via malloc_page() and free_page(). Those should be fairly low-contention so locking around them should be ok.
The pool descriptor themselves (pool_t) are high traffic memory so should probably be thread-local (I recall Tobias doing this some time ago).
For alloc_big() we could have thread-local big_objects and big_objects_marked linked lists but I'd guess it is a lot less important for perf. (or achieve this with an atomic CAS ?).
A bit unrelated but : will the threading branch use an existing work scheduling library ? I've been thinking about making some part of the gc parallel, and managing the queues & work stealing/distribution looks like 80% of the problem. A mature runtime to handle this would be cool to have hanging around the C codebase.

@tknopp
Copy link
Contributor

tknopp commented Feb 27, 2015

Not sure if one can call it thread local what I did with the memory pool. I generated N pools and ensured that when several threads entered the pool_alloc function are dispatched on a "free" pool. Each pool had a mutex to ensure that no race conditions happen:

https://github.com/tknopp/julia/blob/jlthreading/src/gc.c#L974

In my tests this pool was a huge improvement. But has this been entirely removed from the threading branch? I thought the the experts replaced this with a proper TLS solution.

@ArchRobison
Copy link
Contributor Author

Thanks for the comments and history.

@carnaval: We're still working out what sort of scheduler the threading branch will use. Both work-stealing and more loop-centric approaches have merit. I'm still pondering hybrids that the parallel run-times (OpenMP, TBB, Cilk Plus) teams have kicked around.

@tknopp The threading branch was using one pool per thread as far as I could tell. The "TLS" was rolled by hand via an array indexed by thread index.

@eschnett
Copy link
Contributor

eschnett commented Mar 4, 2015

@ArchRobison: Re thread scheduler: I've begun to use Qthreads https://github.com/Qthreads/qthreads in another project recently. The main difference between Qthreads and, well, almost everything else (OpenMP, TBB, Cilk Plus) is that Qthreads doesn't introduce dependencies between threads. In the other threading architectures, threads can only wait for their children, or a core running a particular thread can only switch to running a child, or threads cannot easily switch between cores. This usually has to do with stack layout. Qthreads introduces a separate stack for each thread, leading to its flexibility.

Qtherads also introduces the notion of "shepherd", probably corresponding to NUMA domains. Threads can move between shepherds, but that happens more rarely than moving between cores that are managed by the same shepherd.

Anyway, since Qthreads seems to be more powerful than most other systems I thought I'd mention it here. Qthreads is used e.g. for Chapel.

@ViralBShah
Copy link
Member

@ViralBShah ViralBShah added domain:multithreading Base.Threads and related functionality and removed domain:parallelism Parallel or distributed computation labels Mar 6, 2015
@ViralBShah
Copy link
Member

See the thread safety tracker in #10421

@vtjnash vtjnash modified the milestones: 0.5, 0.4 Mar 7, 2015
@ArchRobison
Copy link
Contributor Author

It's good to bring up QThreads. The one stack per OS thread model has surely done a lot of damage to enabling widespread use of parallelism. Indeed Cilk was an early system to break the one linear stack per OS thread association. It used heap-allocated procedure frames to simulate a cactus stack. Regrettably, that didn't play well in the marketplace since it required a different calling convention, and so the Intel version of Cilk worked out a trick to map a cactus stack onto multiple linear stacks,

The dependencies between parent and child tasks are a two-edged sword. For certain common patterns, they enable very efficient implementation with provable space guarantees, specifically O(S*P) space on P threads if the program takes O(S) space on one thread.. The bound sounds trivial, but it's easy to show space blowup in less constrained schedulers. On the other hand, the dependencies are annoying in other contexts.

I think the overall lesson is to give careful consideration to the options while the calling convention and threading model have some flexibility.

@ArchRobison
Copy link
Contributor Author

@carnaval: Near here, it looks like there may be a copy-paste error:

    if (regions_lb[region_i] < i)
        regions_lb[region_i] = i;
    if (regions_ub[region_i] < i)
        regions_ub[region_i] = i;

Shouldn't the first comparison be > instead of <?

@carnaval
Copy link
Contributor

I don't think so. The naming may be a bit misleading but those are not the lower/upper bound of the same quantity. We store a lower bound of the first free page and an upper bound of the last non-free page.
On allocation, the first one must grow up to where we are allocating since every page before is in use. The second one must also grow if we are allocating outside the previous bound.
Did you encounter a related bug ? If this logic is wrong we are bound to get either slower allocation or memory errors.

@ArchRobison
Copy link
Contributor Author

Thanks for the clarifications.

@JeffBezanson
Copy link
Sponsor Member

As of 2fc8d97 on the jn/threading branch, this is now basically accomplished.

@tkelman
Copy link
Contributor

tkelman commented Nov 13, 2015

can this be closed?

@ViralBShah
Copy link
Member

We should have at least one travis target that builds with multi-threading enabled and runs known passing tests. I don't think this should hold up closing of this particular issue, but it would be nice to have it enabled.

@tkelman
Copy link
Contributor

tkelman commented Nov 14, 2015

same as with #9336, it's rather difficult to build a newer llvm on travis right now. have to decide exactly which branch or release will be used, then package it.

@ViralBShah
Copy link
Member

I guess we can do this once we move to our own branch of llvm 3.7, and use the same branch for threading, gallium, and everything else. Perhaps this is not the right place to discuss it.

@yuyichao
Copy link
Contributor

We've got a Travis Linux with threading on passed =) https://travis-ci.org/JuliaLang/julia/builds/96090255

@ViralBShah
Copy link
Member

👏

@vtjnash vtjnash closed this as completed Dec 10, 2015
@vtjnash
Copy link
Sponsor Member

vtjnash commented Dec 10, 2015

that's what you said above should be the close criteria for this issue (we have newer ones open tracking this better anyways)

@yuyichao
Copy link
Contributor

Err, the branch tested is based on #14190

@StefanKarpinski
Copy link
Sponsor Member

Bravo, @yuyichao. Still, it seems premature to close since this hasn't been merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:multithreading Base.Threads and related functionality
Projects
None yet
Development

No branches or pull requests