Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix use of jl_n_threads in gc; scale default collect interval #32633

Merged
merged 1 commit into from
Jul 22, 2019

Conversation

JeffBezanson
Copy link
Sponsor Member

@JeffBezanson JeffBezanson commented Jul 19, 2019

The first commit fixes a bug where jl_gc_initsawjl_n_threads == 0` because it hadn't been set yet. The second commit scales the default collect interval by the number of threads. Otherwise, adding more threads basically translates directly to spending a higher % of runtime in GC.

This makes the collect interval fully thread-local, allowing it to naturally scale (if allocation happens on n threads, the effective interval is n times bigger). This also removes the dependence of gc_init on the number of threads (which wasn't working, since jl_n_threads wasn't initialized yet).

@JeffBezanson JeffBezanson added performance Must go faster domain:multithreading Base.Threads and related functionality GC Garbage collector labels Jul 19, 2019
src/gc.c Outdated
@@ -3029,6 +3029,9 @@ void jl_gc_init(void)
arraylist_new(&finalizer_list_marked, 0);
arraylist_new(&to_finalize, 0);

assert(jl_n_threads > 0);
default_collect_interval = default_collect_interval * jl_n_threads;
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is linear scaling advisable? Now we have much longer interval in serial sections of the code, leading to generally more memory usage. Maybe some form of sublinear scaling would make sense.

Copy link
Sponsor Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just depends how much time you'd like to spend in GC :) So far I've found that anything less than this means more threads => greater % GC time and it's quite depressing when you're trying to get something to scale.

@chethega
Copy link
Contributor

If I understand this issue right, the collect_interval describes the amount of memory slack (how much memory are we willing to waste on maybe-dead objects). Only one thread can run the gc, and typically all other cores idle in this time (unless we can feed them with something like BLAS). If we move up from e.g. 1 to 8 threads without changing the slack, then the amount and percentage of core-cycles spent on gc will remain the same, but these will represent a larger fraction of real-time. In effect, the cost of gc has gone up 8x, because we additionally burn all the possible work that the other 7 cores could have done in this time.

So you propose to simply linearly scale up the amount of slack by the number of cores; then the effective cost of gc stays the same.

Did I understand this right?

This looks weird to me: If we can afford the larger slack, then we should use it regardless of the number of threads. If we cannot afford the larger slack, then a larger number of threads does not make it more affordable.

In other words, maybe we want a sensible way for users to tell julia how much they value memory against real time against core-time. Especially since extra memory slack is often almost free until it suddenly becomes almost unaffordable (if the system has N gig of ram and is not doing anything else, then there is not a lot of downside to using up this memory; but we cannot use more memory than we have, and at some threshold the kernel will start doing stuff like swapping or evicting stuff from very useful caches).

@vtjnash
Copy link
Sponsor Member

vtjnash commented Jul 21, 2019

I think there’s a couple mitigating factors. When the threads are in use, that also implies that other processes on the system aren’t using that time/memory. But more significantly, reclaiming space makes allocation much cheaper, but gc scales with the amount of memory still in use. With more threads, it may take more work to reach the same GC operational ratio.

@JeffBezanson
Copy link
Sponsor Member Author

Yes good points. I'm going to redo this; I think increasing the interval itself is not quite right, since that also causes code that uses only one thread (out of many available) to use more memory, which shouldn't be strictly necessary. I think the thing to do is really simple: just make the allocation counters thread-local, and GC as soon as any one thread hits the interval. That way single-thread code behaves the same as before, but if more threads are running and allocating then the effective interval naturally scales with the number of threads.

src/gc.c Outdated
gc_num.allocd = -(int64_t)gc_num.interval;
size_t allocd = 0;
for (int i = 0; i < jl_n_threads; i++) {
jl_ptls_t ptls2 = jl_all_tls_states[i];
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you aren't sync'd here with the other threads--it's invalid to write through jl_all_tls_states

@@ -2622,8 +2602,7 @@ JL_DLLEXPORT int64_t jl_gc_total_bytes(void)
jl_gc_num_t num = gc_num;
combine_thread_gc_counts(&num);
// Sync this logic with `base/util.jl:GC_Diff`
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo?

This allows it to expand naturally with the number of threads,
giving much better scalability in GC-heavy workloads.
This way gc_init doesn't need to know nthreads.

fixes #32472
@JeffBezanson JeffBezanson merged commit 684973e into master Jul 22, 2019
@JeffBezanson JeffBezanson deleted the jb/gcnthreads branch July 22, 2019 19:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:multithreading Base.Threads and related functionality GC Garbage collector performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants