-
-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix use of jl_n_threads in gc; scale default collect interval #32633
Conversation
src/gc.c
Outdated
@@ -3029,6 +3029,9 @@ void jl_gc_init(void) | |||
arraylist_new(&finalizer_list_marked, 0); | |||
arraylist_new(&to_finalize, 0); | |||
|
|||
assert(jl_n_threads > 0); | |||
default_collect_interval = default_collect_interval * jl_n_threads; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is linear scaling advisable? Now we have much longer interval in serial sections of the code, leading to generally more memory usage. Maybe some form of sublinear scaling would make sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just depends how much time you'd like to spend in GC :) So far I've found that anything less than this means more threads => greater % GC time and it's quite depressing when you're trying to get something to scale.
If I understand this issue right, the So you propose to simply linearly scale up the amount of slack by the number of cores; then the effective cost of gc stays the same. Did I understand this right? This looks weird to me: If we can afford the larger slack, then we should use it regardless of the number of threads. If we cannot afford the larger slack, then a larger number of threads does not make it more affordable. In other words, maybe we want a sensible way for users to tell julia how much they value memory against real time against core-time. Especially since extra memory slack is often almost free until it suddenly becomes almost unaffordable (if the system has N gig of ram and is not doing anything else, then there is not a lot of downside to using up this memory; but we cannot use more memory than we have, and at some threshold the kernel will start doing stuff like swapping or evicting stuff from very useful caches). |
I think there’s a couple mitigating factors. When the threads are in use, that also implies that other processes on the system aren’t using that time/memory. But more significantly, reclaiming space makes allocation much cheaper, but gc scales with the amount of memory still in use. With more threads, it may take more work to reach the same GC operational ratio. |
Yes good points. I'm going to redo this; I think increasing the interval itself is not quite right, since that also causes code that uses only one thread (out of many available) to use more memory, which shouldn't be strictly necessary. I think the thing to do is really simple: just make the allocation counters thread-local, and GC as soon as any one thread hits the interval. That way single-thread code behaves the same as before, but if more threads are running and allocating then the effective interval naturally scales with the number of threads. |
dfdb627
to
f4d6992
Compare
src/gc.c
Outdated
gc_num.allocd = -(int64_t)gc_num.interval; | ||
size_t allocd = 0; | ||
for (int i = 0; i < jl_n_threads; i++) { | ||
jl_ptls_t ptls2 = jl_all_tls_states[i]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you aren't sync'd here with the other threads--it's invalid to write through jl_all_tls_states
@@ -2622,8 +2602,7 @@ JL_DLLEXPORT int64_t jl_gc_total_bytes(void) | |||
jl_gc_num_t num = gc_num; | |||
combine_thread_gc_counts(&num); | |||
// Sync this logic with `base/util.jl:GC_Diff` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
todo?
This allows it to expand naturally with the number of threads, giving much better scalability in GC-heavy workloads. This way gc_init doesn't need to know nthreads. fixes #32472
e9ba28d
to
83037dd
Compare
The first commit fixes a bug where
jl_gc_initsaw
jl_n_threads == 0` because it hadn't been set yet. The second commit scales the default collect interval by the number of threads. Otherwise, adding more threads basically translates directly to spending a higher % of runtime in GC.This makes the collect interval fully thread-local, allowing it to naturally scale (if allocation happens on n threads, the effective interval is n times bigger). This also removes the dependence of gc_init on the number of threads (which wasn't working, since jl_n_threads wasn't initialized yet).