Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Count GC's time waiting for a safepoint as GC time #53441

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

mbauman
Copy link
Sponsor Member

@mbauman mbauman commented Feb 22, 2024

This is a very simple change that will hopefully help folks identify things like https://discourse.julialang.org/t/inconsistent-cpu-utilisation-in-threads-loops/110512 a little faster.

Before:

julia> @time @sync begin
           Threads.@spawn @ccall sleep(1::Cuint)::Cuint
           sleep(.5)
           Threads.@spawn GC.gc()
       end
  1.078393 seconds (927 allocations: 65.078 KiB, 5.85% gc time, 1.19% compilation time)

After:

julia> @time @sync begin
           Threads.@spawn @ccall sleep(1::Cuint)::Cuint
           sleep(.5)
           Threads.@spawn GC.gc()
       end
  1.040344 seconds (1.35 k allocations: 65.266 KiB, 50.85% gc time: 94% of which was waiting for safepoints, 1.09% compilation time)

base/timing.jl Outdated
@@ -57,7 +57,7 @@ function GC_Diff(new::GC_Num, old::GC_Num)
new.poolalloc - old.poolalloc,
new.bigalloc - old.bigalloc,
new.freecall - old.freecall,
new.total_time - old.total_time,
(new.total_time - old.total_time) + (new.total_time_to_safepoint - old.total_time_to_safepoint),
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer introducing a new field in GC_Diff that is the safepoint time, and break the top-level report into two sections. As an example when I run a scaling test, I want to see that total_time stays constant (e.g. the work of GC scales well with the number of threads and GC workers). If gc time starts creeping up I might use that as a decision point to use more GC threads.

On the other hand if the latency to start GC is high (time to safepoint) I can use that as a decision point to look with a profiler if I spend to much time in certain ccalls.

Munging them together yields a simple: "GC is slow" but no actionable information.

Copy link
Sponsor Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's fair. In my view, "GC is slow" is very actionable information — and throwing more information at people isn't always helpful. How would you want this to appear? Perhaps we take inspiration from compilation time?

2.39% compilation time: 12% of which was recompilation

@KristofferC
Copy link
Sponsor Member

How many people that see this message will know what "waiting for safe point" means? It's quite a lot of text to add.

@vchuravy
Copy link
Sponsor Member

So instead of "waiting for safe point", I would call it latency. Or we could just print it as (30%|10%) GC eg. Total is the sum of those two values.

Another way of thinking about TTSP is that it is like the GC lock.

@adienes
Copy link
Contributor

adienes commented Feb 22, 2024

the information certainly seems useful and 👍 to the functionality, but I'm just noting that #52883 also recently lengthened the message in the parenthetical

maybe at this point @time contains enough information that it could be appropriate to print over multiple lines?

@mbauman
Copy link
Sponsor Member Author

mbauman commented Feb 22, 2024

My only goal here is to match what "gc time" means to me (and I'd wager the vast majority of Julia users) — and that's the total time spent inside GC.gc(), sleeping or not.

I really don't want yet another way of interpreting a particular category's breakdown. So in my view we only have two choices, and I'm happy with either:

  • Match what we do with compilation/recompilation: X% category time: Y% of which was caveat
  • Only print the total time spent inside GC.gc(), but have the details available in @timed's output

It does only print out this extra info if more than 0.5% of the GC time was spent waiting to start. I could make the caveat shorter with just "waiting" but when I wrote "waiting for safepoint" I chose to do so because I figured it'd make for good googlefodder.

@mbauman mbauman added GC Garbage collector observability metrics, timing, understandability, reflection, logging, ... labels Feb 22, 2024
@raminammour
Copy link
Contributor

raminammour commented Feb 22, 2024

I am not an expert, but as people want more an more information going into @time (compilation, GC,...) , which has been very useful, then good heuristics on when to print the fine grained info may be worthwhile: say in this case, only print the "waiting" (+1 to omitting "for safepoint") time only if it is more than 10% (for example) of total time.
It may be arbitrary, but it preserves the concise output in most cases, and only alerts the user that something "else" is happening when it becomes important. The full info can always go in @timed for the gurus :)

Edit: ignore above, I just saw in the code that @mbauman already only shows this if it is > 50%

Thanks for doing this, I learned something new reading this!

@vchuravy
Copy link
Sponsor Member

vchuravy commented Feb 22, 2024

One challenge with using total=TTSP+GC is that TTSP and useful time overlap.

So you could get a measurement where with two threads total is equivalent to 100% (one thread immediately request GC and the other ignores that request until the end).

Right now GC % is a measurement of something automatic, but useful. Mixing in TTSP makes the time reported less useful for me. Which is why I likened it to the recently added lock measurement.

Put another way, currently GC time is a measurement of work and how efficiently that work is being processed, whereas TTSP is a measurement of work not occuring.

I should say that I am all in favor of surfacing this information since it I strongly believe that this is an issue we as a community are not aware of enough.

@oscardssmith
Copy link
Member

The counterpoint to this is that we really do need to make it so that TTSP is always ~0. Right now ccall is the main time where this doesn't happen, but we already basically know what we need to do to fix this.

@mbauman
Copy link
Sponsor Member Author

mbauman commented Feb 22, 2024

Put another way, currently GC time is a measurement of work and how efficiently that work is being processed, whereas TTSP is a measurement of work not occuring.

Ok. But it's the combined number that is the measurement of time my work is not happening. That's why it's what's useful to me — and I'd wager most folks.

@vchuravy
Copy link
Sponsor Member

But the action to take in either case is drastically different. In one case I might need to add some strategical GC.safepoint calls or annotate ccall as GC safe, with the goal of reducing latency and in the other case I might need to preallocate, reduce heap usage in general...

Maybe we report TTSP in the lock section? The more I think about that the more I like the idea. GC is waiting on a lock to acquire exclusive access to the heap.

@d-netto
Copy link
Member

d-netto commented Feb 23, 2024

Want to note that there seems to be some history in the interpretation "GC time = mark time + sweep time".

Note that GCBenchmarks displays/interprets GC time like "mark time + sweep time" (doesn't count TTSP in it), and pretty much all dashboards we have internally at RAI do so.

Having GC times being defined in two ways in GCBenchmarks (which is arguably our main open-source tools for GC performance analysis now) and @time seems inconsistent.

@mbauman
Copy link
Sponsor Member Author

mbauman commented Feb 23, 2024

I'm just confused how to concretely — and succinctly and understandably — say that this is due to the GC without saying it's GC time. I suppose it could be in its own section:

1.040344 seconds (1.35 k allocations: 65.266 KiB, 5.85% gc time,  49.85% time waiting for gc, 1.09% compilation time)

One additional challenge is in determining the "floor" for elision — this time is always nonzero.

I did also think about using the locks section, but there are two issues there: the existing metric is different (time vs count), and even if we started counting counts, I'd still want to separate out GC locks vs. ReentrantLocks which means that we still need to add a new descriptor.

base/timing.jl Outdated Show resolved Hide resolved
@vchuravy
Copy link
Sponsor Member

At the core of the issue for me are units. Currently @time measures real time.
The time a human would experience. In programming we often refer to cpu time, which is a useful concept in concurrent programming.

Take as an example the program below it uses 10 units of real-time and 20 units of cpu-time. If we look at the time G the GC takes to run it would be 2 units of real time and 4 units of cpu time. In both cases GC takes 20% of the program execution.

T1: ----RxxxGG
T2: --------GG

Now what about TTSP? In the ascii diagram above R is the moment we requested GC and x is the time waited on the other mutator. If we use the real-time measurement it would seem like we are using 4 units or 40% of our time on waiting, but using CPU time we can see that this is not the case and TTSP takes 20% of our CPU time, a much more accurate representation of the cost of getting to the safepoint.

This problem gets even worse the more threads we are adding:

T1: ----RxxxGG
T2: --------GG
T3: --------GG
T4: --------GG
T5: --------GG

Still 10 units of real time, but 50 units of cpu time. GC takes 20%, but TTSP takes 40% in real time and 8% of cpu-time.

But it's the combined number that is the measurement of time my work is not happening. That's why it's what's useful to me — and I'd wager most folks.

I don't disagree (besides being cute and stating that GC is still doing your work, but work you procrastinated on).

But I wager that you actually care about the cpu-time ratio and not the real-time ratio, and phrasing a multi-core measurement in real-time is not useful.
It doesn't convey the "how big of a problem is this". I want to make sure that the information we present is actionable.

I don't think we currently track CPU-time accurately throughout the system,
but that is a separate issue.

@mbauman
Copy link
Sponsor Member Author

mbauman commented Feb 23, 2024

So what's your suggestion on how to present this information? In my mind, the overarching goal of @time's parenthetical is to summarize some overheads that you might not have explicitly asked for and might not expect — and which might be highly variable. I put time-to-safepoint in that bin. And I still say it fits the category of a GC-related overhead.

I hear you on the units, but I must confess I don't have a solid understanding of what the percentages are supposed to mean in the context of multithreading. If you were to have pressed me (before I had looked at implementation yesterday) I might have guessed they're numerated by CPU time and denominated by wall time — but maybe that's just because compilation time sometimes reports >100%.

I still haven't quite grokked how the GC tracks multiple threads waiting for a safepoint — but it seems that it only counts the time from the first/worst one? That is, this diagram:

T1: --RxxxxxxxxxGG
T2: ----RxxxxxxxGG
T3: ------RxxxxxGG
T4: ---------RxxGG

The GC stats — as they currently exist — seem like they do the real-world timing, it's just the worst-case-thread one. (edit: I'm guessing that's because the other threads start waiting on behalf of the first thread's collection before they can even try to start their own, and then we end up with 4 synchronized collections back-to-back, but the effect is that this number is more real-time-like than I expected):

julia> @time @sync begin
           Threads.@spawn (@ccall sleep(1::Cuint)::Cuint; GC.gc())
           Threads.@spawn (@ccall sleep(2::Cuint)::Cuint; GC.gc())
           Threads.@spawn (@ccall sleep(3::Cuint)::Cuint; GC.gc())
           Threads.@spawn (@ccall sleep(4::Cuint)::Cuint; GC.gc())
       end
  4.050298 seconds (3.43 k allocations: 163.219 KiB, 75.02% gc time: 99% of which was waiting for safepoints, 0.73% compilation time)

@vchuravy
Copy link
Sponsor Member

vchuravy commented Feb 23, 2024

The GC stats — as they currently exist — seem like they do the real-world timing, it's just the worst-case-thread one:

Yes we currently only track the time the thread requesting GC spends waiting for all the other threads to reach a safe point and it gains exclusive access to the heap and thus object graph.

The real time vs cpu time is probably also the reason why compile time can exceed total time

@mbauman
Copy link
Sponsor Member Author

mbauman commented Feb 23, 2024

So what's your suggestion on how to present this information?

:)

Under the assumption that folks are trying to get all threads evenly chuggingly along, I think this PR is capturing the situation quite well. It's that worst-case thread that you end up waiting for:

julia> @time @sync begin
           Threads.@spawn (@ccall sleep(1::Cuint)::Cuint; GC.gc(); @ccall sleep(4::Cuint)::Cuint)
           Threads.@spawn (@ccall sleep(2::Cuint)::Cuint; GC.safepoint(); @ccall sleep(3::Cuint)::Cuint)
           Threads.@spawn (@ccall sleep(3::Cuint)::Cuint; GC.safepoint(); @ccall sleep(2::Cuint)::Cuint)
           Threads.@spawn (@ccall sleep(4::Cuint)::Cuint; GC.safepoint(); @ccall sleep(1::Cuint)::Cuint)
       end
  8.071955 seconds (3.96 k allocations: 190.641 KiB, 37.86% gc time: 98% of which was waiting for safepoints, 0.33% compilation time)
Task (done) @0x000000010a344650

julia> 8.071955 * (1-.3786)
5.015912837

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GC Garbage collector observability metrics, timing, understandability, reflection, logging, ...
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants