Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More GC regressions in nightly and 1.10 resulting in OOM crashes and weird stats, possibly related to gcext #50705

Closed
fingolfin opened this issue Jul 28, 2023 · 10 comments
Assignees
Labels
GC Garbage collector kind:regression Regression in behavior compared to a previous version performance Must go faster

Comments

@fingolfin
Copy link
Contributor

Our test suite has started to crash more and more frequently, and now almost constantly, with the latest Julia nightly and 1.10 updates.

It seems we get OOM crashes (but it's hard to say because there are no stacktraces, just messages like this (if there is a way to get a backtrace here, that would be super helpful)

[2046] signal (15): Terminated
in expression starting at /home/runner/work/_actions/julia-actions/julia-runtest/latest/test_harness.jl:7
Error: The operation was canceled.

We have collect a ton of more data on oscar-system/Oscar.jl#2441 but no MWE as it is difficult to trigger this locally -- it "helps" that the CI runners on GitHub have only a few GB of RAM.

There is also something weird going on some of the statistics; note the crazy heap_target

Heap stats: bytes_mapped 1728.42 MB, bytes_resident 1286.89 MB, heap_size 1832.69 MB, heap_target 2357.69 MB, live_bytes 1761.48 MB
, Fragmentation 0.961GC: pause 898.46ms. collected 30.505936MB. incr 
Heap stats: bytes_mapped 1728.42 MB, bytes_resident 1286.89 MB, heap_size 1832.31 MB, heap_target 2357.31 MB, live_bytes 1778.76 MB
, Fragmentation 0.971GC: pause 320.62ms. collected 552.570595MB. incr 
Heap stats: bytes_mapped 1728.42 MB, bytes_resident 1317.08 MB, heap_size 2221.08 MB, heap_target 869387521.08 MB, live_bytes 1748.25 MB
, Fragmentation 0.787   39156 ms (1847 ms GC) and 392MB allocated for alnuth/polynome.tst
@gbaraldi
Copy link
Member

This looks like the same thing reported in #40644 (comment)

@brenhinkeller brenhinkeller added kind:regression Regression in behavior compared to a previous version GC Garbage collector performance Must go faster labels Aug 3, 2023
@lgoettgens
Copy link
Contributor

#50682 unfortunately did not help.

@vchuravy vchuravy added this to the 1.10 milestone Aug 11, 2023
@JeffBezanson
Copy link
Sponsor Member

We would like to fix this and will continue to work on it with you but I believe it is not a release blocker. @gbaraldi is still looking into it.

@JeffBezanson JeffBezanson removed this from the 1.10 milestone Aug 15, 2023
@thofma
Copy link
Contributor

thofma commented Aug 18, 2023

Note that this also happens for https://github.com/Nemocas/AbstractAlgebra.jl and 1.10-beta2. This is a pure julia package, with no GC shenanigans or C libraries. Github actions CI used to work flawlessly so far, but with 1.10-beta2 the job is often killed because of memory consumption. Here is an example: https://github.com/Nemocas/AbstractAlgebra.jl/actions/runs/5867762282/job/16017400384?pr=1405.

@lgoettgens
Copy link
Contributor

Note that this also happens for https://github.com/Nemocas/AbstractAlgebra.jl and 1.10-beta2. This is a pure julia package, with no GC shenanigans or C libraries. Github actions CI used to work flawlessly so far, but with 1.10-beta2 the job is often killed because of memory consumption. Here is an example: https://github.com/Nemocas/AbstractAlgebra.jl/actions/runs/5867762282/job/16017400384?pr=1405.

Is this only a 1.10 thing or have you seen any nightly failures for AbstractAlgebra.jl as well ?

@thofma
Copy link
Contributor

thofma commented Aug 18, 2023

Is this only a 1.10 thing or have you seen any nightly failures for AbstractAlgebra.jl as well ?

Only on 1.10 so far, but CI has not run very often in meantime. I just noticed it recently.

@fingolfin
Copy link
Contributor Author

Just to say, currently the AbstractAlgebra CI tests consistently fail with Julia nightly (but pass with 1.10 and older versions), and it really looks like GC is involved. Let me stress again that this is a pure Julia package.

I've written more detailed observations at Nemocas/AbstractAlgebra.jl#1432 but in a nutshell it looks as if it grows the heap target exponentially, but never shrinks it. My conjecture is that the crash happens when it tries to grow the heap from 8 to 16 GB which is too much for those little GitHub CIs -- but since it crashes without a stack trace, and obviously before GC stats can be printed, I am not sure (if someone has a hint how to figure that out, I am all ears).

Of course all of this does not change the fact that for Oscar we heave similar crashes consistently with both Julia nightly and 1.10

d-netto added a commit that referenced this issue Oct 20, 2023
The 1.10 GC heuristics introduced in
#50144 have been a source of
concerning issues such as
#50705 and
#51601. The PR also doesn't
correctly implement the paper on which it's based, as discussed in
#51498.

Test whether the 1.8 GC heuristics are a viable option.
@d-netto
Copy link
Member

d-netto commented Feb 1, 2024

What's the status of this?

@benlorenz
Copy link
Contributor

The CI for Oscar.jl on 1.10 was afaict fixed by the GC revert, I haven't noticed any OOM crashes for quite a while. (I think even before github changed the Linux runners to have 14 GB of RAM)
We did have some infrequent (non-OOM) crashes in the 1.10 CI reporting GC corruption (quite frequently on 1.10, very rarely on 1.9) but these seem to be fixed on the backports-1.10 branch, possibly by (the backport of) #52569.

Regarding nightly: We couldn't test on nightly for quite a while because this requires a working CxxWrap and various related binaries. But right now nightly seems to be mostly stable, there are a few crashes from time to time that we are still investigating but these are definitely not OOM crashes.

We worked on improving the testsuite on our side as well by splitting it into two CI jobs, this also helps in avoiding OOM issues.
And of course the larger GitHub runners with 14GB RAM instead of 7GB also do help.

So in summary this seems resolved. Due to some of these: improved julia code, larger runners, and split up testsuite.

Regarding AbstractAlgebra, I think the CI also looks good on 1.10 and nightly, maybe @thofma can say more.

@thofma
Copy link
Contributor

thofma commented Feb 5, 2024

I have not encountered any GC problems for a while now. Both 1.10 and nightly are looking good.

@gbaraldi gbaraldi closed this as completed Feb 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GC Garbage collector kind:regression Regression in behavior compared to a previous version performance Must go faster
Projects
None yet
Development

No branches or pull requests

9 participants