Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simple heuristic to dynamically adjust number of GC threads #51061

Closed
wants to merge 1 commit into from

Conversation

d-netto
Copy link
Member

@d-netto d-netto commented Aug 26, 2023

Basically, every GC thread will first look at other workers' queues and count the amount of available work in order to decide whether it's worth it to start marking.

Seems to fix, on my machine, a negative scaling I was seeing on a GCBenchmark which exposes very little parallelism (e.g. list.jl).

  • master:
../julia-master/julia run_benchmarks.jl serial linked list -n5 --gcthreads=1         
bench = "list.jl"
┌─────────┬────────────┬─────────┬───────────┬────────────┬──────────────┬───────────────────┬──────────┬────────────┐
│         │ total time │ gc time │ mark time │ sweep time │ max GC pause │ time to safepoint │ max heap │ percent gc │
│         │         ms │      ms │        ms │         ms │           ms │                us │       MB │          % │
├─────────┼────────────┼─────────┼───────────┼────────────┼──────────────┼───────────────────┼──────────┼────────────┤
│ minimum │       6099 │    5233 │      4921 │        291 │         1523 │                82 │     3321 │         86 │
│  median │       6149 │    5275 │      4983 │        297 │         1563 │                85 │     3321 │         86 │
│ maximum │       6740 │    5867 │      5566 │        314 │         2313 │               103 │     3321 │         87 │
│   stdev │        271 │     271 │       270 │          9 │          345 │                10 │        0 │          1 │
└─────────┴────────────┴─────────┴───────────┴────────────┴──────────────┴───────────────────┴──────────┴────────────┘
../julia-master/julia run_benchmarks.jl serial linked list -n5 --gcthreads=8
bench = "list.jl"
┌─────────┬────────────┬─────────┬───────────┬────────────┬──────────────┬───────────────────┬──────────┬────────────┐
│         │ total time │ gc time │ mark time │ sweep time │ max GC pause │ time to safepoint │ max heap │ percent gc │
│         │         ms │      ms │        ms │         ms │           ms │                us │       MB │          % │
├─────────┼────────────┼─────────┼───────────┼────────────┼──────────────┼───────────────────┼──────────┼────────────┤
│ minimum │       7911 │    7022 │      6688 │        304 │         2162 │               106 │     3321 │         89 │
│  median │       8467 │    7582 │      7221 │        333 │         2738 │               174 │     3321 │         89 │
│ maximum │       8553 │    7656 │      7324 │        366 │         2906 │              1143 │     3321 │         90 │
│   stdev │        316 │     315 │       298 │         25 │          323 │               448 │        0 │          0 │
└─────────┴────────────┴─────────┴───────────┴────────────┴──────────────┴───────────────────┴──────────┴────────────┘
  • PR:
../julia-adjust-n-threads/julia run_benchmarks.jl serial linked list -n5 --gcthreads=1
bench = "list.jl"
┌─────────┬────────────┬─────────┬───────────┬────────────┬──────────────┬───────────────────┬──────────┬────────────┐
│         │ total time │ gc time │ mark time │ sweep time │ max GC pause │ time to safepoint │ max heap │ percent gc │
│         │         ms │      ms │        ms │         ms │           ms │                us │       MB │          % │
├─────────┼────────────┼─────────┼───────────┼────────────┼──────────────┼───────────────────┼──────────┼────────────┤
│ minimum │       5874 │    5112 │      4786 │        296 │         1455 │                78 │     3321 │         87 │
│  median │       5900 │    5138 │      4827 │        311 │         1461 │                86 │     3321 │         87 │
│ maximum │       6585 │    5809 │      5384 │        425 │         2278 │                90 │     3321 │         88 │
│   stdev │        337 │     324 │       285 │         52 │          423 │                 4 │        0 │          1 │
└─────────┴────────────┴─────────┴───────────┴────────────┴──────────────┴───────────────────┴──────────┴────────────┘
../julia-adjust-n-threads/julia run_benchmarks.jl serial linked list -n5 --gcthreads=8
bench = "list.jl"
┌─────────┬────────────┬─────────┬───────────┬────────────┬──────────────┬───────────────────┬──────────┬────────────┐
│         │ total time │ gc time │ mark time │ sweep time │ max GC pause │ time to safepoint │ max heap │ percent gc │
│         │         ms │      ms │        ms │         ms │           ms │                us │       MB │          % │
├─────────┼────────────┼─────────┼───────────┼────────────┼──────────────┼───────────────────┼──────────┼────────────┤
│ minimum │       5838 │    5035 │      4720 │        300 │         1513 │                95 │     3321 │         86 │
│  median │       5916 │    5117 │      4806 │        336 │         1679 │               118 │     3321 │         86 │
│ maximum │       6311 │    5521 │      5184 │        392 │         2169 │               489 │     3321 │         87 │
│   stdev │        186 │     193 │       188 │         35 │          257 │               206 │        0 │          1 │
└─────────┴────────────┴─────────┴───────────┴────────────┴──────────────┴───────────────────┴──────────┴────────────┘

Also, doesn't seem to sacrifice scaling on a few GCBenchmarks which expose a lot of parallelism (e.g. binary_tree ones):

  • master:
../julia-master/julia run_benchmarks.jl multithreaded binary_tree tree_mutable -n5 -t8 --gcthreads=8
bench = "tree_mutable.jl"
┌─────────┬────────────┬─────────┬───────────┬────────────┬──────────────┬───────────────────┬──────────┬────────────┐
│         │ total time │ gc time │ mark time │ sweep time │ max GC pause │ time to safepoint │ max heap │ percent gc │
│         │         ms │      ms │        ms │         ms │           ms │                us │       MB │          % │
├─────────┼────────────┼─────────┼───────────┼────────────┼──────────────┼───────────────────┼──────────┼────────────┤
│ minimum │       4200 │    1975 │       520 │       1450 │           84 │              1952 │      980 │         47 │
│  median │       4321 │    2032 │       542 │       1485 │           86 │              2225 │      988 │         47 │
│ maximum │       4413 │    2093 │       561 │       1532 │           97 │              2329 │     1025 │         47 │
│   stdev │         89 │      49 │        16 │         33 │            5 │               149 │       18 │          0 │
└─────────┴────────────┴─────────┴───────────┴────────────┴──────────────┴───────────────────┴──────────┴────────────┘
  • PR:
../julia-adjust-n-threads/julia run_benchmarks.jl multithreaded binary_tree tree_mutable -n5 -t8 --gcthreads=8 
bench = "tree_mutable.jl"
┌─────────┬────────────┬─────────┬───────────┬────────────┬──────────────┬───────────────────┬──────────┬────────────┐
│         │ total time │ gc time │ mark time │ sweep time │ max GC pause │ time to safepoint │ max heap │ percent gc │
│         │         ms │      ms │        ms │         ms │           ms │                us │       MB │          % │
├─────────┼────────────┼─────────┼───────────┼────────────┼──────────────┼───────────────────┼──────────┼────────────┤
│ minimum │       3967 │    1877 │       474 │       1400 │           79 │              2026 │      987 │         47 │
│  median │       4062 │    1939 │       495 │       1441 │           84 │              2174 │     1003 │         47 │
│ maximum │       4222 │    1997 │       522 │       1502 │           91 │              2488 │     1028 │         48 │
│   stdev │        101 │      54 │        19 │         41 │            5 │               171 │       16 │          0 │
└─────────┴────────────┴─────────┴───────────┴────────────┴──────────────┴───────────────────┴──────────┴────────────┘

@d-netto
Copy link
Member Author

d-netto commented Aug 26, 2023

(The idea implemented in this PR is inspired by https://dl.acm.org/doi/pdf/10.1145/2926697.2926706).

@vchuravy
Copy link
Sponsor Member

If I recall correctly we didn't do this originally since the amount of work at the beginning is not necessarily for the amount discovered during the marking.

If I read this PR correctly the thread sits there during GC and spins constantly checking the amount of work present?

I think https://dl.acm.org/doi/pdf/10.1145/2926697.2926706 had one thread that was responsible for waking other thread up, instead of a bunch of threads sitting there spinning?

src/gc.c Outdated Show resolved Hide resolved
@d-netto
Copy link
Member Author

d-netto commented Aug 26, 2023

the amount of work at the beginning is not necessarily for the amount discovered during the marking.

This is true. And that's why the threads check if there is enough work at every attempt of entering the mark-loop (and not only in the beginning).

I think https://dl.acm.org/doi/pdf/10.1145/2926697.2926706 had one thread that was responsible for waking other thread up, instead of a bunch of threads sitting there spinning?

This PR intentionally deviates from the implementation in which a single thread is responsible for counting the available work and waking up other threads.

Delegating the wake-ups to a single thread could cause a lot of performance degradation depending on how this thread is scheduled by the OS. In the worst-case, if the thread is not scheduled at all we would have no parallelization in the mark-loop.

The solution in this PR is a bit more "de-centralized" in that sense.

@d-netto
Copy link
Member Author

d-netto commented Aug 26, 2023

If I read this PR correctly the thread sits there during GC and spins constantly checking the amount of work present?

Not really, if it finds enough work, then it goes ahead and enters the mark-loop.

@d-netto d-netto force-pushed the dcn/adjust-n-gc-threads branch 2 times, most recently from b2fc6c9 to a780003 Compare August 27, 2023 16:20
@gbaraldi
Copy link
Member

The paper specifically mentions that they saw better results with wait/notify instead of spinning/sleeping. Did you experiment a bit?

@d-netto d-netto added performance Must go faster GC Garbage collector labels Aug 29, 2023
@d-netto
Copy link
Member Author

d-netto commented Aug 29, 2023

FWIW this also seems to address the regression from #51044:

./julia --threads=128 --gcthreads=1
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.11.0-DEV.357 (2023-08-27)
 _/ |\__'_|_|_|\__'_|  |  dcn/adjust-n-gc-threads/a780003c4b (fork: 3 commits, 2 days)
|__/                   |

julia> @time display(versioninfo())
Julia Version 1.11.0-DEV.357
Commit a780003c4b (2023-08-27 16:20 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 128 × AMD EPYC 7502 32-Core Processor
  WORD_SIZE: 64
  LLVM: libLLVM-15.0.7 (ORCJIT, znver2)
  Threads: 128 on 128 virtual cores
nothing
  0.512379 seconds (436.95 k allocations: 29.251 MiB, 1.71% gc time, 95.42% compilation time)
./julia --threads=128 --gcthreads=256
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.11.0-DEV.357 (2023-08-27)
 _/ |\__'_|_|_|\__'_|  |  dcn/adjust-n-gc-threads/a780003c4b (fork: 3 commits, 2 days)
|__/                   |

julia> @time display(versioninfo())
Julia Version 1.11.0-DEV.357
Commit a780003c4b (2023-08-27 16:20 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 128 × AMD EPYC 7502 32-Core Processor
  WORD_SIZE: 64
  LLVM: libLLVM-15.0.7 (ORCJIT, znver2)
  Threads: 383 on 128 virtual cores
nothing
  0.516743 seconds (436.95 k allocations: 29.251 MiB, 2.77% gc time, 95.97% compilation time)

@KristofferC
Copy link
Sponsor Member

This seems good to go, or? @gbaraldi merge if ok?

@gbaraldi
Copy link
Member

gbaraldi commented Sep 1, 2023

LGTM!

src/gc.c Outdated Show resolved Hide resolved
@d-netto d-netto force-pushed the dcn/adjust-n-gc-threads branch 4 times, most recently from fb23477 to cf862e4 Compare September 17, 2023 18:09
@d-netto
Copy link
Member Author

d-netto commented Sep 17, 2023

The paper specifically mentions that they saw better results with wait/notify instead of spinning/sleeping. Did you experiment a bit?

Latest commit should follow the paper a lot more closely now.

@oscardssmith
Copy link
Member

Can you update the table with the new results?

@d-netto d-netto force-pushed the dcn/adjust-n-gc-threads branch 3 times, most recently from ffe771f to 77b65cb Compare September 17, 2023 18:44
src/gc.c Outdated Show resolved Hide resolved
@d-netto d-netto force-pushed the dcn/adjust-n-gc-threads branch 2 times, most recently from b062623 to 3473191 Compare September 20, 2023 17:38
@d-netto
Copy link
Member Author

d-netto commented Sep 20, 2023

Re-ran the results on a 36-core machine. Note that for up to 7 GC threads we're using the exponential backoff scheduler, and beyond that we use the spin-master one.

Seems like spin-master is an improvement for mark-times on a large number of threads:

  • master (36 mutators, 36 GC threads):
bench = "tree_immutable.jl"
┌─────────┬────────────┬─────────┬───────────┬────────────┬──────────────┬───────────────────┬──────────┬────────────┐
│         │ total time │ gc time │ mark time │ sweep time │ max GC pause │ time to safepoint │ max heap │ percent gc │
│         │         ms │      ms │        ms │         ms │           ms │                us │       MB │          % │
├─────────┼────────────┼─────────┼───────────┼────────────┼──────────────┼───────────────────┼──────────┼────────────┤
│ minimum │       3436 │    1383 │       230 │       1153 │           67 │              1164 │      614 │         40 │
│  median │       3582 │    1491 │       290 │       1199 │           86 │              1231 │      668 │         41 │
│ maximum │       3608 │    1518 │       308 │       1210 │          101 │              1274 │      737 │         42 │
│   stdev │         81 │      60 │        34 │         25 │           17 │                41 │       52 │          1 │
└─────────┴────────────┴─────────┴───────────┴────────────┴──────────────┴───────────────────┴──────────┴────────────┘
bench = "tree_mutable.jl"
┌─────────┬────────────┬─────────┬───────────┬────────────┬──────────────┬───────────────────┬──────────┬────────────┐
│         │ total time │ gc time │ mark time │ sweep time │ max GC pause │ time to safepoint │ max heap │ percent gc │
│         │         ms │      ms │        ms │         ms │           ms │                us │       MB │          % │
├─────────┼────────────┼─────────┼───────────┼────────────┼──────────────┼───────────────────┼──────────┼────────────┤
│ minimum │       6641 │    3323 │       726 │       2580 │          133 │              1550 │     1002 │         50 │
│  median │       6660 │    3341 │       742 │       2598 │          141 │              1649 │     1052 │         50 │
│ maximum │       6753 │    3411 │       800 │       2626 │          150 │              1753 │     1059 │         51 │
│   stdev │         47 │      35 │        29 │         17 │            6 │                77 │       26 │          0 │
└─────────┴────────────┴─────────┴───────────┴────────────┴──────────────┴───────────────────┴──────────┴────────────┘
  • PR (36 mutators, 36 GC threads):
bench = "tree_immutable.jl"
┌─────────┬────────────┬─────────┬───────────┬────────────┬──────────────┬───────────────────┬──────────┬────────────┐
│         │ total time │ gc time │ mark time │ sweep time │ max GC pause │ time to safepoint │ max heap │ percent gc │
│         │         ms │      ms │        ms │         ms │           ms │                us │       MB │          % │
├─────────┼────────────┼─────────┼───────────┼────────────┼──────────────┼───────────────────┼──────────┼────────────┤
│ minimum │       3428 │    1351 │       216 │       1135 │           65 │              1216 │      613 │         39 │
│  median │       3522 │    1413 │       248 │       1164 │           91 │              1237 │      622 │         40 │
│ maximum │       3624 │    1507 │       288 │       1218 │          101 │              1266 │      688 │         42 │
│   stdev │         80 │      65 │        30 │         35 │           17 │                20 │       38 │          1 │
└─────────┴────────────┴─────────┴───────────┴────────────┴──────────────┴───────────────────┴──────────┴────────────┘
bench = "tree_mutable.jl"
┌─────────┬────────────┬─────────┬───────────┬────────────┬──────────────┬───────────────────┬──────────┬────────────┐
│         │ total time │ gc time │ mark time │ sweep time │ max GC pause │ time to safepoint │ max heap │ percent gc │
│         │         ms │      ms │        ms │         ms │           ms │                us │       MB │          % │
├─────────┼────────────┼─────────┼───────────┼────────────┼──────────────┼───────────────────┼──────────┼────────────┤
│ minimum │       6571 │    3257 │       669 │       2587 │          136 │              1555 │     1057 │         50 │
│  median │       6628 │    3303 │       682 │       2613 │          145 │              1576 │     1070 │         50 │
│ maximum │       6649 │    3334 │       720 │       2639 │          177 │              1606 │     1077 │         50 │
│   stdev │         31 │      30 │        21 │         21 │           16 │                19 │        8 │          0 │
└─────────┴────────────┴─────────┴───────────┴────────────┴──────────────┴───────────────────┴──────────┴────────────┘

For the linked list benchmark, spin-master seems to avoid the negative scaling on a large number of GC threads as well:

  • PR (1 mutator, 1 GC thread):
bench = "list.jl"
┌─────────┬────────────┬─────────┬───────────┬────────────┬──────────────┬───────────────────┬──────────┬────────────┐
│         │ total time │ gc time │ mark time │ sweep time │ max GC pause │ time to safepoint │ max heap │ percent gc │
│         │         ms │      ms │        ms │         ms │           ms │                us │       MB │          % │
├─────────┼────────────┼─────────┼───────────┼────────────┼──────────────┼───────────────────┼──────────┼────────────┤
│ minimum │      12272 │    9496 │      8745 │        748 │         2927 │               106 │     3319 │         77 │
│  median │      12275 │    9501 │      8753 │        749 │         2931 │               106 │     3320 │         77 │
│ maximum │      12283 │    9509 │      8761 │        751 │         2932 │               109 │     3320 │         77 │
│   stdev │          5 │       5 │         6 │          1 │            2 │                 2 │        0 │          0 │
└─────────┴────────────┴─────────┴───────────┴────────────┴──────────────┴───────────────────┴──────────┴────────────┘
  • PR (1 mutator, 36 GC threads):
bench = "list.jl"
┌─────────┬────────────┬─────────┬───────────┬────────────┬──────────────┬───────────────────┬──────────┬────────────┐
│         │ total time │ gc time │ mark time │ sweep time │ max GC pause │ time to safepoint │ max heap │ percent gc │
│         │         ms │      ms │        ms │         ms │           ms │                us │       MB │          % │
├─────────┼────────────┼─────────┼───────────┼────────────┼──────────────┼───────────────────┼──────────┼────────────┤
│ minimum │      11744 │    8970 │      8221 │        747 │         2750 │               135 │     3320 │         76 │
│  median │      11756 │    8981 │      8232 │        748 │         2755 │               140 │     3320 │         76 │
│ maximum │      11880 │    9094 │      8346 │        751 │         2813 │               149 │     3320 │         77 │
│   stdev │         56 │      51 │        52 │          1 │           26 │                 5 │        0 │          0 │
└─────────┴────────────┴─────────┴───────────┴────────────┴──────────────┴───────────────────┴──────────┴────────────┘

@d-netto
Copy link
Member Author

d-netto commented Sep 20, 2023

We've backported this PR and ran a few tests internally. We're seeing a segfault that doesn't seem to be reproducible in the open-source benchmarks.

Marking this PR as draft until further investigated.

@d-netto d-netto marked this pull request as draft September 20, 2023 23:53
@d-netto d-netto closed this Nov 24, 2023
gbaraldi pushed a commit that referenced this pull request Dec 5, 2023
Supersedes #51061 and
#51414.

Still needs more perf analysis.
KristofferC pushed a commit that referenced this pull request Dec 12, 2023
Supersedes #51061 and
#51414.

Still needs more perf analysis.

(cherry picked from commit e26c257)
@giordano giordano deleted the dcn/adjust-n-gc-threads branch February 25, 2024 21:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GC Garbage collector performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants