Simple heuristic to dynamically adjust number of GC threads #51061

d-netto · 2023-08-26T17:53:45Z

Basically, every GC thread will first look at other workers' queues and count the amount of available work in order to decide whether it's worth it to start marking.

Seems to fix, on my machine, a negative scaling I was seeing on a GCBenchmark which exposes very little parallelism (e.g. list.jl).

master:

../julia-master/julia run_benchmarks.jl serial linked list -n5 --gcthreads=1         
bench = "list.jl"
┌─────────┬────────────┬─────────┬───────────┬────────────┬──────────────┬───────────────────┬──────────┬────────────┐
│         │ total time │ gc time │ mark time │ sweep time │ max GC pause │ time to safepoint │ max heap │ percent gc │
│         │         ms │      ms │        ms │         ms │           ms │                us │       MB │          % │
├─────────┼────────────┼─────────┼───────────┼────────────┼──────────────┼───────────────────┼──────────┼────────────┤
│ minimum │       6099 │    5233 │      4921 │        291 │         1523 │                82 │     3321 │         86 │
│  median │       6149 │    5275 │      4983 │        297 │         1563 │                85 │     3321 │         86 │
│ maximum │       6740 │    5867 │      5566 │        314 │         2313 │               103 │     3321 │         87 │
│   stdev │        271 │     271 │       270 │          9 │          345 │                10 │        0 │          1 │
└─────────┴────────────┴─────────┴───────────┴────────────┴──────────────┴───────────────────┴──────────┴────────────┘
../julia-master/julia run_benchmarks.jl serial linked list -n5 --gcthreads=8
bench = "list.jl"
┌─────────┬────────────┬─────────┬───────────┬────────────┬──────────────┬───────────────────┬──────────┬────────────┐
│         │ total time │ gc time │ mark time │ sweep time │ max GC pause │ time to safepoint │ max heap │ percent gc │
│         │         ms │      ms │        ms │         ms │           ms │                us │       MB │          % │
├─────────┼────────────┼─────────┼───────────┼────────────┼──────────────┼───────────────────┼──────────┼────────────┤
│ minimum │       7911 │    7022 │      6688 │        304 │         2162 │               106 │     3321 │         89 │
│  median │       8467 │    7582 │      7221 │        333 │         2738 │               174 │     3321 │         89 │
│ maximum │       8553 │    7656 │      7324 │        366 │         2906 │              1143 │     3321 │         90 │
│   stdev │        316 │     315 │       298 │         25 │          323 │               448 │        0 │          0 │
└─────────┴────────────┴─────────┴───────────┴────────────┴──────────────┴───────────────────┴──────────┴────────────┘

PR:

../julia-adjust-n-threads/julia run_benchmarks.jl serial linked list -n5 --gcthreads=1
bench = "list.jl"
┌─────────┬────────────┬─────────┬───────────┬────────────┬──────────────┬───────────────────┬──────────┬────────────┐
│         │ total time │ gc time │ mark time │ sweep time │ max GC pause │ time to safepoint │ max heap │ percent gc │
│         │         ms │      ms │        ms │         ms │           ms │                us │       MB │          % │
├─────────┼────────────┼─────────┼───────────┼────────────┼──────────────┼───────────────────┼──────────┼────────────┤
│ minimum │       5874 │    5112 │      4786 │        296 │         1455 │                78 │     3321 │         87 │
│  median │       5900 │    5138 │      4827 │        311 │         1461 │                86 │     3321 │         87 │
│ maximum │       6585 │    5809 │      5384 │        425 │         2278 │                90 │     3321 │         88 │
│   stdev │        337 │     324 │       285 │         52 │          423 │                 4 │        0 │          1 │
└─────────┴────────────┴─────────┴───────────┴────────────┴──────────────┴───────────────────┴──────────┴────────────┘
../julia-adjust-n-threads/julia run_benchmarks.jl serial linked list -n5 --gcthreads=8
bench = "list.jl"
┌─────────┬────────────┬─────────┬───────────┬────────────┬──────────────┬───────────────────┬──────────┬────────────┐
│         │ total time │ gc time │ mark time │ sweep time │ max GC pause │ time to safepoint │ max heap │ percent gc │
│         │         ms │      ms │        ms │         ms │           ms │                us │       MB │          % │
├─────────┼────────────┼─────────┼───────────┼────────────┼──────────────┼───────────────────┼──────────┼────────────┤
│ minimum │       5838 │    5035 │      4720 │        300 │         1513 │                95 │     3321 │         86 │
│  median │       5916 │    5117 │      4806 │        336 │         1679 │               118 │     3321 │         86 │
│ maximum │       6311 │    5521 │      5184 │        392 │         2169 │               489 │     3321 │         87 │
│   stdev │        186 │     193 │       188 │         35 │          257 │               206 │        0 │          1 │
└─────────┴────────────┴─────────┴───────────┴────────────┴──────────────┴───────────────────┴──────────┴────────────┘

Also, doesn't seem to sacrifice scaling on a few GCBenchmarks which expose a lot of parallelism (e.g. binary_tree ones):

master:

../julia-master/julia run_benchmarks.jl multithreaded binary_tree tree_mutable -n5 -t8 --gcthreads=8
bench = "tree_mutable.jl"
┌─────────┬────────────┬─────────┬───────────┬────────────┬──────────────┬───────────────────┬──────────┬────────────┐
│         │ total time │ gc time │ mark time │ sweep time │ max GC pause │ time to safepoint │ max heap │ percent gc │
│         │         ms │      ms │        ms │         ms │           ms │                us │       MB │          % │
├─────────┼────────────┼─────────┼───────────┼────────────┼──────────────┼───────────────────┼──────────┼────────────┤
│ minimum │       4200 │    1975 │       520 │       1450 │           84 │              1952 │      980 │         47 │
│  median │       4321 │    2032 │       542 │       1485 │           86 │              2225 │      988 │         47 │
│ maximum │       4413 │    2093 │       561 │       1532 │           97 │              2329 │     1025 │         47 │
│   stdev │         89 │      49 │        16 │         33 │            5 │               149 │       18 │          0 │
└─────────┴────────────┴─────────┴───────────┴────────────┴──────────────┴───────────────────┴──────────┴────────────┘

PR:

../julia-adjust-n-threads/julia run_benchmarks.jl multithreaded binary_tree tree_mutable -n5 -t8 --gcthreads=8 
bench = "tree_mutable.jl"
┌─────────┬────────────┬─────────┬───────────┬────────────┬──────────────┬───────────────────┬──────────┬────────────┐
│         │ total time │ gc time │ mark time │ sweep time │ max GC pause │ time to safepoint │ max heap │ percent gc │
│         │         ms │      ms │        ms │         ms │           ms │                us │       MB │          % │
├─────────┼────────────┼─────────┼───────────┼────────────┼──────────────┼───────────────────┼──────────┼────────────┤
│ minimum │       3967 │    1877 │       474 │       1400 │           79 │              2026 │      987 │         47 │
│  median │       4062 │    1939 │       495 │       1441 │           84 │              2174 │     1003 │         47 │
│ maximum │       4222 │    1997 │       522 │       1502 │           91 │              2488 │     1028 │         48 │
│   stdev │        101 │      54 │        19 │         41 │            5 │               171 │       16 │          0 │
└─────────┴────────────┴─────────┴───────────┴────────────┴──────────────┴───────────────────┴──────────┴────────────┘

d-netto · 2023-08-26T18:56:35Z

(The idea implemented in this PR is inspired by https://dl.acm.org/doi/pdf/10.1145/2926697.2926706).

vchuravy · 2023-08-26T19:53:43Z

If I recall correctly we didn't do this originally since the amount of work at the beginning is not necessarily for the amount discovered during the marking.

If I read this PR correctly the thread sits there during GC and spins constantly checking the amount of work present?

I think https://dl.acm.org/doi/pdf/10.1145/2926697.2926706 had one thread that was responsible for waking other thread up, instead of a bunch of threads sitting there spinning?

src/gc.c

d-netto · 2023-08-26T20:05:41Z

the amount of work at the beginning is not necessarily for the amount discovered during the marking.

This is true. And that's why the threads check if there is enough work at every attempt of entering the mark-loop (and not only in the beginning).

I think https://dl.acm.org/doi/pdf/10.1145/2926697.2926706 had one thread that was responsible for waking other thread up, instead of a bunch of threads sitting there spinning?

This PR intentionally deviates from the implementation in which a single thread is responsible for counting the available work and waking up other threads.

Delegating the wake-ups to a single thread could cause a lot of performance degradation depending on how this thread is scheduled by the OS. In the worst-case, if the thread is not scheduled at all we would have no parallelization in the mark-loop.

The solution in this PR is a bit more "de-centralized" in that sense.

d-netto · 2023-08-26T20:06:21Z

If I read this PR correctly the thread sits there during GC and spins constantly checking the amount of work present?

Not really, if it finds enough work, then it goes ahead and enters the mark-loop.

gbaraldi · 2023-08-28T12:52:28Z

The paper specifically mentions that they saw better results with wait/notify instead of spinning/sleeping. Did you experiment a bit?

d-netto · 2023-08-29T16:02:32Z

FWIW this also seems to address the regression from #51044:

./julia --threads=128 --gcthreads=1
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.11.0-DEV.357 (2023-08-27)
 _/ |\__'_|_|_|\__'_|  |  dcn/adjust-n-gc-threads/a780003c4b (fork: 3 commits, 2 days)
|__/                   |

julia> @time display(versioninfo())
Julia Version 1.11.0-DEV.357
Commit a780003c4b (2023-08-27 16:20 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 128 × AMD EPYC 7502 32-Core Processor
  WORD_SIZE: 64
  LLVM: libLLVM-15.0.7 (ORCJIT, znver2)
  Threads: 128 on 128 virtual cores
nothing
  0.512379 seconds (436.95 k allocations: 29.251 MiB, 1.71% gc time, 95.42% compilation time)

./julia --threads=128 --gcthreads=256
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.11.0-DEV.357 (2023-08-27)
 _/ |\__'_|_|_|\__'_|  |  dcn/adjust-n-gc-threads/a780003c4b (fork: 3 commits, 2 days)
|__/                   |

julia> @time display(versioninfo())
Julia Version 1.11.0-DEV.357
Commit a780003c4b (2023-08-27 16:20 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 128 × AMD EPYC 7502 32-Core Processor
  WORD_SIZE: 64
  LLVM: libLLVM-15.0.7 (ORCJIT, znver2)
  Threads: 383 on 128 virtual cores
nothing
  0.516743 seconds (436.95 k allocations: 29.251 MiB, 2.77% gc time, 95.97% compilation time)

KristofferC · 2023-09-01T07:36:08Z

This seems good to go, or? @gbaraldi merge if ok?

gbaraldi · 2023-09-01T14:40:39Z

LGTM!

src/gc.c

d-netto · 2023-09-17T18:13:58Z

The paper specifically mentions that they saw better results with wait/notify instead of spinning/sleeping. Did you experiment a bit?

Latest commit should follow the paper a lot more closely now.

oscardssmith · 2023-09-17T18:15:33Z

Can you update the table with the new results?

src/gc.c

d-netto · 2023-09-20T18:04:36Z

Re-ran the results on a 36-core machine. Note that for up to 7 GC threads we're using the exponential backoff scheduler, and beyond that we use the spin-master one.

Seems like spin-master is an improvement for mark-times on a large number of threads:

master (36 mutators, 36 GC threads):

bench = "tree_immutable.jl"
┌─────────┬────────────┬─────────┬───────────┬────────────┬──────────────┬───────────────────┬──────────┬────────────┐
│         │ total time │ gc time │ mark time │ sweep time │ max GC pause │ time to safepoint │ max heap │ percent gc │
│         │         ms │      ms │        ms │         ms │           ms │                us │       MB │          % │
├─────────┼────────────┼─────────┼───────────┼────────────┼──────────────┼───────────────────┼──────────┼────────────┤
│ minimum │       3436 │    1383 │       230 │       1153 │           67 │              1164 │      614 │         40 │
│  median │       3582 │    1491 │       290 │       1199 │           86 │              1231 │      668 │         41 │
│ maximum │       3608 │    1518 │       308 │       1210 │          101 │              1274 │      737 │         42 │
│   stdev │         81 │      60 │        34 │         25 │           17 │                41 │       52 │          1 │
└─────────┴────────────┴─────────┴───────────┴────────────┴──────────────┴───────────────────┴──────────┴────────────┘
bench = "tree_mutable.jl"
┌─────────┬────────────┬─────────┬───────────┬────────────┬──────────────┬───────────────────┬──────────┬────────────┐
│         │ total time │ gc time │ mark time │ sweep time │ max GC pause │ time to safepoint │ max heap │ percent gc │
│         │         ms │      ms │        ms │         ms │           ms │                us │       MB │          % │
├─────────┼────────────┼─────────┼───────────┼────────────┼──────────────┼───────────────────┼──────────┼────────────┤
│ minimum │       6641 │    3323 │       726 │       2580 │          133 │              1550 │     1002 │         50 │
│  median │       6660 │    3341 │       742 │       2598 │          141 │              1649 │     1052 │         50 │
│ maximum │       6753 │    3411 │       800 │       2626 │          150 │              1753 │     1059 │         51 │
│   stdev │         47 │      35 │        29 │         17 │            6 │                77 │       26 │          0 │
└─────────┴────────────┴─────────┴───────────┴────────────┴──────────────┴───────────────────┴──────────┴────────────┘

PR (36 mutators, 36 GC threads):

bench = "tree_immutable.jl"
┌─────────┬────────────┬─────────┬───────────┬────────────┬──────────────┬───────────────────┬──────────┬────────────┐
│         │ total time │ gc time │ mark time │ sweep time │ max GC pause │ time to safepoint │ max heap │ percent gc │
│         │         ms │      ms │        ms │         ms │           ms │                us │       MB │          % │
├─────────┼────────────┼─────────┼───────────┼────────────┼──────────────┼───────────────────┼──────────┼────────────┤
│ minimum │       3428 │    1351 │       216 │       1135 │           65 │              1216 │      613 │         39 │
│  median │       3522 │    1413 │       248 │       1164 │           91 │              1237 │      622 │         40 │
│ maximum │       3624 │    1507 │       288 │       1218 │          101 │              1266 │      688 │         42 │
│   stdev │         80 │      65 │        30 │         35 │           17 │                20 │       38 │          1 │
└─────────┴────────────┴─────────┴───────────┴────────────┴──────────────┴───────────────────┴──────────┴────────────┘
bench = "tree_mutable.jl"
┌─────────┬────────────┬─────────┬───────────┬────────────┬──────────────┬───────────────────┬──────────┬────────────┐
│         │ total time │ gc time │ mark time │ sweep time │ max GC pause │ time to safepoint │ max heap │ percent gc │
│         │         ms │      ms │        ms │         ms │           ms │                us │       MB │          % │
├─────────┼────────────┼─────────┼───────────┼────────────┼──────────────┼───────────────────┼──────────┼────────────┤
│ minimum │       6571 │    3257 │       669 │       2587 │          136 │              1555 │     1057 │         50 │
│  median │       6628 │    3303 │       682 │       2613 │          145 │              1576 │     1070 │         50 │
│ maximum │       6649 │    3334 │       720 │       2639 │          177 │              1606 │     1077 │         50 │
│   stdev │         31 │      30 │        21 │         21 │           16 │                19 │        8 │          0 │
└─────────┴────────────┴─────────┴───────────┴────────────┴──────────────┴───────────────────┴──────────┴────────────┘

For the linked list benchmark, spin-master seems to avoid the negative scaling on a large number of GC threads as well:

PR (1 mutator, 1 GC thread):

bench = "list.jl"
┌─────────┬────────────┬─────────┬───────────┬────────────┬──────────────┬───────────────────┬──────────┬────────────┐
│         │ total time │ gc time │ mark time │ sweep time │ max GC pause │ time to safepoint │ max heap │ percent gc │
│         │         ms │      ms │        ms │         ms │           ms │                us │       MB │          % │
├─────────┼────────────┼─────────┼───────────┼────────────┼──────────────┼───────────────────┼──────────┼────────────┤
│ minimum │      12272 │    9496 │      8745 │        748 │         2927 │               106 │     3319 │         77 │
│  median │      12275 │    9501 │      8753 │        749 │         2931 │               106 │     3320 │         77 │
│ maximum │      12283 │    9509 │      8761 │        751 │         2932 │               109 │     3320 │         77 │
│   stdev │          5 │       5 │         6 │          1 │            2 │                 2 │        0 │          0 │
└─────────┴────────────┴─────────┴───────────┴────────────┴──────────────┴───────────────────┴──────────┴────────────┘

PR (1 mutator, 36 GC threads):

bench = "list.jl"
┌─────────┬────────────┬─────────┬───────────┬────────────┬──────────────┬───────────────────┬──────────┬────────────┐
│         │ total time │ gc time │ mark time │ sweep time │ max GC pause │ time to safepoint │ max heap │ percent gc │
│         │         ms │      ms │        ms │         ms │           ms │                us │       MB │          % │
├─────────┼────────────┼─────────┼───────────┼────────────┼──────────────┼───────────────────┼──────────┼────────────┤
│ minimum │      11744 │    8970 │      8221 │        747 │         2750 │               135 │     3320 │         76 │
│  median │      11756 │    8981 │      8232 │        748 │         2755 │               140 │     3320 │         76 │
│ maximum │      11880 │    9094 │      8346 │        751 │         2813 │               149 │     3320 │         77 │
│   stdev │         56 │      51 │        52 │          1 │           26 │                 5 │        0 │          0 │
└─────────┴────────────┴─────────┴───────────┴────────────┴──────────────┴───────────────────┴──────────┴────────────┘

d-netto · 2023-09-20T23:53:45Z

We've backported this PR and ran a few tests internally. We're seeing a segfault that doesn't seem to be reproducible in the open-source benchmarks.

Marking this PR as draft until further investigated.

Supersedes #51061 and #51414. Still needs more perf analysis.

Supersedes #51061 and #51414. Still needs more perf analysis. (cherry picked from commit e26c257)

d-netto requested review from vchuravy, vtjnash and gbaraldi August 26, 2023 17:53

d-netto force-pushed the dcn/adjust-n-gc-threads branch from 9f61076 to 34d858c Compare August 26, 2023 18:49

vchuravy reviewed Aug 26, 2023

View reviewed changes

src/gc.c Outdated Show resolved Hide resolved

d-netto force-pushed the dcn/adjust-n-gc-threads branch 2 times, most recently from b2fc6c9 to a780003 Compare August 27, 2023 16:20

d-netto added performance Must go faster GC Garbage collector labels Aug 29, 2023

kpamnany reviewed Sep 1, 2023

View reviewed changes

src/gc.c Outdated Show resolved Hide resolved

d-netto force-pushed the dcn/adjust-n-gc-threads branch from a780003 to f5cf745 Compare September 4, 2023 11:28

d-netto force-pushed the dcn/adjust-n-gc-threads branch 4 times, most recently from fb23477 to cf862e4 Compare September 17, 2023 18:09

d-netto force-pushed the dcn/adjust-n-gc-threads branch 3 times, most recently from ffe771f to 77b65cb Compare September 17, 2023 18:44

gbaraldi reviewed Sep 20, 2023

View reviewed changes

src/gc.c Outdated Show resolved Hide resolved

d-netto force-pushed the dcn/adjust-n-gc-threads branch from 77b65cb to 2e7fe2f Compare September 20, 2023 17:14

d-netto force-pushed the dcn/adjust-n-gc-threads branch 2 times, most recently from b062623 to 3473191 Compare September 20, 2023 17:38

implement spin master

bee6621

d-netto force-pushed the dcn/adjust-n-gc-threads branch from 3473191 to bee6621 Compare September 20, 2023 23:52

d-netto marked this pull request as draft September 20, 2023 23:53

d-netto mentioned this pull request Sep 21, 2023

improvements on GC scheduler shutdown #51414

Closed

d-netto closed this Nov 24, 2023

d-netto mentioned this pull request Nov 24, 2023

GC scheduler refinements #52294

Merged

gbaraldi pushed a commit that referenced this pull request Dec 5, 2023

GC scheduler refinements (#52294)

e26c257

Supersedes #51061 and #51414. Still needs more perf analysis.

KristofferC pushed a commit that referenced this pull request Dec 12, 2023

GC scheduler refinements (#52294)

4241d4c

Supersedes #51061 and #51414. Still needs more perf analysis. (cherry picked from commit e26c257)

giordano deleted the dcn/adjust-n-gc-threads branch February 25, 2024 21:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simple heuristic to dynamically adjust number of GC threads #51061

Simple heuristic to dynamically adjust number of GC threads #51061

d-netto commented Aug 26, 2023

d-netto commented Aug 26, 2023 •

edited

Loading

vchuravy commented Aug 26, 2023

d-netto commented Aug 26, 2023

d-netto commented Aug 26, 2023

gbaraldi commented Aug 28, 2023

d-netto commented Aug 29, 2023

KristofferC commented Sep 1, 2023

gbaraldi commented Sep 1, 2023

d-netto commented Sep 17, 2023

oscardssmith commented Sep 17, 2023

d-netto commented Sep 20, 2023 •

edited

Loading

d-netto commented Sep 20, 2023

Simple heuristic to dynamically adjust number of GC threads #51061

Simple heuristic to dynamically adjust number of GC threads #51061

Conversation

d-netto commented Aug 26, 2023

d-netto commented Aug 26, 2023 • edited Loading

vchuravy commented Aug 26, 2023

d-netto commented Aug 26, 2023

d-netto commented Aug 26, 2023

gbaraldi commented Aug 28, 2023

d-netto commented Aug 29, 2023

KristofferC commented Sep 1, 2023

gbaraldi commented Sep 1, 2023

d-netto commented Sep 17, 2023

oscardssmith commented Sep 17, 2023

d-netto commented Sep 20, 2023 • edited Loading

d-netto commented Sep 20, 2023

d-netto commented Aug 26, 2023 •

edited

Loading

d-netto commented Sep 20, 2023 •

edited

Loading