[train] Remove base config deepcopy when initializing the trainer actor #44611

justinvyu · 2024-04-10T01:20:43Z

Why are these changes needed?

merge_dicts first creates a deepcopy of the base config before doing a deep update. The deepcopy is unnecessary for this usage in Ray Train, so we can skip it and just perform a deep update.

This causes issues with large objects passed into the trainer, increasing the peak memory usage of the Ray Train coordinator actor (which is labeled _Inner in the Ray dashboard). For example, this problem surfaced for Ray Data datasets that held a lot of metadata being passed into the trainer. (The size of the datasets is a separate issue that will be fixed.)

The large objects are deep copied, which increases the memory usage of the Trainer actor by 2x. The original copy comes from the object store via tune.with_parameters. This copy should get garbage collected immediately after the Trainer Trainable setup is called, but for some reason the copy's memory usage sticks around for the rest of training.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Justin Yu <[email protected]>

matthewdeng

lgtm

python/ray/train/base_trainer.py

Signed-off-by: Justin Yu <[email protected]>

…ve_deepcopy

…or (ray-project#44611) Signed-off-by: Justin Yu <[email protected]>

remove base config deepcopy in base trainer

b2656ff

Signed-off-by: Justin Yu <[email protected]>

justinvyu requested review from matthewdeng and woshiyyya as code owners April 10, 2024 01:20

matthewdeng approved these changes Apr 10, 2024

View reviewed changes

python/ray/train/base_trainer.py Outdated Show resolved Hide resolved

justinvyu added 2 commits April 10, 2024 12:07

fix flag = true

077e8ff

Signed-off-by: Justin Yu <[email protected]>

Merge branch 'master' of https://github.com/ray-project/ray into remo…

f2267c3

…ve_deepcopy

justinvyu merged commit 11810c6 into ray-project:master Apr 10, 2024
5 checks passed

justinvyu deleted the remove_deepcopy branch April 10, 2024 20:22

ryanaoleary pushed a commit to ryanaoleary/ray that referenced this pull request Jun 7, 2024

[train] Remove base config deepcopy when initializing the trainer act…

26a1e35

…or (ray-project#44611) Signed-off-by: Justin Yu <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[train] Remove base config deepcopy when initializing the trainer actor #44611

[train] Remove base config deepcopy when initializing the trainer actor #44611

justinvyu commented Apr 10, 2024

matthewdeng left a comment

[train] Remove base config deepcopy when initializing the trainer actor #44611

[train] Remove base config deepcopy when initializing the trainer actor #44611

Conversation

justinvyu commented Apr 10, 2024

Why are these changes needed?

Related issue number

Checks

matthewdeng left a comment

Choose a reason for hiding this comment