Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[train] Remove base config deepcopy when initializing the trainer actor #44611

Merged
merged 3 commits into from
Apr 10, 2024

Conversation

justinvyu
Copy link
Contributor

Why are these changes needed?

merge_dicts first creates a deepcopy of the base config before doing a deep update. The deepcopy is unnecessary for this usage in Ray Train, so we can skip it and just perform a deep update.

This causes issues with large objects passed into the trainer, increasing the peak memory usage of the Ray Train coordinator actor (which is labeled _Inner in the Ray dashboard). For example, this problem surfaced for Ray Data datasets that held a lot of metadata being passed into the trainer. (The size of the datasets is a separate issue that will be fixed.)

The large objects are deep copied, which increases the memory usage of the Trainer actor by 2x. The original copy comes from the object store via tune.with_parameters. This copy should get garbage collected immediately after the Trainer Trainable setup is called, but for some reason the copy's memory usage sticks around for the rest of training.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Copy link
Contributor

@matthewdeng matthewdeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

python/ray/train/base_trainer.py Outdated Show resolved Hide resolved
@justinvyu justinvyu merged commit 11810c6 into ray-project:master Apr 10, 2024
5 checks passed
@justinvyu justinvyu deleted the remove_deepcopy branch April 10, 2024 20:22
ryanaoleary pushed a commit to ryanaoleary/ray that referenced this pull request Jun 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants