Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[train] Fix regression where large Trainer attributes get serialized along with actor class #43234

Merged

Conversation

justinvyu
Copy link
Contributor

@justinvyu justinvyu commented Feb 16, 2024

Why are these changes needed?

#43146 re-introduced a problem that was originally fixed by #28826.

The problem:

  • When there are large objects captured in the scope of a Trainer's arguments, the Trainer instance on the driver needs to be sure not to package itself (self) when registering the actor class that will be run remotely.
  • Otherwise, self gets packaged along with the actor class during serialization, and so do these large objects.
  • Because Ray Train relies on creating a Ray Tune Trainable under the hood to schedule the remote Trainer instance (which serves as the dist. training coordinator), and because Ray Tune uses the Ray GCS redis key value store to save a shared copy of the serialized actor class, a huge actor class causes a gRPC error such as the one below:
E               ray.tune.error.TuneError: The Trainable/training function is too large for grpc resource limit. Check that its definition is not implicitly capturing a large array or other object in scope. Tip: use tune.with_parameters() to put large objects in the Ray object store. 
E               Original exception: Traceback (most recent call last):
E                 File "/Users/justin/Developer/justinvyu-dev/python/ray/tune/experiment/experiment.py", line 149, in __init__
E                   self._run_identifier = Experiment.register_if_needed(run)
E                 File "/Users/justin/Developer/justinvyu-dev/python/ray/tune/experiment/experiment.py", line 350, in register_if_needed
E                   register_trainable(name, run_object)
E                 File "/Users/justin/Developer/justinvyu-dev/python/ray/tune/registry.py", line 112, in register_trainable
E                   _global_registry.register(TRAINABLE_CLASS, name, trainable)
E                 File "/Users/justin/Developer/justinvyu-dev/python/ray/tune/registry.py", line 239, in register
E                   self.flush_values()
E                 File "/Users/justin/Developer/justinvyu-dev/python/ray/tune/registry.py", line 277, in flush_values
E                   _internal_kv_put(
E                 File "/Users/justin/Developer/justinvyu-dev/python/ray/_private/client_mode_hook.py", line 103, in wrapper
E                   return func(*args, **kwargs)
E                 File "/Users/justin/Developer/justinvyu-dev/python/ray/experimental/internal_kv.py", line 94, in _internal_kv_put
E                   return global_gcs_client.internal_kv_put(key, value, overwrite, namespace) == 0
E                 File "python/ray/_raylet.pyx", line 2614, in ray._raylet._auto_reconnect.wrapper
E                 File "python/ray/_raylet.pyx", line 2591, in ray._raylet._auto_reconnect.wrapper
E                 File "python/ray/_raylet.pyx", line 2728, in ray._raylet.GcsClient.internal_kv_put
E                 File "python/ray/_raylet.pyx", line 579, in ray._raylet.check_status
E               ray.exceptions.RpcError: Sent message larger than max (800008883 vs. 536870912)

The fix is to remove the reference to self in the train coordinator function trainable. This PR also restructures the code a bit to make it harder to make such a mistake in the future.

Future considerations

  • This redis key value store logic in Tune is probably not necessary and can be re-worked to remove this failure case.

Related issue number

Closes #43191
Closes #43192
Closes #43193
Closes #43194
Closes #43195

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Copy link
Member

@woshiyyya woshiyyya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! One question: What's the change that caused the regression?


# Evaluate datasets if they are wrapped in a factory.
trainer.datasets = {
k: d() if callable(d) else d for k, d in self.datasets.items()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the self reference that caused serialization of Huge BaseTrainer object?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep that's it!

@justinvyu
Copy link
Contributor Author

Running release test to verify fix: https://buildkite.com/ray-project/release/builds/8691

Copy link
Contributor

@matthewdeng matthewdeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@justinvyu justinvyu merged commit e1f39ea into ray-project:master Feb 16, 2024
9 checks passed
@justinvyu justinvyu deleted the fix_trainer_trainable_size_regression branch February 16, 2024 23:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants