Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Refactoring Ray DAG object scanner #26917

Merged
merged 4 commits into from
Jul 25, 2022

Conversation

suquark
Copy link
Member

@suquark suquark commented Jul 23, 2022

Signed-off-by: Siyuan Zhuang [email protected]

Why are these changes needed?

This PR fixes multiple important Ray DAG issues.

  1. Ray DAG cannot work under minimal install, this is because Ray DAG need to import and register DAG types from Ray Serve.
  2. Ray DAG does not really provide a standard API - every time you you inherit a class from DAGNode, you need to change the code in py_obj_scanner to register it.
  3. Ray DAG performance is terrible. This is because of improper usage of cloudpickle, which performs complicated serialization process for each node.
  4. There is a cyclic reference between dag_node and py_obj_scanner

This PR addresses all of them.

Related issue number

Closes #25988

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Siyuan Zhuang <[email protected]>

T = TypeVar("T")

# Used in deserialization hooks to reference scanner instances.
_instances: Dict[int, "_PyObjScanner"] = {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now we have a global variable for this. I have some concerns for a multi-threading env. It's normal that the user might have multiple thread creating the DAGs right?

Copy link
Member Author

@suquark suquark Jul 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have the same issue for previous implementation. Also because of GIL, multi-threaded accessing actually is not an issue here (also in the code, the class object would not be shared by multiple threads); but it could be an issue in other parts of Ray DAG. I do not think we need to take care of it at this stage.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand this. This is a dict, right? Do you mean that dict operation inside python is atomic? For example, one instruction for dict operations?
If so, I'm ok with this.

also in the code, the class object would not be shared by multiple threads

I'm not sure. _instances is per-module right? so it'll be shared across threads. If we have multiple threads creating DAG, is this still thread safe?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, dict operation inside python is atomic

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to know! Thanks!

@suquark suquark requested a review from fishbone July 23, 2022 02:48
Copy link
Contributor

@fishbone fishbone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. @ericl do you have any comments for this?

@suquark suquark added the do-not-merge Do not merge this PR! label Jul 23, 2022
@suquark
Copy link
Member Author

suquark commented Jul 23, 2022

Some CI failure may be related.

@suquark
Copy link
Member Author

suquark commented Jul 25, 2022

Could not reproduce the CI failure locally. Retesting.

Signed-off-by: Siyuan Zhuang <[email protected]>
@suquark suquark removed the do-not-merge Do not merge this PR! label Jul 25, 2022
@suquark
Copy link
Member Author

suquark commented Jul 25, 2022

Looks like the CI failure is gone.

@suquark
Copy link
Member Author

suquark commented Jul 25, 2022

The CI failure is fixed by #26960 . It is not related to this PR. Let me merge it.

@suquark suquark merged commit 4cc1ef1 into ray-project:master Jul 25, 2022
@suquark suquark deleted the fix_ray_dag_deps branch July 25, 2022 20:48
ddelange added a commit to ddelange/ray that referenced this pull request Jul 26, 2022
…rch-extra-index-url

* 'master' of https://github.com/ray-project/ray: (26 commits)
  [runtime env] plugin refactor[6/n]: java api refactor (ray-project#26783)
  easy test? (ray-project#26905)
  [core] Introduce a flag which allows a longer timeout for raylet when GCS restarts. (ray-project#26919)
  [RLLib] Record framework and algorithm used by an RLlib run. (ray-project#26956)
  [train] set split locality_hints (ray-project#26973)
  [Serve] Make the checkpoint and recover only from GCS (ray-project#26753)
  [AIR DOC] minor tweaks to checkpoint user guide for clarity and consistency subheadings (ray-project#26937)
  [tune] Only sync down from cloud if needed (ray-project#26725)
  [Core] Refactoring Ray DAG object scanner (ray-project#26917)
  [air] Raise error on path-like access for Checkpoints (ray-project#26970)
  [AIR] Enable other notebooks previously marked with # REGRESSION (ray-project#26896)
  [RLlib] Simplify agent collector (ray-project#26803)
  [Datasets] Automatically cast tensor columns when building Pandas blocks. (ray-project#26924)
  [Workflow] Fix flaky example(ray-project#26960)
  [dashboard] Update cluster_activities endpoint to use pydantic. (ray-project#26609)
  [air] Un-revert "[air] remove unnecessary logs + improve repr for result" (ray-project#26942)
  [setup-dev] Add flag to skip symlink certain folders (ray-project#26899)
  [air/tune/docs] Cont. convert Tune examples to use Tuner.fit() (ray-project#26959)
  [air/tune/docs] Change Tuner() occurences in rest of ray/tune (ray-project#26961)
  [data] set iter_batches default batch_size (ray-project#26955)
  ...

Signed-off-by: ddelange <[email protected]>
klwuibm pushed a commit to yuanchi2807/ray that referenced this pull request Jul 27, 2022
* make sure Ray DAG can work with minimal install

Signed-off-by: Siyuan Zhuang <[email protected]>\
Signed-off-by: klwuibm <[email protected]>
Catch-Bull pushed a commit to alipay/ant-ray that referenced this pull request Jul 27, 2022
* make sure Ray DAG can work with minimal install

Signed-off-by: Siyuan Zhuang <[email protected]>\
Signed-off-by: Catch-Bull <[email protected]>
Rohan138 pushed a commit to Rohan138/ray that referenced this pull request Jul 28, 2022
* make sure Ray DAG can work with minimal install

Signed-off-by: Siyuan Zhuang <[email protected]>\
Signed-off-by: Rohan138 <[email protected]>
franklsf95 pushed a commit to franklsf95/ray that referenced this pull request Aug 2, 2022
* make sure Ray DAG can work with minimal install

Signed-off-by: Siyuan Zhuang <[email protected]>\
Signed-off-by: Frank Luan <[email protected]>
gramhagen pushed a commit to gramhagen/ray that referenced this pull request Aug 15, 2022
* make sure Ray DAG can work with minimal install

Signed-off-by: Siyuan Zhuang <[email protected]>\
Signed-off-by: Scott Graham <[email protected]>
gramhagen pushed a commit to gramhagen/ray that referenced this pull request Aug 15, 2022
* make sure Ray DAG can work with minimal install

Signed-off-by: Siyuan Zhuang <[email protected]>\
Stefan-1313 pushed a commit to Stefan-1313/ray_mod that referenced this pull request Aug 18, 2022
* make sure Ray DAG can work with minimal install

Signed-off-by: Siyuan Zhuang <[email protected]>\
Signed-off-by: Stefan van der Kleij <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Core][DAG] Ray DAG depends on Ray Serve
3 participants