-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add failure tests to test_reference_counting #7400
Conversation
Can one of the admins verify this patch? |
@ray.remote | ||
@pytest.mark.parametrize("failure", [False, True]) | ||
def test_basic_serialized_reference(one_worker_100MiB, failure): | ||
@ray.remote(max_retries=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually it would be great to also have some tests where max_retries > 0
so we can make sure that ref counting works when there are retries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yeah that's a really good point - I just did this to make it run more quickly. Do you think it's worth parametrizing it and testing both with and without retries?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm it seems like testing with retries should cover all the cases without retries too, but testing both also seems fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way we can update the config to make the retries faster?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's just test the retry case with a short timeout. Will update.
Test PASSed. |
Test PASSed. |
@stephanie-wang I left TODOs for the two bugs the tests uncovered this morning. I think we should merge this and address those separately. |
FYI - diff grew because I needed to split the tests to avoid timeouts in bazel |
Test PASSed. |
Test PASSed. |
Test FAILed. |
Test FAILed. |
Test FAILed. |
* enable * Turn on eager eviction * Shorten tests and drain ReferenceCounter * Don't force kill actor handles that have gone out of scope, lint * Fix locks * Cleanup Plasma Async Callback (#7452) * [rllib][tune] fix some nans (#7611) * Change /tmp to platform-specific temporary directory (#7529) * [Serve] UI Improvements (#7569) * bugfix about test_dynres.py (#7615) Co-authored-by: senlin.zsl <[email protected]> * Java call Python actor method use actor.call (#7614) * bug fix about useage of absl::flat_hash_map::erase and absl::flat_hash_set::erase (#7633) Co-authored-by: senlin.zsl <[email protected]> * [Java] Make both `RayActor` and `RayPyActor` inheriting from `BaseActor` (#7462) * [Java] Fix the issue that the cached value in `RayObject` is serialized (#7613) * Add failure tests to test_reference_counting (#7400) * Fix typo in asyncio documentation (#7602) * Fix segfault * debug * Force kill actor * Fix test
Why are these changes needed?
Need to test cases where workers fail. This is not comprehensive, but a good start.
Checks
scripts/format.sh
to lint the changes in this PR.