Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[autoscaler][kuberay] Deflake KubeRay autoscaling test #26411

Merged

Conversation

DmitriGekhtman
Copy link
Contributor

@DmitriGekhtman DmitriGekhtman commented Jul 9, 2022

Why are these changes needed?

Deflakes the KubeRay autoscaling e2e test, which appears to be suffering from a race condition involving CRD creation and registration.
Background: registering a CRD takes a bit of time after the request to create the CRD object returns.

Strategy:

  • Move CRD/operator setup/teardown out of the test and into the test pipeline. This avoid race conditions related to infra creation and deletion during test retries.
  • After creating the CRD and verifying the CRD exists, try a few times to create a test CR.
    • Once that's done, we're confident the CRD is registered and the test can proceed.

Related issue number

Closes #26377

Checks

I ran the test 12 times in the CI with these changes and didn't observe any failures.


# This image will be used for both the Ray nodes and the autoscaler.
# The CI should pass an image built from the test branch.
RAY_IMAGE = os.environ.get("RAY_IMAGE", "rayproject/ray:448f52")
RAY_IMAGE = os.environ.get("RAY_IMAGE", "rayproject/ray:nightly-py38")
Copy link
Contributor Author

@DmitriGekhtman DmitriGekhtman Jul 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No particular reason for the py38 besides the fact that I use a py38 environment locally (Ray images are Py37 by default)
Nightly seems a reasonable enough default for a test whose primary purpose is to test PRs going into the master branch.

Copy link
Contributor Author

@DmitriGekhtman DmitriGekhtman Jul 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(The CI specifies a image built from the PR branch.)

@brucez-anyscale
Copy link
Contributor

This pr looks good. But why move the code outside the python to pipeline will help?
Because pipeline is slower and have wait time between steps?

@DmitriGekhtman
Copy link
Contributor Author

DmitriGekhtman commented Jul 10, 2022

This pr looks good. But why move the code outside the python to pipeline will help? Because pipeline is slower and have wait time between steps?

I should clarify that -- the pipeline is set up to retry tests. I think that this external retry logic may introduce race conditions around creation and teardown of the operator and CRD.
To avoid this, I'd prefer to set up the CRD and operator as part of the test's Kubernetes infrastructure (similar to the KinD setup).

This also makes the test slightly more convenient for me to run repeatedly from my local setup.

.buildkite/pipeline.yml Outdated Show resolved Hide resolved
@brucez-anyscale
Copy link
Contributor

looks like multiple test failure should be fixed.

@DmitriGekhtman
Copy link
Contributor Author

Not sure how these changes could have triggered the failures, but that's quite a few failures.

@DmitriGekhtman
Copy link
Contributor Author

Not sure how these changes could have triggered the failures, but that's quite a few failures.

It's looking better after rebasing.

@DmitriGekhtman DmitriGekhtman merged commit 8f8f036 into ray-project:master Jul 12, 2022
@DmitriGekhtman DmitriGekhtman deleted the dmitri/kuberay-deflake branch July 12, 2022 08:12
Stefan-1313 pushed a commit to Stefan-1313/ray_mod that referenced this pull request Aug 18, 2022
…6411)

Improves stability of KubeRay autoscaling test.

Signed-off-by: Stefan van der Kleij <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[KubeRay][CI] KubeRay autoscaling test is flakey
4 participants