Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] Make bazel sharding for parallel buildkite more intelligent #29221

Merged
merged 28 commits into from
Oct 14, 2022

Conversation

Yard1
Copy link
Member

@Yard1 Yard1 commented Oct 11, 2022

Signed-off-by: Antoni Baum [email protected]

Why are these changes needed?

This PR implements two changes to our bazel-sharding.py script, used for determining which bazel tests to run on each instance when buildkite parallelism is used:

  • An ability to filter tests before they are sharded, using the same logic as bazel test. This is done by specifying the --tag_filters argument, eg. --tag_filters=air,-gpu. If we filter tests with bazel test after they are sharded, we can end up with imbalanced shards as eg. all tests we want to filter out are assigned to one shard. This feature is enabled for Serve tests and it will be required for the changes I want to make to AIR CI.
  • A new algorithm to balance the shards, finally implementing what that comment was asking for all this time. Instead of trying to assign the same number of tests (which have variable timeouts) to each shard, the new algorithm tries to make sure each shard will finish in more or less the same time. This is achieved through a simple but good enough heuristic. The old algorithm can still be accessed through the --sharding_strategy argument.

Those two changes do cause the complexity of the script to increase, necessitating proper testing. In order to facilitate that, this PR also adds a basic buildkite test harness for CI tools/scripts.

After this PR is merged, the next step will be to move most of our manually parallelized jobs to use buildkite parallelism with the new logic here.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Antoni Baum <[email protected]>
@Yard1 Yard1 changed the title [CI] Make bazel sharding more intelligent [CI] Make bazel sharding for parallel buildkite more intelligent Oct 11, 2022
Signed-off-by: Antoni Baum <[email protected]>
Signed-off-by: Antoni Baum <[email protected]>
@Yard1 Yard1 marked this pull request as ready for review October 11, 2022 22:39
Signed-off-by: Antoni Baum <[email protected]>
Signed-off-by: Antoni Baum <[email protected]>
Signed-off-by: Antoni Baum <[email protected]>
Signed-off-by: Antoni Baum <[email protected]>
Signed-off-by: Antoni Baum <[email protected]>
Signed-off-by: Antoni Baum <[email protected]>
Signed-off-by: Antoni Baum <[email protected]>
Copy link
Contributor

@krfricke krfricke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally this is great, a few highlevel comments:

  • The code requires more comments and documentation. It's hard to see what is happening, and we want to make it easy for future maintainers to understand the code quickly.
  • You don't need to fully document code that you didn't change, but if you have any pointers you can quickly add that would be helpful
  • For structure, let's create a ci/tests directory where we put the BUILD file and a subdirectory for test bazel sharding test (so it contains the mock buildfile)
  • Let's either keep the original bazel-sharding.py file or remove the symlink

I'll do a closer review of the new algorithm once there are more comments so I can dive into this a bit easier.

Thanks!

.buildkite/pipeline.build.yml Outdated Show resolved Hide resolved
ci/run/bazel_sharding/bazel_sharding.py Outdated Show resolved Hide resolved
@krfricke krfricke added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 13, 2022
@Yard1
Copy link
Member Author

Yard1 commented Oct 13, 2022

In terms of structure, I tried to make the ci/tests directory and realized it is not possible, because:

  • All the tested files (so bazel_sharding.py in this case) have to be in subdirectories from where the BUILD file is present. Therefore, the BUILD file would need to be in ci
  • When I put the BUILD file in ci, it broke multiple entries in other BUILD files which operate under the assumption that ci is not a bazel subpackage

I imagine that could all be fixed eventually but I didn't want to make multiple seemingly unrelated changes in this PR, therefore I opted for the structure shown here. We can either refactor that in a followup, or try to make it work right now but I think it would be best to leave it for later.

The file also had to be renamed and dash replaced to underscore to make it importable in Python. I left the symlink as to not break any code that may depend on it, but I see now that it is only used in 2 places right now so it's probably fine to just replace it there as well.

Why would we want to keep the old file btw? If someone wants the older one, they can just grab it off github. Having two scripts would be confusing.

Signed-off-by: Antoni Baum <[email protected]>
Signed-off-by: Antoni Baum <[email protected]>
Signed-off-by: Antoni Baum <[email protected]>
@Yard1
Copy link
Member Author

Yard1 commented Oct 13, 2022

I added some comments, PTAL @krfricke

@Yard1 Yard1 requested a review from krfricke October 13, 2022 15:02
Signed-off-by: Antoni Baum <[email protected]>
Signed-off-by: Antoni Baum <[email protected]>
Signed-off-by: Antoni Baum <[email protected]>
@Yard1 Yard1 removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 13, 2022
@krfricke
Copy link
Contributor

Why would we want to keep the old file btw? If someone wants the older one, they can just grab it off github. Having two scripts would be confusing.

I meant instead of the one in the sub directory. Let's just remove the symlink then - taking another look now!

Copy link
Contributor

@krfricke krfricke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for the documentation, it helps me a lot to understand what is going on there.

The code and algorithm look good to me. I think we could unit test parts of the algorithm more (e.g. add_rule_to_best_shard(), or rule.actual_timeout_s), as we currently only have end to end tests for this.

ci/run/bazel_sharding/bazel_sharding.py Outdated Show resolved Hide resolved
ci/run/bazel_sharding/bazel_sharding.py Show resolved Hide resolved
ci/run/bazel_sharding/bazel_sharding.py Outdated Show resolved Hide resolved
ci/run/bazel_sharding/bazel_sharding.py Outdated Show resolved Hide resolved
@Yard1 Yard1 requested a review from krfricke October 13, 2022 22:12
Signed-off-by: Antoni Baum <[email protected]>
Copy link
Contributor

@krfricke krfricke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome!

@krfricke krfricke merged commit d1aa560 into ray-project:master Oct 14, 2022
@Yard1 Yard1 deleted the more_intelligent_bazel_sharding branch October 14, 2022 15:25
WeichenXu123 pushed a commit to WeichenXu123/ray that referenced this pull request Dec 19, 2022
…-project#29221)

This PR implements two changes to our `bazel-sharding.py` script, used for determining which bazel tests to run on each instance when buildkite parallelism is used:
* An ability to filter tests before they are sharded, using the same logic as `bazel test`. This is done by specifying the `--tag_filters` argument, eg. `--tag_filters=air,-gpu`. If we filter tests with `bazel test` *after* they are sharded, we can end up with imbalanced shards as eg. all tests we want to filter out are assigned to one shard. This feature is enabled for Serve tests and it will be required for the changes I want to make to AIR CI.
* A new algorithm to balance the shards, finally implementing what that comment was asking for all this time. Instead of trying to assign the same number of tests (which have variable timeouts) to each shard, the new algorithm tries to make sure each shard will finish in more or less the same time. This is achieved through a simple but good enough heuristic. The old algorithm can still be accessed through the `--sharding_strategy` argument.

Those two changes do cause the complexity of the script to increase, necessitating proper testing. In order to facilitate that, this PR also adds a basic buildkite test harness for CI tools/scripts.

After this PR is merged, the next step will be to move most of our manually parallelized jobs to use buildkite parallelism with the new logic here.

Signed-off-by: Antoni Baum <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants