[CI] Make bazel sharding for parallel buildkite more intelligent #29221

Yard1 · 2022-10-11T18:50:09Z

Signed-off-by: Antoni Baum [email protected]

Why are these changes needed?

This PR implements two changes to our bazel-sharding.py script, used for determining which bazel tests to run on each instance when buildkite parallelism is used:

An ability to filter tests before they are sharded, using the same logic as bazel test. This is done by specifying the --tag_filters argument, eg. --tag_filters=air,-gpu. If we filter tests with bazel test after they are sharded, we can end up with imbalanced shards as eg. all tests we want to filter out are assigned to one shard. This feature is enabled for Serve tests and it will be required for the changes I want to make to AIR CI.
A new algorithm to balance the shards, finally implementing what that comment was asking for all this time. Instead of trying to assign the same number of tests (which have variable timeouts) to each shard, the new algorithm tries to make sure each shard will finish in more or less the same time. This is achieved through a simple but good enough heuristic. The old algorithm can still be accessed through the --sharding_strategy argument.

Those two changes do cause the complexity of the script to increase, necessitating proper testing. In order to facilitate that, this PR also adds a basic buildkite test harness for CI tools/scripts.

After this PR is merged, the next step will be to move most of our manually parallelized jobs to use buildkite parallelism with the new logic here.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Antoni Baum <[email protected]>

krfricke

Generally this is great, a few highlevel comments:

The code requires more comments and documentation. It's hard to see what is happening, and we want to make it easy for future maintainers to understand the code quickly.
You don't need to fully document code that you didn't change, but if you have any pointers you can quickly add that would be helpful
For structure, let's create a ci/tests directory where we put the BUILD file and a subdirectory for test bazel sharding test (so it contains the mock buildfile)
Let's either keep the original bazel-sharding.py file or remove the symlink

I'll do a closer review of the new algorithm once there are more comments so I can dive into this a bit easier.

Thanks!

.buildkite/pipeline.build.yml

ci/run/bazel_sharding/bazel_sharding.py

ci/run/bazel_sharding/tests/test_bazel_sharding.py

Yard1 · 2022-10-13T14:26:41Z

In terms of structure, I tried to make the ci/tests directory and realized it is not possible, because:

All the tested files (so bazel_sharding.py in this case) have to be in subdirectories from where the BUILD file is present. Therefore, the BUILD file would need to be in ci
When I put the BUILD file in ci, it broke multiple entries in other BUILD files which operate under the assumption that ci is not a bazel subpackage

I imagine that could all be fixed eventually but I didn't want to make multiple seemingly unrelated changes in this PR, therefore I opted for the structure shown here. We can either refactor that in a followup, or try to make it work right now but I think it would be best to leave it for later.

The file also had to be renamed and dash replaced to underscore to make it importable in Python. I left the symlink as to not break any code that may depend on it, but I see now that it is only used in 2 places right now so it's probably fine to just replace it there as well.

Why would we want to keep the old file btw? If someone wants the older one, they can just grab it off github. Having two scripts would be confusing.

Signed-off-by: Antoni Baum <[email protected]>

Yard1 · 2022-10-13T15:01:58Z

I added some comments, PTAL @krfricke

Signed-off-by: Antoni Baum <[email protected]>

krfricke · 2022-10-13T20:43:03Z

Why would we want to keep the old file btw? If someone wants the older one, they can just grab it off github. Having two scripts would be confusing.

I meant instead of the one in the sub directory. Let's just remove the symlink then - taking another look now!

krfricke

Thank you very much for the documentation, it helps me a lot to understand what is going on there.

The code and algorithm look good to me. I think we could unit test parts of the algorithm more (e.g. add_rule_to_best_shard(), or rule.actual_timeout_s), as we currently only have end to end tests for this.

ci/run/bazel_sharding/bazel_sharding.py

Signed-off-by: Antoni Baum <[email protected]>

krfricke

Awesome!

Signed-off-by: Antoni Baum <[email protected]>

…-project#29221) This PR implements two changes to our `bazel-sharding.py` script, used for determining which bazel tests to run on each instance when buildkite parallelism is used: * An ability to filter tests before they are sharded, using the same logic as `bazel test`. This is done by specifying the `--tag_filters` argument, eg. `--tag_filters=air,-gpu`. If we filter tests with `bazel test` *after* they are sharded, we can end up with imbalanced shards as eg. all tests we want to filter out are assigned to one shard. This feature is enabled for Serve tests and it will be required for the changes I want to make to AIR CI. * A new algorithm to balance the shards, finally implementing what that comment was asking for all this time. Instead of trying to assign the same number of tests (which have variable timeouts) to each shard, the new algorithm tries to make sure each shard will finish in more or less the same time. This is achieved through a simple but good enough heuristic. The old algorithm can still be accessed through the `--sharding_strategy` argument. Those two changes do cause the complexity of the script to increase, necessitating proper testing. In order to facilitate that, this PR also adds a basic buildkite test harness for CI tools/scripts. After this PR is merged, the next step will be to move most of our manually parallelized jobs to use buildkite parallelism with the new logic here. Signed-off-by: Antoni Baum <[email protected]> Signed-off-by: Weichen Xu <[email protected]>

[CI] Make bazel sharding more intelligent

81391c6

Signed-off-by: Antoni Baum <[email protected]>

Yard1 requested review from matthewdeng, krfricke, rkooo567 and simon-mo October 11, 2022 18:50

Yard1 assigned krfricke Oct 11, 2022

Yard1 added 2 commits October 11, 2022 18:53

Typo

e781df6

Signed-off-by: Antoni Baum <[email protected]>

Make serve tests use tag_filters

a41fcd1

Signed-off-by: Antoni Baum <[email protected]>

Yard1 changed the title ~~[CI] Make bazel sharding more intelligent~~ [CI] Make bazel sharding for parallel buildkite more intelligent Oct 11, 2022

Yard1 added 3 commits October 11, 2022 19:54

Nit

506b2ae

Signed-off-by: Antoni Baum <[email protected]>

Add CI tool testing harness, change script name

d77e98f

Signed-off-by: Antoni Baum <[email protected]>

Bazel fix

bfab0a6

Signed-off-by: Antoni Baum <[email protected]>

Yard1 marked this pull request as ready for review October 11, 2022 22:39

Yard1 added 9 commits October 11, 2022 22:52

Fixes

84d8821

Signed-off-by: Antoni Baum <[email protected]>

Use pipeline.build instead

4ae8ce6

Signed-off-by: Antoni Baum <[email protected]>

Fix

41581f9

Signed-off-by: Antoni Baum <[email protected]>

Print whole exception

337f2bd

Signed-off-by: Antoni Baum <[email protected]>

Be more conservative with tests

e59bef9

Signed-off-by: Antoni Baum <[email protected]>

remove

0df1aed

Signed-off-by: Antoni Baum <[email protected]>

symlink

4b4444a

Signed-off-by: Antoni Baum <[email protected]>

Fix

ac670ef

Signed-off-by: Antoni Baum <[email protected]>

Update BUILD

aba09b8

krfricke requested changes Oct 13, 2022

View reviewed changes

.buildkite/pipeline.build.yml Outdated Show resolved Hide resolved

ci/run/bazel_sharding/bazel_sharding.py Outdated Show resolved Hide resolved

ci/run/bazel_sharding/tests/test_bazel_sharding.py Show resolved Hide resolved

krfricke added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 13, 2022

Merge branch 'master' into more_intelligent_bazel_sharding

59a8071

Yard1 added 3 commits October 13, 2022 14:27

Move to pipeline test

0ee4e6c

Signed-off-by: Antoni Baum <[email protected]>

Remove symlink

09cebec

Signed-off-by: Antoni Baum <[email protected]>

Add more comments

d9c3350

Signed-off-by: Antoni Baum <[email protected]>

Yard1 requested a review from krfricke October 13, 2022 15:02

Yard1 added 3 commits October 13, 2022 15:08

Fix

0f8d352

Signed-off-by: Antoni Baum <[email protected]>

Tweak

c60886f

Signed-off-by: Antoni Baum <[email protected]>

Nits

156b588

Signed-off-by: Antoni Baum <[email protected]>

Yard1 removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 13, 2022

krfricke reviewed Oct 13, 2022

View reviewed changes

Yard1 added 2 commits October 13, 2022 21:40

Merge branch 'master' into more_intelligent_bazel_sharding

4a68f85

Apply feedback from code review

c765ed3

Signed-off-by: Antoni Baum <[email protected]>

Yard1 requested a review from krfricke October 13, 2022 22:12

Nit

3ea05ff

Signed-off-by: Antoni Baum <[email protected]>

krfricke approved these changes Oct 13, 2022

View reviewed changes

Yard1 added 3 commits October 13, 2022 23:26

Fix part 1

abfa8f5

Signed-off-by: Antoni Baum <[email protected]>

Fix part 2

feaac62

Signed-off-by: Antoni Baum <[email protected]>

Update test_bazel_sharding.py

f935b11

krfricke approved these changes Oct 14, 2022

View reviewed changes

krfricke merged commit d1aa560 into ray-project:master Oct 14, 2022

Yard1 deleted the more_intelligent_bazel_sharding branch October 14, 2022 15:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] Make bazel sharding for parallel buildkite more intelligent #29221

[CI] Make bazel sharding for parallel buildkite more intelligent #29221

Yard1 commented Oct 11, 2022 •

edited

Loading

krfricke left a comment

Yard1 commented Oct 13, 2022 •

edited

Loading

Yard1 commented Oct 13, 2022

krfricke commented Oct 13, 2022

krfricke left a comment

krfricke left a comment

[CI] Make bazel sharding for parallel buildkite more intelligent #29221

[CI] Make bazel sharding for parallel buildkite more intelligent #29221

Conversation

Yard1 commented Oct 11, 2022 • edited Loading

Why are these changes needed?

Related issue number

Checks

krfricke left a comment

Choose a reason for hiding this comment

Yard1 commented Oct 13, 2022 • edited Loading

Yard1 commented Oct 13, 2022

krfricke commented Oct 13, 2022

krfricke left a comment

Choose a reason for hiding this comment

krfricke left a comment

Choose a reason for hiding this comment

Yard1 commented Oct 11, 2022 •

edited

Loading

Yard1 commented Oct 13, 2022 •

edited

Loading