-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CI] Make bazel sharding for parallel buildkite more intelligent #29221
[CI] Make bazel sharding for parallel buildkite more intelligent #29221
Conversation
Signed-off-by: Antoni Baum <[email protected]>
Signed-off-by: Antoni Baum <[email protected]>
Signed-off-by: Antoni Baum <[email protected]>
Signed-off-by: Antoni Baum <[email protected]>
Signed-off-by: Antoni Baum <[email protected]>
Signed-off-by: Antoni Baum <[email protected]>
Signed-off-by: Antoni Baum <[email protected]>
Signed-off-by: Antoni Baum <[email protected]>
Signed-off-by: Antoni Baum <[email protected]>
Signed-off-by: Antoni Baum <[email protected]>
Signed-off-by: Antoni Baum <[email protected]>
Signed-off-by: Antoni Baum <[email protected]>
Signed-off-by: Antoni Baum <[email protected]>
Signed-off-by: Antoni Baum <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally this is great, a few highlevel comments:
- The code requires more comments and documentation. It's hard to see what is happening, and we want to make it easy for future maintainers to understand the code quickly.
- You don't need to fully document code that you didn't change, but if you have any pointers you can quickly add that would be helpful
- For structure, let's create a
ci/tests
directory where we put the BUILD file and a subdirectory for test bazel sharding test (so it contains the mock buildfile) - Let's either keep the original bazel-sharding.py file or remove the symlink
I'll do a closer review of the new algorithm once there are more comments so I can dive into this a bit easier.
Thanks!
In terms of structure, I tried to make the
I imagine that could all be fixed eventually but I didn't want to make multiple seemingly unrelated changes in this PR, therefore I opted for the structure shown here. We can either refactor that in a followup, or try to make it work right now but I think it would be best to leave it for later. The file also had to be renamed and dash replaced to underscore to make it importable in Python. I left the symlink as to not break any code that may depend on it, but I see now that it is only used in 2 places right now so it's probably fine to just replace it there as well. Why would we want to keep the old file btw? If someone wants the older one, they can just grab it off github. Having two scripts would be confusing. |
Signed-off-by: Antoni Baum <[email protected]>
Signed-off-by: Antoni Baum <[email protected]>
Signed-off-by: Antoni Baum <[email protected]>
I added some comments, PTAL @krfricke |
Signed-off-by: Antoni Baum <[email protected]>
Signed-off-by: Antoni Baum <[email protected]>
Signed-off-by: Antoni Baum <[email protected]>
I meant instead of the one in the sub directory. Let's just remove the symlink then - taking another look now! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for the documentation, it helps me a lot to understand what is going on there.
The code and algorithm look good to me. I think we could unit test parts of the algorithm more (e.g. add_rule_to_best_shard()
, or rule.actual_timeout_s
), as we currently only have end to end tests for this.
Signed-off-by: Antoni Baum <[email protected]>
Signed-off-by: Antoni Baum <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome!
Signed-off-by: Antoni Baum <[email protected]>
Signed-off-by: Antoni Baum <[email protected]>
…-project#29221) This PR implements two changes to our `bazel-sharding.py` script, used for determining which bazel tests to run on each instance when buildkite parallelism is used: * An ability to filter tests before they are sharded, using the same logic as `bazel test`. This is done by specifying the `--tag_filters` argument, eg. `--tag_filters=air,-gpu`. If we filter tests with `bazel test` *after* they are sharded, we can end up with imbalanced shards as eg. all tests we want to filter out are assigned to one shard. This feature is enabled for Serve tests and it will be required for the changes I want to make to AIR CI. * A new algorithm to balance the shards, finally implementing what that comment was asking for all this time. Instead of trying to assign the same number of tests (which have variable timeouts) to each shard, the new algorithm tries to make sure each shard will finish in more or less the same time. This is achieved through a simple but good enough heuristic. The old algorithm can still be accessed through the `--sharding_strategy` argument. Those two changes do cause the complexity of the script to increase, necessitating proper testing. In order to facilitate that, this PR also adds a basic buildkite test harness for CI tools/scripts. After this PR is merged, the next step will be to move most of our manually parallelized jobs to use buildkite parallelism with the new logic here. Signed-off-by: Antoni Baum <[email protected]> Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Antoni Baum [email protected]
Why are these changes needed?
This PR implements two changes to our
bazel-sharding.py
script, used for determining which bazel tests to run on each instance when buildkite parallelism is used:bazel test
. This is done by specifying the--tag_filters
argument, eg.--tag_filters=air,-gpu
. If we filter tests withbazel test
after they are sharded, we can end up with imbalanced shards as eg. all tests we want to filter out are assigned to one shard. This feature is enabled for Serve tests and it will be required for the changes I want to make to AIR CI.--sharding_strategy
argument.Those two changes do cause the complexity of the script to increase, necessitating proper testing. In order to facilitate that, this PR also adds a basic buildkite test harness for CI tools/scripts.
After this PR is merged, the next step will be to move most of our manually parallelized jobs to use buildkite parallelism with the new logic here.
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.