-
Notifications
You must be signed in to change notification settings - Fork 239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add bastion_tier label in bastion code #544
Conversation
axlearn/cloud/gcp/jobs/tpu_runner.py
Outdated
labels = {} | ||
if reserved is None: | ||
reserved_tier = gcp_settings("reserved_tpu", default=False, required=False) | ||
else: | ||
reserved_tier = reserved | ||
if reserved_tier: | ||
labels["vmtier"] = "reserved" | ||
else: | ||
labels["vmtier"] = "spot" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
labels = {} | |
if reserved is None: | |
reserved_tier = gcp_settings("reserved_tpu", default=False, required=False) | |
else: | |
reserved_tier = reserved | |
if reserved_tier: | |
labels["vmtier"] = "reserved" | |
else: | |
labels["vmtier"] = "spot" | |
if reserved is None: | |
reserved = gcp_settings("reserved_tpu", default=False, required=False) | |
labels = {"bastion_tier": "reserved" if reserved else "spot"} |
Not sure if vmtier
label is arbitrary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@markblee made the code changes as suggested. Custom label names are user defined. "bastion_tier" makes more sense as a label name in the context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will defer to @markblee for review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks -- for future PRs, feel free to "re-request" review when comments are addressed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Do you have an example metric graph handy to show that this label information is used to distinguish between reserved vs spot usages?
In the Metric Explore, you'd see the labels under "User metadata labels", which can be used in filter or groupBy for the query to create charts. Note that image was captured before I changed the label name from vmtier to bastion_tier |
198aaa1
to
11830ff
Compare
Co-authored-by: Mark Lee <[email protected]>
Co-authored-by: Mark Lee <[email protected]>
Head branch was pushed to by a user without write access
Overview
When using tpu.googleapis.com/accelerator/tensorcore_utilization metric, there is no way to break down by VM tiers(i.e. reserved vs spot) as the information is not captured in the existing filtering labels. However the tier information can be injected with VM custom labels, which show up as user labels in the filtering and can be used for grouping by the tiers. The code changes in this PR injects a custom label, vmtier, for TPU VM provisioned via QRM. The custom label supports two values: reserved and spot.
Testing
Unit Testing
e2e Testing (manual)
BASTION_TIER=1 axlearn gcp tpu start --name=$USER-test --tpu_type=v4-8 -- python3 -c "'import jax; print(jax.devices())'"
to start axlearn job on TPU that is provisioned through queue resource.gcloud alpha compute tpus tpu-vm describe $USER-test --zone us-central2-b
to check the TPU VM info which should have lines as below:if
BASTION_TIER=0
, bastion_tier: reserved.