Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add bastion_tier label in bastion code #544

Merged
merged 5 commits into from
Jul 25, 2024

Conversation

eshen1991
Copy link
Contributor

@eshen1991 eshen1991 commented Jun 27, 2024

Overview

When using tpu.googleapis.com/accelerator/tensorcore_utilization metric, there is no way to break down by VM tiers(i.e. reserved vs spot) as the information is not captured in the existing filtering labels. However the tier information can be injected with VM custom labels, which show up as user labels in the filtering and can be used for grouping by the tiers. The code changes in this PR injects a custom label, vmtier, for TPU VM provisioned via QRM. The custom label supports two values: reserved and spot.

Testing

Unit Testing

(axlearn) ➜  axlearn git:(vm-tier-label-2) pytest axlearn/cloud/gcp/jobs/tpu_runner_test.py -v | grep test_start
axlearn/cloud/gcp/jobs/tpu_runner_test.py::TPURunnerJobTest::test_start0 PASSED
axlearn/cloud/gcp/jobs/tpu_runner_test.py::TPURunnerJobTest::test_start1 PASSED
axlearn/cloud/gcp/jobs/tpu_runner_test.py::TPURunnerJobTest::test_start2 PASSED
axlearn/cloud/gcp/jobs/tpu_runner_test.py::TPURunnerJobTest::test_start3 PASSED
axlearn/cloud/gcp/jobs/tpu_runner_test.py::TPURunnerJobTest::test_start4 PASSED
axlearn/cloud/gcp/jobs/tpu_runner_test.py::TPURunnerJobTest::test_start5 PASSED
axlearn/cloud/gcp/jobs/tpu_runner_test.py::TPURunnerJobTest::test_start6 PASSED
axlearn/cloud/gcp/jobs/tpu_runner_test.py::TPURunnerJobTest::test_start7 PASSED
axlearn/cloud/gcp/jobs/tpu_runner_test.py::TPURunnerJobTest::test_start8 PASSED
axlearn/cloud/gcp/jobs/tpu_runner_test.py::TPURunnerMainTest::test_start0 PASSED
axlearn/cloud/gcp/jobs/tpu_runner_test.py::TPURunnerMainTest::test_start1 PASSED

e2e Testing (manual)

  1. run BASTION_TIER=1 axlearn gcp tpu start --name=$USER-test --tpu_type=v4-8 -- python3 -c "'import jax; print(jax.devices())'" to start axlearn job on TPU that is provisioned through queue resource.
  2. Once the queued resource state is active and run gcloud alpha compute tpus tpu-vm describe $USER-test --zone us-central2-b to check the TPU VM info which should have lines as below:
labels:
  bastion_tier: spot 
schedulingConfig:
  spot: true

if BASTION_TIER=0, bastion_tier: reserved.

Comment on lines 308 to 316
labels = {}
if reserved is None:
reserved_tier = gcp_settings("reserved_tpu", default=False, required=False)
else:
reserved_tier = reserved
if reserved_tier:
labels["vmtier"] = "reserved"
else:
labels["vmtier"] = "spot"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
labels = {}
if reserved is None:
reserved_tier = gcp_settings("reserved_tpu", default=False, required=False)
else:
reserved_tier = reserved
if reserved_tier:
labels["vmtier"] = "reserved"
else:
labels["vmtier"] = "spot"
if reserved is None:
reserved = gcp_settings("reserved_tpu", default=False, required=False)
labels = {"bastion_tier": "reserved" if reserved else "spot"}

Not sure if vmtier label is arbitrary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@markblee made the code changes as suggested. Custom label names are user defined. "bastion_tier" makes more sense as a label name in the context.

@eshen1991 eshen1991 requested a review from ruomingp as a code owner July 1, 2024 17:30
@eshen1991 eshen1991 changed the title Add vm tier label in bastion code Add bastion_tier label in bastion code Jul 1, 2024
Copy link
Contributor

@ruomingp ruomingp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will defer to @markblee for review.

Copy link
Contributor

@markblee markblee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks -- for future PRs, feel free to "re-request" review when comments are addressed.

axlearn/cloud/gcp/jobs/tpu_runner.py Outdated Show resolved Hide resolved
axlearn/cloud/gcp/jobs/tpu_runner.py Outdated Show resolved Hide resolved
@markblee markblee requested a review from Ethanlm July 17, 2024 23:47
Copy link
Contributor

@Ethanlm Ethanlm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Do you have an example metric graph handy to show that this label information is used to distinguish between reserved vs spot usages?

@eshen1991
Copy link
Contributor Author

LGTM

Do you have an example metric graph handy to show that this label information is used to distinguish between reserved vs spot usage

In the Metric Explore, you'd see the labels under "User metadata labels", which can be used in filter or groupBy for the query to create charts. Note that image was captured before I changed the label name from vmtier to bastion_tier

image

eshen1991 and others added 2 commits July 18, 2024 13:40
Co-authored-by: Mark Lee <[email protected]>
Co-authored-by: Mark Lee <[email protected]>
@markblee markblee enabled auto-merge July 18, 2024 21:49
auto-merge was automatically disabled July 19, 2024 18:26

Head branch was pushed to by a user without write access

@eshen1991 eshen1991 requested a review from markblee July 19, 2024 18:28
@eshen1991
Copy link
Contributor Author

eshen1991 commented Jul 19, 2024

@markblee I somehow had the pre-commit turned off. I ran pre-commit and fixed the lint issues that broke the circleci pre-commit phase. Please review the latest commit, 1911f1b.

@markblee markblee added this pull request to the merge queue Jul 25, 2024
Merged via the queue into apple:main with commit 9ac0178 Jul 25, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants