-
Notifications
You must be signed in to change notification settings - Fork 230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Eval the command string from XPK for GPU script #640
Conversation
@@ -112,7 +112,6 @@ load_parameters_path=${base_output_directory}/${decode_ckpt_run_id}/checkpoints/ | |||
attention=dot_product ici_tensor_parallelism=${ici_tensor_parallelism} steps=50 \ | |||
metrics_file=/tmp/${run_id}_metrics.txt async_checkpointing=false max_target_length=128 per_device_batch_size=1 \ | |||
quantization=${quantization} \ | |||
${model_params} \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this tiny param, the test fails on A3 nodes with an error message: jaxlib.xla_extension.XlaRuntimeError: INTERNAL: the requested functionality is not supported
Defaulting to base 1B param somehow fixes the issue, and yet it doesn't harm the test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you cut a ticket, etc for this with HLO? The purpose of this exercise is to be squashing bugs...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, b/339473974
|
||
set -uex | ||
|
||
# This script reflects the list of GPU tests in .github/workflows/UnitTests.yml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we be merging this? I don't understand. Don't we want to model our tests as one test per step? (That makes them more isolated and reproducible.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree. I first thought that it'd be too tedious and verbose to add these tests to XLML as separate tests since I would have to write a script for each test. But I think I found a more concise way to do it. Please take a look the new change and this PR GoogleCloudPlatform/ml-auto-solutions#283
@@ -112,7 +112,6 @@ load_parameters_path=${base_output_directory}/${decode_ckpt_run_id}/checkpoints/ | |||
attention=dot_product ici_tensor_parallelism=${ici_tensor_parallelism} steps=50 \ | |||
metrics_file=/tmp/${run_id}_metrics.txt async_checkpointing=false max_target_length=128 per_device_batch_size=1 \ | |||
quantization=${quantization} \ | |||
${model_params} \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you cut a ticket, etc for this with HLO? The purpose of this exercise is to be squashing bugs...
b07fdf9
to
f47fbfb
Compare
@@ -145,7 +145,7 @@ resolve_coordinator_ip | |||
set -e | |||
|
|||
PIDS=() | |||
${COMMAND} & |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems right
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this makes it possible to chain commands as usual when we pass command string to XPK
end_to_end/gpu/test_decode_gpu.sh
Outdated
#!/bin/bash | ||
|
||
# This script provides a convenient set of default configs used for testing decode.py on GPU | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How will test_decode_gpu.sh
and test_train_gpu.sh
be used? They seem kind of weird but I can see why they'd be useful as well!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They make the test commands more concise by providing common default values for the testing.
They will be used like in this line
I think they can be also used in our unit tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will test_decode_gpu.sh
and test_train_gpu.sh
be specific for GPU or generic for TPU and GPU? If it's the latter, shall we move them to /end_to_end/ directory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed the files seem kind of weird at first, but your explanation makes sense. If you want to re-use the same set of config options I guess the convenience is worth it. I like the name base_test_* but this is fine too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also like the names you suggested. I renamed the scripts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually figured out another solution, which is to have these base scripts as a string template in XLML repo. If it looks good, I will remove the scripts from this PR. The change in gpu_multi_process_run.sh
is still required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya I like this template in XLML, defined as close as possible to where it is used
efff9b0
to
beea296
Compare
beea296
to
4f526a8
Compare
end_to_end/gpu/test_decode_gpu.sh
Outdated
#!/bin/bash | ||
|
||
# This script provides a convenient set of default configs used for testing decode.py on GPU | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will test_decode_gpu.sh
and test_train_gpu.sh
be specific for GPU or generic for TPU and GPU? If it's the latter, shall we move them to /end_to_end/ directory?
No description provided.