Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eval the command string from XPK for GPU script #640

Merged
merged 1 commit into from
May 14, 2024

Conversation

michelle-yooh
Copy link
Collaborator

No description provided.

@@ -112,7 +112,6 @@ load_parameters_path=${base_output_directory}/${decode_ckpt_run_id}/checkpoints/
attention=dot_product ici_tensor_parallelism=${ici_tensor_parallelism} steps=50 \
metrics_file=/tmp/${run_id}_metrics.txt async_checkpointing=false max_target_length=128 per_device_batch_size=1 \
quantization=${quantization} \
${model_params} \
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this tiny param, the test fails on A3 nodes with an error message: jaxlib.xla_extension.XlaRuntimeError: INTERNAL: the requested functionality is not supported

Defaulting to base 1B param somehow fixes the issue, and yet it doesn't harm the test.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you cut a ticket, etc for this with HLO? The purpose of this exercise is to be squashing bugs...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, b/339473974


set -uex

# This script reflects the list of GPU tests in .github/workflows/UnitTests.yml
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be merging this? I don't understand. Don't we want to model our tests as one test per step? (That makes them more isolated and reproducible.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree. I first thought that it'd be too tedious and verbose to add these tests to XLML as separate tests since I would have to write a script for each test. But I think I found a more concise way to do it. Please take a look the new change and this PR GoogleCloudPlatform/ml-auto-solutions#283

@@ -112,7 +112,6 @@ load_parameters_path=${base_output_directory}/${decode_ckpt_run_id}/checkpoints/
attention=dot_product ici_tensor_parallelism=${ici_tensor_parallelism} steps=50 \
metrics_file=/tmp/${run_id}_metrics.txt async_checkpointing=false max_target_length=128 per_device_batch_size=1 \
quantization=${quantization} \
${model_params} \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you cut a ticket, etc for this with HLO? The purpose of this exercise is to be squashing bugs...

@rwitten rwitten removed their assignment May 9, 2024
@@ -145,7 +145,7 @@ resolve_coordinator_ip
set -e

PIDS=()
${COMMAND} &
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems right

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this makes it possible to chain commands as usual when we pass command string to XPK

#!/bin/bash

# This script provides a convenient set of default configs used for testing decode.py on GPU

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will test_decode_gpu.sh and test_train_gpu.sh be used? They seem kind of weird but I can see why they'd be useful as well!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They make the test commands more concise by providing common default values for the testing.

They will be used like in this line
I think they can be also used in our unit tests.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will test_decode_gpu.sh and test_train_gpu.sh be specific for GPU or generic for TPU and GPU? If it's the latter, shall we move them to /end_to_end/ directory?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed the files seem kind of weird at first, but your explanation makes sense. If you want to re-use the same set of config options I guess the convenience is worth it. I like the name base_test_* but this is fine too

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also like the names you suggested. I renamed the scripts.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually figured out another solution, which is to have these base scripts as a string template in XLML repo. If it looks good, I will remove the scripts from this PR. The change in gpu_multi_process_run.sh is still required.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya I like this template in XLML, defined as close as possible to where it is used

@michelle-yooh michelle-yooh changed the title Add a script for OSS GPU tests Eval the command string from XPK for GPU script May 13, 2024
#!/bin/bash

# This script provides a convenient set of default configs used for testing decode.py on GPU

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will test_decode_gpu.sh and test_train_gpu.sh be specific for GPU or generic for TPU and GPU? If it's the latter, shall we move them to /end_to_end/ directory?

@copybara-service copybara-service bot merged commit 87c6430 into main May 14, 2024
13 checks passed
@copybara-service copybara-service bot deleted the yooh/add_oss_tests branch May 14, 2024 17:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants