Eval the command string from XPK for GPU script #640

michelle-yooh · 2024-05-06T23:43:39Z

No description provided.

michelle-yooh · 2024-05-06T23:47:11Z

end_to_end/test_generate_param_only_checkpoint.sh

@@ -112,7 +112,6 @@ load_parameters_path=${base_output_directory}/${decode_ckpt_run_id}/checkpoints/
 attention=dot_product ici_tensor_parallelism=${ici_tensor_parallelism} steps=50 \
 metrics_file=/tmp/${run_id}_metrics.txt async_checkpointing=false max_target_length=128 per_device_batch_size=1 \
 quantization=${quantization} \
-${model_params} \


With this tiny param, the test fails on A3 nodes with an error message: jaxlib.xla_extension.XlaRuntimeError: INTERNAL: the requested functionality is not supported

Defaulting to base 1B param somehow fixes the issue, and yet it doesn't harm the test.

Did you cut a ticket, etc for this with HLO? The purpose of this exercise is to be squashing bugs...

Yes, b/339473974

rwitten · 2024-05-07T18:44:08Z

end_to_end/gpu/a3/test_oss_gpu_tests.sh

+
+set -uex
+
+# This script reflects the list of GPU tests in .github/workflows/UnitTests.yml


Should we be merging this? I don't understand. Don't we want to model our tests as one test per step? (That makes them more isolated and reproducible.)

Yes, I agree. I first thought that it'd be too tedious and verbose to add these tests to XLML as separate tests since I would have to write a script for each test. But I think I found a more concise way to do it. Please take a look the new change and this PR GoogleCloudPlatform/ml-auto-solutions#283

rwitten · 2024-05-07T18:44:53Z

end_to_end/test_generate_param_only_checkpoint.sh

@@ -112,7 +112,6 @@ load_parameters_path=${base_output_directory}/${decode_ckpt_run_id}/checkpoints/
 attention=dot_product ici_tensor_parallelism=${ici_tensor_parallelism} steps=50 \
 metrics_file=/tmp/${run_id}_metrics.txt async_checkpointing=false max_target_length=128 per_device_batch_size=1 \
 quantization=${quantization} \
-${model_params} \


Did you cut a ticket, etc for this with HLO? The purpose of this exercise is to be squashing bugs...

rwitten · 2024-05-10T17:16:35Z

gpu_multi_process_run.sh

@@ -145,7 +145,7 @@ resolve_coordinator_ip
 set -e

 PIDS=()
-${COMMAND} &


this seems right

Yes, this makes it possible to chain commands as usual when we pass command string to XPK

rwitten · 2024-05-10T17:17:19Z

end_to_end/gpu/test_decode_gpu.sh

+#!/bin/bash
+
+# This script provides a convenient set of default configs used for testing decode.py on GPU
+


How will test_decode_gpu.sh and test_train_gpu.sh be used? They seem kind of weird but I can see why they'd be useful as well!

They make the test commands more concise by providing common default values for the testing.

They will be used like in this line
I think they can be also used in our unit tests.

Will test_decode_gpu.sh and test_train_gpu.sh be specific for GPU or generic for TPU and GPU? If it's the latter, shall we move them to /end_to_end/ directory?

Agreed the files seem kind of weird at first, but your explanation makes sense. If you want to re-use the same set of config options I guess the convenience is worth it. I like the name base_test_* but this is fine too

I also like the names you suggested. I renamed the scripts.

I actually figured out another solution, which is to have these base scripts as a string template in XLML repo. If it looks good, I will remove the scripts from this PR. The change in gpu_multi_process_run.sh is still required.

Ya I like this template in XLML, defined as close as possible to where it is used

yangyuwei · 2024-05-10T18:54:51Z

end_to_end/gpu/test_decode_gpu.sh

+#!/bin/bash
+
+# This script provides a convenient set of default configs used for testing decode.py on GPU
+


Will test_decode_gpu.sh and test_train_gpu.sh be specific for GPU or generic for TPU and GPU? If it's the latter, shall we move them to /end_to_end/ directory?

michelle-yooh requested review from rwitten and gobbleturk as code owners May 6, 2024 23:43

michelle-yooh commented May 6, 2024

View reviewed changes

michelle-yooh assigned michelle-yooh, rwitten and yangyuwei May 7, 2024

rwitten requested changes May 7, 2024

View reviewed changes

rwitten removed their assignment May 9, 2024

michelle-yooh force-pushed the yooh/add_oss_tests branch from b07fdf9 to f47fbfb Compare May 10, 2024 17:09

rwitten reviewed May 10, 2024

View reviewed changes

michelle-yooh force-pushed the yooh/add_oss_tests branch from efff9b0 to beea296 Compare May 10, 2024 22:01

Eval the command string from XPK for GPU script

4f526a8

michelle-yooh force-pushed the yooh/add_oss_tests branch from beea296 to 4f526a8 Compare May 13, 2024 18:13

michelle-yooh changed the title ~~Add a script for OSS GPU tests~~ Eval the command string from XPK for GPU script May 13, 2024

michelle-yooh mentioned this pull request May 13, 2024

Add maxtext e2e gpu tests GoogleCloudPlatform/ml-auto-solutions#283

Merged

4 tasks

michelle-yooh assigned rwitten May 14, 2024

gobbleturk approved these changes May 14, 2024

View reviewed changes

github-actions bot added the pull ready label May 14, 2024

rwitten approved these changes May 14, 2024

View reviewed changes

yangyuwei approved these changes May 14, 2024

View reviewed changes

copybara-service bot merged commit 87c6430 into main May 14, 2024
13 checks passed

copybara-service bot deleted the yooh/add_oss_tests branch May 14, 2024 17:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eval the command string from XPK for GPU script #640

Eval the command string from XPK for GPU script #640

michelle-yooh commented May 6, 2024

michelle-yooh May 6, 2024 •

edited

rwitten May 7, 2024

michelle-yooh May 10, 2024

rwitten May 7, 2024

michelle-yooh May 10, 2024

rwitten May 7, 2024

rwitten May 10, 2024

michelle-yooh May 10, 2024

rwitten May 10, 2024

michelle-yooh May 10, 2024

yangyuwei May 10, 2024

gobbleturk May 10, 2024

michelle-yooh May 10, 2024

michelle-yooh May 13, 2024

gobbleturk May 13, 2024

yangyuwei May 10, 2024


		set -uex

		# This script reflects the list of GPU tests in .github/workflows/UnitTests.yml

		#!/bin/bash

		# This script provides a convenient set of default configs used for testing decode.py on GPU

Eval the command string from XPK for GPU script #640

Eval the command string from XPK for GPU script #640

Conversation

michelle-yooh commented May 6, 2024

michelle-yooh May 6, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

michelle-yooh May 6, 2024 •

edited