[Train] Simplify llama 2 workspace template #38444

kouroshHakha · 2023-08-15T04:11:57Z

Why are these changes needed?

This PR make the scripts simpler:

Remove the need for prepare_node stuff by enabling the downloading as part of the training function
Added a script to create job submission yamls
Simplified the ray dataset creation by directly reading the json file into a ray dataset.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

pcmoritz · 2023-08-15T04:24:17Z

doc/source/templates/04_finetuning_llms_with_deepspeed/create_job_yaml.py

+ if pargs.cluster_env_build_id:
+ cluster_env_config_kwargs.update(build_id=pargs.cluster_env_build_id)
+
+ base_cmd = f"chmod +x ./run_llama_ft.sh && ./run_llama_ft.sh --size={pargs.size}"


I wonder if you know where in the submission chain the executable permissions get dropped -- in the repo they are there so the chmod +x shouldn't be needed :)

If we know where it gets dropped, it might be worth fixing that!

yeah I don't know. I tried it without chmod first and it errored. Any clue where this might be coming from ?

pcmoritz

I'm probably not the right person to review but very nice change!

pcmoritz · 2023-08-15T04:41:56Z

Do you know about the following pattern:

Say you have a file job.template with

name: ${JOB_NAME}
cloud_id: ${ANYSCALE_CLOUD_ID}
entrypoint: ./run_llama_ft.sh --size=${SIZE}

Then you can run SIZE=7b JOB_NAME=myjob envsubst < job.template > job.yaml. Not as nice with the defaults as your script create_job_yaml.py but a little more explicit and a very useful pattern :)

kouroshHakha · 2023-08-15T05:00:36Z

https://buildkite.com/ray-project/release-tests-pr/builds/48981#_

kouroshHakha · 2023-08-15T05:04:14Z

Do you know about the following pattern:

Say you have a file job.template with
name: ${JOB_NAME}
cloud_id: ${ANYSCALE_CLOUD_ID}
entrypoint: ./run_llama_ft.sh --size=${SIZE}
Then you can run SIZE=7b JOB_NAME=myjob envsubst < job.template > job.yaml. Not as nice with the defaults as your script create_job_yaml.py but a little more explicit and a very useful pattern :)

I wasn't aware of this pattern no. but like u said, the default values make the current solution nice. Could this thing support sth like ${SIZE}|"7b"?

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

pcmoritz · 2023-08-15T05:24:16Z

The way I would do it is have in the readme that the user can run

SIZE=7b JOB_NAME=llama2-$SIZE envsubst < job.template > job.yaml

and check an appropriate version of job.template into the repo, and say that 7b can be replaced with 14b or 70b, that's basically a default :)

It also teaches them this neat trick. It is also useful outside of Anyscale yamls (envsubst is a shell command).

matthewdeng

Nice clean up!

matthewdeng · 2023-08-15T05:08:26Z

doc/source/templates/04_finetuning_llms_with_deepspeed/finetune_hf_llm.py

+ with FileLock(lock_file):
+ download_model(
+ model_id=model_id, bucket_uri=bucket_uri, s3_sync_args=["--no-sign-request"]
+ )


Do you need some sort of de-dup logic to only download the model if it doesn't exist here?

The logic is implemented by aws cli. It is a sync operation. If it already exists it won't sync. It's smarter than if it exists, it also detects file changes.

matthewdeng · 2023-08-15T05:08:49Z

doc/source/templates/04_finetuning_llms_with_deepspeed/finetune_hf_llm.py

@@ -559,9 +562,10 @@ def main():
 }
 )

- train_ds = create_ray_dataset(args.train_path)
+ # Read data
+ train_ds = ray.data.read_json(args.train_path)


matthewdeng · 2023-08-15T05:26:12Z

doc/source/templates/04_finetuning_llms_with_deepspeed/create_job_yaml.py

+ parser.add_argument("--compute-config", type=str, help="Path to the compute config")
+ parser.add_argument(
+ "--cluster-env-build-id",
+ type=str,
+ help="The build-id of the cluster env to use",
+ )


Are there default values for these? (The README update indicates that there should be)

default is None for this one.

kouroshHakha · 2023-08-15T14:31:28Z

https://buildkite.com/ray-project/release-tests-pr/builds/48987

* Remove the need for prepare_node stuff by enabling the downloading as part of the training function * Added a script to create job submission yamls * Simplified the ray dataset creation by directly reading the json file into a ray dataset. Signed-off-by: Kourosh Hakhamaneshi <[email protected]> Signed-off-by: harborn <[email protected]>

* Remove the need for prepare_node stuff by enabling the downloading as part of the training function * Added a script to create job submission yamls * Simplified the ray dataset creation by directly reading the json file into a ray dataset. Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

* Remove the need for prepare_node stuff by enabling the downloading as part of the training function * Added a script to create job submission yamls * Simplified the ray dataset creation by directly reading the json file into a ray dataset. Signed-off-by: Kourosh Hakhamaneshi <[email protected]> Signed-off-by: e428265 <[email protected]>

* Remove the need for prepare_node stuff by enabling the downloading as part of the training function * Added a script to create job submission yamls * Simplified the ray dataset creation by directly reading the json file into a ray dataset. Signed-off-by: Kourosh Hakhamaneshi <[email protected]> Signed-off-by: Victor <[email protected]>

kouroshHakha added 5 commits August 14, 2023 16:38

download the models inside the training function

85471fb

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

read jsonl directly into ray ds

96a6593

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

removed prepare node

f8edebb

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

removed run_on_every_node as it is not needed anymore

31cc52d

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

added job submission script

0b32355

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

kouroshHakha requested review from justinvyu, matthewdeng and sofianhnaide as code owners August 15, 2023 04:11

kouroshHakha and others added 3 commits August 14, 2023 21:16

lint

e58f5b7

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

updated readme

7c3d7b6

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

Merge branch 'master' into simplify-llama-2-ws

45432e8

pcmoritz reviewed Aug 15, 2023

View reviewed changes

wip

e3db24d

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

matthewdeng approved these changes Aug 15, 2023

View reviewed changes

kouroshHakha merged commit ab06452 into ray-project:master Aug 15, 2023
23 of 25 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Train] Simplify llama 2 workspace template #38444

[Train] Simplify llama 2 workspace template #38444

kouroshHakha commented Aug 15, 2023

pcmoritz Aug 15, 2023 •

edited

Loading

kouroshHakha Aug 15, 2023

pcmoritz left a comment

pcmoritz commented Aug 15, 2023 •

edited

Loading

kouroshHakha commented Aug 15, 2023

kouroshHakha commented Aug 15, 2023 •

edited

Loading

pcmoritz commented Aug 15, 2023 •

edited

Loading

matthewdeng left a comment

matthewdeng Aug 15, 2023

kouroshHakha Aug 15, 2023

matthewdeng Aug 15, 2023

matthewdeng Aug 15, 2023

kouroshHakha Aug 15, 2023

kouroshHakha commented Aug 15, 2023

[Train] Simplify llama 2 workspace template #38444

[Train] Simplify llama 2 workspace template #38444

Conversation

kouroshHakha commented Aug 15, 2023

Why are these changes needed?

Related issue number

Checks

pcmoritz Aug 15, 2023 • edited Loading

Choose a reason for hiding this comment

kouroshHakha Aug 15, 2023

Choose a reason for hiding this comment

pcmoritz left a comment

Choose a reason for hiding this comment

pcmoritz commented Aug 15, 2023 • edited Loading

kouroshHakha commented Aug 15, 2023

kouroshHakha commented Aug 15, 2023 • edited Loading

pcmoritz commented Aug 15, 2023 • edited Loading

matthewdeng left a comment

Choose a reason for hiding this comment

matthewdeng Aug 15, 2023

Choose a reason for hiding this comment

kouroshHakha Aug 15, 2023

Choose a reason for hiding this comment

matthewdeng Aug 15, 2023

Choose a reason for hiding this comment

matthewdeng Aug 15, 2023

Choose a reason for hiding this comment

kouroshHakha Aug 15, 2023

Choose a reason for hiding this comment

kouroshHakha commented Aug 15, 2023

pcmoritz Aug 15, 2023 •

edited

Loading

pcmoritz commented Aug 15, 2023 •

edited

Loading

kouroshHakha commented Aug 15, 2023 •

edited

Loading

pcmoritz commented Aug 15, 2023 •

edited

Loading