Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Train] Simplify llama 2 workspace template #38444

Merged
merged 9 commits into from
Aug 15, 2023

Conversation

kouroshHakha
Copy link
Contributor

Why are these changes needed?

This PR make the scripts simpler:

  1. Remove the need for prepare_node stuff by enabling the downloading as part of the training function
  2. Added a script to create job submission yamls
  3. Simplified the ray dataset creation by directly reading the json file into a ray dataset.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
kouroshHakha and others added 3 commits August 14, 2023 21:16
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
if pargs.cluster_env_build_id:
cluster_env_config_kwargs.update(build_id=pargs.cluster_env_build_id)

base_cmd = f"chmod +x ./run_llama_ft.sh && ./run_llama_ft.sh --size={pargs.size}"
Copy link
Contributor

@pcmoritz pcmoritz Aug 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if you know where in the submission chain the executable permissions get dropped -- in the repo they are there so the chmod +x shouldn't be needed :)

If we know where it gets dropped, it might be worth fixing that!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I don't know. I tried it without chmod first and it errored. Any clue where this might be coming from ?

Copy link
Contributor

@pcmoritz pcmoritz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm probably not the right person to review but very nice change!

@pcmoritz
Copy link
Contributor

pcmoritz commented Aug 15, 2023

Do you know about the following pattern:

Say you have a file job.template with

name: ${JOB_NAME}
cloud_id: ${ANYSCALE_CLOUD_ID}
entrypoint: ./run_llama_ft.sh --size=${SIZE}

Then you can run SIZE=7b JOB_NAME=myjob envsubst < job.template > job.yaml. Not as nice with the defaults as your script create_job_yaml.py but a little more explicit and a very useful pattern :)

@kouroshHakha
Copy link
Contributor Author

@kouroshHakha
Copy link
Contributor Author

kouroshHakha commented Aug 15, 2023

Do you know about the following pattern:

Say you have a file job.template with

name: ${JOB_NAME}
cloud_id: ${ANYSCALE_CLOUD_ID}
entrypoint: ./run_llama_ft.sh --size=${SIZE}

Then you can run SIZE=7b JOB_NAME=myjob envsubst < job.template > job.yaml. Not as nice with the defaults as your script create_job_yaml.py but a little more explicit and a very useful pattern :)

I wasn't aware of this pattern no. but like u said, the default values make the current solution nice. Could this thing support sth like ${SIZE}|"7b"?

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
@pcmoritz
Copy link
Contributor

pcmoritz commented Aug 15, 2023

The way I would do it is have in the readme that the user can run

SIZE=7b JOB_NAME=llama2-$SIZE envsubst < job.template > job.yaml

and check an appropriate version of job.template into the repo, and say that 7b can be replaced with 14b or 70b, that's basically a default :)

It also teaches them this neat trick. It is also useful outside of Anyscale yamls (envsubst is a shell command).

Copy link
Contributor

@matthewdeng matthewdeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice clean up!

Comment on lines +174 to +177
with FileLock(lock_file):
download_model(
model_id=model_id, bucket_uri=bucket_uri, s3_sync_args=["--no-sign-request"]
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need some sort of de-dup logic to only download the model if it doesn't exist here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic is implemented by aws cli. It is a sync operation. If it already exists it won't sync. It's smarter than if it exists, it also detects file changes.

@@ -559,9 +562,10 @@ def main():
}
)

train_ds = create_ray_dataset(args.train_path)
# Read data
train_ds = ray.data.read_json(args.train_path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

Comment on lines +32 to +37
parser.add_argument("--compute-config", type=str, help="Path to the compute config")
parser.add_argument(
"--cluster-env-build-id",
type=str,
help="The build-id of the cluster env to use",
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there default values for these? (The README update indicates that there should be)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

default is None for this one.

@kouroshHakha
Copy link
Contributor Author

@kouroshHakha kouroshHakha merged commit ab06452 into ray-project:master Aug 15, 2023
23 of 25 checks passed
harborn pushed a commit to harborn/ray that referenced this pull request Aug 17, 2023
* Remove the need for prepare_node stuff by enabling the downloading as part of the training function
* Added a script to create job submission yamls
* Simplified the ray dataset creation by directly reading the json file into a ray dataset.
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Signed-off-by: harborn <[email protected]>
harborn pushed a commit to harborn/ray that referenced this pull request Aug 17, 2023
* Remove the need for prepare_node stuff by enabling the downloading as part of the training function
* Added a script to create job submission yamls
* Simplified the ray dataset creation by directly reading the json file into a ray dataset.
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
kouroshHakha added a commit that referenced this pull request Aug 18, 2023
* Remove the need for prepare_node stuff by enabling the downloading as part of the training function
* Added a script to create job submission yamls
* Simplified the ray dataset creation by directly reading the json file into a ray dataset.
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
arvind-chandra pushed a commit to lmco/ray that referenced this pull request Aug 31, 2023
* Remove the need for prepare_node stuff by enabling the downloading as part of the training function
* Added a script to create job submission yamls
* Simplified the ray dataset creation by directly reading the json file into a ray dataset.
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Signed-off-by: e428265 <[email protected]>
vymao pushed a commit to vymao/ray that referenced this pull request Oct 11, 2023
* Remove the need for prepare_node stuff by enabling the downloading as part of the training function
* Added a script to create job submission yamls
* Simplified the ray dataset creation by directly reading the json file into a ray dataset.
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Signed-off-by: Victor <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants