Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Train] Llama 2 workspace template release tests #37871

Merged
merged 51 commits into from
Jul 28, 2023

Conversation

kouroshHakha
Copy link
Contributor

Why are these changes needed?

This enables release tests for Llama-2 workspace templates

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

kouroshHakha and others added 30 commits July 25, 2023 21:21
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
@kouroshHakha
Copy link
Contributor Author

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
@kouroshHakha
Copy link
Contributor Author

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
@kouroshHakha
Copy link
Contributor Author

@justinvyu @can-anyscale This needs your approvals.

Copy link
Collaborator

@aslonnie aslonnie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(leaving this to @can-anyscale )

Copy link
Collaborator

@can-anyscale can-anyscale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# See https://hub.docker.com/r/anyscale/ray for full list of
# available Ray, Python, and CUDA versions.
base_image: "anyscale/ray:2.6.1-py39-cu117"
base_image: anyscale/ray:nightly-py39-cu117
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cu118? for consistency with byod

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

- ResourceType: "instance"
Tags:
- Key: ttl-hours
Value: '24'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: new line

- ResourceType: "instance"
Tags:
- Key: ttl-hours
Value: '24'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: new line

@@ -16,7 +16,9 @@
)

aws_gpu_cpu_to_concurrency_groups = [
Condition(min_gpu=9, max_gpu=-1, min_cpu=0, max_cpu=-1, group="large-gpu", limit=4),
Condition(
min_gpu=9, max_gpu=-1, min_cpu=0, max_cpu=-1, group="large-gpu", limit=100
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revert this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

working_dir: workspace_templates/04_finetuning_llms_with_deepspeed
python: "3.9"
frequency: nightly-3x
team: train
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ml?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Comment on lines +16 to +23
bitsandbytes
wandb
pytorch-lightning
protobuf<3.21.0
torchmetrics
lm_eval
tiktoken
sentencepiece
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@can-anyscale So, all release tests are running with the same docker image now? If I want to add a package to one release test, I need to add it to the common byod requirement file? Is there a doc where I can read about this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Tho on a PR it would still use your specified cluster_env but on the builds from master it is currently using the byod. @can-anyscale will deprecate the use of cluster envs soon for both PRs and master builds.
BYOD has a couple of benefits:

  1. You don't have to build the envs when launching release tests which means faster time to failure if any
  2. It will make tests more reliable due to consistency of versions. The best case scenario is if we don't pin anything in in the byod requirements files.

@kouroshHakha kouroshHakha added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Jul 28, 2023
@kouroshHakha kouroshHakha merged commit 0a079dd into ray-project:master Jul 28, 2023
3 of 26 checks passed
NripeshN pushed a commit to NripeshN/ray that referenced this pull request Aug 15, 2023
harborn pushed a commit to harborn/ray that referenced this pull request Aug 17, 2023
harborn pushed a commit to harborn/ray that referenced this pull request Aug 17, 2023
arvind-chandra pushed a commit to lmco/ray that referenced this pull request Aug 31, 2023
vymao pushed a commit to vymao/ray that referenced this pull request Oct 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants