Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Train] Llama 2 workspace template release tests #37871

Merged
merged 51 commits into from
Jul 28, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
51 commits
Select commit Hold shift + click to select a range
3da40e9
[Train] LLM fine-tuning workspace template fix custom resources (#37745)
kouroshHakha Jul 25, 2023
c259150
added llama-2 70b scripts
kouroshHakha Jul 26, 2023
cf999c4
wip
kouroshHakha Jul 26, 2023
4d39318
Merge branch 'master' of github.com:ray-project/ray into llama-70b-ft
kouroshHakha Jul 26, 2023
dc25e03
added release tests for 7 and 13B
kouroshHakha Jul 26, 2023
16ae886
updated README
kouroshHakha Jul 26, 2023
12f4b37
updated readme
kouroshHakha Jul 26, 2023
16be8c1
wip
kouroshHakha Jul 26, 2023
d88f97a
updated scripts
kouroshHakha Jul 26, 2023
7485d3f
wip
kouroshHakha Jul 26, 2023
055b065
wip
kouroshHakha Jul 26, 2023
88cac09
better lamma-7b settings
kouroshHakha Jul 26, 2023
32def4f
1. Fix readme typo, 2. fixed evaluation
kouroshHakha Jul 26, 2023
485168c
fixed typo in release tests
kouroshHakha Jul 26, 2023
9744a13
update readme
kouroshHakha Jul 26, 2023
4055100
updating cluster_end
kouroshHakha Jul 26, 2023
cc709dc
fixing release test
kouroshHakha Jul 27, 2023
159241b
temp changing concurrency group
kouroshHakha Jul 27, 2023
db52ef6
test the shell changes
kouroshHakha Jul 27, 2023
53d5514
added other shells
kouroshHakha Jul 27, 2023
05b0a0f
reverting activating release tests
kouroshHakha Jul 27, 2023
57f402c
reverting concurrency
kouroshHakha Jul 27, 2023
4b46723
lint
kouroshHakha Jul 27, 2023
c6cef39
updated docker
kouroshHakha Jul 27, 2023
6929bb8
reverting the random stuff
kouroshHakha Jul 27, 2023
aa673cc
lint
kouroshHakha Jul 27, 2023
45b59a7
update the shell to one
kouroshHakha Jul 27, 2023
f61b019
code format
kouroshHakha Jul 27, 2023
b30c4e4
format
kouroshHakha Jul 27, 2023
e7d6c0e
Revert "reverting activating release tests"
kouroshHakha Jul 27, 2023
f8330d2
Revert "reverting the random stuff"
kouroshHakha Jul 27, 2023
25c7db9
Revert "reverting concurrency"
kouroshHakha Jul 27, 2023
270facd
moved the testing cluster env
kouroshHakha Jul 27, 2023
04c974d
removed cloud ids from the compute configs
kouroshHakha Jul 27, 2023
e0481e5
added testing compute configs that include cloud_ids
kouroshHakha Jul 27, 2023
49f2477
compute configs repointed
kouroshHakha Jul 27, 2023
1dacda9
Merge branch 'master' into llama-2-release-test
kouroshHakha Jul 27, 2023
35f76ba
white space removal
kouroshHakha Jul 27, 2023
59b927a
testing the path stuff
kouroshHakha Jul 27, 2023
558213e
byod switching
kouroshHakha Jul 27, 2023
ae2d09e
updated the compiled byod stuff
kouroshHakha Jul 27, 2023
a7192bb
wip
kouroshHakha Jul 28, 2023
54c70c9
wip
kouroshHakha Jul 28, 2023
5702c96
wip
kouroshHakha Jul 28, 2023
6100736
lint
kouroshHakha Jul 28, 2023
6f1a5e6
wip
kouroshHakha Jul 28, 2023
b61a352
wip
kouroshHakha Jul 28, 2023
49f0b13
wip
kouroshHakha Jul 28, 2023
2035342
reverting concurrency
kouroshHakha Jul 28, 2023
123b77e
1. cu117->cu118 2. team: train->ml
kouroshHakha Jul 28, 2023
fc680c9
lint
kouroshHakha Jul 28, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -517,7 +517,8 @@ def main():
"env_vars": {
"HF_HOME": "/mnt/local_storage/.cache/huggingface",
"TUNE_RESULT_DIR": os.environ["TUNE_RESULT_DIR"],
}
},
"working_dir": ".",
}
)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,10 @@ def download_model_files_on_all_nodes(hf_model_id: str):
if __name__ == "__main__":

ray.init(
runtime_env={"env_vars": {"HF_HOME": "/mnt/local_storage/.cache/huggingface"}}
runtime_env={
"env_vars": {"HF_HOME": "/mnt/local_storage/.cache/huggingface"},
"working_dir": ".",
}
)

pargs = _parse_args()
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ def run(cmd: str):
parser.add_argument("args", nargs="*", type=str, help="string args to function")
args = parser.parse_args()

ray.init()
ray.init(runtime_env={"working_dir": "."})
if args.function not in globals():
raise ValueError(f"{args.function} doesn't exist")
fn = globals()[args.function]
Expand Down
Original file line number Diff line number Diff line change
@@ -1,12 +1,9 @@
# See https://hub.docker.com/r/anyscale/ray for full list of
# available Ray, Python, and CUDA versions.
base_image: "anyscale/ray:2.6.1-py39-cu117"
base_image: anyscale/ray:nightly-py39-cu118

env_vars: {}

debian_packages: [
libaio1
]
debian_packages:
- libaio1

python:
pip_packages: [
Expand All @@ -30,4 +27,7 @@ python:
]
conda_packages: []

post_build_cmds: []
post_build_cmds:
# Install Ray
- pip3 uninstall -y ray || true && pip3 install -U {{ env["RAY_WHEELS"] | default("ray") }}
- {{ env["RAY_WHEELS_SANITY_CHECK"] | default("echo No Ray wheels sanity check") }}
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
cloud_id: {{env["ANYSCALE_CLOUD_ID"]}}
region: us-west-2

head_node_type:
name: head_node_type
instance_type: g5.48xlarge
resources:
custom_resources:
large_cpu_mem: 1

worker_node_types:
- name: gpu_worker
instance_type: g5.48xlarge
min_workers: 3
max_workers: 3
use_spot: false

aws:
TagSpecifications:
- ResourceType: "instance"
Tags:
- Key: ttl-hours
Value: '24'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: new line

Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
cloud_id: {{env["ANYSCALE_CLOUD_ID"]}}
region: us-west-2

head_node_type:
name: head_node_type
instance_type: g5.48xlarge
resources:
custom_resources:
large_cpu_mem: 1

worker_node_types:
- name: large_gpu_worker
instance_type: g5.48xlarge
min_workers: 2
max_workers: 2
use_spot: false

- name: medium_gpu_worker
instance_type: g5.24xlarge
min_workers: 2
max_workers: 2
use_spot: false

aws:
TagSpecifications:
- ResourceType: "instance"
Tags:
- Key: ttl-hours
Value: '24'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: new line

Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# 1 g5.16xlarge + 15 g5.4xlarge --> 16 GPUs, 256G RAM on trainer and 64G RAM on workers
cloud_id: {{env["ANYSCALE_CLOUD_ID"]}}
region: us-west-2

head_node_type:
name: head_node
instance_type: g5.16xlarge
resources:
custom_resources:
large_cpu_mem: 1

worker_node_types:
- name: worker_node
instance_type: g5.4xlarge
min_workers: 15
max_workers: 15
use_spot: false
resources:
custom_resources:
medium_cpu_mem: 1

aws:
TagSpecifications:
- ResourceType: "instance"
Tags:
- Key: ttl-hours
Value: '24'
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
cloud_id: {{env["ANYSCALE_CLOUD_ID"]}}
region: us-west1
allowed_azs:
- us-west1-b

head_node_type:
name: head_node_type
instance_type: n1-highmem-64-nvidia-k80-12gb-1
resources:
custom_resources:
large_cpu_mem: 1

worker_node_types:
- name: gpu_worker
instance_type: n1-standard-16-nvidia-k80-12gb-1
min_workers: 15
max_workers: 15
use_spot: false
resources:
custom_resources:
medium_cpu_mem: 1
3 changes: 2 additions & 1 deletion release/ray_release/byod/requirements_debian_byod.txt
Original file line number Diff line number Diff line change
Expand Up @@ -10,4 +10,5 @@ libjemalloc-dev
libosmesa6-dev
patchelf
unzip
zip
zip
libaio1
8 changes: 8 additions & 0 deletions release/ray_release/byod/requirements_ml_byod_3.9.in
Original file line number Diff line number Diff line change
Expand Up @@ -13,3 +13,11 @@ transformers
torch
torchtext
torchvision
bitsandbytes
wandb
pytorch-lightning
protobuf<3.21.0
torchmetrics
lm_eval
tiktoken
sentencepiece
Comment on lines +16 to +23
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@can-anyscale So, all release tests are running with the same docker image now? If I want to add a package to one release test, I need to add it to the common byod requirement file? Is there a doc where I can read about this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Tho on a PR it would still use your specified cluster_env but on the builds from master it is currently using the byod. @can-anyscale will deprecate the use of cluster envs soon for both PRs and master builds.
BYOD has a couple of benefits:

  1. You don't have to build the envs when launching release tests which means faster time to failure if any
  2. It will make tests more reliable due to consistency of versions. The best case scenario is if we don't pin anything in in the byod requirements files.

Loading
Loading