-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Train] Llama 2 workspace template release tests #37871
Changes from all commits
3da40e9
c259150
cf999c4
4d39318
dc25e03
16ae886
12f4b37
16be8c1
d88f97a
7485d3f
055b065
88cac09
32def4f
485168c
9744a13
4055100
cc709dc
159241b
db52ef6
53d5514
05b0a0f
57f402c
4b46723
c6cef39
6929bb8
aa673cc
45b59a7
f61b019
b30c4e4
e7d6c0e
f8330d2
25c7db9
270facd
04c974d
e0481e5
49f2477
1dacda9
35f76ba
59b927a
558213e
ae2d09e
a7192bb
54c70c9
5702c96
6100736
6f1a5e6
b61a352
49f0b13
2035342
123b77e
fc680c9
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
cloud_id: {{env["ANYSCALE_CLOUD_ID"]}} | ||
region: us-west-2 | ||
|
||
head_node_type: | ||
name: head_node_type | ||
instance_type: g5.48xlarge | ||
resources: | ||
custom_resources: | ||
large_cpu_mem: 1 | ||
|
||
worker_node_types: | ||
- name: gpu_worker | ||
instance_type: g5.48xlarge | ||
min_workers: 3 | ||
max_workers: 3 | ||
use_spot: false | ||
|
||
aws: | ||
TagSpecifications: | ||
- ResourceType: "instance" | ||
Tags: | ||
- Key: ttl-hours | ||
Value: '24' | ||
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
cloud_id: {{env["ANYSCALE_CLOUD_ID"]}} | ||
region: us-west-2 | ||
|
||
head_node_type: | ||
name: head_node_type | ||
instance_type: g5.48xlarge | ||
resources: | ||
custom_resources: | ||
large_cpu_mem: 1 | ||
|
||
worker_node_types: | ||
- name: large_gpu_worker | ||
instance_type: g5.48xlarge | ||
min_workers: 2 | ||
max_workers: 2 | ||
use_spot: false | ||
|
||
- name: medium_gpu_worker | ||
instance_type: g5.24xlarge | ||
min_workers: 2 | ||
max_workers: 2 | ||
use_spot: false | ||
|
||
aws: | ||
TagSpecifications: | ||
- ResourceType: "instance" | ||
Tags: | ||
- Key: ttl-hours | ||
Value: '24' | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: new line |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,27 @@ | ||
# 1 g5.16xlarge + 15 g5.4xlarge --> 16 GPUs, 256G RAM on trainer and 64G RAM on workers | ||
cloud_id: {{env["ANYSCALE_CLOUD_ID"]}} | ||
region: us-west-2 | ||
|
||
head_node_type: | ||
name: head_node | ||
instance_type: g5.16xlarge | ||
resources: | ||
custom_resources: | ||
large_cpu_mem: 1 | ||
|
||
worker_node_types: | ||
- name: worker_node | ||
instance_type: g5.4xlarge | ||
min_workers: 15 | ||
max_workers: 15 | ||
use_spot: false | ||
resources: | ||
custom_resources: | ||
medium_cpu_mem: 1 | ||
|
||
aws: | ||
TagSpecifications: | ||
- ResourceType: "instance" | ||
Tags: | ||
- Key: ttl-hours | ||
Value: '24' |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
cloud_id: {{env["ANYSCALE_CLOUD_ID"]}} | ||
region: us-west1 | ||
allowed_azs: | ||
- us-west1-b | ||
|
||
head_node_type: | ||
name: head_node_type | ||
instance_type: n1-highmem-64-nvidia-k80-12gb-1 | ||
resources: | ||
custom_resources: | ||
large_cpu_mem: 1 | ||
|
||
worker_node_types: | ||
- name: gpu_worker | ||
instance_type: n1-standard-16-nvidia-k80-12gb-1 | ||
min_workers: 15 | ||
max_workers: 15 | ||
use_spot: false | ||
resources: | ||
custom_resources: | ||
medium_cpu_mem: 1 |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -10,4 +10,5 @@ libjemalloc-dev | |
libosmesa6-dev | ||
patchelf | ||
unzip | ||
zip | ||
zip | ||
libaio1 |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -13,3 +13,11 @@ transformers | |
torch | ||
torchtext | ||
torchvision | ||
bitsandbytes | ||
wandb | ||
pytorch-lightning | ||
protobuf<3.21.0 | ||
torchmetrics | ||
lm_eval | ||
tiktoken | ||
sentencepiece | ||
Comment on lines
+16
to
+23
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @can-anyscale So, all release tests are running with the same docker image now? If I want to add a package to one release test, I need to add it to the common byod requirement file? Is there a doc where I can read about this? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes. Tho on a PR it would still use your specified cluster_env but on the builds from master it is currently using the byod. @can-anyscale will deprecate the use of cluster envs soon for both PRs and master builds.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: new line