diff --git a/docs/blogs/deeprec_autoscale_cn.md b/docs/blogs/deeprec_autoscale_cn.md index b08f40bcb..4562889dc 100644 --- a/docs/blogs/deeprec_autoscale_cn.md +++ b/docs/blogs/deeprec_autoscale_cn.md @@ -1,6 +1,7 @@ # DLRover: 云上自动扩缩容 DeepRec 分布式训练作业 ## 背景 + 如今,深度学习已广泛应用在搜索、广告、推荐等业务中,这类业务场景普遍有两个特点: 1)训练样本量大,需要分布式训练提升训练速度; 2)模型稀疏,即模型结构中离散特征计算逻辑占比较高。 @@ -73,7 +74,7 @@ Chief 资源预估:取历史作业中所有 worker 的 CPU 消耗和内存消 启动阶段的特殊情况: -1. 没有历史相关作业,比如新用户使用新数据集提交的训练作业。此时,DLRover +1. 没有历史相关作业,比如新用户使用新数据集提交的训练作业。此时,DLRover 采用默认资源配置来启动 PS 和 chief。比如 PSNUM=1,PSCPU=8,PSMEM=8G,chiefCPU=8,chiefMEM=8G。 2. 初始资源导致节点失败,最常见的是,内存不足导致 OOM。当节点发生 OOM 时, DLRover 通过容错机制会自动增加节点的内存来重启节点,直到训练正常运行。 @@ -91,7 +92,7 @@ PS 总的 CPU 使用量(total_PSUsedCPU)和内存使用量 (total_PSUsedMe 这样我们根据 Job 配置的总 CPU 量 limitCPU 可以计算 worker 的数量`workerNUM = limitCPU/trianingCPU`。 - worker CPU 和内存预估:因为 worker 的模型完全相同,所以 CPU 和 内存消耗也是相似的, -新的 `workerCPU = chiefUsedCPU * factor`,` workerMem = chiefUsedMem * factor`, +新的 `workerCPU = chiefUsedCPU * factor`,`workerMem = chiefUsedMem * factor`, factor 为冗余因子,比如1.2。 - PS 数量预估:异步训练中,PS 的 CPU 负载与 worker 数量成正比, @@ -100,7 +101,7 @@ factor 为冗余因子,比如1.2。 - PS 内存预估:PS 存储模型参数,内存使用量并不会随 worker 的增加而增加。 如果模型包含稀疏 embedding,PS 的内存会随着训练的进行而增加,为此 PS 的内存预估分为两种情况: 1. PS 的内存在训练开始后保持稳定,PSMEM= (total_PSUsedMem / PSNUM)* factor,factor 为冗余因子,一般要大于1。 - 2. PS 的内存持续增长,那么 DLRover Brain 会计算 PS 内存随迭代步数的增长率 memRate, + 2. PS 的内存持续增长,那么 DLRover Brain 会计算 PS 内存随迭代步数的增长率 memRate, 然后计算总的 totalPSMEM = memRate * totalStep,则每个 PS 的内存 PSMEM = totalPSMEM / PSNUM。 #### 动态调优阶段 @@ -172,7 +173,6 @@ master 会根据样本索引在数据集切分为很多小的 shard,放入一 向 master 汇报 shard 已完成,master会将shard 从 DOING 队列移除。如果 worker 挂了, master 会将对应的 shard 会重新放入 TODO 队列。 -
Editor
@@ -191,7 +191,6 @@ master 会将对应的 shard 会重新放入 TODO 队列。 为解决上述问题,DeepRec 新设计了一套支持动态 Embedding 语义的 EmbeddingVariable, 在特征无损训练的同时以最经济的方式使用内存资源。具体可以参考[DeepRec](https://github.com/alibaba/DeepRec)。 - #### 基于 checkpoint 的 PS 弹性扩缩容 PS 架构中,模型的权重是存储在 PS 内存的。如果 PS 变化,模型训练需要将权重重新分配到 PS 上。 @@ -206,7 +205,6 @@ worker 在每一个step之后会运行相关 hook,在 hook 中会向 DLRover m worker-0 会根据新的 PS 集合来构造计算图,更新 session,重新组网, 然后通知新的 PS 加载 checkpoint。最后 worker-0 通知所有的 worker 连接新的 PS 开始训练。 - ## 阿里云 ACK 上验证 DLRover 自动扩缩容 为了验证自动扩缩容的可行性,我们在阿里云的 ACK 上创建了一个小的 Kubernetes 集群。 @@ -220,6 +218,7 @@ dlrover-auto-scale-edljob-chief-0 1/1 Running 0 32s dlrover-auto-scale-edljob-ps-0 1/1 Running 0 32s elasticjob-torch-mnist-dlrover-master 1/1 Running 0 39s ``` + 此时的训练速度约为 30 step/s。大约 3 min 后,DLRover 自动给作业新增了 3 个 worker,速度提升到 100 steps/s 如下所示: @@ -238,4 +237,3 @@ elasticjob-torch-mnist-dlrover-master 1/1 Running 0 6m24s DLRover 支持了 PS 异步训练的自动扩缩容来提升训练速度。下一步我们将针对 DeepRec 的同步训练提供自动扩缩容功能。除了搜推广场景,DLRover 也将探索 foundation model 的分布式训练的自动扩缩容,以提升预训练大模型的训练效率和降低训练成本。 - diff --git a/docs/blogs/stabilize_llm_training_cn.md b/docs/blogs/stabilize_llm_training_cn.md index 6c0513b78..f45c9225d 100644 --- a/docs/blogs/stabilize_llm_training_cn.md +++ b/docs/blogs/stabilize_llm_training_cn.md @@ -2,20 +2,34 @@ ## 背景 -如今大语言模型(LLM)的分布式训练节点规模越来越大,训练耗时长。比如 OpenAI 在 1024 个 NVIDIA A100 GPU 上训练 GPT-3 大约需要 34 天。训练节点越多,耗时越长,训练期间节点故障概率就越大,况且 A100 GPU 的故障率也相对较高。所以大规模训练作业难免会遇到节点故障。据我们在蚂蚁 GPU 训练集群上观察,一个月内,单卡的故障率约8%,那么一天单卡的故障率约为0.27%。常见的故障原因有Xid、ECC、NVLINK error 和 NCCL error 故障等。对于一个千卡训练作业来说,卡故障导致一天内训练失败的概率高达到 93%。所以训练作业几乎每天都会失败。作业失败后,用户需要手动重启作业,运维成本很高。如果用户重启不及时,中间间隔的时间就会导致 GPU 卡空闲,浪费昂贵的算力资源。 -有些故障会导致机器不可用,从而导致可用节点数量不能达到用户指定的数量。这时,训练就不能启动,用户需要手动减少节点数量后重新提交作业。待故障机修复后,用户又需要手动增加作业的节点数来重启作业。这样增大了用户的运维成本,也导致了新节点无法及时加入训练。 -为此,DLRover 在 Kubernetes 上基于 Torch Elastic 开发了弹性训练功能,实现 PyTorch 分布式训练的自动容错和弹性。具体功能如下: +如今大语言模型(LLM)的分布式训练节点规模越来越大,训练耗时长。比如 OpenAI 在 1024 个 +NVIDIA A100 GPU 上训练 GPT-3 大约需要 34 天。训练节点越多,耗时越长,训练期间节点故障概率就越大,况且 +A100 GPU 的故障率也相对较高。所以大规模训练作业难免会遇到节点故障。据我们在蚂蚁 GPU +训练集群上观察,一个月内,单卡的故障率约8%,那么一天单卡的故障率约为0.27%。常见的故障原因有 +Xid、ECC、NVLINK error 和 NCCL error 故障等。对于一个千卡训练作业来说, +卡故障导致一天内训练失败的概率高达到 93%。所以训练作业几乎每天都会失败。作业失败后, +用户需要手动重启作业,运维成本很高。如果用户重启不及时,中间间隔的时间就会导致 GPU 卡空闲,浪费昂贵的算力资源。 +有些故障会导致机器不可用,从而导致可用节点数量不能达到用户指定的数量。这时,训练就不能启动, +用户需要手动减少节点数量后重新提交作业。待故障机修复后,用户又需要手动增加作业的节点数来重启作业。 +这样增大了用户的运维成本,也导致了新节点无法及时加入训练。 +为此,DLRover 在 Kubernetes 上基于 Torch Elastic 开发了弹性训练功能,实现 +PyTorch 分布式训练的自动容错和弹性。具体功能如下: 1. 出现故障后,快速执行节点健康检测,定位故障机并将其隔离,然后重启 Pod 来替换故障节点。 2. 健康检测通过后,重启训练子进程来自动恢复模型训练,无需重启作业或者所有Pod。 3. 节点故障导致可用机器少于作业配置,自动缩容来继续训练。集群新增机器后,自动扩容来恢复节点数量。 4. 优化FSDP并行训练的模型save/load,支持根据实际卡数reshard 模型参数,缩短checkpoint保存和加载时间。 -在 DLRover 弹性容错应用在蚂蚁大模型训练前,一周内千卡训练运行时间占 60.8%,有效训练时间约 32.9%。有效训练时间 = 模型迭代的步数 * 每步的时间,除此之外,训练运行时间还包括checkpoint 保存时间和训练回退时间等。DLRover 上线后,一周内千卡训练运行时间占比提升至 83.6%,有效训练时间提升至 58.9%。 +在 DLRover 弹性容错应用在蚂蚁大模型训练前,一周内千卡训练运行时间占 60.8%,有效训练时间约 32.9%。 +有效训练时间 = 模型迭代的步数 * 每步的时间,除此之外,训练运行时间还包括checkpoint 保存时间和训练回退时间等。 +DLRover 上线后,一周内千卡训练运行时间占比提升至 83.6%,有效训练时间提升至 58.9%。 ## PyTorch 弹性训练框架 -弹性训练是指在训练过程中可以伸缩节点数量。当前支持 PyTroch 弹性训练的框架有 Torch Elastic 和 Elastic Horovod。二者显著的区别在于节点数量变化后是否需要重启训练子进程来恢复训练。Torch Elastic 感知到新节点加入后会立刻重启所有节点的子进程,集合通信组网,然后从 checkpoint 文件里恢复训练状态来继续训练。而 Elastic Horovod 则是每个训练子进程在每个 step 后检查新节点加入,子进程不退出的情况下重新集合通信组网,然后有rank0将模型广播给所有rank。二者的优劣对比如下: +弹性训练是指在训练过程中可以伸缩节点数量。当前支持 PyTroch 弹性训练的框架有 Torch Elastic 和 Elastic Horovod。 +二者显著的区别在于节点数量变化后是否需要重启训练子进程来恢复训练。Torch Elastic 感知到新节点加入后会立刻重启所有节点的子进程, +集合通信组网,然后从 checkpoint 文件里恢复训练状态来继续训练。而 Elastic Horovod 则是每个训练子进程在每个 step 后检查新节点加入 +,子进程不退出的情况下重新集合通信组网,然后有rank0将模型广播给所有rank。二者的优劣对比如下: | | Torch Elastic | Elastic Horovod | | --- | --- | --- | @@ -25,33 +39,53 @@ | 支持的训练模式 | DDP/FSDP | DDP | | 支持的模型大小 | 大 | 小,只能是单机能存下的 | -通过上述对比可以看出,Torch Elastic 重启训练子进程的方案对用户更加友好,支持更多的分布式训练策略和模型。而FSDP和NCCL是当前大模型分布式训练使用最为广泛的技术。所以 DLRover 选择使用 Torch Elastic 重启子进程的方案来实现 Kubernetes 集群上分布式训练的弹性容错。 +通过上述对比可以看出,Torch Elastic 重启训练子进程的方案对用户更加友好,支持更多的分布式训练策略和模型。 +而FSDP和NCCL是当前大模型分布式训练使用最为广泛的技术。所以 DLRover 选择使用 Torch Elastic 重启子进程的方案来实现 Kubernetes 集群上分布式训练的弹性容错。 ## 集合通信动态组网 -动态组网是指训练进程可以自动根据动态变化的节点数量来组网集合通信,无需固定给各个节点指定集合通信的 rank 和 world size。动态组网是弹性容错训练必须的,因为弹性容错作业中,节点的失败、扩容或者缩容都会导致节点的 rank 和 world size 变化。所以我们无法在作业启动前给节点指定 rank 和 world size。 +动态组网是指训练进程可以自动根据动态变化的节点数量来组网集合通信,无需固定给各个节点指定集合通信的 rank 和 world size。 +动态组网是弹性容错训练必须的,因为弹性容错作业中,节点的失败、扩容或者缩容都会导致节点的 rank 和 world size 变化。 +所以我们无法在作业启动前给节点指定 rank 和 world size。 ### Torch Elastic 动态组网 -Torch Elastic 启动子进程后,所有子进程需要进行集合通信组网。Torch Elastic 使用 Dynamic Rendezvous 机制来协助子进程组网。每个节点上运行一个 ElasticAgent,ElasticAgent 会从一个共享存储中获取作业节点的 host group,然后将自己的 host 加入 group 并同步到共享存储里。这个共享存储当前默认使用 TCPStore。接着,ElasticAgent 不断从共享存储里获取查询 host group,直到 host group 里的节点数量达到最小节点数量 min_nodes 且一段时间内没有变化,即认为所有节点都准备好了。然后,ElasticAgent 就可以从 host group 里获取自己的节点rank (PyTorch 中称为 group rank) 和 world size。这样,ElasticAgent 就可以给拉起的子进程配置 local rank、global rank 和 world size了。有了这些信息,子进程就可以进程集合通信组网。 +Torch Elastic 启动子进程后,所有子进程需要进行集合通信组网。Torch Elastic 使用 Dynamic Rendezvous 机制来协助子进程组网。 +每个节点上运行一个 ElasticAgent,ElasticAgent 会从一个共享存储中获取作业节点的 host group,然后将自己的 host 加入 group +并同步到共享存储里。这个共享存储当前默认使用 TCPStore。接着,ElasticAgent 不断从共享存储里获取查询 host group, +直到 host group 里的节点数量达到最小节点数量 min_nodes 且一段时间内没有变化,即认为所有节点都准备好了。然后, +ElasticAgent 就可以从 host group 里获取自己的节点rank (PyTorch 中称为 group rank) 和 world size。这样, +ElasticAgent 就可以给拉起的子进程配置 local rank、global rank 和 world size了。有了这些信息,子进程就可以进程集合通信组网。 但是使用 Torch Elastic 原生方案中,我们发现一些问题: 1. 节点不能容错。TCPStore 在一个训练节点上,如果该节点挂了,重组网就没法继续了。 -2. 节点 rank 是随机的。 ElasticAgent 同步 host 到共享存储的顺序是随机的,导致节点 rank 的随机。在训练代码中,用户一般会将模型迭代信息输出在 rank-0 的日志里,比如 step、loss 和耗时等。用户只能通过进程日志寻找 rank-0 ,对于多节点的作业,这是比较麻烦的。 -3. Torch Elastic 的动态组网不能控制组网的节点数量。比如 LLM 模型训练中,用户可能会将4个节点作为一个数据预处理的组,那么弹性伸缩需要保证节点数量是4的整数倍。而 Torch Elastic 只要发现有一个新节点加入就会立刻重启训练。 +2. 节点 rank 是随机的。 ElasticAgent 同步 host 到共享存储的顺序是随机的,导致节点 rank 的随机。 +在训练代码中,用户一般会将模型迭代信息输出在 rank-0 的日志里,比如 step、loss 和耗时等。 +用户只能通过进程日志寻找 rank-0 ,对于多节点的作业,这是比较麻烦的。 +3. Torch Elastic 的动态组网不能控制组网的节点数量。比如 LLM 模型训练中, +用户可能会将4个节点作为一个数据预处理的组,那么弹性伸缩需要保证节点数量是4的整数倍。 +而 Torch Elastic 只要发现有一个新节点加入就会立刻重启训练。 ### DLRover 动态组网 -针对上面问题,DLRover 重新实现了 PyTorch ElasticAgent 的动态组网模块 RendezvousHandler,利用 ElasticJob 点 master 来协助 PyTorch 组网。master 是一个纯 CPU 节点,不参与训练,稳定性比 GPU 节点高很多。 +针对上面问题,DLRover 重新实现了 PyTorch ElasticAgent 的动态组网模块 RendezvousHandler, +利用 ElasticJob 点 master 来协助 PyTorch 组网。master 是一个纯 CPU 节点,不参与训练,稳定性比 GPU 节点高很多。
Editor
- DLRover ElasticJob 动态组网 -DLRover 的 ElasticJob 在启动 Pod 时会给每个 Pod 一个唯一的编号 Pod ID 并配置到 Pod 的环境变量里。训练节点的 ElasticAgent的 RendezvousHandler 会将自己的编号 Pod ID 和GPU卡数上报给 Master 的 Rendezvous Manager。然后不断从 master 中请求通信 world,即所有节点的信息。master 的 Rendezvous Manager 会将接收到的 node 信息存储到一个列表里。当列表中的节点数量达到可组网的条件后,master 会将通信 world 发送给所有节点。通信 world 会根据 Pod ID 排序,内容如 {0:8, 1:8, 2:8, 3:8} 其中 key 表示 Pod ID,value 为 Pod 的 GPU 卡数。Pod ID 在 world 中的次序即为其 Rank。这样我们就可以固定 Pod ID 最小的为 Rank-0。 -如果用户需要训练节点数量是 N 的整数倍,那边 master 只需要将 world 根据 N 的整数倍裁剪即可。例如 ,训练作业配置了6个节点,由于机器故障导致 Pod-5 失败了,重新拉起的 Pod-6 因为没有资源而 pending。此时,master 收到的节点信息为 {0:8, 1:8, 2:8, 3:8, 4:8}。但是用户要求节点是 2 的整数倍,那么master可以将 Pod-4 从 world 中踢出,然后发送给 Pod-0 到 Pod-3。而 Pod-4 会等着 Pod-6 起来后再加入训练实现扩容。如下图所示: +DLRover 的 ElasticJob 在启动 Pod 时会给每个 Pod 一个唯一的编号 Pod ID 并配置到 Pod 的环境变量里。 +训练节点的 ElasticAgent的 RendezvousHandler 会将自己的编号 Pod ID 和GPU卡数上报给 Master 的 Rendezvous Manager。 +然后不断从 master 中请求通信 world,即所有节点的信息。master 的 Rendezvous Manager 会将接收到的 node +信息存储到一个列表里。当列表中的节点数量达到可组网的条件后,master 会将通信 world 发送给所有节点。通信 world +会根据 Pod ID 排序,内容如 {0:8, 1:8, 2:8, 3:8} 其中 key 表示 Pod ID,value 为 Pod 的 GPU 卡数。 +Pod ID 在 world 中的次序即为其 Rank。这样我们就可以固定 Pod ID 最小的为 Rank-0。 +如果用户需要训练节点数量是 N 的整数倍,那边 master 只需要将 world 根据 N 的整数倍裁剪即可。例如, +训练作业配置了6个节点,由于机器故障导致 Pod-5 失败了,重新拉起的 Pod-6 因为没有资源而 pending。此时, +master 收到的节点信息为 {0:8, 1:8, 2:8, 3:8, 4:8}。但是用户要求节点是 2 的整数倍,那么master可以将 +Pod-4 从 world 中踢出,然后发送给 Pod-0 到 Pod-3。而 Pod-4 会等着 Pod-6 起来后再加入训练实现扩容。如下图所示:
Editor @@ -61,14 +95,17 @@ DLRover 的 ElasticJob 在启动 Pod 时会给每个 Pod 一个唯一的编号 P 训练容错是指训练出现故障后能在无人工介入的情况下快速恢复训练。训练恢复需要如下步骤: - - 定位错误排原因,判断错误是否可以恢复。 - - 启动训练进程加载训练代码,训练进程能重新集合通信组网。 - - 训进程能加载模型导出的 checkpoint 来恢复训练状态。 - - 如果存在故障机,要及时将故障机排除,以便新节点继续调度在故障机。 +- 定位错误排原因,判断错误是否可以恢复。 +- 启动训练进程加载训练代码,训练进程能重新集合通信组网。 +- 训进程能加载模型导出的 checkpoint 来恢复训练状态。 +- 如果存在故障机,要及时将故障机排除,以便新节点继续调度在故障机。 ### DLRover 容错方案 -Torch Elastic 在子进程出错后,无论什么错误会直接重启所有子进程来恢复训练。但是节点故障导致的失败,重启子进程也是没法恢复的,需要在其他机器上启动一个新 Pod。为此 DLRover 提供了进程恢复、Pod 恢复和故障机自动检测机制。对于无故障机的错误,DLRover 重启进程来恢复训练。对于故障机的错误,DLRover 会通知 SRE 隔离故障机并重新拉起 Pod 来替换出错的 Pod,对于正常运行的Pod 重启其训练进程,减少 Pod 调度时间开销。 +Torch Elastic 在子进程出错后,无论什么错误会直接重启所有子进程来恢复训练。但是节点故障导致的失败, +重启子进程也是没法恢复的,需要在其他机器上启动一个新 Pod。为此 DLRover 提供了进程恢复、Pod 恢复和故障机自动检测机制。 +对于无故障机的错误,DLRover 重启进程来恢复训练。对于故障机的错误,DLRover 会通知 SRE 隔离故障机并重新拉起 +Pod 来替换出错的 Pod,对于正常运行的Pod 重启其训练进程,减少 Pod 调度时间开销。 | 恢复训练的步骤 | 没有容错 | DLRover Pod容错 | DLRover进程容错 | | --- | --- | --- | --- | @@ -82,7 +119,13 @@ Torch Elastic 在子进程出错后,无论什么错误会直接重启所有子 ### DLRover 故障机检测 -DLRover 在重启训练子进程前运行一个简单的 allgather 任务来排查故障机。job master 先将所有节点两两划分为多个 world,每个 world 内的节点上执行 allgather 任务并将成功与否上报给 job master。如果有 world 里的 allgather 任务失败,则此 world 的节点为潜在故障机,否则为正常机器。然后开始第二轮测试,master 会将潜在故障机和正常节点再次两两划分 world。每个 world 的节点继续执行 allgather,这样就找到故障节点。比如作业有6个节点,第一轮的划分结果为 [{1,2}, {3,4}, {5,6}], {5, 6}] 执行 allgather 失败了,那么节点5 和 6 就是潜在故障节点。为此第二轮的划分为[{1,2}, {3,5}, {4,6}] 。如果{4,6} 失败了,说明节点6 就是故障节点。然后,DLRover 会重新拉起一个 Pod,替换节点6。 +DLRover 在重启训练子进程前运行一个简单的 allgather 任务来排查故障机。job master 先将所有节点两两划分为多个 +world,每个 world 内的节点上执行 allgather 任务并将成功与否上报给 job master。 +如果有 world 里的allgather 任务失败,则此 world 的节点为潜在故障机,否则为正常机器。 +然后开始第二轮测试,master 会将潜在故障机和正常节点再次两两划分 world。每个 world 的节点继续执行 allgather, +这样就找到故障节点。比如作业有6个节点,第一轮的划分结果为 [{1,2}, {3,4}, {5,6}], {5, 6}] 执行 allgather 失败了, +那么节点5 和 6 就是潜在故障节点。为此第二轮的划分为[{1,2}, {3,5}, {4,6}] 。如果{4,6} 失败了, +说明节点6 就是故障节点。然后,DLRover 会重新拉起一个 Pod,替换节点6。
Editor @@ -90,7 +133,13 @@ DLRover 在重启训练子进程前运行一个简单的 allgather 任务来排 ### DLRover 错误日志收集 -在 PyTorch 分布式训练中,一个节点的进程出错后,Torch Elastic 会停止所有节点的进程。各个进程的日志都是单独存在各自日志文件中。为了找到训练失败是哪个进程出错导致的,我们需要搜索所有进程的日志。这个工作对于千卡作业是十分耗时且繁琐的。为此,我们在 ElasticAgent 中开发了错误日志收集供功能。当 ElasticAgent 发现子进程失败后,后将其错误日志的 message 发送给 job master。job master 会在其日志中展示具体哪个节点的那个进程失败了,以及错误日志。这样用户只需看下 job master 的节点日志就可以定位训练失败原因了。同时我们也支持将错误信息上报给钉钉。 +在 PyTorch 分布式训练中,一个节点的进程出错后,Torch Elastic 会停止所有节点的进程。 +各个进程的日志都是单独存在各自日志文件中。为了找到训练失败是哪个进程出错导致的,我们需要搜索所有进程的日志 +。这个工作对于千卡作业是十分耗时且繁琐的。为此,我们在 ElasticAgent 中开发了错误日志收集供功能。 +当 ElasticAgent 发现子进程失败后,后将其错误日志的 message 发送给 job master。 +job master 会在其日志中展示具体哪个节点的那个进程失败了,以及错误日志。 +这样用户只需看下 job master 的节点日志就可以定位训练失败原因了。同时我们也支持将错误信息上报给钉钉。 + ```json 任务 torch-train 训练进程失败 torch-train-edljob worker-116 restart 0 fails: { "784": { @@ -110,7 +159,9 @@ DLRover 在重启训练子进程前运行一个简单的 allgather 任务来排 ## FSDP 并行的 save/load 优化 -DLRover 弹性容错需要依赖 checkpoint 来恢复模型状态。当前我们的大模型训练采用 FSDP 的并行方式,FSDP 保存 checkpoint 的方案有两种:1. rank0_only :由 RANK-0 节点获取所有的模型参数和优化器状态存入磁盘,2.sharding方式:所有 RANK 各自保存其模型参数和优化器状态。但是这两个方案都没法满足弹性容错训练的需求。 +DLRover 弹性容错需要依赖 checkpoint 来恢复模型状态。当前我们的大模型训练采用 FSDP 的并行方式, +FSDP 保存 checkpoint 的方案有两种:1. rank0_only :由 RANK-0 节点获取所有的模型参数和优化器状态存入磁盘, +2.sharding方式:所有 RANK 各自保存其模型参数和优化器状态。但是这两个方案都没法满足弹性容错训练的需求。 rank0_only: - RANK-0 需要加载所有的模型参数和优化器状态,可能导致 OOM。 @@ -122,16 +173,19 @@ sharding 方式: ### 参数支持 reshard 的 save/load -原始 torch save 是将整个参数进行 pickle,load 时整体进行 unpickle,因此内存会出现峰值。为解决该问题,我们在 ATorch 中将 save 的过程拆开,先生成 safetensors 的 meta data,之后按需逐个序列化每个 tensor,再进行写入。 -在保存时,直接保存每个 rank 上的 flat param,同时保存一份对应的 meta 信息。如下图所示,每个 flat param 中保存了多个 meta 信息,每个 meta 信息代表这个 flat param 中原始参数的 shape 和在 flat param 中的 start 和 end,因此在恢复参数时,只需要按照顺序将所有的 param 找出来,拼接到一起后,再进行 reshape 即可获得原始的参数。 +原始 torch save 是将整个参数进行 pickle,load 时整体进行 unpickle,因此内存会出现峰值。 +为解决该问题,我们在 ATorch 中将 save 的过程拆开,先生成 safetensors 的 meta data,之后按需逐个序列化每个 tensor,再进行写入。 +在保存时,直接保存每个 rank 上的 flat param,同时保存一份对应的 meta 信息。如下图所示, +每个 flat param 中保存了多个 meta 信息,每个 meta 信息代表这个 flat param 中原始参数的 shape 和在 flat param 中的 start 和 end, +因此在恢复参数时,只需要按照顺序将所有的 param 找出来,拼接到一起后,再进行 reshape 即可获得原始的参数。
Editor
- FSDP flat param 的逻辑格式 代码示例: + ```python from atorch.utils.fsdp_save_util import save_fsdp_flat_param model = ... # atorch 转换 FSDP 的模型 @@ -147,6 +201,7 @@ ckpt └── flat_param.00001-00002 """ ``` + ```python # init_empty_weights_with_disk_offload 时指定 ckpt 地址,会将模型全部在 meta 上 # 初始化,在 FSDP 转换时按需加载 ckpt 地址 @@ -157,15 +212,21 @@ with init_empty_weights_with_disk_offload(ckpt_path='ckpt'): ### 优化器状态支持 reshard 的save/load -FSDP 并行训练时,优化器是基于 FSDP 转化后的模型创建的,atorch 会配置 FSDP 的 use_orig_param。这时优化器状态的结构与 flat param 结构相同。如果某些参数不在 flat param 中,则优化器状态获取到的参数为空。同时还保存了优化器状态的 meta 信息,为优化器状态的 param group 信息。 +FSDP 并行训练时,优化器是基于 FSDP 转化后的模型创建的,atorch 会配置 FSDP 的 use_orig_param。这时优化器状态的结构与 +flat param 结构相同。如果某些参数不在 flat param 中,则优化器状态获取到的参数为空。同时还保存了优化器状态的 meta 信息,为优化器状态的 param group 信息。
Editor
FSDP use_orig_param 的优化器状态的逻辑格式 -因此在保存的时候,优化器状态也是 flatten 为 1D 的数据。在恢复优化器状态时,使用了 FSDP 提供的 `FSDP.shard_full_optim_state_dict`函数,该函数接收的参数为完整的优化器状态和 FSDP wrap 好的模型来重新切分优化器状态。该函数最终调用 `torch.distributed.fsdp._optim_utils._shard_orig_param_state` 函数来切分状态,并且该函数在 torch 内部只有这一处调用,因此 hook 该函数的实现。 -实际在内部实现时,reshard 根据 FSDP 包好的模型来获取优化器状态的数值区间,该区间在 FSDP 内部为intra_param_start_idx,intra_param_end_idx 参数,含义为新的参数在原始 flatten 权重的取值范围。如下图所示,如果由于修改了 rank/wrap 使得 FSDP 的模型产生了变化,则需要重新切分优化器参数。 +因此在保存的时候,优化器状态也是 flatten 为 1D 的数据。在恢复优化器状态时,使用了 FSDP 提供的 `FSDP.shard_full_optim_state_dict`函数, +该函数接收的参数为完整的优化器状态和 FSDP wrap 好的模型来重新切分优化器状态。 +该函数最终调用 `torch.distributed.fsdp._optim_utils._shard_orig_param_state` 函数来切分状态, +并且该函数在 torch 内部只有这一处调用,因此 hook 该函数的实现。 +实际在内部实现时,reshard 根据 FSDP 包好的模型来获取优化器状态的数值区间, +该区间在 FSDP 内部为intra_param_start_idx,intra_param_end_idx 参数,含义为新的参数在原始 flatten 权重的取值范围。 +如下图所示,如果由于修改了 rank/wrap 使得 FSDP 的模型产生了变化,则需要重新切分优化器参数。
Editor @@ -173,6 +234,7 @@ FSDP use_orig_param 的优化器状态的逻辑格式 FSDP 优化器状态 reshard 示意图 代码示例 + ```python from atorch.utils.fsdp_save_util import save_fsdp_optim_param # model, optimizer 均是经过 atorch FSDP 转换的对象 @@ -185,6 +247,7 @@ ckpt └── optim_param.00001-00002 """ ``` + ```python from atorch.utils.fsdp_save_util import ShardOptim sm = ShardOptim("ckpt") @@ -192,9 +255,10 @@ reshard_optim_state = sm.reshard_optim_state_dict(model) optimizer.load_state_dict(reshard_optim_state) ``` -## 弹性容错在千亿级大模型训练的应用效果 +## 弹性容错在千亿级大模型训练的应用效果 -在使用 DLRover 弹性容错之前,Torch 大模型训练只要出错就要重启训练作业。为了及时重启作业,用户写了个程序每隔10min 来检测作业状态。如果失败,就会重启作业。 +在使用 DLRover 弹性容错之前,Torch 大模型训练只要出错就要重启训练作业。为了及时重启作业, +用户写了个程序每隔10min 来检测作业状态。如果失败,就会重启作业。
Editor @@ -202,7 +266,7 @@ optimizer.load_state_dict(reshard_optim_state) 下面对比了训练失败时使用 DLRover 弹性容错前后的耗时。 -| +| | 没有弹性容错 | DLRover 弹性容错 | | | --- | --- | --- | --- | | 训练恢复步骤 | 任何故障 | 机器硬件故障 | 软件故障 | @@ -218,6 +282,7 @@ optimizer.load_state_dict(reshard_optim_state) ## Kubernetes 上提交 GPT 弹性容错作业 1. 在 Kubernetes 集群上部署 DLRover ElasticJob CRD。 + ```python git clone git@github.com:intelligent-machine-learning/dlrover.git cd dlrover/go/operator/ @@ -225,6 +290,7 @@ make deploy IMG=easydl/elasticjob-controller:master ``` 2. 在构造训练镜像的 dockerfile 中安装 dlrover[torch]。 + ```python FROM registry.cn-hangzhou.aliyuncs.com/easydl/dlrover-train:torch201-py38 as base @@ -236,7 +302,10 @@ COPY ./model_zoo ./model_zoo ``` -3. 在 ElasticJob 的container 的 command里使用 dlrover-run 在运行训练脚本。在镜像 registry.cn-hangzhou.aliyuncs.com/easydl/dlrover-train:nanogpt-test 我们已经准备好了代码和训练数据,可以直接用如下 ElasticJob 来提交示例作业。 +3. 在 ElasticJob 的container 的 command里使用 dlrover-run 在运行训练脚本。 +在镜像 registry.cn-hangzhou.aliyuncs.com/easydl/dlrover-train:nanogpt-test +我们已经准备好了代码和训练数据,可以直接用如下 ElasticJob 来提交示例作业。 + ```yaml apiVersion: elastic.iml.github.io/v1alpha1 kind: ElasticJob @@ -276,6 +345,10 @@ spec: ``` -# 总结 & 未来计划 +## 总结 & 未来计划 -DLRover 目前已经在蚂蚁千亿模型训练训练上落地,将GPU故障导致训练暂停时间由 30%降低到了约 12%。我们希望 DLRover 在大规模分布式训练上提供智能化运维功能,降低用户运维成本,提升训练的稳定性。后续我们将介绍蚂蚁在千亿模型训练上的 PyTorch 性能优化方案的扩展包 ATorch,ATorch 旨在提升大规模 GPU 训练的硬件算力效率 HFU (Hardware Flops Utilization) 和训练的稳定性,当前蚂蚁千亿大模型训练使用 Atorch 的 HFU 为 49.6%。我们欢迎不同机构的开发者也能根据自身特点,同我们一起共建 DLRover 项目,推进分布式自动化。 +DLRover 目前已经在蚂蚁千亿模型训练训练上落地,将GPU故障导致训练暂停时间由 30%降低到了约 12%。 +我们希望 DLRover 在大规模分布式训练上提供智能化运维功能,降低用户运维成本,提升训练的稳定性。 +后续我们将介绍蚂蚁在千亿模型训练上的 PyTorch 性能优化方案的扩展包 ATorch,ATorch 旨在提升大规模 +GPU 训练的硬件算力效率 HFU (Hardware Flops Utilization) 和训练的稳定性,当前蚂蚁千亿大模型训练使用 +Atorch 的 HFU 为 49.6%。我们欢迎不同机构的开发者也能根据自身特点,同我们一起共建 DLRover 项目,推进分布式自动化。 diff --git a/docs/deployment/controller.md b/docs/deployment/controller.md index 478bc0c2a..a3108eb4a 100644 --- a/docs/deployment/controller.md +++ b/docs/deployment/controller.md @@ -1,9 +1,13 @@ # Deploy DLRover ElasticJob Controller on a Kubernetes Cluster -Here, we introduce how to deploy the DLRover job controller directly on a Kubernetes cluster step by step. Minikube is optional and primarily used for testing. +Here, we introduce how to deploy the DLRover job controller directly on a +Kubernetes cluster step by step. Minikube is optional and primarily used for testing. ## 1. Preliminary -- Ensure you have [Kubernetes](https://kubernetes.io/docs/home/) installed. If you prefer to use Minikube for testing purposes, make sure to have [Minikube](https://minikube.sigs.k8s.io/docs/start/) installed and run `minikube start`. + +- Ensure you have [Kubernetes](https://kubernetes.io/docs/home/) installed. +If you prefer to use Minikube for testing purposes, make sure to have [Minikube](https://minikube.sigs.k8s.io/docs/start/) +installed and run `minikube start`. ## 3. Deploy Dlrover ElasticJob Controller With Kubectl @@ -16,13 +20,14 @@ $ deployment="git@github.com:intelligent-machine-learning/dlrover/dlrover/go/ope $ kubectl -n dlrover apply -k $deployment ``` -To verify the controller has been deployed, run the command below. The output should show the dlrover-controller-manager pod is running. +To verify the controller has been deployed, run the command below. +The output should show the dlrover-controller-manager pod is running. ```bash kubectl -n dlrover get pods ``` -``` +```bash NAME READY STATUS RESTARTS AGE pod/dlrover-controller-manager-7dccdf6c4d-grmks 2/2 Running 0 6m46s ``` @@ -38,10 +43,11 @@ Check traning nodes. ```bash kubectl -n dlrover get pods ``` -``` + +```bash NAME READY STATUS RESTARTS AGE pod/dlrover-controller-manager-7dccdf6c4d-grmks 2/2 Running 0 4h49m pod/elasticjob-torch-mnist-dlrover-master 1/1 Running 0 4h42m pod/torch-mnist-edljob-worker-0 1/1 Running 0 4h42m pod/torch-mnist-edljob-worker-1 1/1 Running 0 4h42m -``` \ No newline at end of file +``` diff --git a/docs/deployment/k8s.md b/docs/deployment/k8s.md index a641d2223..88bb515c8 100644 --- a/docs/deployment/k8s.md +++ b/docs/deployment/k8s.md @@ -6,23 +6,23 @@ step by step. ## Create namespace ```shell -$ kubectl create namespace dlrover +kubectl create namespace dlrover ``` -## Deploy MySQL +## Deploy MySQL To create MySQL DB as the store for ELRover ```shell -$ cd dlrover/go/brain/manifests/k8s -$ kubectl apply -f mysql-pv.yaml -$ kubectl apply -f mysql.yaml +cd dlrover/go/brain/manifests/k8s +kubectl apply -f mysql-pv.yaml +kubectl apply -f mysql.yaml ``` Create tables in MySQL ```shell -$ kubectl exec -it mysql-pod-name --namespace dlrover -- bash -$ cd dlrover -$ mysql -uroot -proot < dlrover-tables.sql -``` \ No newline at end of file +kubectl exec -it mysql-pod-name --namespace dlrover -- bash +cd dlrover +mysql -uroot -proot < dlrover-tables.sql +``` diff --git a/docs/design/db-design.md b/docs/design/db-design.md index 84e844976..2b843afed 100644 --- a/docs/design/db-design.md +++ b/docs/design/db-design.md @@ -49,4 +49,4 @@ create table cluster( customized_data mediumtext, // cluster customized data PRIMARY KEY (uid) ) -``` \ No newline at end of file +``` diff --git a/docs/design/dlrover-overview.md b/docs/design/dlrover-overview.md index 4452d3608..614e25476 100644 --- a/docs/design/dlrover-overview.md +++ b/docs/design/dlrover-overview.md @@ -2,11 +2,11 @@ DLRover is an automatic distributed deep learning system. DLRover can help users train their models with minimal efforts. For example, -with DLRover, users need not provide any resource configuration for their -deep learning training jobs. Instead, DLRover can pick up the appropriate resource +with DLRover, users need not provide any resource configuration for their +deep learning training jobs. Instead, DLRover can pick up the appropriate resource configuration for each job smartly and continue to optimize those jobs during their runtime. -DLRover's short-term goal is to support automatic resource configuration for DL training jobs. +DLRover's short-term goal is to support automatic resource configuration for DL training jobs. However, the long-term goal of DLRover is to make the whole deep learning model training completely automatic. @@ -20,14 +20,14 @@ workers. Using allreduce architecture, we need to take account of the increasing communication cost with more workers. It is difficult to configure the appropriate resource with different models. -Model developers (users) have to learn more rather than model training -algorithms when they are using those jobs to train their models. To -run a training job, those users have to specify the required resources for their -this job. Then the Kubernetes cluster can allocate the required resources and +Model developers (users) have to learn more rather than model training +algorithms when they are using those jobs to train their models. To +run a training job, those users have to specify the required resources for their +this job. Then the Kubernetes cluster can allocate the required resources and start the job. Unfortunately, we found it is quite an ineffective way to ask the users to take care of the resource configuration. -At first, users are usually the -experts on model design but not training jobs and Kubernetes cluster. It is +At first, users are usually the +experts on model design but not training jobs and Kubernetes cluster. It is not an easy task for them to have the optimal configuration in the first place. Secondly, a training job's resources requirement may vary during its runtime. A static resource configuration usually can not be the optimal one all the time. @@ -39,7 +39,7 @@ users fail to provide the optimal resource configuration for their jobs. We hope to design and implement a system which can free users from resource configuration completely and focus on the model training itself. Without any input (on resource configuration), DLRover can still provide the optimal -resource plan for each training job. Meanwhile, DLRover can optimize the +resource plan for each training job. Meanwhile, DLRover can optimize the performance of training jobs further through resource adjustment when a job is running. @@ -48,13 +48,13 @@ supporting three different modes to satisfy users' requirements. ### Manual Mode -Sometimes users want to explore a single job's performance through manually scaling this +Sometimes users want to explore a single job's performance through manually scaling this job's resources during runtime. DLRover allows users to apply new resource configuration for a running job without restarting the job. ### Single-Job Mode -During DL model development, users usually repeatedly train and test a model before +During DL model development, users usually repeatedly train and test a model before the model reaches a stable status. In this scenario, users only need to run a single job without deploying extra components. However, single-job mode also supports resource auto-configuration for the job. In this mode, auto-scale algorithms are located in the master of the job @@ -66,11 +66,11 @@ not support the fault-tolerance of the master. ### Cluster Mode -In the cluster mode, DLRover handles all training jobs in a cluster and -executes with complete functions. +In the cluster mode, DLRover handles all training jobs in a cluster and +executes with complete functions. -Unlike single-job mode, DLRover in cluster mode has a separate service called -*Brain* which provides resource plans for each running job in the cluster. +Unlike single-job mode, DLRover in cluster mode has a separate service called +*Brain* which provides resource plans for each running job in the cluster. The brain service persists all runtime statistics of jobs into a database. The algorithm can utilize information of all finished and running jobs to optimize the resources of new jobs. After @@ -79,7 +79,6 @@ What's more, the master of a job only executes the resource plans from brain service. When the master fails, DLRover can simply restart a new one and the job can continue to run. - ## Design DLRover consists of four main components: ElasticJob, Elastic Trainer, @@ -99,13 +98,14 @@ launch required Pods and each Pod will start an Elastic Agent on it. During training, the training master of Elastic Trainer dispatches data shards to workers. Meanwhile, the Cluster Monitor is monitoring each job's running status (e.g., resource workload of each node) and -cluster status (e.g., idle resources). Those data will be reported to Brain periodically and -Brain persists the data into database. Then based on the job’s running status, +cluster status (e.g., idle resources). Those data will be reported to Brain periodically and +Brain persists the data into database. Then based on the job’s running status, DLRover Brain picks up appropriate algorithms to generate new resource plans and informs Elastic Trainer to start resources adjustment. ### ElasticJob to Support Elastic Scheduling + ElasticJob is a customized k8s controller to support elastic scheduling of Pods for DL training jobs. ElasticJob is responsible to launch/delete Pods on a k8s cluster according to a Scale CRD. @@ -123,7 +123,7 @@ to launch/delete paramter servers and workers. ### Elastic Trainer to Support Auto-scaling of a Single Job -For each training job, there is an Elastic Trainer to manage the job during +For each training job, there is an Elastic Trainer to manage the job during the job's whole life cycle. Elastic Trainer is to: 1. provide dynamic data sharding to support elasticity of a job. @@ -151,7 +151,7 @@ those samples. All shards are placed into a TODO queue. After a worker starts to run, the data input pipeline of a worker will query one shard from Elastic Trainer and read samples by indices in the shard. Meanwhile, Data Shard Manager marks this shard with the -id of the worker and moves the shard from the TODO to the DOING queue. +id of the worker and moves the shard from the TODO to the DOING queue. After a worker consumes samples in the shard and updates parameters in PS, it reports to the training master and queries a new shard. Then Data Shard Manager deletes the finished shard from the DOING queue. @@ -160,12 +160,12 @@ Then Data Shard Manager deletes the finished shard from the DOING queue. Editor
- #### Elasticity of PS Training 1. Worker elasticity. In asynchronous SGD, each PS updates parameters with gradients from a worker independently and does not synchronize with other workers. -Thus, Elastic Trainer can add or remove workers without influencing other workers. After a new worker starts, it connects to all PS and queries shards from Data Shard Manager +Thus, Elastic Trainer can add or remove workers without influencing other workers. +After a new worker starts, it connects to all PS and queries shards from Data Shard Manager and consume shards to compute gradients. If a worker is terminated, Data Shard Manager moves uncompleted shards of this worker back to the TODO queue from the DOING queue. Later the shard can be dispatched to another workers. @@ -178,7 +178,6 @@ parameter servers to the Elastic Agent of all Pods. Then the Elastic Agent will notify the training framework (e.g. TensorFlow) to restart training and restore model paremeters from a checkpoint. - #### Elasticity of AllReduce Training DLRover implements Fault-tolerance of allreduce @@ -196,7 +195,7 @@ Meanwhile, the master watches the event of the failed worker by K8s APIs and re-assign new ranks for alive workers. The oldest worker will get the rank 0 and broadcast its model and optimization states in the memory to other workers. Because the oldest worker certainly has the whole -model at the time of worker fail. Then, the training continues. +model at the time of worker fail. Then, the training continues. 2. Scalable. After new worker starts, it will send a start signal to the master and the master will re-assign ranks with all alive workers. The worker @@ -206,13 +205,13 @@ a new world with the new rank. 3. Fixed batch size. Not like Asynchronous training, the batch size $B$ of synchronous stochastic gradient descent (SGD) is $𝐵 = 𝑁 ∗ 𝐵_𝑚$ . 𝑁 is the number -of workers and 𝐵𝑚 is the size of mini-batch performed by each worker at each step. -However, the batch size of synchronous SGD affects the model accuracy. -So, the model accuracy may fluctuate if the number of workers changes at runtime. +of workers and 𝐵𝑚 is the size of mini-batch performed by each worker at each step. +However, the batch size of synchronous SGD affects the model accuracy. +So, the model accuracy may fluctuate if the number of workers changes at runtime. In order to overcome the challenge, DLRover supports fixed batch size at runtime if the maximum number $N$ of workers is configured. Before the phase of al-reduce, the master assigns the number of mini-batch computations to workers according to -the number $N_0$ of existing workers. The worker 𝑖 will perform $𝑚_𝑖$ mini-batch +the number $N_0$ of existing workers. The worker 𝑖 will perform $𝑚_𝑖$ mini-batch before merging gradients across workers by all-reduce. $𝑚_𝑖 =⌊𝑁/𝑁_0⌋+1$ if $𝑖<𝑁\%𝑁_0$, otherwise, $𝑚_𝑖 =⌊𝑁/𝑁_0⌋$ . @@ -224,7 +223,7 @@ otherwise, $𝑚_𝑖 =⌊𝑁/𝑁_0⌋$ . Parameter servers and workers can fail at any time. Thus the trainer will checkpoint the parameters periodically. When a parameter server fail, the trainer starts -another parameter server and resume the checkpointing. For worker failure, +another parameter server and resume the checkpointing. For worker failure, the trainer just starts a worker and let the work picks up a shard for computation. ### Brain Service to Support Auto-scaling Jobs in a Cluster @@ -243,7 +242,7 @@ includes three components. #### Administor -When a training job is created, the corresponding administor is also created +When a training job is created, the corresponding administor is also created in the brain. This administor will administer the job during the job's whole lifetime. When to initialize the job or observe a performance issue in the job, the administor will create an optimize event for a new resource plan. @@ -259,12 +258,12 @@ Then we can have the optimal resource plans. #### Algorithm Executor -After the optimize processor decides the algorithm for the job, the algorithm +After the optimize processor decides the algorithm for the job, the algorithm executor executes the algorithm and generates the resource plan. ### Cluster Monitor -In order to detach Brain from a particular platform, Brain only use data in the database +In order to detach Brain from a particular platform, Brain only use data in the database to generate optimized resource plans for jobs. In this way, we can easily reuse similar algorithm for different cluster platform (e.g., Kubernetes and Ray). Therefore, the Cluster Monitor is -implemented for particular platform to collect jobs and cluster statistic data. \ No newline at end of file +implemented for particular platform to collect jobs and cluster statistic data. diff --git a/docs/design/scale-node-design.md b/docs/design/scale-node-design.md index cc8cd100d..e8139ffa7 100644 --- a/docs/design/scale-node-design.md +++ b/docs/design/scale-node-design.md @@ -11,6 +11,7 @@ the training process. ## Design Auto-scaling in DLRover contains the following steps: + - The `JobResourceOptimizer` in `JobManager` queries a `ResourcePlan`. - The `TrainingNodeManager` (e.g. `PSManager` and `WorkerManger`) in `JobManager` generates the `ScalePlan`. @@ -40,7 +41,7 @@ and PS. of the job and ajust resource to mitigate the bottleneck. At each stage, `JobResourceOptimizer` queries a `ResourcePlan` by calling its -`ResourceOptimizer`. +`ResourceOptimizer`. The `ResourcePlan` contains resource configurations of training nodes. For exampel: @@ -141,15 +142,15 @@ If the number of PS is smaller than the current number of PS. `PSManager` will not delete the additional PS nodes immediately. Because model parameters are stored across PS nodes and will be lost if we delele PS nodes before workers checkpoints model parameters on PS. -`PSManager` will add those PS nodes which is to be removed -to a queuee `_pre_dropped_ps` and remove those PS hosts from +`PSManager` will add those PS nodes which is to be removed +to a queuee `_pre_dropped_ps` and remove those PS hosts from its `_next_training_ps_cluster`. After workers succeed to checkpoint model parameters and connect the next PS cluster. `PSManager` will set those those PS nodes into `remove_nodes` of a `ScalePlan`. **Migrate PS.** -If there is a updatation in a PS node's resource in `ResourcePlan.node_resources`, -`PSManager` will create a PS `Node` with the new resource. +If there is a updatation in a PS node's resource in `ResourcePlan.node_resources`, +`PSManager` will create a PS `Node` with the new resource. After the new PS node is running, `PSManager` will update its `_next_training_ps_cluster` and notify workers to connect new PS clusters. After workers succeed to connect new PS @@ -177,7 +178,8 @@ for additional workers to be removed. create/update/delete nodes to achieve the `ScalePlan`. We can implement differenct `Scaler` for different distributed cluster. -#### Pod Scaler +#### Pod Scaler + `PodScaler` is implemented by K8s Python APIs to create/update/delete Pods on a K8s cluster. @@ -192,6 +194,7 @@ will replace the `Node` into the queue to retry. by the name of the node in `remove_nodes`. #### ElasticJob Scaler + `ElasticJobScaler` is implemented to create a `ScalePlan` CRD to notify the [ElasticJob controller](docs/design/elastic-training-operator.md) to reconcile Pods by the `ScalePlan` on a K8s cluster. The example of `ScalePlan` is diff --git a/docs/design/streaming-data-splitter-and-manager.md b/docs/design/streaming-data-splitter-and-manager.md index 691a71595..209bf3699 100644 --- a/docs/design/streaming-data-splitter-and-manager.md +++ b/docs/design/streaming-data-splitter-and-manager.md @@ -1,23 +1,29 @@ # Streaming DataShardManger and Splitter + The design describes the architecture of the Streaming DataShardManger. The Streaming DataShardManger is responsible for dispatching data and keep data checkpoints. ## An Intro to Online learning + Online learning represents a family of machine learning methods, where a learner attempts -to tackle some predictive task by learning from a sequence of data instances one by one at each time. In contrast, offline/batch learner +to tackle some predictive task by learning from a sequence of data instances one by one +at each time. In contrast, offline/batch learner learns from static shuffled data samples and are not sensitive to the data sequence. Online learning has become a promising technique for learning from continuous -streams of data in many real-world applications. +streams of data in many real-world applications. Thus, the key point for online learning the data should be dispatched sequentially and consumed at least once. ## PartitionOffsets -Stream processing is the processing of data in motion, or in other words, computing on data directly as it is produced or received. -In addition, we would never know how many training samples are in advance and when they would arrive. + +Stream processing is the processing of data in motion, or in other words, +computing on data directly as it is produced or received. +In addition, we would never know how many training samples are in advance and when they would arrive. As a result, the worker and ps keeps running and waiting for the upstream sample. PartitionOffsets is responsible for holding consuming status of streaming data. + ```Python class PartitionOffsets(object): @@ -28,31 +34,26 @@ class PartitionOffsets(object): self.partition_num = 0 self.update_partitions() ``` + ## Streaming Data Splitter The streaming data splitter assumes that streaming samples are stored in different partition and every sample is marked with an offset which indicates the sample's sequence. Streaming data splitter is responsible for creating shards. The shard contains offset ranges [start, end) and partition of records. - ## Streaming DataShardManger Checkpoints -- When doing checkpoints, Streaming DataShardManger saves not only current doing tasks and undo tasks but also the splitter info. -- When restoring from checkpoints, Streaming DataShardManger loads not only current doing tasks and undo tasks but also the splitter info. +- When doing checkpoints, Streaming DataShardManger saves not only current doing tasks and +undo tasks but also the splitter info. +- When restoring from checkpoints, Streaming DataShardManger loads not only current doing +tasks and undo tasks but also the splitter info. ## Streaming Reader As for getting training data,there are two kind of modes of online learning: - The training data is stored in the log store or kafka topic, the reader reads data from the log store or topic. -- The training data is processed by a streaming job and sink of the job sends the data to a buffer. The reader reads data from the buffer. By this means, the worker is decoupled with data source. - -In conclusion, the worker is stateless in both online learning and offline learning. - - - - - - - +- The training data is processed by a streaming job and sink of the job sends the data to a buffer. The reader +reads data from the buffer. By this means, the worker is decoupled with data source. +In conclusion, the worker is stateless in both online learning and offline learning. diff --git a/docs/design/training-master.md b/docs/design/training-master.md index cf7ce273e..144d594eb 100644 --- a/docs/design/training-master.md +++ b/docs/design/training-master.md @@ -1,4 +1,5 @@ # Training Master of DLRover + The design describes the architecture of the training master of DLRover. The master is responsible to controll the training of a single job and provide the following services: @@ -25,6 +26,7 @@ distributed systems. ## Architecture of the Training Master The master contains 5 components: + - Resource Generator: it generates resource configuration plans for the job. - Scaler: it generates Scale CRDs according to resource plans. - Stats Collector: it collects the runtime statistics for the job, including @@ -70,6 +72,7 @@ class StatsCollector(metaclass=ABCMeta): def report_resource_usage(self): pass ``` + We can implement `report_resource_usage` to report the runtime statistics (e.g. CPU/memory usage) of all parameter servers and workers to DLRover Brain to persist them in a database like MySQL. @@ -200,7 +203,7 @@ After a worker start to training, `DataShardManager` dispatch the shard to the worker. After the worker uses up samples in the shard, it will report a shard status to `DataShardManger`. The shard only contains indices of smaples not the sample -data. +data. ```Python class DataShardManger(metaclass=ABCmeta): @@ -234,4 +237,3 @@ class DataShardManger(metaclass=ABCmeta): """Restore uncompleted data shards from a checkpoint""" pass ``` - diff --git a/docs/design/virtual-env.md b/docs/design/virtual-env.md index fd5f1311d..bd6f482f1 100644 --- a/docs/design/virtual-env.md +++ b/docs/design/virtual-env.md @@ -3,25 +3,25 @@ ## Background The cluster mode of DLRover is designed to train deep learning model automatically -in production environment. The correctness and robustness of DLRover is +in production environment. The correctness and robustness of DLRover is critical to obtain the required DL models. Therefore, there is high demand for complete and reliable testing on DLRover before each update. DLRover consists of multiple components and those components coordinate with each other to train the DL models. Since DLRover can run quite different training jobs, e.g., -Tensorflow and PyTorch, for different DL models, the job relevant components +Tensorflow and PyTorch, for different DL models, the job relevant components (e.g., operators) are quite different from each other. Unit tests indeed guarantee the function correctness of the single component. We still need to test the coordination among different components. -Meanwhile, new algorithms (e.g., auto-configure and optimization) are in frequent -iteration. For a new algorithm, we need to guarantee the efficiency of this algorithm +Meanwhile, new algorithms (e.g., auto-configure and optimization) are in frequent +iteration. For a new algorithm, we need to guarantee the efficiency of this algorithm as well as the correctness. However, currently we can only run several sample jobs -on gray cluster and observe the algorithm's efficiency that usually shows inaccurate +on gray cluster and observe the algorithm's efficiency that usually shows inaccurate results compared to production environment. Furthermore, a new algorithm requires non-trivial time to have complete test. During each iteration, we need to compare multiple algorithms and pick up the best one. There is high demand to test multiple -algorithms simultaneously. +algorithms simultaneously. ## Design Goal @@ -39,7 +39,7 @@ Based on the background, we have listed the design goals of virtual environment: Editor
-Similar to virtual machines, each **Virtual Environment** can run a different "DLRover" system. +Similar to virtual machines, each **Virtual Environment** can run a different "DLRover" system. Among those DLRover systems, there is at most one system can be labelled as **Real** while all others can only be **Virtual**. Generally, only the real system is to run and optimize DL training jobs and virtual systems are for testing. @@ -47,37 +47,38 @@ However, if a case is observed, e.g., too many jobs fail, the real system will b switched to virtual and a pre-chosen virtual system takes the control and start to train DL models. -DLRover consists of four major components: Brain, Trainer, Store and Job Operator +DLRover consists of four major components: Brain, Trainer, Store and Job Operator (i.e., training framework related, like Tensorflow, PyTorch etc.). Each component has its own virtual environment. Note that two virtual DLRovers could share the same -virtual components. For example, DLRover #1 and DLRover #2 could have different virtual +virtual components. For example, DLRover #1 and DLRover #2 could have different virtual Trainers but use the same virtual Brain. ### Virtual Brain -The core part in the Brain is Optimizer and Optimizer is allowed to have multiple -different implementations. Therefore, we can implement different Optimizers for +The core part in the Brain is Optimizer and Optimizer is allowed to have multiple +different implementations. Therefore, we can implement different Optimizers for different virtual Brain. -Evaluator is another key part in Brain virtual environment. Evaluator is to measure -or compare the efficiency of the algorithms. Similar to Optimizer, Evaluator also +Evaluator is another key part in Brain virtual environment. Evaluator is to measure +or compare the efficiency of the algorithms. Similar to Optimizer, Evaluator also allows different implementations. ### Virtual Store -Store is used to keep necessary data for DLRover. Each virtual environment can have +Store is used to keep necessary data for DLRover. Each virtual environment can have separate data store, e.g., tables in MySQL. ### Virtual Operator -Operator is used to modify/kill/launch jobs virtually or really. Note that, if a virtual +Operator is used to modify/kill/launch jobs virtually or really. Note that, if a virtual operator update jobs virtually, it needs to obtain corresponding cluster status for those virtual job operations. ### Virtual Trainer Each virtual Trainer has three major tasks: -1. To query optimization plans from virtual Brain. + +1. To query optimization plans from virtual Brain. 2. To convert optimization plans to ScalePlan and send to virtual operator. 3. Based on ScalePlan, to simulate job's virtual status and persist relevant data to store. @@ -85,15 +86,12 @@ The simulator interface is as following: ```go type JobStatus struct { - Speed float - ... + Speed float + ... } type JobSimulator interface { - UpdateJob(plan *OptimizePlan) error - GetJobStatus() *JobStatus + UpdateJob(plan *OptimizePlan) error + GetJobStatus() *JobStatus } ``` - - - diff --git a/docs/developer_guide.md b/docs/developer_guide.md index c0686114d..73101b7fb 100644 --- a/docs/developer_guide.md +++ b/docs/developer_guide.md @@ -35,7 +35,7 @@ mkdir -p ${go env GOPATH}/src/github.com/intelligent-machine-learning ln -sf ${GIT_TRAINING} ${go env GOPATH}/src/github.com/intelligent-machine-learning/dlrover ``` -- GIT_TRAINING should be the location where you checked out https://github.com/intelligent-machine-learning/dlrover +- GIT_TRAINING should be the location where you checked out Install dependencies @@ -67,9 +67,11 @@ It is highly recommended to have more than one GPU resources in your workspace. However, there is still a workaround to divide your single GPU resource into multiple ones. -For this, enable [shared-access-to-gpus with CUDA Time-Slicing](https://github.com/NVIDIA/k8s-device-plugin#shared-access-to-gpus-with-cuda-time-slicing) to get more GPU resources. +For this, enable [shared-access-to-gpus with CUDA Time-Slicing](https://github.com/NVIDIA/k8s-device-plugin#shared-access-to-gpus-with-cuda-time-slicing) +to get more GPU resources. -Check the doc and modify your ``nvidia-k8s-device-plugin`` or simply update the plugin by ``helm`` with the command ([See more details about getting GPU resources](https://github.com/ChenhuiHu/DLRover-Supplementary-Description-/blob/main/Obtain%20more%20GPU%20resources%20on%20a%20single%20machine.md)) +Check the doc and modify your ``nvidia-k8s-device-plugin`` or simply update the plugin by ``helm`` with the command +([See more details about getting GPU resources](https://github.com/ChenhuiHu/DLRover-Supplementary-Description-/blob/main/Obtain%20more%20GPU%20resources%20on%20a%20single%20machine.md)) ```bash $ helm upgrade -i nvdp nvdp/nvidia-device-plugin \ @@ -149,7 +151,8 @@ minikube start --vm-driver=docker --cpus 6 --memory 6144 # If you wish to run minikube with GPUs, recommended commands are as follows.(root privilege requried) -minikube start --driver=none --container-runtime='containerd' --apiserver-ips 127.0.0.1 --apiserver-name localhost --cpus 6 --memory 6144 +minikube start --driver=none --container-runtime='containerd' --apiserver-ips 127.0.0.1 \ +--apiserver-name localhost --cpus 6 --memory 6144 ``` ### Configure KUBECONFIG and KUBEFLOW_NAMESPACE @@ -162,7 +165,8 @@ export KUBECONFIG=$(echo ~/.kube/config) export KUBEFLOW_NAMESPACE=$(your_namespace) ``` -- KUBEFLOW_NAMESPACE is used when deployed on Kubernetes, we use this variable to create other resources (e.g. the resource lock) internal in the same namespace. It is optional, use `default` namespace if not set. +- KUBEFLOW_NAMESPACE is used when deployed on Kubernetes, we use this variable to create other +resources (e.g. the resource lock) internal in the same namespace. It is optional, use `default` namespace if not set. ### 2. Run ElasticJob Controller @@ -189,7 +193,7 @@ make deploy IMG=easydl/elasticjob-controller:master kubectl apply -f dlrover/go/operator/config/manifests/bases/default-role.yaml ``` -### 4. Build the Image +### 4. Build the Image **Build the master image with codes.** @@ -203,7 +207,7 @@ docker build -t easydl/dlrover-master:test -f docker/Dockerfile . docker build -t easydl/dlrover-train:test -f docker/pytorch/mnist.dockerfile . ``` -### 5. Submit an ElasticJob to test your images. +### 5. Submit an ElasticJob to test your images We can set the training image of the line 18 and the master image of line 42 in the debug job `examples/pytorch/mnist/elastic_debug_job.yaml`. @@ -220,7 +224,7 @@ Check traning nodes. kubectl -n dlrover get pods ``` -``` +```text NAME READY STATUS RESTARTS AGE elasticjob-torch-mnist-master 1/1 Running 0 2m47s torch-mnist-edljob-chief-0 1/1 Running 0 2m42s @@ -234,4 +238,6 @@ Change pip version and docker image tag when creating a new release. ## Go version -On ubuntu the default go package appears to be gccgo-go which has problems see [issue](https://github.com/golang/go/issues/15429) golang-go package is also really old so install from golang tarballs instead. +On ubuntu the default go package appears to be gccgo-go which has problems see +[issue](https://github.com/golang/go/issues/15429) golang-go package is +also really old so install from golang tarballs instead. diff --git a/docs/tutorial/check_env.md b/docs/tutorial/check_env.md index 225cd35be..e2d8b1ad1 100644 --- a/docs/tutorial/check_env.md +++ b/docs/tutorial/check_env.md @@ -1,6 +1,8 @@ # Environment Test before Start -Before you start installing this project, you need to perform the following tests to ensure that your current computer environment meets the requirements, so as to avoid some possible errors. +Before you start installing this project, you need to perform the following +tests to ensure that your current computer environment meets the requirements, +so as to avoid some possible errors. `grpcio == 1.34.1`: @@ -59,4 +61,3 @@ import importlib assert importlib.util.find_spec("ray") is not None, "ray module is not installed" ``` - diff --git a/docs/tutorial/estimator.md b/docs/tutorial/estimator.md index 95ec5dc57..e0766d009 100644 --- a/docs/tutorial/estimator.md +++ b/docs/tutorial/estimator.md @@ -3,17 +3,20 @@ The document describes how to develop tensorflow estimator model with DLRover trainer. ## Develop model with tensorflow estimator + [Tensorflow Estimator](https://www.tensorflow.org/guide/estimator) encapsulate Training, Evaluation, Prediction and Export for serving actions. In DLrover, both custome estimators and pre-made estimators are supported. A DLrover program with Estimator typically consists of the following four steps: + ### Define the features and label column in the conf Each `Column` identifies a feature name, its type and whether it is label. The following snippet defines two feature columns in the -[example](../../examples/tensorflow/criteo_deeprec/train_conf.py). -``` +[example](../../examples/tensorflow/criteo_deeprec/train_conf.py). + +```python train_set = { "reader": FileReader("test.data"), "columns": ( @@ -29,21 +32,23 @@ train_set = { ), ), } -``` +``` The first feature is `x` and its type is `float32`. -The second feature is `y` and is label. Its type is `float32`. -`dlrover.trainer` helps build `input_fn` for train set and test set with those columns. - +The second feature is `y` and is label. Its type is `float32`. +`dlrover.trainer` helps build `input_fn` for train set and test set with those columns. ### Add Custom Reader for TF Estimator in DLrover + In some case, the reader provided by DLrover trainer doesn't satisfy user's need. User need to develop custom reader and set it in the conf. #### Add Custom Elastic Reader for TF Estimator in DLrover -##### Define Elastic Reader Class + One necessary arguments in the `__init__` method is path. -The key funcion is `read_data_by_index_range` and `count_data`. `count_data` is used for konwing how many dataset are there before training. During training, `read_data_by_index_range` will be called to get train data. +The key funcion is `read_data_by_index_range` and `count_data`. `count_data` is used for +konwing how many dataset are there before training. During training, `read_data_by_index_range` +will be called to get train data. ```python from dlrover.trainer.tensorflow.reader.base_reader import ElasticReader @@ -65,7 +70,6 @@ class FakeReader(ElasticReader): return data ``` -##### Set Reader Conf file you need to initial you reader and set it in the conf. Here is an example ```python @@ -74,8 +78,6 @@ eval_set = {"reader": FakeReader("./eval.data"), "columns": train_set["columns"] #### Add Custom Non Elastic Reader for TF Estimator in DLrover -##### Define Reader Class - The key funcion is `iterator`. During training, `iterator` will be called to get train data. ```python @@ -98,26 +100,35 @@ class Reader: yield d ``` -##### Set Reader Conf file you need to initial you reader and set it in the conf. Here is an example + ```python eval_set = {"reader": Reader("./eval.data"), "columns": train_set["columns"]} ``` -### Instantiate the Estimator. -The heart of every Estimator—whether pre-made or custom—is its model function, model_fn, which is a method that builds graphs for training, evaluation, and prediction. -In `dlrover.trainer`, we assume the Estimator is a custom estimator. And pre-made estimators should be converted to custom estimator with little overhead. +### Instantiate the Estimator + +The heart of every Estimator—whether pre-made or custom—is its model function, model_fn, +which is a method that builds graphs for training, evaluation, and prediction. +In `dlrover.trainer`, we assume the Estimator is a custom estimator. +And pre-made estimators should be converted to custom estimator with little overhead. + #### Train a model from custome estimators -When relying on a custom Estimator, you must write the model function yourself. Refer the [tutorial](https://www.tensorflow.org/guide/estimator). -#### Train a model from pre-made estimators + +When relying on a custom Estimator, you must write the model function yourself. +Refer the [tutorial](https://www.tensorflow.org/guide/estimator). + +#### Train a model from pre-made estimators + You can convert an existing pre-made estimators by writing an Adaptor to fit with `dlrover.trainer`. As we can see, the model_fn is the key part of estimator. When training and evaluating, the model_fn is called with different mode and the graph is returned. Thus, you can define a custom estimator in which model_fn function acts as a wrapper for pre-made estimator model_fn. In the example of [DeepFMAdaptor](../../dlrover/trainer/examples/deepfm/DeepFMAdaptor.py), -`DeepFMEstimator` in [`deepctr.estimator.models`](https://github.com/shenweichen/DeepCTR/tree/master/deepctr/estimator/models) is a pre-made estimator. +`DeepFMEstimator` in [`deepctr.estimator.models`](https://github.com/shenweichen/DeepCTR/tree/master/deepctr/estimator/models) +is a pre-made estimator. -``` +```python from deepctr.estimator.models.deepfm import DeepFMEstimator class DeepFMAdaptor(tf.estimator.Estimator): @@ -142,20 +153,24 @@ class DeepFMAdaptor(tf.estimator.Estimator): ) ``` + ### Saving object-based checkpoints with Estimator -Estimators by default save checkpoints with variable names rather than the object graph described in the Checkpoint guide. + +Estimators by default save checkpoints with variable names rather than the +object graph described in the Checkpoint guide. The checkpoint hook is added by `dlrover.trainer.estimator_executor`. ### SavedModels from Estimators + Estimators export SavedModels through tf.Estimator.export_saved_model. The exporter hook is added by `dlrover.trainer.estimator_executor`. -When the job is launched, `dlrover.trainer.estimator_executor` parses the conf and builds input_fn, estimator and related hooks. +When the job is launched, `dlrover.trainer.estimator_executor` parses the conf and builds input_fn, +estimator and related hooks. +## Submit a Job to Train the Estimator model - ## Submit a Job to Train the Estimator model - - ### Build an Image with Models. +### Build an Image with Models You can install dlrover in your image. @@ -175,7 +190,7 @@ docker build -t ${IMAGE_NAME} -f ${DockerFile} . docker push ${IMAGE_NAME} ``` -### Set the Command to Train the Model. +### Set the Command to Train the Model We need to set the command of ps and worker to train the model like the [DeepCTR example](../../examples/tensorflow/criteo_deeprec/autoscale_job.yaml) @@ -194,4 +209,4 @@ Then, we can submit the job by `kubectl`. ```bash kubectl -n dlrover apply -f ${JOB_YAML_FILE} -``` \ No newline at end of file +``` diff --git a/docs/tutorial/fault_tolerations.md b/docs/tutorial/fault_tolerations.md index f8c9fd136..b35e0463c 100644 --- a/docs/tutorial/fault_tolerations.md +++ b/docs/tutorial/fault_tolerations.md @@ -1,12 +1,19 @@ # worker和ps容错样例 + ## worker容错示例 + 在任务运行过程中删除worker-i对应的pod,之后dlrover master会重新拉起一个pod。work-i对应的pod的名称会发生变化,新创建的pod的启动命令和被kill掉的pod的启动命令相同,启动后参与组网,并进行训练。期间其他worker不受影响。 + ### 启动作业 + 首先,启动作业。为了避免自动扩容缩容的影响,选择人工配置扩容缩容策略。 + ```shell kubectl apply -f deepctr_manual_scale_job.yaml -n dlrover ``` + 当前有1个ps和3个worker。 + ```shell NAME READY STATUS RESTARTS AGE deepctr-auto-scaling-job-edljob-chief-0 1/1 Running 0 117s @@ -14,7 +21,9 @@ deepctr-auto-scaling-job-edljob-ps-0 1/1 Running 0 deepctr-auto-scaling-job-edljob-worker-0 1/1 Running 0 65s deepctr-auto-scaling-job-edljob-worker-1 1/1 Running 0 65s ``` + 查看worker-0对应的pod的信息 + ```shell Name: deepctr-auto-scaling-job-edljob-worker-0 Namespace: dlrover @@ -97,20 +106,27 @@ Events: Normal Created 2m13s kubelet Created container main Normal Started 2m13s kubelet Started container main ``` -### 容错模拟 + +### Worker 容错模拟 + 为了模拟容错,需要主动删除worker-0对应的pod + ```shell kubectl delete pods -n dlrover deepctr-auto-scaling-job-edljob-worker-0 pod "deepctr-auto-scaling-job-edljob-worker-0" deleted ``` + worker-0对应的新pod启动,完成准备工作后开始消费数据,进行训练。 + ```shell deepctr-auto-scaling-job-edljob-chief-0 1/1 Running 0 4m24s deepctr-auto-scaling-job-edljob-ps-0 1/1 Running 0 4m24s deepctr-auto-scaling-job-edljob-worker-1 1/1 Running 0 3m32s deepctr-auto-scaling-job-edljob-worker-2 0/1 ContainerCreating 0 2s ``` + 查看worker-0对应的pod的信息 + ```shell Name: deepctr-auto-scaling-job-edljob-worker-2 Namespace: dlrover @@ -193,15 +209,21 @@ Events: Normal Created 92s kubelet Created container main Normal Started 92s kubelet Started container main ``` + worker-0 对应pod的日志 + ```shell [2023-03-20 11:51:10,774] [INFO][session_manager.py:511:_try_run_local_init_op] Running local_init_op. [2023-03-20 11:51:11,302] [INFO][session_manager.py:513:_try_run_local_init_op] Done running local_init_op. [2023-03-20 11:51:14,279] [INFO][global_step_hook.py:39:before_run] global_step: 10488361 ``` -## ps容错示例 + +## PS 容错示例 + 运行过程中删除一个ps-i对应的pod,之后dlrover master会重新拉起一个pod,ps-i对应的pod的名称会发生变化,但是新创建的pod的启动命令和被kill掉的pod的启动命令相同。在pod被kill到新的pod启动ps创建server之前,worker训练会中断。 + ### 启动作业 + 启动作业之后,可以查看当前运行的worker和ps。 ```shell @@ -211,18 +233,19 @@ deepctr-auto-scaling-job-edljob-ps-0 1/1 Running 0 deepctr-auto-scaling-job-edljob-ps-1 1/1 Running 0 106s deepctr-auto-scaling-job-edljob-worker-0 1/1 Running 0 2m30s deepctr-auto-scaling-job-edljob-worker-1 1/1 Running 0 2m30s -dlrover-controller-manager-7dccdf6c4d-jp4wb 2/2 Running 0 3h26m -elasticjob-deepctr-auto-scaling-job-dlrover-master 1/1 Running 0 4m9s -mysql-7d757854f-8l5k4 1/1 Running 0 4d4h ``` -### 容错模拟 + +### PS 容错模拟 + 为了模拟容错,需要主动删除ps-0对应的pod,删除后worker的日志 + ```shell [2023-03-20 15:04:34,350] [INFO][monitored_session.py:1336:run] An error was raised. This may be due to a preemption in a connected worker or parameter server. The current session will be closed and a new session will be created. This error may also occur due to a gRPC failure caused by high memory or network bandwidth usage in the parameter servers. If this error occurs repeatedly, try increasing the number of parameter servers assigned to the job. Error: From /job:ps/replica:0/task:1: RecvTensor expects a different device incarnation: 11288349594494262162 vs. 11542130100054943552. Your worker job ("/job:localhost/replica:0/task:0") was probably restarted. Check your worker job for the reason why it was restarted. ``` 当ps pod重新创建,ps server启动 + ```shell NAME READY STATUS RESTARTS AGE deepctr-auto-scaling-job-edljob-chief-0 1/1 Running 0 11m @@ -231,7 +254,9 @@ deepctr-auto-scaling-job-edljob-ps-2 1/1 Running 0 deepctr-auto-scaling-job-edljob-worker-0 1/1 Running 0 9m39s deepctr-auto-scaling-job-edljob-worker-1 1/1 Running 0 9m39s ``` + worker会加载最近一次的checkpoint,并继续训练 + ```shell [2023-03-20 15:04:34,100] [INFO][monitored_session.py:1336:run] An error was raised. This may be due to a preemption in a connected worker or parameter server. The current session will be closed and a new session will be created. This error may also occur due to a gRPC failure caused by high memory or network bandwidth usage in the parameter servers. If this error occurs repeatedly, try increasing the number of parameter servers assigned to the job. Error: ===================== @@ -239,7 +264,7 @@ Aborted: From /job:chief/replica:0/task:0: RecvTensor expects a different device incarnation: 11288349594494262162 vs. 11542130100054943552. Your worker job ("/job:localhost/replica:0/task:0") was probably restarted. Check your worker job for the reason why it was restarted. Additional GRPC error information: {"created":"@1679295874.088182934","description":"Error received from peer","file":"external/grpc/src/core/lib/surface/call.cc","file_line":1039,"grpc_message":"RecvTensor expects a different device incarnation: 11288349594494262162 vs. 11542130100054943552. Your worker job ("/job:localhost/replica:0/task:0") was probably restarted. Check your worker job for the reason why it was restarted.","grpc_status":10} - [[node global_step (defined at /local/lib/python3.8/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]] + [[node global_step (defined at /local/lib/python3.8/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]] Aborted: From /job:ps/replica:0/task:0: Session handle is not found: f8368e3b7d417955. Possibly this worker ("/job:localhost/replica:0/task:0") just restarted. ===================== @@ -319,5 +344,3 @@ Original stack trace for 'global_step': [2023-03-20 15:04:34,511] [INFO][session_manager.py:220:_restore_checkpoint] run with loading checkpoint [2023-03-20 15:04:34,724] [INFO][saver.py:1531:restore] Restoring parameters from /nas/model.ckpt-10701903 ``` - - diff --git a/docs/tutorial/gpu_user_guide.md b/docs/tutorial/gpu_user_guide.md index cd7badad6..ba7556216 100644 --- a/docs/tutorial/gpu_user_guide.md +++ b/docs/tutorial/gpu_user_guide.md @@ -1,6 +1,9 @@ ### GPU User Guide -> "The first four steps in this document need to be run on each bare-metal machine that will use a GPU. If you've already set up each node that requires GPU usage, or you're working in a well-maintained cloud-based Kubernetes environment, you can directly start from step five." +> "The first four steps in this document need to be run on each bare-metal machine that will use a GPU. +If you've already set up each node that requires GPU usage, +or you're working in a well-maintained cloud-based Kubernetes environment, +you can directly start from step five." #### Step 1: Prepare the system for NVIDIA GPU support @@ -24,7 +27,8 @@ sudo apt-get install -y nvidia-container-toolkit && \ sudo nvidia-ctk runtime configure --runtime=docker --set-as-default ``` -This will install the necessary components for NVIDIA GPU support in Docker, enabling you to utilize GPU resources within Docker containers. +This will install the necessary components for NVIDIA GPU support in Docker, +enabling you to utilize GPU resources within Docker containers. #### Step 3: Set NVIDIA as the default runtime for Docker @@ -60,7 +64,8 @@ After making the changes, restart the Docker service for the new configuration t sudo systemctl restart docker ``` -Now, NVIDIA will be set as the default runtime for Docker, allowing you to use NVIDIA GPU support seamlessly with Docker containers. +Now, NVIDIA will be set as the default runtime for Docker, +allowing you to use NVIDIA GPU support seamlessly with Docker containers. #### Step 5: Deploy the NVIDIA Device Plugin for Kubernetes @@ -70,7 +75,8 @@ Use the following command to deploy the NVIDIA Device Plugin for Kubernetes: kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v1.11/nvidia-device-plugin.yml ``` -This plugin enables Kubernetes to recognize and manage NVIDIA GPUs on the worker nodes, ensuring efficient allocation and utilization of GPU resources for container workloads. +This plugin enables Kubernetes to recognize and manage NVIDIA GPUs on the worker nodes, +ensuring efficient allocation and utilization of GPU resources for container workloads. #### Step 6: Create a test Pod with GPU resources @@ -90,7 +96,8 @@ spec: nvidia.com/gpu: 1 # requesting 1 GPU ``` -The above YAML configuration requests one GPU for the Pod. Replace the image with your desired GPU-accelerated application image if needed. +The above YAML configuration requests one GPU for the Pod. +Replace the image with your desired GPU-accelerated application image if needed. #### Step 7: Deploy the test Pod @@ -100,7 +107,8 @@ Use the following command to deploy the test Pod to Kubernetes: kubectl apply -f .yaml ``` -This will create the Pod on your Kubernetes cluster, and the GPU resource will be allocated to the Pod based on the NVIDIA Device Plugin's capabilities. - -Now, you have successfully enabled GPU support in your Kubernetes cluster and deployed a test Pod with GPU resources for running GPU-accelerated workloads. +This will create the Pod on your Kubernetes cluster, +and the GPU resource will be allocated to the Pod based on the NVIDIA Device Plugin's capabilities. +Now, you have successfully enabled GPU support in your Kubernetes cluster +and deployed a test Pod with GPU resources for running GPU-accelerated workloads. diff --git a/docs/tutorial/pytorch_training.md b/docs/tutorial/pytorch_training.md index f868f5c24..114026b18 100644 --- a/docs/tutorial/pytorch_training.md +++ b/docs/tutorial/pytorch_training.md @@ -7,14 +7,13 @@ We have provided the [CNN example](../../examples/pytorch/mnist_cnn.py) to show how to train a CNN model with the MNIST dataset. -## Develop a Torch Model with DLRover. +## Develop a Torch Model with DLRover Using elastic training of DLRover, users only need to set the `ElasticDistributedSampler` into their training `DataLoader` and checkpoint the sampler when checkpointing the model. -### Setup ElasticDistributedSampler into the Dataloader. - +### Setup ElasticDistributedSampler into the Dataloader ```Python from dlrover.trainer.torch.elastic_sampler import ElasticDistributedSampler @@ -113,15 +112,16 @@ for _, (data, target) in enumerate(train_loader): torch.save(model_checkpoint, "model.pt") ``` -## Submit an ElasticJob on the Kubernetes to Train the model. +## Submit an ElasticJob on the Kubernetes to Train the model -### Build the Image with the Model. +### Build the Image with the Model You can install dlrover in your image like ```bash pip install dlrover[torch] -U ``` + or build your image with the dockerfile. ```dockerfile @@ -137,7 +137,7 @@ RUN pip install dlrover -U COPY ./model_zoo ./model_zoo ``` -### Run the Training code with dlrover-run. +### Run the Training code with dlrover-run ```yaml spec: diff --git a/docs/tutorial/tf_ps_on_cloud.md b/docs/tutorial/tf_ps_on_cloud.md index d4802257c..7d0766a1d 100644 --- a/docs/tutorial/tf_ps_on_cloud.md +++ b/docs/tutorial/tf_ps_on_cloud.md @@ -5,9 +5,10 @@ with on a public cloud, namely, Alibaba Cloud Container Service for Kubernetes(A ## Preliminary -- Create a Kubernetes cluster on [ACK](https://help.aliyun.com/document_detail/309552.htm?spm=a2c4g.11186623.0.0.168f6b7aegH7nI#task-2112671). +- Create a Kubernetes cluster on [ACK](https://help.aliyun.com/document_detail/309552.htm?spm=a2c4g.11186623.0.0.168f6b7aegH7nI#task-2112671). - Configure cluster credentials on your local computer. -- Create a [NAS](https://help.aliyun.com/document_detail/477380.html?spm=a2c4g.11186623.0.0.10635c83Xn7Tkh) storage and mount it to the cluster. +- Create a [NAS](https://help.aliyun.com/document_detail/477380.html?spm=a2c4g.11186623.0.0.10635c83Xn7Tkh) +storage and mount it to the cluster. ## Deploy the ElasticJob CRD on ACK @@ -17,7 +18,7 @@ with on a public cloud, namely, Alibaba Cloud Container Service for Kubernetes(A make deploy IMG=easydl/elasticjob-controller:v0.1.1 ``` -2. Grant permission for the DLRover master to Access CRDs. +1. Grant permission for the DLRover master to Access CRDs. ```bash kubectl -n dlrover apply -f dlrover/go/operator/config/rbac/default_role.yaml @@ -124,6 +125,7 @@ spec: worker: replicas: 3 ``` + After scaling, there three worker nodes: ``` bash @@ -190,7 +192,6 @@ deepctr-auto-scaling-edljob-ps-0 1/1 Running 0 9m elasticjob-deepctr-auto-scaling-dlrover-master 1/1 Running 0 9m47s ``` - We can migrate a PS with more resource like ```yaml @@ -218,10 +219,11 @@ NAME READY STATUS RESTARTS AG deepctr-auto-scaling-edljob-chief-0 1/1 Running 0 22m deepctr-auto-scaling-edljob-ps-0 1/1 Running 0 22m ``` + After migrating, new ps joins and the old ps exit: ``` bash NAME READY STATUS RESTARTS AGE deepctr-auto-scaling-edljob-chief-0 1/1 Running 0 22m deepctr-auto-scaling-edljob-ps-2 1/1 Running 0 20s -``` \ No newline at end of file +``` diff --git a/docs/tutorial/torch_ddp_nanogpt.md b/docs/tutorial/torch_ddp_nanogpt.md index 10c4bf808..f6eb35a97 100644 --- a/docs/tutorial/torch_ddp_nanogpt.md +++ b/docs/tutorial/torch_ddp_nanogpt.md @@ -1,33 +1,40 @@ # Master the Training of NanoGPT with DLRover -Welcome to an exhaustive guide on how to train the `NanoGPT` model using DLRover. +Welcome to an exhaustive guide on how to train the `NanoGPT` model using DLRover. ## What's NanoGPT? -NanoGPT is a specialized version of the famous GPT (Generative Pretrained Transformer) model. What makes it unique is its role in evaluating the scalability and elasticity of the DLRover job controller. It provides the ability to tweak hyperparameters like _n_layer_, _n_head_, and _n_embedding_, making it possible to conduct tests on GPT models of varying sizes. +NanoGPT is a specialized version of the famous GPT (Generative Pretrained Transformer) model. +What makes it unique is its role in evaluating the scalability and elasticity of the DLRover job controller. +It provides the ability to tweak hyperparameters like _n_layer_, _n_head_, and _n_embedding_, +making it possible to conduct tests on GPT models of varying sizes. -For a more in-depth dive into the fascinating world of NanoGPT, don't hesitate to visit [NanoGPT](https://github.com/karpathy/nanoGPT) for the source code and a plethora of other valuable resources. +For a more in-depth dive into the fascinating world of NanoGPT, don't hesitate to visit [NanoGPT](https://github.com/karpathy/nanoGPT) +for the source code and a plethora of other valuable resources. ## Setting Up the DLRover Job Controller -Follow the comprehensive guide in the [Controller Deployment](dlrover/docs/deployment/controller.md) document to get your DLRover job controller up and running. +Follow the comprehensive guide in the [Controller Deployment](dlrover/docs/deployment/controller.md) +document to get your DLRover job controller up and running. ## GPT Training - Let's Dive In -### Getting Started with a Sample YAML +### Getting Started with a Sample YAML -Starting off with your journey to evaluating the performance of DLRover, you'll be submitting multiple training jobs. This will be done using NanoGPT with a variety of parameter settings to gauge performance under different conditions. +Starting off with your journey to evaluating the performance of DLRover, you'll be submitting multiple training jobs. +This will be done using NanoGPT with a variety of parameter settings to gauge performance under different conditions. Kick off the process with the following command: ```bash -$ kubectl -n dlrover apply -f examples/pytorch/nanogpt/ddp_elastic_job.yaml +kubectl -n dlrover apply -f examples/pytorch/nanogpt/ddp_elastic_job.yaml ``` -Upon successful application of the job configuration, you can monitor the status of the training nodes using the command below: +Upon successful application of the job configuration, +you can monitor the status of the training nodes using the command below: ```bash -$ kubectl -n dlrover get pods +kubectl -n dlrover get pods ``` Expect an output that resembles this: @@ -40,7 +47,7 @@ torch-nanogpt-edljob-worker-0 1/1 Running 0 1 torch-nanogpt-edljob-worker-1 1/1 Running 0 11s ``` -### Examine the results obtained from two different parameter settings: +### Examine the results obtained from two different parameter settings parameter settings 1: @@ -60,17 +67,17 @@ parameter settings 2: --n_embd 768 ``` -#### More detailed description of the pods: +#### More detailed description of the pods Worker-0 Logs ```bash -$ kubectl logs -n dlrover torch-nanogpt-edljob-worker-0 +kubectl logs -n dlrover torch-nanogpt-edljob-worker-0 ``` results with parameter settings 1: -``` +```text iter 0: loss 4.2279, time 4542.46ms, mfu -100.00%, lr 6.00e-04, total time 4.54s iter 1: loss 3.5641, time 4439.20ms, mfu -100.00%, lr 6.00e-04, total time 8.98s iter 2: loss 4.2329, time 4477.08ms, mfu -100.00%, lr 6.00e-04, total time 13.46s @@ -87,7 +94,7 @@ iter 10: loss 3.3144, time 4553.10ms, mfu 0.33%, lr 6.00e-04, total time 49.29s results with parameter settings 2: -``` +```text iter 0: loss 4.4201, time 31329.07ms, mfu -100.00%, lr 6.00e-04, total time 31.33s iter 1: loss 4.6237, time 30611.01ms, mfu -100.00%, lr 6.00e-04, total time 61.94s iter 2: loss 6.7593, time 30294.34ms, mfu -100.00%, lr 6.00e-04, total time 92.23s @@ -105,12 +112,12 @@ iter 10: loss 3.3865, time 30167.96ms, mfu 0.33%, lr 6.00e-04, total time 333.70 Worker-1 Logs ```bash -$ kubectl logs -n dlrover torch-nanogpt-edljob-worker-1 +kubectl logs -n dlrover torch-nanogpt-edljob-worker-1 ``` results with parameter settings 1: -``` +```text iter 0: loss 4.2382, time 4479.40ms, mfu -100.00%, lr 6.00e-04, total time 4.48s iter 1: loss 3.5604, time 4557.53ms, mfu -100.00%, lr 6.00e-04, total time 9.04s iter 2: loss 4.3411, time 4408.12ms, mfu -100.00%, lr 6.00e-04, total time 13.45s @@ -127,7 +134,7 @@ iter 10: loss 3.3551, time 4455.05ms, mfu 0.32%, lr 6.00e-04, total time 49.29s results with parameter settings 2: -``` +```text iter 0: loss 4.4402, time 31209.29ms, mfu -100.00%, lr 6.00e-04, total time 31.21s iter 1: loss 4.5574, time 30688.11ms, mfu -100.00%, lr 6.00e-04, total time 61.90s iter 2: loss 6.7668, time 30233.15ms, mfu -100.00%, lr 6.00e-04, total time 92.13s @@ -141,15 +148,17 @@ iter 9: loss 3.3891, time 30084.17ms, mfu 0.33%, lr 6.00e-04, total time 303.41s iter 10: loss 3.3743, time 30271.93ms, mfu 0.33%, lr 6.00e-04, total time 333.68s [2023-07-26 07:43:16,112] [INFO] [training.py:355:_invoke_run] [default] worker group successfully finished. Waiting 300 seconds for other agents to finish. ``` + ### Building from Docker - Step by Step -**Preparing Your Data** +### Preparing Your Data -To begin, you need a text document which can be a novel, drama, or any textual content. For instance, you can name this document as data.txt. +To begin, you need a text document which can be a novel, drama, or any textual content. +For instance, you can name this document as data.txt. -Here's an example of a Shakespearean dialogue: +Here's an example of a Shakespearean dialogue:p -``` +```text BUCKINGHAM: Welcome, sweet prince, to London, to your chamber. @@ -162,9 +171,11 @@ No, uncle; but our crosses on the way Have made it tedious, wearisome, and heavy I want more uncles here to welcome me. ``` -Alternatively, you can use our provided data, which is available in the [examples/pytorch/nanogpt/data.txt](examples/pytorch/nanogpt/data.txt). This data has already been prepared for use. -**Time to Run the Preparation Script** +Alternatively, you can use our provided data, which is available in the [examples/pytorch/nanogpt/data.txt](examples/pytorch/nanogpt/data.txt). +This data has already been prepared for use. + +### Time to Run the Preparation Script Now that you have your data, let's run the preparation script as follows: @@ -173,7 +184,7 @@ python examples/pytorch/nanogpt/prepare.py --src_data_path data.txt This command generates a train.bin and val.bin file in the data directory. ``` -**Building the Training Image for PyTorch Models** +### Building the Training Image for PyTorch Models Having prepared the data, the final step involves building the training image of PyTorch models. Here's how you do it: @@ -184,12 +195,16 @@ docker build -t easydl/dlrover-train-nanogpt:test -f docker/pytorch/nanogpt.dock And voila! You're all set to run the model and dive into the world of Natural Language Processing. I hope this adds more life and detail to your README document. Let me know if there's anything else you need help with! -# References +## References -This eaxmple is built upon and significantly influenced by the [NanoGPT](https://github.com/karpathy/nanoGPT) project. Several scripts from the project, including but not limited to `prepare.py`, `train.py`, and `model.py`, have been adapted to our specific requirements. +This eaxmple is built upon and significantly influenced by the [NanoGPT](https://github.com/karpathy/nanoGPT) project. +Several scripts from the project, including but not limited to `prepare.py`, `train.py`, and `model.py`, +have been adapted to our specific requirements. The original scripts can be found in the NanoGPT repository: [NanoGPT](https://github.com/karpathy/nanoGPT) -# Acknowledgments +## Acknowledgments -We would like to express our sincere gratitude to the authors and contributors of the NanoGPT project. Their work has provided us with a strong foundation for our example, and their insights have been invaluable for our development process. Thank you! +We would like to express our sincere gratitude to the authors and contributors of the NanoGPT project. +Their work has provided us with a strong foundation for our example, +and their insights have been invaluable for our development process. Thank you! diff --git a/docs/tutorial/torch_fsdp_nanogpt.md b/docs/tutorial/torch_fsdp_nanogpt.md index 70d157bcc..5716f67c3 100644 --- a/docs/tutorial/torch_fsdp_nanogpt.md +++ b/docs/tutorial/torch_fsdp_nanogpt.md @@ -1,28 +1,35 @@ # Switch from DDP to FSDP with NanoGPT -Welcome to this guide on how to transition from DDP (Distributed Data Parallel) to FSDP (Fully Sharded Data Parallelism) for training the NanoGPT model. This guide assumes familiarity with the previous DDP guide. If you're new to DDP, we recommend checking out the DDP guide first. +Welcome to this guide on how to transition from DDP (Distributed Data Parallel) to +FSDP (Fully Sharded Data Parallelism) for training the NanoGPT model. This guide assumes +familiarity with the previous DDP guide. If you're new to DDP, we recommend checking out the DDP guide first. ## What is FSDP? -FSDP is an alternative approach to DDP, designed to improve the efficiency of distributed training. It achieves this by effectively partitioning data and model parameters, reducing communication overhead, and enabling more efficient training on large-scale models. +FSDP is an alternative approach to DDP, designed to improve the efficiency of distributed training. +It achieves this by effectively partitioning data and model parameters, reducing communication +overhead, and enabling more efficient training on large-scale models. ## Configure FSDP for NanoGPT -To replace DDP with FSDP in your existing NanoGPT training configuration, simply make the following changes. Use the `kubectl` command to apply the modified training configuration: +To replace DDP with FSDP in your existing NanoGPT training configuration, +simply make the following changes. Use the `kubectl` command to apply the modified training configuration: ```bash -$ kubectl -n dlrover apply -f examples/pytorch/nanogpt/fsdp_elastic_job.yaml +kubectl -n dlrover apply -f examples/pytorch/nanogpt/fsdp_elastic_job.yaml ``` -Upon successful application of the job configuration, you can monitor the status of the training nodes using the command below: +Upon successful application of the job configuration, +you can monitor the status of the training nodes using the command below: ```bash -$ kubectl -n dlrover get pods +kubectl -n dlrover get pods ``` ## Comparing DDP and FSDP Results -Let's compare the results obtained using DDP and FSDP with the same parameter settings. Here are the results for the two approaches: +Let's compare the results obtained using DDP and FSDP with the same parameter settings. +Here are the results for the two approaches: **DDP:** @@ -42,17 +49,17 @@ Let's compare the results obtained using DDP and FSDP with the same parameter se --n_embd 384 ``` -### More detailed description of the pods: +### More detailed description of the pods Worker-0 Logs ```bash -$ kubectl logs -n dlrover torch-nanogpt-edljob-worker-0 +kubectl logs -n dlrover torch-nanogpt-edljob-worker-0 ``` results on DDP: -``` +```text iter 0: loss 4.2519, time 1295.23ms, mfu -100.00%, cuda memory 0.499G, lr 6.00e-04, total time 1.30s iter 1: loss 3.5362, time 26.58ms, mfu -100.00%, cuda memory 0.499G, lr 6.00e-04, total time 1.32s iter 2: loss 4.0429, time 26.42ms, mfu -100.00%, cuda memory 0.499G, lr 6.00e-04, total time 1.35s @@ -68,7 +75,7 @@ iter 10: loss 3.2967, time 28.30ms, mfu 3.27%, cuda memory 0.499G, lr 6.00e-04, results on FSDP: -``` +```text iter 0: loss 4.2674, time 1967.15ms, mfu -100.00%, cuda memory 0.479G, lr 6.00e-04, total time 1.97s iter 1: loss 3.4770, time 26.56ms, mfu -100.00%, cuda memory 0.479G, lr 6.00e-04, total time 1.99s iter 2: loss 4.6944, time 27.10ms, mfu -100.00%, cuda memory 0.479G, lr 6.00e-04, total time 2.02s @@ -85,12 +92,12 @@ iter 10: loss 3.2457, time 30.30ms, mfu 1.62%, cuda memory 0.479G, lr 6.00e-04, Worker-1 Logs ```bash -$ kubectl logs -n dlrover torch-nanogpt-edljob-worker-1 +kubectl logs -n dlrover torch-nanogpt-edljob-worker-1 ``` results on DDP: -``` +```text iter 0: loss 4.2464, time 1295.62ms, mfu -100.00%, cuda memory 0.499G, lr 6.00e-04, total time 1.30s iter 1: loss 3.4549, time 26.48ms, mfu -100.00%, cuda memory 0.499G, lr 6.00e-04, total time 1.32s iter 2: loss 4.0122, time 26.27ms, mfu -100.00%, cuda memory 0.499G, lr 6.00e-04, total time 1.35s @@ -106,7 +113,7 @@ iter 10: loss 3.3080, time 28.20ms, mfu 3.28%, cuda memory 0.499G, lr 6.00e-04, results on FSDP: -``` +```text iter 0: loss 4.2821, time 1893.33ms, mfu -100.00%, cuda memory 0.479G, lr 6.00e-04, total time 1.89s iter 1: loss 3.5487, time 26.76ms, mfu -100.00%, cuda memory 0.479G, lr 6.00e-04, total time 1.92s iter 2: loss 4.7303, time 26.95ms, mfu -100.00%, cuda memory 0.479G, lr 6.00e-04, total time 1.95s @@ -120,6 +127,7 @@ iter 9: loss 3.2509, time 29.63ms, mfu 1.63%, cuda memory 0.479G, lr 6.00e-04, t iter 10: loss 3.2535, time 30.32ms, mfu 1.62%, cuda memory 0.479G, lr 6.00e-04, total time 2.25s ``` -# References +## References -This guide is a supplemental resource to [torch_ddp_nanogpt.md](./torch_ddp_nanogpt.md). For more details about the usage environment, please refer to torch_ddp_nanogpt.md. \ No newline at end of file +This guide is a supplemental resource to [torch_ddp_nanogpt.md](./torch_ddp_nanogpt.md). +For more details about the usage environment, please refer to torch_ddp_nanogpt.md. diff --git a/docs/tutorial/torch_on_cloud.md b/docs/tutorial/torch_on_cloud.md index d2ac8005e..235a9db81 100644 --- a/docs/tutorial/torch_on_cloud.md +++ b/docs/tutorial/torch_on_cloud.md @@ -5,14 +5,15 @@ on a public cloud, namely, Alibaba Cloud Container Service for Kubernetes(ACK). ## Preliminary -- Create a Kubernetes cluster on [ACK](https://help.aliyun.com/document_detail/309552.htm?spm=a2c4g.11186623.0.0.168f6b7aegH7nI#task-2112671). +- Create a Kubernetes cluster on [ACK](https://help.aliyun.com/document_detail/309552.htm?spm=a2c4g.11186623.0.0.168f6b7aegH7nI#task-2112671). - Configure cluster credentials on your local computer. -- Create a [NAS](https://help.aliyun.com/document_detail/477380.html?spm=a2c4g.11186623.0.0.10635c83Xn7Tkh) storage and mount it to the cluster. +- Create a [NAS](https://help.aliyun.com/document_detail/477380.html?spm=a2c4g.11186623.0.0.10635c83Xn7Tkh) +storage and mount it to the cluster. If you do not have a Kubernetes cluster on Cloud, you also can start a local kubernetes cluster by [Minikube start](https://minikube.sigs.k8s.io/docs/start/). -## Deploy the ElasticJob CRD on the Kubernetes Cluster. +## Deploy the ElasticJob CRD on the Kubernetes Cluster 1. Clone the repo to your host. @@ -27,7 +28,7 @@ cd dlrover/go/operator/ make deploy IMG=easydl/elasticjob-controller:master ``` -2. Grant permission for the DLRover master to Access CRDs. +3. Grant permission for the DLRover master to Access CRDs. ```bash kubectl -n dlrover apply -f config/manifests/bases/default-role.yaml