You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We developed ZeRO to conquer the limitations of data parallelism and model parallelism while achieving the merits of both. ZeRO removes the memory redundancies across data-parallel processes by partitioning the model states—parameters, gradients, and optimizer state—across data parallel processes instead of replicating them. It uses a dynamic communication schedule during training to share the necessary state across distributed devices to retain the computational granularity and communication volume of data parallelism
We call this ZeRO-powered data parallelism, which allows per-device memory usage to scale linearly with the degree of data parallelism and incurs similar communication volume as data parallelism. ZeRO-powered data parallelism can fit models of arbitrary size—as long as the aggregated device memory is large enough to share the model states.
The text was updated successfully, but these errors were encountered:
Per DeepSpeed
The text was updated successfully, but these errors were encountered: