Integrate ZeRO-Powered Data Parallelism #20

StellaAthena · 2021-01-01T02:20:41Z

Per DeepSpeed

We developed ZeRO to conquer the limitations of data parallelism and model parallelism while achieving the merits of both. ZeRO removes the memory redundancies across data-parallel processes by partitioning the model states—parameters, gradients, and optimizer state—across data parallel processes instead of replicating them. It uses a dynamic communication schedule during training to share the necessary state across distributed devices to retain the computational granularity and communication volume of data parallelism

We call this ZeRO-powered data parallelism, which allows per-device memory usage to scale linearly with the degree of data parallelism and incurs similar communication volume as data parallelism. ZeRO-powered data parallelism can fit models of arbitrary size—as long as the aggregated device memory is large enough to share the model states.

sdtblck · 2021-01-05T15:42:07Z

this is handled automatically by deepspeed.initialize

StellaAthena added the feature request New feature or request label Jan 1, 2021

StellaAthena added this to To do in 1T or BUST via automation Jan 1, 2021

sdtblck closed this as completed Jan 5, 2021

1T or BUST automation moved this from To do to Done Jan 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate ZeRO-Powered Data Parallelism #20

Integrate ZeRO-Powered Data Parallelism #20

StellaAthena commented Jan 1, 2021

sdtblck commented Jan 5, 2021

Integrate ZeRO-Powered Data Parallelism #20

Integrate ZeRO-Powered Data Parallelism #20

Comments

StellaAthena commented Jan 1, 2021

sdtblck commented Jan 5, 2021