Skip to content

Latest commit

 

History

History
216 lines (159 loc) · 8.12 KB

huggingface-accelerate.rst

File metadata and controls

216 lines (159 loc) · 8.12 KB

Get Started with Hugging Face Accelerate

The :class:`~ray.train.torch.TorchTrainer` can help you easily launch your Accelerate training across a distributed Ray cluster.

You only need to run your existing training code with a TorchTrainer. You can expect the final code to look like this:

from accelerate import Accelerator

def train_func(config):
    # Instantiate the accelerator
    accelerator = Accelerator(...)

    model = ...
    optimizer = ...
    train_dataloader = ...
    eval_dataloader = ...
    lr_scheduler = ...

    # Prepare everything for distributed training
    (
        model,
        optimizer,
        train_dataloader,
        eval_dataloader,
        lr_scheduler,
    ) = accelerator.prepare(
        model, optimizer, train_dataloader, eval_dataloader, lr_scheduler
    )

    # Start training
    ...

from ray.train.torch import TorchTrainer
from ray.train import ScalingConfig

trainer = TorchTrainer(
    train_func,
    scaling_config=ScalingConfig(...),
    ...
)
trainer.fit()

Tip

Model and data preparation for distributed training is completely handled by the Accelerator object and its Accelerator.prepare() method.

Unlike with native PyTorch, PyTorch Lightning, or HuggingFace Transformers, don't call any additional Ray Train utilities like :meth:`~ray.train.torch.prepare_model` or :meth:`~ray.train.torch.prepare_data_loader` in your training function.

Configure Accelerate

In Ray Train, you can set configurations through the accelerate.Accelerator object in your training function. Below are starter examples for configuring Accelerate.

.. tabs::

    .. group-tab:: DeepSpeed

        For example, to run DeepSpeed with Accelerate, create a `DeepSpeedPlugin <https://huggingface.co/docs/accelerate/main/en/package_reference/deepspeed>`_
        from a dictionary:

        .. code-block:: python

            from accelerate import Accelerator, DeepSpeedPlugin

            DEEPSPEED_CONFIG = {
                "fp16": {
                    "enabled": True
                },
                "zero_optimization": {
                    "stage": 3,
                    "offload_optimizer": {
                        "device": "cpu",
                        "pin_memory": False
                    },
                    "overlap_comm": True,
                    "contiguous_gradients": True,
                    "reduce_bucket_size": "auto",
                    "stage3_prefetch_bucket_size": "auto",
                    "stage3_param_persistence_threshold": "auto",
                    "gather_16bit_weights_on_model_save": True,
                    "round_robin_gradients": True
                },
                "gradient_accumulation_steps": "auto",
                "gradient_clipping": "auto",
                "steps_per_print": 10,
                "train_batch_size": "auto",
                "train_micro_batch_size_per_gpu": "auto",
                "wall_clock_breakdown": False
            }

            def train_func(config):
                # Create a DeepSpeedPlugin from config dict
                ds_plugin = DeepSpeedPlugin(hf_ds_config=DEEPSPEED_CONFIG)

                # Initialize Accelerator
                accelerator = Accelerator(
                    ...,
                    deepspeed_plugin=ds_plugin,
                )

                # Start training
                ...

            from ray.train.torch import TorchTrainer
            from ray.train import ScalingConfig

            trainer = TorchTrainer(
                train_func,
                scaling_config=ScalingConfig(...),
                ...
            )
            trainer.fit()

    .. group-tab:: FSDP

        For PyTorch FSDP, create a `FullyShardedDataParallelPlugin <https://huggingface.co/docs/accelerate/main/en/package_reference/fsdp>`_
        and pass it to the Accelerator.

        .. code-block:: python

            from torch.distributed.fsdp.fully_sharded_data_parallel import FullOptimStateDictConfig, FullStateDictConfig
            from accelerate import Accelerator, FullyShardedDataParallelPlugin

            def train_func(config):
                fsdp_plugin = FullyShardedDataParallelPlugin(
                    state_dict_config=FullStateDictConfig(
                        offload_to_cpu=False,
                        rank0_only=False
                    ),
                    optim_state_dict_config=FullOptimStateDictConfig(
                        offload_to_cpu=False,
                        rank0_only=False
                    )
                )

                # Initialize accelerator
                accelerator = Accelerator(
                    ...,
                    fsdp_plugin=fsdp_plugin,
                )

                # Start training
                ...

            from ray.train.torch import TorchTrainer
            from ray.train import ScalingConfig

            trainer = TorchTrainer(
                train_func,
                scaling_config=ScalingConfig(...),
                ...
            )
            trainer.fit()

Note that Accelerate also provides a CLI tool, "accelerate config", to generate a configuration and launch your training job with "accelerate launch". However, it's not necessary here because Ray's TorchTrainer already sets up the Torch distributed environment and launches the training function on all workers.

Next, see these end-to-end examples below for more details:

.. tabs::

    .. group-tab:: Example with Ray Data

        .. dropdown:: Show Code

            .. literalinclude:: /../../python/ray/train/examples/accelerate/accelerate_torch_trainer.py
                :language: python
                :start-after: __accelerate_torch_basic_example_start__
                :end-before: __accelerate_torch_basic_example_end__

    .. group-tab:: Example with PyTorch DataLoader

        .. dropdown:: Show Code

            .. literalinclude:: /../../python/ray/train/examples/accelerate/accelerate_torch_trainer_no_raydata.py
                :language: python
                :start-after: __accelerate_torch_basic_example_no_raydata_start__
                :end-before: __accelerate_torch_basic_example_no_raydata_end__

.. seealso::

    If you're looking for more advanced use cases, check out this Llama-2 fine-tuning example:

    - `Fine-tuning Llama-2 series models with Deepspeed, Accelerate, and Ray Train. <https://github.com/ray-project/ray/tree/master/doc/source/templates/04_finetuning_llms_with_deepspeed>`_

You may also find these user guides helpful:

AccelerateTrainer Migration Guide

Before Ray 2.7, Ray Train's :class:`AccelerateTrainer <ray.train.huggingface.AccelerateTrainer>` API was the recommended way to run Accelerate code. As a subclass of :class:`TorchTrainer <ray.train.torch.TorchTrainer>`, the AccelerateTrainer takes in a configuration file generated by accelerate config and applies it to all workers. Aside from that, the functionality of AccelerateTrainer is identical to TorchTrainer.

However, this caused confusion around whether this was the only way to run Accelerate code. Because you can express the full Accelerate functionality with the Accelerator and TorchTrainer combination, the plan is to deprecate the AccelerateTrainer in Ray 2.8, and it's recommend to run your Accelerate code directly with TorchTrainer.