Skip to content

DrishtiShrrrma/llama-2-7b-alpaca-flash-atn-vs-atn-vs-tp

Repository files navigation

Dissecting Llama-2-7b's Behavior with Varied Pretraining and Attention Mechanisms

Objective of the Study

The key objective of this study was to analyze the effects of:

  • Different values of pretraining_tp on the behavior of a model during its training and inference phases. (Tested training_tp=1 and training_tp=2)
  • The difference in model performance/behavior between using Flash Attention and the Traditional Attention mechanism for training_tp = 1 and 2.

Dataset Used

We utilized the Alpaca dataset, which is a dataset of 52,000 instructions and demonstrations generated by OpenAI’s text-davinci-003 engine. It’s intended to enhance instruction-following capabilities in language models. We structured examples into task descriptions using a formatting function. This ensures that the model receives instructions in a way it understands.

Model Used

NousResearch/Llama-2–7b-hf

Tools & Techniques:

  1. QLoRA: A technique to reduce the memory footprint of large language models during fine-tuning. It involves:
  • Quantizing the pre-trained model to fewer bits and freezing it.
  • Attaching small, trainable adapter layers (LoRA).
  • Fine-tuning only the adapter layers, using the frozen quantized model for context.
  1. Flash Attention: Flash Attention is a method that makes attention computation more efficient both in terms of speed and memory.
  • Reorders the attention computation to enhance speed and reduce memory usage.
  • It has been claimed that it accelerates training by up to 3x by converting the memory usage from quadratic to linear based on sequence length. We will be testing this out with the help of our brief study!
  • Suitable only for specific GPUs (Ampere & Hopper series).

Experimental Setup

  1. All tests were conducted on an A100 GPU.

  2. model_id = “NousResearch/Llama-2–7b-hf”

  3. BitsAndBytesConfig Setup

image

  1. LoRA Config

image

  1. Training Arguments

image

Comparative Analysis

Case Attention Type Pretraining_tp Training Loss Training Time (h:mm) Inference Time (s)
1 Flash 1 0.8855 1:02 3.96
2 Flash 2 0.8854 1:01 3.61
3 Normal 1 0.8848 1:53 3.61
4 Normal 2 0.8849 1:54 3.56

Observations

  1. Training Loss:
  • The training losses across all cases are very close to each other. There is a very minor decrease in training loss when using a pretraining_tp of 1 vs. 2 but the difference is negligible.
  • The attention type (Flash vs. Normal) does not seem to have a noticeable impact on the final training loss.
  1. Training Time:
  • Using Flash Attention significantly reduces training time, nearly by half as compared to using Normal Attention.
  • The pretraining_tp value does not seem to significantly impact the training time.
  1. Inference Time:
  • Flash Attention with pretraining_tp of 2 has the fastest inference time.
  • Interestingly, Normal Attention has similar inference times for both pretraining_tp values, and they're both comparable or slightly faster than Flash Attention with pretraining_tp of 1.

Future Work

  • Testing the models on a validation dataset to measure the generalization performance.
  • Experimenting with a wider range of pretraining_tp values.
  • Investigating why the Normal Attention method resulted in similar or slightly better inference times for these specific pretraining_tp values.

Conclusion

  • Flash Attention is significantly faster in training compared to Normal Attention, which is expected based on the stated advantages of Flash Attention.
  • The pretraining_tp values, either 1 or 2, do not drastically impact the model's performance or training/inference times in this experiment. However, using pretraining_tp of 2 slightly improves inference time when using Flash Attention.
  • The model’s performance, in terms of loss, is mostly consistent across all cases. Hence, other considerations like training time, inference time, and computational resources could be more important when deciding on the configurations to use.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages