Dissecting Llama-2-7b's Behavior with Varied Pretraining and Attention Mechanisms

Objective of the Study

The key objective of this study was to analyze the effects of:

Different values of pretraining_tp on the behavior of a model during its training and inference phases. (Tested training_tp=1 and training_tp=2)
The difference in model performance/behavior between using Flash Attention and the Traditional Attention mechanism for training_tp = 1 and 2.

Dataset Used

We utilized the Alpaca dataset, which is a dataset of 52,000 instructions and demonstrations generated by OpenAI’s text-davinci-003 engine. It’s intended to enhance instruction-following capabilities in language models. We structured examples into task descriptions using a formatting function. This ensures that the model receives instructions in a way it understands.

Model Used

NousResearch/Llama-2–7b-hf

Tools & Techniques:

QLoRA: A technique to reduce the memory footprint of large language models during fine-tuning. It involves:

Quantizing the pre-trained model to fewer bits and freezing it.
Attaching small, trainable adapter layers (LoRA).
Fine-tuning only the adapter layers, using the frozen quantized model for context.

Flash Attention: Flash Attention is a method that makes attention computation more efficient both in terms of speed and memory.

Reorders the attention computation to enhance speed and reduce memory usage.
It has been claimed that it accelerates training by up to 3x by converting the memory usage from quadratic to linear based on sequence length. We will be testing this out with the help of our brief study!
Suitable only for specific GPUs (Ampere & Hopper series).

Experimental Setup

All tests were conducted on an A100 GPU.
model_id = “NousResearch/Llama-2–7b-hf”
BitsAndBytesConfig Setup

LoRA Config

Training Arguments

Comparative Analysis

Case	Attention Type	Pretraining_tp	Training Loss	Training Time (h:mm)	Inference Time (s)
1	Flash	1	0.8855	1:02	3.96
2	Flash	2	0.8854	1:01	3.61
3	Normal	1	0.8848	1:53	3.61
4	Normal	2	0.8849	1:54	3.56

Observations

Training Loss:

The training losses across all cases are very close to each other. There is a very minor decrease in training loss when using a pretraining_tp of 1 vs. 2 but the difference is negligible.
The attention type (Flash vs. Normal) does not seem to have a noticeable impact on the final training loss.

Training Time:

Using Flash Attention significantly reduces training time, nearly by half as compared to using Normal Attention.
The pretraining_tp value does not seem to significantly impact the training time.

Inference Time:

Flash Attention with pretraining_tp of 2 has the fastest inference time.
Interestingly, Normal Attention has similar inference times for both pretraining_tp values, and they're both comparable or slightly faster than Flash Attention with pretraining_tp of 1.

Future Work

Testing the models on a validation dataset to measure the generalization performance.
Experimenting with a wider range of pretraining_tp values.
Investigating why the Normal Attention method resulted in similar or slightly better inference times for these specific pretraining_tp values.

Conclusion

Flash Attention is significantly faster in training compared to Normal Attention, which is expected based on the stated advantages of Flash Attention.
The pretraining_tp values, either 1 or 2, do not drastically impact the model's performance or training/inference times in this experiment. However, using pretraining_tp of 2 slightly improves inference time when using Flash Attention.
The model’s performance, in terms of loss, is mostly consistent across all cases. Hence, other considerations like training time, inference time, and computational resources could be more important when deciding on the configurations to use.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
README.md		README.md
llama_2_7b_int4_alpaca_flash_attention_tp_1_merged.ipynb		llama_2_7b_int4_alpaca_flash_attention_tp_1_merged.ipynb
llama_2_7b_int4_alpaca_flash_attention_tp_2_merged.ipynb		llama_2_7b_int4_alpaca_flash_attention_tp_2_merged.ipynb
llama_2_7b_int4_alpaca_normal_attention_tp_1_merged.ipynb		llama_2_7b_int4_alpaca_normal_attention_tp_1_merged.ipynb
llama_2_7b_int4_alpaca_normal_attention_tp_2_merged.ipynb		llama_2_7b_int4_alpaca_normal_attention_tp_2_merged.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dissecting Llama-2-7b's Behavior with Varied Pretraining and Attention Mechanisms

Objective of the Study

Dataset Used

Model Used

Tools & Techniques:

Experimental Setup

Comparative Analysis

Observations

Future Work

Conclusion

About

Releases

Packages

Languages

DrishtiShrrrma/llama-2-7b-alpaca-flash-atn-vs-atn-vs-tp

Folders and files

Latest commit

History

Repository files navigation

Dissecting Llama-2-7b's Behavior with Varied Pretraining and Attention Mechanisms

Objective of the Study

Dataset Used

Model Used

Tools & Techniques:

Experimental Setup

Comparative Analysis

Observations

Future Work

Conclusion

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages