Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Train with llava-llama3 #8

Closed
hellangleZ opened this issue Apr 30, 2024 · 9 comments
Closed

Train with llava-llama3 #8

hellangleZ opened this issue Apr 30, 2024 · 9 comments

Comments

@hellangleZ
Copy link

After start pretrain, there is a bug

Traceback (most recent call last):
File "/data2/LLaVA-pp/LLaVA/llava/train/train_mem.py", line 4, in
train(attn_implementation="flash_attention_2")
File "/data2/LLaVA-main/llava/train/train.py", line 969, in train
trainer.train()
File "/data22/llava/lib/python3.10/site-packages/transformers/trainer.py", line 1876, in train
return inner_training_loop(
File "/data22/llava/lib/python3.10/site-packages/transformers/trainer.py", line 2187, in _inner_training_loop
for step, inputs in enumerate(epoch_iterator):
File "/data22/llava/lib/python3.10/site-packages/accelerate/data_loader.py", line 452, in iter
current_batch = next(dataloader_iter)
File "/data22/llava/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 630, in next
data = self._next_data()
File "/data22/llava/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1345, in _next_data
return self._process_data(data)
File "/data22/llava/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1371, in _process_data
data.reraise()
File "/data22/llava/lib/python3.10/site-packages/torch/_utils.py", line 694, in reraise
raise exception
TypeError: Caught TypeError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/data22/llava/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
data = fetcher.fetch(index)
File "/data22/llava/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
return self.collate_fn(data)
File "/data2/LLaVA-main/llava/train/train.py", line 751, in call
input_ids = torch.nn.utils.rnn.pad_sequence(
File "/data22/llava/lib/python3.10/site-packages/torch/nn/utils/rnn.py", line 400, in pad_sequence
return torch._C._nn.pad_sequence(sequences, batch_first, padding_value)
TypeError: pad_sequence(): argument 'padding_value' (position 3) must be float, not NoneType

@mmaaz60
Copy link
Member

mmaaz60 commented Apr 30, 2024

Hi @hellangleZ,

Thank you for your interest in our work. Please make sure that you have followed the below steps correctly for running the training,

STEP 1: Ensure to install all dependencies accurately. Follow the instructions below for installation,

git clone https://github.com/mbzuai-oryx/LLaVA-pp.git
cd LLaVA-pp
git submodule update --init --recursive

pip install --upgrade pip
pip install -e .

pip install git+https://github.com/huggingface/transformers@a98c41798cf6ed99e1ff17e3792d6e06a2ff2ff3

pip install ninja
pip install flash-attn --no-build-isolation --no-cache-dir

STEP 2: Ensure that you have correct transformers version. Please install transformers using the following command.

pip install git+https://github.com/huggingface/transformers@a98c41798cf6ed99e1ff17e3792d6e06a2ff2ff3

STEP 3: Ensure that you copied all the relevant files to the LLaVA directory,

For LLaMA-3, do the following,

cp LLaMA-3-V/train.py LLaVA/llava/train/train.py
cp LLaMA-3-V/conversation.py LLaVA/llava/conversation.py
cp LLaMA-3-V/builder.py LLaVA/llava/model/builder.py
cp LLaMA-3-V/llava_llama.py LLaVA/llava/model/language_model/llava_llama.py

For Phi-3, do the following,

cp Phi-3-V/train.py LLaVA/llava/train/train.py
cp Phi-3-V/llava_phi3.py LLaVA/llava/model/language_model/llava_phi3.py
cp Phi-3-V/builder.py LLaVA/llava/model/builder.py
cp Phi-3-V/model__init__.py LLaVA/llava/model/__init__.py
cp Phi-3-V/main__init__.py LLaVA/llava/__init__.py
cp Phi-3-V/conversation.py LLaVA/llava/conversation.py

STEP 4: Make sure you are using --version plain for pretraining, --version llama3 for LLaMA-3 based fine-tuning and --version phi3_instruct for Phi-3 based fine-tuning.

STEP 5: Make sure to use meta-llama/Meta-Llama-3-8B-Instruct as base model for LLaMA-3 based trainings. And microsoft/Phi-3-mini-4k-instruct as base model for Phi-3 based trainings.

I hope this will solve the issue. In case, if it did not solve the issue, please provide the step-by-step instructions to reproduce the issue so that we can reproduce the issue and assist you better.

Good Luck :)

@hellangleZ
Copy link
Author

Hi @hellangleZ,

Thank you for your interest in our work. Please make sure that you have followed the below steps correctly for running the training,

STEP 1: Ensure to install all dependencies accurately. Follow the instructions below for installation,

git clone https://github.com/mbzuai-oryx/LLaVA-pp.git
cd LLaVA-pp
git submodule update --init --recursive

pip install --upgrade pip
pip install -e .

pip install git+https://github.com/huggingface/transformers@a98c41798cf6ed99e1ff17e3792d6e06a2ff2ff3

pip install ninja
pip install flash-attn --no-build-isolation --no-cache-dir

STEP 2: Ensure that you have correct transformers version. Please install transformers using the following command.

pip install git+https://github.com/huggingface/transformers@a98c41798cf6ed99e1ff17e3792d6e06a2ff2ff3

STEP 3: Ensure that you copied all the relevant files to the LLaVA directory,

For LLaMA-3, do the following,

cp LLaMA-3-V/train.py LLaVA/llava/train/train.py
cp LLaMA-3-V/conversation.py LLaVA/llava/conversation.py
cp LLaMA-3-V/builder.py LLaVA/llava/model/builder.py
cp LLaMA-3-V/llava_llama.py LLaVA/llava/model/language_model/llava_llama.py

For Phi-3, do the following,

cp Phi-3-V/train.py LLaVA/llava/train/train.py
cp Phi-3-V/llava_phi3.py LLaVA/llava/model/language_model/llava_phi3.py
cp Phi-3-V/builder.py LLaVA/llava/model/builder.py
cp Phi-3-V/model__init__.py LLaVA/llava/model/__init__.py
cp Phi-3-V/main__init__.py LLaVA/llava/__init__.py
cp Phi-3-V/conversation.py LLaVA/llava/conversation.py

STEP 4: Make sure you are using --version plain for pretraining, --version llama3 for LLaMA-3 based fine-tuning and --version phi3_instruct for Phi-3 based fine-tuning.

STEP 5: Make sure to use meta-llama/Meta-Llama-3-8B-Instruct as base model for LLaMA-3 based trainings. And microsoft/Phi-3-mini-4k-instruct as base model for Phi-3 based trainings.

I hope this will solve the issue. In case, if it did not solve the issue, please provide the step-by-step instructions to reproduce the issue so that we can reproduce the issue and assist you better.

Good Luck :)

Hi @mmaaz60

image

The step
image

It should be at LLaVA folder or just LLaVA-pp folder ?

@mmaaz60
Copy link
Member

mmaaz60 commented May 1, 2024

Hi @hellangleZ

It should be in the LLaVA-pp/LLaVA folder.

@hellangleZ
Copy link
Author

HI @mmaaz60

I copy all the step by step but still has some bug

Hi @hellangleZ
It should be in the LLaVA-pp/LLaVA folder.
image
image
image
image
image

step-4 and step5 image

but image

I'm sure there is the code

image

@hellangleZ
Copy link
Author

Hi @mmaaz60

Also same issue occurs on LLama3 pretrain

Hi @hellangleZ

It should be in the LLaVA-pp/LLaVA folder.

image

@Luo-Z13
Copy link

Luo-Z13 commented May 1, 2024

Hi @hellangleZ,

Thank you for your interest in our work. Please make sure that you have followed the below steps correctly for running the training,

STEP 1: Ensure to install all dependencies accurately. Follow the instructions below for installation,

git clone https://github.com/mbzuai-oryx/LLaVA-pp.git
cd LLaVA-pp
git submodule update --init --recursive

pip install --upgrade pip
pip install -e .

pip install git+https://github.com/huggingface/transformers@a98c41798cf6ed99e1ff17e3792d6e06a2ff2ff3

pip install ninja
pip install flash-attn --no-build-isolation --no-cache-dir

STEP 2: Ensure that you have correct transformers version. Please install transformers using the following command.

pip install git+https://github.com/huggingface/transformers@a98c41798cf6ed99e1ff17e3792d6e06a2ff2ff3

STEP 3: Ensure that you copied all the relevant files to the LLaVA directory,

For LLaMA-3, do the following,

cp LLaMA-3-V/train.py LLaVA/llava/train/train.py
cp LLaMA-3-V/conversation.py LLaVA/llava/conversation.py
cp LLaMA-3-V/builder.py LLaVA/llava/model/builder.py
cp LLaMA-3-V/llava_llama.py LLaVA/llava/model/language_model/llava_llama.py

For Phi-3, do the following,

cp Phi-3-V/train.py LLaVA/llava/train/train.py
cp Phi-3-V/llava_phi3.py LLaVA/llava/model/language_model/llava_phi3.py
cp Phi-3-V/builder.py LLaVA/llava/model/builder.py
cp Phi-3-V/model__init__.py LLaVA/llava/model/__init__.py
cp Phi-3-V/main__init__.py LLaVA/llava/__init__.py
cp Phi-3-V/conversation.py LLaVA/llava/conversation.py

STEP 4: Make sure you are using --version plain for pretraining, --version llama3 for LLaMA-3 based fine-tuning and --version phi3_instruct for Phi-3 based fine-tuning.

STEP 5: Make sure to use meta-llama/Meta-Llama-3-8B-Instruct as base model for LLaMA-3 based trainings. And microsoft/Phi-3-mini-4k-instruct as base model for Phi-3 based trainings.

I hope this will solve the issue. In case, if it did not solve the issue, please provide the step-by-step instructions to reproduce the issue so that we can reproduce the issue and assist you better.

Good Luck :)

Hello @mmaaz60 ,

I have been following the installation process you provided exactly, with the exception of the version of accelerate (I am using accelerate==0.29.3). Here are the specific steps and issues I encountered:

  1. After running pip install git+https://github.com/huggingface/transformers@a98c41798cf6ed99e1ff17e3792d6e06a2ff2ff3, I received the following errors:
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
llava 1.2.2.post1 requires tokenizers==0.15.1, but you have tokenizers 0.19.1 which is incompatible.
llava 1.2.2.post1 requires transformers==4.37.2, but you have transformers 4.41.0.dev0 which is incompatible.
  1. Then, during the LoRA finetuning process, I encountered an error due to the default version of accelerator being too low, which resulted in the following error: TypeError: Accelerator.__init__() got an unexpected keyword argument 'use_seedable_sampler'. To resolve this, I ran pip install accelerate --upgrade, which updated accelerate to version 0.29.3.

  2. Afterwards, I encountered another error: TypeError: pad_sequence(): argument 'padding_value' (position 3) must be float, not NoneType.

Could you please help me diagnose and resolve these issues? Here's my current environment setup:

accelerate                0.29.3
aiofiles                  23.2.1
altair                    5.3.0
annotated-types           0.6.0
anyio                     4.3.0
appdirs                   1.4.4
attrs                     23.2.0
bitsandbytes              0.42.0
certifi                   2024.2.2
charset-normalizer        3.3.2
click                     8.1.7
contourpy                 1.2.1
cycler                    0.12.1
deepspeed                 0.12.6
docker-pycreds            0.4.0
einops                    0.6.1
einops-exts               0.0.4
exceptiongroup            1.2.1
fastapi                   0.110.3
ffmpy                     0.3.2
filelock                  3.14.0
flash-attn                2.5.8
fonttools                 4.51.0
fsspec                    2024.3.1
gitdb                     4.0.11
GitPython                 3.1.43
gradio                    4.16.0
gradio_client             0.8.1
h11                       0.14.0
hjson                     3.1.0
httpcore                  0.17.3
httpx                     0.24.0
huggingface-hub           0.22.2
idna                      3.7
importlib_resources       6.4.0
Jinja2                    3.1.3
joblib                    1.4.0
jsonschema                4.21.1
jsonschema-specifications 2023.12.1
kiwisolver                1.4.5
llava                     1.2.2.post1 /usr/VLM/LLaVA-pp
markdown-it-py            3.0.0
markdown2                 2.4.13
MarkupSafe                2.1.5
matplotlib                3.8.4
mdurl                     0.1.2
mpmath                    1.3.0
networkx                  3.3
ninja                     1.11.1.1
numpy                     1.26.4
nvidia-cublas-cu12        12.1.3.1
nvidia-cuda-cupti-cu12    12.1.105
nvidia-cuda-nvrtc-cu12    12.1.105
nvidia-cuda-runtime-cu12  12.1.105
nvidia-cudnn-cu12         8.9.2.26
nvidia-cufft-cu12         11.0.2.54
nvidia-curand-cu12        10.3.2.106
nvidia-cusolver-cu12      11.4.5.107
nvidia-cusparse-cu12      12.1.0.106
nvidia-nccl-cu12          2.18.1
nvidia-nvjitlink-cu12     12.4.127
nvidia-nvtx-cu12          12.1.105
orjson                    3.10.1
packaging                 24.0
pandas                    2.2.2
peft                      0.10.0
pillow                    10.3.0
pip                       24.0
protobuf                  4.25.3
psutil                    5.9.8
py-cpuinfo                9.0.0
pydantic                  2.7.1
pydantic_core             2.18.2
pydub                     0.25.1
Pygments                  2.17.2
pynvml                    11.5.0
pyparsing                 3.1.2
python-dateutil           2.9.0.post0
python-multipart          0.0.9
pytz                      2024.1
PyYAML                    6.0.1
referencing               0.35.0
regex                     2024.4.28
requests                  2.31.0
rich                      13.7.1
rpds-py                   0.18.0
ruff                      0.4.2
safetensors               0.4.3
scikit-learn              1.2.2
scipy                     1.13.0
semantic-version          2.10.0
sentencepiece             0.1.99
sentry-sdk                2.0.1
setproctitle              1.3.3
setuptools                68.2.2
shellingham               1.5.4
shortuuid                 1.0.13
six                       1.16.0
smmap                     5.0.1
sniffio                   1.3.1
starlette                 0.37.2
svgwrite                  1.4.3
sympy                     1.12
threadpoolctl             3.5.0
timm                      0.6.13
tokenizers                0.19.1
tomlkit                   0.12.0
toolz                     0.12.1
torch                     2.1.2
torchvision               0.16.2
tqdm                      4.66.2
transformers              4.41.0.dev0
triton                    2.1.0
typer                     0.12.3
typing_extensions         4.11.0
tzdata                    2024.1
urllib3                   2.2.1
uvicorn                   0.29.0
wandb                     0.16.6
wavedrom                  2.0.3.post3
websockets                11.0.3
wheel                     0.41.2

Thank you for your help!

@hellangleZ
Copy link
Author

Hi @hellangleZ

It should be in the LLaVA-pp/LLaVA folder.

image

image
image
image

STEP2

image

STEP3

image

Hi @hellangleZ

It should be in the LLaVA-pp/LLaVA folder.

Great, It's work now. it's a deepspeed issue

@mmaaz60
Copy link
Member

mmaaz60 commented May 1, 2024

pip's dependency

Hi @Luo-Z13,

  • The error related to pip's dependency can be ignored.
  • The error TypeError: pad_sequence(): argument 'padding_value' (position 3) must be float, not NoneType occurs during LLaMA-3 based model training. Actually LLaMA-3 does not use any pad token however during LLaVA-LLaMA-3 training we need pad token. So the workaround is to add a special token and resize the embeddings. This is done at
    smart_tokenizer_and_embedding_resize(
    .

Please make sure that baseline official LLaVA code is working properly. And then make sure to copy all the files related to LLaMA-3 in the corresponding directory. Lastly please note that to run LLaMA-3 based training you need to pass --version llama3.

I hope it will help and solve the issue. Good Luck.

@mmaaz60
Copy link
Member

mmaaz60 commented May 1, 2024

Hi @hellangleZ @Luo-Z13,

I am closing this issue as @hellangleZ was able to run the trainings. Please feel free to create a new issue in case if you have any questions or encounter any other error. I appreciate your cooperation. Thank You.

@mmaaz60 mmaaz60 closed this as completed May 1, 2024
pythonlearner1025 pushed a commit to pythonlearner1025/LLaVA-pp that referenced this issue May 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants