Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-GPU inference issue #224

Closed
ruifengma opened this issue Jun 18, 2024 · 11 comments
Closed

Multi-GPU inference issue #224

ruifengma opened this issue Jun 18, 2024 · 11 comments

Comments

@ruifengma
Copy link

When I try to run internVL-Chat-V1.5, since it is large and need at lease two GPUs, therefore, I use the following command to run
CUDA_VISIBLE_DEVICES=2,3 python run.py --data MME --model InternVL-Chat-V1-5 --verbose
But it does not run on the two GPUs but only one and give OOM error, do I need to give more configuration?

@kennymckormick
Copy link
Member

kennymckormick commented Jun 18, 2024

Hi, @ruifengma ,
Sorry we have only supported evaluation on GPUs with 80G memories now (we will adapt to other low-profile GPUs very soon). A quick fix is to remove L111 in vlmeval/vlm/internvl_chat.py and add device_map='auto' in AutoModel.from_pretrained

@ruifengma
Copy link
Author

Thanks @kennymckormick for the reply, it actually can be loaded onto 2 GPUs, but when inferencing, I got new issue

Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:16<00:00,  1.52s/it]
/data/mrx/VLMEvalKit/vlmeval/vlm/internvl_chat.py:122: UserWarning: Following kwargs received: {'do_sample': False, 'max_new_tokens': 1024, 'top_p': None, 'num_beams': 1}, will use as generation config.
  warnings.warn(f'Following kwargs received: {self.kwargs}, will use as generation config. ')
  0%|                                                                                                                                                                           | 0/2374 [00:00<?, ?it/s]/data/mrx/VLMEvalKit/vlmeval/vlm/base.py:140: UserWarning: Model InternVLChat does not support interleaved input. Will use the first image and aggregated texts as prompt.
  warnings.warn(
dynamic ViT batch size: 1
  0%|                                                                                                                                                                           | 0/2374 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/data/mrx/VLMEvalKit/run.py", line 155, in <module>
    main()
  File "/data/mrx/VLMEvalKit/run.py", line 79, in main
    model = infer_data_job(
            ^^^^^^^^^^^^^^^
  File "/data/mrx/VLMEvalKit/vlmeval/inference.py", line 164, in infer_data_job
    model = infer_data(
            ^^^^^^^^^^^
  File "/data/mrx/VLMEvalKit/vlmeval/inference.py", line 130, in infer_data
    response = model.generate(message=struct, dataset=dataset_name)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/mrx/VLMEvalKit/vlmeval/vlm/base.py", line 135, in generate
    return self.generate_inner(message, dataset)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/mrx/VLMEvalKit/vlmeval/vlm/internvl_chat.py", line 220, in generate_inner
    return self.generate_v1_5(message, dataset)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/mrx/VLMEvalKit/vlmeval/vlm/internvl_chat.py", line 201, in generate_v1_5
    response = self.model.chat(self.tokenizer, pixel_values=pixel_values,
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.cache/huggingface/modules/transformers_modules/InternVL-Chat-V1-5/modeling_internvl_chat.py", line 309, in chat
    generation_output = self.generate(
                        ^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/vlmeval/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/root/.cache/huggingface/modules/transformers_modules/InternVL-Chat-V1-5/modeling_internvl_chat.py", line 353, in generate
    input_embeds[selected] = vit_embeds.reshape(-1, C)
    ~~~~~~~~~~~~^^^^^^^^^^
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!

@junming-yang
Copy link
Collaborator

junming-yang commented Jun 18, 2024

L197 in vlmeval/vlm/internvl_chat.py the image data is loaded into a default GPU. Try to check the GPU id before loading the data.

@ruifengma
Copy link
Author

L197 in vlmeval/vlm/internvl_chat.py the image data is loaded into a default GPU. Try to check the GPU id before loading the data.

Thanks @junming-yang , the image script I found is
pixel_values = load_image(image_path, max_num=self.max_num).cuda().to(torch.bfloat16)
Do I need to dynamically check the id or force it to GPU 0 ? Since the model loading process is on 0 first then 1 (0 and 1 are projected by 2 and 3)

@junming-yang
Copy link
Collaborator

You can try to dynamically check the model's device.

@ruifengma
Copy link
Author

You can try to dynamically check the model's device.

I appended .to(torch.cuda.current_device()) at the end, it still give me the same error

@junming-yang
Copy link
Collaborator

Maybe you can try .to(self.model.device).

@ruifengma
Copy link
Author

Maybe you can try .to(self.model.device).

Yes, I did. Still the same

@junming-yang
Copy link
Collaborator

I have tried to reproduce your bug.
This is the revised code for running (from original code L107-110):

self.model = AutoModel.from_pretrained(model_path, torch_dtype=torch.bfloat16,
                                       trust_remote_code=True,
                                       load_in_8bit=load_in_8bit, device_map='auto').eval()
# if not load_in_8bit:
#     self.model = self.model.to(device)

Each GPU is allocated about 26 GiB. And no error is reported. Please check your code.

@ruifengma
Copy link
Author

I have tried to reproduce your bug. This is the revised code for running (from original code L107-110):

self.model = AutoModel.from_pretrained(model_path, torch_dtype=torch.bfloat16,
                                       trust_remote_code=True,
                                       load_in_8bit=load_in_8bit, device_map='auto').eval()
# if not load_in_8bit:
#     self.model = self.model.to(device)

Each GPU is allocated about 26 GiB. And no error is reported. Please check your code.

I actually did not DIY but completely following the advice. I use two A40 GPUs for the task, I checked and did the same modification as you did

@ruifengma
Copy link
Author

ruifengma commented Jun 19, 2024

Not coding issue, update the latest version of official internvl configuration file solve

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants