Error Encountered in Multi-Node Evaluation Using Distributed Arguments #142

jdy18 · 2024-04-05T03:51:12Z

I encountered an issue while attempting to perform a multi-node evaluation using PyTorch's torchrun with specific distributed arguments. Below is the command I used, including the distributed arguments setup and the execution command:

=


DISTRIBUTED_ARGS=" \
    --nproc_per_node 3 \
    --nnodes 4 \
    --node_rank $NODE_RANK \
    --master_addr $MASTER_ADDR \
    --master_port $MASTER_PORT"

torchrun $DISTRIBUTED_ARGS run.py \
    --data MME MMBench_DEV_EN MMBench_DEV_CN CCBench SEEDBench_IMG MMMU_DEV_VAL MathVista_MINI HallusionBench LLaVABench \
    MMBench_TEST_EN MMBench_TEST_CN \
    --model llava

Upon execution, I received the following error message:

RUN - ERROR - No such file or directory: './llava/312_MME.pkl'
It seems like only the .pkl file on node0 was saved correctly. Only 012,112,212 was saved.
Thank you in advance for your assistance!

The text was updated successfully, but these errors were encountered:

kennymckormick · 2024-04-07T11:18:09Z

Hi, @jdy18 ,
Currently, we have not supported multi-node test yet. I would further check if it's easy to add this new feature.

jdy18 · 2024-04-07T13:57:35Z

Hi, @jdy18 , Currently, we have not supported multi-node test yet. I would further check if it's easy to add this new feature.

I have successfully resolved the issue. The function get_rank_and_world_size() should return rank instead of local_rank, and the call to torch.cuda.set_device(rank) should be modified to torch.cuda.set_device(local_rank).

kennymckormick · 2024-04-08T02:22:54Z

Great, are you willing to create a PR to support this new feature? I think it's compatible with the original single node multi-GPU parallel inference.

kennymckormick · 2024-04-21T07:30:29Z

The modification has been updated into the main branch.

kennymckormick closed this as completed Apr 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error Encountered in Multi-Node Evaluation Using Distributed Arguments #142

Error Encountered in Multi-Node Evaluation Using Distributed Arguments #142

jdy18 commented Apr 5, 2024 •

edited

kennymckormick commented Apr 7, 2024

jdy18 commented Apr 7, 2024

kennymckormick commented Apr 8, 2024

kennymckormick commented Apr 21, 2024

Error Encountered in Multi-Node Evaluation Using Distributed Arguments #142

Error Encountered in Multi-Node Evaluation Using Distributed Arguments #142

Comments

jdy18 commented Apr 5, 2024 • edited

kennymckormick commented Apr 7, 2024

jdy18 commented Apr 7, 2024

kennymckormick commented Apr 8, 2024

kennymckormick commented Apr 21, 2024

jdy18 commented Apr 5, 2024 •

edited