Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error Encountered in Multi-Node Evaluation Using Distributed Arguments #142

Closed
jdy18 opened this issue Apr 5, 2024 · 4 comments
Closed

Comments

@jdy18
Copy link

jdy18 commented Apr 5, 2024

I encountered an issue while attempting to perform a multi-node evaluation using PyTorch's torchrun with specific distributed arguments. Below is the command I used, including the distributed arguments setup and the execution command:

=


DISTRIBUTED_ARGS=" \
    --nproc_per_node 3 \
    --nnodes 4 \
    --node_rank $NODE_RANK \
    --master_addr $MASTER_ADDR \
    --master_port $MASTER_PORT"

torchrun $DISTRIBUTED_ARGS run.py \
    --data MME MMBench_DEV_EN MMBench_DEV_CN CCBench SEEDBench_IMG MMMU_DEV_VAL MathVista_MINI HallusionBench LLaVABench \
    MMBench_TEST_EN MMBench_TEST_CN \
    --model llava

Upon execution, I received the following error message:

RUN - ERROR - No such file or directory: './llava/312_MME.pkl'
It seems like only the .pkl file on node0 was saved correctly. Only 012,112,212 was saved.
Thank you in advance for your assistance!

@kennymckormick
Copy link
Member

Hi, @jdy18 ,
Currently, we have not supported multi-node test yet. I would further check if it's easy to add this new feature.

@jdy18
Copy link
Author

jdy18 commented Apr 7, 2024

Hi, @jdy18 , Currently, we have not supported multi-node test yet. I would further check if it's easy to add this new feature.

I have successfully resolved the issue. The function get_rank_and_world_size() should return rank instead of local_rank, and the call to torch.cuda.set_device(rank) should be modified to torch.cuda.set_device(local_rank).

@kennymckormick
Copy link
Member

Great, are you willing to create a PR to support this new feature? I think it's compatible with the original single node multi-GPU parallel inference.

@kennymckormick
Copy link
Member

The modification has been updated into the main branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants