You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Jun 3, 2024. It is now read-only.
Describe the bug
I am running the simple command python -m openfed.tools.simulator --nproc 6 examples/run.py as given in the repository just to check if the code was running and I encountered the following error.
(openfed) ozaland@prec3660c:~/OpenFed$ python -m openfed.tools.simulator --nproc 6 examples/run.py
0%| | 0/10 [00:00<?, ?it/s]/home/ozaland/anaconda3/envs/openfed/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py:430: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead
"torch.distributed.distributed_c10d._get_global_rank is deprecated "
/home/ozaland/anaconda3/envs/openfed/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py:430: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead
"torch.distributed.distributed_c10d._get_global_rank is deprecated "
/home/ozaland/anaconda3/envs/openfed/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py:430: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead
"torch.distributed.distributed_c10d._get_global_rank is deprecated "
/home/ozaland/anaconda3/envs/openfed/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py:430: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead
"torch.distributed.distributed_c10d._get_global_rank is deprecated "
/home/ozaland/anaconda3/envs/openfed/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py:430: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead
"torch.distributed.distributed_c10d._get_global_rank is deprecated "
/home/ozaland/anaconda3/envs/openfed/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py:430: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead
"torch.distributed.distributed_c10d._get_global_rank is deprecated "
10%|████ | 1/10 [30:02<4:30:18, 1802.07s/it]
Traceback (most recent call last):
File "examples/run.py", line 99, in <module>
simulate()
File "examples/run.py", line 52, in simulate
api.run()
File "/home/ozaland/OpenFed/openfed/api.py", line 71, in run
maintainer.step()
File "/home/ozaland/OpenFed/openfed/core/maintainer.py", line 306, in step
return self._aggregator_step(*args, **kwargs)
File "/home/ozaland/OpenFed/openfed/core/maintainer.py", line 378, in _aggregator_step
flag = self.upload()
File "/home/ozaland/OpenFed/openfed/core/maintainer.py", line 298, in upload
self.transfer(to=True)
File "/home/ozaland/OpenFed/openfed/core/functional.py", line 33, in _fed_context
return safe_call(self, *args, **kwargs)
File "/home/ozaland/OpenFed/openfed/core/functional.py", line 24, in safe_call
return func(*args, **kwargs)
File "/home/ozaland/OpenFed/openfed/core/maintainer.py", line 253, in transfer
self.pipe.upload(self.packaged_data)
File "/home/ozaland/OpenFed/openfed/federated/pipe.py", line 164, in upload
self.transfer(True, data)
File "/home/ozaland/OpenFed/openfed/federated/pipe.py", line 233, in transfer
self.push(data)
File "/home/ozaland/OpenFed/openfed/federated/pipe.py", line 249, in push
distributed_c10d.gather_object(data, None, dst=rank, group=self.pg)
File "/home/ozaland/anaconda3/envs/openfed/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1981, in gather_object
all_gather(object_size_list, local_size, group=group)
File "/home/ozaland/anaconda3/envs/openfed/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 2282, in all_gather
work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:133] Timed out waiting 1800000ms for send operation to complete
/home/ozaland/OpenFed/openfed/core/functional.py:26: UserWarning: Failed to call <function Maintainer.transfer at 0x7fc6515eff80>
warnings.warn(f'Failed to call {func}')
/home/ozaland/OpenFed/openfed/core/functional.py:26: UserWarning: Failed to call <function Maintainer.transfer at 0x7f2c126eff80>
warnings.warn(f'Failed to call {func}')
/home/ozaland/OpenFed/openfed/core/functional.py:26: UserWarning: Failed to call <function Maintainer.transfer at 0x7f60a89eff80>
warnings.warn(f'Failed to call {func}')
/home/ozaland/OpenFed/openfed/core/functional.py:26: UserWarning: Failed to call <function Maintainer.transfer at 0x7ff26edaff80>
warnings.warn(f'Failed to call {func}')
/home/ozaland/OpenFed/openfed/core/functional.py:26: UserWarning: Failed to call <function Maintainer.transfer at 0x7f290cfaff80>
warnings.warn(f'Failed to call {func}')
Killing subprocess 1740079
Killing subprocess 1740080
Killing subprocess 1740081
Killing subprocess 1740082
Killing subprocess 1740083
Killing subprocess 1740084
Traceback (most recent call last):
File "/home/ozaland/anaconda3/envs/openfed/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/home/ozaland/anaconda3/envs/openfed/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/ozaland/OpenFed/openfed/tools/simulator.py", line 221, in <module>
main()
File "/home/ozaland/OpenFed/openfed/tools/simulator.py", line 204, in main
sigkill_handler(signal.SIGTERM, None)
File "/home/ozaland/OpenFed/openfed/tools/simulator.py", line 138, in sigkill_handler
returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/ozaland/anaconda3/envs/openfed/bin/python', '-u', 'examples/run.py', '--props=/tmp/collaborator-5.json']' returned non-zero exit status 1.
Environment (please complete the following information):
OS Platform and Distribution (e.g., Linux Ubuntu 22.04):
Describe the bug
I am running the simple command
python -m openfed.tools.simulator --nproc 6 examples/run.py
as given in the repository just to check if the code was running and I encountered the following error.Environment (please complete the following information):
The text was updated successfully, but these errors were encountered: