Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

"socket.error: [Errno 111] Connection refused" while training with multiple workers #11872

Closed
miteshyh opened this issue Jul 24, 2018 · 12 comments

Comments

@miteshyh
Copy link

Hi,
I am getting following error after few data iteration @ 551/22210:
File "train.py", line 201, in
trainer.training(epoch)
File "train.py", line 142, in training
for i, (data, target) in enumerate(tbar):
File "/usr/local/lib/python2.7/dist-packages/tqdm/_tqdm.py", line 930, in iter
for obj in iterable:
File "/usr/local/lib/python2.7/dist-packages/mxnet/gluon/data/dataloader.py", line 222, in next
return self.next()
File "/usr/local/lib/python2.7/dist-packages/mxnet/gluon/data/dataloader.py", line 218, in next
idx, batch = self._data_queue.get()
File "/usr/lib/python2.7/multiprocessing/queues.py", line 117, in get
res = self._recv()
File "/usr/local/lib/python2.7/dist-packages/mxnet/gluon/data/dataloader.py", line 88, in recv
return pickle.loads(buf)
File "/usr/lib/python2.7/pickle.py", line 1388, in loads
return Unpickler(file).load()
File "/usr/lib/python2.7/pickle.py", line 864, in load
dispatchkey
File "/usr/lib/python2.7/pickle.py", line 1139, in load_reduce
value = func(*args)
File "/usr/local/lib/python2.7/dist-packages/mxnet/gluon/data/dataloader.py", line 53, in rebuild_ndarray
fd = multiprocessing.reduction.rebuild_handle(fd)
File "/usr/lib/python2.7/multiprocessing/reduction.py", line 156, in rebuild_handle
conn = Client(address, authkey=current_process().authkey)
File "/usr/lib/python2.7/multiprocessing/connection.py", line 169, in Client
c = SocketClient(address)
File "/usr/lib/python2.7/multiprocessing/connection.py", line 308, in SocketClient
s.connect(address)
File "/usr/lib/python2.7/socket.py", line 228, in meth
return getattr(self._sock,name)(*args)
socket.error: [Errno 111] Connection refused

I am using latest nightly of MXNET along with newly added Sync BatchNorm layer , This error comes with and without SyncBatchNorm layer.

I am using MXNET docker

Any help is much appreciated.

dmlc/gluon-cv#215

@zhreshold would you be able to comment on this?

@zhreshold
Copy link
Member

This is related to recent change that we switched from shared memory to file descriptor on linux for inter-processing communication. Still investigating solutions for that.
Of course we can add an option to enable either way as fallback solution.

@zhreshold
Copy link
Member

Temporary solutions:

  1. Increase shared memory if it's too small, you can use df -h /dev/shm to check the shared memory size and usage: edit /etc/sysctl.conf, add a line or edit add a line kernel.shmmax = 4,294,967,296 for example to use maximum 4G shared mem.
  2. Reduce num_workers, if you set num_workers = 0, no multiprocess worker will be used, but it's slower.

@zhreshold
Copy link
Member

I have figured out that the pre-fetch strategy for data loader is too aggressive which might cause the related issue with shared mem.
The fix is included in #11908

@miteshyh
Copy link
Author

Thanks @zhreshold , I will follow this PR.

Yes, even with few (0/1) workers resource usage was quite high requiring more than usual shared memory space.

@zhreshold
Copy link
Member

With #11908 been merged, I am closing this for now. Feel free to ping me if it still exists.

@ifeherva
Copy link
Contributor

I am using the latest master and the issue still persists in docker. Even num_workers = 1 causes a hang in the dataloader's while True loop

@zhreshold
Copy link
Member

@ifeherva docker run --shm-size xxx, if not specified, docker has no shared memory

@ifeherva
Copy link
Contributor

@zhreshold Good point. How much shared memory is recommended for mxnet?

@zhreshold
Copy link
Member

That should align with the (input batch_size, data shape, worker number), usually several GB is recommended for multi-gpu training.

@ifeherva
Copy link
Contributor

@zhreshold Adding shared memory to docker solved the problem. Thanks!

@djaym7
Copy link

djaym7 commented Jan 4, 2020

image

num_workers 30, consistently getting connection timeout.. will try reducing the workers.. mxnet-cu101mkl : 1.6.0b20191207, p3.16xlarge sagemaker notebook instance

@austinmw
Copy link

austinmw commented Mar 24, 2020

@djaym7 This worked for me: aws/sagemaker-python-sdk#937 (comment)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants