"socket.error: [Errno 111] Connection refused" while training with multiple workers #11872

miteshyh · 2018-07-24T17:30:08Z

Hi,
I am getting following error after few data iteration @ 551/22210:
File "train.py", line 201, in
trainer.training(epoch)
File "train.py", line 142, in training
for i, (data, target) in enumerate(tbar):
File "/usr/local/lib/python2.7/dist-packages/tqdm/_tqdm.py", line 930, in iter
for obj in iterable:
File "/usr/local/lib/python2.7/dist-packages/mxnet/gluon/data/dataloader.py", line 222, in next
return self.next()
File "/usr/local/lib/python2.7/dist-packages/mxnet/gluon/data/dataloader.py", line 218, in next
idx, batch = self._data_queue.get()
File "/usr/lib/python2.7/multiprocessing/queues.py", line 117, in get
res = self._recv()
File "/usr/local/lib/python2.7/dist-packages/mxnet/gluon/data/dataloader.py", line 88, in recv
return pickle.loads(buf)
File "/usr/lib/python2.7/pickle.py", line 1388, in loads
return Unpickler(file).load()
File "/usr/lib/python2.7/pickle.py", line 864, in load
dispatchkey
File "/usr/lib/python2.7/pickle.py", line 1139, in load_reduce
value = func(*args)
File "/usr/local/lib/python2.7/dist-packages/mxnet/gluon/data/dataloader.py", line 53, in rebuild_ndarray
fd = multiprocessing.reduction.rebuild_handle(fd)
File "/usr/lib/python2.7/multiprocessing/reduction.py", line 156, in rebuild_handle
conn = Client(address, authkey=current_process().authkey)
File "/usr/lib/python2.7/multiprocessing/connection.py", line 169, in Client
c = SocketClient(address)
File "/usr/lib/python2.7/multiprocessing/connection.py", line 308, in SocketClient
s.connect(address)
File "/usr/lib/python2.7/socket.py", line 228, in meth
return getattr(self._sock,name)(*args)
socket.error: [Errno 111] Connection refused

I am using latest nightly of MXNET along with newly added Sync BatchNorm layer , This error comes with and without SyncBatchNorm layer.

I am using MXNET docker

Any help is much appreciated.

dmlc/gluon-cv#215

@zhreshold would you be able to comment on this?

zhreshold · 2018-07-24T19:59:04Z

This is related to recent change that we switched from shared memory to file descriptor on linux for inter-processing communication. Still investigating solutions for that.
Of course we can add an option to enable either way as fallback solution.

zhreshold · 2018-07-26T20:00:05Z

Temporary solutions:

Increase shared memory if it's too small, you can use df -h /dev/shm to check the shared memory size and usage: edit /etc/sysctl.conf, add a line or edit add a line kernel.shmmax = 4,294,967,296 for example to use maximum 4G shared mem.
Reduce num_workers, if you set num_workers = 0, no multiprocess worker will be used, but it's slower.

zhreshold · 2018-07-27T18:59:56Z

I have figured out that the pre-fetch strategy for data loader is too aggressive which might cause the related issue with shared mem.
The fix is included in #11908

miteshyh · 2018-07-30T09:17:39Z

Thanks @zhreshold , I will follow this PR.

Yes, even with few (0/1) workers resource usage was quite high requiring more than usual shared memory space.

zhreshold · 2018-08-13T21:16:48Z

With #11908 been merged, I am closing this for now. Feel free to ping me if it still exists.

ifeherva · 2018-08-16T05:23:24Z

I am using the latest master and the issue still persists in docker. Even num_workers = 1 causes a hang in the dataloader's while True loop

zhreshold · 2018-08-16T05:25:08Z

@ifeherva docker run --shm-size xxx, if not specified, docker has no shared memory

ifeherva · 2018-08-16T12:46:40Z

@zhreshold Good point. How much shared memory is recommended for mxnet?

zhreshold · 2018-08-16T15:00:05Z

That should align with the (input batch_size, data shape, worker number), usually several GB is recommended for multi-gpu training.

ifeherva · 2018-08-16T21:44:35Z

@zhreshold Adding shared memory to docker solved the problem. Thanks!

djaym7 · 2020-01-04T07:52:50Z

num_workers 30, consistently getting connection timeout.. will try reducing the workers.. mxnet-cu101mkl : 1.6.0b20191207, p3.16xlarge sagemaker notebook instance

austinmw · 2020-03-24T14:45:47Z

@djaym7 This worked for me: aws/sagemaker-python-sdk#937 (comment)

miteshyh mentioned this issue Jul 24, 2018

"socket.error: [Errno 111] Connection refused" while training on ADE20K dmlc/gluon-cv#215

Closed

zhreshold added the Data-loading label Jul 24, 2018

zhreshold mentioned this issue Aug 2, 2018

When Train SSD, It hold on during read the data #11763

Closed

zhreshold closed this as completed Aug 13, 2018

This was referenced Feb 9, 2019

[Nightly Test Failure] Tutorial test_tutorials.test_gluon_end_to_end Test Failure #14026

Closed

increased docker shared memory for nightly test #14119

Merged

gradientsky mentioned this issue May 13, 2021

[CV][Bugfix] Addressing multiprocessing.context.TimeoutError in fork mode autogluon/autogluon#1107

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"socket.error: [Errno 111] Connection refused" while training with multiple workers #11872

"socket.error: [Errno 111] Connection refused" while training with multiple workers #11872

miteshyh commented Jul 24, 2018

zhreshold commented Jul 24, 2018

zhreshold commented Jul 26, 2018

zhreshold commented Jul 27, 2018

miteshyh commented Jul 30, 2018

zhreshold commented Aug 13, 2018

ifeherva commented Aug 16, 2018

zhreshold commented Aug 16, 2018

ifeherva commented Aug 16, 2018

zhreshold commented Aug 16, 2018

ifeherva commented Aug 16, 2018

djaym7 commented Jan 4, 2020

austinmw commented Mar 24, 2020 •

edited

Loading

"socket.error: [Errno 111] Connection refused" while training with multiple workers #11872

"socket.error: [Errno 111] Connection refused" while training with multiple workers #11872

Comments

miteshyh commented Jul 24, 2018

zhreshold commented Jul 24, 2018

zhreshold commented Jul 26, 2018

zhreshold commented Jul 27, 2018

miteshyh commented Jul 30, 2018

zhreshold commented Aug 13, 2018

ifeherva commented Aug 16, 2018

zhreshold commented Aug 16, 2018

ifeherva commented Aug 16, 2018

zhreshold commented Aug 16, 2018

ifeherva commented Aug 16, 2018

djaym7 commented Jan 4, 2020

austinmw commented Mar 24, 2020 • edited Loading

austinmw commented Mar 24, 2020 •

edited

Loading