Fix no SIGCHLD checking in DataLoaderIter._shutdown_workers #19421

ssnl · 2019-04-18T20:19:08Z

Also

Bump multiprocessing test timeout following python core tests
Fix one type of flakiness in test_proper_exit.
Add trace reporting when loader process hangs in test_proper_exit using faulthandler.
Give test_proper_exit another try.

I'll heavily retest this.

ssnl · 2019-04-18T20:20:27Z

@kostmo , @ezyang mentioned to me that you are working on some log aggregator. I wonder if it will be possible to make a test failure not fail the entire build, but rather sends out an email/notification? :)

ssnl · 2019-04-19T17:04:28Z

@pytorchbot retest this please

kostmo · 2019-04-19T17:40:52Z

@kostmo , @ezyang mentioned to me that you are working on some log aggregator. I wonder if it will be possible to make a test failure not fail the entire build, but rather sends out an email/notification? :)

We probably wouldn't want to add another dependence on the network during the build. I'm thinking that we should allow it to fail, but make use of the log aggregator to make visible the seriousness of each failure, so that developers can know this flaky test is not a "merge blocker". Perhaps this would involve a Chrome extension or special github integration in the long term...

ssnl · 2019-04-19T19:09:22Z

@kostmo That makes sense. It'd be also great to not have CI stop after failing a flaky test. Thanks for working on this!

ezyang · 2019-04-19T21:53:22Z

@ssnl Let me know when you are convinced enough to want to merge this.

ssnl · 2019-04-20T15:08:32Z

@pytorchbot retest this please

ssnl · 2019-04-20T15:31:44Z

@pytorchbot retest this please

ssnl · 2019-04-20T16:45:58Z

@pytorchbot retest this please

ssnl · 2019-04-20T18:09:38Z

@pytorchbot retest this please

ssnl · 2019-04-20T18:19:26Z

@pytorchbot retest this please

ssnl · 2019-04-20T20:34:00Z

@pytorchbot retest this please

ssnl · 2019-04-20T20:34:09Z

@pytorchbot retest this please

ssnl · 2019-04-20T20:34:17Z

@pytorchbot retest this please

ssnl · 2019-04-21T06:33:45Z

@pytorchbot retest this please

ssnl · 2019-04-22T16:10:51Z

@pytorchbot retest this please

ssnl · 2019-04-22T19:35:16Z

@pytorchbot retest this please

…at is also detected

ssnl · 2019-04-23T17:59:57Z

@pytorchbot retest this please

ssnl · 2019-04-24T02:49:54Z

@ezyang The two yellow CircleCI builds actually passed (if you clicked into them), and lint isn't broken by this patch. I think this is good to go. I fixed another thing in the DataLoader._shutdown_workers, and updated the description.

ssnl · 2019-04-24T02:50:17Z

torch/utils/data/dataloader.py

@@ -596,41 +596,50 @@ def _shutdown_workers(self):
 # See (1) and the second half of the note.
 if not self.shutdown:
 self.shutdown = True
- # Removes pids from the C side data structure first so worker
- # termination afterwards won't trigger false positive error report.


I don' think that this comment applies to the current code anymore

facebook-github-bot

@ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2019-04-24T15:13:55Z

@ezyang merged this pull request in 5e62ee2.

nairbv · 2019-05-02T15:13:41Z

@ssnl looks like the flaky test_proper_exit test is still occasionally having issues:
https://circleci.com/gh/pytorch/pytorch/1520165?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link

Apr 29 19:51:58 ======================================================================
Apr 29 19:51:58 FAIL: test_proper_exit (__main__.TestDataLoader)
Apr 29 19:51:58 There might be ConnectionResetError or leaked semaphore warning (due to dirty process exit), but they are all safe to ignore
Apr 29 19:51:58 ----------------------------------------------------------------------
Apr 29 19:51:58 Traceback (most recent call last):
Apr 29 19:51:58   File "/var/lib/jenkins/workspace/test/common_utils.py", line 129, in wrapper
Apr 29 19:51:58     fn(*args, **kwargs)
Apr 29 19:51:58   File "test_dataloader.py", line 847, in test_proper_exit
Apr 29 19:51:58     self.fail(fail_msg + ', and had exception {}'.format(loader_p.exception))
Apr 29 19:51:58 AssertionError: test_proper_exit with use_workers=True, pin_memory=False, hold_iter_reference=False, exit_method=worker_kill: loader process did not terminate, and had exception Traceback (most recent call last):
Apr 29 19:51:58   File "test_dataloader.py", line 227, in run
Apr 29 19:51:58     super(ErrorTrackingProcess, self).run()
Apr 29 19:51:58   File "/opt/python/2.7.9/lib/python2.7/multiprocessing/process.py", line 114, in run
Apr 29 19:51:58     self._target(*self._args, **self._kwargs)
Apr 29 19:51:58   File "test_dataloader.py", line 424, in _test_proper_exit
Apr 29 19:51:58     for i, _ in enumerate(it):
Apr 29 19:51:58   File "/opt/python/2.7.9/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 545, in __next__
Apr 29 19:51:58     idx, batch = self._get_batch()
Apr 29 19:51:58   File "/opt/python/2.7.9/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 522, in _get_batch
Apr 29 19:51:58     success, data = self._try_get_batch()
Apr 29 19:51:58   File "/opt/python/2.7.9/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 480, in _try_get_batch
Apr 29 19:51:58     data = self.data_queue.get(timeout=timeout)
Apr 29 19:51:58   File "/opt/python/2.7.9/lib/python2.7/multiprocessing/queues.py", line 135, in get
Apr 29 19:51:58     res = self._recv()
Apr 29 19:51:58   File "/opt/python/2.7.9/lib/python2.7/site-packages/torch/multiprocessing/queue.py", line 22, in recv
Apr 29 19:51:58     return pickle.loads(buf)
Apr 29 19:51:58   File "/opt/python/2.7.9/lib/python2.7/pickle.py", line 1382, in loads
Apr 29 19:51:58     return Unpickler(file).load()
Apr 29 19:51:58   File "/opt/python/2.7.9/lib/python2.7/pickle.py", line 858, in load
Apr 29 19:51:58     dispatch[key](self)
Apr 29 19:51:58   File "/opt/python/2.7.9/lib/python2.7/pickle.py", line 1133, in load_reduce
Apr 29 19:51:58     value = func(*args)
Apr 29 19:51:58   File "/opt/python/2.7.9/lib/python2.7/site-packages/torch/multiprocessing/reductions.py", line 274, in rebuild_storage_fd
Apr 29 19:51:58     fd = multiprocessing.reduction.rebuild_handle(df)
Apr 29 19:51:58   File "/opt/python/2.7.9/lib/python2.7/multiprocessing/reduction.py", line 155, in rebuild_handle
Apr 29 19:51:58     conn = Client(address, authkey=current_process().authkey)
Apr 29 19:51:58   File "/opt/python/2.7.9/lib/python2.7/multiprocessing/connection.py", line 169, in Client
Apr 29 19:51:58     c = SocketClient(address)
Apr 29 19:51:58   File "/opt/python/2.7.9/lib/python2.7/multiprocessing/connection.py", line 304, in SocketClient
Apr 29 19:51:58     s.connect(address)
Apr 29 19:51:58   File "/opt/python/2.7.9/lib/python2.7/socket.py", line 224, in meth
Apr 29 19:51:58     return getattr(self._sock,name)(*args)
Apr 29 19:51:58 error: [Errno 111] Connection refused

) Summary: test was disabled for being flaky, re-enabled in #19421 but still occasionally failing: https://circleci.com/gh/pytorch/pytorch/1520165?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link ``` Apr 29 19:51:58 ====================================================================== Apr 29 19:51:58 FAIL: test_proper_exit (__main__.TestDataLoader) Apr 29 19:51:58 There might be ConnectionResetError or leaked semaphore warning (due to dirty process exit), but they are all safe to ignore Apr 29 19:51:58 ---------------------------------------------------------------------- Apr 29 19:51:58 Traceback (most recent call last): Apr 29 19:51:58 File "/var/lib/jenkins/workspace/test/common_utils.py", line 129, in wrapper Apr 29 19:51:58 fn(*args, **kwargs) Apr 29 19:51:58 File "test_dataloader.py", line 847, in test_proper_exit Apr 29 19:51:58 self.fail(fail_msg + ', and had exception {}'.format(loader_p.exception)) Apr 29 19:51:58 AssertionError: test_proper_exit with use_workers=True, pin_memory=False, hold_iter_reference=False, exit_method=worker_kill: loader process did not terminate, and had exception Traceback (most recent call last): Apr 29 19:51:58 File "test_dataloader.py", line 227, in run Apr 29 19:51:58 super(ErrorTrackingProcess, self).run() Apr 29 19:51:58 File "/opt/python/2.7.9/lib/python2.7/multiprocessing/process.py", line 114, in run Apr 29 19:51:58 self._target(*self._args, **self._kwargs) Apr 29 19:51:58 File "test_dataloader.py", line 424, in _test_proper_exit Apr 29 19:51:58 for i, _ in enumerate(it): Apr 29 19:51:58 File "/opt/python/2.7.9/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 545, in __next__ Apr 29 19:51:58 idx, batch = self._get_batch() Apr 29 19:51:58 File "/opt/python/2.7.9/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 522, in _get_batch Apr 29 19:51:58 success, data = self._try_get_batch() Apr 29 19:51:58 File "/opt/python/2.7.9/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 480, in _try_get_batch Apr 29 19:51:58 data = self.data_queue.get(timeout=timeout) Apr 29 19:51:58 File "/opt/python/2.7.9/lib/python2.7/multiprocessing/queues.py", line 135, in get Apr 29 19:51:58 res = self._recv() Apr 29 19:51:58 File "/opt/python/2.7.9/lib/python2.7/site-packages/torch/multiprocessing/queue.py", line 22, in recv Apr 29 19:51:58 return pickle.loads(buf) Apr 29 19:51:58 File "/opt/python/2.7.9/lib/python2.7/pickle.py", line 1382, in loads Apr 29 19:51:58 return Unpickler(file).load() Apr 29 19:51:58 File "/opt/python/2.7.9/lib/python2.7/pickle.py", line 858, in load Apr 29 19:51:58 dispatch[key](self) Apr 29 19:51:58 File "/opt/python/2.7.9/lib/python2.7/pickle.py", line 1133, in load_reduce Apr 29 19:51:58 value = func(*args) Apr 29 19:51:58 File "/opt/python/2.7.9/lib/python2.7/site-packages/torch/multiprocessing/reductions.py", line 274, in rebuild_storage_fd Apr 29 19:51:58 fd = multiprocessing.reduction.rebuild_handle(df) Apr 29 19:51:58 File "/opt/python/2.7.9/lib/python2.7/multiprocessing/reduction.py", line 155, in rebuild_handle Apr 29 19:51:58 conn = Client(address, authkey=current_process().authkey) Apr 29 19:51:58 File "/opt/python/2.7.9/lib/python2.7/multiprocessing/connection.py", line 169, in Client Apr 29 19:51:58 c = SocketClient(address) Apr 29 19:51:58 File "/opt/python/2.7.9/lib/python2.7/multiprocessing/connection.py", line 304, in SocketClient Apr 29 19:51:58 s.connect(address) Apr 29 19:51:58 File "/opt/python/2.7.9/lib/python2.7/socket.py", line 224, in meth Apr 29 19:51:58 return getattr(self._sock,name)(*args) Apr 29 19:51:58 error: [Errno 111] Connection refused ``` Pull Request resolved: #20063 Differential Revision: D15218223 Pulled By: nairbv fbshipit-source-id: 32018c4220f7cb9372ef138631fc3a79759265e1

…19421) Summary: Also 1. Bump multiprocessing test timeout following python core tests 2. Fix one type of flakiness in `test_proper_exit`. 3. Add trace reporting when loader process hangs in `test_proper_exit` using `faulthandler`. 3. Give `test_proper_exit` another try. I'll heavily retest this. Pull Request resolved: pytorch#19421 Differential Revision: D15063728 Pulled By: ezyang fbshipit-source-id: 4e0d992622e11053c44a9ec237b88b9a28a4472c

nairbv added module: tests Issues related to tests (not the torch.testing module) module: flaky-tests Problem is a flaky test in CI triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Apr 18, 2019

nairbv requested a review from ezyang April 18, 2019 21:04

ssnl mentioned this pull request Apr 18, 2019

DistributedDataParallelTest .test_dist_broadcast_coalesced_gloo is flaky #19438

Closed

ssnl force-pushed the to60 branch from 0d31d83 to 9a0b539 Compare April 19, 2019 01:32

ezyang approved these changes Apr 19, 2019

View reviewed changes

ssnl mentioned this pull request Apr 19, 2019

test_proper_exit is flaky #16608

Closed

ssnl force-pushed the to60 branch from 25a872d to 28a9fe3 Compare April 20, 2019 18:26

ssnl force-pushed the to60 branch from 28a9fe3 to b65e5f5 Compare April 22, 2019 06:42

ssnl force-pushed the to60 branch from b65e5f5 to 96c0b7c Compare April 22, 2019 17:40

ssnl force-pushed the to60 branch 3 times, most recently from 8e67c9c to 60d8381 Compare April 23, 2019 04:14

ssnl force-pushed the to60 branch from 60d8381 to 8672ed4 Compare April 23, 2019 04:35

ssnl added 5 commits April 23, 2019 02:22

Bump multiprocessing test timeout following python core tests

a06862e

skip memory pinning on windows

e04d639

properly print subprocess trace for debugging

1e3651f

clear worker pid set at last of cleaning up so worker dying during th…

c0ff7c3

…at is also detected

add comments

5c05c12

ssnl force-pushed the to60 branch from 2bb87ca to 5c05c12 Compare April 23, 2019 06:22

ssnl added 2 commits April 23, 2019 09:28

fix windows not having faulthandler.register

0f17c52

update comments

4e0fa6a

ssnl changed the title ~~Bump multiprocessing test timeout following python core tests~~ Fix no SIGCHLD checking in DataLoaderIter._shutdown_workers Apr 23, 2019

ssnl added 2 commits April 23, 2019 15:13

add fixme for windows

687e619

update comment

07deb4d

ssnl mentioned this pull request Apr 24, 2019

Add IterableDataset #19228

Closed

ssnl commented Apr 24, 2019

View reviewed changes

facebook-github-bot reviewed Apr 24, 2019

View reviewed changes

facebook-github-bot closed this in 5e62ee2 Apr 24, 2019

facebook-github-bot added the merged label Apr 24, 2019

ssnl deleted the to60 branch April 24, 2019 16:20

nairbv mentioned this pull request May 2, 2019

disable flaky test_proper_exit again, still occasionally failing #20063

Closed

ezyang added the open source label Jun 24, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix no SIGCHLD checking in DataLoaderIter._shutdown_workers #19421

Fix no SIGCHLD checking in DataLoaderIter._shutdown_workers #19421

ssnl commented Apr 18, 2019 •

edited

Loading

ssnl commented Apr 18, 2019

ssnl commented Apr 19, 2019

kostmo commented Apr 19, 2019

ssnl commented Apr 19, 2019

ezyang commented Apr 19, 2019

ssnl commented Apr 20, 2019

ssnl commented Apr 20, 2019

ssnl commented Apr 20, 2019

ssnl commented Apr 20, 2019

ssnl commented Apr 20, 2019

ssnl commented Apr 20, 2019

ssnl commented Apr 20, 2019

ssnl commented Apr 20, 2019

ssnl commented Apr 21, 2019

ssnl commented Apr 22, 2019

ssnl commented Apr 22, 2019

ssnl commented Apr 23, 2019

ssnl commented Apr 24, 2019

ssnl Apr 24, 2019

facebook-github-bot left a comment

facebook-github-bot commented Apr 24, 2019

nairbv commented May 2, 2019

Fix no SIGCHLD checking in DataLoaderIter._shutdown_workers #19421

Fix no SIGCHLD checking in DataLoaderIter._shutdown_workers #19421

Conversation

ssnl commented Apr 18, 2019 • edited Loading

ssnl commented Apr 18, 2019

ssnl commented Apr 19, 2019

kostmo commented Apr 19, 2019

ssnl commented Apr 19, 2019

ezyang commented Apr 19, 2019

ssnl commented Apr 20, 2019

ssnl commented Apr 20, 2019

ssnl commented Apr 20, 2019

ssnl commented Apr 20, 2019

ssnl commented Apr 20, 2019

ssnl commented Apr 20, 2019

@pytorchbot retest this please

ssnl commented Apr 20, 2019

ssnl commented Apr 20, 2019

ssnl commented Apr 21, 2019

ssnl commented Apr 22, 2019

ssnl commented Apr 22, 2019

ssnl commented Apr 23, 2019

ssnl commented Apr 24, 2019

ssnl Apr 24, 2019

Choose a reason for hiding this comment

facebook-github-bot left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Apr 24, 2019

nairbv commented May 2, 2019

ssnl commented Apr 18, 2019 •

edited

Loading