-
Notifications
You must be signed in to change notification settings - Fork 21.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix no SIGCHLD checking in DataLoaderIter._shutdown_workers #19421
Conversation
@pytorchbot retest this please |
We probably wouldn't want to add another dependence on the network during the build. I'm thinking that we should allow it to fail, but make use of the log aggregator to make visible the seriousness of each failure, so that developers can know this flaky test is not a "merge blocker". Perhaps this would involve a Chrome extension or special github integration in the long term... |
@kostmo That makes sense. It'd be also great to not have CI stop after failing a flaky test. Thanks for working on this! |
@ssnl Let me know when you are convinced enough to want to merge this. |
@pytorchbot retest this please |
@pytorchbot retest this please |
3 similar comments
@pytorchbot retest this please |
@pytorchbot retest this please |
@pytorchbot retest this please |
@pytorchbot retest this please |
@pytorchbot retest this please |
@pytorchbot retest this please |
1 similar comment
@pytorchbot retest this please |
@pytorchbot retest this please |
@pytorchbot retest this please |
8e67c9c
to
60d8381
Compare
@pytorchbot retest this please |
@ezyang The two yellow CircleCI builds actually passed (if you clicked into them), and lint isn't broken by this patch. I think this is good to go. I fixed another thing in the DataLoader._shutdown_workers, and updated the description. |
@@ -596,41 +596,50 @@ def _shutdown_workers(self): | |||
# See (1) and the second half of the note. | |||
if not self.shutdown: | |||
self.shutdown = True | |||
# Removes pids from the C side data structure first so worker | |||
# termination afterwards won't trigger false positive error report. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don' think that this comment applies to the current code anymore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
@ssnl looks like the flaky test_proper_exit test is still occasionally having issues:
|
) Summary: test was disabled for being flaky, re-enabled in #19421 but still occasionally failing: https://circleci.com/gh/pytorch/pytorch/1520165?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link ``` Apr 29 19:51:58 ====================================================================== Apr 29 19:51:58 FAIL: test_proper_exit (__main__.TestDataLoader) Apr 29 19:51:58 There might be ConnectionResetError or leaked semaphore warning (due to dirty process exit), but they are all safe to ignore Apr 29 19:51:58 ---------------------------------------------------------------------- Apr 29 19:51:58 Traceback (most recent call last): Apr 29 19:51:58 File "/var/lib/jenkins/workspace/test/common_utils.py", line 129, in wrapper Apr 29 19:51:58 fn(*args, **kwargs) Apr 29 19:51:58 File "test_dataloader.py", line 847, in test_proper_exit Apr 29 19:51:58 self.fail(fail_msg + ', and had exception {}'.format(loader_p.exception)) Apr 29 19:51:58 AssertionError: test_proper_exit with use_workers=True, pin_memory=False, hold_iter_reference=False, exit_method=worker_kill: loader process did not terminate, and had exception Traceback (most recent call last): Apr 29 19:51:58 File "test_dataloader.py", line 227, in run Apr 29 19:51:58 super(ErrorTrackingProcess, self).run() Apr 29 19:51:58 File "/opt/python/2.7.9/lib/python2.7/multiprocessing/process.py", line 114, in run Apr 29 19:51:58 self._target(*self._args, **self._kwargs) Apr 29 19:51:58 File "test_dataloader.py", line 424, in _test_proper_exit Apr 29 19:51:58 for i, _ in enumerate(it): Apr 29 19:51:58 File "/opt/python/2.7.9/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 545, in __next__ Apr 29 19:51:58 idx, batch = self._get_batch() Apr 29 19:51:58 File "/opt/python/2.7.9/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 522, in _get_batch Apr 29 19:51:58 success, data = self._try_get_batch() Apr 29 19:51:58 File "/opt/python/2.7.9/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 480, in _try_get_batch Apr 29 19:51:58 data = self.data_queue.get(timeout=timeout) Apr 29 19:51:58 File "/opt/python/2.7.9/lib/python2.7/multiprocessing/queues.py", line 135, in get Apr 29 19:51:58 res = self._recv() Apr 29 19:51:58 File "/opt/python/2.7.9/lib/python2.7/site-packages/torch/multiprocessing/queue.py", line 22, in recv Apr 29 19:51:58 return pickle.loads(buf) Apr 29 19:51:58 File "/opt/python/2.7.9/lib/python2.7/pickle.py", line 1382, in loads Apr 29 19:51:58 return Unpickler(file).load() Apr 29 19:51:58 File "/opt/python/2.7.9/lib/python2.7/pickle.py", line 858, in load Apr 29 19:51:58 dispatch[key](self) Apr 29 19:51:58 File "/opt/python/2.7.9/lib/python2.7/pickle.py", line 1133, in load_reduce Apr 29 19:51:58 value = func(*args) Apr 29 19:51:58 File "/opt/python/2.7.9/lib/python2.7/site-packages/torch/multiprocessing/reductions.py", line 274, in rebuild_storage_fd Apr 29 19:51:58 fd = multiprocessing.reduction.rebuild_handle(df) Apr 29 19:51:58 File "/opt/python/2.7.9/lib/python2.7/multiprocessing/reduction.py", line 155, in rebuild_handle Apr 29 19:51:58 conn = Client(address, authkey=current_process().authkey) Apr 29 19:51:58 File "/opt/python/2.7.9/lib/python2.7/multiprocessing/connection.py", line 169, in Client Apr 29 19:51:58 c = SocketClient(address) Apr 29 19:51:58 File "/opt/python/2.7.9/lib/python2.7/multiprocessing/connection.py", line 304, in SocketClient Apr 29 19:51:58 s.connect(address) Apr 29 19:51:58 File "/opt/python/2.7.9/lib/python2.7/socket.py", line 224, in meth Apr 29 19:51:58 return getattr(self._sock,name)(*args) Apr 29 19:51:58 error: [Errno 111] Connection refused ``` Pull Request resolved: #20063 Differential Revision: D15218223 Pulled By: nairbv fbshipit-source-id: 32018c4220f7cb9372ef138631fc3a79759265e1
…19421) Summary: Also 1. Bump multiprocessing test timeout following python core tests 2. Fix one type of flakiness in `test_proper_exit`. 3. Add trace reporting when loader process hangs in `test_proper_exit` using `faulthandler`. 3. Give `test_proper_exit` another try. I'll heavily retest this. Pull Request resolved: pytorch#19421 Differential Revision: D15063728 Pulled By: ezyang fbshipit-source-id: 4e0d992622e11053c44a9ec237b88b9a28a4472c
Also
test_proper_exit
.test_proper_exit
usingfaulthandler
.test_proper_exit
another try.I'll heavily retest this.