Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix no SIGCHLD checking in DataLoaderIter._shutdown_workers #19421

Closed
wants to merge 9 commits into from

Conversation

ssnl
Copy link
Collaborator

@ssnl ssnl commented Apr 18, 2019

Also

  1. Bump multiprocessing test timeout following python core tests
  2. Fix one type of flakiness in test_proper_exit.
  3. Add trace reporting when loader process hangs in test_proper_exit using faulthandler.
  4. Give test_proper_exit another try.

I'll heavily retest this.

@ssnl
Copy link
Collaborator Author

ssnl commented Apr 18, 2019

@kostmo , @ezyang mentioned to me that you are working on some log aggregator. I wonder if it will be possible to make a test failure not fail the entire build, but rather sends out an email/notification? :)

@nairbv nairbv added module: tests Issues related to tests (not the torch.testing module) module: flaky-tests Problem is a flaky test in CI triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Apr 18, 2019
@nairbv nairbv requested a review from ezyang April 18, 2019 21:04
@ssnl
Copy link
Collaborator Author

ssnl commented Apr 19, 2019

@pytorchbot retest this please

@kostmo
Copy link
Member

kostmo commented Apr 19, 2019

@kostmo , @ezyang mentioned to me that you are working on some log aggregator. I wonder if it will be possible to make a test failure not fail the entire build, but rather sends out an email/notification? :)

We probably wouldn't want to add another dependence on the network during the build. I'm thinking that we should allow it to fail, but make use of the log aggregator to make visible the seriousness of each failure, so that developers can know this flaky test is not a "merge blocker". Perhaps this would involve a Chrome extension or special github integration in the long term...

@ssnl ssnl mentioned this pull request Apr 19, 2019
@ssnl
Copy link
Collaborator Author

ssnl commented Apr 19, 2019

@kostmo That makes sense. It'd be also great to not have CI stop after failing a flaky test. Thanks for working on this!

@ezyang
Copy link
Contributor

ezyang commented Apr 19, 2019

@ssnl Let me know when you are convinced enough to want to merge this.

@ssnl
Copy link
Collaborator Author

ssnl commented Apr 20, 2019

@pytorchbot retest this please

@ssnl
Copy link
Collaborator Author

ssnl commented Apr 20, 2019

@pytorchbot retest this please

3 similar comments
@ssnl
Copy link
Collaborator Author

ssnl commented Apr 20, 2019

@pytorchbot retest this please

@ssnl
Copy link
Collaborator Author

ssnl commented Apr 20, 2019

@pytorchbot retest this please

@ssnl
Copy link
Collaborator Author

ssnl commented Apr 20, 2019

@pytorchbot retest this please

@ssnl
Copy link
Collaborator Author

ssnl commented Apr 20, 2019

@pytorchbot retest this please

@ssnl
Copy link
Collaborator Author

ssnl commented Apr 20, 2019

@pytorchbot retest this please

@ssnl
Copy link
Collaborator Author

ssnl commented Apr 20, 2019

@pytorchbot retest this please

1 similar comment
@ssnl
Copy link
Collaborator Author

ssnl commented Apr 21, 2019

@pytorchbot retest this please

@ssnl
Copy link
Collaborator Author

ssnl commented Apr 22, 2019

@pytorchbot retest this please

@ssnl
Copy link
Collaborator Author

ssnl commented Apr 22, 2019

@pytorchbot retest this please

@ssnl ssnl force-pushed the to60 branch 3 times, most recently from 8e67c9c to 60d8381 Compare April 23, 2019 04:14
@ssnl
Copy link
Collaborator Author

ssnl commented Apr 23, 2019

@pytorchbot retest this please

@ssnl ssnl changed the title Bump multiprocessing test timeout following python core tests Fix no SIGCHLD checking in DataLoaderIter._shutdown_workers Apr 23, 2019
@ssnl ssnl mentioned this pull request Apr 24, 2019
@ssnl
Copy link
Collaborator Author

ssnl commented Apr 24, 2019

@ezyang The two yellow CircleCI builds actually passed (if you clicked into them), and lint isn't broken by this patch. I think this is good to go. I fixed another thing in the DataLoader._shutdown_workers, and updated the description.

@@ -596,41 +596,50 @@ def _shutdown_workers(self):
# See (1) and the second half of the note.
if not self.shutdown:
self.shutdown = True
# Removes pids from the C side data structure first so worker
# termination afterwards won't trigger false positive error report.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don' think that this comment applies to the current code anymore

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@ezyang merged this pull request in 5e62ee2.

@ssnl ssnl deleted the to60 branch April 24, 2019 16:20
@nairbv
Copy link
Collaborator

nairbv commented May 2, 2019

@ssnl looks like the flaky test_proper_exit test is still occasionally having issues:
https://circleci.com/gh/pytorch/pytorch/1520165?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link

Apr 29 19:51:58 ======================================================================
Apr 29 19:51:58 FAIL: test_proper_exit (__main__.TestDataLoader)
Apr 29 19:51:58 There might be ConnectionResetError or leaked semaphore warning (due to dirty process exit), but they are all safe to ignore
Apr 29 19:51:58 ----------------------------------------------------------------------
Apr 29 19:51:58 Traceback (most recent call last):
Apr 29 19:51:58   File "/var/lib/jenkins/workspace/test/common_utils.py", line 129, in wrapper
Apr 29 19:51:58     fn(*args, **kwargs)
Apr 29 19:51:58   File "test_dataloader.py", line 847, in test_proper_exit
Apr 29 19:51:58     self.fail(fail_msg + ', and had exception {}'.format(loader_p.exception))
Apr 29 19:51:58 AssertionError: test_proper_exit with use_workers=True, pin_memory=False, hold_iter_reference=False, exit_method=worker_kill: loader process did not terminate, and had exception Traceback (most recent call last):
Apr 29 19:51:58   File "test_dataloader.py", line 227, in run
Apr 29 19:51:58     super(ErrorTrackingProcess, self).run()
Apr 29 19:51:58   File "/opt/python/2.7.9/lib/python2.7/multiprocessing/process.py", line 114, in run
Apr 29 19:51:58     self._target(*self._args, **self._kwargs)
Apr 29 19:51:58   File "test_dataloader.py", line 424, in _test_proper_exit
Apr 29 19:51:58     for i, _ in enumerate(it):
Apr 29 19:51:58   File "/opt/python/2.7.9/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 545, in __next__
Apr 29 19:51:58     idx, batch = self._get_batch()
Apr 29 19:51:58   File "/opt/python/2.7.9/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 522, in _get_batch
Apr 29 19:51:58     success, data = self._try_get_batch()
Apr 29 19:51:58   File "/opt/python/2.7.9/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 480, in _try_get_batch
Apr 29 19:51:58     data = self.data_queue.get(timeout=timeout)
Apr 29 19:51:58   File "/opt/python/2.7.9/lib/python2.7/multiprocessing/queues.py", line 135, in get
Apr 29 19:51:58     res = self._recv()
Apr 29 19:51:58   File "/opt/python/2.7.9/lib/python2.7/site-packages/torch/multiprocessing/queue.py", line 22, in recv
Apr 29 19:51:58     return pickle.loads(buf)
Apr 29 19:51:58   File "/opt/python/2.7.9/lib/python2.7/pickle.py", line 1382, in loads
Apr 29 19:51:58     return Unpickler(file).load()
Apr 29 19:51:58   File "/opt/python/2.7.9/lib/python2.7/pickle.py", line 858, in load
Apr 29 19:51:58     dispatch[key](self)
Apr 29 19:51:58   File "/opt/python/2.7.9/lib/python2.7/pickle.py", line 1133, in load_reduce
Apr 29 19:51:58     value = func(*args)
Apr 29 19:51:58   File "/opt/python/2.7.9/lib/python2.7/site-packages/torch/multiprocessing/reductions.py", line 274, in rebuild_storage_fd
Apr 29 19:51:58     fd = multiprocessing.reduction.rebuild_handle(df)
Apr 29 19:51:58   File "/opt/python/2.7.9/lib/python2.7/multiprocessing/reduction.py", line 155, in rebuild_handle
Apr 29 19:51:58     conn = Client(address, authkey=current_process().authkey)
Apr 29 19:51:58   File "/opt/python/2.7.9/lib/python2.7/multiprocessing/connection.py", line 169, in Client
Apr 29 19:51:58     c = SocketClient(address)
Apr 29 19:51:58   File "/opt/python/2.7.9/lib/python2.7/multiprocessing/connection.py", line 304, in SocketClient
Apr 29 19:51:58     s.connect(address)
Apr 29 19:51:58   File "/opt/python/2.7.9/lib/python2.7/socket.py", line 224, in meth
Apr 29 19:51:58     return getattr(self._sock,name)(*args)
Apr 29 19:51:58 error: [Errno 111] Connection refused

facebook-github-bot pushed a commit that referenced this pull request May 6, 2019
)

Summary:
test was disabled for being flaky, re-enabled in #19421 but still occasionally failing:

https://circleci.com/gh/pytorch/pytorch/1520165?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link

```
Apr 29 19:51:58 ======================================================================
Apr 29 19:51:58 FAIL: test_proper_exit (__main__.TestDataLoader)
Apr 29 19:51:58 There might be ConnectionResetError or leaked semaphore warning (due to dirty process exit), but they are all safe to ignore
Apr 29 19:51:58 ----------------------------------------------------------------------
Apr 29 19:51:58 Traceback (most recent call last):
Apr 29 19:51:58   File "/var/lib/jenkins/workspace/test/common_utils.py", line 129, in wrapper
Apr 29 19:51:58     fn(*args, **kwargs)
Apr 29 19:51:58   File "test_dataloader.py", line 847, in test_proper_exit
Apr 29 19:51:58     self.fail(fail_msg + ', and had exception {}'.format(loader_p.exception))
Apr 29 19:51:58 AssertionError: test_proper_exit with use_workers=True, pin_memory=False, hold_iter_reference=False, exit_method=worker_kill: loader process did not terminate, and had exception Traceback (most recent call last):
Apr 29 19:51:58   File "test_dataloader.py", line 227, in run
Apr 29 19:51:58     super(ErrorTrackingProcess, self).run()
Apr 29 19:51:58   File "/opt/python/2.7.9/lib/python2.7/multiprocessing/process.py", line 114, in run
Apr 29 19:51:58     self._target(*self._args, **self._kwargs)
Apr 29 19:51:58   File "test_dataloader.py", line 424, in _test_proper_exit
Apr 29 19:51:58     for i, _ in enumerate(it):
Apr 29 19:51:58   File "/opt/python/2.7.9/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 545, in __next__
Apr 29 19:51:58     idx, batch = self._get_batch()
Apr 29 19:51:58   File "/opt/python/2.7.9/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 522, in _get_batch
Apr 29 19:51:58     success, data = self._try_get_batch()
Apr 29 19:51:58   File "/opt/python/2.7.9/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 480, in _try_get_batch
Apr 29 19:51:58     data = self.data_queue.get(timeout=timeout)
Apr 29 19:51:58   File "/opt/python/2.7.9/lib/python2.7/multiprocessing/queues.py", line 135, in get
Apr 29 19:51:58     res = self._recv()
Apr 29 19:51:58   File "/opt/python/2.7.9/lib/python2.7/site-packages/torch/multiprocessing/queue.py", line 22, in recv
Apr 29 19:51:58     return pickle.loads(buf)
Apr 29 19:51:58   File "/opt/python/2.7.9/lib/python2.7/pickle.py", line 1382, in loads
Apr 29 19:51:58     return Unpickler(file).load()
Apr 29 19:51:58   File "/opt/python/2.7.9/lib/python2.7/pickle.py", line 858, in load
Apr 29 19:51:58     dispatch[key](self)
Apr 29 19:51:58   File "/opt/python/2.7.9/lib/python2.7/pickle.py", line 1133, in load_reduce
Apr 29 19:51:58     value = func(*args)
Apr 29 19:51:58   File "/opt/python/2.7.9/lib/python2.7/site-packages/torch/multiprocessing/reductions.py", line 274, in rebuild_storage_fd
Apr 29 19:51:58     fd = multiprocessing.reduction.rebuild_handle(df)
Apr 29 19:51:58   File "/opt/python/2.7.9/lib/python2.7/multiprocessing/reduction.py", line 155, in rebuild_handle
Apr 29 19:51:58     conn = Client(address, authkey=current_process().authkey)
Apr 29 19:51:58   File "/opt/python/2.7.9/lib/python2.7/multiprocessing/connection.py", line 169, in Client
Apr 29 19:51:58     c = SocketClient(address)
Apr 29 19:51:58   File "/opt/python/2.7.9/lib/python2.7/multiprocessing/connection.py", line 304, in SocketClient
Apr 29 19:51:58     s.connect(address)
Apr 29 19:51:58   File "/opt/python/2.7.9/lib/python2.7/socket.py", line 224, in meth
Apr 29 19:51:58     return getattr(self._sock,name)(*args)
Apr 29 19:51:58 error: [Errno 111] Connection refused

```
Pull Request resolved: #20063

Differential Revision: D15218223

Pulled By: nairbv

fbshipit-source-id: 32018c4220f7cb9372ef138631fc3a79759265e1
zhangguanheng66 pushed a commit to zhangguanheng66/pytorch that referenced this pull request May 6, 2019
…19421)

Summary:
Also

1. Bump multiprocessing test timeout following python core tests
2. Fix one type of flakiness in `test_proper_exit`.
3. Add trace reporting when loader process hangs in `test_proper_exit` using `faulthandler`.
3. Give `test_proper_exit` another try.

I'll heavily retest this.
Pull Request resolved: pytorch#19421

Differential Revision: D15063728

Pulled By: ezyang

fbshipit-source-id: 4e0d992622e11053c44a9ec237b88b9a28a4472c
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: flaky-tests Problem is a flaky test in CI module: tests Issues related to tests (not the torch.testing module) open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants