-
Notifications
You must be signed in to change notification settings - Fork 21.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[utils.bottleneck] Bottleneck crashes with multi-threaded data loader #6313
Comments
Yikes, thanks for the catch. I'll look into it |
Ran into this problem as well on unix; Would be great if we could still use the bottleneck functionality with multiple dataloaders :) |
Hi, has anyone found any workaround with this issue ? |
This thread deserves more visibility. Tried many ways to avoid Have not been able to locate a good tool in identifying the bottleneck. |
@stmharry A very simple bandaid which I'd be happy to approve, is to update the "initialization error" message with a suggested workaround of reducing the number of worker threads. Would you like to submit a PR? |
This is still an issue for me as well |
Still an issue for me as well |
I also meet a similar issue when using autograd profiler on GPU device. You can reproduce the result by running the following code: import argparse
import torch
import torch.utils.data
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='mwe')
parser.add_argument('--num-workers', default=0, type=int)
args = parser.parse_args()
data = torch.rand(10, 1000)
target = torch.rand(10)
dataset = torch.utils.data.TensorDataset(data, target)
data_loader = torch.utils.data.DataLoader(dataset,batch_size=2, num_workers=args.num_workers)
with torch.autograd.profiler.profile(use_cuda=True) as prof:
for i, batch in enumerate(data_loader):
pass
print(prof) Running the script via: python test.py --num-workers 0 works fine, while python test.py --num-workers 1 can get the following error: Traceback (most recent call last):
File "my_test.py", line 15, in <module>
for i, batch in enumerate(data_loader):
File "/home/xiaobinz/anaconda3/envs/pytorch-gpu/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 637, in __next__
return self._process_next_batch(batch)
File "/home/xiaobinz/anaconda3/envs/pytorch-gpu/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 658, in _process_next_batch
raise batch.exc_type(batch.exc_msg)
RuntimeError: Traceback (most recent call last):
File "/home/xiaobinz/anaconda3/envs/pytorch-gpu/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 138, in _worker_loop
samples = collate_fn([dataset[i] for i in batch_indices])
File "/home/xiaobinz/anaconda3/envs/pytorch-gpu/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 138, in <listcomp>
samples = collate_fn([dataset[i] for i in batch_indices])
File "/home/xiaobinz/anaconda3/envs/pytorch-gpu/lib/python3.6/site-packages/torch/utils/data/dataset.py", line 40, in __getitem__
return tuple(tensor[index] for tensor in self.tensors)
File "/home/xiaobinz/anaconda3/envs/pytorch-gpu/lib/python3.6/site-packages/torch/utils/data/dataset.py", line 40, in <genexpr>
return tuple(tensor[index] for tensor in self.tensors)
RuntimeError: /opt/conda/conda-bld/pytorch_1549628766161/work/torch/csrc/autograd/profiler.h:81: initialization error Torch version is pytorch 1.0.1 py3.6_cuda9.0.176_cudnn7.4.2_2 pytorch Perhaps they are same issue, thanks! |
This is still an issue; the failure mode changed after the significant rewrite of
Full traceback:
|
Still an issue as of v1.2 |
So the root problem is the following:
I'm not sure what the best way to resolve this is. From the user side, one can change the multiprocessing method to
|
Mitigates #6313 A common use case for the autograd profiler is to use it to run over an entire model, including dataloading. The following will crash: - run autograd profiler in CUDA mode - Use a multi-worker DataLoader (presumably with the 'fork' spawn method) because the autograd profiler initializes CUDA and forking after CUDA is initialized is bad. This PR puts in a nice error message when this happens so that users aren't too confused. The new error message looks like: ``` terminate called after throwing an instance of 'std::runtime_error' what(): ../torch/csrc/autograd/profiler_cuda.cpp:36: CUDA initialization error. This can occur if one ru ns the profiler in CUDA mode on code that creates a DataLoader with num_workers > 0. This operation is curr ently unsupported; potential workarounds are: (1) don't use the profiler in CUDA mode or (2) use num_worker s=0 in the DataLoader or (3) Don't profile the data loading portion of your code. https://github.com/pytorc h/pytorch/issues/6313 tracks profiler support for multi-worker DataLoader. Traceback (most recent call last): File "/scratch/rzou/pt/workspace/torch/utils/data/dataloader.py", line 761, in _try_get_data data = self._data_queue.get(timeout=timeout) File "/scratch/rzou/pt/workspace-env/lib/python3.7/multiprocessing/queues.py", line 104, in get if not self._poll(timeout): File "/scratch/rzou/pt/workspace-env/lib/python3.7/multiprocessing/connection.py", line 257, in poll return self._poll(timeout) File "/scratch/rzou/pt/workspace-env/lib/python3.7/multiprocessing/connection.py", line 414, in _poll r = wait([self], timeout) File "/scratch/rzou/pt/workspace-env/lib/python3.7/multiprocessing/connection.py", line 920, in wait ready = selector.select(timeout) File "/scratch/rzou/pt/workspace-env/lib/python3.7/selectors.py", line 415, in select fd_event_list = self._selector.poll(timeout) File "/scratch/rzou/pt/workspace/torch/utils/data/_utils/signal_handling.py", line 66, in handler _error_if_any_worker_fails() RuntimeError: DataLoader worker (pid 18703) is killed by signal: Aborted. During handling of the above exception, another exception occurred: Traceback (most recent call last): File "mwe.py", line 15, in <module> for i, batch in enumerate(data_loader): File "/scratch/rzou/pt/workspace/torch/utils/data/dataloader.py", line 345, in __next__ data = self._next_data() File "/scratch/rzou/pt/workspace/torch/utils/data/dataloader.py", line 841, in _next_data idx, data = self._get_data() File "/scratch/rzou/pt/workspace/torch/utils/data/dataloader.py", line 808, in _get_data success, data = self._try_get_data() File "/scratch/rzou/pt/workspace/torch/utils/data/dataloader.py", line 774, in _try_get_data raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) RuntimeError: DataLoader worker (pid(s) 18703) exited unexpectedly ``` Test Plan: - Tested locally - It's hard to add a test for this because the program is supposed to fail. [ghstack-poisoned]
Mitigates #6313 A common use case for the autograd profiler is to use it to run over an entire model, including dataloading. The following will crash: - run autograd profiler in CUDA mode - Use a multi-worker DataLoader (presumably with the 'fork' spawn method) because the autograd profiler initializes CUDA and forking after CUDA is initialized is bad. This PR puts in a nice error message when this happens so that users aren't too confused. The new error message looks like: ``` terminate called after throwing an instance of 'std::runtime_error' what(): ../torch/csrc/autograd/profiler_cuda.cpp:36: CUDA initialization error. This can occur if one ru ns the profiler in CUDA mode on code that creates a DataLoader with num_workers > 0. This operation is curr ently unsupported; potential workarounds are: (1) don't use the profiler in CUDA mode or (2) use num_worker s=0 in the DataLoader or (3) Don't profile the data loading portion of your code. https://github.com/pytorc h/pytorch/issues/6313 tracks profiler support for multi-worker DataLoader. Traceback (most recent call last): File "/scratch/rzou/pt/workspace/torch/utils/data/dataloader.py", line 761, in _try_get_data data = self._data_queue.get(timeout=timeout) File "/scratch/rzou/pt/workspace-env/lib/python3.7/multiprocessing/queues.py", line 104, in get if not self._poll(timeout): File "/scratch/rzou/pt/workspace-env/lib/python3.7/multiprocessing/connection.py", line 257, in poll return self._poll(timeout) File "/scratch/rzou/pt/workspace-env/lib/python3.7/multiprocessing/connection.py", line 414, in _poll r = wait([self], timeout) File "/scratch/rzou/pt/workspace-env/lib/python3.7/multiprocessing/connection.py", line 920, in wait ready = selector.select(timeout) File "/scratch/rzou/pt/workspace-env/lib/python3.7/selectors.py", line 415, in select fd_event_list = self._selector.poll(timeout) File "/scratch/rzou/pt/workspace/torch/utils/data/_utils/signal_handling.py", line 66, in handler _error_if_any_worker_fails() RuntimeError: DataLoader worker (pid 18703) is killed by signal: Aborted. During handling of the above exception, another exception occurred: Traceback (most recent call last): File "mwe.py", line 15, in <module> for i, batch in enumerate(data_loader): File "/scratch/rzou/pt/workspace/torch/utils/data/dataloader.py", line 345, in __next__ data = self._next_data() File "/scratch/rzou/pt/workspace/torch/utils/data/dataloader.py", line 841, in _next_data idx, data = self._get_data() File "/scratch/rzou/pt/workspace/torch/utils/data/dataloader.py", line 808, in _get_data success, data = self._try_get_data() File "/scratch/rzou/pt/workspace/torch/utils/data/dataloader.py", line 774, in _try_get_data raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) RuntimeError: DataLoader worker (pid(s) 18703) exited unexpectedly ``` Test Plan: - Tested locally - It's hard to add a test for this because the program is supposed to fail. ghstack-source-id: a59340bd942860fe135ae418dab5d9e88726b043 Pull Request resolved: #31445
Mitigates #6313 A common use case for the autograd profiler is to use it to run over an entire model, including dataloading. The following will crash: - run autograd profiler in CUDA mode - Use a multi-worker DataLoader (presumably with the 'fork' spawn method) - because the autograd profiler initializes CUDA and forking after CUDA is initialized is bad. This PR puts in a nice error message when this happens so that users aren't too confused. The new error message looks like: https://gist.github.com/zou3519/903f15c3e86bad4585b7e5ce14cc1b70 Test Plan: - Tested locally. - I didn't add a test case for this because it's hard to write a test case that doesn't completely stop the rest of our test suite from running. [ghstack-poisoned]
…loader crash" Mitigates #6313 A common use case for the autograd profiler is to use it to run over an entire model, including dataloading. The following will crash: - run autograd profiler in CUDA mode - Use a multi-worker DataLoader (presumably with the 'fork' spawn method) - because the autograd profiler initializes CUDA and forking after CUDA is initialized is bad. This PR puts in a nice error message when this happens so that users aren't too confused. The new error message looks like: https://gist.github.com/zou3519/903f15c3e86bad4585b7e5ce14cc1b70 Test Plan: - Tested locally. - I didn't add a test case for this because it's hard to write a test case that doesn't completely stop the rest of our test suite from running. Differential Revision: [D19178080](https://our.internmc.facebook.com/intern/diff/D19178080) [ghstack-poisoned]
Mitigates #6313 A common use case for the autograd profiler is to use it to run over an entire model, including dataloading. The following will crash: - run autograd profiler in CUDA mode - Use a multi-worker DataLoader (presumably with the 'fork' spawn method) - because the autograd profiler initializes CUDA and forking after CUDA is initialized is bad. This PR puts in a nice error message when this happens so that users aren't too confused. The new error message looks like: https://gist.github.com/zou3519/903f15c3e86bad4585b7e5ce14cc1b70 Test Plan: - Tested locally. - I didn't add a test case for this because it's hard to write a test case that doesn't completely stop the rest of our test suite from running. ghstack-source-id: 0d0e42674f2c7e3324abdbe4ccc797ba827e1ab6 Pull Request resolved: #31473
#31473) Summary: Pull Request resolved: #31473 Mitigates #6313 A common use case for the autograd profiler is to use it to run over an entire model, including dataloading. The following will crash: - run autograd profiler in CUDA mode - Use a multi-worker DataLoader (presumably with the 'fork' spawn method) - because the autograd profiler initializes CUDA and forking after CUDA is initialized is bad. This PR puts in a nice error message when this happens so that users aren't too confused. The new error message looks like: https://gist.github.com/zou3519/903f15c3e86bad4585b7e5ce14cc1b70 Test Plan: - Tested locally. - I didn't add a test case for this because it's hard to write a test case that doesn't completely stop the rest of our test suite from running. Differential Revision: D19178080 Pulled By: zou3519 fbshipit-source-id: c632525ba1f7b168324f1aa55416e5250f56a086
FWIW, nvprof also has the restriction that it can't profile a process that forks: https://docs.nvidia.com/cuda/profiler-users-guide/index.html#multiprocess-profiling. I'm not sure what actually happens in this case. Futhermore, the design of the autograd profiler requires it to initialize CUDA; otherwise, it is difficult to accurately create a timeline. Regarding torch.utils.bottleneck, we should do at least one of the following
|
pytorch#31473) Summary: Pull Request resolved: pytorch#31473 Mitigates pytorch#6313 A common use case for the autograd profiler is to use it to run over an entire model, including dataloading. The following will crash: - run autograd profiler in CUDA mode - Use a multi-worker DataLoader (presumably with the 'fork' spawn method) - because the autograd profiler initializes CUDA and forking after CUDA is initialized is bad. This PR puts in a nice error message when this happens so that users aren't too confused. The new error message looks like: https://gist.github.com/zou3519/903f15c3e86bad4585b7e5ce14cc1b70 Test Plan: - Tested locally. - I didn't add a test case for this because it's hard to write a test case that doesn't completely stop the rest of our test suite from running. Differential Revision: D19178080 Pulled By: zou3519 fbshipit-source-id: c632525ba1f7b168324f1aa55416e5250f56a086
AFAICT zou3517 has implemented 2. and 3. from #6313 (comment) in the merged PR #31473. Should this issue be closed, or is there still something left to do? |
cc @zou3519 |
@Baranowski: Yeah, it looks like we've done 2&3. For 2, we missed adding an entry to the bottleneck docs: https://pytorch.org/docs/stable/bottleneck.html . So that remains to be done. For action item 1, I'm not sure if it is a good idea to switch to CPU-only profiling for bottleneck. That is a separate issue though; to finish resolving the issue of this crash, we should add a line to the bottleneck docs about the limitation. |
how do you change from edit: welp that doesn't solve the problem because i can't pickle |
closing due to age. |
torch.utils.bottleneck
doesn't work properly when the code contains a data loader that uses more than 0 threads.Minimum reproducible example (
mwe.py
):Running the script via:
works fine, while
crashes with the following stack trace:
assigning this to @zou3519 , even thought I'm not sure if it's a problem in the profiler or in the
bottleneck
tool.pytorch version
'0.4.0a0+b21e135'
cc @ezyang @gchanan @zou3519 @ssnl
The text was updated successfully, but these errors were encountered: