Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[utils.bottleneck] Bottleneck crashes with multi-threaded data loader #6313

Closed
fmassa opened this issue Apr 5, 2018 · 17 comments
Closed

[utils.bottleneck] Bottleneck crashes with multi-threaded data loader #6313

fmassa opened this issue Apr 5, 2018 · 17 comments
Assignees
Labels
module: bottleneck Related to torch.utils.bottleneck module: dataloader Related to torch.utils.data.DataLoader and Sampler quansight-nack High-prio issues that have been reviewed by Quansight and are judged to be not actionable. triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@fmassa
Copy link
Member

fmassa commented Apr 5, 2018

torch.utils.bottleneck doesn't work properly when the code contains a data loader that uses more than 0 threads.

Minimum reproducible example (mwe.py):

import argparse
import torch
import torch.utils.data

if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='mwe')
    parser.add_argument('--num-workers', default=0, type=int)
    args = parser.parse_args()

    data = torch.rand(10, 1000)
    target = torch.rand(10)
    dataset = torch.utils.data.TensorDataset(data, target)
    data_loader = torch.utils.data.DataLoader(dataset,
        batch_size=2, num_workers=args.num_workers)
    for i, batch in enumerate(data_loader):
        pass

Running the script via:

python -m torch.utils.bottleneck -- mwe.py --num-workers 0

works fine, while

python -m torch.utils.bottleneck -- mwe2.py --num-workers 1

crashes with the following stack trace:

Traceback (most recent call last):
  File "/private/home/fmassa/.conda/envs/detectron_v2/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/private/home/fmassa/.conda/envs/detectron_v2/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/private/home/fmassa/.conda/envs/detectron_v2/lib/python3.6/site-packages/torch/utils/bottleneck/__main__.py", line 280, in <module>
    main()
  File "/private/home/fmassa/.conda/envs/detectron_v2/lib/python3.6/site-packages/torch/utils/bottleneck/__main__.py", line 261, in main
    autograd_prof_cpu, autograd_prof_cuda = run_autograd_prof(code, globs)
  File "/private/home/fmassa/.conda/envs/detectron_v2/lib/python3.6/site-packages/torch/utils/bottleneck/__main__.py", line 155, in run_autograd_prof
    result.append(run_prof(use_cuda=True))
  File "/private/home/fmassa/.conda/envs/detectron_v2/lib/python3.6/site-packages/torch/utils/bottleneck/__main__.py", line 149, in run_prof
    exec(code, globs, None)
  File "mwe2.py", line 15, in <module>
    for i, batch in enumerate(data_loader):
  File "/private/home/fmassa/.conda/envs/detectron_v2/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 285, in __next__
    return self._process_next_batch(batch)
  File "/private/home/fmassa/.conda/envs/detectron_v2/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 306, in _process_next_batch
    raise batch.exc_type(batch.exc_msg)
RuntimeError: Traceback (most recent call last):
  File "/private/home/fmassa/.conda/envs/detectron_v2/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 57, in _worker_loop
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/private/home/fmassa/.conda/envs/detectron_v2/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 57, in <listcomp>
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/private/home/fmassa/.conda/envs/detectron_v2/lib/python3.6/site-packages/torch/utils/data/dataset.py", line 40, in __getitem__
    return tuple(tensor[index] for tensor in self.tensors)
  File "/private/home/fmassa/.conda/envs/detectron_v2/lib/python3.6/site-packages/torch/utils/data/dataset.py", line 40, in <genexpr>
    return tuple(tensor[index] for tensor in self.tensors)
RuntimeError: /private/home/fmassa/github/pytorch/torch/csrc/autograd/profiler.h:52: initialization error

assigning this to @zou3519 , even thought I'm not sure if it's a problem in the profiler or in the bottleneck tool.

pytorch version '0.4.0a0+b21e135'

cc @ezyang @gchanan @zou3519 @ssnl

@zou3519
Copy link
Contributor

zou3519 commented Apr 5, 2018

Yikes, thanks for the catch. I'll look into it

@TiRune
Copy link
Contributor

TiRune commented Aug 23, 2018

Ran into this problem as well on unix; Would be great if we could still use the bottleneck functionality with multiple dataloaders :)

@jhagege
Copy link

jhagege commented Nov 4, 2018

Hi, has anyone found any workaround with this issue ?
I am trying to identify bottleneck in a PyTorch code and have a hard time doing it without proper profiling.
Can anyone recommend good resources for identifying bottleneck (even with other techniques) ?
Thanks !

@stmharry
Copy link

This thread deserves more visibility.

Tried many ways to avoid /pytorch/torch/csrc/autograd/profiler.h:52: initialization error, and ended up here, just to realize the problem is the number of workers.

Have not been able to locate a good tool in identifying the bottleneck.

@ezyang
Copy link
Contributor

ezyang commented Jan 3, 2019

@stmharry A very simple bandaid which I'd be happy to approve, is to update the "initialization error" message with a suggested workaround of reducing the number of worker threads. Would you like to submit a PR?

@zou3519 zou3519 removed their assignment Mar 12, 2019
@dcela
Copy link

dcela commented Mar 23, 2019

This is still an issue for me as well

@yaceben
Copy link

yaceben commented Apr 2, 2019

Still an issue for me as well

@ezyang ezyang added high priority triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module module: dataloader Related to torch.utils.data.DataLoader and Sampler module: bottleneck Related to torch.utils.bottleneck and removed bug todo Not as important as medium or high priority tasks, but we will work on these. labels Apr 2, 2019
@XiaobingSuper
Copy link
Collaborator

I also meet a similar issue when using autograd profiler on GPU device. You can reproduce the result by running the following code:

import argparse
import torch
import torch.utils.data

if __name__ == '__main__':
     parser = argparse.ArgumentParser(description='mwe')
     parser.add_argument('--num-workers', default=0, type=int)
     args = parser.parse_args()

     data = torch.rand(10, 1000)
     target = torch.rand(10)
     dataset = torch.utils.data.TensorDataset(data, target)
     data_loader = torch.utils.data.DataLoader(dataset,batch_size=2, num_workers=args.num_workers)
     with torch.autograd.profiler.profile(use_cuda=True) as prof:
         for i, batch in enumerate(data_loader):
             pass
     print(prof)

Running the script via:

python test.py --num-workers 0

works fine, while

python test.py --num-workers 1

can get the following error:

Traceback (most recent call last):
  File "my_test.py", line 15, in <module>
    for i, batch in enumerate(data_loader):
  File "/home/xiaobinz/anaconda3/envs/pytorch-gpu/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 637, in __next__
    return self._process_next_batch(batch)
  File "/home/xiaobinz/anaconda3/envs/pytorch-gpu/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 658, in _process_next_batch
    raise batch.exc_type(batch.exc_msg)
RuntimeError: Traceback (most recent call last):
  File "/home/xiaobinz/anaconda3/envs/pytorch-gpu/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 138, in _worker_loop
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/home/xiaobinz/anaconda3/envs/pytorch-gpu/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 138, in <listcomp>
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/home/xiaobinz/anaconda3/envs/pytorch-gpu/lib/python3.6/site-packages/torch/utils/data/dataset.py", line 40, in __getitem__
    return tuple(tensor[index] for tensor in self.tensors)
  File "/home/xiaobinz/anaconda3/envs/pytorch-gpu/lib/python3.6/site-packages/torch/utils/data/dataset.py", line 40, in <genexpr>
    return tuple(tensor[index] for tensor in self.tensors)
RuntimeError: /opt/conda/conda-bld/pytorch_1549628766161/work/torch/csrc/autograd/profiler.h:81: initialization error

Torch version is

pytorch                   1.0.1           py3.6_cuda9.0.176_cudnn7.4.2_2    pytorch

Perhaps they are same issue, thanks!

@rgommers
Copy link
Collaborator

This is still an issue; the failure mode changed after the significant rewrite of dataloader.py in #19228:

    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 15030) is killed by signal: Aborted.
...
RuntimeError: DataLoader worker (pid(s) 15030) exited unexpectedly

Full traceback:

$ python -m torch.utils.bottleneck -- torch_6313.py --num-workers 1
`bottleneck` is a tool that can be used as an initial step for debugging
bottlenecks in your program.

It summarizes runs of your script with the Python profiler and PyTorch's
autograd profiler. Because your script will be profiled, please ensure that it
exits in a finite amount of time.

For more complicated uses of the profilers, please see
https://docs.python.org/3/library/profile.html and
https://pytorch.org/docs/master/autograd.html#profiler for more information.
Running environment analysis...
Running your script with cProfile
Running your script with the autograd profiler...
terminate called after throwing an instance of 'std::runtime_error'
  what():  /opt/conda/conda-bld/pytorch-nightly_1563772086834/work/torch/csrc/autograd/profiler_cuda.cpp:22: initialization error
Traceback (most recent call last):
  File "/home/rgommers/anaconda3/envs/pytorch-nightly/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 683, in _try_get_data
    data = self.data_queue.get(timeout=timeout)
  File "/home/rgommers/anaconda3/envs/pytorch-nightly/lib/python3.7/multiprocessing/queues.py", line 104, in get
    if not self._poll(timeout):
  File "/home/rgommers/anaconda3/envs/pytorch-nightly/lib/python3.7/multiprocessing/connection.py", line 257, in poll
    return self._poll(timeout)
  File "/home/rgommers/anaconda3/envs/pytorch-nightly/lib/python3.7/multiprocessing/connection.py", line 414, in _poll
    r = wait([self], timeout)
  File "/home/rgommers/anaconda3/envs/pytorch-nightly/lib/python3.7/multiprocessing/connection.py", line 920, in wait
    ready = selector.select(timeout)
  File "/home/rgommers/anaconda3/envs/pytorch-nightly/lib/python3.7/selectors.py", line 415, in select
    fd_event_list = self._selector.poll(timeout)
  File "/home/rgommers/anaconda3/envs/pytorch-nightly/lib/python3.7/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 15030) is killed by signal: Aborted. 

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/rgommers/anaconda3/envs/pytorch-nightly/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/rgommers/anaconda3/envs/pytorch-nightly/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/rgommers/anaconda3/envs/pytorch-nightly/lib/python3.7/site-packages/torch/utils/bottleneck/__main__.py", line 231, in <module>
    main()
  File "/home/rgommers/anaconda3/envs/pytorch-nightly/lib/python3.7/site-packages/torch/utils/bottleneck/__main__.py", line 210, in main
    autograd_prof_cpu, autograd_prof_cuda = run_autograd_prof(code, globs)
  File "/home/rgommers/anaconda3/envs/pytorch-nightly/lib/python3.7/site-packages/torch/utils/bottleneck/__main__.py", line 104, in run_autograd_prof
    result.append(run_prof(use_cuda=True))
  File "/home/rgommers/anaconda3/envs/pytorch-nightly/lib/python3.7/site-packages/torch/utils/bottleneck/__main__.py", line 98, in run_prof
    exec(code, globs, None)
  File "torch_6313.py", line 15, in <module>
    for i, batch in enumerate(data_loader):
  File "/home/rgommers/anaconda3/envs/pytorch-nightly/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 763, in __next__
    idx, data = self._get_data()
  File "/home/rgommers/anaconda3/envs/pytorch-nightly/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 730, in _get_data
    success, data = self._try_get_data()
  File "/home/rgommers/anaconda3/envs/pytorch-nightly/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 696, in _try_get_data
    raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str))
RuntimeError: DataLoader worker (pid(s) 15030) exited unexpectedly

@andfoy andfoy self-assigned this Jul 25, 2019
@hongliny
Copy link

Still an issue as of v1.2

@zou3519 zou3519 self-assigned this Sep 13, 2019
@zou3519
Copy link
Contributor

zou3519 commented Oct 9, 2019

So the root problem is the following:

  • DataLoader uses fork to create a new process
  • the autograd profiler in CUDA mode initializes CUDA. Bottleneck uses the CUDA mode autograd profiler if it is available
  • the CUDA API does not support fork-ing after CUDA has been initialized.

I'm not sure what the best way to resolve this is. From the user side, one can change the multiprocessing method to spawn (instead of fork), but this might have different performance characteristics. Here are some potential action items:

  • Give a nice error message. Something like: "autograd profiler in CUDA mode is not supported with dataloader"
  • figure out some way to benchmark data loading.

zou3519 added a commit that referenced this issue Dec 18, 2019
Mitigates #6313

A common use case for the autograd profiler is to use it to run over an
entire model, including dataloading. The following will crash:
- run autograd profiler in CUDA mode
- Use a multi-worker DataLoader (presumably with the 'fork' spawn
method)
because the autograd profiler initializes CUDA and forking after CUDA is
initialized is bad.

This PR puts in a nice error message when this happens so that users
aren't too confused. The new error message looks like:
```
terminate called after throwing an instance of 'std::runtime_error'
  what():  ../torch/csrc/autograd/profiler_cuda.cpp:36: CUDA initialization error. This can occur if one ru
ns the profiler in CUDA mode on code that creates a DataLoader with num_workers > 0. This operation is curr
ently unsupported; potential workarounds are: (1) don't use the profiler in CUDA mode or (2) use num_worker
s=0 in the DataLoader or (3) Don't profile the data loading portion of your code. https://github.com/pytorc
h/pytorch/issues/6313 tracks profiler support for multi-worker DataLoader.
Traceback (most recent call last):
  File "/scratch/rzou/pt/workspace/torch/utils/data/dataloader.py", line 761, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/scratch/rzou/pt/workspace-env/lib/python3.7/multiprocessing/queues.py", line 104, in get
    if not self._poll(timeout):
  File "/scratch/rzou/pt/workspace-env/lib/python3.7/multiprocessing/connection.py", line 257, in poll
    return self._poll(timeout)
  File "/scratch/rzou/pt/workspace-env/lib/python3.7/multiprocessing/connection.py", line 414, in _poll
    r = wait([self], timeout)
  File "/scratch/rzou/pt/workspace-env/lib/python3.7/multiprocessing/connection.py", line 920, in wait
    ready = selector.select(timeout)
  File "/scratch/rzou/pt/workspace-env/lib/python3.7/selectors.py", line 415, in select
    fd_event_list = self._selector.poll(timeout)
  File "/scratch/rzou/pt/workspace/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 18703) is killed by signal: Aborted.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "mwe.py", line 15, in <module>
    for i, batch in enumerate(data_loader):
  File "/scratch/rzou/pt/workspace/torch/utils/data/dataloader.py", line 345, in __next__
    data = self._next_data()
  File "/scratch/rzou/pt/workspace/torch/utils/data/dataloader.py", line 841, in _next_data
    idx, data = self._get_data()
  File "/scratch/rzou/pt/workspace/torch/utils/data/dataloader.py", line 808, in _get_data
    success, data = self._try_get_data()
  File "/scratch/rzou/pt/workspace/torch/utils/data/dataloader.py", line 774, in _try_get_data
    raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str))
RuntimeError: DataLoader worker (pid(s) 18703) exited unexpectedly
```

Test Plan:
- Tested locally
- It's hard to add a test for this because the program is supposed to fail.

[ghstack-poisoned]
zou3519 added a commit that referenced this issue Dec 18, 2019
Mitigates #6313

A common use case for the autograd profiler is to use it to run over an
entire model, including dataloading. The following will crash:
- run autograd profiler in CUDA mode
- Use a multi-worker DataLoader (presumably with the 'fork' spawn
method)
because the autograd profiler initializes CUDA and forking after CUDA is
initialized is bad.

This PR puts in a nice error message when this happens so that users
aren't too confused. The new error message looks like:
```
terminate called after throwing an instance of 'std::runtime_error'
  what():  ../torch/csrc/autograd/profiler_cuda.cpp:36: CUDA initialization error. This can occur if one ru
ns the profiler in CUDA mode on code that creates a DataLoader with num_workers > 0. This operation is curr
ently unsupported; potential workarounds are: (1) don't use the profiler in CUDA mode or (2) use num_worker
s=0 in the DataLoader or (3) Don't profile the data loading portion of your code. https://github.com/pytorc
h/pytorch/issues/6313 tracks profiler support for multi-worker DataLoader.
Traceback (most recent call last):
  File "/scratch/rzou/pt/workspace/torch/utils/data/dataloader.py", line 761, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/scratch/rzou/pt/workspace-env/lib/python3.7/multiprocessing/queues.py", line 104, in get
    if not self._poll(timeout):
  File "/scratch/rzou/pt/workspace-env/lib/python3.7/multiprocessing/connection.py", line 257, in poll
    return self._poll(timeout)
  File "/scratch/rzou/pt/workspace-env/lib/python3.7/multiprocessing/connection.py", line 414, in _poll
    r = wait([self], timeout)
  File "/scratch/rzou/pt/workspace-env/lib/python3.7/multiprocessing/connection.py", line 920, in wait
    ready = selector.select(timeout)
  File "/scratch/rzou/pt/workspace-env/lib/python3.7/selectors.py", line 415, in select
    fd_event_list = self._selector.poll(timeout)
  File "/scratch/rzou/pt/workspace/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 18703) is killed by signal: Aborted.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "mwe.py", line 15, in <module>
    for i, batch in enumerate(data_loader):
  File "/scratch/rzou/pt/workspace/torch/utils/data/dataloader.py", line 345, in __next__
    data = self._next_data()
  File "/scratch/rzou/pt/workspace/torch/utils/data/dataloader.py", line 841, in _next_data
    idx, data = self._get_data()
  File "/scratch/rzou/pt/workspace/torch/utils/data/dataloader.py", line 808, in _get_data
    success, data = self._try_get_data()
  File "/scratch/rzou/pt/workspace/torch/utils/data/dataloader.py", line 774, in _try_get_data
    raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str))
RuntimeError: DataLoader worker (pid(s) 18703) exited unexpectedly
```

Test Plan:
- Tested locally
- It's hard to add a test for this because the program is supposed to fail.

ghstack-source-id: a59340bd942860fe135ae418dab5d9e88726b043
Pull Request resolved: #31445
zou3519 added a commit that referenced this issue Dec 19, 2019
Mitigates #6313

A common use case for the autograd profiler is to use it to run over an
entire model, including dataloading. The following will crash:
- run autograd profiler in CUDA mode
- Use a multi-worker DataLoader (presumably with the 'fork' spawn
method)
- because the autograd profiler initializes CUDA and forking after CUDA is
initialized is bad.

This PR puts in a nice error message when this happens so that users
aren't too confused. The new error message looks like:
https://gist.github.com/zou3519/903f15c3e86bad4585b7e5ce14cc1b70

Test Plan:
- Tested locally.
- I didn't add a test case for this because it's hard to write a test
case that doesn't completely stop the rest of our test suite from
running.

[ghstack-poisoned]
zou3519 added a commit that referenced this issue Dec 19, 2019
…loader crash"

Mitigates #6313

A common use case for the autograd profiler is to use it to run over an
entire model, including dataloading. The following will crash:
- run autograd profiler in CUDA mode
- Use a multi-worker DataLoader (presumably with the 'fork' spawn
method)
- because the autograd profiler initializes CUDA and forking after CUDA is
initialized is bad.

This PR puts in a nice error message when this happens so that users
aren't too confused. The new error message looks like:
https://gist.github.com/zou3519/903f15c3e86bad4585b7e5ce14cc1b70

Test Plan:
- Tested locally.
- I didn't add a test case for this because it's hard to write a test
case that doesn't completely stop the rest of our test suite from
running.

Differential Revision: [D19178080](https://our.internmc.facebook.com/intern/diff/D19178080)

[ghstack-poisoned]
zou3519 added a commit that referenced this issue Dec 19, 2019
Mitigates #6313

A common use case for the autograd profiler is to use it to run over an
entire model, including dataloading. The following will crash:
- run autograd profiler in CUDA mode
- Use a multi-worker DataLoader (presumably with the 'fork' spawn
method)
- because the autograd profiler initializes CUDA and forking after CUDA is
initialized is bad.

This PR puts in a nice error message when this happens so that users
aren't too confused. The new error message looks like:
https://gist.github.com/zou3519/903f15c3e86bad4585b7e5ce14cc1b70

Test Plan:
- Tested locally.
- I didn't add a test case for this because it's hard to write a test
case that doesn't completely stop the rest of our test suite from
running.

ghstack-source-id: 0d0e42674f2c7e3324abdbe4ccc797ba827e1ab6
Pull Request resolved: #31473
facebook-github-bot pushed a commit that referenced this issue Dec 19, 2019
#31473)

Summary:
Pull Request resolved: #31473

Mitigates #6313

A common use case for the autograd profiler is to use it to run over an
entire model, including dataloading. The following will crash:
- run autograd profiler in CUDA mode
- Use a multi-worker DataLoader (presumably with the 'fork' spawn
method)
- because the autograd profiler initializes CUDA and forking after CUDA is
initialized is bad.

This PR puts in a nice error message when this happens so that users
aren't too confused. The new error message looks like:
https://gist.github.com/zou3519/903f15c3e86bad4585b7e5ce14cc1b70

Test Plan:
- Tested locally.
- I didn't add a test case for this because it's hard to write a test
case that doesn't completely stop the rest of our test suite from
running.

Differential Revision: D19178080

Pulled By: zou3519

fbshipit-source-id: c632525ba1f7b168324f1aa55416e5250f56a086
@zou3519
Copy link
Contributor

zou3519 commented Dec 20, 2019

FWIW, nvprof also has the restriction that it can't profile a process that forks: https://docs.nvidia.com/cuda/profiler-users-guide/index.html#multiprocess-profiling. I'm not sure what actually happens in this case.

Futhermore, the design of the autograd profiler requires it to initialize CUDA; otherwise, it is difficult to accurately create a timeline.

Regarding torch.utils.bottleneck, we should do at least one of the following

  1. Switch to CPU-only profiling as the default
  2. Document that we don't support multi-process profiling
  3. Add some graceful error handling so that bottleneck doesn't crash when doing multi-process profiling.

wuhuikx pushed a commit to wuhuikx/pytorch that referenced this issue Jan 30, 2020
pytorch#31473)

Summary:
Pull Request resolved: pytorch#31473

Mitigates pytorch#6313

A common use case for the autograd profiler is to use it to run over an
entire model, including dataloading. The following will crash:
- run autograd profiler in CUDA mode
- Use a multi-worker DataLoader (presumably with the 'fork' spawn
method)
- because the autograd profiler initializes CUDA and forking after CUDA is
initialized is bad.

This PR puts in a nice error message when this happens so that users
aren't too confused. The new error message looks like:
https://gist.github.com/zou3519/903f15c3e86bad4585b7e5ce14cc1b70

Test Plan:
- Tested locally.
- I didn't add a test case for this because it's hard to write a test
case that doesn't completely stop the rest of our test suite from
running.

Differential Revision: D19178080

Pulled By: zou3519

fbshipit-source-id: c632525ba1f7b168324f1aa55416e5250f56a086
@Baranowski
Copy link
Contributor

AFAICT zou3517 has implemented 2. and 3. from #6313 (comment) in the merged PR #31473. Should this issue be closed, or is there still something left to do?

@Baranowski Baranowski added the quansight-nack High-prio issues that have been reviewed by Quansight and are judged to be not actionable. label Apr 21, 2020
@ezyang
Copy link
Contributor

ezyang commented Apr 21, 2020

cc @zou3519

@zou3519
Copy link
Contributor

zou3519 commented Apr 21, 2020

@Baranowski: Yeah, it looks like we've done 2&3.

For 2, we missed adding an entry to the bottleneck docs: https://pytorch.org/docs/stable/bottleneck.html . So that remains to be done.

For action item 1, I'm not sure if it is a good idea to switch to CPU-only profiling for bottleneck. That is a separate issue though; to finish resolving the issue of this crash, we should add a line to the bottleneck docs about the limitation.

@makslevental
Copy link
Contributor

makslevental commented Apr 21, 2020

how do you change from fork to spawn in bottleneck? i'm having this issue right now

edit: welp that doesn't solve the problem because i can't pickle Lambda transforms.

@ngimel
Copy link
Collaborator

ngimel commented Feb 7, 2022

closing due to age.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: bottleneck Related to torch.utils.bottleneck module: dataloader Related to torch.utils.data.DataLoader and Sampler quansight-nack High-prio issues that have been reviewed by Quansight and are judged to be not actionable. triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests