Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Gluon DataLoader cannot release the processes in the pool #13521

Closed
YutingZhang opened this issue Dec 4, 2018 · 5 comments · Fixed by #13537
Closed

Gluon DataLoader cannot release the processes in the pool #13521

YutingZhang opened this issue Dec 4, 2018 · 5 comments · Fixed by #13537

Comments

@YutingZhang
Copy link
Contributor

YutingZhang commented Dec 4, 2018

https://github.com/apache/incubator-mxnet/blob/f2dcd7c7b8676b55d912997fc3f9c62c55915307/python/mxnet/gluon/data/dataloader.py#L532-L533

Logically, when a DataLoader is recycled, the _worker_pool should be recycled, and the terminate() of the _worker_pool function should be called immediately. However, it did not ...

Each time I kill a DataLoader, it leaves the worker processes dangling.
I guess it is a bug of python multiprocess.Pool. Anyway, I think we can patch it by explicitly call _worker_pool.terminate()

Minimum code to reproduce the errors.

import mxnet as mx
import numpy as np
A=np.random.rand(999, 2000)
D=mx.gluon.data.DataLoader(A, batch_size=8, num_workers=2)
the_iter = iter(D)
next(the_iter)
del the_iter
del D

I recorded a video demo for this bug: https://drive.google.com/open?id=1q4CmU_F1vAtxoZ_KUmrIEfVRk3RsQfv8

Environment: today's mxnet from pip, python3.6 on p3

@zhreshold
Copy link
Member

@YutingZhang Seems like it's caused by jupyter since it may cache the sessions?
I've tried it in terminal and the processes are gabage collected just fine.

@zhreshold
Copy link
Member

@YutingZhang Okay, I found the problem is present on linux but not mac.
As discussed offline, it's better to secure the terminate manually, I will file a PR regarding this.

@zhreshold zhreshold mentioned this issue Dec 4, 2018
4 tasks
@YutingZhang
Copy link
Contributor Author

YutingZhang commented Dec 5, 2018

@zhreshold Great. Thanks!
FYI, I tested it on my Mac using anaconda Python3.6. It also caused problems. Maybe caused by the anaconda version of python?

zhreshold added a commit that referenced this issue Dec 5, 2018
* fix pool release

* fix
zhreshold added a commit that referenced this issue Dec 5, 2018
* fix pool release

* fix
TaoLv added a commit that referenced this issue Dec 6, 2018
This reverts commit f6b4665.
TaoLv added a commit that referenced this issue Dec 6, 2018
…icense file" (#13558)

* Revert "Chi_square_check for discrete distribution fix (#13543)"

This reverts commit cf6e8cb.

* Revert "Updated docs for randint operator (#13541)"

This reverts commit e0ff3c3.

* Revert "Simplifications and some fun stuff for the MNIST Gluon tutorial (#13094)"

This reverts commit 8bbac82.

* Revert "Fix #13521 (#13537)"

This reverts commit f6b4665.

* Revert "Add a retry to qemu_provision (#13551)"

This reverts commit f6f8401.

* Revert "[MXNET-769] Use MXNET_HOME in a tempdir in windows to prevent access denied due t… (#13531)"

This reverts commit bd8e0f8.

* Revert "[MXNET-1249] Fix Object Detector Performance with GPU (#13522)"

This reverts commit 1c8972c.

* Revert "Fixing a 404 in the ubuntu setup doc (#13542)"

This reverts commit cb0db29.

* Revert "Bumped minor version from 1.4.0 to 1.5.0 on master, updated License file (#13478)"

This reverts commit 40db619.
@YutingZhang
Copy link
Contributor Author

@zhreshold Confirmed this as a python bug: https://bugs.python.org/issue34172

@zhreshold
Copy link
Member

@YutingZhang Good to know, thanks

zhaoyao73 pushed a commit to zhaoyao73/incubator-mxnet that referenced this issue Dec 13, 2018
* fix pool release

* fix
zhaoyao73 pushed a commit to zhaoyao73/incubator-mxnet that referenced this issue Dec 13, 2018
…icense file" (apache#13558)

* Revert "Chi_square_check for discrete distribution fix (apache#13543)"

This reverts commit cf6e8cb.

* Revert "Updated docs for randint operator (apache#13541)"

This reverts commit e0ff3c3.

* Revert "Simplifications and some fun stuff for the MNIST Gluon tutorial (apache#13094)"

This reverts commit 8bbac82.

* Revert "Fix apache#13521 (apache#13537)"

This reverts commit f6b4665.

* Revert "Add a retry to qemu_provision (apache#13551)"

This reverts commit f6f8401.

* Revert "[MXNET-769] Use MXNET_HOME in a tempdir in windows to prevent access denied due t… (apache#13531)"

This reverts commit bd8e0f8.

* Revert "[MXNET-1249] Fix Object Detector Performance with GPU (apache#13522)"

This reverts commit 1c8972c.

* Revert "Fixing a 404 in the ubuntu setup doc (apache#13542)"

This reverts commit cb0db29.

* Revert "Bumped minor version from 1.4.0 to 1.5.0 on master, updated License file (apache#13478)"

This reverts commit 40db619.
zhaoyao73 added a commit to zhaoyao73/incubator-mxnet that referenced this issue Dec 13, 2018
* upstream/master: (54 commits)
  Add notes about debug with libstdc++ symbols (apache#13533)
  add cpp example inception to nightly test (apache#13534)
  Fix exception handling api doc (apache#13519)
  fix link for gluon model zoo (apache#13583)
  ONNX import/export: Size (apache#13112)
  Update MXNetTutorialTemplate.ipynb (apache#13568)
  fix the situation where idx didn't align with rec (apache#13550)
  Fix use-before-assignment in convert_dot (apache#13511)
  License update  (apache#13565)
  Update version to v1.5.0 including clojure package (apache#13566)
  Fix flaky test test_random:test_randint_generator (apache#13498)
  Add workspace cleaning after job finished (apache#13490)
  Adding test for softmaxoutput (apache#13116)
  apache#13441 [Clojure] Add Spec Validations for the Random namespace (apache#13523)
  Revert "Bumped minor version from 1.4.0 to 1.5.0 on master, updated License file" (apache#13558)
  Chi_square_check for discrete distribution fix (apache#13543)
  Updated docs for randint operator (apache#13541)
  Simplifications and some fun stuff for the MNIST Gluon tutorial (apache#13094)
  Fix apache#13521 (apache#13537)
  Add a retry to qemu_provision (apache#13551)
  ...
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants