Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Scala Module API resize is leaking memory on the native size. #10867

Closed
jessebrizzi opened this issue May 9, 2018 · 18 comments · Fixed by #14372
Closed

Scala Module API resize is leaking memory on the native size. #10867

jessebrizzi opened this issue May 9, 2018 · 18 comments · Fixed by #14372

Comments

@jessebrizzi
Copy link
Contributor

jessebrizzi commented May 9, 2018

Description

Create and bind a MXNet Module with batch size N+1 and proceed to loop and pass DataBatches to it that require the Module to resize before performing the forward pass. Monitor the system resources (With htop, nvidia-smi, jvmtop) and you will notice the used system memory in htop will start to grow, but not the jvm heap size (the system memory usages grows beyond the set max JVM heap size) or GPU memory usage. This will continue until your system runs out of memory and there is a crash or the JVM is killed clearing all of the leaked used system memory with it.

Environment info (Required)

----------Python Info----------
Version      : 3.4.3
Compiler     : GCC 4.8.4
Build        : ('default', 'Nov 17 2016 01:08:31')
Arch         : ('64bit', 'ELF')
------------Pip Info-----------
Version      : 10.0.1
Directory    : /usr/local/lib/python3.4/dist-packages/pip
----------MXNet Info-----------
Version      : 1.1.0
Directory    : /home/paperspace/src/mxnet/python/mxnet
Hashtag not found. Not installed from pre-built package.
----------System Info----------
Platform     : Linux-4.4.0-31-generic-x86_64-with-Ubuntu-14.04-trusty
system       : Linux
node         : psplnlbg
release      : 4.4.0-31-generic
version      : #50~14.04.1-Ubuntu SMP Wed Jul 13 01:07:32 UTC 2016
----------Hardware Info----------
machine      : x86_64
processor    : x86_64
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0-7
Thread(s) per core:    1
Core(s) per socket:    8
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 79
Stepping:              1
CPU MHz:               2600.072
BogoMIPS:              5200.14
Hypervisor vendor:     Microsoft
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              10240K
NUMA node0 CPU(s):     0-7
----------Network Test----------
Setting timeout: 10
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0884 sec, LOAD: 0.4182 sec.
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0131 sec, LOAD: 0.4660 sec.
Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.0957 sec, LOAD: 0.6263 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0099 sec, LOAD: 0.3049 sec.
Timing for Gluon Tutorial(en): https://gluon.mxnet.io, DNS: 0.0134 sec, LOAD: 0.6473 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0126 sec, LOAD: 0.0343 sec.

Package used (Python/R/Scala/Julia):
Scala. This seems to be specific to the Scala API. Can not reproduce in Python.

For Scala user, please provide:

  1. Java version: (java -version)
    java version "1.8.0_131"
    Java(TM) SE Runtime Environment (build 1.8.0_131-b11)
    Java HotSpot(TM) 64-Bit Server VM (build 25.131-b11, mixed mode)

  2. Maven version: (mvn -version)
    Apache Maven 3.0.5
    Maven home: /usr/share/maven
    Java version: 1.8.0_131, vendor: Oracle Corporation
    Java home: /usr/lib/jvm/java-8-oracle/jre
    Default locale: en_US, platform encoding: UTF-8
    OS name: "linux", version: "4.4.0-31-generic", arch: "amd64", family: "unix"

  3. Scala runtime if applicable: (scala -version)
    2.11.11

Build info (Required if built from source)

Compiler (gcc/clang/mingw/visual studio): GCC 4.8.4

MXNet commit hash:
07a83a0

Build config:
make -j $(nproc) USE_OPENCV=1 USE_BLAS=openblas USE_CUDA=1 USE_CUDA_PATH=/usr/local/cuda USE_CUDNN=1
make scalainstall -j $(nproc) USE_OPENCV=1 USE_BLAS=openblas USE_CUDA=1 USE_CUDA_PATH=/usr/local/cuda USE_CUDNN=1

Error Message:

None

Minimum reproducible example

link to simple Scala project/code to reproduce issue
https://github.com/jessebrizzi/MXNet-Bug/blob/master/scala/TestBug.scala

Steps to reproduce

(Paste the commands you ran that produced the error.)

  1. Pull this project and run it with SBT to test this issue.

What have you tried to solve it?

  1. Avoid passing in DataBatch that requires resize by padding smaller DataBatch with 0's.
  2. Not using the Scala API
@jessebrizzi
Copy link
Contributor Author

Also tested on master branch 10ac529 and bug is present

@lupesko
Copy link
Contributor

lupesko commented Jun 13, 2018

Hey @jessebrizzi - thanks for reporting the issue!
I noticed that your example code is using the old namespace... can you reproduce on MXNet latest (1.2.0) and with the new package that is available on Maven?

Adding @nswamy who contributes to the Scala package

@jessebrizzi
Copy link
Contributor Author

@lupesko I updated the example code to use the mxnet 1.2.0 package released on Maven and reran the test and the memory leak is still present.

I have also put together a docker container that you can use with nvidia-docker to run my example code with sbt/scala/java/cudnn/cuda all installed here https://hub.docker.com/r/jessebrizzi/dl-dev/ to control for environment differences.

@nswamy nswamy added the Memory label Jul 1, 2018
@jessebrizzi
Copy link
Contributor Author

I spent some time digging into this to see if I could find a fix.

Them memory leak seems to be exclusive to when an Executor has to reshape to a size larger than the previously binded size. When you size down there is no noticeable memory leak. I have updated my git repo that as the bug reproduction to show this.

This could be related to the new empty NDArrays being created here https://github.com/apache/incubator-mxnet/blob/master/scala-package/core/src/main/scala/org/apache/mxnet/Executor.scala#L105 and then the previously binded arrays that they are replacing are not properly being freed somewhere. I am assuming the problem is avoided in the resize to a smaller size case since the backend implementation of reshape on the NDArray has no problem reusing the old native memory space for the NDarrays since it can fit the new size in the same memory space as the old size. I tired disposing the old NDArrays for the old size of the Executor in various places but could not find a place that worked (or any proof that disposing the old NDArrays wold fix the memory leak)

I looked into the python API to see if it was doing anything different that could be adapted into the Scala interface to solve the problem. It looks like the Python API has changed from resizing the Executor in Python to using a new backend function to resize the Executor. (Link to the PR for that #10882). I figured that this new backend implementation avoids the memory leak issue I made a Jira ticket for the same change (https://issues.apache.org/jira/browse/MXNET-747) and attempted to make the same change in the Scala interface (https://github.com/jessebrizzi/incubator-mxnet/tree/MXNET-747), but I am having trouble writing the new JNI interface for the native resize call and I could use some eyes on it that are more familiar with the JNI interface for the Scala API (getting a SIGSEGV on the call to the MXExecutorReshape method here https://github.com/jessebrizzi/incubator-mxnet/blob/MXNET-747/scala-package/native/src/main/native/org_apache_mxnet_native_c_api.cc#L1764 and I'm not sure why).

Any help or eyes on this problem would be greatly appreciated, I am currently using MXNet in a production environment and have a pretty hacky work around in place currently to avoid this issue.

@nswamy
Copy link
Member

nswamy commented Jul 31, 2018

@jessebrizzi thanks for your efforts in trying to solve this memory issue. @andrewfayres started looking into this part of the code(JNI) recently, he might be able to help you.

@lanking520 as FYI.

@andrewfayres
Copy link
Contributor

@jessebrizzi I actually started looking into this a few days ago. I got a bit sidetracked with something else but I think I should have a solution for you in a day or two.

@andrewfayres
Copy link
Contributor

@jessebrizzi I've been looking at the JNI code you wrote and figured out that the sigsegv is happening here. However, I've been unable to figure out exactly what's causing the error thus far. Presumably at least one of the variables on that line is invalid but I'm not seeing why.

I'm going to change gears a bit and attempt to fix the memory leak in the existing scala code to at least resolve the issue for you.

@andrewfayres
Copy link
Contributor

@jessebrizzi I've got what I believe to be a working fix in my repo. I need to do some more thorough testing and add some automated testing to this before submitting a PR but my preliminary testing is looking good. Feel free to take a look and let me know if you find any issues.

I'll work on moving all of this over to the native code as soon as I get a few other things off my backlog.

@jessebrizzi
Copy link
Contributor Author

Thanks @andrewfayres !

Just catching up on all of this after taking some time off. I'm going to pull down your change and do some testing myself and get back to you.

@jessebrizzi
Copy link
Contributor Author

Sorry for taking so long to get back to you @andrewfayres

I pulled down your fix from your repo and ran it both on my OSX machine in CPU mode and in a nvidia-docker container (https://hub.docker.com/r/jessebrizzi/dl-dev/) in linux for GPU mode and I still think a memory leak issue is still present.

I have update my bug reproduction repository here to show the exact behavior I am testing.

The memory leak that I am still observing seems to be independent of the binded network size. I tested this by running through various max batch sizes that I would randomly sample from for my test input and regardless of what I set after 10000 iterations the native memory growth seems to be around the same.

The fix does seem to address the upsize vs downlise issue observed earlier.

I would still suggest, to maintain parity with the python interface, that the resize logic should be switched to the backend method that the python interface uses, but I know now messing with a JNI interface change is a pain.

Could you post an example of the code you where running to test the change?

@andrewfayres
Copy link
Contributor

@jessebrizzi Sorry for the slow reply, I've been out on vacation.

As soon as I get a little bit of free time I'll pull down the update you've made to your repository and see if I can reproduce. I mainly checked for the resize issue when I was looking earlier. It's possible there's also a different leak. I definitely agree with moving it to native is our best long term solution but at the moment I don't have the bandwidth to dedicate to this.

The example code that I've been running is actually the same as what you posted originally.

As a side note, there is ongoing work being done by @nswamy to provide a much better solution to scala memory management. Feel free to take a look at the design. Hopefully, once this work is finished all the leaking issues will be resolved.

@zachgk
Copy link
Contributor

zachgk commented Jan 12, 2019

@jessebrizzi The NativeResource Management was added as part of #12647. You can find some documentation about it https://github.com/apache/incubator-mxnet/blob/master/scala-package/memory-management.md. Does this look like it fixes your problem?

@lanking520 lanking520 assigned lanking520 and unassigned lanking520 Jan 28, 2019
@ddavydenko
Copy link
Contributor

@andrewfayres , could you please take a look and test again whether this memory leak issue is fixed with Phantom references implementation?

@jessebrizzi
Copy link
Contributor Author

jessebrizzi commented Jan 28, 2019

@zachgk Sorry for the late response, I had to find some time to run through my tests.

I modified my repo (https://github.com/jessebrizzi/MXNet-Bug) with the example code for the memory leak bug to use the snapshots published in the Nightly Repo https://repository.apache.org/#nexus-search;gav~org.apache.mxnet
and include a Dockerfile that sets up an image to test my code with the nightly builds (You have to change the Cuda version in the docker image to switch between 1.4.0-SNAPSHOT and 1.5.0-SNAPSHOT)

I think I am still seeing the bug (native memory climbing when JVM and GPU memory stays stable), but I am running into a crashing issue when running my "bugged" reproduction code https://github.com/jessebrizzi/MXNet-Bug/blob/master/scala/TestBug.scala

In CPU mode the script will freeze after a few hundred loops with all CPU's pegged and in GPU mode it freezes with the GPU/CPU idling.

In the non-bugged version (https://github.com/jessebrizzi/MXNet-Bug/blob/master/scala/TestNoBug.scala) of my repoduction code this does not happen, it runs fine since it avoids the resize call.

I'm currently trying to get a repo/test using the current 1.3.1 published release, but I have to work through some other blocking issues with that (Mainly the 1.3.1 published scala jar for Linux-GPU is built against opencv 3.4 instead of the opencv 2.4 provided by Ubuntu 16.04 https://packages.ubuntu.com/xenial/libopencv-dev), and I'll comment again when I have that working.

@jessebrizzi
Copy link
Contributor Author

jessebrizzi commented Jan 28, 2019

Updated https://github.com/jessebrizzi/MXNet-Bug to support testing with the released MXNet 1.3.1, 1.4.0-SNAPSHOT, and 1.5.0-SNAPSHOT

The bug exists in 1.3.1 (this was already known), running https://github.com/jessebrizzi/MXNet-Bug/blob/master/scala/TestBug.scala for 10000 forward passes of random input resizes leads to over of 6gb of system memory used and climbing when using the GPU.
Htop attributes the memory usage to the java/sbt process but JVMTop does not show this usage in the Heap inside the JVM (The default heapmax is not that large regardless).

Both 1.4.0-SNAPSHOT and 1.5.0-SNAPSHOT cannot finish the 10000 forward passes random resize test as they crash/freeze as described in my previous comment.

for all versions, the https://github.com/jessebrizzi/MXNet-Bug/blob/master/scala/TestNoBug.scala test runs fine with only 1.8ish GB of mem used on my machine (no random resizes on the input) for both GPU and CPU backed nets.

The dockerfile/README in my bug repo https://github.com/jessebrizzi/MXNet-Bug has the needed instructions if someone wants to try and reproduce my observations.

@andrewfayres
Copy link
Contributor

Thanks @jessebrizzi, I'll use your tests on my end to reproduce and try to resolve this.

@andrewfayres
Copy link
Contributor

I've finally come up with a good fix for this. I'll be submitting a PR shortly. I've run your test on my local laptop and am able to get through all passes without issue and memory looks stable. I've also fixed it so that this code can be wrapped in a ResourceScope block without it crashing.

I'm going to start up an instance to run the test over the weekend just to be thorough but everything's looking good.

@gigasquid
Copy link
Member

@andrewfayres Fantastic! Really excited about this 💯

nswamy pushed a commit that referenced this issue Mar 28, 2019
* Fixes for memory leak when reshaping executor

* Fixed Adam Optimizer memory leak

* Cleanup for PR

* Added unit test for new ResourceScope method

* Removing import that was added by overzealous ide

* Add back in an import

* Added flags for executor to know whether or not it owns NDArrays for disposal

* Moving to ResourceScope.using implementation

* Changes to make ResourceScope.using work with existing scope

* Updating ResourceScope to work with existing scopes via usingIfScopeExists method

* Fix clojure unit tests

* Fixes to be compatibile with how clojure is using ResourceScope

* Removing some unnecessary changes

* Adding scope assertion in unit test
vdantu pushed a commit to vdantu/incubator-mxnet that referenced this issue Mar 31, 2019
…e#14372)

* Fixes for memory leak when reshaping executor

* Fixed Adam Optimizer memory leak

* Cleanup for PR

* Added unit test for new ResourceScope method

* Removing import that was added by overzealous ide

* Add back in an import

* Added flags for executor to know whether or not it owns NDArrays for disposal

* Moving to ResourceScope.using implementation

* Changes to make ResourceScope.using work with existing scope

* Updating ResourceScope to work with existing scopes via usingIfScopeExists method

* Fix clojure unit tests

* Fixes to be compatibile with how clojure is using ResourceScope

* Removing some unnecessary changes

* Adding scope assertion in unit test
lanking520 pushed a commit to lanking520/incubator-mxnet that referenced this issue Apr 1, 2019
…e#14372)

* Fixes for memory leak when reshaping executor

* Fixed Adam Optimizer memory leak

* Cleanup for PR

* Added unit test for new ResourceScope method

* Removing import that was added by overzealous ide

* Add back in an import

* Added flags for executor to know whether or not it owns NDArrays for disposal

* Moving to ResourceScope.using implementation

* Changes to make ResourceScope.using work with existing scope

* Updating ResourceScope to work with existing scopes via usingIfScopeExists method

* Fix clojure unit tests

* Fixes to be compatibile with how clojure is using ResourceScope

* Removing some unnecessary changes

* Adding scope assertion in unit test
ZhennanQin pushed a commit to ZhennanQin/incubator-mxnet that referenced this issue Apr 3, 2019
…e#14372)

* Fixes for memory leak when reshaping executor

* Fixed Adam Optimizer memory leak

* Cleanup for PR

* Added unit test for new ResourceScope method

* Removing import that was added by overzealous ide

* Add back in an import

* Added flags for executor to know whether or not it owns NDArrays for disposal

* Moving to ResourceScope.using implementation

* Changes to make ResourceScope.using work with existing scope

* Updating ResourceScope to work with existing scopes via usingIfScopeExists method

* Fix clojure unit tests

* Fixes to be compatibile with how clojure is using ResourceScope

* Removing some unnecessary changes

* Adding scope assertion in unit test
nswamy pushed a commit that referenced this issue Apr 5, 2019
* Fixes for memory leak when reshaping executor

* Fixed Adam Optimizer memory leak

* Cleanup for PR

* Added unit test for new ResourceScope method

* Removing import that was added by overzealous ide

* Add back in an import

* Added flags for executor to know whether or not it owns NDArrays for disposal

* Moving to ResourceScope.using implementation

* Changes to make ResourceScope.using work with existing scope

* Updating ResourceScope to work with existing scopes via usingIfScopeExists method

* Fix clojure unit tests

* Fixes to be compatibile with how clojure is using ResourceScope

* Removing some unnecessary changes

* Adding scope assertion in unit test
lanking520 added a commit that referenced this issue Apr 5, 2019
* Fixes for memory leak when reshaping executor

* Fixed Adam Optimizer memory leak

* Cleanup for PR

* Added unit test for new ResourceScope method

* Removing import that was added by overzealous ide

* Add back in an import

* Added flags for executor to know whether or not it owns NDArrays for disposal

* Moving to ResourceScope.using implementation

* Changes to make ResourceScope.using work with existing scope

* Updating ResourceScope to work with existing scopes via usingIfScopeExists method

* Fix clojure unit tests

* Fixes to be compatibile with how clojure is using ResourceScope

* Removing some unnecessary changes

* Adding scope assertion in unit test
haohuanw pushed a commit to haohuanw/incubator-mxnet that referenced this issue Jun 23, 2019
…e#14372)

* Fixes for memory leak when reshaping executor

* Fixed Adam Optimizer memory leak

* Cleanup for PR

* Added unit test for new ResourceScope method

* Removing import that was added by overzealous ide

* Add back in an import

* Added flags for executor to know whether or not it owns NDArrays for disposal

* Moving to ResourceScope.using implementation

* Changes to make ResourceScope.using work with existing scope

* Updating ResourceScope to work with existing scopes via usingIfScopeExists method

* Fix clojure unit tests

* Fixes to be compatibile with how clojure is using ResourceScope

* Removing some unnecessary changes

* Adding scope assertion in unit test
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants