Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reduce memory usage of nucleation model implementation in raccoon #150

Open
BoZeng1997 opened this issue Jul 5, 2023 · 30 comments
Open

Comments

@BoZeng1997
Copy link
Contributor

Two nucleation models for phase-field fracture are memory consuming. Either in how the material object is coded, or how the model is implemented in input deck level (or both).

source code

https://github.com/BoZeng1997/raccoon/blob/c24df81ba4ef97f1b3490821daa631d961e3e68d/src/materials/KLRNucleationMicroForce.C
https://github.com/BoZeng1997/raccoon/blob/c24df81ba4ef97f1b3490821daa631d961e3e68d/include/materials/KLRNucleationMicroForce.h

how the model is implemented

https://github.com/BoZeng1997/raccoon/tree/c24df81ba4ef97f1b3490821daa631d961e3e68d/tutorials/surfing_boundary_problem
The current implementation is for sure not the best way. It requires dispx dispy dispz to be transfered to the subapp. Then the subapp would compute stress tensor invariant I1 and J2. One way to improve it a little bit is by computing I1 and J2 in the mainapp then transfer it to subapp. I am waiting to see if there is even better way of improvement.

@BoZeng1997
Copy link
Contributor Author

Please take a look at it when you have time @permcody . Thanks.

@BoZeng1997
Copy link
Contributor Author

I am checking with my derivative size setting. The old 70GB per cpu case were done with size=900 and no wonder it is so memory consuming.
I just found out that the minimum size to run the same problem on different machine is different. Does this make sense? Or is it a sign of some bug in my code or inappropriate compilation setting?
I am running the same problem (same mesh, input deck, num of cpus, moose version ...) on a workstation and on the Duke cluster. Both of them are in mamba environment. On the workstation, --with-derivative-size=150 runs the problem fine. On the cluster, size=300 reported
We caught a MetaPhysicL error in while performing element or face loops. This is potentially due to AD not having a sufficiently large derivative container size. To increase the AD container size, you can run configure in the MOOSE root directory with the '--with-derivative-size=<n>' option and then recompile. Other causes of MetaPhysicL logic errors include evaluating functions where they are not defined or differentiable like sqrt (which gets called for vector norm functions) or log with arguments <= 0
Any comment? @permcody @recuero

@recuero
Copy link

recuero commented Aug 29, 2023

That's a bit surprising to me. For both cases (150 and 300), did you configure MOOSE and then compile your application?

@BoZeng1997
Copy link
Contributor Author

yes. ./configure --with-derivative-size=<n> in moose/scripts/ then compile.

@recuero
Copy link

recuero commented Aug 29, 2023

You have std::sqrts in your models for AD objects. You could protect against a derivative divide by zero by adding a positive epsilon (see https://github.com/idaholab/moose/blob/ee15815834405de6cc5ccccd988d42a38c0dac6c/modules/contact/src/constraints/ComputeFrictionalForceLMMechanicalContact.C#L223). That might help since it's mentioned in the message itself.

@BoZeng1997
Copy link
Contributor Author

thanks for the advice. But I am not sure if I understand it. So the goal is to protect possible division by zero coming from std::sqrt terms not in the part of the code we can see (because in the code there is no division by sqrt() explicitly) but somewhere during computing, is that right? And I should apply this small residual to all std::sqrt(not_a_number) terms to implement the protection.
Also, for the situation I just mentioned, I was not using the new Material code from me. The issue exist before I use my new material object. I am now testing if the same issue occurs on moose only test files. I will post the result when it comes out.

@recuero
Copy link

recuero commented Aug 29, 2023

The issue is in the derivative of sqrt(ADReal(0)), which is ~ 1/sqrt(0). It may be that a similar issue is found in other parts of the code, not necessarily yours. You could run the model through the debugger and find out what's triggering that MetaPhysicL error.

@BoZeng1997
Copy link
Contributor Author

BoZeng1997 commented Aug 31, 2023

It is weird. On the duke cluster, opt hit the above error but dbg runs fine. How should I learn from the different behaviors?
the dbg executable was compiled under the same environment as opt

@recuero
Copy link

recuero commented Aug 31, 2023

That behavior seems a bit odd to me... @lindsayad

@BoZeng1997
Copy link
Contributor Author

You have std::sqrts in your models for AD objects. You could protect against a derivative divide by zero by adding a positive epsilon (see https://github.com/idaholab/moose/blob/ee15815834405de6cc5ccccd988d42a38c0dac6c/modules/contact/src/constraints/ComputeFrictionalForceLMMechanicalContact.C#L223). That might help since it's mentioned in the message itself.

I have already applied a treatment to object that will be used in std::sqrt
https://github.com/BoZeng1997/raccoon/blob/f583027e2e500111e5f264841fe45353aa630ac4/src/materials/KLRNucleationMicroForce.C#L94-L99
In this case, do I still need to apply a small epsilon in sqrt()?

@lindsayad
Copy link
Contributor

I would run your input with valgrind to make sure there are no uninitialized values

@BoZeng1997
Copy link
Contributor Author

I would run your input with valgrind to make sure there are no uninitialized values

I will post the input and mesh very soon. It is not the example listed at the beginning of this issue.

@lindsayad
Copy link
Contributor

You should do that, not me 😄

@hugary1995
Copy link
Owner

I'm optimistic about valgrind telling us something useful.

@BoZeng1997
Copy link
Contributor Author

You should do that, not me 😄

oops, sorry i misunderstood.

@BoZeng1997
Copy link
Contributor Author

I think this is the valgrind msg related to uninitialized value(s). It was printed before the moose executable printed the ad derivative size error.

==2055451== Invalid read of size 8
==2055451==    at 0x421A4D7: f_ca4ea86d12991e15 (in /hpc/group/dolbowlab/bz75/annular/fracture/fullsolve/nuc/.jitcache/ca4ea86d12991e15.so)
==2055451==    by 0x8A49D3C: ADFParser::Eval(MetaPhysicL::DualNumber<double, MetaPhysicL::SemiDynamicSparseNumberArray<double, unsigned long, MetaPhysicL::NWrapper<150ul> >, true> const*) (in /hpc/group/dolbowlab/bz75/projects/raccoon/moose/framework/libmoose-opt.so.0.0.0)
==2055451==    by 0x8AC2A46: FunctionParserUtils<true>::evaluate(std::shared_ptr<ADFParser>&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (in /hpc/group/dolbowlab/bz75/projects/raccoon/moose/framework/libmoose-opt.so.0.0.0)
==2055451==    by 0x831405C: ParsedMaterialHelper<true>::computeQpProperties() (in /hpc/group/dolbowlab/bz75/projects/raccoon/moose/framework/libmoose-opt.so.0.0.0)
==2055451==    by 0x82F9668: Material::computeProperties() (in /hpc/group/dolbowlab/bz75/projects/raccoon/moose/framework/libmoose-opt.so.0.0.0)
==2055451==    by 0x8215A33: FEProblemBase::reinitMaterials(unsigned short, unsigned int, bool) (in /hpc/group/dolbowlab/bz75/projects/raccoon/moose/framework/libmoose-opt.so.0.0.0)
==2055451==    by 0x7768D6D: NonlinearThread::onElement(libMesh::Elem const*) (in /hpc/group/dolbowlab/bz75/projects/raccoon/moose/framework/libmoose-opt.so.0.0.0)
==2055451==    by 0x77842BF: ThreadedElementLoopBase<libMesh::StoredRange<libMesh::MeshBase::const_element_iterator, libMesh::Elem const*> >::operator()(libMesh::StoredRange<libMesh::MeshBase::const_element_iterator, libMesh::Elem const*> const&, bool) (in /hpc/group/dolbowlab/bz75/projects/raccoon/moose/framework/libmoose-opt.so.0.0.0)
==2055451==    by 0x7BD77BE: void libMesh::Threads::parallel_reduce<libMesh::StoredRange<libMesh::MeshBase::const_element_iterator, libMesh::Elem const*>, ComputeResidualThread>(libMesh::StoredRange<libMesh::MeshBase::const_element_iterator, libMesh::Elem const*> const&, ComputeResidualThread&) (in /hpc/group/dolbowlab/bz75/projects/raccoon/moose/framework/libmoose-opt.so.0.0.0)
==2055451==    by 0x7C7CB18: NonlinearSystemBase::computeResidualInternal(std::set<unsigned int, std::less<unsigned int>, std::allocator<unsigned int> > const&) (in /hpc/group/dolbowlab/bz75/projects/raccoon/moose/framework/libmoose-opt.so.0.0.0)
==2055451==    by 0x7C7E37A: NonlinearSystemBase::computeResidualTags(std::set<unsigned int, std::less<unsigned int>, std::allocator<unsigned int> > const&) (in /hpc/group/dolbowlab/bz75/projects/raccoon/moose/framework/libmoose-opt.so.0.0.0)
==2055451==    by 0x8287F1E: FEProblemBase::computeResidualTags(std::set<unsigned int, std::less<unsigned int>, std::allocator<unsigned int> > const&) (in /hpc/group/dolbowlab/bz75/projects/raccoon/moose/framework/libmoose-opt.so.0.0.0)
==2055451==  Address 0x1409acf8 is 8 bytes before a block of size 32 alloc'd
==2055451==    at 0x4C38913: operator new(unsigned long) (vg_replace_malloc.c:472)
==2055451==    by 0x552563B: void std::vector<unsigned long, std::allocator<unsigned long> >::_M_realloc_insert<unsigned long>(__gnu_cxx::__normal_iterator<unsigned long*, std::vector<unsigned long, std::allocator<unsigned long> > >, unsigned long&&) (in /hpc/group/dolbowlab/bz75/projects/raccoon/moose/modules/phase_field/lib/libphase_field-opt.so.0.0.0)
==2055451==    by 0x859EB61: MooseMesh::nodeToActiveSemilocalElemMap() (in /hpc/group/dolbowlab/bz75/projects/raccoon/moose/framework/libmoose-opt.so.0.0.0)
==2055451==    by 0x777DFAE: BoundaryNodeIntegrityCheckThread::BoundaryNodeIntegrityCheckThread(FEProblemBase&, TheWarehouse::QueryCache<> const&) (in /hpc/group/dolbowlab/bz75/projects/raccoon/moose/framework/libmoose-opt.so.0.0.0)
==2055451==    by 0x82B9B2C: FEProblemBase::initialSetup() (in /hpc/group/dolbowlab/bz75/projects/raccoon/moose/framework/libmoose-opt.so.0.0.0)
==2055451==    by 0x887F5BA: Transient::init() (in /hpc/group/dolbowlab/bz75/projects/raccoon/moose/framework/libmoose-opt.so.0.0.0)
==2055451==    by 0x8C20E5E: MooseApp::executeExecutioner() (in /hpc/group/dolbowlab/bz75/projects/raccoon/moose/framework/libmoose-opt.so.0.0.0)
==2055451==    by 0x8C29041: MooseApp::run() (in /hpc/group/dolbowlab/bz75/projects/raccoon/moose/framework/libmoose-opt.so.0.0.0)
==2055451==    by 0x10B135: main (in /hpc/group/dolbowlab/bz75/projects/raccoon/raccoon-opt)
==2055451== 

How should I look for the cause of this uninitialized value? Is it an AD variable on the boundary?

@recuero
Copy link

recuero commented Sep 1, 2023

Judging by the back trace it seems the issue is coming from a parsed material: Can you double check your input? Maybe uninitialized values, as Alex pointed out? Or divide by zero,...

@lindsayad
Copy link
Contributor

@dschwen do you think this is a false positive in the JIT code?

@BoZeng1997
Copy link
Contributor Author

Judging by the back trace it seems the issue is coming from a parsed material: Can you double check your input? Maybe uninitialized values, as Alex pointed out? Or divide by zero,...

I am trying with constant material properties or linear material properties to see if that clear the issue. Can you explain what is an uninitialized values in the input deck? I thought for all quantities in the input deck, when we create them in the input deck, the initial value must be provided to complete the definition.

@recuero
Copy link

recuero commented Sep 1, 2023

I thought for all quantities in the input deck, when we create them in the input deck, the initial value must be provided to complete the definition.

I thought so too. Just suggested that you double check in case you see an issue.

@dschwen
Copy link

dschwen commented Sep 1, 2023

Can you do that valgrind check with a dbg executable? JIT compilation keeps the function sources in that case and we could check exactly what's going on here.

@BoZeng1997
Copy link
Contributor Author

BoZeng1997 commented Sep 1, 2023

Can you do that valgrind check with a dbg executable? JIT compilation keeps the function sources in that case and we could check exactly what's going on here.

but running in dbg executable does not trigger the error. Assertion _dynamic_n <= N' failed. only when running in opt

@lindsayad
Copy link
Contributor

That suggests that there is some kind of non-deterministic error. Valgrind will catch this if that's the case regardless of the method you run with. Also how do you know that is the assertion you're triggering? I thought that you were just getting a general MetaPhysciL exception, the cause of which was unknown?

@BoZeng1997
Copy link
Contributor Author

That suggests that there is some kind of non-deterministic error. Valgrind will catch this if that's the case regardless of the method you run with.

valgrind --leak-check=full --track-origins=yes --show-leak-kinds=all on dbg executable did not catch any memory error. Here is the summary

==2575156== HEAP SUMMARY:
==2575156==     in use at exit: 0 bytes in 0 blocks
==2575156==   total heap usage: 3,767 allocs, 3,767 frees, 3,093,909 bytes allocated
==2575156== 
==2575156== All heap blocks were freed -- no leaks are possible
==2575156== 
==2575156== For lists of detected and suppressed errors, rerun with: -s
==2575156== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

Also how do you know that is the assertion you're triggering? I thought that you were just getting a general MetaPhysciL exception, the cause of which was unknown?

Sorry i missed this part of the error msg. I only posted the part after ***ERROR***. The complete error msg is here

Assertion _dynamic_n <= N' failed.
/hpc/group/dolbowlab/bz75/moose-compilers/mambaforge3/envs/moose/libmesh/include/metaphysicl/dynamic_std_array_wrapper.h, line 74, compiled Jun 18 2023 at 15:36:32

*** ERROR ***
We caught a MetaPhysicL error in while performing element or face loops. This is potentially due to AD not having a sufficiently large derivative container size. To increase the AD container size, you can run configure in the MOOSE root directory with the '--with-derivative-size=<n>' option and then recompile. Other causes of MetaPhysicL logic errors include evaluating functions where they are not defined or differentiable like sqrt (which gets called for vector norm functions) or log with arguments <= 0

Is it normal that libmesh was compile on Jun 18? I updated mamba this Monday.

@dschwen
Copy link

dschwen commented Sep 1, 2023

I thought the issues was an uninitialized access ...

@lindsayad
Copy link
Contributor

@BoZeng1997 what method were you running with when you got the valgrind error?

@BoZeng1997
Copy link
Contributor Author

I thought the issues was an uninitialized access ...

I got invalid read error msg when running opt executable with valgrind. I am not sure if that means uninitialized values.

what method were you running with when you got the valgrind error?

opt only.

@lindsayad
Copy link
Contributor

Well the next thing I would try is gdb with ‘catch throw’ and see what you can learn when the metaphysicl exception is thrown. It would be good to get a stack trace

@BoZeng1997
Copy link
Contributor Author

This is what I can get with gbd+opt.

Time Step 1, time = 49.5, dt = 0.5
Assertion `_dynamic_n <= N' failed.
/hpc/group/dolbowlab/bz75/moose-compilers/mambaforge3/envs/moose/libmesh/include/metaphysicl/dynamic_std_array_wrapper.h, line 74, compiled Jun 18 2023 at 15:36:32

Thread 1 "raccoon-opt" hit Catchpoint 2 (exception thrown), __cxxabiv1::__cxa_throw (obj=0x555557b2efa0, tinfo=0x7ffff7dab038 <typeinfo for MetaPhysicL::LogicError>, 
    dest=0x7ffff7e047c0 <MetaPhysicL::LogicError::~LogicError()>)
    at /home/conda/feedstock_root/build_artifacts/gcc_compilers_1666516830325/work/build/x86_64-conda-linux-gnu/libstdc++-v3/libsupc++/eh_throw.cc:80
80      /home/conda/feedstock_root/build_artifacts/gcc_compilers_1666516830325/work/build/x86_64-conda-linux-gnu/libstdc++-v3/libsupc++/eh_throw.cc: No such file or directory.
Missing separate debuginfos, use: yum debuginfo-install glibc-2.28-228.el8.x86_64
(gdb) bt
#0  __cxxabiv1::__cxa_throw (obj=0x555557b2efa0, tinfo=0x7ffff7dab038 <typeinfo for MetaPhysicL::LogicError>, dest=0x7ffff7e047c0 <MetaPhysicL::LogicError::~LogicError()>)
    at /home/conda/feedstock_root/build_artifacts/gcc_compilers_1666516830325/work/build/x86_64-conda-linux-gnu/libstdc++-v3/libsupc++/eh_throw.cc:80
#1  0x00007ffff7e03778 in f_ca4ea86d12991e15.cold () from .jitcache/ca4ea86d12991e15.so
#2  0x00007ffff5709d3d in ADFParser::Eval(MetaPhysicL::DualNumber<double, MetaPhysicL::SemiDynamicSparseNumberArray<double, unsigned long, MetaPhysicL::NWrapper<150ul> >, true> const*) ()
   from /hpc/group/dolbowlab/bz75/projects/raccoon/moose/framework/libmoose-opt.so.0
#3  0x00007ffff5782a47 in FunctionParserUtils<true>::evaluate(std::shared_ptr<ADFParser>&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
   from /hpc/group/dolbowlab/bz75/projects/raccoon/moose/framework/libmoose-opt.so.0
#4  0x00007ffff4fd405d in ParsedMaterialHelper<true>::computeQpProperties() () from /hpc/group/dolbowlab/bz75/projects/raccoon/moose/framework/libmoose-opt.so.0
#5  0x00007ffff4fb9669 in Material::computeProperties() () from /hpc/group/dolbowlab/bz75/projects/raccoon/moose/framework/libmoose-opt.so.0
#6  0x00007ffff4ed5a34 in FEProblemBase::reinitMaterials(unsigned short, unsigned int, bool) () from /hpc/group/dolbowlab/bz75/projects/raccoon/moose/framework/libmoose-opt.so.0
#7  0x00007ffff4428d6e in NonlinearThread::onElement(libMesh::Elem const*) () from /hpc/group/dolbowlab/bz75/projects/raccoon/moose/framework/libmoose-opt.so.0
#8  0x00007ffff44442c0 in ThreadedElementLoopBase<libMesh::StoredRange<libMesh::MeshBase::const_element_iterator, libMesh::Elem const*> >::operator()(libMesh::StoredRange<libMesh::MeshBase::const_element_iterator, libMesh::Elem const*> const&, bool) () from /hpc/group/dolbowlab/bz75/projects/raccoon/moose/framework/libmoose-opt.so.0
#9  0x00007ffff48977bf in void libMesh::Threads::parallel_reduce<libMesh::StoredRange<libMesh::MeshBase::const_element_iterator, libMesh::Elem const*>, ComputeResidualThread>(libMesh::StoredRange<libMesh::MeshBase::const_element_iterator, libMesh::Elem const*> const&, ComputeResidualThread&) () from /hpc/group/dolbowlab/bz75/projects/raccoon/moose/framework/libmoose-opt.so.0
#10 0x00007ffff493cb19 in NonlinearSystemBase::computeResidualInternal(std::set<unsigned int, std::less<unsigned int>, std::allocator<unsigned int> > const&) ()
   from /hpc/group/dolbowlab/bz75/projects/raccoon/moose/framework/libmoose-opt.so.0
#11 0x00007ffff493e37b in NonlinearSystemBase::computeResidualTags(std::set<unsigned int, std::less<unsigned int>, std::allocator<unsigned int> > const&) ()
   from /hpc/group/dolbowlab/bz75/projects/raccoon/moose/framework/libmoose-opt.so.0
#12 0x00007ffff4f47f1f in FEProblemBase::computeResidualTags(std::set<unsigned int, std::less<unsigned int>, std::allocator<unsigned int> > const&) ()
   from /hpc/group/dolbowlab/bz75/projects/raccoon/moose/framework/libmoose-opt.so.0
#13 0x00007ffff4e8dc36 in FEProblemBase::computeResidualInternal(libMesh::NumericVector<double> const&, libMesh::NumericVector<double>&, std::set<unsigned int, std::less<unsigned int>, std::allocator<unsigned int> > const&) () from /hpc/group/dolbowlab/bz75/projects/raccoon/moose/framework/libmoose-opt.so.0
#14 0x00007ffff4e8d5fe in FEProblemBase::computeResidualL2Norm() () from /hpc/group/dolbowlab/bz75/projects/raccoon/moose/framework/libmoose-opt.so.0
#15 0x00007ffff5551691 in FixedPointSolve::solve() () from /hpc/group/dolbowlab/bz75/projects/raccoon/moose/framework/libmoose-opt.so.0
#16 0x00007ffff4cd985e in TimeStepper::step() () from /hpc/group/dolbowlab/bz75/projects/raccoon/moose/framework/libmoose-opt.so.0
#17 0x00007ffff553d6ee in Transient::takeStep(double) () from /hpc/group/dolbowlab/bz75/projects/raccoon/moose/framework/libmoose-opt.so.0
#18 0x00007ffff553a577 in Transient::execute() () from /hpc/group/dolbowlab/bz75/projects/raccoon/moose/framework/libmoose-opt.so.0
#19 0x00007ffff58e0e47 in MooseApp::executeExecutioner() () from /hpc/group/dolbowlab/bz75/projects/raccoon/moose/framework/libmoose-opt.so.0
#20 0x00007ffff58e9042 in MooseApp::run() () from /hpc/group/dolbowlab/bz75/projects/raccoon/moose/framework/libmoose-opt.so.0
#21 0x0000555555557136 in main ()

opt on cluster with ad derivative size 150 runs after I cleaned the folder .jitcache/. I think this small issue is solved. What is stored in .jitcache/ ? This folder being not cleaned sometimes cause my other simulations having zero residual always.

@lindsayad
Copy link
Contributor

lindsayad commented Sep 9, 2023

Oh I forgot about this ... if you change your derivative size configuration there are problems with the .jitcache, and the current solution is to do what you did: blow away the .jitcache directory before running. I know that @dschwen is aware of this and I could have sworn we have an issue for it, but I'm struggling to find it at the moment. Sorry for the trouble!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants