Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPEC test_set_profile_* sporadically failing #357

Closed
landreman opened this issue Oct 1, 2023 · 7 comments
Closed

SPEC test_set_profile_* sporadically failing #357

landreman opened this issue Oct 1, 2023 · 7 comments
Assignees

Comments

@landreman
Copy link
Contributor

The SPEC tests test_set_profile_cumulative and test_set_profile_non_cumulative are occasionally failing in the CI. Some examples:
https://github.com/hiddenSymmetries/simsopt/actions/runs/6364144339/job/17280224739
https://github.com/hiddenSymmetries/simsopt/actions/runs/6263108068/job/17006759326
Most of the time these tests both pass - usually it is just one of the many jobs in the "extensive CI" that fails. When they fail, the 2 tests seem to fail together, and the error message is

======================================================================
ERROR: test_set_profile_cumulative (mhd.test_spec.SpecTests)
Set a SPEC profile of a cumulative quantity (volume current in this example)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/runner/work/simsopt/simsopt/tests/mhd/test_spec.py", line 178, in test_set_profile_cumulative
    s = Spec(filename)
  File "/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/simsopt/mhd/spec.py", line 181, in __init__
    self.initial_guess[lvol].set_rc(mm, nn, self.allglobal.allrzrz[0, lvol, imode])
IndexError: index 2 is out of bounds for axis 1 with size 2

======================================================================
ERROR: test_set_profile_non_cumulative (mhd.test_spec.SpecTests)
Set a SPEC profile of a non-cumulative quantity (surface current in this example)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/runner/work/simsopt/simsopt/tests/mhd/test_spec.py", line 146, in test_set_profile_non_cumulative
    s = Spec(filename)
  File "/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/simsopt/mhd/spec.py", line 181, in __init__
    self.initial_guess[lvol].set_rc(mm, nn, self.allglobal.allrzrz[0, lvol, imode])
IndexError: index 2 is out of bounds for axis 1 with size 2

I'm not sure what would cause this error in a non-deterministic way.

These tests both begin with the following code:

        filename = os.path.join(TEST_DIR, 'RotatingEllipse_Nvol8.sp')

        with ScratchDir("."):
            s = Spec(filename)

So perhaps there is non-deterministic behavior coming from the filesystem and creating the scratch directory... I don't think a scratch directory is actually needed since there are no files written, right?

Any ideas how to fix this error?

@smiet
Copy link
Contributor

smiet commented Oct 2, 2023

Hi Matt,

I am looking into it, though it might take me a while to get to the bottom.

The error is thrown when translating the initial guess for the interfaces into the Spec object.
These coefficients should be contained in the spec.allglobal.allrzrz array.
The test fails when the fourier coefficients for an interface lvol>2 are attempted to be read, which are not present in the allrzrz-array.

SPEC will write the interface guesses into a long lap of text at the bottom of the .sp file upon sucessful completion of a run.

When I look at the file, or read it using spec.allglobal.read_inputlist_from_file(), I see the array has the correct dimensions.
Maybe during the setup of the scratch filesystem the file is not done copying?
Or a previous run has written the interface guesses to file, but didn't complete?

I have seen similar errors when running an ipython kernel in which I load multiple Spec objects at the same time with different resolutions, but I am having trouble reproducing that now.

I will investigate further and discuss, the randomly failing tests are very annoying.

@andrewgiuliani
Copy link
Contributor

andrewgiuliani commented Apr 23, 2024

Hi all, I've been able to SSH into the github runner and reproduce the error. You can do the same by adding

    - name: Setup tmate session
      uses: mxschmitt/action-tmate@v3

to test.yml just before the unit tests are run, just like I did here:

- name: Setup tmate session

When SSH'ed into the runner, I call the unit tests individually and it all seems to work

python -m unittest mhd.test_spec.SpecTests.test_set_profile_cumulative

python -m unittest mhd.test_spec.SpecTests.test_set_profile_non_cumulative

However, I can reproduce the bug by calling

coverage run --source=simsopt -m unittest discover -t tests -v -s tests/mhd

and stepping through the problematic unit tests test_set_profile_cumulative, test_set_profile_non_cumulative with an interactive debugger.

@smiet
Copy link
Contributor

smiet commented Jun 14, 2024

I think I have found the solution!

The condition to read the initial guess was based on:
` if self.allglobal.nmodes > 0 and self.nvol > 1:'

When that condition is met, the sporadic fail occurs when running over lvol and reading a non-existent entry in allrzrz

for lvol in range(0, self.nvol-1):
                    self.initial_guess[lvol].set_rc(mm, nn, self.allglobal.allrzrz[0, lvol, imode])

the `self.allglobal.nmodes' references a f90wrapped fortran-array.

now the issue
the fortran memory persists if the python kernel isn't quit. This means that the allglobal.mnmodes can be present, and the test will try to read the stale f90wrapped memory set up in a previous test. If the number of volumes is smaller than that in memory, no problem, but if it is larger, it tries to read out of memory and fails as above.

Spec only updates the allrzrz and nmmodes array when Linitialize is set to zero (or a negative number)

Thus the failure only occurs when the fortran memory is not cleared, the spec equilibrium that is attempted to be read has more interfaces than the one just preceding it, and Linitialize!=0, so it does not update the arrays.

I believe that coverage keeps the python kernel active, or keeps the objects int he tests within scope for longer, and therefore prevents the clearing and re-setting of the Spec.spec FORTRAN memory, or the number of tests and their order are different as they are spread differently over processors, resulting in the sporadic failures.

possible solution
changing the condition upon which memory is read to self.inputlist.linitialize == 0 and self.nvol >1 should solve it, as I am attempting in PR #418

@smiet
Copy link
Contributor

smiet commented Jun 14, 2024

It did not solve the issue, though in PR #418 it is now a different test that is failing, by again accessing out-of-bound memory when trying to open the 8-volume-case RotatingEllipse_Nvol8.sp.

There is something very strange going on that the allrzrz array is not updated correctly in the f90-wrapped memory when the tests are run under coverage, but not when directly run.
@andrewgiuliani Let's try to plan another interactive debug session, we made a lot of progress last time, and I am feeling we are getting even closer

@smiet
Copy link
Contributor

smiet commented Jun 26, 2024

I think I solved it! The deep-dive into f90wrap to solve the numpy2.0 migration made me give it another go (plus new tests in my PR #418 made the bug not-so-sporadic anymore).

The issue: f90wrap uses getter and setters to access the F90wrapped FORTRAN arrays, and the logic is as follows:

handle = get_specific_array_handle(stuff)  #pointer to the array
if handle in self._arrays #dict with arrays and handles:
  return self._arrays[handle]
else:
  array = get_array_in_complicated_f90wrapway(stuff)
  self._arrays[handle]=array
  return array

Now for the race condition: If spec has run before, and the array has been accessed before (put in cache), and by happenstance the handle remains the same, python will return the cached array, and not access the FORTRAN memory.

The handle is basically the pointer to the array, cast as an int. It is unlikely to be exactly the same EXCEPT when a conveniently shaped hole has just been emptied in memory by deallocation.

This also explains why the fault is mostly seen in CI, as they have much smaller memories, and are more likely to write to the same location. But this is handled by the CPU and not reproducible.

Since the error is a cache mismatch on the python side, it should be solved by clearing the cache dictionary in the Spec __init__, and this change is made in PR #418.

@andrewgiuliani
Copy link
Contributor

I'd suggest proposing the fix in a different PR than #418 to ensure it gets merged more quickly and so that we can see precisely what was changed.

@smiet
Copy link
Contributor

smiet commented Jun 27, 2024

Done in #431

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants