Make tests robust & independent #521

glatterf42 · 2024-03-12T12:14:59Z

Closes #512.

Analogous to iiasa/message_ix#784, this PR aims at making the ixmp tests less flaky. For now, this is largely done by giving Scenarios and Platforms test-specific names.

This PR reverts the mitigation from #510. If all goes surprisingly well, this won't be a problem. Either way, it should not be a problem in the end if the intended changes are complete.

How to review

Read the diff and note that the CI checks all pass, ideally without manual re-runs.

PR checklist

Continuous integration checks all ✅
Add or expand tests; coverage checks both ✅
~~[ ] Add, expand, or update documentation.~~ Just CI tests
~~[ ] Update release notes.~~ Just CI tests

glatterf42 · 2024-03-12T12:19:58Z

Some notes here:

I've grouped the tutorial tests so that they run on the same worker. This seems like a cheat considering we want to use xdist, but the *_scenario tutorials seem to rely on the * tutorials running first. However, this grouping might not solve the issue of these required tutorials failing in the first place.
I don't understand why test_export_timeseries_data is failing. It almost seems like occasionally, test_mp.export_timeseries_data() doesn't find the data it's supposed to export. Is this test relying on other tests filling the fixture test_mp with data?
We should still search the tests for FLAKY markers and see if we can resolve more of them.

glatterf42 · 2024-03-12T12:50:49Z

The fixes in test_integration seem to work locally, but I'm not sure they are optimal yet. They seem to hardcode how many rows in the scenario column of the expected dfs need to be changed. I'm not sure we want to do that, but maybe it's okay to expect these particular tests to yield the same results.

glatterf42 · 2024-03-18T08:43:07Z

On my system and on the windows-py3.12 platform here, there is this test failure in test_base.py:

FAILED ixmp/tests/backend/test_base.py::TestCachingBackend::test_del_ts - AssertionError: assert 0 == 1
E       AssertionError: assert 0 == 1
E        +  where 1 = len({(133743699349232, 'par', 'd'):            i         j  value unit\n0    seattle  new-york    2.5   km\n1    seattle   c...a    1.8   km\n3  san-diego  new-york    2.5   km\n4  san-diego   chicago    1.8   km\n5  san-diego    topeka    1.4   km})
E        +    where {(133743699349232, 'par', 'd'):            i         j  value unit\n0    seattle  new-york    2.5   km\n1    seattle   c...a    1.8   km\n3  san-diego  new-york    2.5   km\n4  san-diego   chicago    1.8   km\n5  san-diego    topeka    1.4   km} = <ixmp.backend.jdbc.JDBCBackend object at 0x79a3a1141520>._cache

ixmp/tests/backend/test_base.py:161: AssertionError

In other words, these lines don't seem to garbage collect the cached value correctly. This seems related to #463 and #504, and the issue is resolved by running s.__del__() for all python versions. @khaeru, if there's no specific reason to use del s, I'd just switch to s.__del__().

My system is running Python 3.12 with pandas 2.2.0 and JPype1 1.5.0.

codecov · 2024-03-18T09:15:47Z

Codecov Report

Attention: Patch coverage is 99.15254% with 1 line in your changes missing coverage. Please review.

Project coverage is 98.7%. Comparing base (81b692a) to head (fc51651).
Report is 1 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##            main    #521     +/-   ##
=======================================
- Coverage   98.8%   98.7%   -0.1%     
=======================================
  Files         44      44             
  Lines       4796    4809     +13     
=======================================
+ Hits        4739    4751     +12     
- Misses        57      58      +1

Files	Coverage Δ
ixmp/backend/jdbc.py	`97.4% <100.0%> (+<0.1%)`	⬆️
ixmp/core/platform.py	`98.9% <ø> (ø)`
ixmp/testing/data.py	`97.4% <100.0%> (+0.1%)`	⬆️
ixmp/tests/backend/test_base.py	`98.9% <100.0%> (ø)`
ixmp/tests/backend/test_jdbc.py	`100.0% <100.0%> (ø)`
ixmp/tests/core/test_scenario.py	`100.0% <100.0%> (ø)`
ixmp/tests/core/test_timeseries.py	`100.0% <100.0%> (ø)`
ixmp/tests/report/test_operator.py	`100.0% <100.0%> (ø)`
ixmp/tests/report/test_reporter.py	`100.0% <100.0%> (ø)`
ixmp/tests/test_access.py	`100.0% <100.0%> (ø)`
... and 5 more

khaeru · 2024-03-18T09:26:27Z

@khaeru, if there's no specific reason to use del s, I'd just switch to s.__del__().

We can answer this by first thinking about the intended behaviour of the subject/tested code:

At the end of a scope where a Scenario object (like s) is used, its reference count should decrease, maybe to zero.
This should result in Python's internal garbage collector choosing to call Scenario.__del__ (https://docs.python.org/3/reference/datamodel.html#object.__del__ —see the "Note:" in the grey box.)
That, in turn will free up memory in the Java/JDBC backend.
In other words, it's not intended that the user should call either del s or s.__del__() manually; rather, whenever they give up all references to the Scenario, the clean-up will happen.

The reason del s appears in the test function is to trigger Python's GC earlier, before the end of the test function, so that the following assertion can be made about the post clean-up state.

The comment points out that in certain cases, the intended behaviour doesn't work (we haven't diagnosed why). So s.__del__() forces the GC to occur and the test to pass. But because it is forced, the test is no longer checking that it will happen autonomously. We might expect that garbage collection does not work at all for a user with that combination of JPype and Python.

So to extend: if s.__del__() were called in the test in every case, then we no longer will test that Python's built-in reference-counting/GC can or will trigger Scenario.__del__—including with the up-to-date Python, JPype, and pandas you mention—and can't claim that this feature is working.

glatterf42 · 2024-03-18T10:01:55Z

Thanks for the explanation, so I'll revert/drop that commit and try to figure out how del s can call Scenario.__del__ again.

One other note: I've searched for other flaky markers and think I resolved another one, but there's one for test_jvm_warn where I don't know what to do. The original flaky error is described in #489 and it just seems that this test sometimes occurs simultaneously as another test reading a big file, which results in a warning. Maybe we can suppress this specific warning, but I'm not sure what else we could do.

glatterf42 · 2024-03-18T11:28:55Z

I'm not sure I completely agree with your analysis. The flaky function is called test_del_ts and its docstring is """Test CachingBackend.del_ts().""". However, we don't explicitly call CachingBackend.del_ts() in the test at the moment.
del_ts is a guaranteed part of Scenario.__del__, but as the Note box in the docs you linked above points out:

del x doesn’t directly call x.__del__() — the former decrements the reference count for x by one, and the latter is only called when x’s reference count reaches zero.

In particular, the docs also say:

CPython implementation detail: It is possible for a reference cycle to prevent the reference count of an object from going to zero. In this case, the cycle will be later detected and deleted by the cyclic garbage collector. A common cause of reference cycles is when an exception has been caught in a local variable. The frame’s locals then reference the exception, which references its own traceback, which references the locals of all frames caught in the traceback.

Due to the precarious circumstances under which del() methods are invoked, exceptions that occur during their execution are ignored, and a warning is printed to sys.stderr instead.

So the current flaky behaviour we see only reflects exactly that: del s doesn't always call Scenario.__del__. With that, I'm wondering: should it? And do we want to test it here?
Especially to the latter, the answer seems to be "No" to me. With this test, according to its name and docstring, we only want to ensure that CachingBackend.del_ts works as intended, which it does (locally at least, the flakiness is resolved by always calling backend.del_ts(s) explicitly).

I realize that this theoretically leaves part of our code untested, the part where we call Scenario.__del__. In practice, however, I expect this finalizer to be called multiple times, e.g. as part of this very function (just for a limited number of python versions).

In addition, the note that "all exceptions during the execution of del() methods are ignored" makes me wonder if we even need the try-clause in our Scenario.__del__. Apparently, these exceptions would only result in warnings anyway. Or was it a conscious choice to suppress these warnings?

glatterf42 · 2024-03-18T12:54:57Z

I'm not sure why build couldn't be installed for macos-py3.8, but I don't think that's on us. The error in windows-py3.12 is indeed still flaky and appears to be almost exactly the same as the one above: the test calls del s for a number of Scenarios, but maybe calling del_ts explicitly would be more accurate.

glatterf42 · 2024-03-22T10:46:35Z

The tests seem to be doing okay this time around, but I'm not sure we want to keep the solution. In essence, the problem was that the JDBCBackend has a .jindex attribute, which is a WeakKeyDictionary. This means that keys will be discarded automatically once no strong reference to the keys exists at runtime. But when trying to simply call backend.del_ts() in the test_jdbc::test_del_ts, this didn't always work. My guess is that just as before, a strong reference to some key could be stored as part of a reference cycle. That way, del s will not lead to s.__del__() being called (which would call backend.del_ts()), but calling backend.del_ts() is still not enough since we rely on all strong references vanishing to adapt backend.jindex and we check that len(backend.jindex) changes in our tests.
So I've now adapted backend.del_ts to remove the ts from jindex after garbage collection to make sure it vanishes, even if a strong reference still exists in some reference cycle. However, this would of course remove ts from jindex irrespective of the reason a strong reference still exists.

khaeru · 2024-03-26T11:56:58Z

However, this would of course remove ts from jindex irrespective of the reason a strong reference still exists.

Hm—in theory, backend.del_ts is not going to get called unless (a) there are no strong references anywhere, so TimeSeries.__del__ is triggered, or (b) the user calls it explicitly.

(But since .jindex is WeakKeyDictionary, the presence (or absence) of ts as a key in that should not be making any difference, right? Or maybe I have not fully thought through the implications of keys vs. values vs. both being weak in such collections. If the key has some dangling strong ref, then that prevents its removal from the dict and thus the value (the JPype/Java/ixmp_source object) also continues to sit there.)

In any case I think this is a safe change to make. Unless the user does (b), this is not going to throw away objects they are still using. I would put the added line in the reverse order, though:

self.jindex.pop(ts, None)
self.gc()

khaeru · 2024-03-28T16:01:18Z

Working on #524, I noticed that the Jupyter notebook tests of macos-latest-py3.12 failed several times before eventually passing. They also failed on the first run on main after merging the PR.

* Split up test_close to receive only clean stdout

* Adapt expected values to new scenario names

glatterf42 added bug ci Continuous integration labels Mar 12, 2024

glatterf42 added this to the 3.9 milestone Mar 12, 2024

glatterf42 self-assigned this Mar 12, 2024

glatterf42 force-pushed the enh/ci-test-stability branch from e25e6a9 to fb9ebba Compare March 18, 2024 11:30

glatterf42 force-pushed the enh/ci-test-stability branch from 2b527b4 to 494f235 Compare May 7, 2024 06:28

glatterf42 force-pushed the enh/ci-test-stability branch from 494f235 to 4a1b312 Compare May 21, 2024 07:07

khaeru modified the milestones: 3.9, 3.10 Jun 4, 2024

glatterf42 added 10 commits July 8, 2024 09:49

Use request.node.name for make_dantzig scenarios

bdf5646

Fix typo in docstring

0ab4149

Group tutorials that depend on each other

80af3fe

Give every platform a test-specific name

473a2ea

Use test-specific name for scenarios

1581944

Use test-specific name for platforms

6b9a937

* Split up test_close to receive only clean stdout

Revert temporary fix from iiasa#510

21a149e

Copy model["dantzig"] properly

0a257de

* Adapt expected values to new scenario names

Avoid searching for to-be-created test-specific platforms

2037338

Search for correct new scenario name

a3c07c1

glatterf42 added 8 commits July 8, 2024 09:49

Increase default cell runtime on GHA

64c3b5c

Use more test-specific platforms

8d49776

Make call to to-be-tested function explicit

e7f0688

Rename expected data files for individual tests

cd1f451

Call Scenario deletion function explicitly

8acd713

Call TimeSeries deletion function explicitly

a20815f

Add type hint for correct specific backend type

4816ea6

Remove timeseries from jindex explicitly

fc51651

glatterf42 force-pushed the enh/ci-test-stability branch from 4a1b312 to fc51651 Compare July 8, 2024 07:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make tests robust & independent #521

Make tests robust & independent #521

glatterf42 commented Mar 12, 2024

glatterf42 commented Mar 12, 2024 •

edited

Loading

glatterf42 commented Mar 12, 2024

glatterf42 commented Mar 18, 2024 •

edited

Loading

codecov bot commented Mar 18, 2024 •

edited

Loading

khaeru commented Mar 18, 2024

glatterf42 commented Mar 18, 2024

glatterf42 commented Mar 18, 2024 •

edited

Loading

glatterf42 commented Mar 18, 2024

glatterf42 commented Mar 22, 2024

khaeru commented Mar 26, 2024

khaeru commented Mar 28, 2024

Make tests robust & independent #521

Are you sure you want to change the base?

Make tests robust & independent #521

Conversation

glatterf42 commented Mar 12, 2024

How to review

PR checklist

glatterf42 commented Mar 12, 2024 • edited Loading

glatterf42 commented Mar 12, 2024

glatterf42 commented Mar 18, 2024 • edited Loading

codecov bot commented Mar 18, 2024 • edited Loading

Codecov Report

khaeru commented Mar 18, 2024

glatterf42 commented Mar 18, 2024

glatterf42 commented Mar 18, 2024 • edited Loading

glatterf42 commented Mar 18, 2024

glatterf42 commented Mar 22, 2024

khaeru commented Mar 26, 2024

khaeru commented Mar 28, 2024

glatterf42 commented Mar 12, 2024 •

edited

Loading

glatterf42 commented Mar 18, 2024 •

edited

Loading

codecov bot commented Mar 18, 2024 •

edited

Loading

glatterf42 commented Mar 18, 2024 •

edited

Loading