Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run web_api tests with PYTHONASYNCIODEBUG=1 #1580

Merged
merged 1 commit into from
Feb 5, 2024

Conversation

sk1p
Copy link
Member

@sk1p sk1p commented Feb 4, 2024

Fix snooze task cancellation being called from the wrong thread (which was caught by this change).

Contributor Checklist:

Reviewer Checklist:

  • /azp run libertem.libertem-data passed
  • No import of GPL code from MIT code

Fix snooze task cancellation being called from the wrong thread.
@sk1p sk1p added the bug Something isn't working label Feb 4, 2024
@sk1p
Copy link
Member Author

sk1p commented Feb 4, 2024

/azp run libertem.libertem-data

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link

codecov bot commented Feb 4, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (01feaae) 70.50% compared to head (a8ffbda) 70.25%.
Report is 5 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1580      +/-   ##
==========================================
- Coverage   70.50%   70.25%   -0.25%     
==========================================
  Files         324      324              
  Lines       26873    26874       +1     
  Branches     3072     3072              
==========================================
- Hits        18946    18881      -65     
- Misses       7374     7430      +56     
- Partials      553      563      +10     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@matbryan52
Copy link
Member

matbryan52 commented Feb 5, 2024

Reproduced what is probably the nbval test failure outside of tox:

image

Seems to be slow coroutines being logged as warning by the new envvar, which is bizarre as the variable is only set during the web_api env ??

NOTE: there seem to be two CI failures, first the async notebook fails from cell [2], then the job got stuck and later cancelled when running binning.ipynb.

@sk1p
Copy link
Member Author

sk1p commented Feb 5, 2024

Reproduced what is probably the nbval test failure outside of tox:

[...]

Seems to be slow coroutines being logged as warning by the new envvar, which is bizarre as the variable is only set during the web_api env ??

Thanks for tracking this down. That is weird indeed - can you check os.environ['PYTHONASYNCIODEBUG'] inside the notebook? It's either this environment variable, loop.set_debug(True) or the Python development mode that should trigger these log messages...

The log messages themselves are related to long running tasks, see also #1577 - are there any other messages in there, maybe related to thread safety or other "async sins"?

NOTE: there seem to be two CI failures, first the async notebook fails from cell [2], then the job got stuck and later cancelled when running binning.ipynb.

@sk1p sk1p added this to the 0.14 milestone Feb 5, 2024
@matbryan52
Copy link
Member

Think it might just have been a red herring ? I started going down the rabbit hole of envvars bleeding through tox configs but actually it passed fine on re-run.

Maybe the worker had a hiccup which caused a log message in the notebook cell ?

I would say LGTM and monitor during other CI runs...

@sk1p
Copy link
Member Author

sk1p commented Feb 5, 2024

Think it might just have been a red herring ? I started going down the rabbit hole of envvars bleeding through tox configs but actually it passed fine on re-run.

Maybe the worker had a hiccup which caused a log message in the notebook cell ?

I would say LGTM and monitor during other CI runs...

Thanks for looking into that. Yeah, I would concur. I'm quite certain that there are issues with async, especially if distributed is not up to date, or in the shutdown code, but this change should only affect the web API, not the bare Python API etc.

I'll go ahead and merge!

@sk1p sk1p merged commit 501fe67 into LiberTEM:master Feb 5, 2024
37 checks passed
@sk1p sk1p deleted the asyncio-sanity branch February 5, 2024 13:19
@sk1p
Copy link
Member Author

sk1p commented Feb 5, 2024

I think I understand what was happening - sadly, there's some instability with the azure pipeline agents on our own infra, which means they "run out of space" (I think on /dev/shm or something like that?) after running a bunch of jobs. I didn't have time to properly diagnose that specific issue, but it may have to do with zombie Python processes that hang around longer than they should, and the pipeline agent not implementing proper isolation between runs. Right now my "solution" is to restart the containers...

If that happens during a test run, we might get these strange failures or hangs. I apologize that this sent you down a rabbit hole...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants