Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The asyncio wait_for function is broken in Python 3.8-3.11 and can be the reason of Tribler freezes and slowdowns #7570

Closed
kozlovsky opened this issue Aug 14, 2023 · 0 comments · Fixed by #7572

Comments

@kozlovsky
Copy link
Collaborator

The wait_for function in Python's asyncio library has a pretty serious bug that can affect Tribler execution. This Python asyncio bug may be responsible for some cases of Tribler freezes.

The problem is described in the following CPython issues:

Similar issues appear in other libraries, such as aiohttp and async_timeout:

The root of the problem lies in two facts:

1. CancelledError is very tricky to handle

Consider the following code block:

async def my_coroutine(some_task)
    ...
    return await some_task

If we get CancelledError at the line with return await ..., it can be caused by two completely different reasons:

  1. some_task was canceled. It could be canceled internally by the some_task itself or externally by some different coroutine.
  2. my_coroutine was canceled from the outside, for example, due to some timeout.

It may be very hard to distinguish two cases and understand the reason why CancelledError was actually raised.

The logic of asyncio expects that CancelledError should be propagated. In most cases, it is not correct to swallow CancelledError, and it is very typical to have a code that does something like:

task.cancel()
await task  # wait until task cancellation and finalization are done

When the task swallows some CancelledError exceptions, erroneously assuming they we raised by some inner sub-tasks, it is possible that we will never be able to await the task after the cancellation of the task was requested. That can lead to very strange problems with the program hanging.

2. The construction result = await some_task is not atomic in Python.

The following pattern is widespread in programming:

async def my_coroutine()
    resource = function_call()
    try:
        do_somethig_with(resource)
    finally:
        resource.release()

In this example, the execution of its first line can lead to one of the two following outcomes:

  • The function_call() completes successfully, and the resource is assigned to the variable
  • The function_call() raised an exception that can be partially handled inside the function_call() by wrapping its inner code into a try/except block.

If we reach the following line, we can be sure that the resource was allocated and the reference to it was assigned to a variable so that we can clean it. If the function call crashes with an exception, the resource is not allocated, and the function itself is responsible for cleaning the partially-constructed resource if necessary.

Not so with async code. Let's look at the next code block:

async def my_coroutine()
    resource = await function_call()
    try:
        do_somethig_with(resource)
    finally:
        resource.release()

If the function_call() retuned a task, and we await this task, then a third outcome is also possible:

  • The task finished successfully and returned a resource, but my_coroutine was canceled from the outside (for example, due to some timeout). In that case, the CancelledError exception is raised despite the fact that the function call and the task execution completes without any exception.

In that case, it may be very hard to properly call resource.release() because the resource was not assigned to the variable.

wait_for is suffered from the combination of two mentioned problems.

Roughly speaking, wait_for(task, timeout) contained an analog of the following code block:

async def wait_for(task, timeout):
    ...
    return await task()

As a result, if task allocated a resource (such as an open socket), it was possible to be in a situation where the task was finished successfully and returned a result (an opened socket), but due to the raise condition the result was ignored, and CancelledError was propagated instead.

That logic was in Python 3.7. In Python 3.8-3.11, the logic was "fixed": now, in the raise condition situation, the CancelledError was swallowed, and the task result was returned instead. But this "fix" lead to a much worse problem, as asyncio expects that CancelledError should not be swallowed. Due to this change, it is possible in Python 3.8-3.11 to have inexplicable freezes and weird asyncio behavior.

The problem was apparently fixed in Python 3.12 by rewriting the asyncio logic with the with timeout context manager, but previous Python versions (including 3.11) are still affected.

The problem is not limited to the wait_for function; it also appears in multiple different asyncio elements as well:

  • in async Semaphores
  • in async Locks
  • in async Queues
  • in with timeout context manager, actively used by aiohttp

But as Tribler itself does not use async semaphores and locks, we should be able to fix it by monkey-patching the wait_for function and the timeout context manager until the native fix is available.

drew2a added a commit to drew2a/tribler that referenced this issue Dec 11, 2023
…` exception

This commit fixes a bug in the asyncio `wait_for` function where it was swallowing the `CancelledError` exception. The issue was reported in Tribler#7570. The fix involves patching the `wait_for` function to ensure that the exception is not swallowed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant