The asyncio wait_for
function is broken in Python 3.8-3.11 and can be the reason of Tribler freezes and slowdowns
#7570
Labels
wait_for
function is broken in Python 3.8-3.11 and can be the reason of Tribler freezes and slowdowns
#7570
The
wait_for
function in Python'sasyncio
library has a pretty serious bug that can affect Tribler execution. This Python asyncio bug may be responsible for some cases of Tribler freezes.The problem is described in the following CPython issues:
Similar issues appear in other libraries, such as
aiohttp
andasync_timeout
:The root of the problem lies in two facts:
1.
CancelledError
is very tricky to handleConsider the following code block:
If we get
CancelledError
at the line withreturn await ...
, it can be caused by two completely different reasons:some_task
was canceled. It could be canceled internally by thesome_task
itself or externally by some different coroutine.my_coroutine
was canceled from the outside, for example, due to some timeout.It may be very hard to distinguish two cases and understand the reason why
CancelledError
was actually raised.The logic of
asyncio
expects thatCancelledError
should be propagated. In most cases, it is not correct to swallowCancelledError
, and it is very typical to have a code that does something like:When the task swallows some
CancelledError
exceptions, erroneously assuming they we raised by some inner sub-tasks, it is possible that we will never be able to await the task after the cancellation of the task was requested. That can lead to very strange problems with the program hanging.2. The construction
result = await some_task
is not atomic in Python.The following pattern is widespread in programming:
In this example, the execution of its first line can lead to one of the two following outcomes:
function_call()
completes successfully, and the resource is assigned to the variablefunction_call()
raised an exception that can be partially handled inside thefunction_call()
by wrapping its inner code into atry/except
block.If we reach the following line, we can be sure that the resource was allocated and the reference to it was assigned to a variable so that we can clean it. If the function call crashes with an exception, the resource is not allocated, and the function itself is responsible for cleaning the partially-constructed resource if necessary.
Not so with async code. Let's look at the next code block:
If the
function_call()
retuned a task, and we await this task, then a third outcome is also possible:my_coroutine
was canceled from the outside (for example, due to some timeout). In that case, theCancelledError
exception is raised despite the fact that the function call and the task execution completes without any exception.In that case, it may be very hard to properly call
resource.release()
because the resource was not assigned to the variable.wait_for
is suffered from the combination of two mentioned problems.Roughly speaking,
wait_for(task, timeout)
contained an analog of the following code block:As a result, if
task
allocated a resource (such as an open socket), it was possible to be in a situation where the task was finished successfully and returned a result (an opened socket), but due to the raise condition the result was ignored, andCancelledError
was propagated instead.That logic was in Python 3.7. In Python 3.8-3.11, the logic was "fixed": now, in the raise condition situation, the
CancelledError
was swallowed, and the task result was returned instead. But this "fix" lead to a much worse problem, as asyncio expects thatCancelledError
should not be swallowed. Due to this change, it is possible in Python 3.8-3.11 to have inexplicable freezes and weird asyncio behavior.The problem was apparently fixed in Python 3.12 by rewriting the asyncio logic with the
with timeout
context manager, but previous Python versions (including 3.11) are still affected.The problem is not limited to the
wait_for
function; it also appears in multiple different asyncio elements as well:with timeout
context manager, actively used byaiohttp
But as Tribler itself does not use async semaphores and locks, we should be able to fix it by monkey-patching the
wait_for
function and thetimeout
context manager until the native fix is available.The text was updated successfully, but these errors were encountered: