Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems with trio.Lock() across multiple tasks #3035

Closed
Zac-HD opened this issue Jul 16, 2024 · 16 comments · Fixed by #3081
Closed

Problems with trio.Lock() across multiple tasks #3035

Zac-HD opened this issue Jul 16, 2024 · 16 comments · Fixed by #3081

Comments

@Zac-HD
Copy link
Member

Zac-HD commented Jul 16, 2024

Some time ago - never mind how long precisely - having a little coordination between tasks and nothing particular to interest me in semaphores, I saw our RuntimeError("can't release a Lock you don't own").

I approve pretty strongly of keeping things local by default; it prevents all sorts of bugs and trio.Semaphore is right there if you need something without this limitation. But two further observations prompted this issue:

  1. trio.Semaphore() is considerably more powerful and flexible in general. By the principle of least power, it might be nice to have trio.Lock(*, allow_outside_release: bool = False) or something along those lines.
  2. If the owning task finishes (returns or raises) without releasing the lock, then you're stuck forever. The toy program below demonstrates. Perhaps we should raise an error or at least emit a warning when a task which owns a lock finishes?
import trio

@trio.run
async def main():
    lock = trio.Lock()
    async with trio.open_nursery() as nursery:
        nursery.start_soon(lock.acquire)
    # The lock is now held, and can only be released by the acquiring task.
    # However, that task is now finished, so we can never release the lock!
    lock.release()
@A5rocks
Copy link
Contributor

A5rocks commented Aug 2, 2024

I'm not sure how I feel about adding a keyword argument to trio.Lock. Does trio.Semaphore with a limit of 1 really have differences other than this? I like that trio.Lock is more easily debuggable and harder to shoot yourself with. (and if we make acquire error given the owner is dead, then even better!)

Perhaps we should raise an error or at least emit a warning when a task which owns a lock finishes?

I think that this could error when you run lock.acquire_nowait(). i.e. lock.acquire_nowait could check if self._owner is still running. This doesn't handle the following though:

  • task 1 calls lock.acquire
  • task 2 calls lock.acquire
  • task 1 exits

@Zac-HD
Copy link
Member Author

Zac-HD commented Aug 5, 2024

Let's set Semaphore aside then, I'm more concerned that you can easily get a deadlock from Lock, e.g.

    lock = trio.Lock()
    async with trio.open_nursery() as nursery:
        nursery.start_soon(lock.acquire)
        nursery.start_soon(lock.acquire)

IMO instead of hanging forever, we should raise BrokenResourceError in the task(s) waiting on a lock which can never be released, and perhaps also when exiting the task which holds that lock.

@A5rocks
Copy link
Contributor

A5rocks commented Aug 5, 2024

I agree; BrokenResourceError (or RuntimeError if we want) is better than a deadlock. Though I think that if a lock gets acquired and never released, but nobody else wants it, it's OK. (Though it's definitely a sign of a bug? Should that warn?)

@Zac-HD
Copy link
Member Author

Zac-HD commented Aug 5, 2024

I think so long as we get a BrokenResourceError instead of hanging, and can identify the offending lock + task if that happens, the other decisions are less crucial - e.g. I'd prefer to warn-or-error on task exit but only if that's not going to cost us performance in the happy case.

@jakkdl
Copy link
Member

jakkdl commented Aug 27, 2024

Erroring on

    lock = trio.Lock()
    async with trio.open_nursery() as nursery:
        nursery.start_soon(lock.acquire)
    await lock.acquire()

Is easy & straightforward, adding a check in acquire_nowait for self._owner not in self._owner._parent_nursery._children.

But the other cases are definitely more complicated, since the other task(s) might start waiting before the task owner exits. I think the way to go is to

  1. Add logic to Runner.task_exited which checks if the exiting task is holding any lock
    • which I think means we need to store held locks in the task.
  2. notify all other tasks waiting on those locks that they should error
    • Could make this a method in the public API on locks, error_other_holders
    • also gotta look at how you cancel and inject an error into those tasks

I don't think this requires a big performance hit, but we'd likely need to add some overhead whenever acquiring/releasing a lock, which perhaps partially could be traded off with more logic in task_exited.

@A5rocks
Copy link
Contributor

A5rocks commented Aug 27, 2024

I think better than a parent nursery check is just checking if the task is running. But yeah I'm not sure how to handle the other cases. Should there be a way to "poison" a parking lot?

@jakkdl
Copy link
Member

jakkdl commented Sep 2, 2024

Another way of implementing it (inspired by @mikenerone) is having an Event firing on a Task exiting, and that event would wake up any other tasks waiting inside acquire. This initially sounded conceptually simple, but when I started thinking about how to actually implement it (acquire waiting for that Event firing or its ParkingLot.park(), and the lock event changing each time the lock owner changes) I'm not sure how clean it would be.

@A5rocks
Copy link
Contributor

A5rocks commented Sep 5, 2024

Here's my latest idea: we can have a global dictionary of task -> a list (not a set because need to allow multiple of the same element) of parking lots, w/ an interface like trio.lowlevel.add_parking_lot_breaker(task, parking lot) and trio.lowlevel.remove_parking_lot_breaker(task, parking lot).

Then lock uses that to have its parking lot broken if the task gets killed without releasing the lock. We can document this for use for anything that expects a task to release something but didn't.

@jakkdl
Copy link
Member

jakkdl commented Sep 5, 2024

Then lock uses that to have its parking lot broken if the task gets killed without releasing the lock

The lock initiates it?

Trying to flesh this out:

  1. _LockImpl.acquire_now calls trio.lowlevel.add_parking_lot_breaker(self._owner, self._lot)
  2. _LockImpl.release calls trio.lowlevel.remove_parking_lot_breaker(self._owner, self._lot)
  3. Runner.task_exited calls trio.lowlevel.break_parking_lots(task)
    1. This calls ParkingLot.break() on all associated lots
    2. ParkingLot.break calls Runner.reschedule(..., next_send=Outcome.Error(BrokenResourceError(...))) for each parked entity (if that's how you inject errors into other Tasks).

@A5rocks
Copy link
Contributor

A5rocks commented Sep 5, 2024

Then lock uses that to have its parking lot broken if the task gets killed without releasing the lock

The lock initiates it?

Yeah, if I understand what you're asking.

Trying to flesh this out:

  1. _LockImpl.acquire_now calls trio.lowlevel.add_parking_lot_breaker(self._owner, self._lot)
  2. _LockImpl.release calls trio.lowlevel.remove_parking_lot_breaker(self._owner, self._lot)
  3. Runner.task_exited calls trio.lowlevel.break_parking_lots(task)

We don't have to expose trio.lowlevel.break_parking_lots, it's fine if the runner just uses the global dict.

i. This calls ParkingLot.break() on all associated lots
ii. ParkingLot.break calls Runner.reschedule(..., next_send=Outcome.Error(BrokenResourceError(...))) for each parked entity (if that's how you inject errors into other Tasks).

ParkingLot.break also:

  1. needs a different name because break is a keyword
  2. should make any future ParkingLot.park calls error.

The error should probably contain a reference to the task that caused things to break. Maybe a subclass of BrokenResourceError.

@CoolCat467
Copy link
Member

How to handle future improvements when we can have multiple guest runs able to run at once?

@A5rocks
Copy link
Contributor

A5rocks commented Sep 5, 2024

The Task shouldn't compare the same, though if it does then ouch that's something we should fix. (Given there's no way to start a task on two loops). So it's fine to use it as a key.

@jakkdl
Copy link
Member

jakkdl commented Sep 6, 2024

we can have a global dictionary of task -> a list (not a set because need to allow multiple of the same element) of parking lots

why is it necessary to allow multiple of the same element? It's not required for lock, so I'm guessing you're thinking of some external scenario?

Opened #3081 with an initial implementation. It needs some polish but the overall approach looks solid

@A5rocks
Copy link
Contributor

A5rocks commented Sep 6, 2024

we can have a global dictionary of task -> a list (not a set because need to allow multiple of the same element) of parking lots

why is it necessary to allow multiple of the same element? It's not required for lock, so I'm guessing you're thinking of some external scenario?

Yeah I was thinking of a semaphore (cause it's still bad form to forget to release) and reentering within the same task.

@jakkdl
Copy link
Member

jakkdl commented Sep 9, 2024

we can have a global dictionary of task -> a list (not a set because need to allow multiple of the same element) of parking lots

why is it necessary to allow multiple of the same element? It's not required for lock, so I'm guessing you're thinking of some external scenario?

Yeah I was thinking of a semaphore (cause it's still bad form to forget to release) and reentering within the same task.

This made me look more into the other sync primitives, and CapacityLimiter might be a pretty good fit for an optional parameter that enables the same functionality (optional because of the _on_behalf_of methods). Semaphore would be a worse fit, but I suppose could also have it. (at that point we probably move the logic from _LockImpl to AsyncContextManagerMixin, or an intermediate class between those).

@jakkdl
Copy link
Member

jakkdl commented Oct 9, 2024

We can probably let this issue stay locked and if somebody requests BrokenResourceError for CapacityLimiter or Semaphore we'll deal with it then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants