-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[bug]: Potential Deadlock in HodlInvoice logic #8803
Comments
Goroutine Dump: |
Is this present in 0.18? |
This noderunner still runs 17.3 and will update soon to 18.0, so we don't know yet but he will provide the necessary information. |
Analysing further I think we have a race condition which we need to guard for: In the Lines 4081 to 4083 in a832371
before we should probably make sure that the htlcmanger is shutdown correctly. ec55831#diff-9926ec0b86d09ad93deb44807ccee9a52ed42535007f96ae497478c07b76843aL1449-L1466 so the above code base change with Release 18, so this bug might not be part of the code base anymore. |
Though I don't see that we make sure our I think we need to make sure the Maybe we need to signal the quit channel before before stopping the hodlqueue (and wait until the Lines 1443 to 1445 in a832371
Lines 572 to 574 in a832371
otherwise there might be the race condition where we try to process an HTLC and all of a sudden the peer disconnects and we stop the hodlqueue never being able to read the what do you think @ProofOfKeags ? We potentially might reintroduce the |
Oof. Yeah if there is any necessary cleanup that the channelLink has to perform then we should be doing that in the post loop section of the htlcManager thread. I would need to take a closer look at how the hodlQueue works. I'm somewhat concerned that we do direct channel exposure via From what I can tell though, I can see a bit of an architectural crack wherein we leak out the hodlQueue go-channel all the way into the invoice registry. The problem with that is that this is essentially a dangling reference that we cannot detect has been destroyed in that context. I think the "correct" fix of this will have to involve either preventing that go-channel from leaking directly, or also attaching with it some mechanism for being able to tell when the resource dies (a copy of the link's quit channel). |
Problem confirmed with LND 18.0, working on a fix. |
Thank you for the ideas, I will try the first design with just adding the link's quit channel to the subscriber, but maybe we need to really isolate the concurrent queue with |
Big Concept ACK here. I think the point here is that we want to make sure that reading from a pipe whose writers have exited or writing to a pipe whose readers have exited should (ideally) not deadlock but cause some sort of other "exception" on the side that's still live. If you can solve this at the level of the concurrent queue itself that would be a huge win. |
Also see this issue, back when I fixed a concurrency issue here, dug deeper, and saw many more: #8023 |
In the process of investigating some of the changes I've been wanting to make it occurs to me that the existence of the hodlQueue within the link itself is another symptom of a fundamental architectural crack I noticed a few months ago. I think patching over it to hunt down this deadlock is fine but I want to take a moment to further educate on the issue. The following is my opinion but it will be presented as fact so as to make as direct of a case as possible. The main problem that sits at the bottom of many of the issues we are encountering is an improper division of responsibility between the link and the switch. Right now we do exit hop processing in the link itself as opposed to the switch. This is a mistake for a number of reasons. First and foremost it means we have to duplicate traffic controls between the links and the switch. Further it requires invoice registry references in every channel link which will cause synchronization issues to proliferate. Instead what we should be doing is as soon as we have received a revocation from our counterparty we take the batch of remote updates, package them up and hand them to the switch. From there the switch can make the determination as to what to do with it. If we are the recipient and it is not a hodl invoice we would immediately drop an htlc packet back into the link and we are on our merry way. If it is a hodl invoice, the switch can await the confirming signal from the invoice registry before doing so. There are many more problems that this improper division of responsibility creates but I will save those for a different forum. Changing this architectural issue is a considerable effort and not one we should undertake to solve immediate production issues so in my view, we should fix it in whatever way can work. I include this detail because I will be making an ongoing case that we need to change the responsibility division between the link and the switch and I keep finding odd side-effects of the current architecture and want to document them. |
Description of the old design: As described in the above chart a short description on the old hodlinvoice notification design. The drawbacks are mentioned here: #8023 with potential fixes. One of the problems is, that a channel link might already be done and hence the hodl queue will never be read although another htlc got registered in the meanwhile to this channel link. This race condition happens from time to time so in the next post a new design will be presented. |
The new design is based on the comment mentioned in #8023: Instead of Mutexes a new interface is created: "hodlHTLCNotifier" which spinns up an event loop and does the following four tasks without the need of any Mutexes:
|
FWIW these diagrams are nearly unreadable in GH dark mode. If you could make the arrows between them not so dark, that would help. Or alternatively make the background colored instead of alpha 0 |
Version: LND 17.3
Received a goroutine dump where the invoice system was in the deadlock. I could not reproduce locally and could also not come up with a reasonable explanation why the
hodlqueue.ChanIn()
is not avaialable.Let's first look at the goroutine dump:
This is the call which initiates the deadlock, the notifyHodlSubscribers cannot write into the
hodlqueue
channel of the specific channel link, locking thehodlSubscriptionsMux
forever.Moreover the
cancelInvoiceImpl
locks theInvoiceRegistry
mutex as well so both mutexs are locked.lnd/invoices/invoiceregistry.go
Lines 1713 to 1724 in 13aa7f9
So somehow the
hodlqueue
channel is never read, which only happens when we somehow stopped theconcurrent queue
otherwise the incoming channel is always read in a loop:lnd/queue/queue.go
Lines 62 to 97 in 613bfc0
So I am waiting for additional goroutine dumps from the noderunner to verify this behaviour and investigate further.
The text was updated successfully, but these errors were encountered: