Stop processing ACME verifications when active node is stepped down #23278

stevendpclark · 2023-09-25T17:26:36Z

Do not load existing ACME challenges persisted within storage on non-active nodes. This was the main culprit of the issues, secondary nodes would load existing persisted challenges trying to resolve them but writes would fail leading to the excessive logging.
- We now handle this by not starting the ACME background thread on non-active nodes, while also checking within the scheduling loop and breaking out. That will force a re-reading of the Closing channel that should have been called by the PKI plugin's Cleanup method.
If a node is stepped down from being the active node while it is actively processing a verification, we could get into an infinite loop due to an ErrReadOnly error attempting to clean up a challenge entry
Add a maximum number of retries for errors around attempting to decode,fetch challenge/authorization entries from disk. We use double the number of "normal" max attempts for these types of errors, than we would for normal ACME retry attempts to avoid collision issues. Note that these additional retry attempts are not persisted to disk and will restart on every node start
Add a 1 second backoff to any disk related error to not immediately spin on disk/io errors for challenges.

- If a node is stepped down from being the active node while it is actively processing a verification, we could get into an infinite loop due to an ErrReadOnly error attempting to clean up a challenge entry - Add a maximum number of retries for errors around attempting to decode,fetch challenge/authorization entries from disk. We use double the number of "normal" max attempts for these types of errors, than we would for normal ACME retry attempts to avoid collision issues. Note that these additional retry attempts are not persisted to disk and will restart on every node start - Add a 5 second backoff to any disk related error to not immediately spin on disk/io errors for challenges.

github-actions · 2023-09-25T17:45:42Z

Build Results:
All builds succeeded! ✅

github-actions · 2023-09-25T17:52:34Z

CI Results:
All Go tests succeeded! ✅

hghaf099

Looks great! I asked some clarifying questions.

hghaf099 · 2023-09-25T18:35:27Z

builtin/logical/pki/acme_challenge_engine.go

+ // data store, we would re-attempt them forever and never clean them up. We want to leverage
+ // the retry counts within the authorization/challenge objects first which is why we give ourselves
+ // twice the attempts here.
+ if retries > MaxRetryAttempts*2 {


I am not super familiar with the internals of acme challenge engine, so please bear with me with my questions. What is the correlation between MaxRetryAttempt and retry count within the authorization/challenge object? Would this doubling here cause confusion in the future as in it may cause two different behaviours for a single thing?

I am not super familiar with the internals of acme challenge engine, so please bear with me with my questions.

No worries, I appreciate the feedback!

What is the correlation between MaxRetryAttempt and retry count within the authorization/challenge object?

There isn't really a correlation, but you are right that it may lead to confusion. Basically this new one I'm just trying to track are we looping continuously on a challenge that we can't seem to decode/process out of the normal ACME processing pipeline. Kinda a catch 22 in if we can't decode/read the event from storage how are we supposed to know how many times we've attempted it...

I realized if we don't do this and for some reason these start to increase we may get stuck continuously processing these and nothing new incoming.

Would this doubling here cause confusion in the future as in it may cause two different behaviours for a single thing?

Maybe but again this is here as a stop gap, "should" never fire technically.

One thing to keep in mind with ACME challenges/orders is, it isn't the end of the world to fail/lose them, not great mind you, but all the clients I've seen will/can just restart the process again with minimal effort.

This is good context! Thanks!

hghaf099 · 2023-09-25T18:44:13Z

builtin/logical/pki/acme_challenge_engine.go

@@ -343,23 +372,23 @@ func (ace *ACMEChallengeEngine) VerifyChallenge(runnerSc *storageContext, id str

 func (ace *ACMEChallengeEngine) _verifyChallenge(sc *storageContext, id string, config *acmeConfigEntry) (bool, time.Time, error) {
 now := time.Now()
+ backoffTime := now.Add(5 * time.Second)


Should this backoff be exponentially growing instead of a flat 5 seconds delay? maybe there is a glitch in the IO or decoding the json, etc. should it wait that long to retry? Could this cause ace.Validations to be filled up without doing the actual work and cause issues down the road?

Should this backoff be exponentially growing instead of a flat 5 seconds delay?

The original design was to try to get through the ACME challenges as quickly as possible to avoid long standing entries within that Validation queue in memory. I did pick 5 second delay kinda randomly but I wouldn't want it to grow excessively either.

maybe there is a glitch in the IO or decoding the json, etc. should it wait that long to retry?

It wouldn't be great, but then again IO glitches aren't really something I'd special case for. Note that there is an exponential backoff for the "normal" validation failure workflow, just not this. I don't think 5 seconds is overly excessive for us to come back to the item either, but I'm happy to be proven wrong here as well.

Could this cause ace.Validations to be filled up without doing the actual work and cause issues down the road?

I don't think so, to get into this queue in the first place an ACME client needs to create a new order and then acknowledge they have solved the given challenge. All those things require I/O so I think they would fail as well.

But that is also kinda why I have this new retry mechanism/timeout in place, in case there is something I'm not seeing to have an upper bound on how many times we do try to process them before we give up.

What do you think of using this https://github.com/cenkalti/backoff which you can set the backoff to backoff slowly at first but exponentially increases until a certain cap. This package is used elsewhere in Vault as well.

I thought about this last night, I don't think we want the exponential backoff here as we'd like to clear that in-memory queue fast, so even if we'd use it I'd limit it's maximum to 5 seconds. That said I'm not sure beyond say a randomization for multiple items, which doesn't really matter in this use-case, I don't see the benefit.

I've lowered the backoffs after more thought to 1 second, the main purpose to bump them was to avoid an immediate retry of the same thing.

Or is there some aspect I'm missing?

hghaf099 · 2023-09-25T18:51:07Z

builtin/logical/pki/acme_challenge_engine.go


 // Remove our ChallengeValidation entry only.
 if deleteErr := sc.Storage.Delete(sc.Context, acmeValidationPrefix+id); deleteErr != nil {
- return true, now.Add(-1 * time.Second), fmt.Errorf("error deleting challenge %v (error prior to cleanup, if any: %v): %w", id, err, deleteErr)
+ return true, backoff, fmt.Errorf("error deleting challenge %v (error prior to cleanup, "+"if any: %v): %w", id, err, deleteErr)


I am not sure if we would need to change this bit? If we are not successful we should get rid of that entry, and if we hit the readonly error we bail anyway.
Actually, it does not seem to me that we would need to change anything in this function. The last return, returns false for retry, so we don't retry at all. Again, please take these comments with a grain of salt as I am not familiar with the workflow that much!

I am not sure if we would need to change this bit? If we are not successful we should get rid of that entry, and if we hit the readonly error we bail anyway.

By get rid of that entry, I assume you mean the in memory version of it? We could do that, but then the item is still within storage which will be reloaded on next restart in theory. So if the error we got back from the Delete wasn't an ErrReadOnly but say an I/O error we wouldn't re-attempt to delete it again until the next restart.

Actually, it does not seem to me that we would need to change anything in this function. The last return, returns false for retry, so we don't retry at all.

You are correct, I wanted the backoff behavior to apply to the I/O error use-case, and ideally this would force someone in the future to apply the backoff to other statements added in the future. In general it would have been better in retrospect to return the actual errors and leave the retry/backoff logic into a single location within VerifyChallenge but that was a bigger refactoring I didn't want to get into for a backport.

I actually meant from storage. So, the original code means that get rid of that storage entry as soon as possible since the retryAfter is set to the past! Then, not changing anything here is fine for all intents and purposes.

Sorry perhaps I'm not understanding, but if we don't change anything in this function, we would be looping very quickly on I/O errors on the deletion call no?

You are right about the looping. So far, what we have been seeing is the cannot write to readonly storage.

2023-09-22T18:23:34.4464531Z 2023-09-22T18:23:00.674Z [ERROR] Test_ACME.core-2.secrets.pki.pki_7114d24b: ACME validation failed for 5dcc6663-8b3c-5826-7f62-b9018987fc5a-dns-01: error deleting challenge 5dcc6663-8b3c-5826-7f62-b9018987fc5a-dns-01 (error prior to cleanup, if any: <nil>): cannot write to readonly storage

So, other than that error would you think we should be backing off that much?
Hmm, maybe we should! What I like about the original behaviour is the priority to which we give to cleaning up.

Maybe we should also consider the work load here. How much of workload would you think that we might expect? Is there a use case in which there are going to be millions of challenge cleanup burst all at once? If so, maybe we should reconsider the design. We have similar issue with lease revocation right now and that has caused several issues.

The in-memory queue contains all the in-flight ACME validation requests, if we were servicing say LetsEncrypt workloads this definitely would not fly. This isn't the entire ACME lifecycle, just the validation of a request so it should be relatively quick and we cleanup the requests within the same processing.

Now we did have that issue with possibly corrupted items that would never be removed which should be resolved within this PR.

hghaf099 · 2023-09-25T18:55:25Z

builtin/logical/pki/acme_challenge_engine.go

 runnerSc.Backend.Logger().Debug("Starting verification of challenge", "id", id)

 if retry, retryAfter, err := ace._verifyChallenge(sc, id, config); err != nil {
+ if errors.Is(err, logical.ErrReadOnly) {


One thing that I am wondering about is that would the caller of VerifyChallenge need to check if core is an active node or not? I don't see that anywhere there, delaying it until here seems unnecessary. The ACME work load might be large enough that it could be taking some time to get rid of every item in the list with the readonly error (at least this is what I understood reading the caller code).

One thing that I am wondering about is that would the caller of VerifyChallenge need to check if core is an active node or not?

Yup you are correct, this bug is still here I think if we have just enough read-only errors to keep us within the second for loop, great catch. The backoffs might not be enough to pop us out of that loop to then re-evaluate the ace.Closing status to shutdown the thread completely.

ACME work load might be large enough that it could be taking some time to get rid of every item in the list with the readonly

It's a little weird but yeah we only have a maximum of 5 items in flight at any given time. The plugin's backend Clean method signals the Closing channel so we don't need to process all items within in the queue, but I do think we should add a check for Closing and the current state to break out the second for loop

Yup you were correct, I've removed the special handling for ErrReadOnly here and added a check for us running on the active node instead within the loop as well as not starting up the ACME thread and loading up any persisted challenges to attempt processing on those nodes.

stevendpclark · 2023-09-25T21:19:11Z

~~Switching to draft, one of @hghaf099 comments made me realize there is another issue I missed to address.~~ Addressed with latest commits

- This reverts some of the previous changes in the PR, mainly around trapping the ErrReadOnly - Do not schedule the ACME worker thread if we aren't an active node, which will prevent us from reading the queue from disk - Bail out early within the for loop prior to scheduling the verification if we are no longer an active node - Readjust the retries for deletion errors

…hange

sgmiller

My only question is if a secondary node becomes the primary, will the challenge engine start up?

stevendpclark · 2023-09-26T17:38:45Z

My only question is if a secondary node becomes the primary, will the challenge engine start up?

Yes, confirmed that when a secondary node becomes the active node, the plugin's Initialize function is called, so we will re-evaluate and start it up. I've also manually tested it works when stepping down nodes in a main cluster as well as nodes within a performance replica cluster.

sgmiller · 2023-09-26T17:41:20Z

Backport labels?

hghaf099 · 2023-09-26T17:56:06Z

builtin/logical/pki/dnstest/server.go

- LogConsumer: func(s string) {
- ts.log.Info(s)
- },
+ // DNS container logging was disabled to reduce content within CI logs.


Should this be removed?

Yes we will try it for now, it was another request to reduce logging in CI from Paul Banks.

stevendpclark added bug Used to indicate a potential bug secret/pki labels Sep 25, 2023

stevendpclark added this to the 1.14.5 milestone Sep 25, 2023

stevendpclark requested a review from a team as a code owner September 25, 2023 17:26

stevendpclark self-assigned this Sep 25, 2023

github-actions bot added the hashicorp-contributed-pr If the PR is HashiCorp (i.e. not-community) contributed label Sep 25, 2023

Add cl

fec9a6b

stevendpclark requested a review from kitography September 25, 2023 17:33

hghaf099 reviewed Sep 25, 2023

View reviewed changes

stevendpclark marked this pull request as draft September 25, 2023 21:18

stevendpclark added 3 commits September 26, 2023 10:59

Adjust missed backoff delay to 1 second

f901d47

Merge branch 'main' into stevendpclark/vault-20315-acme-active-node-c…

8e896c2

…hange

stevendpclark marked this pull request as ready for review September 26, 2023 16:58

sgmiller reviewed Sep 26, 2023

View reviewed changes

sgmiller approved these changes Sep 26, 2023

View reviewed changes

stevendpclark added backport/1.14.x labels Sep 26, 2023

hghaf099 reviewed Sep 26, 2023

View reviewed changes

hghaf099 approved these changes Sep 26, 2023

View reviewed changes

stevendpclark merged commit dbfaa6f into main Sep 26, 2023
114 of 143 checks passed

stevendpclark deleted the stevendpclark/vault-20315-acme-active-node-change branch September 26, 2023 17:59

This was referenced Sep 26, 2023

Backport of Stop processing ACME verifications when active node is stepped down into release/1.14.x #23285

Merged

Backport of Stop processing ACME verifications when active node is stepped down into release/1.15.x #23286

Merged

stevendpclark mentioned this pull request Sep 26, 2023

Do not attempt to shutdown ACME thread on non-active nodes #23293

Merged

This was referenced Sep 26, 2023

Backport of Do not attempt to shutdown ACME thread on non-active nodes into release/1.14.x #23294

Merged

Backport of Do not attempt to shutdown ACME thread on non-active nodes into release/1.15.x #23295

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stop processing ACME verifications when active node is stepped down #23278

Stop processing ACME verifications when active node is stepped down #23278

stevendpclark commented Sep 25, 2023 •

edited

Loading

github-actions bot commented Sep 25, 2023

github-actions bot commented Sep 25, 2023

hghaf099 left a comment

hghaf099 Sep 25, 2023

stevendpclark Sep 25, 2023

hghaf099 Sep 25, 2023

hghaf099 Sep 25, 2023

stevendpclark Sep 25, 2023 •

edited

Loading

hghaf099 Sep 25, 2023

stevendpclark Sep 26, 2023

hghaf099 Sep 25, 2023

stevendpclark Sep 25, 2023

hghaf099 Sep 25, 2023

stevendpclark Sep 25, 2023

hghaf099 Sep 25, 2023

stevendpclark Sep 26, 2023

hghaf099 Sep 25, 2023

stevendpclark Sep 25, 2023

stevendpclark Sep 26, 2023

stevendpclark commented Sep 25, 2023 •

edited

Loading

sgmiller left a comment

stevendpclark commented Sep 26, 2023

sgmiller commented Sep 26, 2023

hghaf099 Sep 26, 2023

stevendpclark Sep 26, 2023

Stop processing ACME verifications when active node is stepped down #23278

Stop processing ACME verifications when active node is stepped down #23278

Conversation

stevendpclark commented Sep 25, 2023 • edited Loading

github-actions bot commented Sep 25, 2023

github-actions bot commented Sep 25, 2023

hghaf099 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stevendpclark Sep 25, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stevendpclark commented Sep 25, 2023 • edited Loading

sgmiller left a comment

Choose a reason for hiding this comment

stevendpclark commented Sep 26, 2023

sgmiller commented Sep 26, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stevendpclark commented Sep 25, 2023 •

edited

Loading

stevendpclark Sep 25, 2023 •

edited

Loading

stevendpclark commented Sep 25, 2023 •

edited

Loading