Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop processing ACME verifications when active node is stepped down #23278

Merged
merged 5 commits into from
Sep 26, 2023

Conversation

stevendpclark
Copy link
Contributor

@stevendpclark stevendpclark commented Sep 25, 2023

  • Do not load existing ACME challenges persisted within storage on non-active nodes. This was the main culprit of the issues, secondary nodes would load existing persisted challenges trying to resolve them but writes would fail leading to the excessive logging.

    • We now handle this by not starting the ACME background thread on non-active nodes, while also checking within the scheduling loop and breaking out. That will force a re-reading of the Closing channel that should have been called by the PKI plugin's Cleanup method.
  • If a node is stepped down from being the active node while it is actively processing a verification, we could get into an infinite loop due to an ErrReadOnly error attempting to clean up a challenge entry

  • Add a maximum number of retries for errors around attempting to decode,fetch challenge/authorization entries from disk. We use double the number of "normal" max attempts for these types of errors, than we would for normal ACME retry attempts to avoid collision issues. Note that these additional retry attempts are not persisted to disk and will restart on every node start

  • Add a 1 second backoff to any disk related error to not immediately spin on disk/io errors for challenges.

 - If a node is stepped down from being the active node while it is actively processing a verification, we could get into an infinite loop due to an ErrReadOnly error attempting to clean up a challenge entry
 - Add a maximum number of retries for errors around attempting to
  decode,fetch challenge/authorization entries from disk. We use double
  the number of "normal" max attempts for these types of errors,
  than we would for normal ACME retry attempts to avoid collision
  issues. Note that these additional retry attempts are not persisted
  to disk and will restart on every node start
 - Add a 5 second backoff to any disk related error to not immediately spin on disk/io errors for challenges.
@stevendpclark stevendpclark added bug Used to indicate a potential bug secret/pki labels Sep 25, 2023
@stevendpclark stevendpclark added this to the 1.14.5 milestone Sep 25, 2023
@stevendpclark stevendpclark requested a review from a team as a code owner September 25, 2023 17:26
@stevendpclark stevendpclark self-assigned this Sep 25, 2023
@github-actions github-actions bot added the hashicorp-contributed-pr If the PR is HashiCorp (i.e. not-community) contributed label Sep 25, 2023
@github-actions
Copy link

Build Results:
All builds succeeded! ✅

@github-actions
Copy link

CI Results:
All Go tests succeeded! ✅

Copy link
Contributor

@hghaf099 hghaf099 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! I asked some clarifying questions.

// data store, we would re-attempt them forever and never clean them up. We want to leverage
// the retry counts within the authorization/challenge objects first which is why we give ourselves
// twice the attempts here.
if retries > MaxRetryAttempts*2 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not super familiar with the internals of acme challenge engine, so please bear with me with my questions. What is the correlation between MaxRetryAttempt and retry count within the authorization/challenge object? Would this doubling here cause confusion in the future as in it may cause two different behaviours for a single thing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not super familiar with the internals of acme challenge engine, so please bear with me with my questions.

No worries, I appreciate the feedback!

What is the correlation between MaxRetryAttempt and retry count within the authorization/challenge object?

There isn't really a correlation, but you are right that it may lead to confusion. Basically this new one I'm just trying to track are we looping continuously on a challenge that we can't seem to decode/process out of the normal ACME processing pipeline. Kinda a catch 22 in if we can't decode/read the event from storage how are we supposed to know how many times we've attempted it...

I realized if we don't do this and for some reason these start to increase we may get stuck continuously processing these and nothing new incoming.

Would this doubling here cause confusion in the future as in it may cause two different behaviours for a single thing?

Maybe but again this is here as a stop gap, "should" never fire technically.

One thing to keep in mind with ACME challenges/orders is, it isn't the end of the world to fail/lose them, not great mind you, but all the clients I've seen will/can just restart the process again with minimal effort.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good context! Thanks!

@@ -343,23 +372,23 @@ func (ace *ACMEChallengeEngine) VerifyChallenge(runnerSc *storageContext, id str

func (ace *ACMEChallengeEngine) _verifyChallenge(sc *storageContext, id string, config *acmeConfigEntry) (bool, time.Time, error) {
now := time.Now()
backoffTime := now.Add(5 * time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this backoff be exponentially growing instead of a flat 5 seconds delay? maybe there is a glitch in the IO or decoding the json, etc. should it wait that long to retry? Could this cause ace.Validations to be filled up without doing the actual work and cause issues down the road?

Copy link
Contributor Author

@stevendpclark stevendpclark Sep 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this backoff be exponentially growing instead of a flat 5 seconds delay?

The original design was to try to get through the ACME challenges as quickly as possible to avoid long standing entries within that Validation queue in memory. I did pick 5 second delay kinda randomly but I wouldn't want it to grow excessively either.

maybe there is a glitch in the IO or decoding the json, etc. should it wait that long to retry?

It wouldn't be great, but then again IO glitches aren't really something I'd special case for. Note that there is an exponential backoff for the "normal" validation failure workflow, just not this. I don't think 5 seconds is overly excessive for us to come back to the item either, but I'm happy to be proven wrong here as well.

Could this cause ace.Validations to be filled up without doing the actual work and cause issues down the road?

I don't think so, to get into this queue in the first place an ACME client needs to create a new order and then acknowledge they have solved the given challenge. All those things require I/O so I think they would fail as well.

But that is also kinda why I have this new retry mechanism/timeout in place, in case there is something I'm not seeing to have an upper bound on how many times we do try to process them before we give up.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of using this https://github.com/cenkalti/backoff which you can set the backoff to backoff slowly at first but exponentially increases until a certain cap. This package is used elsewhere in Vault as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this last night, I don't think we want the exponential backoff here as we'd like to clear that in-memory queue fast, so even if we'd use it I'd limit it's maximum to 5 seconds. That said I'm not sure beyond say a randomization for multiple items, which doesn't really matter in this use-case, I don't see the benefit.

I've lowered the backoffs after more thought to 1 second, the main purpose to bump them was to avoid an immediate retry of the same thing.

Or is there some aspect I'm missing?


// Remove our ChallengeValidation entry only.
if deleteErr := sc.Storage.Delete(sc.Context, acmeValidationPrefix+id); deleteErr != nil {
return true, now.Add(-1 * time.Second), fmt.Errorf("error deleting challenge %v (error prior to cleanup, if any: %v): %w", id, err, deleteErr)
return true, backoff, fmt.Errorf("error deleting challenge %v (error prior to cleanup, "+"if any: %v): %w", id, err, deleteErr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if we would need to change this bit? If we are not successful we should get rid of that entry, and if we hit the readonly error we bail anyway.
Actually, it does not seem to me that we would need to change anything in this function. The last return, returns false for retry, so we don't retry at all. Again, please take these comments with a grain of salt as I am not familiar with the workflow that much!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if we would need to change this bit? If we are not successful we should get rid of that entry, and if we hit the readonly error we bail anyway.

By get rid of that entry, I assume you mean the in memory version of it? We could do that, but then the item is still within storage which will be reloaded on next restart in theory. So if the error we got back from the Delete wasn't an ErrReadOnly but say an I/O error we wouldn't re-attempt to delete it again until the next restart.

Actually, it does not seem to me that we would need to change anything in this function. The last return, returns false for retry, so we don't retry at all.

You are correct, I wanted the backoff behavior to apply to the I/O error use-case, and ideally this would force someone in the future to apply the backoff to other statements added in the future. In general it would have been better in retrospect to return the actual errors and leave the retry/backoff logic into a single location within VerifyChallenge but that was a bigger refactoring I didn't want to get into for a backport.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually meant from storage. So, the original code means that get rid of that storage entry as soon as possible since the retryAfter is set to the past! Then, not changing anything here is fine for all intents and purposes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry perhaps I'm not understanding, but if we don't change anything in this function, we would be looping very quickly on I/O errors on the deletion call no?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right about the looping. So far, what we have been seeing is the cannot write to readonly storage.

2023-09-22T18:23:34.4464531Z 2023-09-22T18:23:00.674Z [ERROR] Test_ACME.core-2.secrets.pki.pki_7114d24b: ACME validation failed for 5dcc6663-8b3c-5826-7f62-b9018987fc5a-dns-01: error deleting challenge 5dcc6663-8b3c-5826-7f62-b9018987fc5a-dns-01 (error prior to cleanup, if any: <nil>): cannot write to readonly storage

So, other than that error would you think we should be backing off that much?
Hmm, maybe we should! What I like about the original behaviour is the priority to which we give to cleaning up.

Maybe we should also consider the work load here. How much of workload would you think that we might expect? Is there a use case in which there are going to be millions of challenge cleanup burst all at once? If so, maybe we should reconsider the design. We have similar issue with lease revocation right now and that has caused several issues.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The in-memory queue contains all the in-flight ACME validation requests, if we were servicing say LetsEncrypt workloads this definitely would not fly. This isn't the entire ACME lifecycle, just the validation of a request so it should be relatively quick and we cleanup the requests within the same processing.

Now we did have that issue with possibly corrupted items that would never be removed which should be resolved within this PR.

runnerSc.Backend.Logger().Debug("Starting verification of challenge", "id", id)

if retry, retryAfter, err := ace._verifyChallenge(sc, id, config); err != nil {
if errors.Is(err, logical.ErrReadOnly) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing that I am wondering about is that would the caller of VerifyChallenge need to check if core is an active node or not? I don't see that anywhere there, delaying it until here seems unnecessary. The ACME work load might be large enough that it could be taking some time to get rid of every item in the list with the readonly error (at least this is what I understood reading the caller code).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing that I am wondering about is that would the caller of VerifyChallenge need to check if core is an active node or not?

Yup you are correct, this bug is still here I think if we have just enough read-only errors to keep us within the second for loop, great catch. The backoffs might not be enough to pop us out of that loop to then re-evaluate the ace.Closing status to shutdown the thread completely.

ACME work load might be large enough that it could be taking some time to get rid of every item in the list with the readonly

It's a little weird but yeah we only have a maximum of 5 items in flight at any given time. The plugin's backend Clean method signals the Closing channel so we don't need to process all items within in the queue, but I do think we should add a check for Closing and the current state to break out the second for loop

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup you were correct, I've removed the special handling for ErrReadOnly here and added a check for us running on the active node instead within the loop as well as not starting up the ACME thread and loading up any persisted challenges to attempt processing on those nodes.

@stevendpclark stevendpclark marked this pull request as draft September 25, 2023 21:18
@stevendpclark
Copy link
Contributor Author

stevendpclark commented Sep 25, 2023

Switching to draft, one of @hghaf099 comments made me realize there is another issue I missed to address. Addressed with latest commits

 - This reverts some of the previous changes in the PR, mainly around
   trapping the ErrReadOnly
 - Do not schedule the ACME worker thread if we aren't an active node,
   which will prevent us from reading the queue from disk
 - Bail out early within the for loop prior to scheduling the
   verification if we are no longer an active node
 - Readjust the retries for deletion errors
@stevendpclark stevendpclark marked this pull request as ready for review September 26, 2023 16:58
Copy link
Collaborator

@sgmiller sgmiller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only question is if a secondary node becomes the primary, will the challenge engine start up?

@stevendpclark
Copy link
Contributor Author

My only question is if a secondary node becomes the primary, will the challenge engine start up?

Yes, confirmed that when a secondary node becomes the active node, the plugin's Initialize function is called, so we will re-evaluate and start it up. I've also manually tested it works when stepping down nodes in a main cluster as well as nodes within a performance replica cluster.

@sgmiller
Copy link
Collaborator

Backport labels?

LogConsumer: func(s string) {
ts.log.Info(s)
},
// DNS container logging was disabled to reduce content within CI logs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we will try it for now, it was another request to reduce logging in CI from Paul Banks.

@stevendpclark stevendpclark merged commit dbfaa6f into main Sep 26, 2023
114 of 143 checks passed
@stevendpclark stevendpclark deleted the stevendpclark/vault-20315-acme-active-node-change branch September 26, 2023 17:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Used to indicate a potential bug hashicorp-contributed-pr If the PR is HashiCorp (i.e. not-community) contributed secret/pki
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants