-
Notifications
You must be signed in to change notification settings - Fork 403
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent long timeout when invoking a function and the user has misconfigured the environment. #1543
Prevent long timeout when invoking a function and the user has misconfigured the environment. #1543
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think the HandleFindRunnersFailure isn't required here.
Let's ping @ostrain about the question on checking the number of runners returned.
Trying to determine how long it takes to recover once a user error has been fixed. I am having network issues here atm, but, it is still returning 502's 10 mins later... Something cached? |
The cached response behaviour is what I was concerned about. Will discuss this face-to-face. |
Thanks Jan. Just to update the thread, it only gets cached for 60 mins. So not a showstopper, but, we are going to remedy it. |
Hmm, we're supposed to clear the error as soon as we successfully provision a runner, so if you're still seeing it after 10 minutes that suggests we have a bug. Plus the fact that we clear it shouldn't generally matter since once the runner is up your invoke ought to succeed in which case you won't see an error regardless. |
tr.HandleFindRunnersFailure(err) | ||
// If there are no runners and last call to provision runners failed due | ||
// to a user error (or misconfiguration) then fail fast. | ||
if numOfRunners == 0 && err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I agree with this check on the number of runners.
Adding this means that the user will be notified of the misconfiguration only if she has not runners or, in case has some old runners, only when they get recycled.
I'd prefer to immediately notify the users about the error so they can fix it soon. All the point of this change is to fail fast and give a quick feedback to customers, if we check for empty runner list we miss the point of this change.
Let's consider this scenario where I make a change (misconfiguring my env) and deploy it, I test that it works and i managed to invoke a function because I land on an old runner, after "n" hours my functions suddenly start to fail because the old good runners get recycled and the new ones are affected by my misconfiguration. Personally I prefer to find the error soon so i have the right context at the right moment to fix it. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in any case I think that we still record the stats for numOfRunners==0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree. It is a choice between two behaviors, but, my gut feeling is I prefer fast feedback when I am/have done something wrong. I can revert the check for existing runners if we all agree this is most useful/suitable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The numOfRunners == 0
check is doing more good than harm IMO, and here's why:
-
There are lots of errors where failing fast is the wrong thing to do if we have any runners. The best example is the case where the subnet is out of IPs or out of VNICs. In that scenario we have as many runners as is possible and we're under enough load that we need even more, and now we're going to just start failing all requests immediately and not using the runners we have. This would almost certainly result in a CAPA.
-
In the case where you get everything working, get a runner up, and then make a change that breaks it, there's a good chance you still won't get fast feedback if if we remove
numOfRunners == 0
, because the err won't get set until we try to specialize another runner. But since you already have a runner, we won't try to specialize another unless your load increases, or that runner reaches its end-of-life (in which case you would indeed get the error sooner, but still potentially hours after you broke it).
The other option, and probably the most correct solution, would be to inspect the error itself and only fail fast if it's a subnet NotAuthorizedOrNotFound error, rather than relying on numOfRunners
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with making the errors more explicit. PR has the title "Prevent long timeout when invoking a function and the user has misconfigured the environment." but in the code changes there's no indication of a user misconfiguration. IsFuncError() an indication of subnet error? Or does it include out of IP or VNICs as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if it is too difficult or taking too long to implement the inspect of errors I think we can go with this solution as it is now with the numOfRunners check and having a follow up patch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@skinowski - I was going to use the actual error, but, it was just a string specific to the OCI backend implementation I was using, I assumed we would want a more detailed abstraction here and not leak what is underneath (OCI). The IsFuncError() is what (to me) looked like the current way of determining a 'user blame' error? I also would prefer something better and clearer than using IsFuncError, but, that should not be unilateral. Wanna help?
I agree with @ostrain . I currently think we should only skip when there are no runners. Looks like lots of intermittent things could go wrong inside the GetRunners call e.g. periodic shonky DNS, and, I am not sure if IsFunc() is a fail-safe way to only catch user configuration errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To add my $0.02
I think trying to extend this to cover cases where runner count is already greater than 0 may be a good move long term but will introduce complications.
In the short term handling the 0 runners case effectively is a good basis on which to improve.
Using the function error interface seems like the right way to determine blame, or, if we want to consider this tangential to invocation errors we could introduce a new interface representing user blame errors in the runner pool implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What actionable changes do I need to make, or, further tests to add/perform?
Can I confirm IsFunc() is the mandated way to determine this is a user blame error?
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works as written.
…obtain Runners cannot suceed due to a human Fn misconfiguration error. This prevents a user having to wait for the maximum timeout when the system is misconfigured.
49efdb2
to
3cc70a9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
Would be interesting to revisit handling numOfRunners > 0
later, but for now this will help a lot.
// IsFuncError currently synonymous with tag: 'blame == user' | ||
// See: runner_fninvoke.handleFnInvokeCall2 | ||
if models.IsFuncError(err) { | ||
// We also sanity check for a 502 before returning. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor nit: do we need this extra check here? I think IsFuncError() should be good enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly not; it's hard to envisage a user error that prevents us from scaling up (from zero) where we don't want to bail immediately. On the other hand, I'd rather err on the side of being overly cautious in the first instance and then relax this condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Updates the RetryAllBackOff function to fail early when a request to obtain Runners cannot succeed due to an Fn misconfiguration error. This prevents a user having to wait for the maximum timeout (~180s) when the system is misconfigured.
Link to issue this resolves
What I did
Updated the
RetryAllBackOff
to accept the error returned byrp.Runners(ctx, call)
.Added check to not re-try when the error returned by
rp.Runners(ctx, call)
returns auser configuration error
.The error is checked to see if it a user configuration error with the
models.IsFuncError(err)
. This is the same function used to create theblame user
tag in the existing middleware (with the exception we do not need to check for nullity).Additionally check for an error status of
502
(which also denotes a user error).How I did it
Worked with Jan Grant (already a comitter).
Replicated the issue tests against a development version of the Fn platform.
Dynamically checked that
tr.requestCtx
andtr.placerCtx
did not have the required information at the point of early return.Statically examined the Fn code to see how the
user blame
mechanism worked.Added a small piece of code to replicate the
user blame
functionality inRetryAllBackOff
and return early if necessary.How to verify it
Create OCI test resources, Application, and Function.
Remove the VCN Policy permissions from the invoking Function compartment.
Attempt to invoke the Function.
Without the changes it will take around 180s to fail.
Manual Integration Test
One line description for the changelog
Prevent long timeout when invoking a function and the user has misconfigured the environment.