Prevent long timeout when invoking a function and the user has misconfigured the environment. #1543

templecloud · 2019-08-20T16:33:09Z

Updates the RetryAllBackOff function to fail early when a request to obtain Runners cannot succeed due to an Fn misconfiguration error. This prevents a user having to wait for the maximum timeout (~180s) when the system is misconfigured.

Link to issue this resolves
What I did
1. Updated the RetryAllBackOff to accept the error returned by rp.Runners(ctx, call).
2. Added check to not re-try when the error returned by rp.Runners(ctx, call) returns a user configuration error.
  1. The error is checked to see if it a user configuration error with the models.IsFuncError(err). This is the same function used to create the blame user tag in the existing middleware (with the exception we do not need to check for nullity).
  2. Additionally check for an error status of 502 (which also denotes a user error).
How I did it
1. Worked with Jan Grant (already a comitter).
2. Replicated the issue tests against a development version of the Fn platform.
3. Dynamically checked that tr.requestCtx and tr.placerCtx did not have the required information at the point of early return.
4. Statically examined the Fn code to see how the user blame mechanism worked.
5. Added a small piece of code to replicate the user blame functionality in RetryAllBackOff and return early if necessary.

How to verify it

NB: This was tested against an OCI backend.

Create OCI test resources, Application, and Function.
Remove the VCN Policy permissions from the invoking Function compartment.
Attempt to invoke the Function.

Without the changes it will take around 180s to fail.

Manual Integration Test

Show invoke fails after 180s when miss-configured.

$> time fn invoke test-app test-fn-01
Error invoking function. status: 502 message: subnet ocid1.subnet.oc1.iad.aaaaaaaa3zi47ktmvvoqjltnw2kksgbedfcdzqrocvuxeonaz4rs3vped6gq does not exist or Oracle Functions is not authorized to use it

real	3m2.990s
user	0m0.065s
sys	0m0.032s

PR - Invoke fails fast when miss-configured.

$> time fn invoke test-app test-fn-01
Error invoking function. status: 502 message: subnet ocid1.subnet.oc1.iad.aaaaaaaa3zi47ktmvvoqjltnw2kksgbedfcdzqrocvuxeonaz4rs3vped6gq does not exist or Oracle Functions is not authorized to use it

real	0m1.426s
user	0m0.065s
sys	0m0.018s

PR - Invoke still works when NOT miss-configured..

$> time fn invoke test-app test-fn-01
{"message":"Hello World"}

real	0m1.121s
user	0m0.066s
sys	0m0.017s

One line description for the changelog

Prevent long timeout when invoking a function and the user has misconfigured the environment.

jan-g

Think the HandleFindRunnersFailure isn't required here.

Let's ping @ostrain about the question on checking the number of runners returned.

api/runnerpool/placer_tracker.go

templecloud · 2019-08-21T14:12:23Z

Trying to determine how long it takes to recover once a user error has been fixed. I am having network issues here atm, but, it is still returning 502's 10 mins later...

Something cached?

jan-g · 2019-08-21T14:44:09Z

The cached response behaviour is what I was concerned about. Will discuss this face-to-face.

templecloud · 2019-08-21T17:00:50Z

Thanks Jan.

Just to update the thread, it only gets cached for 60 mins. So not a showstopper, but, we are going to remedy it.

ostrain · 2019-08-21T18:58:28Z

Hmm, we're supposed to clear the error as soon as we successfully provision a runner, so if you're still seeing it after 10 minutes that suggests we have a bug. Plus the fact that we clear it shouldn't generally matter since once the runner is up your invoke ought to succeed in which case you won't see an error regardless.

reclaro · 2019-08-22T09:51:10Z

api/runnerpool/placer_tracker.go

- tr.HandleFindRunnersFailure(err)
+ // If there are no runners and last call to provision runners failed due
+ // to a user error (or misconfiguration) then fail fast.
+ if numOfRunners == 0 && err != nil {


Not sure I agree with this check on the number of runners.
Adding this means that the user will be notified of the misconfiguration only if she has not runners or, in case has some old runners, only when they get recycled.
I'd prefer to immediately notify the users about the error so they can fix it soon. All the point of this change is to fail fast and give a quick feedback to customers, if we check for empty runner list we miss the point of this change.
Let's consider this scenario where I make a change (misconfiguring my env) and deploy it, I test that it works and i managed to invoke a function because I land on an old runner, after "n" hours my functions suddenly start to fail because the old good runners get recycled and the new ones are affected by my misconfiguration. Personally I prefer to find the error soon so i have the right context at the right moment to fix it. What do you think?

in any case I think that we still record the stats for numOfRunners==0

Yes, I agree. It is a choice between two behaviors, but, my gut feeling is I prefer fast feedback when I am/have done something wrong. I can revert the check for existing runners if we all agree this is most useful/suitable.

The numOfRunners == 0 check is doing more good than harm IMO, and here's why:

There are lots of errors where failing fast is the wrong thing to do if we have any runners. The best example is the case where the subnet is out of IPs or out of VNICs. In that scenario we have as many runners as is possible and we're under enough load that we need even more, and now we're going to just start failing all requests immediately and not using the runners we have. This would almost certainly result in a CAPA.

In the case where you get everything working, get a runner up, and then make a change that breaks it, there's a good chance you still won't get fast feedback if if we remove numOfRunners == 0, because the err won't get set until we try to specialize another runner. But since you already have a runner, we won't try to specialize another unless your load increases, or that runner reaches its end-of-life (in which case you would indeed get the error sooner, but still potentially hours after you broke it).

The other option, and probably the most correct solution, would be to inspect the error itself and only fail fast if it's a subnet NotAuthorizedOrNotFound error, rather than relying on numOfRunners.

I agree with making the errors more explicit. PR has the title "Prevent long timeout when invoking a function and the user has misconfigured the environment." but in the code changes there's no indication of a user misconfiguration. IsFuncError() an indication of subnet error? Or does it include out of IP or VNICs as well?

if it is too difficult or taking too long to implement the inspect of errors I think we can go with this solution as it is now with the numOfRunners check and having a follow up patch.

@skinowski - I was going to use the actual error, but, it was just a string specific to the OCI backend implementation I was using, I assumed we would want a more detailed abstraction here and not leak what is underneath (OCI). The IsFuncError() is what (to me) looked like the current way of determining a 'user blame' error? I also would prefer something better and clearer than using IsFuncError, but, that should not be unilateral. Wanna help?

I agree with @ostrain . I currently think we should only skip when there are no runners. Looks like lots of intermittent things could go wrong inside the GetRunners call e.g. periodic shonky DNS, and, I am not sure if IsFunc() is a fail-safe way to only catch user configuration errors.

To add my $0.02

I think trying to extend this to cover cases where runner count is already greater than 0 may be a good move long term but will introduce complications.
In the short term handling the 0 runners case effectively is a good basis on which to improve.

Using the function error interface seems like the right way to determine blame, or, if we want to consider this tangential to invocation errors we could introduce a new interface representing user blame errors in the runner pool implementation.

What actionable changes do I need to make, or, further tests to add/perform?

Can I confirm IsFunc() is the mandated way to determine this is a user blame error?

Thanks!

This works as written.

…obtain Runners cannot suceed due to a human Fn misconfiguration error. This prevents a user having to wait for the maximum timeout when the system is misconfigured.

api/runnerpool/placer_tracker.go

riconnon

Looks good to me.
Would be interesting to revisit handling numOfRunners > 0 later, but for now this will help a lot.

skinowski · 2019-10-01T16:44:40Z

api/runnerpool/placer_tracker.go

+ // IsFuncError currently synonymous with tag: 'blame == user'
+ // See: runner_fninvoke.handleFnInvokeCall2
+ if models.IsFuncError(err) {
+ // We also sanity check for a 502 before returning.


minor nit: do we need this extra check here? I think IsFuncError() should be good enough.

Possibly not; it's hard to envisage a user error that prevents us from scaling up (from zero) where we don't want to bail immediately. On the other hand, I'd rather err on the side of being overly cautious in the first instance and then relax this condition.

skinowski

lgtm

templecloud requested review from jan-g, skinowski, reclaro and ostrain and removed request for skinowski August 20, 2019 16:33

jan-g suggested changes Aug 21, 2019

View reviewed changes

api/runnerpool/placer_tracker.go Outdated Show resolved Hide resolved

api/runnerpool/placer_tracker.go Outdated Show resolved Hide resolved

api/runnerpool/placer_tracker.go Outdated Show resolved Hide resolved

reclaro reviewed Aug 22, 2019

View reviewed changes

templecloud added 2 commits September 17, 2019 16:55

Updates the RetryAllBackOff function to fail early when a request to …

d84e9d5

…obtain Runners cannot suceed due to a human Fn misconfiguration error. This prevents a user having to wait for the maximum timeout when the system is misconfigured.

Code review changes.

3cc70a9

shreyagarge force-pushed the bugfix/exit-runner-placement-early-when-configured-incorrectly branch from 49efdb2 to 3cc70a9 Compare September 17, 2019 15:58

ostrain reviewed Sep 24, 2019

View reviewed changes

api/runnerpool/placer_tracker.go Outdated Show resolved Hide resolved

Use the base interface for the type check

b3d0694

riconnon approved these changes Oct 1, 2019

View reviewed changes

skinowski reviewed Oct 1, 2019

View reviewed changes

skinowski approved these changes Oct 1, 2019

View reviewed changes

jan-g approved these changes Oct 3, 2019

View reviewed changes

jan-g merged commit 10b3265 into master Oct 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent long timeout when invoking a function and the user has misconfigured the environment. #1543

Prevent long timeout when invoking a function and the user has misconfigured the environment. #1543

templecloud commented Aug 20, 2019 •

edited

Loading

jan-g left a comment

templecloud commented Aug 21, 2019 •

edited

Loading

jan-g commented Aug 21, 2019

templecloud commented Aug 21, 2019

ostrain commented Aug 21, 2019

reclaro Aug 22, 2019

reclaro Aug 22, 2019

templecloud Aug 22, 2019

ostrain Aug 26, 2019 •

edited

Loading

skinowski Aug 26, 2019 •

edited

Loading

reclaro Aug 27, 2019

templecloud Sep 7, 2019 •

edited

Loading

riconnon Sep 8, 2019

templecloud Sep 17, 2019 •

edited

Loading

jan-g Oct 1, 2019

riconnon left a comment

skinowski Oct 1, 2019

jan-g Oct 2, 2019

skinowski left a comment

Prevent long timeout when invoking a function and the user has misconfigured the environment. #1543

Prevent long timeout when invoking a function and the user has misconfigured the environment. #1543

Conversation

templecloud commented Aug 20, 2019 • edited Loading

jan-g left a comment

Choose a reason for hiding this comment

templecloud commented Aug 21, 2019 • edited Loading

jan-g commented Aug 21, 2019

templecloud commented Aug 21, 2019

ostrain commented Aug 21, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ostrain Aug 26, 2019 • edited Loading

Choose a reason for hiding this comment

skinowski Aug 26, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

templecloud Sep 7, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

templecloud Sep 17, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

riconnon left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

skinowski left a comment

Choose a reason for hiding this comment

templecloud commented Aug 20, 2019 •

edited

Loading

templecloud commented Aug 21, 2019 •

edited

Loading

ostrain Aug 26, 2019 •

edited

Loading

skinowski Aug 26, 2019 •

edited

Loading

templecloud Sep 7, 2019 •

edited

Loading

templecloud Sep 17, 2019 •

edited

Loading