Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent long timeout when invoking a function and the user has misconfigured the environment. #1543

Merged

Conversation

templecloud
Copy link
Contributor

@templecloud templecloud commented Aug 20, 2019

Updates the RetryAllBackOff function to fail early when a request to obtain Runners cannot succeed due to an Fn misconfiguration error. This prevents a user having to wait for the maximum timeout (~180s) when the system is misconfigured.

  • Link to issue this resolves

  • What I did

    1. Updated the RetryAllBackOff to accept the error returned by rp.Runners(ctx, call).

    2. Added check to not re-try when the error returned by rp.Runners(ctx, call) returns a user configuration error.

      1. The error is checked to see if it a user configuration error with the models.IsFuncError(err). This is the same function used to create the blame user tag in the existing middleware (with the exception we do not need to check for nullity).

      2. Additionally check for an error status of 502 (which also denotes a user error).

  • How I did it

    1. Worked with Jan Grant (already a comitter).

    2. Replicated the issue tests against a development version of the Fn platform.

    3. Dynamically checked that tr.requestCtx and tr.placerCtx did not have the required information at the point of early return.

    4. Statically examined the Fn code to see how the user blame mechanism worked.

    5. Added a small piece of code to replicate the user blame functionality in RetryAllBackOff and return early if necessary.

  • How to verify it

    NB: This was tested against an OCI backend.

    1. Create OCI test resources, Application, and Function.

    2. Remove the VCN Policy permissions from the invoking Function compartment.

    3. Attempt to invoke the Function.

    Without the changes it will take around 180s to fail.

    Manual Integration Test

    1. Show invoke fails after 180s when miss-configured.
    $> time fn invoke test-app test-fn-01
    Error invoking function. status: 502 message: subnet ocid1.subnet.oc1.iad.aaaaaaaa3zi47ktmvvoqjltnw2kksgbedfcdzqrocvuxeonaz4rs3vped6gq does not exist or Oracle Functions is not authorized to use it
    
    real	3m2.990s
    user	0m0.065s
    sys	0m0.032s
    
    1. PR - Invoke fails fast when miss-configured.
    $> time fn invoke test-app test-fn-01
    Error invoking function. status: 502 message: subnet ocid1.subnet.oc1.iad.aaaaaaaa3zi47ktmvvoqjltnw2kksgbedfcdzqrocvuxeonaz4rs3vped6gq does not exist or Oracle Functions is not authorized to use it
    
    real	0m1.426s
    user	0m0.065s
    sys	0m0.018s
    
    1. PR - Invoke still works when NOT miss-configured..
    $> time fn invoke test-app test-fn-01
    {"message":"Hello World"}
    
    real	0m1.121s
    user	0m0.066s
    sys	0m0.017s
    
  • One line description for the changelog

Prevent long timeout when invoking a function and the user has misconfigured the environment.

@templecloud templecloud requested review from jan-g, skinowski, reclaro and ostrain and removed request for skinowski August 20, 2019 16:33
Copy link
Contributor

@jan-g jan-g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think the HandleFindRunnersFailure isn't required here.

Let's ping @ostrain about the question on checking the number of runners returned.

api/runnerpool/placer_tracker.go Outdated Show resolved Hide resolved
api/runnerpool/placer_tracker.go Outdated Show resolved Hide resolved
api/runnerpool/placer_tracker.go Outdated Show resolved Hide resolved
@templecloud
Copy link
Contributor Author

templecloud commented Aug 21, 2019

Trying to determine how long it takes to recover once a user error has been fixed. I am having network issues here atm, but, it is still returning 502's 10 mins later...

Something cached?

@jan-g
Copy link
Contributor

jan-g commented Aug 21, 2019

The cached response behaviour is what I was concerned about. Will discuss this face-to-face.

@templecloud
Copy link
Contributor Author

Thanks Jan.

Just to update the thread, it only gets cached for 60 mins. So not a showstopper, but, we are going to remedy it.

@ostrain
Copy link
Contributor

ostrain commented Aug 21, 2019

Hmm, we're supposed to clear the error as soon as we successfully provision a runner, so if you're still seeing it after 10 minutes that suggests we have a bug. Plus the fact that we clear it shouldn't generally matter since once the runner is up your invoke ought to succeed in which case you won't see an error regardless.

tr.HandleFindRunnersFailure(err)
// If there are no runners and last call to provision runners failed due
// to a user error (or misconfiguration) then fail fast.
if numOfRunners == 0 && err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I agree with this check on the number of runners.
Adding this means that the user will be notified of the misconfiguration only if she has not runners or, in case has some old runners, only when they get recycled.
I'd prefer to immediately notify the users about the error so they can fix it soon. All the point of this change is to fail fast and give a quick feedback to customers, if we check for empty runner list we miss the point of this change.
Let's consider this scenario where I make a change (misconfiguring my env) and deploy it, I test that it works and i managed to invoke a function because I land on an old runner, after "n" hours my functions suddenly start to fail because the old good runners get recycled and the new ones are affected by my misconfiguration. Personally I prefer to find the error soon so i have the right context at the right moment to fix it. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in any case I think that we still record the stats for numOfRunners==0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree. It is a choice between two behaviors, but, my gut feeling is I prefer fast feedback when I am/have done something wrong. I can revert the check for existing runners if we all agree this is most useful/suitable.

Copy link
Contributor

@ostrain ostrain Aug 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The numOfRunners == 0 check is doing more good than harm IMO, and here's why:

  • There are lots of errors where failing fast is the wrong thing to do if we have any runners. The best example is the case where the subnet is out of IPs or out of VNICs. In that scenario we have as many runners as is possible and we're under enough load that we need even more, and now we're going to just start failing all requests immediately and not using the runners we have. This would almost certainly result in a CAPA.

  • In the case where you get everything working, get a runner up, and then make a change that breaks it, there's a good chance you still won't get fast feedback if if we remove numOfRunners == 0, because the err won't get set until we try to specialize another runner. But since you already have a runner, we won't try to specialize another unless your load increases, or that runner reaches its end-of-life (in which case you would indeed get the error sooner, but still potentially hours after you broke it).

The other option, and probably the most correct solution, would be to inspect the error itself and only fail fast if it's a subnet NotAuthorizedOrNotFound error, rather than relying on numOfRunners.

Copy link
Member

@skinowski skinowski Aug 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with making the errors more explicit. PR has the title "Prevent long timeout when invoking a function and the user has misconfigured the environment." but in the code changes there's no indication of a user misconfiguration. IsFuncError() an indication of subnet error? Or does it include out of IP or VNICs as well?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it is too difficult or taking too long to implement the inspect of errors I think we can go with this solution as it is now with the numOfRunners check and having a follow up patch.

Copy link
Contributor Author

@templecloud templecloud Sep 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@skinowski - I was going to use the actual error, but, it was just a string specific to the OCI backend implementation I was using, I assumed we would want a more detailed abstraction here and not leak what is underneath (OCI). The IsFuncError() is what (to me) looked like the current way of determining a 'user blame' error? I also would prefer something better and clearer than using IsFuncError, but, that should not be unilateral. Wanna help?

I agree with @ostrain . I currently think we should only skip when there are no runners. Looks like lots of intermittent things could go wrong inside the GetRunners call e.g. periodic shonky DNS, and, I am not sure if IsFunc() is a fail-safe way to only catch user configuration errors.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To add my $0.02

I think trying to extend this to cover cases where runner count is already greater than 0 may be a good move long term but will introduce complications.
In the short term handling the 0 runners case effectively is a good basis on which to improve.

Using the function error interface seems like the right way to determine blame, or, if we want to consider this tangential to invocation errors we could introduce a new interface representing user blame errors in the runner pool implementation.

Copy link
Contributor Author

@templecloud templecloud Sep 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What actionable changes do I need to make, or, further tests to add/perform?

Can I confirm IsFunc() is the mandated way to determine this is a user blame error?

Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works as written.

…obtain Runners cannot suceed due to a human Fn misconfiguration error. This prevents a user having to wait for the maximum timeout when the system is misconfigured.
@shreyagarge shreyagarge force-pushed the bugfix/exit-runner-placement-early-when-configured-incorrectly branch from 49efdb2 to 3cc70a9 Compare September 17, 2019 15:58
Copy link
Contributor

@riconnon riconnon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.
Would be interesting to revisit handling numOfRunners > 0 later, but for now this will help a lot.

// IsFuncError currently synonymous with tag: 'blame == user'
// See: runner_fninvoke.handleFnInvokeCall2
if models.IsFuncError(err) {
// We also sanity check for a 502 before returning.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor nit: do we need this extra check here? I think IsFuncError() should be good enough.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly not; it's hard to envisage a user error that prevents us from scaling up (from zero) where we don't want to bail immediately. On the other hand, I'd rather err on the side of being overly cautious in the first instance and then relax this condition.

Copy link
Member

@skinowski skinowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@jan-g jan-g merged commit 10b3265 into master Oct 3, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants