Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RevisionSpec.Timeout has deviated from the original specification #12634

Closed
1 of 6 tasks
dprotaso opened this issue Feb 16, 2022 · 11 comments · Fixed by #12970
Closed
1 of 6 tasks

RevisionSpec.Timeout has deviated from the original specification #12634

dprotaso opened this issue Feb 16, 2022 · 11 comments · Fixed by #12970
Assignees
Labels
area/API API objects and controllers
Milestone

Comments

@dprotaso
Copy link
Member

dprotaso commented Feb 16, 2022

/area API

I dropped the ball regarding how we handled RevisionSpec.Timeout in the OSS implementation. We deviated from what was originally in the spec and opt'd to change the spec (#10806) rather than fix the behaviour.

We prioritized backwards compatibility over remaining conformant.

In hindsight this was a mistake since:

  1. This led us to introduced a new field to cover the functional gap - (Should be possible to set an actual revision timeout (max duration, not first byte) #10851)
  2. Timeout and MaxDuration being so similar is confusing and bad UX for end-users
  3. Workloads running on the OSS implementation will behave differently on other compliant Knative offerings.

New plan is here: #12634 (comment)

I think the best path forward here is:

  • Drop MaxDurationSeconds from the RevisionSpec #12635
    • Yes this is poor form but it's only been out for two weeks in the latest release at the time of writing and hasn't been documented
  • create a feature flag that enables the first byte timeout logic (on by default)
  • notify users about this default and be explicit they should set it for compatibility
  • after N releases we swap the default to be conformant to the spec's original intention
  • For those who opt into the first-byte timeout we provide a maxDuration annotation so they still have that ability
  • Allow a per-revision override via annotations so existing revisions can work independently of the operator settings
@knative-prow-robot knative-prow-robot added the area/API API objects and controllers label Feb 16, 2022
@dprotaso
Copy link
Member Author

/assign @dprotaso

@evankanderson
Copy link
Member

It feels like "time to first response byte" is a valid but niche use case. Rather than introducing a flag to change the meaning of timeoutSeconds, have we considered introducing a firstByteMaxTime or the like?

Thinking about it, it seems like the main use case for fBMT is for server or bidirectional streaming scenarios, which are definitely not the default.

@evankanderson
Copy link
Member

(I don't like firstByteMaxTime as a name, for the record.)

It still seems useful to be able to specify a first but and max stream duration at the Revision level, to enable safe lame-duck operation and bound times for proxy-level drain and restart without complicated TCP handoffs.

@dprotaso
Copy link
Member Author

Rather than introducing a flag to change the meaning of timeoutSeconds, have we considered introducing a firstByteMaxTime or the like?

I had not. Though now I wonder if adding support for idleTimeout (#10852) would cover the niche case of first byte timeout

@skonto
Copy link
Contributor

skonto commented Feb 17, 2022

Timeout and MaxDuration being so similar is confusing and bad UX for end-users

It is confusing inded. IMHO MaxDuration is a self-explanatory name and seems better that Timeout to me.

Should we capture this in a design doc and describe how previous cases can be handled with the "new" design?
It would help also documenting some of the scenarios eg. "if I use websockets then..." (diagrams would be nice too to explain options).

@nader-ziada
Copy link
Member

create a feature flag that enables the first byte timeout logic (on by default)

@dprotaso enable it to be taken into account by the timeout handler? what happens if its disabled, the behaviour uses the MaxDuration only and ignore first byte response?

@evankanderson
Copy link
Member

I'm also wondering about the plan -- do we mean a feature flag that makes timeoutSeconds mean timeoutOnResponseStart, or a feature-flag for a new parameter (idleTimeoutSeconds or startResponseTimeout)?

@nader-ziada
Copy link
Member

so here is my proposal:

Based on the user choice, the timeout handler would use the FirstByte or MaxDuration.

and I think since the MaxDuration seems to be the more common use case, we can make it the default, but at the same time allow the user to choose FirstByte if needed.

This new behaviour can start behind a feature gate for a release or two, then be on by default.

We can allow the user to choose using an annotation on the revision?

@dprotaso
Copy link
Member Author

dprotaso commented May 25, 2022

Circling back - our timeout to first byte logic is currently broken for web sockets. When using web sockets the connection will remain open for Timeout duration - which conforms to what the spec originally had.

user reported issue: https://knative.slack.com/archives/CA9RHBGJX/p1645566907713149

This simplifies the original plan since the earlier concern we had was about breaking compatibility with long running web sockets connections isn't an issue.

Thus all I think we need to do is:

but at the same time allow the user to choose FirstByte if needed.

I think this would be covered as a new property - IdleTimeout - #10852

@dprotaso
Copy link
Member Author

Nader is going to drive this for the 1.6 release

/assign @nader-ziada
/unassign @dprotaso

@dprotaso
Copy link
Member Author

With #11980 we should now be able to make the necessary changes for 1.7

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/API API objects and controllers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants