Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix initializations and terminations of http servers #13723

Closed
wants to merge 16 commits into from

Conversation

davidhadas
Copy link
Contributor

Fixes #13720

Bug fixes for server initialization and termination.

  • Adds timer for waiting for services to shutdown
  • Fixes drainer for TLS
  • Remove admin service when tlsAdmin is used
  • Shutdown all servers before we flush and exit

@knative-prow knative-prow bot added area/networking size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Feb 21, 2023
@codecov
Copy link

codecov bot commented Feb 21, 2023

Codecov Report

Patch coverage has no change and project coverage change: -0.09 ⚠️

Comparison is base (8d8c39e) 86.10% compared to head (8ef6bac) 86.02%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #13723      +/-   ##
==========================================
- Coverage   86.10%   86.02%   -0.09%     
==========================================
  Files         199      199              
  Lines       14796    14815      +19     
==========================================
+ Hits        12740    12744       +4     
- Misses       1752     1768      +16     
+ Partials      304      303       -1     
Impacted Files Coverage Δ
pkg/queue/sharedmain/main.go 0.87% <0.00%> (+0.02%) ⬆️
pkg/queue/sharedmain/servers.go 0.00% <0.00%> (ø)

... and 3 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@davidhadas
Copy link
Contributor Author

Tests fail. Is this due to admin being removed when tlsAdmin was used? Do we require both?

@davidhadas
Copy link
Contributor Author

@nak3, any thoughts on why the tests are failing following g the change?

for name, server := range httpServers {
go func(name string, s *http.Server) {
for _, s := range srvs {
go func(s service) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use errGroup to fail here if any of the server fails?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

errGroup waits for all to terminate (and present the first error). We seem to not have a reason to use it here.
We will still need the errChan to signal when any one of them had an error.

During shutdown, we use synchronous Shutdowns (one by one).
in both cases seems we do not need to wait for all to terminate asynchronously

@skonto
Copy link
Contributor

skonto commented Feb 23, 2023

stream.go:305: E 18:03:48.425 activator-544b695d96-85k9p [activator] [serving-tests/should-have-stdin-e-o-f-wrakflxj-00001] Failed to probe clusterIP 10.96.112.83:80 err=error roundtripping [https://10.96.112.83:80/healthz:](https://10.96.112.83/healthz:) dial tcp 10.96.112.83:80: connect: connection refused

@davidhadas
Copy link
Contributor Author

davidhadas commented Feb 23, 2023

What do we need adminTls for?

Looking into Drainer, my understanding is that the admin's sole purpose is to serve the "/wait-for-drain" sent as a preStop hook. This afaik will always be HTTP and never TLS. Can we drop the adminTls then and always use HTTP based admin?

Another way to put this is that admin is always an HTTP service - never a TLS one.

…ers, moved to struct service being returned from build* functions

Signed-off-by: David Hadas <[email protected]>
@knative-prow knative-prow bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Feb 23, 2023
Signed-off-by: David Hadas <[email protected]>
@davidhadas
Copy link
Contributor Author

/assign @davidhadas

@davidhadas
Copy link
Contributor Author

Moved to a single admin service (no mainTls needed) - the admin now drain both the drainer from main and the one from mainTls (if not nil).

build* functions are now returning a service struct

We now have a simpler logic where we just add services as needed to our array depending on the different flags.

Signed-off-by: David Hadas <[email protected]>
@davidhadas
Copy link
Contributor Author

/retest

1 similar comment
@davidhadas
Copy link
Contributor Author

/retest

@davidhadas
Copy link
Contributor Author

/test istio-latest-no-mesh

@davidhadas davidhadas requested review from dprotaso and removed request for psschwei March 16, 2023 12:14
@davidhadas
Copy link
Contributor Author

/retest

@davidhadas
Copy link
Contributor Author

davidhadas commented Mar 24, 2023

/hold
@dprotaso is planning to refactor the main and mainTls handlers - moving from 2 separate handlers to one.
Once this refactor is done, we can suffice with a single drainer.
We will then continue working on this PR to solve #13720 - it will naturally simplify this PR as there will be only a single drainer.

@dprotaso
Copy link
Member

The handler refactor merged - #13815

You should be able to rebase your changes. There might be some overlap - one that's still unique to this PR is introduce a deadline for shutdown.

@knative-prow-robot knative-prow-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 28, 2023
@davidhadas davidhadas closed this Mar 29, 2023
@knative-prow knative-prow bot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Mar 29, 2023
@davidhadas davidhadas reopened this Mar 29, 2023
@knative-prow-robot knative-prow-robot removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Mar 29, 2023
@knative-prow knative-prow bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Mar 29, 2023
@knative-prow
Copy link

knative-prow bot commented Mar 29, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: davidhadas, nak3, ReToCode

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

1 similar comment
@knative-prow
Copy link

knative-prow bot commented Mar 29, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: davidhadas, nak3, ReToCode

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@knative-prow knative-prow bot added approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Mar 29, 2023
@davidhadas
Copy link
Contributor Author

@dprotaso rebased

Copy link
Member

@dprotaso dprotaso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think that it's much easier to reason which servers support TLS by having it declarative vs. changing the builders and having an opaque list.

I'm prioritizing readibility over repitition in this instance.

Can we just scope this PR to introducing a shutdown time limit?

@davidhadas
Copy link
Contributor Author

davidhadas commented Mar 29, 2023

@dprotaso
This PR came to fix a different issue and not the shutdown time limit...
It is here to fix a bug in the termination that resulted in connections being successfully created even after flush().
See #13720 (comment),

Was this bug fixed and tested by #13815 ?

If it did, then we can do without this PR.
If it was not, then some PR is still needed to fix it.
This PR offers to solve this bug by ensuring that we have an orderly list of open services and ensuring that we carefully shutdown this list - this has been tested to fix the bug.

@dprotaso
Copy link
Member

Was this bug fixed and tested by #13815 ?

Yup - main should be shutting down properly now

The fix was here
https://github.com/knative/serving/pull/13815/files#diff-c89ab95fa1d2a52207142b5664c30bfce82bc2abea59bf620d248dd43f30bd56L309-L310

This line should have been removed in an earlier refactor.

@davidhadas
Copy link
Contributor Author

Closing this PR as #13720 (comment) was solved

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/autoscale area/networking do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Http and tls services are not initialized and terminated correctly
7 participants