Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support http1 full duplex per workload #14568

Merged
merged 10 commits into from
Jan 16, 2024
Merged

Conversation

skonto
Copy link
Contributor

@skonto skonto commented Oct 26, 2023

Fixes #12387

Proposed Changes

  • Workloads annotated with features.knative.dev/http-full-duplex=Enabled will get the support for http1 full duplex end-to-end on the data path. Activator and QP h2c handlers dont seem to create an issue on the http1 case.
  • This feature requires a build with golang 1.21.x.
  • Handlers that override the writer methods need to be controlled with the new Golang fix. Check this for more.
    Note: If any of the wrappers is removed then you get ~1 request failure over tens of thousands of requests.
    Had to run the test multiple times to get one failure, each run creates ~30K requests over a period ~1m.
  • Tested with this repo, sample run. Might have to add an integration test though.
  • Tested this both in serve/proxy mode.

Release Note

Http1 full duplex is now supported. Workloads need to be annotated with `features.knative.dev/http-full-duplex`.

@knative-prow knative-prow bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 26, 2023
@knative-prow knative-prow bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. area/API API objects and controllers area/autoscale area/networking labels Oct 26, 2023
@codecov
Copy link

codecov bot commented Oct 26, 2023

Codecov Report

Attention: 25 lines in your changes are missing coverage. Please review.

Comparison is base (e5602d7) 86.02% compared to head (558ae15) 85.92%.
Report is 12 commits behind head on main.

Files Patch % Lines
pkg/queue/sharedmain/handlers.go 0.00% 12 Missing ⚠️
pkg/activator/handler/context.go 60.00% 3 Missing and 1 partial ⚠️
pkg/activator/handler/handler.go 66.66% 2 Missing and 1 partial ⚠️
cmd/activator/main.go 0.00% 2 Missing ⚠️
pkg/http/handler/timeout.go 0.00% 2 Missing ⚠️
pkg/http/response_recorder.go 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #14568      +/-   ##
==========================================
- Coverage   86.02%   85.92%   -0.10%     
==========================================
  Files         197      197              
  Lines       14950    14991      +41     
==========================================
+ Hits        12860    12881      +21     
- Misses       1777     1796      +19     
- Partials      313      314       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@skonto
Copy link
Contributor Author

skonto commented Oct 26, 2023

/test performance-tests

1 similar comment
@skonto
Copy link
Contributor Author

skonto commented Oct 26, 2023

/test performance-tests

@skonto
Copy link
Contributor Author

skonto commented Oct 26, 2023

/test istio-latest-no-mesh-tls

@skonto
Copy link
Contributor Author

skonto commented Oct 26, 2023

Sec failure:

Building with tags: e2e,hpa,upgrade
# knative.dev/serving/pkg/activator/handler
pkg/activator/handler/handler.go:139:11: rc.EnableFullDuplex undefined (type *"net/http".ResponseController has no field or method EnableFullDuplex)
# knative.dev/serving/pkg/queue/sharedmain
pkg/queue/sharedmain/handlers.go:[54](https://github.com/knative/serving/actions/runs/6653489641/job/18079562592?pr=14568#step:5:55):11: rc.EnableFullDuplex undefined (type *"net/http".ResponseController has no field or method EnableFullDuplex)
pkg/queue/sharedmain/handlers.go:146:11: rc.EnableFullDuplex undefined (type *"net/http".ResponseController has no field or method EnableFullDuplex)

It seems that we need to setup go version with a separate step for CodeQL.

@skonto skonto changed the title [WIP] Support http1 full duplex per workload Support http1 full duplex per workload Oct 27, 2023
@knative-prow knative-prow bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 27, 2023
@skonto
Copy link
Contributor Author

skonto commented Nov 1, 2023

cc @dprotaso gentle ping

@@ -236,7 +236,7 @@ func main() {

// NOTE: MetricHandler is being used as the outermost handler of the meaty bits. We're not interested in measuring
// the healthchecks or probes.
ah = activatorhandler.NewMetricHandler(env.PodName, ah)
ah = wrapActivatorHandlerWithFullDuplex(activatorhandler.NewMetricHandler(env.PodName, ah))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for me to understand better, what is the reason that we need the handler twice?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Is it not possible to do the annotation lookup and setting the duplex option when we proxy the request here ?

var proxy *httputil.ReverseProxy

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Secondly, I say we don't include this change on the metrics handler - I don't think users have reported an issue with that.

Copy link
Contributor Author

@skonto skonto Nov 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metrics handler is on the request path and activator logs the same error we are trying to address. If I don't wrap it I get that error, not as frequently, but I do after a large number of requests.
Each handler interacts in isolation wrt other handlers in the chain, when it comes to the full duplex thingy. That is why I wrap the handlers that do create a problem in the chain individually. Bottom line is that handlers that override write logic eg:

type Writer interface {
	Write(p []byte) (n int, err error)
}

create the issue seen. Probably we need an e2e test to copy the test logic or you can test this manually.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each handler interacts in isolation wrt other handlers in the chain, when it comes to the full duplex thingy.

The underlying ResponseWriter is the one from net/http though - so EnableFullDuplex needs to only be called once.

https://cs.opensource.google/go/go/+/refs/tags/go1.21.4:src/net/http/server.go;l=504-507

When we wrap this original writer in various handlers we should include an Unwrap method

https://cs.opensource.google/go/go/+/refs/tags/go1.21.4:src/net/http/responsecontroller.go;l=42-44

Copy link
Contributor Author

@skonto skonto Nov 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will check. We certainly don't pass the original writer everywhere:

// The ResponseWriter should be the original value passed to the Handler.ServeHTTP method,
// or have an Unwrap method returning the original ResponseWriter.

Copy link
Contributor Author

@skonto skonto Nov 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it seems you need to only set EnableFullDuplex at the outermost handler (first called). Adding the unwrap method didn't not change the outcome locally but it is good practice so that the handlers can be used with a responsecontroller. The initial responsewriter passed by net/http (when it schedules the user handle) is the only one that implements the EnableFullDuplex thing anyway. Anything else should return not supported (our handlers passed as responseWriters).

cmd/activator/main.go Outdated Show resolved Hide resolved
cmd/activator/main.go Outdated Show resolved Hide resolved
Copy link
Member

@ReToCode ReToCode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this @skonto!

In general I feel like we have to wrap in a lot of places, but probably have to do it that way.

Also, can we run some of the existing e2e tests with that flag enabled?

@dprotaso
Copy link
Member

Can you also update the go.mod file to ensure folks are building with go1.21

Otherwise this feature won't work.

@skonto
Copy link
Contributor Author

skonto commented Dec 11, 2023

@dprotaso ready, gentle ping.

@skonto
Copy link
Contributor Author

skonto commented Dec 18, 2023

@dprotaso gentle ping.

1 similar comment
@skonto
Copy link
Contributor Author

skonto commented Jan 3, 2024

@dprotaso gentle ping.

@dprotaso
Copy link
Member

dprotaso commented Jan 12, 2024

Looks good - I tried to run the test locally and it seemed to hang

go test -v -run TestActivatorChain

2024/01/12 13:48:59 http: proxy error: dial tcp 127.0.0.1:58796: connect: can't assign requested address
2024/01/12 13:48:59 http: proxy error: dial tcp 127.0.0.1:58796: connect: can't assign requested address
main_test.go:253: error during request: unexpected status code: 502
main_test.go:253: error during request: failed to execute request: Post "http:https://127.0.0.1:58798": dial tcp 127.0.0.1:58798: connect: can't assign requested address
main_test.go:253: error during request: failed to execute request: Post "http:https://127.0.0.1:58798": dial tcp 127.0.0.1:58798: connect: can't assign requested address
main_test.go:253: error during request: failed to execute request: Post "http:https://127.0.0.1:58798": dial tcp 127.0.0.1:58798: connect: can't assign requested address
main_test.go:253: error during request: unexpected status code: 502
main_test.go:253: error during request: failed to execute request: Post "http:https://127.0.0.1:58798": dial tcp 127.0.0.1:58798: connect: can't assign requested address
main_test.go:253: error during request: failed to execute request: Post "http:https://127.0.0.1:58798": dial tcp 127.0.0.1:58798: connect: can't assign requested address
main_test.go:253: error during request: failed to execute request: Post "http:https://127.0.0.1:58798": dial tcp 127.0.0.1:58798: connect: can't assign requested address
main_test.go:253: error during request: failed to execute request: Post "http:https://127.0.0.1:58798": dial tcp 127.0.0.1:58798: connect: can't assign requested address
main_test.go:253: error during request: failed to execute request: Post "http:https://127.0.0.1:58798": dial tcp 127.0.0.1:58798: connect: can't assign requested address
main_test.go:253: error during request: unexpected status code: 502
main_test.go:253: error during request: failed to execute request: Post "http:https://127.0.0.1:58798": dial tcp 127.0.0.1:58798: connect: can't assign requested address
main_test.go:253: error during request: unexpected status code: 502
main_test.go:253: error during request: failed to execute request: Post "http:https://127.0.0.1:58798": dial tcp 127.0.0.1:58798: connect: can't assign requested address

I'm on Mac ARM - go version 1.21.6

@skonto
Copy link
Contributor Author

skonto commented Jan 15, 2024

@dprotaso hi,

I'm on Mac ARM - go version 1.21.6

I don't use Mac but maybe @ReToCode can help. This seems like a known MAC network setup error, check also here for a fix.

At my side tests pass (go 1.21.6):

 go test -v -run TestActivatorChainHandlerWithFullDuplex ./pkg/activator/handler/...
=== RUN   TestActivatorChainHandlerWithFullDuplex
    logger.go:130: 2024-01-15T13:48:51.464+0200	DEBUG	configmap/store.go:155	activator config "config-tracing" config was added or updated: &config.Config{Backend:"none", ZipkinEndpoint:"", Debug:false, SampleRate:0.1}
--- PASS: TestActivatorChainHandlerWithFullDuplex (96.12s)
PASS
ok  	knative.dev/serving/pkg/activator/handler	96.144s

I rebased and still see no such failure.

@ReToCode
Copy link
Member

ReToCode commented Jan 16, 2024

I tested it locally and I have the same. It seems like MacOS has more restrictive default settings for opening a lot of response-ports (see https://copyprogramming.com/howto/ab-program-freezes-after-lots-of-requests-why). I can see a lot of TIME_WAIT on netstat.

This fixes it:

sudo sysctl -w net.inet.tcp.msl=1000

But I don't think it is a good idea to force people to do this. So maybe we can just skip the test when runtime.GOOS == darwin?

@skonto
Copy link
Contributor Author

skonto commented Jan 16, 2024

{ error creating the clusters: error creating clusters: error creating cluster: exit status 1, output: "Default change: VPC-native is the default mode during cluster creation for versions greater than 1.21.0-gke.1500...

@skonto
Copy link
Contributor Author

skonto commented Jan 16, 2024

/test istio-latest-no-mesh

@dprotaso
Copy link
Member

/lgtm
/approve

@knative-prow knative-prow bot added the lgtm Indicates that a PR is ready to be merged. label Jan 16, 2024
Copy link

knative-prow bot commented Jan 16, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dprotaso, skonto

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@knative-prow knative-prow bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 16, 2024
@knative-prow knative-prow bot merged commit 34da024 into knative:main Jan 16, 2024
49 checks passed
skonto added a commit to skonto/serving that referenced this pull request Jan 17, 2024
* support http1 full duplex per workload

* lint

* style

* single call

* updates & unit test

* fix unit test

* fix comment

* address review comments

* fix lint

* skip for Mac
skonto added a commit to skonto/serving that referenced this pull request Jan 18, 2024
* support http1 full duplex per workload

* lint

* style

* single call

* updates & unit test

* fix unit test

* fix comment

* address review comments

* fix lint

* skip for Mac
openshift-merge-bot bot pushed a commit to openshift-knative/serving that referenced this pull request Jan 23, 2024
…rkload (knative#14568) (#591)

* Support http1 full duplex per workload (knative#14568)

* support http1 full duplex per workload

* lint

* style

* single call

* updates & unit test

* fix unit test

* fix comment

* address review comments

* fix lint

* skip for Mac

* Create less load for TestActivatorChainHandlerWithFullDuplex (knative#14820)

* less load

* limit cons
skonto added a commit to skonto/serving that referenced this pull request Jan 26, 2024
…rkload (knative#14568) (knative#591)

* Support http1 full duplex per workload (knative#14568)

* support http1 full duplex per workload

* lint

* style

* single call

* updates & unit test

* fix unit test

* fix comment

* address review comments

* fix lint

* skip for Mac

* Create less load for TestActivatorChainHandlerWithFullDuplex (knative#14820)

* less load

* limit cons
openshift-merge-bot bot pushed a commit to openshift-knative/serving that referenced this pull request Jan 29, 2024
…rkload (knative#14568) (#591) (#605)

* Support http1 full duplex per workload (knative#14568)

* support http1 full duplex per workload

* lint

* style

* single call

* updates & unit test

* fix unit test

* fix comment

* address review comments

* fix lint

* skip for Mac

* Create less load for TestActivatorChainHandlerWithFullDuplex (knative#14820)

* less load

* limit cons
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/API API objects and controllers area/autoscale area/networking lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

500 errors with queue-proxy in knative versions higher than 0.24.0
4 participants