500 errors with queue-proxy in knative versions higher than 0.24.0 #12387

adriangudas · 2021-12-02T22:43:13Z

What version of Knative?

1.0.0

Expected Behavior

No 500 errors.

Actual Behavior

Sporadic 500 errors being presented to the client. From the queue-proxy logs on the service, we see the following error message:

httputil: ReverseProxy read error during body copy: read tcp 127.0.0.1:51376->127.0.0.1:8080: use of closed network connection

Steps to Reproduce the Problem

We were able to reproduce these errors with knative 1.0.0, as well as knative 0.26.0 and 0.25.1. We're not seeing an issue in 0.24.0. I wasn't able to reproduce it using hey and helloworld-go so it appears workload-dependent to some degree, however we're not doing anything unusual except that our responses are understandably larger than what helloworld-go sends out, being an actual production API workload.

The count of 500's is fairly small (5 occurrences over several thousand requests), but it's higher than the rate of zero errors we were getting before, and we can't explain it. There doesn't appear to be any anomalous behaviour in the app itself that seems to be causing the issues (memory/cpu are stable, etc.)

Once we moved to knative 0.24.0, the problem went away. (The knative serving operator makes it really easy to move between versions 💯 )

Some digging in Go's repo resulted in an interesting lead, which might be related, and points to an issue in Go rather than queue-proxy specifically.

Was wondering if this is on the right track, or if anyone else experiences this in their logs. This is preventing us from being able to upgrade to 1.0.0 in production.

Thanks!

The text was updated successfully, but these errors were encountered:

skonto · 2022-01-12T11:49:02Z

Hi @adriangudas could be the known Golang issue, however let's make sure you dont see any errors at the user container side eg. application closed the connection for some reason. Could you try hit the application endpoint either locally in that container or via a client to see if there are any failures?

github-actions · 2022-04-13T01:31:16Z

This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen. Mark the issue as
fresh by adding the comment /remove-lifecycle stale.

dprotaso · 2022-04-29T14:24:34Z

/triage accepted
/milestone Icebox

qldong · 2022-05-27T20:59:54Z

I'm on @adriangudas 's team and we've been testing this issue on knative version 1.4.0. The issue still persists.

httputil: ReverseProxy read error during body copy: read tcp 127.0.0.1:45456->127.0.0.1:8080: use of closed network connection
error reverse proxying request; sockstat: sockets: used 201
TCP: inuse 140 orphan 2 tw 19 alloc 632 mem 162
UDP: inuse 0 mem 0
UDPLITE: inuse 0
RAW: inuse 0
FRAG: inuse 0 memory 0

The application we're testing against has not changed since last time, and was unchanged from when we were using knative 0.24, but the issue started happening when we tried to upgrade knative to newer versions.

qldong · 2022-05-31T21:06:03Z

@dprotaso Let me know if you need help obtaining debug information.

dprotaso · 2022-06-09T20:39:33Z

I prodded the upstream golang issue - golang/go#40747 (comment)

we leverage the httputil.ReverseProxy - it would be nice if they could fix this upstream.

kahirokunn · 2022-09-22T01:56:21Z

I tried grpc-ping-go with knative serving 1.7.1 and 1.7.7 and 1.4.0.
https://github.com/knative/docs/tree/main/code-samples/serving/grpc-ping-go

I hit the exact same problem.
#12387 (comment)

{
  "severity": "ERROR",
  "timestamp": "2022-09-22T01:48:53.163229744Z",
  "logger": "queueproxy",
  "caller": "network/error_handler.go:33",
  "message": "error reverse proxying request; sockstat: sockets: used 9\nTCP: inuse 0 orphan 0 tw 2 alloc 209 mem 47\nUDP: inuse 0 mem 6\nUDPLITE: inuse 0\nRAW: inuse 0\nFRAG: inuse 0 memory 0\n",
  "commit": "9402a71-dirty",
  "knative.dev/key": "default/grpc-ping-00003",
  "knative.dev/pod": "grpc-ping-00003-deployment-6c879dc98-7b8rh",
  "error": "net/http: HTTP/1.x transport connection broken: write tcp 127.0.0.1:55694->127.0.0.1:8080: write: connection reset by peer",
  "stacktrace": "knative.dev/pkg/network.ErrorHandler.func1\n\tknative.dev/[email protected]/network/error_handler.go:33\nnet/http/httputil.(*ReverseProxy).ServeHTTP\n\tnet/http/httputil/reverseproxy.go:301\nknative.dev/serving/pkg/queue.(*appRequestMetricsHandler).ServeHTTP\n\tknative.dev/serving/pkg/queue/request_metric.go:199\nknative.dev/serving/pkg/queue.ProxyHandler.func1\n\tknative.dev/serving/pkg/queue/handler.go:76\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2084\nknative.dev/serving/pkg/queue.ForwardedShimHandler.func1\n\tknative.dev/serving/pkg/queue/forwarded_shim.go:54\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2084\nknative.dev/serving/pkg/http/handler.(*timeoutHandler).ServeHTTP.func4\n\tknative.dev/serving/pkg/http/handler/timeout.go:117"
}

reason: knative/serving#12387

JayChanggithub · 2023-06-07T16:01:34Z

@kahirokunn
Just wanna to confirm those issue whether be resolved? We seems like to meet same situation. The detailed logs below.

2023-05-31 08:53:09 | {"severity":"ERROR","timestamp":"2023-05-31T00:53:09.113500293Z","logger":"queueproxy","caller":"network/error_handler.go:33","message":"error reverse proxying request; sockstat: sockets: used 21\nTCP: inuse 8 orphan 2 tw 131 alloc 664 mem 1392\nUDP: inuse 0 mem 13\nUDPLITE: inuse 0\nRAW: inuse 0\nFRAG: inuse 0 memory 0\n","commit":"5c49eba","knative.dev/key":"prod/rm-ml-return-forecast-00011","knative.dev/pod":"rm-ml-return-forecast-00011-deployment-75df68bc76-2kmds","error":"context canceled","stacktrace":"knative.dev/pkg/network.ErrorHandler.func1\n\tknative.dev/[email protected]/network/error_handler.go:33\nnet/http/httputil.(*ReverseProxy).ServeHTTP\n\tnet/http/httputil/reverseproxy.go:300\nknative.dev/serving/pkg/queue.(*appRequestMetricsHandler).ServeHTTP\n\tknative.dev/serving/pkg/queue/request_metric.go:199\nknative.dev/serving/pkg/queue.ProxyHandler.func1\n\tknative.dev/serving/pkg/queue/handler.go:35\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2047\nknative.dev/serving/pkg/queue.ForwardedShimHandler.func1\n\tknative.dev/serving/pkg/queue/forwarded_shim.go:50\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2047\nknative.dev/serving/pkg/http/handler.(*timeoutHandler).ServeHTTP.func3\n\tknative.dev/serving/pkg/http/handler/timeout.go:100"}
-- | --

Also our relevant revision, Not sure downgrade our kanative revision whether can resolve?

$ k get svc -n istio-system knative-local-gateway   -oyaml | kubectl-neat 

apiVersion: v1
kind: Service
metadata:
  labels:
    experimental.istio.io/disable-gateway-port-translation: "true"
    networking.knative.dev/ingress-provider: istio
    serving.knative.dev/release: v0.26.0
  name: knative-local-gateway
  namespace: istio-system
spec:
  clusterIP:

dprotaso · 2023-06-07T19:47:28Z

FWIW - the golang upstream fix should land in go1.21 (comes out august)

Also - I'd retest against the latest knative to make sure this is still a problem

dprotaso · 2023-06-07T19:48:22Z

Adding to v1.12.0 release since we'll probably be able to switch to go1.21 for that release

skonto · 2023-08-30T12:06:42Z

@dprotaso I tested the new fix it seems to work.
Release notes for the fix:

The new ResponseController.EnableFullDuplex method allows server handlers to concurrently read from an HTTP/1 request body while writing the response. Normally, the HTTP/1 server automatically consumes any remaining request body before starting to write the response, to avoid deadlocking clients which attempt to write a complete request before reading the response. The EnableFullDuplex method disables this behavior.```

You just need the following modification in the reproducer and build with 1.21:

	proxy := httputil.NewSingleHostReverseProxy(echoURL)
	proxyHandler := http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
		rc := http.NewResponseController(w)
		_ = rc.EnableFullDuplex()
		proxy.ServeHTTP(w, r)
	})
	proxyServer := httptest.NewServer(proxyHandler)

A few concerns:

Having this method enabled by default is not what we probably want due to the deadlock issue.
Long term, Golang folks will not have this as the default behavior for http/1 servers. So this fix requires
that clients to understand the server behavior. It seems to me more like an app decision on whether users want
to have this on. So I was thinking that having a revision annotation could help enable this at the activator and qp reverse proxies for the given app (just for the specific handlers invocked) and then make sure the path: ingress -> activator -> qp works as expected for that app.Alternatively, if we enable by default, we should provide a way to opt out this setting somehow per app.
Finally, this could require a conditional build for 1.21 as in here if we dont switch to that.

dprotaso · 2023-08-30T14:08:44Z

Having this method enabled by default is not what we probably want due to the deadlock issue.

It'd be great to know which clients. I also wonder if this concern is mitigated because we have an envoy proxy (or something else) in front of the activator/queue proxy.

Alternatively, if we enable by default, we should provide a way to opt out this setting somehow per app.

Good point about this being workload dependent.

tikr7 · 2023-09-05T09:57:01Z

Does it only happen if the user-container is written on python?

moadi · 2023-09-18T17:04:03Z

@tikr7 we've been seeing this issue with the user-container running java 8 spring boot services

moadi · 2023-09-18T17:11:50Z

@skonto having the behaviour configurable via an annotation on the revision would be extremely handy. depending on the defaults chosen we can pick the affected services on our end and have them opt in/out

skonto · 2023-09-26T10:13:38Z

I will take a look. /assign @skonto

skonto · 2023-10-19T16:10:23Z

I am back on it. I have a small project here experimenting locally with the full duplex. Envoy seems having no issues (havent tested older clients etc), I will create a PR and we can discuss further.
cc @dprotaso

ReToCode · 2023-10-20T06:29:50Z

@skonto thanks for the tests and repo 🎉

As there is no real clear answer, to which clients could break with this (golang/go#57786 (comment)), @dprotaso what do you think about if we add this behind a feature flag and then try to do a transformation from disabled to enabled by default but always with a flag to disable this behaviour?

This issue causes a lot of issues for some of our users and we could ask them to try with full-duplex enabled and see if that causes any issues in a real environment.

skonto · 2023-10-20T07:41:04Z

@ReToCode my goal btw is to have some annotation to enable it per workload and turn it on/off on demand.
Activator can detect that annotation in the revision and setup the right type of connection in its handler (we create a rv proxy per request afaik).

moadi · 2023-10-20T14:17:41Z

@skonto thanks for looking into this! In our Knative cluster we don't generally use the activator for the vast majority of the services (for context we're running ~40 KN services in the cluster). I presume the proposed changes would still work fine even if the activator is disabled for the services?

skonto · 2023-10-26T11:23:31Z

Hi @moadi.

I presume the proposed changes would still work fine even if the activator is disabled for the services?

Yes it should, the idea is without activator on the path the QP should deal with the full duplex support as well.

adriangudas added the kind/bug Categorizes issue or PR as related to a bug. label Dec 2, 2021

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 13, 2022

moadi mentioned this issue Apr 27, 2022

target-utilization-percentage does not seem to be working as expected #12888

Closed

knative-prow bot added the triage/accepted Issues which should be fixed (post-triage) label Apr 29, 2022

dprotaso removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 29, 2022

dprotaso added this to the Icebox milestone Apr 29, 2022

kahirokunn added a commit to kahirokunn/knative-charts that referenced this issue Oct 10, 2022

chore: down knative version 1.4.0 -> 0.24.2

69900e3

reason: knative/serving#12387

kahirokunn mentioned this issue Oct 11, 2022

queue-proxy required key QUEUE_SERVING_TLS_PORT missing value #13377

Closed

dprotaso modified the milestones: Icebox, v1.12.0 Jun 7, 2023

ashrafguitoni mentioned this issue Aug 9, 2023

read: connection reset by peer at high load #14239

Closed

skonto self-assigned this Sep 26, 2023

ReToCode mentioned this issue Oct 25, 2023

Rename env var for http2 autodetection #14564

Merged

skonto mentioned this issue Oct 26, 2023

Support http1 full duplex per workload #14568

Merged

ReToCode modified the milestones: v1.12.0, v1.13.0 Nov 6, 2023

knative-prow bot closed this as completed in #14568 Jan 16, 2024

skonto mentioned this issue Jan 17, 2024

Document the use of HTTP/1 full duplex support knative/docs#5818

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

500 errors with queue-proxy in knative versions higher than 0.24.0 #12387

500 errors with queue-proxy in knative versions higher than 0.24.0 #12387

adriangudas commented Dec 2, 2021 •

edited

Loading

skonto commented Jan 12, 2022 •

edited

Loading

github-actions bot commented Apr 13, 2022

dprotaso commented Apr 29, 2022

qldong commented May 27, 2022

qldong commented May 31, 2022

dprotaso commented Jun 9, 2022

kahirokunn commented Sep 22, 2022 •

edited

Loading

JayChanggithub commented Jun 7, 2023

dprotaso commented Jun 7, 2023

dprotaso commented Jun 7, 2023

skonto commented Aug 30, 2023 •

edited

Loading

dprotaso commented Aug 30, 2023 •

edited

Loading

tikr7 commented Sep 5, 2023

moadi commented Sep 18, 2023

moadi commented Sep 18, 2023

skonto commented Sep 26, 2023

skonto commented Oct 19, 2023 •

edited

Loading

ReToCode commented Oct 20, 2023

skonto commented Oct 20, 2023 •

edited

Loading

moadi commented Oct 20, 2023

skonto commented Oct 26, 2023 •

edited

Loading

500 errors with queue-proxy in knative versions higher than 0.24.0 #12387

500 errors with queue-proxy in knative versions higher than 0.24.0 #12387

Comments

adriangudas commented Dec 2, 2021 • edited Loading

What version of Knative?

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

skonto commented Jan 12, 2022 • edited Loading

github-actions bot commented Apr 13, 2022

dprotaso commented Apr 29, 2022

qldong commented May 27, 2022

qldong commented May 31, 2022

dprotaso commented Jun 9, 2022

kahirokunn commented Sep 22, 2022 • edited Loading

JayChanggithub commented Jun 7, 2023

dprotaso commented Jun 7, 2023

dprotaso commented Jun 7, 2023

skonto commented Aug 30, 2023 • edited Loading

dprotaso commented Aug 30, 2023 • edited Loading

tikr7 commented Sep 5, 2023

moadi commented Sep 18, 2023

moadi commented Sep 18, 2023

skonto commented Sep 26, 2023

skonto commented Oct 19, 2023 • edited Loading

ReToCode commented Oct 20, 2023

skonto commented Oct 20, 2023 • edited Loading

moadi commented Oct 20, 2023

skonto commented Oct 26, 2023 • edited Loading

adriangudas commented Dec 2, 2021 •

edited

Loading

skonto commented Jan 12, 2022 •

edited

Loading

kahirokunn commented Sep 22, 2022 •

edited

Loading

skonto commented Aug 30, 2023 •

edited

Loading

dprotaso commented Aug 30, 2023 •

edited

Loading

skonto commented Oct 19, 2023 •

edited

Loading

skonto commented Oct 20, 2023 •

edited

Loading

skonto commented Oct 26, 2023 •

edited

Loading