Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enable GSO, disable if sending fails for a particular address #4005

Merged
merged 3 commits into from
Aug 18, 2023

Conversation

marten-seemann
Copy link
Member

Fixes #3911.

@codecov
Copy link

codecov bot commented Jul 30, 2023

Codecov Report

Merging #4005 (5200f27) into master (83c00a5) will increase coverage by 0.31%.
The diff coverage is 78.00%.

@@            Coverage Diff             @@
##           master    #4005      +/-   ##
==========================================
+ Coverage   82.94%   83.25%   +0.31%     
==========================================
  Files         147      147              
  Lines       14781    14800      +19     
==========================================
+ Hits        12260    12321      +61     
+ Misses       2023     1981      -42     
  Partials      498      498              
Files Changed Coverage Δ
packet_handler_map.go 84.52% <ø> (ø)
transport.go 67.30% <25.00%> (-0.25%) ⬇️
sys_conn_helper_nonlinux.go 25.00% <50.00%> (+25.00%) ⬆️
sys_conn_oob.go 70.49% <66.67%> (+7.92%) ⬆️
server.go 75.89% <75.00%> (ø)
sys_conn_helper_linux.go 72.31% <76.92%> (+6.92%) ⬆️
send_conn.go 85.45% <86.05%> (+30.45%) ⬆️
sys_conn.go 80.00% <100.00%> (+2.81%) ⬆️
sys_conn_df_linux.go 70.83% <100.00%> (+29.32%) ⬆️

@marten-seemann marten-seemann force-pushed the gso-detection branch 5 times, most recently from 5b0c2d4 to 8154148 Compare July 30, 2023 17:08
@marten-seemann marten-seemann changed the base branch from transport-disable-gso to master July 30, 2023 17:08
@marten-seemann
Copy link
Member Author

@zllovesuki @otbutz @bt90 @kgersen Could you try out this PR, and see if it actually resolves #3911. I'll leave this PR open for a few days, and if it works, I'll ship v0.38 very soon to get GSO support out.

@zllovesuki
Copy link
Contributor

@marten-seemann negative, testing with this PR in a container on raspberry pi still encounters problems

@marten-seemann
Copy link
Member Author

@zllovesuki Thanks for testing! Can you provide some more details? What's the error you're getting?

@zllovesuki
Copy link
Contributor

zllovesuki commented Jul 30, 2023

$ sudo k3s kubectl -n kube-system logs specter-tunnel-596bbf79cd-g9jxb
{"level":"info","ts":1690746649.3391173,"caller":"client/client.go:150","msg":"Connected to specter server","id":8319038888190,"addr":"[redacted]"}
{"level":"info","ts":1690746649.4465966,"caller":"client/client.go:366","msg":"Reusing existing client certificate","id":8319038888190,"identity":{"id":8319038888190,"token":"..."}}
{"level":"info","ts":1690746650.189319,"caller":"client/client.go:150","msg":"Connected to specter server","id":8319038888190,"addr":"[redacted]"}
{"level":"info","ts":1690746650.924124,"caller":"client/client.go:150","msg":"Connected to specter server","id":8319038888190,"addr":"[redacted]"}
{"level":"info","ts":1690746650.9242585,"caller":"client/client.go:105","msg":"Waiting for RTT measurement...","id":8319038888190,"max":3}
{"level":"info","ts":1690746652.60327,"caller":"client/client.go:465","msg":"Synchronizing tunnels in config file with specter","id":8319038888190,"tunnels":3}
{"level":"error","ts":1690746652.8073633,"caller":"client/client.go:470","msg":"Failed to query available hostnames","id":8319038888190,"error":"twirp error internal: failed to do request: Post \"https://[redacted]:443/twirp/protocol.TunnelService/RegisteredHostnames\": INTERNAL_ERROR (local): write udp [::]:54363->[redacted]:443: sendmsg: invalid argument","stacktrace":"kon.nect.sh/specter/tun/client.(*Client).SyncConfigTunnels\n\t/app/tun/client/client.go:470\nkon.nect.sh/specter/tun/client.(*Client).Initialize\n\t/app/tun/client/client.go:109\nkon.nect.sh/specter/cmd/client.cmdTunnel\n\t/app/cmd/client/tunnel.go:115\ngithub.com/urfave/cli/v2.(*Command).Run\n\t/go/pkg/mod/github.com/urfave/cli/[email protected]/command.go:274\ngithub.com/urfave/cli/v2.(*Command).Run\n\t/go/pkg/mod/github.com/urfave/cli/[email protected]/command.go:267\ngithub.com/urfave/cli/v2.(*Command).Run\n\t/go/pkg/mod/github.com/urfave/cli/[email protected]/command.go:267\ngithub.com/urfave/cli/v2.(*App).RunContext\n\t/go/pkg/mod/github.com/urfave/cli/[email protected]/app.go:332\nmain.main\n\t/app/main.go:15\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250"}
{"level":"info","ts":1690746652.807639,"caller":"client/client.go:734","msg":"Listening for tunnel traffic","id":8319038888190}
{"level":"info","ts":1690746652.808922,"caller":"client/server.go:210","msg":"Local server started","id":8319038888190,"listen":"[::]:1180"}
{"level":"info","ts":1690746683.0883477,"caller":"overlay/transport.go:133","msg":"Potential connection reuse conflict, retrying to get previously cached connection","peer":{"address":"[redacted]:443"},"error":"invalid state: other peer has cached connection while we are establishing a new outgoing connection"}
{"level":"info","ts":1690746683.4750397,"caller":"overlay/transport.go:133","msg":"Potential connection reuse conflict, retrying to get previously cached connection","peer":{"address":"[redacted]:443"},"error":"invalid state: other peer has cached connection while we are establishing a new outgoing connection"}
{"level":"error","ts":1690746683.4752026,"caller":"overlay/transport.go:140","msg":"Failed to establish connection","peer":{"address":"[redacted]:443"},"error":"invalid state: other peer has cached connection while we are establishing a new outgoing connection","stacktrace":"kon.nect.sh/specter/overlay.(*QUIC).getCachedConnection\n\t/app/overlay/transport.go:140\nkon.nect.sh/specter/overlay.(*QUIC).DialStream\n\t/app/overlay/transport.go:195\nkon.nect.sh/specter/spec/rpc.getDynamicDialer.func1\n\t/app/spec/rpc/rpc.go:46\nnet/http.(*Transport).customDialTLS\n\t/usr/local/go/src/net/http/transport.go:1324\nnet/http.(*Transport).dialConn\n\t/usr/local/go/src/net/http/transport.go:1590\nnet/http.(*Transport).dialConnFor\n\t/usr/local/go/src/net/http/transport.go:1456"}
{"level":"info","ts":1690746684.1898744,"caller":"overlay/transport.go:133","msg":"Potential connection reuse conflict, retrying to get previously cached connection","peer":{"address":"[redacted]:443"},"error":"invalid state: other peer has cached connection while we are establishing a new outgoing connection"}
{"level":"info","ts":1690746684.7732158,"caller":"overlay/transport.go:133","msg":"Potential connection reuse conflict, retrying to get previously cached connection","peer":{"address":"[redacted]:443"},"error":"invalid state: other peer has cached connection while we are establishing a new outgoing connection"}
{"level":"error","ts":1690746684.7733483,"caller":"overlay/transport.go:140","msg":"Failed to establish connection","peer":{"address":"[redacted]:443"},"error":"invalid state: other peer has cached connection while we are establishing a new outgoing connection","stacktrace":"kon.nect.sh/specter/overlay.(*QUIC).getCachedConnection\n\t/app/overlay/transport.go:140\nkon.nect.sh/specter/overlay.(*QUIC).DialStream\n\t/app/overlay/transport.go:195\nkon.nect.sh/specter/spec/rpc.getDynamicDialer.func1\n\t/app/spec/rpc/rpc.go:46\nnet/http.(*Transport).customDialTLS\n\t/usr/local/go/src/net/http/transport.go:1324\nnet/http.(*Transport).dialConn\n\t/usr/local/go/src/net/http/transport.go:1590\nnet/http.(*Transport).dialConnFor\n\t/usr/local/go/src/net/http/transport.go:1456"}
{"level":"info","ts":1690746684.7735217,"caller":"client/client.go:296","msg":"Some connections have failed, opening more connections to specter server","id":8319038888190,"dead":2}

@zllovesuki
Copy link
Contributor

The retry attempt means that it "lost" connectivity with the remotes, and initial request also failed with "Invalid Argument."

@marten-seemann
Copy link
Member Author

Ok, so it looks like the GSO error detection (in isGSOError) doesn't trigger, since you're getting an "invalid argument" error. I'm wondering if this is because your kernel doesn't support GSO (maybe it's really old?), and we could fix that by checking for GSO support as suggested in #3911 (comment).

@jfgiorgi
Copy link
Contributor

jfgiorgi commented Jul 30, 2023

same error as before (sendmsg: invalid argument).

(this is @kgersen on my work account)

@bt90
Copy link
Contributor

bt90 commented Jul 30, 2023

(possibly?) related comment from @jwhited

WireGuard/wireguard-go#75 (comment)

@marten-seemann
Copy link
Member Author

What OS / kernel version are you using?

@jfgiorgi
Copy link
Contributor

same as before: see #3911 (comment)

it's an issue for IPv6 only.
These machines doesn't pass the linux kernel selftest so may be don't bother with us.
We do use Tailscale on these configs but without IPv6 because of another reason (lack of some IPv6 routing feature) so the issue don't arise for Tailscale.

@zllovesuki
Copy link
Contributor

the pi is running:
$ uname -r
6.1.21-v8+

@marten-seemann marten-seemann force-pushed the gso-detection branch 3 times, most recently from 68d3caa to 9532779 Compare July 30, 2023 21:02
@marten-seemann
Copy link
Member Author

I added kernel GSO support detection by checking the error return value from unix.GetsockoptInt(int(fd), unix.IPPROTO_UDP, unix.UDP_SEGMENT). Can you please try if that fixes the issue, @zllovesuki and @jfgiorgi?

@zllovesuki
Copy link
Contributor

negative, still getting sendmsg: invalid argument

@marten-seemann
Copy link
Member Author

marten-seemann commented Jul 30, 2023

I really don't know what else to do. There's not much more that we can do than checking for kernel support and the specific error.

As far as I can tell, our detection logic is equivalent to the one in wireguard: https://github.com/WireGuard/wireguard-go/pull/75/files#diff-2db94c8b4e96b109a13cdbd5b5828184ceeb129ded12b67b7b3d3b22b281287c.

@marten-seemann
Copy link
Member Author

@zllovesuki I finally got my hands on a Raspberry Pi (Model 4) myself.

I can't reproduce the issue though. I can do QUIC downloads from QUIC servers on the internet (using the wlan0 interface), with GSO enabled on the interface. I can also transfer data from the Raspberry Pi to my computer when connected via the ethernet interface. Kernel version seems to be the same: 6.1.21-v8+.

Can you please provide more details what exactly you're running on your machine that causes this error?

@zllovesuki
Copy link
Contributor

@marten-seemann if you want to replicate exactly my environment, install k3s on the raspberry pi, then run the program in deployment

@zllovesuki
Copy link
Contributor

Both the eth0 and wlan0 are connected, but eth0 is the primary interface.

@marten-seemann
Copy link
Member Author

I’d like to avoid touching Kubernetes. Could you provide a minimum reproducer for this? It’s probably sufficient to run it in Docker, maybe?

@zllovesuki
Copy link
Contributor

I mean, I don't know if the k3s distribution does something specific with how it starts a container. You can try running it in docker first and see if you can replicate it

@marten-seemann
Copy link
Member Author

@zllovesuki What exactly should I run in Docker?

@zllovesuki
Copy link
Contributor

@marten-seemann try running a docker container/k8s deployment with miragespace/specter:gso_detection image, with args client tunnel -c /path/to/config.yaml, then mount the config.yaml with the initial content of:

version: 2
apex: fly.specter.im:443
tunnels:
  - target: http:https://127.0.0.1:8080

@marten-seemann
Copy link
Member Author

I tried running a QUIC server inside of Docker, and doing QUIC transfers both to the Raspi and to my computer, connected via Ethernet, as well as via WiFi via Tailscale. Everything works as expected.

I'm beginning to wonder if there's anything to fix here at all. Maybe @zllovesuki's setup is just so horribly broken that we don't actually need to make any accommodations for it (other than providing a config flag to turn GSO off). Maybe we should just merge this PR as is.

@zllovesuki If you want me to continue debugging this, I will need clearer instructions from you. I have no idea how to make sense of your last message. A docker-compose file that does all the setup would be appreciated (no Kubernetes please). Please also include instructions how to exactly hit the endpoint to elicit the error message.

@zllovesuki
Copy link
Contributor

@marten-seemann OK, I tried to reproduce it on bare raspberry pi, it didn't error, nor in docker, but only inside of k3s/containerd. Let's consider this to be resolved for now, and I will just turn GSO off when running in k3s.

@zllovesuki
Copy link
Contributor

Can we add the environment variable back to disable GSO if needed?

@marten-seemann
Copy link
Member Author

@marten-seemann OK, I tried to reproduce it on bare raspberry pi, it didn't error, nor in docker, but only inside of k3s/containerd.

Interesting. Does it happen in a bare containerd, without k3s?

@zllovesuki
Copy link
Contributor

Not sure about how to run it in just containerd. Although docker and k3s both uses containerd underneath. They must have different ways of setting up namespaces or network

@marten-seemann
Copy link
Member Author

I re-added the QUIC_GO_DISABLE_GSO env.

@marten-seemann marten-seemann changed the title disable GSO if sending fails for a particular remote address enable GSO by default, disable if sending fails for a particular remote address Aug 18, 2023
@marten-seemann marten-seemann changed the title enable GSO by default, disable if sending fails for a particular remote address enable GSO by default, disable if sending fails for a particular address Aug 18, 2023
@marten-seemann marten-seemann changed the title enable GSO by default, disable if sending fails for a particular address enable GSO, disable if sending fails for a particular address Aug 18, 2023
@marten-seemann marten-seemann merged commit f7f4872 into master Aug 18, 2023
33 checks passed
@marten-seemann marten-seemann deleted the gso-detection branch August 18, 2023 06:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

GSO fails on some platforms / interfaces
5 participants