Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: network bind: timeout #23263

Closed
edsantiago opened this issue Jul 12, 2024 · 3 comments · Fixed by #23339
Closed

CI: network bind: timeout #23263

edsantiago opened this issue Jul 12, 2024 · 3 comments · Fixed by #23339
Assignees
Labels
flakes Flakes from Continuous Integration locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.

Comments

@edsantiago
Copy link
Member

edsantiago commented Jul 12, 2024

New flake, started July 11, same day as my local-registry PR (not sure if related, but it seems suspicious). Mostly in network bind to 127.0.0.1 test:

    # podman-remote [options] run --network slirp4netns:outbound_addr=127.0.0.1,allow_host_loopback=true -dt quay.io/libpod/alpine:latest nc -w 2 10.0.2.2 5546
           b3a860c2439c32a2ee2f4a363ae08e813f8223bac6148d8a6ad5396390ef999b
           Ncat: Version 7.95 ( https://nmap.org/ncat )
           Ncat: Listening on [::]:5546
           Ncat: Listening on 0.0.0.0:5546

           [FAILED] Timed out after 90.001s.
           command timed out after 90s: [nc -v -n -l -p 5546]
           STDOUT: 
           STDERR: Ncat: Version 7.95 ( https://nmap.org/ncat )
           Ncat: Listening on [::]:5546
           Ncat: Listening on 0.0.0.0:5546
         
           Expected process to exit.  It did not.

...but also in a kube test (weirdly, in pod rm):

  # podman-remote [options] pod rm -fa -t 0

           [FAILED] Timed out after 90.001s.
           command timed out after 90s: [/var/tmp/go/src/github.com/containers/podman/bin/podman-remote --remote --url unix:https:///run/podman/podman-6f170c3cdcc4e5cc9207eaf41298d418fd0cb7f904ae8be3f5ec412f2a5909b0.sock pod rm -fa -t 0]
           STDOUT: 
           STDERR: 
           Expected process to exit.  It did not.

Only common factor is root.

My go-to when I see socket hangs is "somehow I've mixed CNI and netavark". It's possible that my local-registry work is doing that because it uses system-installed podman. But, even on f40 and rawhide? And it seems weird, because that tends to hang EVERYTHING, not just intermittent. I will look into it on Monday.

  • debian-13 : int remote debian-13 root host sqlite [remote]
    • 07-11 23:13 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
  • fedora-39 : int remote fedora-39 root host boltdb [remote]
    • 07-12 10:23 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
  • fedora-40 : int podman fedora-40 root host sqlite
    • 07-11 23:14 in Podman run networking podman run network bind to HostIP
    • 07-11 23:14 in Podman run networking podman run network bind to 127.0.0.1
  • fedora-40 : int remote fedora-40 root host sqlite [remote]
    • 07-11 23:14 in Podman run networking podman run network bind to 127.0.0.1
  • rawhide : int podman rawhide root host sqlite
    • 07-11 23:15 in Podman run networking podman run network bind to 127.0.0.1
  • rawhide : int remote rawhide root host sqlite [remote]
    • 07-12 10:23 in Podman run networking podman run network bind to 127.0.0.1
    • 07-11 23:15 in Podman run networking podman run network bind to 127.0.0.1
x x x x x x
int(8) remote(5) rawhide(3) root(8) host(8) sqlite(7)
podman(3) fedora-40(3) boltdb(1)
fedora-39(1)
@edsantiago edsantiago added the flakes Flakes from Continuous Integration label Jul 12, 2024
@Luap99
Copy link
Member

Luap99 commented Jul 12, 2024

CNI conflicts should no longer be possible as no CNI code is compiled into the binary as of 5.0.
#23234 also merged the same day which touches networking which seems more likely than your change.

However in this case --network slirp4netns is used which means it does not use the normal rootful netavark firewall rules at all. It is a user mode proxy. The one issue I could see is that nc binds the port on the host after the container sends the data via nc. The easiest thing to do is remove the "-dt" so that we get the container output.
However it is not clear how this would be related to either of our PRs.

@edsantiago
Copy link
Member Author

I think this is the same failure we're seeing in gating, except that's a hard failure, not a flake.

Reproduces easily and on the first try in 1mt:

$ # hack/bats 500:"port forward range"                                                                                               
# bats --filter port forward range test/system/500-networking.bats                                                                                              
500-networking.bats                                                                                                                                             
 ✗ [500] podman run port forward range                                                                                                                          
   port 5355 is in use; trying another.                                                                                                                         
   tags: distro-integration                                                                                                                                     
   (from function `basic_teardown' in file test/system/helpers.bash, line 232,                                                                                  
    from function `teardown' in test file test/system/helpers.bash, line 242)                                                                                   
     `basic_teardown' failed                                                                                                                                    
                                                                                                                                                                
   [15:59:51.554142118] # /root/go/podman/bin/podman info --format {{.Host.Slirp4NetNS.Executable}}                                                             
   [15:59:51.654975142] /usr/bin/slirp4netns                                                                                                                    
                                                                                                                                                                
   [15:59:51.824258497] # /root/go/podman/bin/podman run --network bridge -p 5596-5598:5596-5598 -d quay.io/libpod/testimage:20240123 sleep inf                 
   [15:59:52.145671161] c4dbf4902e6569b0b5a52b4671993b1cabccb503f847cee3ec374f647de81714                                                                        
   #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv                                                                                                              
   #|     FAIL: ncat unexpected exit code                                                                                                                       
   #| expected: -eq 2                                                                                                                                           
   #|   actual:     124                                                                                                                                         
   #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                              
   #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv                                                                                                              
   #|     FAIL: ncat error message                                                                                                                              
   #| expected: =~ 127.0.0.1:5596: Address already in use                                                                                                       
   #|   actual:    ''                                                                                                                                           
   #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                              
                                                                                                                                                                
[several more times]

   #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
   #| FAIL: 6 test assertions failed. Search for 'FAIL:' above this line.
   #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Oof. I think I have it. -v to the rescue:

# nc -l -n -v -p 9012 127.0.0.1
Ncat: Version 7.92 ( https://nmap.org/ncat )
Ncat: Listening on 127.0.0.1:31337   <----- this is not 9012

! reorder args, remove -p, put port at end
# nc -l -n -v 127.0.0.1 9012
Ncat: Version 7.92 ( https://nmap.org/ncat )
Ncat: Listening on 127.0.0.1:9012    <------ this is

Studying some more. Will file PR if this solves it.

@Luap99
Copy link
Member

Luap99 commented Jul 18, 2024

I don't think the gating test is related to this at all.

@Luap99 Luap99 self-assigned this Jul 19, 2024
@stale-locking-app stale-locking-app bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Oct 21, 2024
@stale-locking-app stale-locking-app bot locked as resolved and limited conversation to collaborators Oct 21, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
flakes Flakes from Continuous Integration locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants