Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ESP32-WIFI packet clogging #12446

Closed
casaroli opened this issue Jun 3, 2024 · 22 comments
Closed

ESP32-WIFI packet clogging #12446

casaroli opened this issue Jun 3, 2024 · 22 comments
Assignees

Comments

@casaroli
Copy link
Contributor

casaroli commented Jun 3, 2024

Hello,

I am having a strange issue and I am not sure if this is in esp32 wifi driver or anything else in NuttX, so I describe here to see if anyone else can reproduce or point me where to look next.

I am using NuttX and apps master branches, configuration esp32-devkitc:wifi and building with xtensa-esp-elf-gcc (crosstool-NG esp-13.2.0_20240305) 13.2.0.

So after booting, I connect to a managed wifi network which I can monitor with tcpdump(8):

In nsh, we connect to the wifi network.

NuttShell (NSH) NuttX-12.5.1
nsh> 
nsh> wapi psk wlan0 "mypassword" 3
nsh> wapi essid wlan0 "myhome" 1
nsh> renew wlan0

I can see it worked because there are some DHCP and ARP in tcpdump:

15:00:29.593930 IP 10.0.0.2.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from b0:b2:1c:b4:62:28, length 265
15:00:29.593942 IP 10.0.0.2.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from b0:b2:1c:b4:62:28, length 265
15:00:29.594238 IP 10.42.0.1.bootps > 10.42.0.205.bootpc: BOOTP/DHCP, Reply, length 300
15:00:29.602333 IP 10.42.0.205.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from b0:b2:1c:b4:62:28, length 272
15:00:29.602397 IP 10.42.0.205.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from b0:b2:1c:b4:62:28, length 272
15:00:29.605776 IP 10.42.0.1.bootps > 10.42.0.205.bootpc: BOOTP/DHCP, Reply, length 300
15:00:35.101796 ARP, Request who-has 10.42.0.205 tell 10.42.0.1, length 28
15:00:35.107316 ARP, Reply 10.42.0.205 is-at b0:b2:1c:b4:62:28, length 28

So then I send a ICMP echo from NuttX:

nsh> ping -c1 8.8.8.8
PING 8.8.8.8 56 bytes of data
No response from 8.8.8.8: icmp_seq=0 time=1000 ms
1 packets transmitted, 0 received, 100% packet loss, time 1020 ms

ok so packet is lost, not a big deal, however I is not even captured by tcpdump (i.e. it was not even sent).

So I ping a different address:

nsh> ping -c1 8.8.4.4
PING 8.8.4.4 56 bytes of data
56 bytes from 8.8.4.4: icmp_seq=0 time=50.0 ms
1 packets transmitted, 1 received, 0% packet loss, time 1020 ms
rtt min/avg/max/mdev = 50.000/50.000/50.000/0.000 ms

So now it works and tcpdump shows it:

15:00:56.640000 IP 10.42.0.205 > 8.8.4.4: ICMP echo request, id 2, seq 0, length 64
15:00:56.687513 IP 8.8.4.4 > 10.42.0.205: ICMP echo reply, id 2, seq 0, length 64

So far so good, nothing weird apart from the fact that the first packet was not even sent in the air. Lets try the ping to the original address again:

nsh> ping -c1 8.8.8.8
PING 8.8.8.8 56 bytes of data
56 bytes from 8.8.8.8: icmp_seq=0 time=70.0 ms
1 packets transmitted, 1 received, 0% packet loss, time 1010 ms
rtt min/avg/max/mdev = 70.000/70.000/70.000/0.000 ms

Now it worked but tcpdump shows the following:

15:01:02.251981 IP 10.42.0.205 > 8.8.8.8: ICMP echo request, id 1, seq 0, length 64
15:01:02.251982 IP 10.42.0.205 > 8.8.8.8: ICMP echo request, id 3, seq 0, length 64
15:01:02.311513 IP 8.8.8.8 > 10.42.0.205: ICMP echo reply, id 1, seq 0, length 64
15:01:02.312913 IP 8.8.8.8 > 10.42.0.205: ICMP echo reply, id 3, seq 0, length 64

WAT???

It looks like our first ICMP echo request (id 1) was clogged somewhere, and it was unclogged when we sent another ICMP echo request (id 3) to the same address. And both packets were transmitted (and got a reply). If I keep sending ICMP echos to 8.8.4.4 (like the one with id 2), it will not unclog. I could only manage to make the packet go whenever a new packet is sent to that same IP address.

In fact, this is a extreme simplification of the problem, as the real problem actually happens with TCP connections (and SYN packets), however to simplify I could reproduce with ICMP. So I believe it might be the same problem (for both TCP and ICMP).

I could not reproduce this in sim

Well, AFAIK there is nothing keeping different queues for packets with different destination addresses, so I have no idea where the problem can be.

Anyone has a clue?

@tmedicci tmedicci self-assigned this Jun 3, 2024
@acassis
Copy link
Contributor

acassis commented Jun 3, 2024

@casaroli I remember that even for TCP/IP in the past ping was failing for the very first package or delaying 1s, but AFAIK it was fixed. Hi @wengzhe do you know more about it?

@tmedicci
Copy link
Contributor

tmedicci commented Jun 3, 2024

@casaroli , I tried to reproduce the issue here but I wasn't able to reproduce the same behavior:

Eventually I got the 1st ping lost (I can't see it on tcpdump), but It was not sent after pinging the other IP address:

15:44:49.836494 IP 192.168.1.150 > 8.8.4.4: ICMP echo request, id 2, seq 0, length 64
15:44:49.856767 IP 8.8.4.4 > 192.168.1.150: ICMP echo reply, id 2, seq 0, length 64
15:44:54.122817 IP 192.168.1.150 > 8.8.8.8: ICMP echo request, id 3, seq 0, length 64
15:44:54.143688 IP 8.8.8.8 > 192.168.1.150: ICMP echo reply, id 3, seq 0, length 64

Can you share your binaries, please? (don't forget to include the bootloader and the partition table binaries)

@wengzhe
Copy link
Contributor

wengzhe commented Jun 3, 2024

@casaroli I remember that even for TCP/IP in the past ping was failing for the very first package or delaying 1s, but AFAIK it was fixed. Hi @wengzhe do you know more about it?

@acassis AFAIK, on the current master branch, if we don't enable CONFIG_NET_ARP_SEND, the very first packet will still be changed into an ARP packet for the target (or gateway) if we don't have the needed MAC address, and enabling CONFIG_NET_ARP_SEND may prevent the first packet from being lost (another ARP will be sent before this ICMP). But I also don't know how id 1 could be sent delayed, it's strange and seems not a normal behavior our net stack can do.

@tmedicci
Copy link
Contributor

tmedicci commented Jun 3, 2024

@casaroli I remember that even for TCP/IP in the past ping was failing for the very first package or delaying 1s, but AFAIK it was fixed. Hi @wengzhe do you know more about it?

@acassis AFAIK, on the current master branch, if we don't enable CONFIG_NET_ARP_SEND, the very first packet will still be changed into an ARP packet for the target (or gateway) if we don't have the needed MAC address, and enabling CONFIG_NET_ARP_SEND may prevent the first packet from being lost (another ARP will be sent before this ICMP). But I also don't know how id 1 could be sent delayed, it's strange and seems not a normal behavior our net stack can do.

I can confirm that after enabling it CONFIG_NET_ARP_SEND=y no 1st ping requests (id 1) were lost in more than 10 attempts. @casaroli , can you try it too?

@casaroli
Copy link
Contributor Author

casaroli commented Jun 3, 2024

I will try and share the binaries, config and outcome.

@casaroli
Copy link
Contributor Author

casaroli commented Jun 3, 2024

started from a clean tree

./tools/configure.sh -l esp32-devkitc:wifi
make
make bootloader
make flash ESPTOOL_PORT=/dev/ttyUSB0 ESPTOOL_BINDIR=.

Then I reboot the device and try the following commands. I got the issue on the 3rd try:

nsh> wapi psk wlan0 "Password1" 3
nsh> wapi essid wlan0 "nolife" 1
nsh> renew wlan0
nsh> arp -a 10.42.0.1
nsh: arp: no such ARP entry: 10.42.0.1
nsh> arp -a 10.42.0.1
HWaddr: 7e:94:b3:2b:5e:80
nsh> ping -c1 8.8.8.8
PING 8.8.8.8 56 bytes of data
No response from 8.8.8.8: icmp_seq=0 time=1000 ms
1 packets transmitted, 0 received, 100% packet loss, time 1010 ms
nsh> ping -c1 8.8.4.4
PING 8.8.4.4 56 bytes of data
56 bytes from 8.8.4.4: icmp_seq=0 time=30.0 ms
1 packets transmitted, 1 received, 0% packet loss, time 1010 ms
rtt min/avg/max/mdev = 30.000/30.000/30.000/0.000 ms
nsh> ping -c1 8.8.8.8
PING 8.8.8.8 56 bytes of data
56 bytes from 8.8.8.8: icmp_seq=0 time=40.0 ms
1 packets transmitted, 1 received, 0% packet loss, time 1010 ms
rtt min/avg/max/mdev = 40.000/40.000/40.000/0.000 ms
nsh> 

see that I waited for the ARP table to fill before I sent the first echo.

On the 3rd echo, the 1st went through also:

Screenshot from 2024-06-03 19-36-49

This is with CONFIG_NET_ARP_SEND disabled. I will test with it enabled.

I shared the binaries here:
https://drive.google.com/drive/folders/1MiLvaIrX1Upqc7l1gtysVW1EDQRkDdQH?usp=sharing

@casaroli
Copy link
Contributor Author

casaroli commented Jun 3, 2024

Note: when I sent the packet before the ARP table is populated, and without CONFIG_NET_ARP_SEND enabled, it drops (as expected) and the packets are not transmitted later, I will make the images with CONFIG_NET_ARP_SEND enabled but I think the issue is not the ARP table.

@acassis
Copy link
Contributor

acassis commented Jun 3, 2024

@casaroli I think Greg @patacongo explained this issue in the pass, it was caused by some initialization that was done during the first packet processing.

@casaroli
Copy link
Contributor Author

casaroli commented Jun 3, 2024

So here is with CONFIG_NET_ARP_SEND enabled. Same issue. See that ARP_SEND was not even used because I waited until the ARP table was populated (with the gateway). This time I got the problem on the 2nd try:

NuttShell (NSH) NuttX-12.5.1
nsh> 
nsh> 
nsh> wapi psk wlan0 "fuckfuck" 3
nsh> wapi essid wlan0 "nolife" 1
nsh> renew wlan0
nsh> arp -a 10.42.0.1
nsh: arp: no such ARP entry: 10.42.0.1
nsh> arp -a 10.42.0.1
HWaddr: 7e:94:b3:2b:5e:80
nsh> ping -c1 8.8.8.8
PING 8.8.8.8 56 bytes of data
No response from 8.8.8.8: icmp_seq=0 time=1000 ms
1 packets transmitted, 0 received, 100% packet loss, time 1020 ms
nsh> ping -c1 8.8.4.4
PING 8.8.4.4 56 bytes of data
56 bytes from 8.8.4.4: icmp_seq=0 time=60.0 ms
1 packets transmitted, 1 received, 0% packet loss, time 1010 ms
rtt min/avg/max/mdev = 60.000/60.000/60.000/0.000 ms
nsh> ping -c1 8.8.8.8
PING 8.8.8.8 56 bytes of data
56 bytes from 8.8.8.8: icmp_seq=0 time=60.0 ms
1 packets transmitted, 1 received, 0% packet loss, time 1010 ms
rtt min/avg/max/mdev = 60.000/60.000/60.000/0.000 ms

image

Files are here, bootloader and partition table are the same: https://drive.google.com/drive/folders/1UElulLX0Qk_H43KSePl_OYuzLUVJBnQE?usp=sharing

@casaroli
Copy link
Contributor Author

casaroli commented Jun 3, 2024

I could trace the packet down to esp_wifi_sta_send_data or esp_wifi_internal_tx so I think the packet was successfully delivered to the wifi controller and NuttX network stack is done. But I am not 100% sure because this is hard to debug in practice.

I can also provide an image with CONFIG_DEBUG_NET_INFO and CONFIG_DEBUG_WIRELESS_INFO enabled, and the logs will show that the network stack worked correctly (in this scenario)

@casaroli
Copy link
Contributor Author

casaroli commented Jun 3, 2024

Here are logs with info enabled:

nsh> ping -c1 8.8.8.8
PING 8.8.8.8 56 bytes of data
sendto_eventhandler: flags: 4000
sendto_eventhandler: Send ICMP request
ipv4_build_header: IPv4 Packet: ipid:3, length: 84
sendto_request: Outgoing ICMP packet length: 84
sendto_eventhandler: Resuming
esp_wifi_sta_send_data: SEND DATA 98
No response from 8.8.8.8: icmp_seq=0 time=1000 ms
1 packets transmitted, 0 received, 100% packet loss, time 1020 ms
nsh> ping -c1 8.8.4.4
PING 8.8.4.4 56 bytes of data
sendto_eventhandler: flags: 4000
sendto_eventhandler: Send ICMP request
ipv4_build_header: IPv4 Packet: ipid:4, length: 84
sendto_request: Outgoing ICMP packet length: 84
sendto_eventhandler: Resuming
esp_wifi_sta_send_data: SEND DATA 98
wlan_rxpoll: IPv4 frame
icmp_poll_eventhandler: flags: 0002
icmp_datahandler: Buffered 84 bytes
icmp_readahead: Received 64 bytes (of 84)
56 bytes from 8.8.4.4: icmp_seq=0 time=40.0 ms
1 packets transmitted, 1 received, 0% packet loss, time 1010 ms
rtt min/avg/max/mdev = 40.000/40.000/40.000/0.000 ms
nsh> ping -c1 8.8.8.8
PING 8.8.8.8 56 bytes of data
sendto_eventhandler: flags: 4000
sendto_eventhandler: Send ICMP request
ipv4_build_header: IPv4 Packet: ipid:5, length: 84
sendto_request: Outgoing ICMP packet length: 84
sendto_eventhandler: Resuming
esp_wifi_sta_send_data: SEND DATA 98
wlan_rxpoll: IPv4 frame
icmp_input: WARNING: Unknown ICMP cmd: 0
wlan_rxpoll: IPv4 frame
icmp_poll_eventhandler: flags: 0002
icmp_datahandler: Buffered 84 bytes
icmp_readahead: Received 64 bytes (of 84)
56 bytes from 8.8.8.8: icmp_seq=0 time=50.0 ms
1 packets transmitted, 1 received, 0% packet loss, time 1010 ms
rtt min/avg/max/mdev = 50.000/50.000/50.000/0.000 ms
nsh> 

This line I added, it shows the esp_wifi_sta_send_data was called:

esp_wifi_sta_send_data: SEND DATA 98

It is interesting also this line:

icmp_input: WARNING: Unknown ICMP cmd: 0

which I think is related to the reply to the first echo (id=1) -- that we are not expecting.

packet capture looks like the same:

image

files: https://drive.google.com/drive/folders/1cdxsfDrmnlf4ZhX6-Uq_N_CJ-hBixfEv?usp=drive_link

@acassis
Copy link
Contributor

acassis commented Jun 3, 2024

@casaroli I remember that even for TCP/IP in the past ping was failing for the very first package or delaying 1s, but AFAIK it was fixed. Hi @wengzhe do you know more about it?

@acassis AFAIK, on the current master branch, if we don't enable CONFIG_NET_ARP_SEND, the very first packet will still be changed into an ARP packet for the target (or gateway) if we don't have the needed MAC address, and enabling CONFIG_NET_ARP_SEND may prevent the first packet from being lost (another ARP will be sent before this ICMP). But I also don't know how id 1 could be sent delayed, it's strange and seems not a normal behavior our net stack can do.

Thank you very much!

@patacongo
Copy link
Contributor

@casaroli I think Greg @patacongo explained this issue in the pass, it was caused by some initialization that was done during the first packet processing.

This seems to be a different issue. I only recall explaining packet loss when packets with CONFIG_ARP_SEND=n.

CONFIG_ARP_SEND is one of those mandatory options. Select 'n' for a broken system, select 'y' for normal behavior. I am in favor of removing mandatory options.

@acassis
Copy link
Contributor

acassis commented Jun 3, 2024

I think instead of removing it could be enabled by default, this way avoid bloating NuttX and will work by default.

@casaroli
Copy link
Contributor Author

casaroli commented Jun 3, 2024

But, please notice that the reported issue happens even with CONFIG_NET_ARP_SEND enabled

@patacongo
Copy link
Contributor

But I also don't know how id 1 could be sent delayed, it's strange and seems not a normal behavior our net stack can do.

Was it delayed? Or sent twice?

Notice that ECHO seq=3 is sent twice. That should not happen and if the sending were under software control, then the sequence number should have been incremented. This suggests that some driver related buffering issue may be at fault. Could a stale packet buffer have been sent?

@casaroli
Copy link
Contributor Author

casaroli commented Jun 3, 2024

But I also don't know how id 1 could be sent delayed, it's strange and seems not a normal behavior our net stack can do.

Was it delayed? Or sent twice?

Notice that ECHO seq=3 is sent twice. That should not happen and if the sending were under software control, then the sequence number should have been incremented. This suggests that some driver related buffering issue may be at fault. Could a stale packet buffer have been sent?

Sorry. I don't see it being sent twice. I see the packet with id=1 being sent just before the id=3 (the id=3 packet unclogged the packet with id=1).

All the sequence numbers are 0/0 because I used ping -c1

So this is not retransmission and this is not ARP table problem. I think it is something inside the wifi driver or the controller. Not inside NuttX itself.

@tmedicci
Copy link
Contributor

tmedicci commented Jun 3, 2024

Hi @casaroli,

I couldn't reproduce the issue even when using your firmware (I tested the first firmware you sent, with CONFIG_NET_ARP_SEND disabled). If I wait for the ARP table to be filled, I couldn't make the 1st ping failing (same result with CONFIG_NET_ARP_SEND=y). If I don't wait for ARP, the first packet fails, but it is not transmitted later (same result of my initial tests) .

Can you test with another network?

@patacongo
Copy link
Contributor

All the sequence numbers are 0/0 because I used ping -c1

But shouldn't the ID have changed from 3?

@casaroli
Copy link
Contributor Author

casaroli commented Jun 4, 2024

All the sequence numbers are 0/0 because I used ping -c1

But shouldn't the ID have changed from 3?

I send 3 different ping -c1 commands:

First one (id=1) to 8.8.8.8 gets stuck somewhere (and we get a timeout)

Second one (id=2) goes fine to 8.8.4.4

Then the third one goes fine to 8.8.8.8 but the first one (id=1) also gets sent at this time.

This means the first icmp echo request got stuck somewhere and then later it gets unstuck. This is the weird behavior I am reporting.

@casaroli
Copy link
Contributor Author

casaroli commented Jun 4, 2024

Can you test with another network?

Yes I will test on a different network. I think I can also have a wifi monitor capturing lower level wifi signalling frames that might be useful for debugging. Will share the results soon.

Note: if I send the packet before the ARP table gets filled, the problem does not happen (well the packet is still lost as expected, but it is not transmitted later -- as expected)

tmedicci added a commit to tmedicci/incubator-nuttx that referenced this issue Jun 5, 2024
Enable logic to send ARP requests if the target IP address mapping
does not appear in the ARP table.

Please check the comment in apache#12446 (comment)
@casaroli
Copy link
Contributor Author

casaroli commented Jun 5, 2024

After switching to a more reliable access point, I could see the packets arriving and I can confirm the problem does not happen with ICMP.

However, the original problem with TCP is still hapenning, but since it requires a lot more context, I will open a separate issue.

Thank you for the very fast response time.

Sorry for the noise.

@casaroli casaroli closed this as completed Jun 5, 2024
xiaoxiang781216 pushed a commit that referenced this issue Jun 5, 2024
Enable logic to send ARP requests if the target IP address mapping
does not appear in the ARP table.

Please check the comment in #12446 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants