Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Esp8266 IP Address not reachable after a while #2330

Closed
thehellmaker opened this issue Jul 26, 2016 · 326 comments
Closed

Esp8266 IP Address not reachable after a while #2330

thehellmaker opened this issue Jul 26, 2016 · 326 comments
Labels
component: libraries type: bug waiting for feedback Waiting on additional info. If it's not received, the issue may be closed.
Milestone

Comments

@thehellmaker
Copy link

Hi All,
ESP abecomes unavailable after sometime intermittently where it says
Connection to https://192.168.1.4:80 refused
org.apache.http.conn.HttpHostConnectException: Connection to https://192.168.1.4:80 refused

The same happens from the web browser and then it starts working randomly. Checking the logs of the ESP device itself there is no crash.

Am i missing something in the setup of the server which can keep it alive all the time.

I thought this was a webserver problem but it seems like and ESP issue.
Here is the related issue me-no-dev/ESPAsyncWebServer#54

Cheers,
Akash A

@thehellmaker
Copy link
Author

A similar issue #1137 was reported on dec 2015

@thehellmaker
Copy link
Author

A similar issue also reported in
https://internetofhomethings.com/homethings/?p=426

@thehellmaker
Copy link
Author

As mentioned in me-no-dev/ESPAsyncWebServer#54 I have already tried the approach in the link https://www.esp8266.com/viewtopic.php?p=12809 and its still not working.

Now will analyse using wireshark myself

@thehellmaker
Copy link
Author

Wireshark is recieving ARP broadcast from the module every second because of the fix.

Here is the packet content.
100 7.995514 Espressi_1a:66:47 Broadcast ARP 42 Gratuitous ARP for 192.168.1.6 (Request)
Frame 100: 42 bytes on wire (336 bits), 42 bytes captured (336 bits) on interface 0
Ethernet II, Src: Espressi_1a:66:47 (5c:cf:7f:1a:66:47), Dst: Broadcast (ff:ff:ff:ff:ff:ff)
Destination: Broadcast (ff:ff:ff:ff:ff:ff)
Address: Broadcast (ff:ff:ff:ff:ff:ff)
.... ..1. .... .... .... .... = LG bit: Locally administered address (this is NOT the factory default)
.... ...1 .... .... .... .... = IG bit: Group address (multicast/broadcast)
Source: Espressi_1a:66:47 (5c:cf:7f:1a:66:47)
Address: Espressi_1a:66:47 (5c:cf:7f:1a:66:47)
.... ..0. .... .... .... .... = LG bit: Globally unique address (factory default)
.... ...0 .... .... .... .... = IG bit: Individual address (unicast)
Type: ARP (0x0806)
Address Resolution Protocol (request/gratuitous ARP)
Hardware type: Ethernet (1)
Protocol type: IPv4 (0x0800)
Hardware size: 6
Protocol size: 4
Opcode: request (1)
[Is gratuitous: True]
Sender MAC address: Espressi_1a:66:47 (5c:cf:7f:1a:66:47)
Sender IP address: 192.168.1.6
Target MAC address: 00:00:00_00:00:00 (00:00:00:00:00:00)
Target IP address: 192.168.1.6

@thehellmaker
Copy link
Author

thehellmaker commented Jul 26, 2016

For a device to which ARP is responding here is the sequence

  1. Request
    1732 208.713855 IntelCor_c5:37:30 Espressi_1a:66:47 ARP 42 Who has 192.168.1.6? Tell 192.168.1.7
  2. Response
    1733 208.734013 Espressi_1a:66:47 IntelCor_c5:37:30 ARP 42 192.168.1.6 is at 5c:cf:7f:1a:66:47
  3. Request Body
Frame 1732: 42 bytes on wire (336 bits), 42 bytes captured (336 bits) on interface 0
    Interface id: 0 (\Device\NPF_{641ED2C7-4125-43D0-BEF1-205ACE40B627})
    Encapsulation type: Ethernet (1)
    Arrival Time: Jul 26, 2016 21:05:15.359512000 India Standard Time
    [Time shift for this packet: 0.000000000 seconds]
    Epoch Time: 1469547315.359512000 seconds
    [Time delta from previous captured frame: 0.374458000 seconds]
    [Time delta from previous displayed frame: 0.374458000 seconds]
    [Time since reference or first frame: 208.713855000 seconds]
    Frame Number: 1732
    Frame Length: 42 bytes (336 bits)
    Capture Length: 42 bytes (336 bits)
    [Frame is marked: False]
    [Frame is ignored: False]
    [Protocols in frame: eth:ethertype:arp]
    [Coloring Rule Name: ARP]
    [Coloring Rule String: arp]
Ethernet II, Src: IntelCor_c5:37:30 (18:5e:0f:c5:37:30), Dst: Espressi_1a:66:47 (5c:cf:7f:1a:66:47)
    Destination: Espressi_1a:66:47 (5c:cf:7f:1a:66:47)
        Address: Espressi_1a:66:47 (5c:cf:7f:1a:66:47)
        .... ..0. .... .... .... .... = LG bit: Globally unique address (factory default)
        .... ...0 .... .... .... .... = IG bit: Individual address (unicast)
    Source: IntelCor_c5:37:30 (18:5e:0f:c5:37:30)
        Address: IntelCor_c5:37:30 (18:5e:0f:c5:37:30)
        .... ..0. .... .... .... .... = LG bit: Globally unique address (factory default)
        .... ...0 .... .... .... .... = IG bit: Individual address (unicast)
    Type: ARP (0x0806)
Address Resolution Protocol (request)
    Hardware type: Ethernet (1)
    Protocol type: IPv4 (0x0800)
    Hardware size: 6
    Protocol size: 4
    Opcode: request (1)
    Sender MAC address: IntelCor_c5:37:30 (18:5e:0f:c5:37:30)
    Sender IP address: 192.168.1.7
    Target MAC address: Espressi_1a:66:47 (5c:cf:7f:1a:66:47)
    Target IP address: 192.168.1.6
  1. Response Body
Frame 1733: 42 bytes on wire (336 bits), 42 bytes captured (336 bits) on interface 0
    Interface id: 0 (\Device\NPF_{641ED2C7-4125-43D0-BEF1-205ACE40B627})
    Encapsulation type: Ethernet (1)
    Arrival Time: Jul 26, 2016 21:05:15.379670000 India Standard Time
    [Time shift for this packet: 0.000000000 seconds]
    Epoch Time: 1469547315.379670000 seconds
    [Time delta from previous captured frame: 0.020158000 seconds]
    [Time delta from previous displayed frame: 0.020158000 seconds]
    [Time since reference or first frame: 208.734013000 seconds]
    Frame Number: 1733
    Frame Length: 42 bytes (336 bits)
    Capture Length: 42 bytes (336 bits)
    [Frame is marked: False]
    [Frame is ignored: False]
    [Protocols in frame: eth:ethertype:arp]
    [Coloring Rule Name: ARP]
    [Coloring Rule String: arp]
Ethernet II, Src: Espressi_1a:66:47 (5c:cf:7f:1a:66:47), Dst: IntelCor_c5:37:30 (18:5e:0f:c5:37:30)
    Destination: IntelCor_c5:37:30 (18:5e:0f:c5:37:30)
        Address: IntelCor_c5:37:30 (18:5e:0f:c5:37:30)
        .... ..0. .... .... .... .... = LG bit: Globally unique address (factory default)
        .... ...0 .... .... .... .... = IG bit: Individual address (unicast)
    Source: Espressi_1a:66:47 (5c:cf:7f:1a:66:47)
        Address: Espressi_1a:66:47 (5c:cf:7f:1a:66:47)
        .... ..0. .... .... .... .... = LG bit: Globally unique address (factory default)
        .... ...0 .... .... .... .... = IG bit: Individual address (unicast)
    Type: ARP (0x0806)
Address Resolution Protocol (reply)
    Hardware type: Ethernet (1)
    Protocol type: IPv4 (0x0800)
    Hardware size: 6
    Protocol size: 4
    Opcode: reply (2)
    Sender MAC address: Espressi_1a:66:47 (5c:cf:7f:1a:66:47)
    Sender IP address: 192.168.1.6
    Target MAC address: IntelCor_c5:37:30 (18:5e:0f:c5:37:30)
    Target IP address: 192.168.1.7

@thehellmaker
Copy link
Author

thehellmaker commented Jul 26, 2016

Found a very interesting thing.
I am using a Windows 7 OS to debug this issue and here are the findings

  1. ESP is responding to ARP queries where destination is the ESP MAC address
    1918 151.364565 IntelCor_c5:37:30 Espressi_1a:66:47 ARP 42 Who has 192.168.1.6? Tell 192.168.1.7
    1919 151.371335 Espressi_1a:66:47 IntelCor_c5:37:30 ARP 42 192.168.1.6 is at 5c:cf:7f:1a:66:47
  2. ESP is not responding to broadcast ARP pings using nmap.
    3459 254.010073 IntelCor_c5:37:30 Broadcast ARP 42 Who has 192.168.1.6? Tell 192.168.1.7

I will look into the arp query responder code in the codebase

@thehellmaker
Copy link
Author

Looks like the arp requests are completely handled by lwIP project which is what this project is depenent on.
@me-no-dev looks like you imported the project as dependency 4 months back. And i did a diff with the latest version of the project 1.4.1 of lwIP and seems like some broaddcast functionality was added which is not there in the version imported. Did you import the latest version ?

@me-no-dev
Copy link
Collaborator

lwip comes from espressif and not me :) I just tweaked some stuff here and there (not broadcast but multicast). Latest lwip is wip :)

@thehellmaker
Copy link
Author

Upgrade to open source Lwip(1.4.1) from 1.3.2 port as suggested by @igrr The module is still responding to ARP requests.. Waiting to see if it stops.

@igrr
Copy link
Member

igrr commented Aug 1, 2016

Can you make a diff between 1.3.2 and 1.4.1 in the part which deals with ARP? Maybe we can backport the fix instead of updating all of lwip for now.

@thehellmaker
Copy link
Author

Stopped responding to ARP requests on 1.4.1 as well.
The gratuitous ARP that is being sent is not being handled by android devices. Deep diving into the code base to debug further.

51056 1693.352539 Espressi_88:7f:7e Broadcast ARP 42 Gratuitous ARP for 192.168.1.12 (Request)
Frame 51056: 42 bytes on wire (336 bits), 42 bytes captured (336 bits) on interface 0
Ethernet II, Src: Espressi_88:7f:7e (5c:cf:7f:88:7f:7e), Dst: Broadcast (ff:ff:ff:ff:ff:ff)
Address Resolution Protocol (request/gratuitous ARP)

@thehellmaker
Copy link
Author

How ever restarting the module takes a new ip address and it starts responding to ARP requests.

@thehellmaker
Copy link
Author

Now I am seeing that the IP Address is also in use by another device which is obvious as ESP didn't respond to ARP request. But ESP has been sending gratuitous ARP and here is the wire shark capture

5495591 37123.051429 00:e1:40:46:09:6c Broadcast ARP 42 Gratuitous ARP for 192.168.1.5 (Request) (duplicate use of 192.168.1.5 detected!)

@thehellmaker
Copy link
Author

thehellmaker commented Aug 12, 2016

This is not an issue with ARP as most people have pointed out. This has something to do with the wireless connectivity stability.

I see debug logs right after the module stops responding to ARP saying
wifi evt: 7
add 1
aid 1
station: 40:88:05:b1:29:eb join, AID = 1
wifi evt: 5
wifi evt: 7
bcn_timout,ap_probe_send_start

This seems to be the root cause. I have attached the full log here.
https://drive.google.com/open?id=0B8DXcb9GfNuARFZGdy1USGNPbFk

@thehellmaker
Copy link
Author

thehellmaker commented Aug 12, 2016

Attaching Enums that the event numbers point to


Both map to same enum values..

@mtnbrit
Copy link

mtnbrit commented Aug 12, 2016

What make, model and firmware is your AP? Have you tried a different brand or model of wifi AP? they are not all equal by far.

On Aug 12, 2016, at 10:46 AM, Akash Ashok [email protected] wrote:

This is not an issue with ARP as most people have pointed out. This has something to do with the wireless connectivity stability.

I see debug logs right after the module stops responding to ARP saying
wifi evt: 7
add 1
aid 1
station: 40:88:05:b1:29:eb join, AID = 1
wifi evt: 5
wifi evt: 7
bcn_timout,ap_probe_send_start

This seems to be the root cause. I have attached the full log here.
https://drive.google.com/open?id=0B8DXcb9GfNuARW54YWFsVHhJbnc https://drive.google.com/open?id=0B8DXcb9GfNuARW54YWFsVHhJbnc

You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub #2330 (comment), or mute the thread https://github.com/notifications/unsubscribe-auth/AKy2zsZ_DPhgEpNQL7VhxiU-lqHxkFeCks5qfLF8gaJpZM4JVNL1.

@ClaudioHutte
Copy link

So you think the wifi connectivity instability lead to the inability of responding ARP broadcast requests?
A reception problem since transmission seems to be ok, correct?

@thehellmaker
Copy link
Author

@ClaudioHutte You are partially right. Here are my observations

  1. Reciever seems to be mainly affected because post this gratuitous ARP from other module is not being recieved as well but Gratuitous ARP is being sent to other modules though
  2. How ever if you see the log below it seems like... The module tries to rejoin it gets a wifi evt: 5 which is connected post which it recieves the Gratuitous ARp from other modules for just a few seconds post which it disconnects with
err already associed!
station: 98:0c:a5:b8:de:91 leave, AID = 1

Log

add 1
aid 1
station: 98:0c:a5:b8:de:91 join, AID = 1
wifi evt: 5
wifi evt: 7
wifi evt: 7
wifi evt: 7
wifi evt: 7
wifi evt: 7
Got ARP Input 
nHere for arpwifi evt: 7
Got ARP Input 
nHere for arpGot ARP Input 
nHere for arpwifi evt: 7
wifi evt: 7
wifi evt: 7
wifi evt: 7
Got ARP Input 
nHere for arpGot ARP Input 
nHere for arpGot ARP Input 
nwifi evt: 7
wifi evt: 7
wifi evt: 7
wifi evt: 7
wifi evt: 7
wifi evt: 7
wifi evt: 7
wifi evt: 7
wifi evt: 7
Got ARP Input 
nHere for arpGot ARP Input 
nHere for arpGot ARP Input 
nHere for arpwifi evt: 7
wifi evt: 7
wifi evt: 7
wifi evt: 7
wifi evt: 7
Got ARP Input 
nHere for arpwifi evt: 7
Got ARP Input 
nHere for arpGot ARP Input 
nHere for arpwifi evt: 7
Got ARP Input 
nHere for arpwifi evt: 7
wifi evt: 7
wifi evt: 7
wifi evt: 7
wifi evt: 7
err already associed!
station: 98:0c:a5:b8:de:91 leave, AID = 1
rm 1
wifi evt: 6
add 1
aid 1
station: 98:0c:a5:b8:de:91 join, AID = 1
  1. somewhere between multiple join and leave attempts you'll also see
    max connection!
  2. And ofcourse a bunch of
bcn_timout,ap_probe_send_start
bcn_timout,ap_probe_send_start

Just to explain my setup I have 2 esp8266 12 f modules
https://www.thaieasyelec.com/products/wireless-modules/wifi-modules/esp8266-12f-wifi-serial-transceiver-module-detail.html

  1. I setup gratuitous ARP to send arp broadcast pings into the network every second as @ClaudioHutte pointed out in the beginning
  2. When the modules connect for the first time every second the module prints
Got ARP Input 
nHere for arp
  1. Along with this ARP recieve there are other wifi events.
  2. At some point the logs mentioned in point 2 stop. (After close to 48 hours) and there are a bunch of other events which happen before this terminates.

I have attached the complete log to the link
https://drive.google.com/file/d/0B8DXcb9GfNuARFZGdy1USGNPbFk/view

@ClaudioHutte
Copy link

I never tested two units as you done, though I incurred into the same troubles with ESP8266-12 and a TP-link router located quite far (two stories below mine). I would like to do some tests the same way you've done, but I will be busy into other works for the next two weeks.
What happens if the "every second gratuitous ARP send" workaround is stopped/skipped?

@thehellmaker
Copy link
Author

Before you mentioned about the gratuitous ARP i hadn't added it into the code base. Even then the module stopped responding like we discussed here me-no-dev/ESPAsyncWebServer#54

But i haven't collected the logs without Gratuitous ARP but I'm sure its the same issue though.

For the module to eventually stop responding it always take 36 hours + .

@alex-yazdan
Copy link

By upgrading to SDK 2.1.0 #3215 this problem will be solved.

@pouriap
Copy link

pouriap commented Jun 20, 2017

Thanks for the efforts to create the update_sdk_2.1.0 branch.

But I'm still having the ARP issue even when using that branch:

esp8266_arp

Can anyone confirm that their ARP issue has been resolved by using that branch?

@IvanBayan
Copy link

I tried new sdk and still have arp issue.

@vks007
Copy link

vks007 commented Jun 25, 2017

I also have this same issue.
I am using a webserver on the ESP which connects to my router in the STA mode. The router assigns a fix IP to the ESP (192.168.1.54). All works good but after some time (typically a few hours to a day) the ESP webserver stops responding. I tried pinging the IP address at this point and its unreachable. To see memory footprint I added log calls within the ESP which calls a googlesheet URL and logs all relevant info. All that keeps working fine. Memory foot print is also normal. So while my ESP is able to reach the internet it's IP address is not reachable from within the local network.
If I reset my ESP or turn my modem ON/OFF (to again assign the IP address) the issue goes away for a few hours.
I have tried the simple webserver from the examples and it behaves the same way so my program is not what is causing this. I have also tried this on a SONOFF,Electrodragon ESP relay module, ESP 01 module , ESP 12E module - they all behave the same.
Can somebody guide me on what should I be looking at here,

@mtnbrit
Copy link

mtnbrit commented Jun 25, 2017 via email

@vks007
Copy link

vks007 commented Jun 26, 2017

Hi @mtnbrit , I am in the process of trying out a different router, will a few days to verify this.
Also, while the above condition happened some times, with my testing since yday most of the times, I am able to ping the ESP while the webserver does not respond. At other times, I am sometimes able to get a response from my iPhone browser while the desktop browser throws - connection reset error.
I also, started recycling the webserver every 10 min (I mean recreate the webserver object via reset method) and I was still able to get into a state where the ESP behaves normally but webserver stops responding.
Is there someway I can debug the state of the webserver object, I can log that and figure out what is not responding. Maybe I can tweak the source files and read some members and log them, if they are private, make them public just for this purpose. But I am not sure what things would tell me something about the webserver object.

@d-a-v
Copy link
Collaborator

d-a-v commented Sep 9, 2019

@cziter15 @TD-er Try adding IRAM_ATTR to the three new functions (check #6484 (comment))

@pouriap
Copy link

pouriap commented Sep 9, 2019

I am using 2.5.0 version from Boards Manager because versions above that give me a BSOD when I try to upload the sketch. (And I'm too exhausted to look into the root of the problem).
Is it okay if I just copy/paste the two changed files into my Arduino15\packages\esp8266\hardware\esp8266\2.5.0\ folder?

@devyte
Copy link
Collaborator

devyte commented Sep 9, 2019

@pouriap this is only part of the fix, the last part (I hope). The rest was merged after 2.5.2. That means that pulling just these files into your version won't make a meaningful test.

@pouriap
Copy link

pouriap commented Sep 9, 2019

@devyte If I clone ChocolateFrogsNuts repository and checkout ets_intr_lock_nest branch will it work?
Or is there something else I need to do?

@d-a-v
Copy link
Collaborator

d-a-v commented Sep 9, 2019

@pouriap Yes

Another way for command line git users is there.

@pouriap
Copy link

pouriap commented Sep 9, 2019

It keeps crashing giving me this on serial monitor:

ISR not in IRAM!

User exception (panic/abort/assert)
Abort called

>>>stack>>>

ctx: cont
sp: 3ffffec0 end: 3fffffc0 offset: 0000
3ffffec0:  feefeffe feefeffe feefeffe feefeffe  
3ffffed0:  000000fe 00000000 00000000 00000000  
3ffffee0:  00000000 00000000 00000000 00ff0000  
3ffffef0:  5ffffe00 5ffffe00 feefeffe 00000000  
3fffff00:  00000003 0000000e 3ffe84d9 4020865e  
3fffff10:  401004c6 00000000 3ffee6f8 40208674  
3fffff20:  3ffedf60 3ffee8b8 3ffe84d9 40208ba5  
3fffff30:  00000000 3ffee8b8 3ffe8504 4021788c  
3fffff40:  4020874e 00000064 3ffee6f8 3ffee858  
3fffff50:  3ffee728 3ffee540 3ffe84d9 40208c54  
3fffff60:  3ffee728 00000000 3ffee818 402023e1  
3fffff70:  feefeffe feefeffe feefeffe feefeffe  
3fffff80:  feefeffe feefeffe feefeffe feefeffe  
3fffff90:  feefeffe feefeffe feefeffe 3ffee858  
3fffffa0:  3fffdad0 00000000 3ffee818 40208268  
3fffffb0:  feefeffe feefeffe 3ffe8504 40100df9  
<<<stack<<<
$J⸮C⸮LC2�	#⸮⸮

There's a bunch of new options added which I don't understand. I'm using the defaults:
VTables -> Flash
Espressif FW -> nonos-sdk.2.2.1 + 100 (testing)
Exceptions -> Legacy (new can return nullptr)

(IRAM_ATTR is already added to the functions in core_esp8266_main.cpp)

@pouriap
Copy link

pouriap commented Sep 9, 2019

The HelloSever sketch works so the above error is probably because of Ticker library or something else I'm using in my main sketch. The versions above 2.5.0 seem to break everything for me.

Anyways in other news the ets_intr_lock_nest did not fix the ARP issue for me:

arp

@Pablo2048
Copy link

ISR not in IRAM! is not about core error, but probably error in your code. You have to move all your ISR handlers to IRAM using right decorator. It means that all attachInterrupt() must point to ICACHE_RAM_ATTR decorated methods.

@devyte
Copy link
Collaborator

devyte commented Sep 9, 2019

@pouriap ISRs being in IRAM has always been a requirement, it's just that we started enforcing it (at least the top level function) recently in 2.5.x. The fact that you hit the message with the newer core means that you aren't complying.
I don't know the state of the fork you used. The fix comprises several parts, including a rebuild of lwip and umm malloc, plus #6484. I suggest using this PR directly, and doing a full clean build before testing.

@pouriap
Copy link

pouriap commented Sep 9, 2019

Deleted previous esp8266 directories (Board Manager one and the git one).
Cloned the main repository as per instructions in the docs.
Fetched the pull request as per @d-a-v's script.
This is how my local repository looks like in GitExtentions:

git
Have I done everything correctly?

@TD-er
Copy link
Contributor

TD-er commented Sep 9, 2019

If you are not sure about the ISR functions, you may just build against the (released) core 2.5.2 and try to see if that still gives the same issues.
This way you can differentiate the issues caused by the current core branch and issues with your code.

@devyte
Copy link
Collaborator

devyte commented Sep 9, 2019

Looks and sounds correct. Also, to cover super paranoia, delete the arduino15 dir.

@pouriap
Copy link

pouriap commented Sep 9, 2019

Deleted Arduino15 dir to cover super paranoia.
Sketch uploaded with no problems and ESP working. Tho the ARP side of things doesn't look promising. 90% of ARP requests still dropped:

arp

I'll wait for a few hours/days and see if it goes unresponsive.

@TD-er Thanks. Adding ICACHE_RAM_ATTR before ISR functions fixed the error. It was my fault for not having read the docs carefully. And the BSOD reason was because of verbose mode in upload. Updating Arduino IDE and disabling verbose upload fixed that too.

@TD-er
Copy link
Contributor

TD-er commented Sep 9, 2019

I wonder, who is answering the ARP requests after the 1st one?
Are you sure the ARP request is always answered by the ESP node?

The reason I'm asking is this run I just did:

sudo nping --arp 192.168.1.152 -c 10

Starting Nping 0.7.60 ( https://nmap.org/nping ) at 2019-09-09 21:51 CEST
SENT (0.2256s) ARP who has 192.168.1.152? Tell 192.168.1.4
RCVD (0.4115s) ARP reply 192.168.1.152 is at CC:50:E3:B6:0C:18
RCVD (0.4115s) ARP reply 192.168.1.152 is at CC:50:E3:B6:0C:18
SENT (1.2259s) ARP who has 192.168.1.152? Tell 192.168.1.4
RCVD (1.4355s) ARP reply 192.168.1.152 is at CC:50:E3:B6:0C:18
SENT (2.2275s) ARP who has 192.168.1.152? Tell 192.168.1.4
RCVD (2.4555s) ARP reply 192.168.1.152 is at CC:50:E3:B6:0C:18
SENT (3.2297s) ARP who has 192.168.1.152? Tell 192.168.1.4
RCVD (3.4755s) ARP reply 192.168.1.152 is at CC:50:E3:B6:0C:18
SENT (4.2317s) ARP who has 192.168.1.152? Tell 192.168.1.4
RCVD (4.2915s) ARP reply 192.168.1.152 is at CC:50:E3:B6:0C:18
SENT (5.2339s) ARP who has 192.168.1.152? Tell 192.168.1.4
RCVD (5.3116s) ARP reply 192.168.1.152 is at CC:50:E3:B6:0C:18
SENT (6.2359s) ARP who has 192.168.1.152? Tell 192.168.1.4
RCVD (6.3315s) ARP reply 192.168.1.152 is at CC:50:E3:B6:0C:18
SENT (7.2379s) ARP who has 192.168.1.152? Tell 192.168.1.4
RCVD (7.3513s) ARP reply 192.168.1.152 is at CC:50:E3:B6:0C:18
RCVD (8.1675s) ARP reply 192.168.1.4 is at F4:4D:30:6A:83:6B
SENT (8.2381s) ARP who has 192.168.1.152? Tell 192.168.1.4
RCVD (8.3713s) ARP reply 192.168.1.152 is at CC:50:E3:B6:0C:18
SENT (9.2398s) ARP who has 192.168.1.152? Tell 192.168.1.4
RCVD (9.3913s) ARP reply 192.168.1.152 is at CC:50:E3:B6:0C:18

Max rtt: N/A | Min rtt: N/A | Avg rtt: N/A
Raw packets sent: 10 (420B) | Rcvd: 12 (578B) | Lost: 0 (0.00%)
Nping done: 1 IP address pinged in 9.43 seconds

As you can see, I got 12 replies here.
Please note that specific ESP is also sending Gratuitous ARP packets, which may explain the one extra received packet. The other is a reply about its own address (192.168.1.4).

I tested only one of my nodes running the PR #6484 to make sure I did not change behavior on the other one running this test.
That one node and all others running in my network were all replying to these ARP pings without loosing any. They are all running Gratuitous ARP (except the ones running the new PR).
So in this test setup I could not see any difference.
So I doubt whether it is a good test for these specific issues and also I am not entirely sure it does test what you think it tests.
The first 2 pings take 150 - 200 msec to get a reply and the rest is replied in 15 - 100 msec.
To me, that's an indication you are already changing the ESP's behavior. When hooked up to a power supply with an Amp meter, you can probably see the power consumption rising after the 2nd (ARP) ping.

@pouriap
Copy link

pouriap commented Sep 10, 2019

@TD-er If you use Wireshark you can make sure who is answering the ARP by looking at the MAC address of the sender.
About changing behavior, I think if ARPing it changes its behavior that means it's still not fixed.
When I ARP it, for the first few packets(5, 10, 20, it varies) I receive no answer, but after the first answer it's like the ESP "wakes up" and starts answering the rest of the ARPs. (Not always tho because its behavior is highly inconsistent). But if it wasn't dropping the ARPs in the first place this change of behavior wouldn't even happen.

@devyte Is there a specific SKD and lwip version we should use for this test?

@cziter15
Copy link
Contributor

cziter15 commented Sep 10, 2019

I think testing on 2.2.x+100 with LWIP v2 (Higher Bandwidth or Lower Memory) is expected.
Update: My two nodes are still connected to MQTT, w/o disconnection for ~12 hours. Still testing.
No ARP issues for now, tested using arping for windows.

@pouriap What DTIM and power saving mode are you using?
What DTIM is set on access point you are connected to?

@TD-er
Copy link
Contributor

TD-er commented Sep 10, 2019

@pouriap The same behavior of "waking up" the ESP was also seen before, but then tested with ping.
For ping, I sometimes saw latency numbers of up-to 800 - 900 msec, but all would get answered eventually.
This was not happening with other packet types like UDP traffic.
My test nodes are still responding quickly after the night and I'm not using Gratuitous ARP on those running this patch.

@devyte
Copy link
Collaborator

devyte commented Sep 10, 2019

@pouriap just don't use sdk3. Also, don't use sdk libs not included with our core, or use sdk calls directly.
What is your test sketch? It's possible that #6484 is still somehow incomplete, although I don't see how atm.

@ChocolateFrogsNuts
Copy link
Contributor

@pouriap The test build here is with, SDK 22y and lwip v2 low mem with a fairly recent git pull, but it shouldn't matter which lwip v2 config you use (low mem, high bandwidth or otherwise).
No gratuitous arp.

nping --arp is getting 100% response with no extras:

Raw packets sent: 100 (4.200KB) | Rcvd: 100 (4.600KB) | Lost: 0 (0.00%)
Nping done: 1 IP address pinged in 99.36 seconds

The first 1-15 responses take around 100ms although I've seen up to 250ms, until the esp "wakes up" and replies are mostly about 5ms, with some up to 15ms.
ping is getting similar results, although there was some minimal packet loss

100 packets transmitted, 99 received, 1% packet loss, time 99189ms
rtt min/avg/max/mdev = 2.402/8.386/92.298/14.849 ms

I am running 15dbm RF output on the test - shouldn't make a difference unless your chip is showing other signs of instability but might be worth a try.

Oh and those tests above were run against a WEMOS D1 mini that's been running 104 hours without a reset or loss of connectivity.

@devyte
Copy link
Collaborator

devyte commented Sep 11, 2019

Closing via #6484

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component: libraries type: bug waiting for feedback Waiting on additional info. If it's not received, the issue may be closed.
Projects
No open projects
Migrate to lwip2
Issues that may be impacted
Development

No branches or pull requests