Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TX delays several seconds #1915

Open
radoslavv opened this issue Nov 27, 2020 · 14 comments
Open

TX delays several seconds #1915

radoslavv opened this issue Nov 27, 2020 · 14 comments

Comments

@radoslavv
Copy link

Hi, I am using:

  • RPi4/4GB RAM, OS raspbian (Linux raspberrypi 5.4.51-v7l+), running:
    • OpenHab2 v.2.5.10 and
    • Mosquitto v.1.6.12
  • 3 IoT devices 1x PIC18F27J53+ENC28J60, 2xPIC18F67J60 running
    • modified Microchip TCP/IP lite stack with
    • modified paho.mqtt.embedded-c
    • with several sensors attached, communication timeout (no answer from broker) is 4 seconds, few messages per second published to broker

There happens several times a day communication in mosquitto is stuck for several seconds (also cases with 10 seconds ! no communication from mosquitto).
In the system log I cannot find any issue that could cause this within time window where issue occurs. Also I am not facing HW problems on RPi4.
Actually I am running 3 (PIC based) IoT nodes communicating in QoS1. Data are published to OpenHab2. On RPi4 eth0 adapter there is continuous flow of packets from IoT devices. After about 4 seconds PIC nodes are terminating connections on PUBACK timeout and trying to disconnect and reconnect (this can be seen also in tshark.log). But mosquitto does not TX answer packets on time, Also in mosquitto.log it's visible there are significant delays form messages processing.

Attaching

There you can find 3 sections:

  1. RPi4 /var/log/mosquitto/mosquitto.log filtered messages for PIC18F27J53
  2. RPi4 /tmp/tshark.log captured packets on eth0 from/to PIC18F27J53 (IP 192.168.1.232)
  3. RPi4 /var/log/mosquitto/mosquitto.log unfiltered
    (here you can see all nodes are detaching due to mosquitto stuck communication IP 192,168.1.222)
    (in 10 seconds all 3 devices are trying to connect 3 times)

This generates huge amount of reconnects 180 in 4 days (mostly CONNACK errors 150, remaining are mainly PUBACK/SUBACK errors):
image
image

RPi4 load is small:

  1  [||||||                                          9.3%]   Tasks: 123, 321 thr; 1 running
  2  [||||||                                          9.7%]   Load average: 1.42 1.10 0.97
  3  [|||||||                                        10.3%]   Uptime: 27 days, 13:12:05
  4  [||||||||                                       11.7%]
  Mem[|||||||||||||||||||||||||||||||||||||||||2.13G/3.56G]
  Swp[||||||||||||||||||||||||||                511M/1024M]

Seems to be a BUG.

@ralight
Copy link
Contributor

ralight commented Nov 28, 2020

Well that's certainly interesting. I don't see any behaviour like that myself, but I'll see if I can reproduce your setup (without the PICs though). How did you install Mosquitto, out of interest?

@radoslavv
Copy link
Author

radoslavv commented Nov 28, 2020

RPi4 repository has some old version. So I downloaded ZIP from github -> make, make install.

You do not need PICs, find some other source generating similar traffic with Mosquitto.
Anyhow, if you have som idea how to debug deeper the reason of this behaviour, I am ready to cooperate.
For example as it is hard for me to debug uP (breakpoints are not useful for debugging realtime applications ~ till you will step by step debug something, TCP connection is lost), I am logging every critical activity

  • when entered function,
  • when leaved function,
  • also logging some critical sections values and return values.

This helped a lot to debug occassionally occurring BUGs, too.
Also this way I see connection with mosquitto is really lost because of timeout on PUBACK, afterwards there occures several CONNACK issues -> for example within 10 seconds 3 CONNACK tryes, with 2 unsuccessful, so statistically that means for 180 connections for 30 failed PUBACKs there is IoT device trying 6 CONNACKs (6 x 4 sec timeout = continuous 24 seconds where mosquitto does not react properly within 4 seconds timeframe)

@karlp
Copy link
Contributor

karlp commented Nov 28, 2020

Are you seeing these problems between your clients? or are you seeing these blockages in openhab2 not getting connections to mosquitto? My first thought would be the network stacks on the pics, not mosquitto. I've ran into problems before with naiive networking assumptions in ultra embedded.

@ralight
Copy link
Contributor

ralight commented Nov 30, 2020

If I'm reading it correctly, the 4 second delay wireshark log shows some ICMP traffic from 222 (broker) to 232 (PIC) at the time of the delay, and that ICMP request shows destination unreachable. So it's not clear that there is a network connection at that point in time.

@radoslavv
Copy link
Author

ICMP implementation seems to me beeing corrupt in the Microchip TCP IP Lite stack - first PING reply is always bad 😒, no matter if Linux or Windows device is pinging PIC.
Did another experiment - I replaced mosquitto with different broker MQTTnet.Server (consumes more CPU as it is .NET => RPi4 is more hot), but result after about 24 h:
image
= NO reconnects.

@ralight
Copy link
Contributor

ralight commented Dec 1, 2020

I have now been able to see this behaviour when connecting to a pi, but it is so uncommon to make it very difficult to debug. From what I've seen so far, there are very occasional "TCP spurious retransmission" of PUBACK packets, and these are always present when a delay happens.

I've modified the Mosquitto source to print the parameters and result of the read() and write() calls, to find out when the broker reads/writes data, in case there is a partial packet blocking for some reason. What I've seen so far, is that the broker does see delays in the logs, but the read/write calls look identical to normal read/write calls. I haven't yet captured a sufficiently long delay that is causing the client to reconnect however, so perhaps that is a different situation.

Something that may be related, when I've noticed the spurious retransmission messages in wireshark, and seen a large delay - that has coincided with my ssh connection to the pi also having problems. I know that the location my pi is situated doesn't have great wifi signal, so for me at least that may be a contributing factor.

I should also note that the spurious retransmission occurs at both the pi and my laptop.

@radoslavv
Copy link
Author

everything is connected via Ethernet cable:
[PIC IoT] ---10Mb/s,HDX--- [smart SWITCH] ---1Gb/s,FDX--- [NetGear router] ---1Gb/s,FDX--- [RPi4]

PIC IoT communicates at 10 Mb/s HDX in average with load about

  • 10 Tx packets/sec (not bigger than 50 bytes) and
  • 10 Rx packets/sec (not bigger than 50 bytes), sometimes 2 MQTT PUBLISH messages received in 1 TCP/IP packet

smart switch has

  • 0 TX and
  • 0 RX

errors for PIC IoT LAN port, so no packets lost on physical layer


should I download lates sources to make more detail logs?

I've modified the Mosquitto source to print the parameters and result of the read() and write() calls, to find out when the broker reads/writes data, in case there is a partial packet blocking for some reason.

@radoslavv
Copy link
Author

radoslavv commented Dec 29, 2020

@ralight any updates?

Tried mosquitto version 2.0.4, but behavior seems even worse - more TCP retransmissions, more often occuring timeout.
I suppose that problem is when both sides are PUBLISHing meessage at the same time, examples:

1.1) collision of PUBLISHing on both sides

  • BROKER->IoT: 192.168.1.222->192.168.1.234 Publish [sys/date_DD_MM_]
  • IoT->BROKER: 192.168.1.234->192.168.1.222 Publish [04/eth/PktsRxOK]

1.2) TCP trace: PUBACK of 04/eth/PktsRxOK after IoT DISCONNECT:

2234 2020-12-28 22:54:00.015383932 192.168.1.222 â 192.168.1.234 MQTT 77 Publish Message [sys/time_HH_MM]
2235 2020-12-28 22:54:00.099667633 192.168.1.234 â 192.168.1.222 TCP 60 2570 â 1883 [ACK] Seq=17358 Ack=2892 Win=27 Len=0
2236 2020-12-28 22:54:00.099745317 192.168.1.222 â 192.168.1.234 MQTT 79 Publish Message [sys/date_DD_MM_]

2237 2020-12-28 22:54:00.143376680 192.168.1.234 â 192.168.1.222 MQTT 80 Publish Message (id=784) [04/eth/PktsRxOK]

2238 2020-12-28 22:54:00.152934840 192.168.1.234 â 192.168.1.222 TCP 80 [TCP Retransmission] 2570 â 1883 [PSH, ACK] Seq=17358 Ack=2892 Win=27 Len=26
2239 2020-12-28 22:54:00.152991969 192.168.1.222 â 192.168.1.234 TCP 54 1883 â 2570 [ACK] Seq=2917 Ack=17384 Win=64064 Len=0
2240 2020-12-28 22:54:00.349023715 192.168.1.222 â 192.168.1.234 TCP 79 [TCP Retransmission] 1883 â 2570 [PSH, ACK] Seq=2892 Ack=17384 Win=64064 Len=25
2241 2020-12-28 22:54:00.839048322 192.168.1.222 â 192.168.1.234 TCP 79 [TCP Retransmission] 1883 â 2570 [PSH, ACK] Seq=2892 Ack=17384 Win=64064 Len=25
2242 2020-12-28 22:54:01.829045043 192.168.1.222 â 192.168.1.234 TCP 79 [TCP Retransmission] 1883 â 2570 [PSH, ACK] Seq=2892 Ack=17384 Win=64064 Len=25
2243 2020-12-28 22:54:03.839099137 192.168.1.222 â 192.168.1.234 TCP 79 [TCP Retransmission] 1883 â 2570 [PSH, ACK] Seq=2892 Ack=17384 Win=64064 Len=25
2244 2020-12-28 22:54:04.280438212 192.168.1.234 â 192.168.1.222 MQTT 60 Disconnect Req

2245 2020-12-28 22:54:04.280504285 192.168.1.222 â 192.168.1.234 MQTT 58 Publish Ack (id=784)

2246 2020-12-28 22:54:04.280789375 192.168.1.222 â 192.168.1.234 TCP 54 1883 â 2570 [FIN, ACK] Seq=2921 Ack=17386 Win=64064 Len=0
2247 2020-12-28 22:54:04.297473850 192.168.1.234 â 192.168.1.222 TCP 60 2570 â 1883 [ACK] Seq=17386 Ack=2921 Win=46 Len=0
2248 2020-12-28 22:54:04.306314462 192.168.1.234 â 192.168.1.222 TCP 60 2570 â 1883 [FIN, ACK] Seq=17386 Ack=2922 Win=46 Len=0
2249 2020-12-28 22:54:04.306358351 192.168.1.222 â 192.168.1.234 TCP 54 1883 â 2570 [ACK] Seq=2922 Ack=17387 Win=64064 Len=0
2250 2020-12-28 22:54:04.477559095 192.168.1.234 â 192.168.1.222 TCP 60 2990 â 1883 [SYN] Seq=0 Win=50 Len=0
2251 2020-12-28 22:54:04.477619150 192.168.1.222 â 192.168.1.234 TCP 58 1883 â 2990 [SYN, ACK] Seq=0 Ack=1 Win=64240 Len=0 MSS=1460
2252 2020-12-28 22:54:04.486429559 192.168.1.234 â 192.168.1.222 TCP 60 2990 â 1883 [ACK] Seq=1 Ack=1 Win=50 Len=0
2253 2020-12-28 22:54:04.579577704 192.168.1.234 â 192.168.1.222 MQTT 108 Connect Command
2254 2020-12-28 22:54:04.579623352 192.168.1.222 â 192.168.1.234 TCP 54 1883 â 2990 [ACK] Seq=1 Ack=55 Win=64186 Len=0
2255 2020-12-28 22:54:04.580032218 192.168.1.222 â 192.168.1.234 MQTT 58 Connect Ack
2256 2020-12-28 22:54:04.628460642 192.168.1.234 â 192.168.1.222 TCP 60 2990 â 1883 [ACK] Seq=55 Ack=5 Win=46 Len=0

3.3) mosquitto.log:
N/A

===
2.1) collision of PUBLISHing on both sides

  • BROKER->IoT: 192.168.1.222->192.168.1.234 Publish [sys/date_DD_MM_]
  • IoT->BROKER: 192.168.1.234->192.168.1.222 Publish [04/BME680/P_pa]

2.2) TCP trace: PUBACK of 04/BME680/P_pa after IoT DISCONNECT:

31884 2020-12-29 00:13:00.014453852 192.168.1.222 â 192.168.1.234 MQTT 77 Publish Message [sys/time_HH_MM]
31885 2020-12-29 00:13:00.056804324 192.168.1.234 â 192.168.1.222 TCP 60 2990 â 1883 [ACK] Seq=234897 Ack=39834 Win=27 Len=0
31886 2020-12-29 00:13:00.056851620 192.168.1.222 â 192.168.1.234 MQTT 79 Publish Message [sys/date_DD_MM_]

31887 2020-12-29 00:13:00.102305190 192.168.1.234 â 192.168.1.222 MQTT 82 Publish Message (id=9016) [04/BME680/P_pa]

31888 2020-12-29 00:13:00.111867959 192.168.1.234 â 192.168.1.222 TCP 82 [TCP Retransmission] 2990 â 1883 [PSH, ACK] Seq=234897 Ack=39834 Win=27 Len=28
31889 2020-12-29 00:13:00.111959568 192.168.1.222 â 192.168.1.234 TCP 54 1883 â 2990 [ACK] Seq=39859 Ack=234925 Win=64064 Len=0
31890 2020-12-29 00:13:00.299022243 192.168.1.222 â 192.168.1.234 TCP 79 [TCP Retransmission] 1883 â 2990 [PSH, ACK] Seq=39834 Ack=234925 Win=64064 Len=25
31891 2020-12-29 00:13:00.789025916 192.168.1.222 â 192.168.1.234 TCP 79 [TCP Retransmission] 1883 â 2990 [PSH, ACK] Seq=39834 Ack=234925 Win=64064 Len=25
31892 2020-12-29 00:13:01.829058159 192.168.1.222 â 192.168.1.234 TCP 79 [TCP Retransmission] 1883 â 2990 [PSH, ACK] Seq=39834 Ack=234925 Win=64064 Len=25
31893 2020-12-29 00:13:03.839079145 192.168.1.222 â 192.168.1.234 TCP 79 [TCP Retransmission] 1883 â 2990 [PSH, ACK] Seq=39834 Ack=234925 Win=64064 Len=25
31894 2020-12-29 00:13:04.231527786 192.168.1.234 â 192.168.1.222 MQTT 60 Disconnect Req

31895 2020-12-29 00:13:04.231595192 192.168.1.222 â 192.168.1.234 MQTT 58 Publish Ack (id=9016)
                    13:04 !!!

31896 2020-12-29 00:13:04.232000926 192.168.1.222 â 192.168.1.234 TCP 54 1883 â 2990 [FIN, ACK] Seq=39863 Ack=234927 Win=64064 Len=0
31897 2020-12-29 00:13:04.249096387 192.168.1.234 â 192.168.1.222 TCP 60 2990 â 1883 [ACK] Seq=234927 Ack=39863 Win=46 Len=0
31898 2020-12-29 00:13:04.257854041 192.168.1.234 â 192.168.1.222 TCP 60 2990 â 1883 [FIN, ACK] Seq=234927 Ack=39864 Win=46 Len=0
31899 2020-12-29 00:13:04.257908206 192.168.1.222 â 192.168.1.234 TCP 54 1883 â 2990 [ACK] Seq=39864 Ack=234928 Win=64064 Len=0
31900 2020-12-29 00:13:04.426460823 192.168.1.234 â 192.168.1.222 TCP 60 7729 â 1883 [SYN] Seq=0 Win=50 Len=0
31901 2020-12-29 00:13:04.426528470 192.168.1.222 â 192.168.1.234 TCP 58 1883 â 7729 [SYN, ACK] Seq=0 Ack=1 Win=64240 Len=0 MSS=1460
31902 2020-12-29 00:13:04.435334604 192.168.1.234 â 192.168.1.222 TCP 60 7729 â 1883 [ACK] Seq=1 Ack=1 Win=50 Len=0
31903 2020-12-29 00:13:04.513447401 192.168.1.234 â 192.168.1.222 MQTT 108 Connect Command
31904 2020-12-29 00:13:04.513497845 192.168.1.222 â 192.168.1.234 TCP 54 1883 â 7729 [ACK] Seq=1 Ack=55 Win=64186 Len=0
31905 2020-12-29 00:13:04.514002133 192.168.1.222 â 192.168.1.234 MQTT 58 Connect Ack

2.3) mosquitto.log:

2020-12-29 01:13:00: Sending PUBLISH to PIC18F67J60 Ethernet [04] (d0, q0, r0, m0, 'sys/time_HH_MM', ... (5 bytes))
2020-12-29 01:13:00: Sending PUBLISH to PIC18F67J60 Ethernet [04] (d0, q0, r0, m0, 'sys/date_DD_MM_', ... (6 bytes))
2020-12-29 01:13:00: Received PUBLISH from PIC18F67J60 Ethernet [04] (d0, q1, r0, m9016, '04/BME680/P_pa', ... (8 bytes))

2020-12-29 01:13:00: Sending PUBACK to PIC18F67J60 Ethernet [04] (m9016, rc0)
              13:00 !!!

2020-12-29 01:13:04: Received DISCONNECT from PIC18F67J60 Ethernet [04]
2020-12-29 01:13:04: Client PIC18F67J60 Ethernet [04] disconnected.
2020-12-29 01:13:04: New client connected from 192.168.1.234:7729 as PIC18F67J60 Ethernet [04] (p1, c1, k6).
2020-12-29 01:13:04: Sending CONNACK to PIC18F67J60 Ethernet [04] (0, 0)

===
3.1) collision of PUBLISHing on both sides

  • BROKER->IoT: 192.168.1.222->192.168.1.234 Publish [sys/date_DD_MM_]
  • IoT->BROKER: 192.168.1.234->192.168.1.222 Publish [04/eth/PktsTxOK]

3.2) TCP trace: PUBACK of 04/eth/PktsTxOK after IoT DISCONNECT:

55802 2020-12-29 01:19:00.014566236 192.168.1.222 â 192.168.1.234 MQTT 77 Publish Message [sys/time_HH_MM]
55803 2020-12-29 01:19:00.058590487 192.168.1.234 â 192.168.1.222 TCP 60 7729 â 1883 [ACK] Seq=193667 Ack=32190 Win=27 Len=0
55804 2020-12-29 01:19:00.058641116 192.168.1.222 â 192.168.1.234 MQTT 79 Publish Message [sys/date_DD_MM_]

55805 2020-12-29 01:19:00.103007251 192.168.1.234 â 192.168.1.222 MQTT 80 Publish Message (id=7261) [04/eth/PktsTxOK]

55806 2020-12-29 01:19:00.112559651 192.168.1.234 â 192.168.1.222 TCP 80 [TCP Retransmission] 7729 â 1883 [PSH, ACK] Seq=193667 Ack=32190 Win=27 Len=26
55807 2020-12-29 01:19:00.112604873 192.168.1.222 â 192.168.1.234 TCP 54 1883 â 7729 [ACK] Seq=32215 Ack=193693 Win=64064 Len=0
55808 2020-12-29 01:19:00.299036042 192.168.1.222 â 192.168.1.234 TCP 79 [TCP Retransmission] 1883 â 7729 [PSH, ACK] Seq=32190 Ack=193693 Win=64064 Len=25
55809 2020-12-29 01:19:00.789025799 192.168.1.222 â 192.168.1.234 TCP 79 [TCP Retransmission] 1883 â 7729 [PSH, ACK] Seq=32190 Ack=193693 Win=64064 Len=25
55810 2020-12-29 01:19:01.829091627 192.168.1.222 â 192.168.1.234 TCP 79 [TCP Retransmission] 1883 â 7729 [PSH, ACK] Seq=32190 Ack=193693 Win=64064 Len=25
55811 2020-12-29 01:19:03.829085790 192.168.1.222 â 192.168.1.234 TCP 79 [TCP Retransmission] 1883 â 7729 [PSH, ACK] Seq=32190 Ack=193693 Win=64064 Len=25
55812 2020-12-29 01:19:04.215702686 192.168.1.234 â 192.168.1.222 MQTT 60 Disconnect Req

55813 2020-12-29 01:19:04.215778852 192.168.1.222 â 192.168.1.234 MQTT 58 Publish Ack (id=7261)**
                    19:04 !!!

55814 2020-12-29 01:19:04.216064978 192.168.1.222 â 192.168.1.234 TCP 54 1883 â 7729 [FIN, ACK] Seq=32219 Ack=193695 Win=64064 Len=0
55815 2020-12-29 01:19:04.233198081 192.168.1.234 â 192.168.1.222 TCP 60 7729 â 1883 [ACK] Seq=193695 Ack=32219 Win=46 Len=0
55816 2020-12-29 01:19:04.242043157 192.168.1.234 â 192.168.1.222 TCP 60 7729 â 1883 [FIN, ACK] Seq=193695 Ack=32220 Win=46 Len=0
55817 2020-12-29 01:19:04.242086415 192.168.1.222 â 192.168.1.234 TCP 54 1883 â 7729 [ACK] Seq=32220 Ack=193696 Win=64064 Len=0
55818 2020-12-29 01:19:04.408298505 192.168.1.234 â 192.168.1.222 TCP 60 11688 â 1883 [SYN] Seq=0 Win=50 Len=0
55819 2020-12-29 01:19:04.408356727 192.168.1.222 â 192.168.1.234 TCP 58 1883 â 11688 [SYN, ACK] Seq=0 Ack=1 Win=64240 Len=0 MSS=1460
55820 2020-12-29 01:19:04.417434003 192.168.1.234 â 192.168.1.222 TCP 60 11688 â 1883 [ACK] Seq=1 Ack=1 Win=50 Len=0
55821 2020-12-29 01:19:04.511009215 192.168.1.234 â 192.168.1.222 MQTT 108 Connect Command
55822 2020-12-29 01:19:04.511061825 192.168.1.222 â 192.168.1.234 TCP 54 1883 â 11688 [ACK] Seq=1 Ack=55 Win=64186 Len=0
55823 2020-12-29 01:19:04.511481209 192.168.1.222 â 192.168.1.234 MQTT 58 Connect Ack

3.3) mosquitto.log:

2020-12-29 02:19:00: Sending PUBLISH to PIC18F67J60 Ethernet [04] (d0, q0, r0, m0, 'sys/time_HH_MM', ... (5 bytes))
2020-12-29 02:19:00: Sending PUBLISH to PIC18F67J60 Ethernet [04] (d0, q0, r0, m0, 'sys/date_DD_MM_', ... (6 bytes))
2020-12-29 02:19:00: Received PUBLISH from PIC18F67J60 Ethernet [04] (d0, q1, r0, m7261, '04/eth/PktsTxOK', ... (5 bytes))

2020-12-29 02:19:00: Sending PUBACK to PIC18F67J60 Ethernet [04] (m7261, rc0)
              19:00 !!!

2020-12-29 02:19:04: Received DISCONNECT from PIC18F67J60 Ethernet [04]
2020-12-29 02:19:04: Client PIC18F67J60 Ethernet [04] disconnected.
2020-12-29 02:19:04: New client connected from 192.168.1.234:11688 as PIC18F67J60 Ethernet [04] (p1, c1, k6).
2020-12-29 02:19:04: Sending CONNACK to PIC18F67J60 Ethernet [04] (0, 0)

What I can see in the IoT logs, that sys/date_DD_MM_ is successfully delivered to IoT, but connection is timeouting on PUBACK of messages published at the same time to mosquitto BROKER.
This is message order from the IoT device perspective:

  1. IoT TCP/IP stack publishes message 04/eth/PktsTxOK (04/BME680/P_pa, 04/eth/PktsTxOK) QoS1
  2. IoT TCP/IP stack receives sys/time_HH_MM + sys/date_DD_MM_ QoS0 at the same time
  3. IoT waits PUBACK for 04/eth/PktsTxOK (04/BME680/P_pa, 04/eth/PktsTxOK)
  4. IoT timeouts connection (no PUBACK in 4 sec received), DISCONNECT sent
  5. BROKER receives DISCONNECT message successfully!!!
  6. BROKER tries to send PUBACK for 04/eth/PktsTxOK (04/BME680/P_pa, 04/eth/PktsTxOK) after DISCONNECT ?!

while mosquitto sees lines 1 & 2 in the opposite order (and here seems to occur problem with TCP/IP sync).
IoT MQTT stack is behaving correctly, even received sys/time_HH_MM + sys/date_DD_MM_ QoS0 where PUBACK of 04/eth/PktsTxOK (04/BME680/P_pa, 04/eth/PktsTxOK) QoS1 expected, stack still waits for PUBACK.


I am considering in such a conflict cases to re-publish last message (in this case 04/eth/PktsTxOK, 04/BME680/P_pa, 04/eth/PktsTxOK) and maybe will receive correct PUBACK 😉

@ralight
Copy link
Contributor

ralight commented Dec 30, 2020

Sorry, I haven't had the chance to look at this further.

Could you try the following patch, it adds nanoseconds to the log messages when you already have a timestamp format defined (which I see you have) and prints log messages when packets have been passed to the TCP stack. I'm convinced this is happening at a level below Mosquitto, this will show whether I'm right or not.

git clone https://github.com/eclipse/mosquitto
cd mosquitto
git checkout fixes
git apply 1915.txt
make -C src
sudo make prefix=/usr -C src install 

1915.txt

@radoslavv
Copy link
Author

radoslavv commented Dec 30, 2020

Hi, seems patch file is missing:

root@raspberrypi:/home/pi/Packages/mosquitto_v2.0.4patch/mosquitto# git apply 1915.txt
error: can't open patch '1915.txt': No such file or directory

Did you commit that file in a correct branch?

Anyhow, expecting same like you wrote (Socket library level) based on this:

Mosquitto.log:   2020-12-29 02:19:00: Sending PUBACK to PIC18F67J60 Ethernet [04] (m7261, rc0)
                               19:00 !!!
TCP trace: 55813 2020-12-29 01:19:04.215778852 192.168.1.222 â 192.168.1.234 MQTT 58 Publish Ack (id=7261)**
                               19:04 !!!

btw, in the meantime identified the problem of
214032 2020-11-24 05:34:04.696569474 192.168.1.232 â 192.168.1.222 ICMP 126 Destination unreachable (Port unreachable) [ETHERNET FRAME CHECK SEQUENCE INCORRECT]
Caused by OpenHab pings (that differs from Linux ping command), BUG reported [openhab/openhab-addons/issues/9575]

@ralight
Copy link
Contributor

ralight commented Dec 30, 2020

No, I wouldn't have committed the temporary patch. It's attached to my previous comment.

@radoslavv
Copy link
Author

Here's result:
/tmp/tshark_T04.log

76869 2021-01-13 01:23:59.667928768 192.168.10.234 → 192.168.10.222 TCP 60 7876 → 1883 [ACK] Seq=345220 Ack=57739 Win=46 Len=0
76870 2021-01-13 01:24:00.014287822 192.168.10.222 → 192.168.10.234 MQTT 77 Publish Message [sys/time_HH_MM]
76871 2021-01-13 01:24:00.055441222 192.168.10.234 → 192.168.10.222 TCP 60 7876 → 1883 [ACK] Seq=345220 Ack=57762 Win=27 Len=0
76872 2021-01-13 01:24:00.055533499 192.168.10.222 → 192.168.10.234 MQTT 79 Publish Message [sys/date_DD_MM_]
76873 2021-01-13 01:24:00.114797224 192.168.10.234 → 192.168.10.222 MQTT 80 Publish Message (id=12790) [04/eth/PktsTxOK]
76874 2021-01-13 01:24:00.124349058 192.168.10.234 → 192.168.10.222 TCP 80 [TCP Retransmission] 7876 → 1883 [PSH, ACK] Seq=345220 Ack=57762 Win=27 Len=26
76875 2021-01-13 01:24:00.124423743 192.168.10.222 → 192.168.10.234 TCP 54 1883 → 7876 [ACK] Seq=57787 Ack=345246 Win=64064 Len=0
76876 2021-01-13 01:24:00.303409204 192.168.10.222 → 192.168.10.234 TCP 79 [TCP Retransmission] 1883 → 7876 [PSH, ACK] Seq=57762 Ack=345246 Win=64064 Len=25
76877 2021-01-13 01:24:00.793408134 192.168.10.222 → 192.168.10.234 TCP 79 [TCP Retransmission] 1883 → 7876 [PSH, ACK] Seq=57762 Ack=345246 Win=64064 Len=25
76878 2021-01-13 01:24:01.793408931 192.168.10.222 → 192.168.10.234 TCP 79 [TCP Retransmission] 1883 → 7876 [PSH, ACK] Seq=57762 Ack=345246 Win=64064 Len=25
76879 2021-01-13 01:24:03.793413689 192.168.10.222 → 192.168.10.234 TCP 79 [TCP Retransmission] 1883 → 7876 [PSH, ACK] Seq=57762 Ack=345246 Win=64064 Len=25
76880 2021-01-13 01:24:04.277433602 192.168.10.234 → 192.168.10.222 MQTT 60 Disconnect Req

76881 2021-01-13 01:24:04.277540119 192.168.10.222 → 192.168.10.234 MQTT 58 Publish Ack (id=12790)

76882 2021-01-13 01:24:04.277838987 192.168.10.222 → 192.168.10.234 TCP 54 1883 → 7876 [FIN, ACK] Seq=57791 Ack=345248 Win=64064 Len=0
76883 2021-01-13 01:24:04.300823395 192.168.10.234 → 192.168.10.222 TCP 60 7876 → 1883 [ACK] Seq=345248 Ack=57791 Win=46 Len=0
76884 2021-01-13 01:24:04.309657421 192.168.10.234 → 192.168.10.222 TCP 60 7876 → 1883 [FIN, ACK] Seq=345248 Ack=57792 Win=46 Len=0
76885 2021-01-13 01:24:04.309724494 192.168.10.222 → 192.168.10.234 TCP 54 1883 → 7876 [ACK] Seq=57792 Ack=345249 Win=64064 Len=0
76886 2021-01-13 01:24:04.520587908 192.168.10.234 → 192.168.10.222 TCP 60 16153 → 1883 [SYN] Seq=0 Win=50 Len=0
76887 2021-01-13 01:24:04.520667555 192.168.10.222 → 192.168.10.234 TCP 58 1883 → 16153 [SYN, ACK] Seq=0 Ack=1 Win=64240 Len=0 MSS=1460
76888 2021-01-13 01:24:04.529916726 192.168.10.234 → 192.168.10.222 TCP 60 16153 → 1883 [ACK] Seq=1 Ack=1 Win=50 Len=0
76889 2021-01-13 01:24:04.620420021 192.168.10.234 → 192.168.10.222 MQTT 108 Connect Command
76890 2021-01-13 01:24:04.620485298 192.168.10.222 → 192.168.10.234 TCP 54 1883 → 16153 [ACK] Seq=1 Ack=55 Win=64186 Len=0
76891 2021-01-13 01:24:04.620967663 192.168.10.222 → 192.168.10.234 MQTT 58 Connect Ack

/var/log/mosquitto/mosquitto.log

2021-01-13 02:24:00.013636013: Received PUBLISH from openHab_v2_4_0 (d0, q0, r0, m0, 'sys/time_HH_MM', ... (5 bytes))
2021-01-13 02:24:00.013831456: Sending PUBLISH to openHab_v2_4_0 (d0, q0, r0, m0, 'sys/time_HH_MM', ... (5 bytes))
2021-01-13 02:24:00.014083842: PUBLISH packet written to TCP stack for openHab_v2_4_0
2021-01-13 02:24:00.014181637: Sending PUBLISH to PIC18F67J60 Ethernet [04] (d0, q0, r0, m0, 'sys/time_HH_MM', ... (5 bytes))
2021-01-13 02:24:00.014330784: PUBLISH packet written to TCP stack for PIC18F67J60 Ethernet [04]
2021-01-13 02:24:00.014695707: Received PUBLISH from openHab_v2_4_0 (d0, q0, r0, m0, 'sys/date_DD_MM_', ... (6 bytes))
2021-01-13 02:24:00.014796798: Sending PUBLISH to openHab_v2_4_0 (d0, q0, r0, m0, 'sys/date_DD_MM_', ... (6 bytes))
2021-01-13 02:24:00.015003889: PUBLISH packet written to TCP stack for openHab_v2_4_0
2021-01-13 02:24:00.015092573: Sending PUBLISH to PIC18F67J60 Ethernet [04] (d0, q0, r0, m0, 'sys/date_DD_MM_', ... (6 bytes))
2021-01-13 02:24:00.015212331: PUBLISH packet written to TCP stack for PIC18F67J60 Ethernet [04]
2021-01-13 02:24:00.114998815: Received PUBLISH from PIC18F67J60 Ethernet [04] (d0, q1, r0, m12790, '04/eth/PktsTxOK', ... (5 bytes))
2021-01-13 02:24:00.115155517: Sending PUBLISH to openHab_v2_4_0 (d0, q0, r0, m0, '04/eth/PktsTxOK', ... (5 bytes))
2021-01-13 02:24:00.115373552: PUBLISH packet written to TCP stack for openHab_v2_4_0

2021-01-13 02:24:00.115456607: Sending PUBACK to PIC18F67J60 Ethernet [04] (m12790, rc0)
2021-01-13 02:24:00.115538328: PUBACK packet written to TCP stack for PIC18F67J60 Ethernet [04]

2021-01-13 02:24:04.277603082: Received DISCONNECT from PIC18F67J60 Ethernet [04]
2021-01-13 02:24:04.277728932: Client PIC18F67J60 Ethernet [04] disconnected.
2021-01-13 02:24:04.530127020: New connection from 192.168.10.234:16153 on port 1883.
2021-01-13 02:24:04.620574112: New client connected from 192.168.10.234:16153 as PIC18F67J60 Ethernet [04] (p1, c1, k6).
2021-01-13 02:24:04.620705073: Will message specified (1 bytes) (r0, q1).
2021-01-13 02:24:04.620764628:  04/state
2021-01-13 02:24:04.620818276: Sending CONNACK to PIC18F67J60 Ethernet [04] (0, 0)
2021-01-13 02:24:04.621001459: CONNACK packet written to TCP stack for PIC18F67J60 Ethernet [04]

=> Seems socket layer issue like discussed 😢.

In logs I see another issues:

  1. PUBLISH from publishing thing -> PUBLISH to receiving things -> then PUBACK to publishing thing ?

    2021-01-13 02:24:00.114998815: Received PUBLISH from PIC18F67J60 Ethernet [04] (d0, q1, r0, m12790, '04/eth/PktsTxOK', ... (5 bytes))
    2021-01-13 02:24:00.115155517: Sending PUBLISH to openHab_v2_4_0 (d0, q0, r0, m0, '04/eth/PktsTxOK', ... (5 bytes))
    2021-01-13 02:24:00.115373552: PUBLISH packet written to TCP stack for openHab_v2_4_0
    2021-01-13 02:24:00.115456607: Sending PUBACK to PIC18F67J60 Ethernet [04] (m12790, rc0)
    
    

    I think first have to be PUBACK and then spreading message to other things.

  2. PUBLISH received is returned/rePUBLISHED back to thing ?

    2021-01-13 02:24:00.013636013: Received PUBLISH from openHab_v2_4_0 (d0, q0, r0, m0, 'sys/time_HH_MM', ... (5 bytes))
    2021-01-13 02:24:00.013831456: Sending PUBLISH to openHab_v2_4_0 (d0, q0, r0, m0, 'sys/time_HH_MM', ... (5 bytes))
    2021-01-13 02:24:00.014083842: PUBLISH packet written to TCP stack for openHab_v2_4_0
    

    I do not thing this is correct.

@ralight
Copy link
Contributor

ralight commented Jan 13, 2021

Well I was expecting that, but I'm still a bit surprised by it.

Your point 1 - the PUBLISH/PUBACK ordering - is fine. This is allowed by the spec.

Point 2 - if the client openHab_v2_4_0 has subscribed to sys/time_HH_MM, then it will receive all messages for that topic even if it sent them. In MQTT v5 there is an option when making a subscription to disable this behaviour - the "no local" flag.

@radoslavv
Copy link
Author

Yesterday I applied modified (workarounded) my code to resend last PUBLISH message in case of missing PUBACK (keeping same msg ID).
At least no reconnects within 24h, now. But still many missing PUBACKs (25 in less then 24h):
image

Will try to report TCP/IP Retransmission issue to Microchip. In the past I've been successful with few BUGs reported. Seems to me Microchip TCP/IP Lite stack does not respond to TCP Retransmissions (maybe not supported in Lite stack version to keep small code footprint).

76876 2021-01-13 01:24:00.303409204 192.168.10.222 → 192.168.10.234 TCP 79 [TCP Retransmission] 1883 → 7876 [PSH, ACK] Seq=57762 Ack=345246 Win=64064 Len=25
76877 2021-01-13 01:24:00.793408134 192.168.10.222 → 192.168.10.234 TCP 79 [TCP Retransmission] 1883 → 7876 [PSH, ACK] Seq=57762 Ack=345246 Win=64064 Len=25
76878 2021-01-13 01:24:01.793408931 192.168.10.222 → 192.168.10.234 TCP 79 [TCP Retransmission] 1883 → 7876 [PSH, ACK] Seq=57762 Ack=345246 Win=64064 Len=25
76879 2021-01-13 01:24:03.793413689 192.168.10.222 → 192.168.10.234 TCP 79 [TCP Retransmission] 1883 → 7876 [PSH, ACK] Seq=57762 Ack=345246 Win=64064 Len=25

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants