Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Socket error on client <unknown>, disconnecting. ARMv7 subscriber only #1385

Open
nickper opened this issue Aug 20, 2019 · 19 comments
Open

Socket error on client <unknown>, disconnecting. ARMv7 subscriber only #1385

nickper opened this issue Aug 20, 2019 · 19 comments

Comments

@nickper
Copy link

nickper commented Aug 20, 2019

Currently I'm working on a project where i need to use MQTT on an ARMv7 and i686 device. The current problem is that specificly on the ARMv7 device some problems arise.

i am running on both of the devices Debian 7 Wheezy.

When i try to connect the ARMv7 device to the broker it seems to connect but doesn't receive anything at all. The broker returns the following

1566312615: New connection from 192.168.1.40 on port 1883.
1566312616: Socket error on client , disconnecting.
1566312616: New connection from 192.168.1.40 on port 1883.
1566312616: New client connected from 192.168.1.40 as mosq/qEidBqNY1Kx74MR3Ia (p2, c1, k60).
1566312616: No will message specified.
1566312616: Sending CONNACK to mosq/qEidBqNY1Kx74MR3Ia (0, 0)

The I686 devices show the following notice

1566313930: New connection from 192.168.1.20 on port 1883.
1566313930: New client connected from 192.168.1.20 as mosq/Y0ecobRZo5SXfWh1J1 (p2, c1, k60).
1566313930: No will message specified.
1566313930: Sending CONNACK to mosq/Y0ecobRZo5SXfWh1J1 (0, 0)
1566313930: Received SUBSCRIBE from mosq/Y0ecobRZo5SXfWh1J1
1566313930: diagnostics (QoS 1)
1566313930: mosq/Y0ecobRZo5SXfWh1J1 1 diagnostics
1566313930: Sending SUBACK to mosq/Y0ecobRZo5SXfWh1J1

  • the I686 device works normal as either publisher, subscriber or broker
  • the ARMv7 device works normal as publisher or broker
  • the same problem happens when using my pc (windows x64) as broker, while the I686 device works fine.

after little research i noticed that the broker does not forward anything to the ARMv7 device. i don't know it this is a bug of something else.
Both devices use the same codebase, but are built with seperate compilers.

//mosquitttomqtt.h
class Handler;

struct Payload
{
    uint32_t id;
    std::string topic;
    int64_t counter;

    Payload();
    Payload(uint32_t id, std::string topic, int counter);

};

class mosquittoMQTT : public mosqpp::mosquittopp
{
public:
    mosquittoMQTT();
    virtual ~mosquittoMQTT();

    bool Initialise(std::string broker, std::string topic, int qos, Handler* handler);
    void Deinitialise();

    /// Publish data to MQTT topic
    void MQTTPublish(const std::string& topic, const Payload& payload, int qos = 0, bool retain = false);

    /// handler for incomig data
    virtual void on_message(const struct mosquitto_message* message) override;

private:
    Handler* subscriber = nullptr;
    std::string topic = "default";
    std::string broker = "127.0.0.1";

};
//mosquittomqtt.cpp
#include "mosquittomqtt.h"
#include "handler.h"

///constructor
mosquittoMQTT::mosquittoMQTT()
{}

mosquittoMQTT::~mosquittoMQTT()
{}

/// Initialise this class
bool mosquittoMQTT::Initialise(std::string broker, std::string Topic, int qos, Handler* handler)
{
    // initialise mosquitto
    mosqpp::lib_init();
    loop_start();

    this->topic = Topic;
    this->subscriber = handler;
    this->broker = broker;

    int result = connect(this->broker.c_str());
    if (result != MOSQ_ERR_SUCCESS)
    {
        std::cout << "Error connecting to MQTT Broker. Error code " << mosquitto_strerror(result) << std::endl;
        mosqpp::lib_cleanup();
        return false;
    }
    if (this->subscriber != nullptr)
    {
        if (this->subscribe(nullptr, topic.c_str(), qos))
        {
            std::cout << "Error initialising MQTTSubscriber" << std::endl;
            disconnect();
            mosqpp::lib_cleanup();
            return false;
        }
    }
    return true;
}

/// Deinitialise this class
void mosquittoMQTT::Deinitialise()
{
    disconnect();
    loop_stop();
    mosqpp::lib_cleanup();
}

/// Publish data on MQTT
void mosquittoMQTT::MQTTPublish(const std::string& topic, const Payload& payload, int qos, bool retain)
{
    // Publish to MQTT broker
    int result = publish(nullptr, topic.c_str(), sizeof(payload), &payload, qos, true);
    if (result != MOSQ_ERR_SUCCESS)
    {
        std::cout << "Error publishing on MQTT. Return code " << result << std::endl;
        if (result == MOSQ_ERR_NO_CONN)
        {
            std::cout << "Trying to reconnect" << std::endl;
            reconnect_async();
        }
    }
}

/// Handle incoming messages
void mosquittoMQTT::on_message(const struct mosquitto_message* message)
{
    //std::cout << message->payload << std::endl;
    /// make sure message topic and payload are copied!!!
    if (this->subscriber)
    {
        //std::cout << message->payloadlen << std::endl;
        Payload* payload = (struct Payload*)message->payload;
        //std::cout << payload->counter << std::endl;
        this->subscriber->ReceivedIntegerValue(payload->id, payload->counter);
    }
}

Payload::Payload(uint32_t id, std::string topic, int counter)
    : id(id)
    , counter(counter)
    , topic(topic)
{
}

Payload::Payload()
    : id(0)
    , counter(0)
    , topic("default")
{
}

Thanks in advance.
Nick

@karlp
Copy link
Contributor

karlp commented Aug 20, 2019

You never send a subscribe. I'd check the way you are trying to handle the "has the subscriber been initialized yet?"

@nickper
Copy link
Author

nickper commented Aug 21, 2019

I do, it is only hidden in a If statement. and i know that this method works because i got subscribtion messages when i subscribe with my i686 device.

if (this->subscribe(nullptr, topic.c_str(), qos))

@nickper
Copy link
Author

nickper commented Aug 21, 2019

to give some more information, I build mosquitto from source with a ARM toolchain on a i686 VM with the following command

make WITH_TLS=no WITH_DOCS=no WITH_BUNDLED_DEPS=no

Furthermore i checked the traffic through wireshark, en encountered something weird.
this is a working subscribe connection.
image
the problem is that my arm device doesn't send these subscribe request messages. while
this->subscribe(nullptr, topic.c_str(), qos) does return no Error

@ralight
Copy link
Contributor

ralight commented Aug 21, 2019

Can I check, if you're building from source I presume you're on version 1.6.4? Is that correct?

@nickper
Copy link
Author

nickper commented Aug 21, 2019

that is correct

@karlp
Copy link
Contributor

karlp commented Aug 21, 2019

Yes, I wouldn't expect a subscrube request in wireshark as it's not shown int eh broker logs either. Are you sure you actually make the subscribe call? Add an else clause so you get a print regardless?

@nickper
Copy link
Author

nickper commented Aug 21, 2019

As far as i can see it does resolve the subscribe function succesfully. It return no Error, and accourding to the documentation it should be sufficient to call the loop_start(); to ensure that it connects as it should.

the client <unknown> error is given at connect(this->broker.c_str()); which can mean that the setup in this function is not going as should. But the function itself return also no Error

@ralight
Copy link
Contributor

ralight commented Aug 22, 2019

Does the on_log logging show anything useful on the client?

@nickper
Copy link
Author

nickper commented Aug 23, 2019

on log: 16, Client mosq/lc7NxCxvBtlsB1bX8y sending CONNECT
on log: 16, Client mosq/lc7NxCxvBtlsB1bX8y sending SUBSCRIBE (Mid: 1, Topic: diagnostics, QoS: 0, Options: 0x00)
Waiting for samples... //is called after the initialize function in the main
on log: 16, Client mosq/lc7NxCxvBtlsB1bX8y sending CONNECT
on log: 16, Client mosq/lc7NxCxvBtlsB1bX8y received CONNACK (0)

It does send a subscribe according to the on_log.
It doesn't show the second CONNECT/CONNACK on the I686 devices.

@nickper
Copy link
Author

nickper commented Sep 3, 2019

I have tried to build it on another build environment with another toolchain and also updated the linux version on de ARMv7 target. It still gives the same error.

@ralight
Copy link
Contributor

ralight commented Sep 3, 2019

updated the linux version - do you mean something newer than Wheezy? I haven't yet reproduced this on any architecture, but don't have anything running Wheezy.

The example code you provide is incomplete, is it possible to have a full working example that shows the problem?

@nickper
Copy link
Author

nickper commented Sep 5, 2019

yes, i tried it this time to build on a yocto ubuntu 18.04. and with updated system libraries on my device. unfortunatly the problem still persist.

I use a custom build linux OS which is higly based on debian wheezy. it uses kernel 3.10 which gave some trouble, but i created a workaround for that. i use that workaround on both devices, and it works on both.
(i mentioned my workaround here #1403)

I will provide a working copy later today

@nickper
Copy link
Author

nickper commented Sep 9, 2019

I debuged the library and encountered an problem. In my case on the ARMv7 chip the library runs into a race condition where it internaly returns MOSQ_ERR_NO_CONN and tries to reconnect Problem is that this reconnect is for some reason not done correctly.
By accident i encounterd that the problem was resolved after more that 2 print statements between the initializer and the first real socket action in the loop_forever function.
Therefor i put a usleep right at the start of loop_forever and the problem disapeared.

//loop.c
int mosquitto_loop_forever(struct mosquitto *mosq, int timeout, int max_packets)
{
        usleep(400);
	int run = 1;
	int rc;
	unsigned int reconnects = 0;
	unsigned long reconnect_delay;
...

It is an ugly solution but for now it helps.
I don't know if i am the only one with this problem, and if kernel version, linux environment and/or hardware specs is responsable for this, But i finally got it working.

It may be good to check if the reconnect function does work when the first connection is not performed well.

EDIT
i forgot to mention that before my fix I had the Socket error on client <unknown>, disconnecting notification also when i tried to connect with my i686 device. But for some reason it didn't had any impact there.

@ralight
Copy link
Contributor

ralight commented Sep 9, 2019

Good find! Are you able to check with the latest fixes branch? There are some extra locks added where they were missing. It could be related.

@nickper
Copy link
Author

nickper commented Sep 11, 2019

I tried the fixes branch, but it doesn't resolve the problem.
on the broker side i still recieve the message Socket error on client <unknown>, disconnecting. which is the indication of the race condition.

ralight added a commit that referenced this issue Sep 11, 2019
@ralight
Copy link
Contributor

ralight commented Sep 11, 2019

I haven't been able to reproduce this, but I think I can tell where the most likely cause for this is. I've just pushed a commit which may fix it.

ralight added a commit that referenced this issue Sep 18, 2019
@karlp
Copy link
Contributor

karlp commented Sep 23, 2019

This is a regression for me on both desktop linux (glibc, x86_64) and openwrt (musl-libc, mips32/ath79)

I use connect_async() followed by loop_start, and I simply never receive my on_connect callback. I'm using libevent2 for my own portion of the application, and if I send a signal that I'm handling via libevent2 (ctrl-c to cleanly exit) I finally see the connect callback firing before immediately my clean exit handler disconnecting and exiting.

@karlp
Copy link
Contributor

karlp commented Sep 23, 2019

test case for connect_async available at etactica@f7e04bf

@ralight
Copy link
Contributor

ralight commented Sep 24, 2019

There's an updated fix in the fixes branch that helps the regression and should help this too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants