Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

testing two raspberry pi #17

Closed
SimoneNascivera opened this issue Sep 16, 2021 · 17 comments
Closed

testing two raspberry pi #17

SimoneNascivera opened this issue Sep 16, 2021 · 17 comments
Assignees

Comments

@SimoneNascivera
Copy link

SimoneNascivera commented Sep 16, 2021

Hi everyone,
I'm was looking for a library to sync two raspberry pi in a local network and I have found this one which looks very promising.
I tried to check the delay between two computers using this custom script for the client:

// Only for testing.
// A self-contained program with threads for a server and some clients.
// An example that does not read any config file.

#include <thread>
#include <vector>

#include "ClockClient.h"
#include "HighResolutionClock.h"
#include "PhaseLockedClock.h"

using namespace dex;
using namespace std::chrono_literals;

int main(int argc, char* argv[])
{
    if (argc != 3) {
        std::cerr << "usage: " << argv[0] << " port duration\n";
        return 1;
    }
    const auto port = atoi(argv[1]);
    std::chrono::microseconds runtime(int64_t(1000000 * atof(argv[2])));

    auto& clockHiRes = HighResolutionClock::instance();

    ClockClient client(ost::InetHostAddress("192.168.1.1"), port);
    client.setTimeout(1000);
    client.setAcknowledge(true);
    PhaseLockedClock plc(clockHiRes, client);
    plc.setPhasePanic(5000);
    plc.setUpdatePanic(5000000);

    std::atomic_bool end_clocks(false);

    std::thread plcThread(&PhaseLockedClock::run, &plc, std::ref(end_clocks));

    while (runtime.count() > 0) {
        std::cout << "offset: " << plc.getOffset() << "\n" << timestampToString(plc.getValue()) << std::endl;
        std::cout << std::endl;
        constexpr auto wait = 600ms;
        std::this_thread::sleep_for(wait);
        runtime -= wait;
    }

    plc.die();
    plcThread.join();
    return 0;
}

And running the ClockServerMain.cpp executable on the other PC.
However, this is the output I get:

<time 1631805921 304074> 192.168.1.2    3       907
<time 1631805921 337063> 192.168.1.2    -9221740230933440132    0
<time 1631805921 340981> 192.168.1.2    -9221740230933436065    0
<time 1631805921 516315> 192.168.1.2    -9221740230933263086    0
<time 1631805921 552744> 192.168.1.2    -9221740230933224251    0
<time 1631805921 560615> 192.168.1.2    -14     897
<time 1631805921 562499> 192.168.1.2    -9221740230933214455    0
<time 1631805921 666530> 192.168.1.2    -9221740230933110745    0
<time 1631805921 693150> 192.168.1.2    -9221740230933083913    0
<time 1631805921 695964> 192.168.1.2    -9221740230933081202    0
<time 1631805921 697906> 192.168.1.2    -9221740230933079108    0
<time 1631805921 749100> 192.168.1.2    -9221740230933028002    0
<time 1631805921 751206> 192.168.1.2    -9221740230933025967    0
<time 1631805921 753392> 192.168.1.2    -73     983

When I run the same script locally, the offsets and RTT times are reasonable:

<time 1631807223 681889> localhost	-36	148
<time 1631807223 681889> offsetMax	110	---
<time 1631807223 879107> localhost	-7	159
<time 1631807224  81756> localhost	-6	157
<time 1631807224 291728> localhost	-5	152
<time 1631807224 485152> localhost	-21	129
<time 1631807224 681626> localhost	-17	167

Could you please help me to find out what am I doing wrong?

Thank you for your time,

Simone

@camilleg
Copy link
Owner

Simone, this looks like test-standalone.cpp, simplified to use only one (local) client, running on 192.168.1.2. Is that right?

A better example to work from would be make test-remote, with config-remote.example.sh copied to config-remote.sh and customized to match your network. That example tightly syncs a pair of hosts, either two plugged into the same switch, or two on different continents. Does that work on your network?

@SimoneNascivera
Copy link
Author

SimoneNascivera commented Sep 16, 2021

Yes, I simplified the test-standalone.cpp code to run using only one client.

I didn't try using test-remote because I didn't understand how to properly configure it.
Should I configure config-remote.sh only for the client side, and use ckserver for the server one? Or does it only require to have this repo cloned on the server and config-remote.sh automatically starts the server using sshd?

Thank you for your patience

@camilleg camilleg self-assigned this Sep 18, 2021
camilleg added a commit that referenced this issue Sep 18, 2021
@camilleg
Copy link
Owner

I just wrote a better example for you, at https://github.com/camilleg/clockkit#to-sync-host-b-to-host-a. Does that work on your network?
(You're right, test-remote has a lot of extra automation that you shouldn't need to understand or configure.)

@camilleg camilleg changed the title Issues while testing library testing two raspberry pi's Sep 18, 2021
@SimoneNascivera
Copy link
Author

SimoneNascivera commented Sep 20, 2021

I have just tried the new example but unfortunately it still is not working properly. The scripts communicates with each other but these are the output I get from the server:

linaro@tinkerboard:~/clockkit/ClockKit$ ./ckserver 4567
time                     host           offset  round-trip-time
<time 771 429281> 192.168.88.238        -9223372036083379040    0
<time 771 429281> offsetMax     9223372036083379040     ---
<time 771 838948> 192.168.88.238        -9223372036082968586    0
<time 772 238949> 192.168.88.238        -9223372036082569060    0
<time 773  64974> 192.168.88.238        -9223372036081756633    0

And from the client:

./ckphaselock my-clockkit.conf 
config [server:192.168.88.15]
config [port:4567]
config [timeout:2000]
config [phasePanic:10000]
config [updatePanic:10000000]
offset: invalid
<time 2077252342 775807>
offset: 771396768
<time -2077252342 -775808>
offset: invalid
<time 2077252342 775807>
offset: 771807222
<time -2077252342 -775808>

Note that I changed also PCs and now the two IPs are 192.168.88.15 and 192.168.88.238.

Could this be caused by my local timezone? Is there antrhing I can debug further more to check what is causing my issue?

Running date +%s on both PCs gives me respectively 1632127304 and 1632127305, which do not seem to be too far off.

Thank you for your time.

@camilleg
Copy link
Owner

camilleg commented Sep 20, 2021

These <time xxx xxx> values reported by server and client are puzzling. They come from gettimeofday() in ClockKit/SystemClock.cpp, as you can check there with a few couts.
That function's manpage says it should agree with date +%s. You can check that with a system("date +%s") there. Is that function broken on rasbpi?

raspberrypi/pico-sdk#27 (comment)
docker/for-linux#1145
https://lkml.kernel.org/lkml/[email protected]/#r

It's sometimes broken elsewhere, more subtly:
https://gitlab.isc.org/isc-projects/bind9/-/issues/1679

If your gettimeofday is indeed unreliable, and your build platform has https://en.cppreference.com/w/cpp/chrono/system_clock, then I could send you an experimental build using the latter.

@SimoneNascivera SimoneNascivera changed the title testing two raspberry pi's testing two raspberry pi Sep 20, 2021
@camilleg
Copy link
Owner

camilleg commented Sep 20, 2021

SystemClock.cpp:

#include <chrono>
using namespace std::chrono;
return system_clock::now().time_since_epoch() / 1us;

FYI, if here I replace 1us with 1ns to incorrectly return nanoseconds,
then I replicate your client's output <time -2077252342 -775808>!

@SimoneNascivera
Copy link
Author

SimoneNascivera commented Sep 21, 2021

I may have found out the issue. I created a test script on my raspberry that looks like this:

#include <sys/time.h>

#include <chrono>
#include <iostream>

using namespace std::chrono_literals;

int main()
{
    timeval now;
    gettimeofday(&now, 0);
    long long time = (now.tv_sec * 1000000) + now.tv_usec;
    std::cout << "Clockkit Timeofday: " << time << std::endl;

	time =  std::chrono::system_clock::now().time_since_epoch() / 1us;

	std::cout << "Chrono us: " << time << std::endl;

	struct timeval te;
    gettimeofday(&te, NULL);                                         // get current time
    time = te.tv_sec * 1000000LL + te.tv_usec; // calculate milliseconds
    // printf("milliseconds: %lld\n", milliseconds);
	std::cout << "My Timeofday: " << time << std::endl;

    return 0;
}

And there are actually some strange behaviours. On my pc, all the three values are the same:

Clockkit Timeofday: 1632239394846348
Chrono us: 1632239394846402
My Timeofday: 1632239394846407

While on the raspberry the values are quite different:

Clockkit Timeofday: 1807868033
Chrono us: 1632239704203465
My Timeofday: 1632239704203481

This evening and tomorrow morning I'll try to implement it on library.

PS: changing now.tv_sec * 1000000 + now.tv_usec to now.tv_sec * 1000000LL + now.tv_usec seems to do the job. I'm testing if that's the case

PPS: the issue seems to be with ./ckphaselock my-clockkit.conf since ./ckserver 4567 something gives my a reasonable time, such as

time                     host           offset  round-trip-time
<time 1632240247 320178> 192.168.88.244 -9221739796607504245    0
<time 1632240247 320178> offsetMax      9221739796607504245     ---
<time 1632240254 362965> 192.168.88.244 -9221739796600460676    0
<time 1632240254 362965> offsetMax      9221739796600460676     ---
<time 1632240256 198898> 192.168.88.244 -9221739796598625172    0
<time 1632240256 198898> offsetMax      9221739796598625172     ---

While the client keeps giving me this error:

<time 2077252342 775807>
offset: invalid
<time 2077252342 775807>
offset: invalid
<time 2077252342 775807>
offset: invalid
<time 2077252342 775807>

I'll test more in detail tomorrow morning.

@camilleg
Copy link
Owner

camilleg commented Sep 21, 2021

From your experiments, I'm strongly inclined to replace the brittle gettimeofday() with system_clock::now().time_since_epoch()/1us, for reliability across platforms. A stark contrast between the two in online bug reports: the latter tend to be both many years old and spurious.

@SimoneNascivera
Copy link
Author

I have more results. I changed the original SystemClock.cpp to:

#include "SystemClock.h"
#ifdef WIN32
#include <windows.h>
#else
#include <sys/time.h>
#endif
#include <chrono>
#include <limits>

using namespace std::chrono_literals;
namespace dex {

SystemClock SystemClock::instance_;

// Needs no mutex, because no state is stored,
// and because GetSystemTimeAsFileTime and gettimeofday are thread-safe.
timestamp_t SystemClock::getValue()
{
#ifdef WIN32
    FILETIME filetime;
    GetSystemTimeAsFileTime(&filetime);
    // Convert to a 64bit int, from 100 ns to usec,
    // then from windows epoch to unix epoch.
    return ((filetime.dwHighDateTime << 32) | filetime.dwLowDateTime) / 10 - 11644473600000000;
#else
    // static constexpr auto invalid = std::numeric_limits<timestamp_t>::max();
    // timeval now;
    // return gettimeofday(&now, nullptr) < 0 ? invalid : now.tv_sec * 1000000LL + now.tv_usec;
    return std::chrono::system_clock::now().time_since_epoch() / 1us;
#endif
}

}  // namespace dex

If I run the server on my RPi and the phaselock on my PC, the server prints a correct time stamp while the phase lock gives a huge value as in the previous examples.
However, if I do the opposite, the server is not printing anything even if the phase lock is giving offset: invalid as output. After ~10s , the server prints only three sequences and blocks another time.

I'm now trying to follow the printing of the phaselock script to find why its timestamp is broken.

@SimoneNascivera
Copy link
Author

SimoneNascivera commented Sep 22, 2021

I added this line of cout debug in ClockClient.cpp:

timestamp_t ClockClient::getPhase(Clock& clock, bool acknowledge)
{
    ++sequence_ %= 250;  // One byte.
    std::cout << "ClockClient sending packet timestamp: " << clock.getValue() << std::endl << std::endl;
    ...
}

And this is the output of lockphase:

offset: invalid
<time 2077252342 775807>
ClockClient sending packet timestamp: 1632298491625394

ClockClient sending packet timestamp: -9223372036854775808

I found out that the value -9223372036854775808 is a common way for the compiler to indicate an overflow value.

Do you have any idea why this is happening?

The server side confirm this theory of the phase lock timestamp being off, since the input packet has this as print:

--- PACKET ---
clientRequestTime: <time -2077252342 -775808>
serverReplyTime: <time 1632298488 748966>
clientReceiveTime: <time -2077252342 -775808>

@camilleg
Copy link
Owner

camilleg commented Sep 22, 2021

9223372036854775807 is printed by cout << invalid, from (in several places)
static constexpr auto invalid = std::numeric_limits<timestamp_t>::max();. It is 2^63-1.

But you see minus ... 808 instead of +...807 , which is -2^63.
Wraparound? timestamp_t is int64_t.
Avoid the wraparound with invalid = ... - 100; ?

Would it be useful or valid for me to try reproducing this with an RPi emulator? (I don't have hardware.)

@SimoneNascivera
Copy link
Author

SimoneNascivera commented Sep 22, 2021

I replaced static constexpr auto invalid = std::numeric_limits<timestamp_t>::max(); with static constexpr auto invalid = std::numeric_limits<timestamp_t>::max() -100 in ClockClient.h, PhaseLockMain.cpp, PhaseLockedClock.cpp, SystemClock.cpp, VariableFrequencyClock.cpp and clockkit.cpp but it does not seem to change a lot.

Do you think the problem might be related to the system being 32bit (armv7l)? If that's the case, a 32bit emulator might be the right tool.

Anyhow, is it correct that the phase lock sends two packet per time as in my previous message? What I am not understanding is why the first timestamp is always correct while the second one most of the time is wrong.

@camilleg
Copy link
Owner

That CPU's 32bitness shouldn't matter. This code ran on only 32-bit hosts for many years (in 2004, 64-bit desktops were only starting to appear).

You understand the packet handshaking ok, as in the top of ClockPacket.h: cli sends REQUEST, srv replies with REPLY, cli repllies with ACKNOWLEDGE.

I'm still puzzled. Yes, let's work backwards from the printed numbers that look wrong.

Is there something like a cloud service where I could ssh to an RPi, install clockkit, and experiment myself for a few hours? Or should I just buy a used RPi on ebay? RPi 3? Zero?

@SimoneNascivera
Copy link
Author

SimoneNascivera commented Sep 23, 2021

I tested both from my PC to another linux machine, and everything works fine. I also tested using a arm32v7 docker image, and it also works fine.

I'm doing tests with both a RPi 4 and a TinkerBoard module but they both share the same behaviour. They give the same output even when the server is not running so it makes me think it's some communication related issue (however, setting a timeout in phaselock results in the server shutting down after that time, so some kind of communication is still happening).

@SimoneNascivera
Copy link
Author

SimoneNascivera commented Sep 23, 2021

OK I might have found the issue. This is the normal PhaseLock script output with DEBUG defined in ClockClient.cpp:

ClockClient sending packet timestamp: 1632403147226429
sending REQUEST
expecting...
got REPLY

ClockClient sending packet timestamp: 1632403147227681
sending REQUEST
expecting...
got REPLY
sending ACKNOWLEDGE

offset: 455
<time 1632403147 425883>

Compared with the RPi one:

ClockClient sending packet timestamp: 1632403014951137
sending REQUEST
expecting...
timed out waiting for packet after 2 ms

ClockClient sending packet timestamp: -9223372036854775808
sending REQUEST
expecting...
timed out waiting for packet after 2 ms

ClockClient sending packet timestamp: 1632403015147996
sending REQUEST
expecting...
got REPLY
ignoring out-of-order packet 1; expected 3
got REPLY
ignoring out-of-order packet 2; expected 3
timed out waiting for packet after 2 ms

My understanding is that the RPi one is for some reason out of sync. I'm now trying to understand why

@SimoneNascivera
Copy link
Author

I solved my issue. I didn't understand that timeout in my-clockkit.conf was expressed in us and not in ms. I had a ping of 6ms with the RPi, so it kept giving me a timeout.

Thank you for your time and patience and I'm sorry for my misunderstanding.

@camilleg
Copy link
Owner

The config files can now include comments. I've added comments there, so others don't fall into the same misunderstanding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants