It takes a long time to detect connection lost #205

VemundH · 2024-07-26T11:58:47Z

Small test program based on the examples in this repo:

#include <boost/redis/connection.hpp>
#include <boost/asio/detached.hpp>
#include <boost/asio/co_spawn.hpp>
#include <boost/asio/redirect_error.hpp>
#include <boost/asio/steady_timer.hpp>

#include <iostream>
#include <chrono>

namespace asio = boost::asio;
using boost::redis::request;
using boost::redis::ignore;
using boost::redis::config;
using boost::redis::connection;
using boost::redis::logger;
using boost::asio::awaitable;
using boost::asio::detached;
using boost::asio::consign;

awaitable<void> run_publisher(std::shared_ptr<connection> conn)
{
  asio::steady_timer timer(co_await asio::this_coro::executor);
  while (conn->will_reconnect())
  {
    request req;
    req.push("PUBLISH", "channel", "some data");
    boost::system::error_code ec;
    co_await conn->async_exec(req, ignore, asio::redirect_error(asio::use_awaitable, ec));
    if (ec)
    {
      std::clog << std::chrono::system_clock::now()
        << " PUBLISH failed with message: " << ec.message()
        << std::endl;
    }
    timer.expires_after(std::chrono::seconds(2));
    co_await timer.async_wait(asio::redirect_error(asio::use_awaitable, ec));
  }
}

awaitable<void> co_main(config cfg)
{
  cfg.health_check_interval = std::chrono::seconds(30);
  auto conn = std::make_shared<connection>(co_await asio::this_coro::executor);
  conn->async_run(cfg, {logger::level::debug}, consign(detached, conn));

  co_await run_publisher(conn);
  conn->cancel();
}

Using 30 seconds health check interval as indicated in this comment will be default in a future version.

Typically seeing a delay between 3x to 4x health check interval after restarting Redis before connection lost is detected and reconnect attempted.

When restarting Redis the run-op completes almost immediately. This log shows that almost 2 min later connection lost is detected:

2024-07-26 10:54:51.1038269 (Boost.Redis) run-all-op: resolve addresses 127.0.0.1:6379
2024-07-26 10:54:51.1085491 (Boost.Redis) run-all-op: connected to endpoint 127.0.0.1:6379
2024-07-26 10:54:51.1090630 (Boost.Redis) writer-op: 130 bytes written.
2024-07-26 10:54:51.1110903 (Boost.Redis) reader-op: 973 bytes read.
2024-07-26 10:54:51.1117493 (Boost.Redis) hello-op: Success
2024-07-26 10:54:53.1252447 (Boost.Redis) writer-op: 45 bytes written.
2024-07-26 10:54:53.1268255 (Boost.Redis) reader-op: 4 bytes read.
2024-07-26 10:54:55.1416191 (Boost.Redis) writer-op: 45 bytes written.
2024-07-26 10:54:55.1428586 (Boost.Redis) reader-op: 4 bytes read.
2024-07-26 10:54:57.1517312 (Boost.Redis) writer-op: 45 bytes written.
2024-07-26 10:54:57.1527005 (Boost.Redis) reader-op: 4 bytes read.
2024-07-26 10:54:59.1575930 (Boost.Redis) writer-op: 45 bytes written.
2024-07-26 10:54:59.1581645 (Boost.Redis) reader-op: 4 bytes read.
2024-07-26 10:55:01.1650597 (Boost.Redis) writer-op: 45 bytes written.
2024-07-26 10:55:01.1659037 (Boost.Redis) reader-op: 4 bytes read.
2024-07-26 10:55:03.1787981 (Boost.Redis) writer-op: 45 bytes written.
2024-07-26 10:55:03.1792210 (Boost.Redis) reader-op: 4 bytes read.
2024-07-26 10:55:05.1378655 (Boost.Redis) reader-op: End of file
2024-07-26 10:55:05.1382268 (Boost.Redis) reader-op: EOF received. Exiting ...
2024-07-26 10:55:05.1386657 (Boost.Redis) writer-op: canceled (3). Exiting ...
2024-07-26 10:55:05.1389448 (Boost.Redis) run-op: The operation completed successfully (reader), The operation completed successfully (writer)
2024-07-26 10:56:51.1219054 (Boost.Redis) check-timeout-op: Response has no value. Exiting ...
2024-07-26 10:56:51.1228789 (Boost.Redis) ping_op: error/cancelled (1).
2024-07-26 10:56:51.1236270 (Boost.Redis) check-health-op: The I/O operation has been aborted because of either a thread exit or an application request (async_ping), Pong timeout. (async_check_timeout).
2024-07-26 10:56:51.1239811 (Boost.Redis) runner-op: The operation completed successfully (async_run_all), Pong timeout. (async_health_check) The operation completed successfully (async_hello).
2024-07-26 10:56:51.1243785 (Boost.Redis) Connection lost.
2024-07-26 10:56:52.1326278 (Boost.Redis) run-all-op: resolve addresses 127.0.0.1:6379
2024-07-26 10:56:52.1333834 (Boost.Redis) run-all-op: connected to endpoint 127.0.0.1:6379
2024-07-26 10:56:52.1337355 (Boost.Redis) writer-op: 130 bytes written.
2024-07-26 10:56:52.1354276 (Boost.Redis) reader-op: 973 bytes read.
2024-07-26 10:56:52.1358929 (Boost.Redis) hello-op: Success
2024-07-26 10:56:54.1395472 (Boost.Redis) writer-op: 45 bytes written.
2024-07-26 10:56:54.1408228 (Boost.Redis) reader-op: 4 bytes read.
2024-07-26 10:56:56.1418841 (Boost.Redis) writer-op: 45 bytes written.
2024-07-26 10:56:56.1428140 (Boost.Redis) reader-op: 4 bytes read.
2024-07-26 10:56:58.1477065 (Boost.Redis) writer-op: 45 bytes written.

Added a small change to logger to get the timestamps:

void logger::write_prefix()
{
   std::clog << std::chrono::system_clock::now() << " ";
   if (!std::empty(prefix_))
      std::clog << prefix_;
}

Setting cancel_if_not_connected = true in the request will result in async_exec failing with the expected not connected error:

2024-07-26 11:17:43.6053426 PUBLISH failed with message: Not connected.
2024-07-26 11:17:43.6057837 (Boost.Redis) run-all-op: resolve addresses 127.0.0.1:6379
2024-07-26 11:17:43.6074385 (Boost.Redis) run-all-op: connected to endpoint 127.0.0.1:6379
2024-07-26 11:17:43.6078098 (Boost.Redis) writer-op: 85 bytes written.
2024-07-26 11:17:43.6107975 (Boost.Redis) reader-op: 969 bytes read.
2024-07-26 11:17:43.6115954 (Boost.Redis) hello-op: Success
2024-07-26 11:17:45.6173984 (Boost.Redis) writer-op: 45 bytes written.
2024-07-26 11:17:45.6179272 (Boost.Redis) reader-op: 4 bytes read.
2024-07-26 11:17:47.6251202 (Boost.Redis) writer-op: 45 bytes written.
2024-07-26 11:17:47.6257825 (Boost.Redis) reader-op: 4 bytes read.
2024-07-26 11:17:49.6389446 (Boost.Redis) writer-op: 45 bytes written.
2024-07-26 11:17:49.6398523 (Boost.Redis) reader-op: 4 bytes read.
2024-07-26 11:17:49.9818766 (Boost.Redis) reader-op: End of file
2024-07-26 11:17:49.9822220 (Boost.Redis) reader-op: EOF received. Exiting ...
2024-07-26 11:17:49.9825654 (Boost.Redis) writer-op: canceled (3). Exiting ...
2024-07-26 11:17:49.9827326 (Boost.Redis) run-op: The operation completed successfully (reader), The operation completed successfully (writer)
2024-07-26 11:17:51.6439209 PUBLISH failed with message: Not connected.
2024-07-26 11:17:53.6537792 PUBLISH failed with message: Not connected.
2024-07-26 11:17:55.6638643 PUBLISH failed with message: Not connected.
2024-07-26 11:17:57.6701620 PUBLISH failed with message: Not connected.
2024-07-26 11:17:59.6732612 PUBLISH failed with message: Not connected.
...
2024-07-26 11:19:42.0683707 PUBLISH failed with message: Not connected.
2024-07-26 11:19:43.6180784 (Boost.Redis) check-timeout-op: Response has no value. Exiting ...
2024-07-26 11:19:43.6187335 (Boost.Redis) ping_op: error/cancelled (1).
2024-07-26 11:19:43.6193823 (Boost.Redis) check-health-op: The I/O operation has been aborted because of either a thread exit or an application request (async_ping), Pong timeout. (async_check_timeout).
2024-07-26 11:19:43.6198053 (Boost.Redis) runner-op: The operation completed successfully (async_run_all), Pong timeout. (async_health_check) The operation completed successfully (async_hello).
2024-07-26 11:19:43.6200351 (Boost.Redis) Connection lost.
2024-07-26 11:19:44.0833046 PUBLISH failed with message: Not connected.
2024-07-26 11:19:44.6271292 (Boost.Redis) run-all-op: resolve addresses 127.0.0.1:6379
2024-07-26 11:19:44.6283678 (Boost.Redis) run-all-op: connected to endpoint 127.0.0.1:6379
2024-07-26 11:19:44.6290895 (Boost.Redis) writer-op: 85 bytes written.
2024-07-26 11:19:44.6397801 (Boost.Redis) reader-op: 969 bytes read.
2024-07-26 11:19:44.6403232 (Boost.Redis) hello-op: Success
2024-07-26 11:19:46.0852712 (Boost.Redis) writer-op: 45 bytes written.
2024-07-26 11:19:46.0861008 (Boost.Redis) reader-op: 4 bytes read.
2024-07-26 11:19:48.1011338 (Boost.Redis) writer-op: 45 bytes written.
2024-07-26 11:19:48.1018705 (Boost.Redis) reader-op: 4 bytes read.
2024-07-26 11:19:50.1027101 (Boost.Redis) writer-op: 45 bytes written.
2024-07-26 11:19:50.1036486 (Boost.Redis) reader-op: 4 bytes read.

The text was updated successfully, but these errors were encountered:

mzimbres · 2024-08-03T14:17:31Z

Thanks for the detailed analysis (and sorry for the delayed response).

Using 30 seconds health check interval as indicated in this comment
will be default in a future version.

I am not yet sure what is better.

When restarting Redis the run-op completes almost immediately. This
log shows that almost 2 min later connection lost is detected:

Yeah it takes 1min52s more precisely. My analysis below shows the following

# This is the log of the last successful health check. After it
# completes a new ping is scheduled for 30s later or at
# 10:55:03 + 30s = 10:55:33.
# https://github.com/boostorg/redis/blob/e8dd4d69eb3524c9a0efb0c9723790b479a4ba8f/include/boost/redis/detail/health_checker.hpp#L56
10:55:03 writer-op: 45 bytes written.
10:55:03 reader-op: 4 bytes read.

# This block is the first reaction to restarting the Redis server. The
# server closed the connection cleanly by sending and EOF. Once the
# read operation detects it and exits, the write op is also canceled
# since I use wait_for_one in https://github.com/boostorg/redis/blob/e8dd4d69eb3524c9a0efb0c9723790b479a4ba8f/include/boost/redis/detail/connection_base.hpp#L196
10:55:05 reader-op: End of file
10:55:05 reader-op: EOF received. Exiting ...
10:55:05 writer-op: canceled (3). Exiting ...
10:55:05 run-op: The operation completed successfully (reader), The operation completed successfully (writer)

# 30s passes since the last successfull health-check and the next one
# is scheduled. But since there is no connection it sits there waiting
# for a new connection to be stablished. If cancel_if_not_connected
# were set in the health check it would fail immediately here, but
# that can't be used reliably since it uses socket.is_open().
10:55:33 ... # No log entry.

# The check-health operation wakes up here to check if the
# health-check pings are being received, cleans up the response of the
# last successful health-check, that happended at 10:55:03 and goes
# sleep for another 2 * 30s = 60s.
# https://github.com/boostorg/redis/blob/e8dd4d69eb3524c9a0efb0c9723790b479a4ba8f/include/boost/redis/detail/health_checker.hpp#L81
10:55:51 ... # No log entry.

# After 60s the check-health operation wakes up just to see no
# successful response was received, then it exits to the reconnect
# loop to stablish a new connection.
10:56:51 check-timeout-op: Response has no value. Exiting ...
10:56:51 ping_op: error/cancelled (1).
10:56:51 check-health-op: The I/O operation has been aborted because of either a thread exit or an application request (async_ping), Pong timeout. (async_check_timeout).
10:56:51 runner-op: The operation completed successfully (async_run_all), Pong timeout. (async_health_check) The operation completed successfully (async_hello).
10:56:51 Connection lost.
10:56:52 run-all-op: resolve addresses 127.0.0.1:6379
10:56:52 run-all-op: connected to endpoint 127.0.0.1:6379
10:56:52 writer-op: 130 bytes written.
10:56:52 reader-op: 973 bytes read.
10:56:52 hello-op: Success

I am not sure yet, but the only place where I still can adjust to
reduce the delay is on how long the check-health operation waits, at
the moment it is 2 * ping_interval, in principle it could be reduced
to ping_interval + small number so that instead of 2min you would
experience 1min30s.

VemundH · 2024-08-05T08:20:08Z

The design of the health checks looks OK.
I think the actual problem is the wait_for_all here:

redis/include/boost/redis/detail/runner.hpp

Line 96 in e8dd4d6

asio::experimental::wait_for_all(),

When async_run_all completes the health check can no longer succeed and should be canceled.

mzimbres · 2024-08-17T19:21:11Z

I think the actual problem is the wait_for_all here:
asio::experimental::wait_for_all(),
When async_run_all completes the health check can no longer succeed and should be canceled.

Good point, perhaps wait_for_one_error is the correct token here, I will check.

         BOOST_ASIO_CORO_YIELD
         asio::experimental::make_parallel_group(
            [this](auto token) { return runner_->async_run_all(*conn_, logger_, token); },
            [this](auto token) { return runner_->health_checker_.async_check_health(*conn_, logger_, token); },
            [this](auto token) { return runner_->async_hello(*conn_, logger_, token); }
         ).async_wait(
            asio::experimental::wait_for_all(),
            std::move(self));

As the user points out in #205 wait_for_all might be the cause of delay in the reconnection. I could not observe any improvement in my local setup but wait_for_one_error look the correct token and all tests pass so I am willing to change it.

VemundH · 2024-08-19T08:15:35Z

Testing with wait_for_one_error did not make any difference. Maybe because the sever disconnect is not considered an error?

2024-08-19 07:40:47.8739521 (Boost.Redis) writer-op: 45 bytes written.
2024-08-19 07:40:47.8746217 (Boost.Redis) reader-op: 4 bytes read.
2024-08-19 07:40:49.3554511 (Boost.Redis) reader-op: End of file
2024-08-19 07:40:49.3557639 (Boost.Redis) reader-op: EOF received. Exiting ...
2024-08-19 07:40:49.3560638 (Boost.Redis) writer-op: canceled (3). Exiting ...
2024-08-19 07:40:49.3562413 (Boost.Redis) run-op: The operation completed successfully (reader), The operation completed successfully (writer)
2024-08-19 07:42:31.8046087 (Boost.Redis) check-timeout-op: Response has no value. Exiting ...
2024-08-19 07:42:31.8054157 (Boost.Redis) ping_op: error/cancelled (1).
2024-08-19 07:42:31.8061572 (Boost.Redis) check-health-op: The I/O operation has been aborted because of either a thread exit or an application request (async_ping), Pong timeout. (async_check_timeout).
2024-08-19 07:42:31.8066630 (Boost.Redis) runner-op: The operation completed successfully (async_run_all), Pong timeout. (async_health_check) The operation completed successfully (async_hello).
2024-08-19 07:42:31.8073355 (Boost.Redis) Connection lost.

VemundH · 2024-08-19T08:44:54Z

As a simple test I added conn_->cancel(operation::health_check); here:

redis/include/boost/redis/detail/runner.hpp

Line 158 in e8dd4d6

self.complete(ec);

Then I get:

2024-08-19 08:19:00.5407450 (Boost.Redis) writer-op: 45 bytes written.
2024-08-19 08:19:00.5420529 (Boost.Redis) reader-op: 4 bytes read.
2024-08-19 08:19:01.5827613 (Boost.Redis) reader-op: End of file
2024-08-19 08:19:01.5830703 (Boost.Redis) reader-op: EOF received. Exiting ...
2024-08-19 08:19:01.5833432 (Boost.Redis) writer-op: canceled (3). Exiting ...
2024-08-19 08:19:01.5834968 (Boost.Redis) run-op: The operation completed successfully (reader), The operation completed successfully (writer)
2024-08-19 08:19:01.5839292 (Boost.Redis) ping_op: error/cancelled (2).
2024-08-19 08:19:01.5840782 (Boost.Redis) check-timeout-op: error/canceled. Exiting ...
2024-08-19 08:19:01.5842141 (Boost.Redis) check-health-op: The I/O operation has been aborted because of either a thread exit or an application request (async_ping), The I/O operation has been aborted because of either a thread exit or an application request (async_check_timeout).
2024-08-19 08:19:01.5844371 (Boost.Redis) runner-op: The operation completed successfully (async_run_all), The I/O operation has been aborted because of either a thread exit or an application request (async_health_check) The operation completed successfully (async_hello).
2024-08-19 08:19:01.5846814 (Boost.Redis) Connection lost.

The reasoning is that no more reads and writes will be attempted after run_all_op completes. Health check can no longer succeed. Canceling it instead of waiting for timeout will remove the long delay before reconnect logic is started.
I have not familiarized myself enough with the code and design to see if this change will have any negative side-effects.

anarthal · 2024-09-01T13:59:04Z

Been out for a few days. Without being a Boost.Redis expert, I'd say @mzimbres diagnostics is mainly correct. But bear in mind that from this log line:

2024-07-26 10:55:05.1389448 (Boost.Redis) run-op: The operation completed successfully (reader), The operation completed successfully (writer)

This op is completing successfully. I don't know under which circumstances the other ops may complete, but I'd say it makes sense to cancel the other operations once one of them exits.

Instead of conn_->cancel(operation::health_check);, I think using asio::experimental::wait_for_one() is better, as it provides exactly this semantics.

The degree of nesting async ops makes testing quite difficult, so I'd suggest exploring the option of replacing some of these by a sans-io state machine, which usually leads to more testable code.

As the user points out in #205 wait_for_all might be the cause of delay in the reconnection. I could not observe any improvement in my local setup but wait_for_one_error look the correct token and all tests pass so I am willing to change it.

mzimbres · 2024-09-08T20:27:58Z

Back form summer vacation ...

@VemundH I couldn't reproduce this problem so far, I tried to run Redis manually, on systemd and on docker as you explained in the PR. But I think you are correct on the fact that since EOF is not considered and error the parallel group linked above will have to wait until the health-check returns an error. I have changed this in the PR here. Can you please try again?

VemundH · 2024-09-09T07:55:52Z

@mzimbres I have tested with the changes in the PR and now the disconnect is detected when EOF is received. Thanks!

2024-09-09 07:39:57.9455664 (Boost.Redis) writer-op: 45 bytes written.
2024-09-09 07:39:57.9461189 (Boost.Redis) reader-op: 4 bytes read.
2024-09-09 07:39:58.8449635 (Boost.Redis) reader-op: End of file
2024-09-09 07:39:58.8454481 (Boost.Redis) reader-op: EOF received. Exiting ...
2024-09-09 07:39:58.8458182 (Boost.Redis) writer-op: canceled (3). Exiting ...
2024-09-09 07:39:58.8460415 (Boost.Redis) run-op: The operation completed successfully (reader), The operation completed successfully (writer)
2024-09-09 07:39:58.8465081 (Boost.Redis) ping_op: error/cancelled (2).
2024-09-09 07:39:58.8467220 (Boost.Redis) check-timeout-op: error/canceled. Exiting ...
2024-09-09 07:39:58.8469090 (Boost.Redis) check-health-op: The I/O operation has been aborted because of either a thread exit or an application request (async_ping), The I/O operation has been aborted because of either a thread exit or an application request (async_check_timeout).
2024-09-09 07:39:58.8472674 (Boost.Redis) runner-op: The operation completed successfully (async_run_all), The I/O operation has been aborted because of either a thread exit or an application request (async_health_check) The operation completed successfully (async_hello).
2024-09-09 07:39:58.8476614 (Boost.Redis) Connection lost.

As the user points out in #205 wait_for_all might be the cause of delay in the reconnection. I could not observe any improvement in my local setup but wait_for_one_error look the correct token and all tests pass so I am willing to change it.

mzimbres · 2024-09-09T20:39:51Z

This op is completing successfully. I don't know under which
circumstances the other ops may complete, but I'd say it makes sense
to cancel the other operations once one of them exits.

Correct, that is achieved with the PR by making EOF an error and using wait_for_one_error instead of wait_for_all.

Instead of conn_->cancel(operation::health_check);, I think using
asio::experimental::wait_for_one() is better, as it provides exactly
this semantics.

The problem with that set of operation is that two of them are long running (async_run and the health check) and the other is short lived (resp3 handshake) so if we used wait_for_one instead when the latter finishes it would trigger cancellation of the other two.

The degree of nesting async ops makes testing quite difficult, so
I'd suggest exploring the option of replacing some of these by a
sans-io state machine, which usually leads to more testable code.

At first I tried to launch all operations in the same group i.e. read, write, handshake, health_check. That reduces the nested ops but bloated the connection class, that is why I split it in two classes. I agree however there is still room for improvements.

As the user points out in #205 wait_for_all might be the cause of delay in the reconnection. I could not observe any improvement in my local setup but wait_for_one_error look the correct token and all tests pass so I am willing to change it.

mzimbres mentioned this issue Aug 18, 2024

Improves reconnection responsiveness #208

Merged

mzimbres linked a pull request Aug 18, 2024 that will close this issue

Improves reconnection responsiveness #208

Merged

mzimbres mentioned this issue Aug 18, 2024

Add timestamp to the logger #209

Open

mzimbres closed this as completed in #208 Sep 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

It takes a long time to detect connection lost #205

It takes a long time to detect connection lost #205

VemundH commented Jul 26, 2024

mzimbres commented Aug 3, 2024

VemundH commented Aug 5, 2024

mzimbres commented Aug 17, 2024

VemundH commented Aug 19, 2024

VemundH commented Aug 19, 2024

anarthal commented Sep 1, 2024

mzimbres commented Sep 8, 2024

VemundH commented Sep 9, 2024

mzimbres commented Sep 9, 2024

It takes a long time to detect connection lost #205

It takes a long time to detect connection lost #205

Comments

VemundH commented Jul 26, 2024

mzimbres commented Aug 3, 2024

VemundH commented Aug 5, 2024

mzimbres commented Aug 17, 2024

VemundH commented Aug 19, 2024

VemundH commented Aug 19, 2024

anarthal commented Sep 1, 2024

mzimbres commented Sep 8, 2024

VemundH commented Sep 9, 2024

mzimbres commented Sep 9, 2024