Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cluster couchdb unstable (fabric_worker_timeout) #4790

Open
alessio-congedo opened this issue Oct 4, 2023 · 4 comments
Open

cluster couchdb unstable (fabric_worker_timeout) #4790

alessio-congedo opened this issue Oct 4, 2023 · 4 comments

Comments

@alessio-congedo
Copy link

alessio-congedo commented Oct 4, 2023

Hi everyone.
I present to you a problem that we are randomly encountering.
this is a 3 node couchdb cluster. one of them (always the same) apparently for no reason, signals countless "fabric_worker_timeout" and is finally excluded from the cluster until the restart.
Below are the logs and configuration.
thanks to everyone who wants to contribute


configuration:

3 node cluster:

host440 --> S.O. Ubuntu 18.04.5 LTS (Bionic Beaver) (kernel 5.4.0-1040-azure)

host430 --> S.O. Ubuntu 18.04.5 LTS (Bionic Beaver) (kernel 5.4.0-1064-azure)

host410 --> S.O. Ubuntu 18.04.4 LTS (Bionic Beaver) (kernel 5.4.0-1095-azure) (crashed node)



[vendor]
name = The Apache Software Foundation
[couchdb]
uuid = 
database_dir = ./data
view_index_dir = ./data
max_dbs_open = 500
file_compression = snappy
attachment_stream_buffer_size = 4096
default_security = admin_only
changes_doc_ids_optimization_threshold = 100
default_engine = couch
[purge]
users_db_security_editable = false
[couchdb_engines]
couch = couch_bt_engine
[process_priority]
[cluster]
q=2
n=3
[chttpd]
port = 5984
bind_address = 127.0.0.1
backlog = 512
socket_options = [{sndbuf, 262144}, {nodelay, true}]
server_options = [{recbuf, undefined}]
require_valid_user = false
prefer_minimal = Cache-Control, Content-Length, Content-Range, Content-Type, ETag, Server, Transfer-Encoding, Vary
max_db_number_for_dbs_info_req = 100
[couch_peruser]
enable = false
delete_dbs = false
database_prefix = userdb-
[httpd]
port = 5986
bind_address = 127.0.0.1
authentication_handlers = {couch_httpd_auth, cookie_authentication_handler}, {couch_httpd_auth, default_authentication_handler}
secure_rewrites = true
allow_jsonp = false
socket_options = [{sndbuf, 262144}]
enable_cors = false
enable_xframe_options = false
[ssl]
port = 6984
[couch_httpd_auth]
authentication_db = _users
authentication_redirect = /_utils/session.html
require_valid_user = false
[csp]
enable = true
[cors]
credentials = false
[x_frame_options]
[native_query_servers]
[query_server_config]
reduce_limit = true
os_process_limit = 100
[mango]
[indexers]
couch_mrview = true
[feature_flags]
partitioned||* = true
[uuids]
algorithm = sequential
utc_id_suffix =
# Maximum number of UUIDs retrievable from /_uuids in a single request
max_count = 1000
[attachments]
compressible_types = text/*, application/javascript, application/json, application/xml
[replicator]
startup_jitter = 5000
max_jobs = 500
interval = 60000
max_churn = 20
worker_processes = 4
worker_batch_size = 500
http_connections = 20
connection_timeout = 30000
retries_per_request = 5
socket_options = [{keepalive, true}, {nodelay, false}]
verify_ssl_certificates = false
ssl_certificate_max_depth = 3
[log]
level = info
writer = stderr
[stats]
[smoosh.ratio_dbs]
min_priority = 2.0
[smoosh.ratio_views]
min_priority = 2.0
[ioq]
concurrency = 10
ratio = 0.01
[ioq.bypass]
os_process = true
read = true
write = true
view_update = true
shard_sync = false
compaction = false
[dreyfus]
[reshard]
--------------------------------------------------------------------
local.ini:

[couchdb]
[couch_peruser]
[chttpd]
[httpd]
[couch_httpd_auth]
[ssl]
[vhosts]
[admins]
[log]
file = /couchdblog/couchdb.log
level = info
-----------------------------------------------
logs host410:
[notice] 2023-09-22T07:27:37.992397Z [email protected] <0.28252.6837> 4d4c63b407 XXXXXXXX.xxxxx.com:5984 000.000.158.66 JJJJJJJJJJ GET /xxxxxxxxxxx_xxxxxxxxxxx-store/_design/sync/_view/pull?start_key=%222023-09-21T11%3A43%3A10.0900665Z%22&limit=2&include_docs=true 200 ok 6
[notice] 2023-09-22T07:27:38.108163Z [email protected] <0.32169.6869> 9d742bba86 XXXXXXXX.xxxxx.com:5984 000.000.158.66 JJJJJJJJJJ GET /xxxxxxxxxxx-YYYYYYYYY-data/_changes?include_docs=true&since=90688-g1AAAADteJzLYWBgYMpgTmFQSc4vTc5ISXJITqwqLUotLiwqMDEx0EvOy0jOSUnM1cvJT07MyQGpTmRIqv___39WBnMSA8OW37lAMXbLtMREQ0tTIo0hzb48FiDJ0ACk_sOtXesKtjbJIi0x1TKNSNOyAErHTyo&limit=1 200 ok 15
[info] 2023-09-22T07:27:38.132250Z [email protected] <0.23148.6894> -------- Starting index update for db: shards/80000000-ffffffff/xxxxxxxxxxx-YYYYYYYYY-data.1663606339 idx: _design/keys
[info] 2023-09-22T07:27:38.132420Z [email protected] <0.10215.6866> -------- Starting index update for db: shards/80000000-ffffffff/xxxxxxxxxxx-YYYYYYYYY-data.1663606339 idx: _design/crm365
[info] 2023-09-22T07:27:38.132533Z [email protected] <0.10286.6888> -------- Starting index update for db: shards/80000000-ffffffff/xxxxxxxxxxx-YYYYYYYYY-data.1663606339 idx: _design/qlikViewsCommonData
[notice] 2023-09-22T07:27:38.165317Z [email protected] <0.4683.6913> 2d9d36ea72 XXXXXXXX.xxxxx.com:5984 000.000.158.66 JJJJJJJJJJ GET /xxxxxxxxxxx-YYYYYYYYY-store/_design/sync/_view/pullCount?start_key=%222023-09-19T10%3A20%3A23.6650000Z%22 200 ok 4
[notice] 2023-09-22T07:27:38.166509Z [email protected] <0.28839.6862> 30a88e797e XXXXXXXX.xxxxx.com:5984 000.000.158.66 JJJJJJJJJJ GET /xxxxxxxxxxx-YYYYYYYYY-store/_design/sync/_view/pull?start_key=%222023-09-19T10%3A20%3A23.6650000Z%22&limit=2&include_docs=true 200 ok 5
[info] 2023-09-22T07:27:38.280964Z [email protected] <0.14685.6879> -------- Starting index update for db: shards/00000000-7fffffff/xxxxxxxxxxx-YYYYYYYYY-data.1663606339 idx: _design/keys
[info] 2023-09-22T07:27:38.281173Z [email protected] <0.5183.6871> -------- Starting index update for db: shards/00000000-7fffffff/xxxxxxxxxxx-YYYYYYYYY-data.1663606339 idx: _design/crm365
[info] 2023-09-22T07:27:38.281236Z [email protected] <0.9142.6883> -------- Starting index update for db: shards/00000000-7fffffff/xxxxxxxxxxx-YYYYYYYYY-data.1663606339 idx: _design/qlikViewsCommonData
[notice] 2023-09-22T07:27:38.420903Z [email protected] <0.2955.6877> 7135637ce5 XXXXXXXX.xxxxx.com:5984 000.000.158.66 JJJJJJJJJJ GET /xxxxxxxxxxl-YYYYYYYYY-store/_design/sync/_view/pullCount?start_key=%222023-09-22T06%3A35%3A46.5070000Z%22 200 ok 19
[info] 2023-09-22T07:27:38.672348Z [email protected] <0.10286.6888> -------- Index update finished for db: shards/80000000-ffffffff/xxxxxxxxxxx-YYYYYYYYY-data.1663606339 idx: _design/qlikViewsCommonData
[info] 2023-09-22T07:27:38.730669Z [email protected] <0.23148.6894> -------- Index update finished for db: shards/80000000-ffffffff/xxxxxxxxxxx-YYYYYYYYY-data.1663606339 idx: _design/keys
[notice] 2023-09-22T07:27:38.745247Z [email protected] <0.28451.6848> 0fb95ccdf2 XXXXXXXX.xxxxx.com:5984 000.000.158.4 JJJJJJJJJJ POST /xxxxxxxxxx-YYYYYYYYY/_all_docs?include_docs=true 200 ok 13
[info] 2023-09-22T07:27:38.755594Z [email protected] <0.10215.6866> -------- Index update finished for db: shards/80000000-ffffffff/xxxxxxxxxxx-YYYYYYYYY-data.1663606339 idx: _design/crm365
[info] 2023-09-22T07:27:38.922129Z [email protected] <0.9142.6883> -------- Index update finished for db: shards/00000000-7fffffff/xxxxxxxxxxx-YYYYYYYYY-data.1663606339 idx: _design/qlikViewsCommonData
[info] 2023-09-22T07:27:38.957962Z [email protected] <0.14685.6879> -------- Index update finished for db: shards/00000000-7fffffff/xxxxxxxxxxx-YYYYYYYYY-data.1663606339 idx: _design/keys
[info] 2023-09-22T07:27:38.970011Z [email protected] <0.5183.6871> -------- Index update finished for db: shards/00000000-7fffffff/xxxxxxxxxxx-YYYYYYYYY-data.1663606339 idx: _design/crm365
[notice] 2023-09-22T07:27:39.410205Z [email protected] <0.4969.6869> 880d8c7a31 XXXXXXXX.xxxxx.com:5984 000.000.158.66 JJJJJJJJJJ GET /XXXXXXXXX-YYYYYYYYY-data/_changes?include_docs=true&since=137250-g1AAAADteJzLYWBgYMpgTmFQSc4vTc5ISXJITqwqLUotLiwqMDE20EvOy0jOSUnM1cvJT07MyQGpTmRIqv___39WBnMSA6P0z1ygGHtyckpiqqk5kcaQZl8eC5BkaABS_2HWMvzRBFtrZmiSlGhpRKRpWQALtE5_&limit=1 200 ok 11
[notice] 2023-09-22T07:27:39.426000Z [email protected] <0.6036.6891> e5b57b4db7 XXXXXXXX.xxxxx.com:5984 000.000.158.66 JJJJJJJJJJ GET /XXXXXXXX-YYYYYYYYY-data/_changes?since=137250-g1AAAADteJzLYWBgYMpgTmFQSc4vTc5ISXJITqwqLUotLiwqMDE20EvOy0jOSUnM1cvJT07MyQGpTmRIqv___39WBnMSA6P0z1ygGHtyckpiqqk5kcaQZl8eC5BkaABS_2HWMvzRBFtrZmiSlGhpRKRpWQALtE5_ 200 ok 9
[notice] 2023-09-22T07:27:39.469679Z [email protected] <0.19949.6869> 85f9d1aa9d XXXXXXXX.xxxxx.com:5984 000.000.158.66 JJJJJJJJJJ GET /xxxxxxxx_xxxxxxxxxxx-store/_design/sync/_view/pull?start_key=%222023-09-21T17%3A42%3A46.0454235Z%22&limit=2&include_docs=true 200 ok 7
[notice] 2023-09-22T07:27:39.667451Z [email protected] <0.3686.6892> 61172e29d1 XXXXXXXX.xxxxx.com:5984 000.000.158.4 jjjjjjjjjj GET /xxxxxxxxx-YYYYYYYYY/c3f34b2d-f821-43e7-813c-5b9f7f3eab8e 200 ok 11
[error] 2023-09-22T07:28:39.683094Z [email protected] <0.23080.6890> a3de97af73 fabric_worker_timeout open_doc,'[email protected]',<<"shards/80000000-ffffffff/_users.1606227611">>
[error] 2023-09-22T07:28:39.683423Z [email protected] <0.23080.6890> a3de97af73 fabric_worker_timeout open_doc,'[email protected]',<<"shards/80000000-ffffffff/_users.1606227611">>
[error] 2023-09-22T07:28:40.114013Z [email protected] <0.32575.6885> -------- fabric_worker_timeout open_revs,'[email protected]',<<"shards/80000000-ffffffff/xxxxxxxxxxx_xxxxxxxxxxx-store.1624883181">>
[error] 2023-09-22T07:28:40.114018Z [email protected] <0.22781.6879> -------- fabric_worker_timeout open_doc,'[email protected]',<<"shards/80000000-ffffffff/xxxxxxxxxxx_xxxxxxxxxxx-store.1624883181">>
[error] 2023-09-22T07:28:40.114090Z [email protected] <0.22781.6879> -------- fabric_worker_timeout open_doc,'[email protected]',<<"shards/80000000-ffffffff/xxxxxxxxxxx_xxxxxxxxxxx-store.1624883181">>
[error] 2023-09-22T07:28:40.114101Z [email protected] <0.32575.6885> -------- fabric_worker_timeout open_revs,'[email protected]',<<"shards/80000000-ffffffff/xxxxxxxxxxx_xxxxxxxxxxx-store.1624883181">>
[error] 2023-09-22T07:28:41.085053Z [email protected] <0.16232.6867> -------- fabric_worker_timeout open_doc,'[email protected]',<<"shards/80000000-ffffffff/xxxxxxxxx_xxxxxxxxxxx-store.1661792183">>
[error] 2023-09-22T07:28:41.085144Z [email protected] <0.16232.6867> -------- fabric_worker_timeout open_doc,'[email protected]',<<"shards/80000000-ffffffff/xxxxxxxxx_xxxxxxxxxxx-store.1661792183">>
[error] 2023-09-22T07:28:41.085961Z [email protected] <0.9368.6911> -------- fabric_worker_timeout open_revs,'[email protected]',<<"shards/80000000-ffffffff/xxxxxxxxx_xxxxxxxxxxx-store.1661792183">>
[error] 2023-09-22T07:28:41.086001Z [email protected] <0.9368.6911> -------- fabric_worker_timeout open_revs,'[email protected]',<<"shards/80000000-ffffffff/xxxxxxxxx_xxxxxxxxxxx-store.1661792183">>
[error] 2023-09-22T07:28:41.310101Z [email protected] <0.31151.6895> -------- fabric_worker_timeout open_doc,'[email protected]',<<"shards/00000000-7fffffff/xxxxxxxxx-YYYYYYYYY-store.1610742106">>
[error] 2023-09-22T07:28:41.310188Z [email protected] <0.31151.6895> -------- fabric_worker_timeout open_doc,'[email protected]',<<"shards/00000000-7fffffff/xxxxxxxxx-YYYYYYYYY-store.1610742106">>
[error] 2023-09-22T07:28:41.311104Z [email protected] <0.5060.6887> -------- fabric_worker_timeout open_revs,'[email protected]',<<"shards/00000000-7fffffff/xxxxxxxxx-YYYYYYYYY-store.1610742106">>
[error] 2023-09-22T07:28:41.311149Z [email protected] <0.5060.6887> -------- fabric_worker_timeout open_revs,'[email protected]',<<"shards/00000000-7fffffff/xxxxxxxxx-YYYYYYYYY-store.1610742106">>
[error] 2023-09-22T07:28:41.610017Z [email protected] <0.27352.6879> -------- fabric_worker_timeout open_revs,'[email protected]',<<"shards/00000000-7fffffff/xxxxxxxxxxx-YYYYYYYYY-data.1676992264">>
[error] 2023-09-22T07:28:41.610021Z [email protected] <0.22928.6893> -------- fabric_worker_timeout open_doc,'[email protected]',<<"shards/00000000-7fffffff/xxxxxxxxxxx-YYYYYYYYY-data.1676992264">>
[error] 2023-09-22T07:28:41.610087Z [email protected] <0.22928.6893> -------- fabric_worker_timeout open_doc,'[email protected]',<<"shards/00000000-7fffffff/xxxxxxxxxxx-YYYYYYYYY-data.1676992264">>
[error] 2023-09-22T07:28:41.610086Z [email protected] <0.27352.6879> -------- fabric_worker_timeout open_revs,'[email protected]',<<"shards/00000000-7fffffff/xxxxxxxxxxx-YYYYYYYYY-data.1676992264">>
[error] 2023-09-22T07:28:42.052079Z [email protected] <0.22197.6887> -------- fabric_worker_timeout open_doc,'[email protected]',<<"shards/80000000-ffffffff/xxxxxxxxx-YYYYYYYYY.1610806297">>
[error] 2023-09-22T07:28:42.052164Z [email protected] <0.22197.6887> -------- fabric_worker_timeout open_doc,'[email protected]',<<"shards/80000000-ffffffff/xxxxxxxxx-YYYYYYYYY.1610806297">>
[warning] 2023-09-22T07:28:42.052435Z [email protected] <0.19983.6900> -------- Failed to get group_pid for "upgrade_views" <<"shards/00000000-7fffffff/xxxxxxxxx-YYYYYYYYY.1610806297">> <<"_design/QlikViews">>: timeout
[error] 2023-09-22T07:28:42.570144Z [email protected] <0.28181.6921> -------- fabric_worker_timeout open_doc,'[email protected]',<<"shards/00000000-7fffffff/xxxxxxxxx-YYYYYYYYY-store.1610764356">>
[error] 2023-09-22T07:28:42.570236Z [email protected] <0.28181.6921> -------- fabric_worker_timeout open_doc,'[email protected]',<<"shards/00000000-7fffffff/xxxxxxxxx-YYYYYYYYY-store.1610764356">>
[warning] 2023-09-22T07:28:42.570499Z [email protected] <0.26664.6885> -------- Failed to get group_pid for "upgrade_views" <<"shards/00000000-7fffffff/xxxxxxxxx-YYYYYYYYY-store.1610764356">> <<"_design/rmi">>: timeout
[error] 2023-09-22T07:28:42.660038Z [email protected] <0.1969.6843> -------- fabric_worker_timeout open_revs,'[email protected]',<<"shards/00000000-7fffffff/xxxxxxxxx-YYYYYYYYY-data.1610752889">>
[error] 2023-09-22T07:28:42.660116Z [email protected] <0.1969.6843> -------- fabric_worker_timeout open_revs,'[email protected]',<<"shards/00000000-7fffffff/xxxxxxxxx-YYYYYYYYY-data.1610752889">>
[error] 2023-09-22T07:28:42.679132Z [email protected] <0.18090.6868> -------- fabric_worker_timeout open_doc,'[email protected]',<<"shards/80000000-ffffffff/xxxxxxxxx-YYYYYYYYY-store.1610764356">>
[error] 2023-09-22T07:28:42.679193Z [email protected] <0.18090.6868> -------- fabric_worker_timeout open_doc,'[email protected]',<<"shards/80000000-ffffffff/xxxxxxxxx-YYYYYYYYY-store.1610764356">>
[warning] 2023-09-22T07:28:42.679331Z [email protected] <0.9142.6895> -------- Failed to get group_pid for "upgrade_views" <<"shards/00000000-7fffffff/xxxxxxxxx-YYYYYYYYY-store.1610764356">> <<"_design/eventTypeIndex">>: timeout
[error] 2023-09-22T07:28:42.875018Z [email protected] <0.8427.6884> -------- fabric_worker_timeout get_all_security,'[email protected]',<<"shards/80000000-ffffffff/xxxxxxxxx-YYYYYYYYY.1610766612">>
[error] 2023-09-22T07:28:42.875105Z [email protected] <0.8427.6884> -------- Error checking security objects for xxxxxxxxx-YYYYYYYYY :: {error,timeout}
[error] 2023-09-22T07:28:42.875222Z [email protected] <0.7152.6891> -------- fabric_worker_timeout get_all_security,'[email protected]',<<"shards/00000000-7fffffff/xxxxxxxxx-YYYYYYYYY-store.1610764356">>
[error] 2023-09-22T07:28:42.875263Z [email protected] <0.7152.6891> -------- Error checking security objects for xxxxxxxxx-YYYYYYYYY-store :: {error,timeout}
[error] 2023-09-22T07:28:42.875921Z [email protected] <0.7143.6855> -------- fabric_worker_timeout get_all_security,'[email protected]',<<"shards/00000000-7fffffff/xxxxxxxxx-YYYYYYYYY-store.1610764356">>
[error] 2023-09-22T07:28:42.875988Z [email protected] <0.7143.6855> -------- Error checking security objects for xxxxxxxxx-YYYYYYYYY-store :: {error,timeout}
[error] 2023-09-22T07:28:42.876188Z [email protected] <0.12119.6899> -------- fabric_worker_timeout get_all_security,'[email protected]',<<"shards/80000000-ffffffff/xxxxxxxxx-YYYYYYYYY.1610766612">>
[error] 2023-09-22T07:28:42.876243Z [email protected] <0.12119.6899> -------- Error checking security objects for xxxxxxxxx-YYYYYYYYY :: {error,timeout}
[error] 2023-09-22T07:28:43.202106Z [email protected] <0.19541.6913> -------- fabric_worker_timeout open_revs,'[email protected]',<<"shards/00000000-7fffffff/xxxxxxxxx-YYYYYYYYY-data.1663603160">>
[error] 2023-09-22T07:28:43.202176Z [email protected] <0.19541.6913> -------- fabric_worker_timeout open_revs,'[email protected]',<<"shards/00000000-7fffffff/xxxxxxxxx-YYYYYYYYY-data.1663603160">>
[error] 2023-09-22T07:28:43.377020Z [email protected] <0.1421.6852> -------- fabric_worker_timeout get_all_security,'[email protected]',<<"shards/80000000-ffffffff/xxxxxxxxxxx-YYYYYYYYY-data.1663606339">>
[error] 2023-09-22T07:28:43.377019Z [email protected] <0.6036.6893> -------- fabric_worker_timeout get_all_security,'[email protected]',<<"shards/00000000-7fffffff/xxxxxxxxxxx-YYYYYYYYY-data.1663606339">>
[error] 2023-09-22T07:28:43.377019Z [email protected] <0.6516.6893> -------- fabric_worker_timeout get_all_security,'[email protected]',<<"shards/80000000-ffffffff/xxxxxxxxxxx-YYYYYYYYY-data.1663606339">>
[error] 2023-09-22T07:28:43.377087Z [email protected] <0.6516.6893> -------- Error checking security objects for xxxxxxxxxxx-YYYYYYYYY-data :: {error,timeout}
[error] 2023-09-22T07:28:43.377089Z [email protected] <0.1421.6852> -------- Error checking security objects for xxxxxxxxxxx-YYYYYYYYY-data :: {error,timeout}
[error] 2023-09-22T07:28:43.377088Z [email protected] <0.6036.6893> -------- Error checking security objects for xxxxxxxxxxx-YYYYYYYYY-data :: {error,timeout}
[error] 2023-09-22T07:28:43.377165Z [email protected] <0.11217.6895> -------- fabric_worker_timeout get_all_security,'[email protected]',<<"shards/00000000-7fffffff/xxxxxxxxxxx-YYYYYYYYY-data.1663606339">>
[error] 2023-09-22T07:28:43.377214Z [email protected] <0.11217.6895> -------- Error checking security objects for xxxxxxxxxxx-YYYYYYYYY-data :: {error,timeout}
[error] 2023-09-22T07:28:43.553055Z [email protected] <0.3215.6870> -------- fabric_worker_timeout open_doc,'[email protected]',<<"shards/00000000-7fffffff/xxxxxxxxx-YYYYYYYYY-data.1611910875">>
[error] 2023-09-22T07:28:43.553136Z [email protected] <0.3215.6870> -------- fabric_worker_timeout open_doc,'[email protected]',<<"shards/00000000-7fffffff/xxxxxxxxx-YYYYYYYYY-data.1611910875">>
[error] 2023-09-22T07:28:43.553957Z [email protected] <0.23209.6871> -------- fabric_worker_timeout open_revs,'[email protected]',<<"shards/00000000-7fffffff/xxxxxxxxx-YYYYYYYYY-data.1611910875">>
[error] 2023-09-22T07:28:43.554019Z [email protected] <0.23209.6871> -------- fabric_worker_timeout open_revs,'[email protected]',<<"shards/00000000-7fffffff/xxxxxxxxx-YYYYYYYYY-data.1611910875">>
[error] 2023-09-22T07:28:43.678035Z [email protected] <0.29740.6885> -------- fabric_worker_timeout open_doc,'[email protected]',<<"shards/00000000-7fffffff/xxxxxxxxxxx-YYYYYYYYY-data.1663606339">>
[error] 2023-09-22T07:28:43.678123Z [email protected] <0.29740.6885> -------- fabric_worker_timeout open_doc,'[email protected]',<<"shards/00000000-7fffffff/xxxxxxxxxxx-YYYYYYYYY-data.1663606339">>
[warning] 2023-09-22T07:28:43.678293Z [email protected] <0.9559.6904> -------- Failed to get group_pid for "upgrade_views" <<"shards/80000000-ffffffff/xxxxxxxxxxx-YYYYYYYYY-data.1663606339">> <<"_design/qlikViewsCommonData">>: timeout
[error] 2023-09-22T07:28:43.732077Z [email protected] <0.32314.6900> -------- fabric_worker_timeout open_doc,'[email protected]',<<"shards/00000000-7fffffff/xxxxxxxxxxx-YYYYYYYYY-data.1663606339">>
[error] 2023-09-22T07:28:43.732171Z [email protected] <0.32314.6900> -------- fabric_worker_timeout open_doc,'[email protected]',<<"shards/00000000-7fffffff/xxxxxxxxxxx-YYYYYYYYY-data.1663606339">>
[warning] 2023-09-22T07:28:43.732337Z [email protected] <0.8810.6892> -------- Failed to get group_pid for "upgrade_views" <<"shards/80000000-ffffffff/xxxxxxxxxxx-YYYYYYYYY-data.1663606339">> <<"_design/keys">>: timeout
[error] 2023-09-22T07:28:43.758042Z [email protected] <0.5407.6881> -------- fabric_worker_timeout open_doc,'[email protected]',<<"shards/00000000-7fffffff/xxxxxxxxxxx-YYYYYYYYY-data.1663606339">>
[error] 2023-09-22T07:28:43.758117Z [email protected] <0.5407.6881> -------- fabric_worker_timeout open_doc,'[email protected]',<<"shards/00000000-7fffffff/xxxxxxxxxxx-YYYYYYYYY-data.1663606339">>
[warning] 2023-09-22T07:28:43.758194Z [email protected] <0.16313.6907> -------- Failed to get group_pid for "upgrade_views" <<"shards/80000000-ffffffff/xxxxxxxxxxx-YYYYYYYYY-data.1663606339">> <<"_design/crm365">>: timeout
[error] 2023-09-22T07:28:43.924122Z [email protected] <0.16596.6891> -------- fabric_worker_timeout open_doc,'[email protected]',<<"shards/00000000-7fffffff/xxxxxxxxxxx-YYYYYYYYY-data.1663606339">>
[error] 2023-09-22T07:28:43.924215Z [email protected] <0.16596.6891> -------- fabric_worker_timeout open_doc,'[email protected]',<<"shards/00000000-7fffffff/xxxxxxxxxxx-YYYYYYYYY-data.1663606339">>
[warning] 2023-09-22T07:28:43.924314Z [email protected] <0.7358.6885> -------- Failed to get group_pid for "upgrade_views" <<"shards/00000000-7fffffff/xxxxxxxxxxx-YYYYYYYYY-data.1663606339">> <<"_design/qlikViewsCommonData">>: timeout
[error] 2023-09-22T07:28:43.960068Z [email protected] <0.6482.6895> -------- fabric_worker_timeout open_doc,'[email protected]',<<"shards/00000000-7fffffff/xxxxxxxxxxx-YYYYYYYYY-data.1663606339">>
- - - - - 
[error] 2023-09-22T07:29:12.840359Z [email protected] <0.109.0> -------- ** Node '[email protected]' not responding **
** Removing (timedout) connection **

[error] 2023-09-22T07:29:12.840960Z [email protected] <0.111.0> -------- ** Node '[email protected]' not responding **
** Removing (timedout) connection **

[notice] 2023-09-22T07:29:12.843431Z [email protected] <0.23080.6890> a3de97af73 XXXXXXXX.xxxxx.com:5984 000.000.158.4 kkkkkkkkkkk GET /xxxxxxxxx_xxxxxxxxxxx-store/- 404 ok 93161
[notice] 2023-09-22T07:29:12.844397Z [email protected] <0.275.0> -------- rexi_server_mon : cluster unstable
[notice] 2023-09-22T07:29:12.844450Z [email protected] <0.275.0> -------- rexi_server_mon : cluster unstable
[notice] 2023-09-22T07:29:12.844491Z [email protected] <0.281.0> -------- rexi_server_mon : cluster unstable
[notice] 2023-09-22T07:29:12.844635Z [email protected] <0.281.0> -------- rexi_server_mon : cluster unstable
[error] 2023-09-22T07:29:12.847877Z [email protected] emulator -------- Error in process <0.16294.6892> on node '[email protected]' with exit value:
{{rexi_DOWN,{'[email protected]',noconnection}},[{mem3_rpc,rexi_call,3,[{file,"src/mem3_rpc.erl"},{line,394}]},{mem3_rep,calculate_start_seq,3,[{file,"src/mem3_rep.erl"},{line,402}]},{maps,'-map/2-lc$^0/1-0-',2,[{file,"maps.erl"},{line,232}]},{maps,map,2,[{file,"maps.erl"},{line,232}]},{mem3_rep,calculate_start_seq_multi,1,[{file,"src/mem3_rep.erl"},{line,390}]},{mem3_rep,repl,1,[{file,"src/mem3_rep.erl"},{line,292}]},{mem3_rep,go,1,[{file,"src/mem3_rep.erl"},{line,111}]},{mem3_sync,'-start_push_replication/1-fun-0-',2,[{file,"src/mem3_sync.erl"},{line,212}]}]}

[warning] 2023-09-22T07:29:12.848269Z [email protected] <0.344.0> -------- mem3_sync shards/80000000-ffffffff/xxxxxxxxxxx-YYYYYYYYY-data.1663606339 [email protected] {{rexi_DOWN,{'[email protected]',noconnection}},[{mem3_rpc,rexi_call,3,[{file,[115,114,99,47,109,101,109,51,95,114,112,99,46,101,114,108]},{line,394}]},{mem3_rep,calculate_start_seq,3,[{file,[115,114,99,47,109,101,109,51,95,114,101,112,46,101,114,108]},{line,402}]},{maps,'-map/2-lc$^0/1-0-',2,[{file,[109,97,112,115,46,101,114,108]},{line,232}]},{maps,map,2,[{file,[109,97,112,115,46,101,114,108]},{line,232}]},{mem3_rep,calculate_start_seq_multi,1,[{file,[115,114,99,47,109,101,109,51,95,114,101,112,46,101,114,108]},{line,390}]},{mem3_rep,repl,1,[{file,[115,114,99,47,109,101,109,51,95,114,101,112,46,101,114,108]},{line,292}]},{mem3_rep,go,1,[{file,[115,114,99,47,109,101,109,51,95,114,101,112,46,101,114,108]},{line,111}]},{mem3_sync,'-start_push_replication/1-fun-0-',2,[{file,[115,114,99,47,109,101,109,51,95,115,121,110,99,46,101,114,108]},{line,212}]}]}
[error] 2023-09-22T07:29:12.848335Z [email protected] emulator -------- Error in process <0.32653.6885> on node '[email protected]' with exit value:
{{rexi_DOWN,{'[email protected]',noconnection}},[{mem3_rpc,rexi_call,3,[{file,"src/mem3_rpc.erl"},{line,394}]},{mem3_rep,calculate_start_seq,3,[{file,"src/mem3_rep.erl"},{line,402}]},{maps,'-map/2-lc$^0/1-0-',2,[{file,"maps.erl"},{line,232}]},{maps,map,2,[{file,"maps.erl"},{line,232}]},{mem3_rep,calculate_start_seq_multi,1,[{file,"src/mem3_rep.erl"},{line,390}]},{mem3_rep,repl,1,[{file,"src/mem3_rep.erl"},{line,292}]},{mem3_rep,go,1,[{file,"src/mem3_rep.erl"},{line,111}]},{mem3_sync,'-start_push_replication/1-fun-0-',2,[{file,"src/mem3_sync.erl"},{line,212}]}]}

[error] 2023-09-22T07:29:12.848484Z [email protected] emulator -------- Error in process <0.7845.6894> on node '[email protected]' with exit value:
{{rexi_DOWN,{'[email protected]',noconnection}},[{mem3_rpc,rexi_call,3,[{file,"src/mem3_rpc.erl"},{line,394}]},{mem3_rep,calculate_start_seq,3,[{file,"src/mem3_rep.erl"},{line,402}]},{maps,'-map/2-lc$^0/1-0-',2,[{file,"maps.erl"},{line,232}]},{maps,map,2,[{file,"maps.erl"},{line,232}]},{mem3_rep,calculate_start_seq_multi,1,[{file,"src/mem3_rep.erl"},{line,390}]},{mem3_rep,repl,1,[{file,"src/mem3_rep.erl"},{line,292}]},{mem3_rep,go,1,[{file,"src/mem3_rep.erl"},{line,111}]},{mem3_sync,'-start_push_replication/1-fun-0-',2,[{file,"src/mem3_sync.erl"},{line,212}]}]}

[error] 2023-09-22T07:29:12.848643Z [email protected] emulator -------- Error in process <0.20330.6882> on node '[email protected]' with exit value:
{{rexi_DOWN,{'[email protected]',noconnection}},[{mem3_rpc,rexi_call,3,[{file,"src/mem3_rpc.erl"},{line,394}]},{mem3_rep,calculate_start_seq,3,[{file,"src/mem3_rep.erl"},{line,402}]},{maps,'-map/2-lc$^0/1-0-',2,[{file,"maps.erl"},{line,232}]},{maps,map,2,[{file,"maps.erl"},{line,232}]},{mem3_rep,calculate_start_seq_multi,1,[{file,"src/mem3_rep.erl"},{line,390}]},{mem3_rep,repl,1,[{file,"src/mem3_rep.erl"},{line,292}]},{mem3_rep,go,1,[{file,"src/mem3_rep.erl"},{line,111}]},{mem3_sync,'-start_push_replication/1-fun-0-',2,[{file,"src/mem3_sync.erl"},{line,212}]}]}

[error] 2023-09-22T07:29:12.848845Z [email protected] emulator -------- Error in process <0.26222.6868> on node '[email protected]' with exit value:
{{rexi_DOWN,{'[email protected]',noconnection}},[{mem3_rpc,rexi_call,3,[{file,"src/mem3_rpc.erl"},{line,394}]},{mem3_rep,calculate_start_seq,3,[{file,"src/mem3_rep.erl"},{line,402}]},{maps,'-map/2-lc$^0/1-0-',2,[{file,"maps.erl"},{line,232}]},{maps,map,2,[{file,"maps.erl"},{line,232}]},{mem3_rep,calculate_start_seq_multi,1,[{file,"src/mem3_rep.erl"},{line,390}]},{mem3_rep,repl,1,[{file,"src/mem3_rep.erl"},{line,292}]},{mem3_rep,go,1,[{file,"src/mem3_rep.erl"},{line,111}]},{mem3_sync,'-start_push_replication/1-fun-0-',2,[{file,"src/mem3_sync.erl"},{line,212}]}]}

[error] 2023-09-22T07:29:12.849003Z [email protected] emulator -------- Error in process <0.27307.6863> on node '[email protected]' with exit value:
{{rexi_DOWN,{'[email protected]',noconnection}},[{mem3_rpc,rexi_call,3,[{file,"src/mem3_rpc.erl"},{line,394}]},{mem3_rep,calculate_start_seq,3,[{file,"src/mem3_rep.erl"},{line,402}]},{maps,'-map/2-lc$^0/1-0-',2,[{file,"maps.erl"},{line,232}]},{maps,map,2,[{file,"maps.erl"},{line,232}]},{mem3_rep,calculate_start_seq_multi,1,[{file,"src/mem3_rep.erl"},{line,390}]},{mem3_rep,repl,1,[{file,"src/mem3_rep.erl"},{line,292}]},{mem3_rep,go,1,[{file,"src/mem3_rep.erl"},{line,111}]},{mem3_sync,'-start_push_replication/1-fun-0-',2,[{file,"src/mem3_sync.erl"},{line,212}]}]}

[error] 2023-09-22T07:29:12.849206Z [email protected] emulator -------- Error in process <0.8719.6878> on node '[email protected]' with exit value:
{{rexi_DOWN,{'[email protected]',noconnection}},[{mem3_rpc,rexi_call,3,[{file,"src/mem3_rpc.erl"},{line,394}]},{mem3_rep,calculate_start_seq,3,[{file,"src/mem3_rep.erl"},{line,402}]},{maps,'-map/2-lc$^0/1-0-',2,[{file,"maps.erl"},{line,232}]},{maps,map,2,[{file,"maps.erl"},{line,232}]},{mem3_rep,calculate_start_seq_multi,1,[{file,"src/mem3_rep.erl"},{line,390}]},{mem3_rep,repl,1,[{file,"src/mem3_rep.erl"},{line,292}]},{mem3_rep,go,1,[{file,"src/mem3_rep.erl"},{line,111}]},{mem3_sync,'-start_push_replication/1-fun-0-',2,[{file,"src/mem3_sync.erl"},{line,212}]}]}

[error] 2023-09-22T07:29:12.849328Z [email protected] emulator -------- Error in process <0.2467.6889> on node '[email protected]' with exit value:
{{rexi_DOWN,{'[email protected]',noconnection}},[{mem3_rpc,rexi_call,3,[{file,"src/mem3_rpc.erl"},{line,394}]},{mem3_rep,calculate_start_seq,3,[{file,"src/mem3_rep.erl"},{line,402}]},{maps,'-map/2-lc$^0/1-0-',2,[{file,"maps.erl"},{line,232}]},{maps,map,2,[{file,"maps.erl"},{line,232}]},{mem3_rep,calculate_start_seq_multi,1,[{file,"src/mem3_rep.erl"},{line,390}]},{mem3_rep,repl,1,[{file,"src/mem3_rep.erl"},{line,292}]},{mem3_rep,go,1,[{file,"src/mem3_rep.erl"},{line,111}]},{mem3_sync,'-start_push_replication/1-fun-0-',2,[{file,"src/mem3_sync.erl"},{line,212}]}]}

[notice] 2023-09-22T07:29:12.850012Z [email protected] <0.274.0> -------- rexi_server : cluster unstable
[notice] 2023-09-22T07:29:12.850014Z [email protected] <0.280.0> -------- rexi_buffer : cluster unstable
[warning] 2023-09-22T07:29:12.854631Z [email protected] <0.344.0> -------- mem3_sync shards/00000000-7fffffff/xxxxxxxxx-YYYYYYYYY-store.1610764356 [email protected] {{rexi_DOWN,{'[email protected]',noconnection}},[{mem3_rpc,rexi_call,3,[{file,[115,114,99,47,109,101,109,51,95,114,112,99,46,101,114,108]},{line,394}]},{mem3_rep,calculate_start_seq,3,[{file,[115,114,99,47,109,101,109,51,95,114,101,112,46,101,114,108]},{line,402}]},{maps,'-map/2-lc$^0/1-0-',2,[{file,[109,97,112,115,46,101,114,108]},{line,232}]},{maps,map,2,[{file,[109,97,112,115,46,101,114,108]},{line,232}]},{mem3_rep,calculate_start_seq_multi,1,[{file,[115,114,99,47,109,101,109,51,95,114,101,112,46,101,114,108]},{line,390}]},{mem3_rep,repl,1,[{file,[115,114,99,47,109,101,109,51,95,114,101,112,46,101,114,108]},{line,292}]},{mem3_rep,go,1,[{file,[115,114,99,47,109,101,109,51,95,114,101,112,46,101,114,108]},{line,111}]},{mem3_sync,'-start_push_replication/1-fun-0-',2,[{file,[115,114,99,47,109,101,109,51,95,115,121,110,99,46,101,114,108]},{line,212}]}]}
[notice] 2023-09-22T07:29:12.856634Z [email protected] <0.280.0> -------- rexi_buffer : cluster unstable
--------------------------------------------------------------------------------

in the other nodes a connection timeout is detected for the faulty node:
(these logs were recovered in another identical case)

logs host430:

[error] 2023-09-18T14:12:54.048086Z [email protected] <0.27095.6476> -------- fabric_worker_timeout get_all_security,'[email protected]',<<"shards/00000000-7fffffff/xxxxxxxxx_ie-YYYYYYYYY-data.1632235447">>
[error] 2023-09-18T14:12:54.048143Z [email protected] <0.27095.6476> -------- Error checking security objects for xxxxxxxxx_ie-YYYYYYYYY-data :: {error,timeout}
[notice] 2023-09-18T14:12:54.069795Z [email protected] <0.4394.6527> 7889fcfd4f XXXXXXXX.xxxxx.com:5984 000.000.158.4 kkkkkkkkkkk GET /xxxxxxxxx-YYYYYYYYY/6a422c42-8f33-46bc-8b78-e4dbc6b7bad7 200 ok 11
[notice] 2023-09-18T14:12:54.083987Z [email protected] <0.10204.6526> 13a99de03a XXXXXXXX.xxxxx.com:5984 000.000.158.129 kkkkkkkkkkk GET /xxxxxxxxx-YYYYYYYYY-data/_design/keys/_view/key_%5Bdoctype_code%5D?include_docs=true&reduce=false&group=false&key=%5B%22Owner%22%2
C%22fk981%40kkkkkkkk%22%5D 200 ok 3
[notice] 2023-09-18T14:12:54.120732Z [email protected] <0.10204.6526> fd649996ef XXXXXXXX.xxxxx.com:5984 000.000.158.129 kkkkkkkkkkk GET /xxxxxxxxx-YYYYYYYYY-data/_design/keys/_view/key_%5Bdoctype_market%5D?include_docs=true&reduce=false&group=false&key=%5B%22Applicat
ionSettings%22%2C%22cs_ag_fr%22%5D 200 ok 6
[notice] 2023-09-18T14:12:54.137062Z [email protected] <0.10204.6526> 50852240e0 XXXXXXXX.xxxxx.com:5984 000.000.158.129 kkkkkkkkkkk GET /xxxxxxxxx-YYYYYYYYY-data/_design/keys/_view/key_%5Bdoctype_code%5D?include_docs=true&reduce=false&group=false&key=%5B%22Owner%22%2
C%22fk981%40kkkkkkkk%22%5D 200 ok 4
[notice] 2023-09-18T14:12:54.142260Z [email protected] <0.10204.6526> 45b9ffec1d XXXXXXXX.xxxxx.com:5984 000.000.158.129 kkkkkkkkkkk GET /xxxxxxxxx-YYYYYYYYY-data/_design/keys/_view/key_%5Bdoctype_market%5D?include_docs=true&reduce=false&group=false&key=%5B%22Dealersh
ipDefaults%22%2C%22cs_ag_fr%22%5D 200 ok 3
[error] 2023-09-18T14:12:54.560244Z [email protected] <0.1421.6484> -------- fabric_worker_timeout get_all_security,'[email protected]',<<"shards/00000000-7fffffff/xxxxxxxxx-YYYYYYYYY-store.1663607213">>
[error] 2023-09-18T14:12:54.560250Z [email protected] <0.4464.6531> -------- fabric_worker_timeout get_all_security,'[email protected]',<<"shards/80000000-ffffffff/xxxxxxxxx-YYYYYYYYY-store.1615278664">>
[error] 2023-09-18T14:12:54.560314Z [email protected] <0.4464.6531> -------- Error checking security objects for xxxxxxxxx-YYYYYYYYY-store :: {error,timeout}
[error] 2023-09-18T14:12:54.560318Z [email protected] <0.1421.6484> -------- Error checking security objects for xxxxxxxxx-YYYYYYYYY-store :: {error,timeout}
[error] 2023-09-18T14:12:55.061137Z [email protected] <0.14474.6506> -------- fabric_worker_timeout get_all_security,'[email protected]',<<"shards/80000000-ffffffff/xxxxxxxxx-YYYYYYYYY.1663607217">>
[error] 2023-09-18T14:12:55.061197Z [email protected] <0.14474.6506> -------- Error checking security objects for xxxxxxxxx-YYYYYYYYY :: {error,timeout}
[error] 2023-09-18T14:12:55.562152Z [email protected] <0.9122.6522> -------- fabric_worker_timeout get_all_security,'[email protected]',<<"shards/80000000-ffffffff/xxxxxxxxx-YYYYYYYYY-store.1610805549">>
[error] 2023-09-18T14:12:55.562213Z [email protected] <0.9122.6522> -------- Error checking security objects for xxxxxxxxx-YYYYYYYYY-store :: {error,timeout}
[error] 2023-09-18T14:12:55.616163Z [email protected] <0.2246.6530> -------- fabric_worker_timeout get_all_security,'[email protected]',<<"shards/00000000-7fffffff/xxxxxxxxx-YYYYYYYYY.1610763551">>
[error] 2023-09-18T14:12:55.616222Z [email protected] <0.2246.6530> -------- Error checking security objects for xxxxxxxxx-YYYYYYYYY :: {error,timeout}
[error] 2023-09-18T14:12:56.753099Z [email protected] <0.10599.6516> 9c4d71f25f fabric_worker_timeout changes,'[email protected]',<<"shards/00000000-7fffffff/m_internal_xxxx_be-YYYYYYYYY-data.1676298639">>
[error] 2023-09-18T14:12:56.753173Z [email protected] <0.10599.6516> 9c4d71f25f fabric_worker_timeout changes,'[email protected]',<<"shards/80000000-ffffffff/m_internal_xxxx_be-YYYYYYYYY-data.1676298639">>
-------------------------------------------------------------------------------------

logs host440:

[notice] 2023-09-18T14:12:53.584893Z [email protected] <0.26553.6568> de51555331 XXXXXXXX.xxxxx.com:5984 000.000.158.129 kkkkkkkkkkk GET /xxxxxxxxx-YYYYYYYYY-data/_design/keys/_view/key_%5Bdoctype_code%5D?include_docs=true&reduce=false&group=false&key=%5B%22Owner%22%2
C%22fk981%40kkkkkkkk%22%5D 200 ok 4
[error] 2023-09-18T14:12:53.989756Z [email protected] <0.12478.6534> -------- fabric_worker_timeout get_all_security,'[email protected]',<<"shards/00000000-7fffffff/xxxxxxxxx_ie-YYYYYYYYY-data.1632235447">>
[error] 2023-09-18T14:12:53.989855Z [email protected] <0.12478.6534> -------- Error checking security objects for xxxxxxxxx_ie-YYYYYYYYY-data :: {error,timeout}
[notice] 2023-09-18T14:12:54.064578Z [email protected] <0.15164.6581> 9b4bf04f7b XXXXXXXX.xxxxx.com:5984 000.000.158.4 kkkkkkkkkkk GET /XXXXXXXXX-YYYYYYYYY-data/_design/keys/_view/key_%5Bdoctype_code%5D?include_docs=true&reduce=false&group=false&key=%5B%22Owner%22%2C%22xw388%40kkkkkkkk%22%5D 200 ok 5
[error] 2023-09-18T14:12:54.492923Z [email protected] <0.24825.6553> -------- fabric_worker_timeout get_all_security,'[email protected]',<<"shards/00000000-7fffffff/xxxxxxxxx-YYYYYYYYY-store.1663607213">>
[error] 2023-09-18T14:12:54.493021Z [email protected] <0.24825.6553> -------- Error checking security objects for xxxxxxxxx-YYYYYYYYY-store :: {error,timeout}
[error] 2023-09-18T14:12:54.493568Z [email protected] <0.19846.6541> -------- fabric_worker_timeout get_all_security,'[email protected]',<<"shards/80000000-ffffffff/xxxxxxxxx-YYYYYYYYY-store.1615278664">>
[error] 2023-09-18T14:12:54.493635Z [email protected] <0.19846.6541> -------- Error checking security objects for xxxxxxxxx-YYYYYYYYY-store :: {error,timeout}
[error] 2023-09-18T14:12:54.994886Z [email protected] <0.28372.6543> -------- fabric_worker_timeout get_all_security,'[email protected]',<<"shards/80000000-ffffffff/xxxxxxxxx-YYYYYYYYY.1663607217">>
[error] 2023-09-18T14:12:54.995001Z [email protected] <0.28372.6543> -------- Error checking security objects for xxxxxxxxx-YYYYYYYYY :: {error,timeout}
[error] 2023-09-18T14:12:55.495646Z [email protected] <0.29680.6524> -------- fabric_worker_timeout get_all_security,'[email protected]',<<"shards/00000000-7fffffff/xxxxxxxxx-YYYYYYYYY.1610763551">>
[error] 2023-09-18T14:12:55.495662Z [email protected] <0.15438.6523> -------- fabric_worker_timeout get_all_security,'[email protected]',<<"shards/80000000-ffffffff/xxxxxxxxx-YYYYYYYYY-store.1610805549">>
[error] 2023-09-18T14:12:55.495718Z [email protected] <0.15438.6523> -------- Error checking security objects for xxxxxxxxx-YYYYYYYYY-store :: {error,timeout}
[error] 2023-09-18T14:12:55.495718Z [email protected] <0.29680.6524> -------- Error checking security objects for xxxxxxxxx-YYYYYYYYY :: {error,timeout}
[error] 2023-09-18T14:12:56.998821Z [email protected] <0.25052.6527> -------- fabric_worker_timeout get_all_security,'[email protected]',<<"shards/00000000-7fffffff/xxxxxxxxx-YYYYYYYYY-data.1610744998">>
[error] 2023-09-18T14:12:56.999006Z [email protected] <0.25052.6527> -------- Error checking security objects for xxxxxxxxx-YYYYYYYYY-data :: {error,timeout}
[error] 2023-09-18T14:12:56.999705Z [email protected] <0.9010.6590> -------- fabric_worker_timeout get_all_security,'[email protected]',<<"shards/80000000-ffffffff/xxxxxxxxx-YYYYYYYYY-data.1610744998">>
[error] 2023-09-18T14:12:56.999788Z [email protected] <0.9010.6590> -------- Error checking security objects for xxxxxxxxx-YYYYYYYYY-data :: {error,timeout}
[error] 2023-09-18T14:12:58.219781Z [email protected] <0.4871.6510> 6410b44cd3 fabric_worker_timeout changes,'[email protected]',<<"shards/00000000-7fffffff/xxxxxxxxx-YYYYYYYYY-data.1610752364">>
[error] 2023-09-18T14:12:58.219873Z [email protected] <0.4871.6510> 6410b44cd3 fabric_worker_timeout changes,'[email protected]',<<"shards/80000000-ffffffff/xxxxxxxxx-YYYYYYYYY-data.1610752364">>
[notice] 2023-09-18T14:12:58.220467Z [email protected] <0.4871.6510> 6410b44cd3 XXXXXXXX.xxxxx.com:5984 000.000.158.66 kkkkkkkkkkk GET /xxxxxxxxx-YYYYYYYYY-data/_changes?include_docs=true&since=586856-g1AAAADteJzLYWBgYMpgTmFQSc4vTc5ISXJITqwqLUotLiwqMDE00EvOy0jOSUnM1cvJT07MyQGpTmRI```
@nickva
Copy link
Contributor

nickva commented Oct 4, 2023

It seems like nodes are having a hard time staying connected. Is networking connectivity between cluster nodes fairly stable?

Try to upgrade to the latest Apache CouchDB as there is periodic forced re-connection module. The setting is [cluster] reconnect_interval_sec = 37. You could lower it, for instance to 5 seconds or so.

There is also [fabric] request_timeout = 60000 (unit in milliseconds). Could try to either increase or decrease the value to see what effect it has.

Hint: you can use three backward ticks to enclose logs so they are rendered as pre-formatted text. ```

@alessio-congedo
Copy link
Author

It seems like nodes are having a hard time staying connected. Is networking connectivity between cluster nodes fairly stable?

Try to upgrade to the latest Apache CouchDB as there is periodic forced re-connection module. The setting is [cluster] reconnect_interval_sec = 37. You could lower it, for instance to 5 seconds or so.

There is also [fabric] request_timeout = 60000 (unit in milliseconds). Could try to either increase or decrease the value to see what effect it has.

Hint: you can use three backward ticks to enclose logs so they are rendered as pre-formatted text. ```

Hi Nickva,
thanks for replying.
the advice of three backwards is excellent. 👍
How can you explain the fact that it's always the same node that crashes?

@nickva
Copy link
Contributor

nickva commented Oct 23, 2023

@alessio-congedo

How can you explain the fact that it's always the same node that crashes?

Not sure I can explain that from the logs. It seems like other nodes also have connectivity issues and timeouts. I see some disconnects between 440 <-> 410:

[email protected] <0.32575.6885> -------- 
 fabric_worker_timeout open_revs,'[email protected]',
 <<"shards/80000000-ffffffff/xxxxxxxxxxx_xxxxxxxxxxx-store.1624883181">>

See if perhaps you'd also add an ioq bypass for shard_sync = false and increase fabric request timeouts: https://github.com/apache/couchdb/blob/main/rel/overlay/etc/default.ini#L374-L376

If you're running in a kube environment check that each of the containers have the same and enough CPU and disk IO resources/bandwidth available.

@sergey-safarov
Copy link

We have cached the same issue on v3.3.3
Also on the one CouchDB node, I can see "Node not responding"

Sep 28 01:47:07 db1a.wv.example.com couchdb[3128]: [email protected] <0.29864.2> -------- ** Node '[email protected]' not responding **
Sep 28 01:47:07 db1a.wv.example.com couchdb[3128]: ** Removing (timedout) connection **
Sep 28 01:47:07 db1a.wv.example.com couchdb[3128]: [email protected] <0.29864.2> -------- ** Node '[email protected]' not responding **
Sep 28 01:47:07 db1a.wv.example.com couchdb[3128]: ** Removing (timedout) connection **
Sep 28 01:47:07 db1a.wv.example.com couchdb[3128]: [email protected] <0.20680.5805> -------- 1 conflicted shard in cluster
Sep 28 01:47:08 db1a.wv.example.com couchdb[3128]: [email protected] <0.4340.5788> -------- 1 conflicted shard in cluster
Sep 28 01:48:29 db1a.wv.example.com couchdb[3128]: [email protected] <0.19100.5806> -------- fabric_worker_timeout get_all_security,'[email protected]',<<"shards/c0000000-dfffffff/account/9f/f9/d07f1389d0360b4cff90710cc1fe-202409.1725148810">>
Sep 28 01:48:29 db1a.wv.example.com couchdb[3128]: [email protected] <0.19100.5806> -------- fabric_worker_timeout get_all_security,'[email protected]',<<"shards/c0000000-dfffffff/account/9f/f9/d07f1389d0360b4cff90710cc1fe-202409.1725148810">>
Sep 28 01:48:29 db1a.wv.example.com couchdb[3128]: [email protected] <0.19100.5806> -------- Error checking security objects for account/9f/f9/d07f1389d0360b4cff90710cc1fe-202409 :: {error,timeout}
Sep 28 01:48:29 db1a.wv.example.com couchdb[3128]: [email protected] <0.11649.5811> -------- fabric_worker_timeout update_docs,'[email protected]',<<"shards/40000000-5fffffff/_global_changes.1660293400">>
Sep 28 01:48:29 db1a.wv.example.com couchdb[3128]: [email protected] <0.11649.5811> -------- fabric_worker_timeout update_docs,'[email protected]',<<"shards/40000000-5fffffff/_global_changes.1660293400">>
Sep 28 01:48:29 db1a.wv.example.com couchdb[3128]: [email protected] <0.30041.5810> -------- fabric_worker_timeout get_all_security,'[email protected]',<<"shards/c0000000-dfffffff/account/9f/f9/d07f1389d0360b4cff90710cc1fe-202409.1725148810">>
Sep 28 01:48:29 db1a.wv.example.com couchdb[3128]: [email protected] <0.30041.5810> -------- Error checking security objects for account/9f/f9/d07f1389d0360b4cff90710cc1fe-202409 :: {error,timeout}
Sep 28 01:48:29 db1a.wv.example.com couchdb[3128]: [email protected] <0.7850.5798> -------- fabric_worker_timeout open_doc,'[email protected]',<<"shards/80000000-9fffffff/account/de/99/3bb20e6609f27b81bac3123eb833-202409.1725148811">>
Sep 28 01:48:29 db1a.wv.example.com couchdb[3128]: [email protected] <0.7850.5798> -------- fabric_worker_timeout open_doc,'[email protected]',<<"shards/80000000-9fffffff/account/de/99/3bb20e6609f27b81bac3123eb833-202409.1725148811">>
Sep 28 01:48:29 db1a.wv.example.com couchdb[3128]: [email protected] <0.15964.5814> -------- fabric_worker_timeout open_doc,'[email protected]',<<"shards/a0000000-bfffffff/account/de/99/3bb20e6609f27b81bac3123eb833-202409.1725148811">>
Sep 28 01:48:29 db1a.wv.example.com couchdb[3128]: [email protected] <0.15964.5814> -------- fabric_worker_timeout open_doc,'[email protected]',<<"shards/a0000000-bfffffff/account/de/99/3bb20e6609f27b81bac3123eb833-202409.1725148811">>
Sep 28 01:48:29 db1a.wv.example.com couchdb[3128]: [email protected] <0.32403.5807> -------- fabric_worker_timeout open_doc,'[email protected]',<<"shards/20000000-3fffffff/account/de/99/3bb20e6609f27b81bac3123eb833-202409.1725148811">>
Sep 28 01:48:29 db1a.wv.example.com couchdb[3128]: [email protected] <0.32403.5807> -------- fabric_worker_timeout open_doc,'[email protected]',<<"shards/20000000-3fffffff/account/de/99/3bb20e6609f27b81bac3123eb833-202409.1725148811">>
Sep 28 01:48:29 db1a.wv.example.com couchdb[3128]: [email protected] <0.10430.5762> -------- fabric_worker_timeout open_doc,'[email protected]',<<"shards/20000000-3fffffff/account/de/99/3bb20e6609f27b81bac3123eb833-202409.1725148811">>
Sep 28 01:48:29 db1a.wv.example.com couchdb[3128]: [email protected] <0.10430.5762> -------- fabric_worker_timeout open_doc,'[email protected]',<<"shards/20000000-3fffffff/account/de/99/3bb20e6609f27b81bac3123eb833-202409.1725148811">>
Sep 28 01:48:29 db1a.wv.example.com couchdb[3128]: [email protected] <0.11019.5802> -------- fabric_worker_timeout open_doc,'[email protected]',<<"shards/20000000-3fffffff/account/de/99/3bb20e6609f27b81bac3123eb833-202409.1725148811">>
Sep 28 01:48:29 db1a.wv.example.com couchdb[3128]: [email protected] <0.11019.5802> -------- fabric_worker_timeout open_doc,'[email protected]',<<"shards/20000000-3fffffff/account/de/99/3bb20e6609f27b81bac3123eb833-202409.1725148811">>
Sep 28 01:48:29 db1a.wv.example.com couchdb[3128]: [email protected] <0.28862.5796> -------- fabric_worker_timeout open_doc,'[email protected]',<<"shards/20000000-3fffffff/account/de/99/3bb20e6609f27b81bac3123eb833-202409.1725148811">>
Sep 28 01:48:29 db1a.wv.example.com couchdb[3128]: [email protected] <0.28862.5796> -------- fabric_worker_timeout open_doc,'[email protected]',<<"shards/20000000-3fffffff/account/de/99/3bb20e6609f27b81bac3123eb833-202409.1725148811">>
Sep 28 01:48:29 db1a.wv.example.com couchdb[3128]: [email protected] <0.14078.5795> -------- fabric_worker_timeout open_doc,'[email protected]',<<"shards/20000000-3fffffff/account/de/99/3bb20e6609f27b81bac3123eb833-202409.1725148811">>

But during troubleshooting when the issue was present I sent /_membership curl request and returned a response with all (three) nodes present online in the cluster. The request was sent to each CouchDB node in the cluster and returned the same results "three nodes online in the cluster".

On the other two nodes in the cluster, I can see error messages like "fabric_worker_timeout open_doc" and no messages like "Node not responding".

Also on the two nodes CPU load increased to 100%.
db0a
image
db0b
image
db1a
image

I am sure network connectivity is present between CouchDB nodes. Also /_membership response responded with all nodes online on all CocuhDB instances.
But anyway we will adjust the recommended values and provide feedback if the issue is reproduced.

[cluster] reconnect_interval_sec = 37
[fabric] request_timeout = 60000

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants