Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dead nodes won't leave cluster. #21633

Open
edward-smith opened this issue Aug 22, 2024 · 0 comments
Open

Dead nodes won't leave cluster. #21633

edward-smith opened this issue Aug 22, 2024 · 0 comments

Comments

@edward-smith
Copy link

Overview of the Issue

We have a cluster of about 40 nodes and we have 2 nodes that refuse to leave. Using force-leave (with and without -prune) removes them briefly but then they get rebroadcast back into the node list. These nodes have been dead for weeks if not months.


Reproduction Steps

Unknown how to reproduce but we got here doing regular green/blue deployments for the past number of years:


Node                                          Address              Status  Type    Build       Protocol  DC         Partition  Segment
ip-172-31-104-189-blue                        172.31.104.189:8301  alive   server  1.19.1+ent  2         us-west-2  default    <all>
ip-172-31-107-104-blue                        172.31.107.104:8301  alive   server  1.19.1+ent  2         us-west-2  default    <all>
ip-172-31-107-58-blue                         172.31.107.58:8301   alive   server  1.19.1+ent  2         us-west-2  default    <all>
ip-172-31-109-169-blue                        172.31.109.169:8301  alive   server  1.19.1+ent  2         us-west-2  default    <all>
ip-172-31-109-223-blue                        172.31.109.223:8301  alive   server  1.19.1+ent  2         us-west-2  default    <all>
ip-172-31-100-123-blue                        172.31.100.123:8301  alive   client  1.19.1+ent  2         us-west-2  default    <default>
ip-172-31-105-110-green                       172.31.105.110:8301  alive   client  1.19.1+ent  2         us-west-2  default    <default>
ip-172-31-105-142-green                       172.31.105.142:8301  alive   client  1.19.1+ent  2         us-west-2  default    <default>
ip-172-31-105-187-blue                        172.31.105.187:8301  alive   client  1.19.1+ent  2         us-west-2  default    <default>
ip-172-31-105-213-blue                        172.31.105.213:8301  alive   client  1.19.1+ent  2         us-west-2  default    <default>
ip-172-31-105-227-blue                        172.31.105.227:8301  alive   client  1.19.1+ent  2         us-west-2  default    <default>
ip-172-31-105-6-green                         172.31.105.6:8301    alive   client  1.19.1+ent  2         us-west-2  default    <default>
ip-172-31-106-141-green                       172.31.106.141:8301  alive   client  1.19.1+ent  2         us-west-2  default    <default>
ip-172-31-106-204-green                       172.31.106.204:8301  alive   client  1.19.1+ent  2         us-west-2  default    <default>
ip-172-31-107-101-blue                        172.31.107.101:8301  alive   client  1.19.1+ent  2         us-west-2  default    <default>
ip-172-31-107-243-blue                        172.31.107.243:8301  alive   client  1.19.1+ent  2         us-west-2  default    <default>
ip-172-31-107-35-green                        172.31.107.35:8301   alive   client  1.19.1+ent  2         us-west-2  default    <default>
ip-172-31-107-42-blue                         172.31.107.42:8301   alive   client  1.19.1+ent  2         us-west-2  default    <default>
ip-172-31-107-68-blue                         172.31.107.68:8301   alive   client  1.19.1+ent  2         us-west-2  default    <default>
ip-172-31-108-112-blue                        172.31.108.112:8301  alive   client  1.19.1+ent  2         us-west-2  default    <default>
ip-172-31-108-163-blue                        172.31.108.163:8301  alive   client  1.19.1+ent  2         us-west-2  default    <default>
ip-172-31-108-5-green                         172.31.108.5:8301    alive   client  1.19.1+ent  2         us-west-2  default    <default>
ip-172-31-109-15-green                        172.31.109.15:8301   alive   client  1.19.1+ent  2         us-west-2  default    <default>
ip-172-31-109-249.us-west-2.compute.internal  172.31.109.249:8301  alive   client  1.18.0      2         us-west-2  default    <default>
ip-172-31-109-252-blue                        172.31.109.252:8301  alive   client  1.19.1+ent  2         us-west-2  default    <default>
ip-172-31-109-254-green                       172.31.109.254:8301  alive   client  1.19.1+ent  2         us-west-2  default    <default>
ip-172-31-109-42-blue                         172.31.109.42:8301   alive   client  1.19.1+ent  2         us-west-2  default    <default>
ip-172-31-80-251-blue                         172.31.80.251:8301   alive   client  1.19.1+ent  2         us-west-2  default    <default>
ip-172-31-81-11-blue                          172.31.81.11:8301    alive   client  1.19.1+ent  2         us-west-2  default    <default>
ip-172-31-81-241-blue                         172.31.81.241:8301   alive   client  1.19.1+ent  2         us-west-2  default    <default>
ip-172-31-82-24-blue                          172.31.82.24:8301    alive   client  1.19.1+ent  2         us-west-2  default    <default>
ip-172-31-82-36-blue                          172.31.82.36:8301    alive   client  1.19.1+ent  2         us-west-2  default    <default>
ip-172-31-88-119-blue                         172.31.88.119:8301   alive   client  1.19.1+ent  2         us-west-2  default    <default>
ip-172-31-88-233-blue                         172.31.88.233:8301   alive   client  1.19.1+ent  2         us-west-2  default    <default>
ip-172-31-90-4-blue                           172.31.90.4:8301     alive   client  1.19.1+ent  2         us-west-2  default    <default>
ip-172-31-92-245-blue                         172.31.92.245:8301   alive   client  1.19.1+ent  2         us-west-2  default    <default>
ip-172-31-96-115-blue                         172.31.96.115:8301   alive   client  1.19.1+ent  2         us-west-2  default    <default>
ip-172-31-97-233-blue                         172.31.97.233:8301   alive   client  1.19.1+ent  2         us-west-2  default    <default>
ip-172-31-97-38.us-west-2.compute.internal    172.31.97.38:8301    alive   client  1.18.0      2         us-west-2  default    <default>
ip-172-31-99-32-blue                          172.31.99.32:8301    alive   client  1.19.1+ent  2         us-west-2  default    <default>

Consul info for both Client and Server

Client info
agent:
        check_monitors = 2
        check_ttls = 0
        checks = 19
        services = 15
build:
        prerelease =
        revision = 20316a71
        version = 1.19.1
        version_metadata = ent
consul:
        acl = disabled
        known_servers = 5
        server = false
license:
        customer = 546f0f9a-a80a-9ee7-57f2-ae5e84ae7188
        expiration_time = 2025-10-30 00:00:00 +0000 UTC
        features = Automated Backups, Automated Upgrades, Enhanced Read Scalability, Network Segments, Redundancy Zone, Advanced Network Federation, Namespaces, SSO, Audit Logging, Admin Partitions
        id = 707a2fc0-a09c-e110-3202-2b583a0d9bb3
        install_id = *
        issue_time = 2023-10-17 18:57:42.193347531 +0000 UTC
        modules = Global Visibility, Routing and Scale, Governance and Policy
        product = consul
        start_time = 2023-10-17 18:57:28.239 +0000 UTC
runtime:
        arch = amd64
        cpu_count = 8
        goroutines = 139
        max_procs = 8
        os = linux
        version = go1.22.5
serf_lan:
        coordinate_resets = 0
        encrypted = true
        event_queue = 0
        event_time = 184
        failed = 1
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 3494508
        members = 40
        query_queue = 0
        query_time = 1

{
    "license_path": "/etc/consul/consul_license.hclic",
    "enable_syslog": true,
    "enable_local_script_checks": true,
    "enable_central_service_config": true,
    "leave_on_terminate": true,
    "data_dir": "/opt/consul/data",
    "client_addr": "0.0.0.0",
    "log_level": "INFO",
    "retry_join": ["consul.service"],
    "ports": {
        "grpc": 8502,
        "grpc_tls": 8503
    },
    "connect": {
        "enabled": true
    }
}[
Server info
agent:
        check_monitors = 2
        check_ttls = 2
        checks = 5
        services = 6
build:
        prerelease =
        revision = 20316a71
        version = 1.19.1
        version_metadata = ent
consul:
        acl = disabled
        bootstrap = false
        known_datacenters = 1
        leader = false
        leader_addr = 172.31.107.58:8300
        server = true
license:
        customer = 546f0f9a-a80a-9ee7-57f2-ae5e84ae7188
        expiration_time = 2025-10-30 00:00:00 +0000 UTC
        features = Automated Backups, Automated Upgrades, Enhanced Read Scalability, Network Segments, Redundancy Zone, Advanced Network Federation, Namespaces, SSO, Audit Logging, Admin Partitions
        id = 707a2fc0-a09c-e110-3202-2b583a0d9bb3
        install_id = *
        issue_time = 2023-10-17 18:57:42.193347531 +0000 UTC
        modules = Global Visibility, Routing and Scale, Governance and Policy
        product = consul
        start_time = 2023-10-17 18:57:28.239 +0000 UTC
raft:
        applied_index = 499478789
        commit_index = 499478789
        fsm_pending = 0
        last_contact = 420.361µs
        last_log_index = 499478789
        last_log_term = 84193
        last_snapshot_index = 499470585
        last_snapshot_term = 84193
        latest_configuration = [{Suffrage:Voter ID:4cb581fc-508c-2b59-ae9e-c73448a70800 Address:172.31.104.189:8300} {Suffrage:Voter ID:6a8198ee-456a-7730-7cd2-0fe99c9a23b0 Address:172.31.107.104:8300} {Suffrage:Voter ID:1e049f9d-d103-b49c-f106-683e95b142a9 Address:172.31.107.58:8300} {Suffrage:Voter ID:49287bd4-6f7b-96f8-7713-36ba3f22bab0 Address:172.31.109.223:8300} {Suffrage:Voter ID:54e15bcd-d3c4-c898-7771-92e04508aeee Address:172.31.109.169:8300}]
        latest_configuration_index = 0
       num_peers = 4
        protocol_version = 3
        protocol_version_max = 3
        protocol_version_min = 0
        snapshot_version_max = 1
        snapshot_version_min = 0
        state = Follower
        term = 84193
runtime:
        arch = arm64
        cpu_count = 8
        goroutines = 799
        max_procs = 8
        os = linux
        version = go1.22.5
serf_lan:
        coordinate_resets = 0
        encrypted = true
        event_queue = 0
        event_time = 184
        failed = 1
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 3494508
        members = 40
        query_queue = 0
        query_time = 1
serf_wan:
        coordinate_resets = 0
        encrypted = true
        event_queue = 0
        event_time = 1
        failed = 0
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 583298
        members = 5
        query_queue = 0
        query_time = 1


[root@ip-172-31-104-189 consul]# cat config.json
{
    "license_path": "/etc/consul/consul_license.hclic",
    "enable_syslog": true,
    "enable_local_script_checks": true,
    "enable_central_service_config": true,
    "leave_on_terminate": true,
    "data_dir": "/opt/consul/data",
    "client_addr": "0.0.0.0",
    "log_level": "INFO",
    "retry_join": ["consul.service"],
    "ports": {
        "grpc": 8502,
        "grpc_tls": 8503
    },
    "connect": {
        "enabled": true
    }
}
[root@ip-172-31-104-189 consul]# cat instance.json
{
    "license_path": "/etc/consul/consul_license.hclic",
    "server": true,
    "ui_config": {
        "enabled": true
    },
    "node_name": "ip-172-31-104-189-blue",
    "domain": "owf-dev.",
    "datacenter": "us-west-2",
    "enable_debug": true,
    "data_dir": "/data/consul",
    "advertise_addr": "172.31.104.189",
    "recursors": ["169.254.169.253","172.31.96.2"],
    "bootstrap_expect": 3,
    "raft_protocol": 3,
    "limits": {
      "http_max_conns_per_client": 0,
      "rpc_max_conns_per_client": 0,
      "request_limits": {
              "mode": "permissive",
              "read_rate": 100,
              "write_rate": 100
      }
    },
    "acl": {
      "enabled": false,
      "default_policy": "allow",
      "enable_token_persistence": true
    },
    "performance": {
      "raft_multiplier": 1
    },
"dns_config": {
      "allow_stale": true,
      "max_stale": "1m",
      "service_ttl": {
        "*": "5s"
      },
      "node_ttl": "30s",
      "soa": {
        "min_ttl": 60
      }
    },
    "telemetry": {
      "dogstatsd_addr": "127.0.0.1:8125"
    },
    "node_meta": {
        "ami_version": "3.8.0"
    },
    "autopilot": {
        "upgrade_version_tag": "ami_version"
    },
    "retry_join": ["provider=aws tag_key=Zone tag_value=owf-dev"],
    "http_config": {
        "response_headers": {
            "Access-Control-Allow-Origin": "*"
        }
    },
    "config_entries": [
        {
        "bootstrap": [
            {
            "config": [
                {
                "local_request_timeout_ms": 0,
                "envoy_extra_static_clusters_json": "{\"connect_timeout\": \"3.000s\",\"dns_lookup_family\": \"V4_ONLY\",\"lb_policy\": \"ROUND_ROBIN\",\"load_assignment\": {\"cluster_name\": \"datadog_8126\",\"endpoints\": [{\"lb_endpoints\": [{\"endpoint\": {\"address\": {\"socket_address\": {\"address\": \"datadog-apm.service.owf-dev\",\"port_value\": 8126,\"protocol\": \"TCP\"}}}}]}]},\"name\": \"datadog_8126\",\"type\": \"STRICT_DNS\"}",
                "envoy_tracing_json": "{\"http\": {\"name\": \"envoy.tracers.datadog\",\"typed_config\":{\"@type\":\"type.googleapis.com/envoy.config.trace.v3.DatadogConfig\",\"collector_cluster\":\"datadog_8126\",\"service_name\":\"envoy\"}}}",
                "protocol": "http"
                }
            ],
            "kind": "proxy-defaults",
            "name": "global"
            }
        ]
        }
    ]
}

Operating system and Environment details

Running on AWS AL2023 on EC2's in an on-demand ASG.

Log Fragments

Can attach debug log payload to ticket but basically once the force-leave is issued, very shortly after we see an EventMemberJoin message come through with the dead node getting re-added.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant