Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ApiListener#Start(): auto-renew CA on its owner #9891

Merged
merged 7 commits into from
Dec 19, 2023
Merged

Conversation

Al2Klimov
Copy link
Member

@Al2Klimov Al2Klimov commented Oct 27, 2023

otherwise it would expire.

fixes #9890

With this, the Icinga 2 node owning the CA periodically renews it locally. (Pretty much like 3753f86, but for the CA this time.)

Sooner or later that local CA cert will be (already) propagated through pki::UpdateCertificate.

This way the root certificate never expires on the whole cluster.

TODO

  • Increase CA threshold? @Al2Klimov
  • Test in a mixed version cluster (with shorter thresholds of course) @Al2Klimov

@Al2Klimov Al2Klimov added the consider backporting Should be considered for inclusion in a bugfix release label Oct 27, 2023
@Al2Klimov Al2Klimov added this to the 2.15.0 milestone Oct 27, 2023
@cla-bot cla-bot bot added the cla/signed label Oct 27, 2023
@icinga-probot icinga-probot bot added area/distributed Distributed monitoring (master, satellites, clients) bug Something isn't working TBD To be defined - We aren't certain about this yet labels Oct 27, 2023
Copy link
Contributor

@julianbrost julianbrost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it really that easy to create a drop-in replacement for the CA certificate that also keeps all existing child certificates valid? So please provide a reference for which attributes have to be shared between both certificates for this to work.

lib/remote/apilistener.cpp Outdated Show resolved Hide resolved
@Al2Klimov
Copy link
Member Author

Al2Klimov commented Nov 6, 2023

Test protocol

Setup

  1. 3x Debian 12
  2. 2x 1194eb5, 1x v2.13.6 (agent)
  3. icinga2 node wizard with correct CNs and IPs
  4. /etc/icinga2/zones.conf as below

Now you have a 3 lvl cluster with nothing.

/etc/icinga2/zones.conf

object Endpoint "master" {
  host = "10.27.3.102"
}

object Zone "master" {
  endpoints = [ "master" ]
}

object Endpoint "sat" {
  host = "10.27.0.135"
}

object Zone "sat" {
  endpoints = [ "sat" ]
  parent = "master"
}

object Endpoint "agent" {
  host = "10.27.3.177"
}

object Zone "agent" {
  endpoints = [ "agent" ]
  parent = "sat"
}

Preparation

  1. Stop Icinga everywhere
  2. Manipulate the CA validity period:
  3. On the master:
    * openssl req -x509 -days 1 -out ca.crt -subj '/CN=Icinga CA' -md5 -nodes -key /var/lib/icinga2/ca/ca.key
    * cat ca.crt >/var/lib/icinga2/ca/ca.crt
  4. Copy ca.crt to the others
  5. Everywhere: cat ca.crt >/var/lib/icinga2/certs/ca.crt

Congratulations! Now you have a 3 lvl cluster with nothing which is about to collapse in one day.

Examination

  1. Watch everyone's logs: tail -f /var/log/icinga2/icinga2.log &
  2. Start the master
  3. 👍 [2023-11-06 15:35:42 +0000] information/ApiListener: Our CA will expire soon, but we own it. Renewing.
  4. Start the satellite
  5. 👍 [2023-11-06 15:38:17 +0000] information/JsonRpcConnection: Updating the client certificate for CN 'sat' at runtime and reconnecting the endpoints.
  6. Start the agent
  7. 👎 Satellite says [2023-11-06 15:40:14 +0000] information/JsonRpcConnection: Certificate request for CN 'agent' is pending. Waiting for approval.
  8. 👎 Master says [2023-11-06 15:41:36 +0000] information/JsonRpcConnection: The certificates for CN 'agent' and its root CA are valid and uptodate. Skipping automated renewal.

Conclusion

To do:

  • pre-2.14.1 agents must work despite up-to-date leaf cert
  • parent shall check child chain only for actual requests of that child

@Al2Klimov Al2Klimov self-assigned this Nov 6, 2023
@Al2Klimov Al2Klimov removed the request for review from julianbrost November 6, 2023 15:51
@Al2Klimov
Copy link
Member Author

Much better:

  1. 👍 [2023-11-06 16:46:16 +0000] information/JsonRpcConnection: Updating the client certificate for CN 'agent' at runtime and reconnecting the endpoints.

Prevented the cluster from collapsing! Everywhere openssl x509 -noout -text -in /var/lib/icinga2/certs/ca.crt shows fresh certs made by Icinga.

However:

👎 Satellite and agent started w/o master are now in a re-connect loop. 🙈

@Al2Klimov Al2Klimov force-pushed the renew-the-ca-9890 branch 3 times, most recently from e42f9e6 to a1e3402 Compare November 6, 2023 17:55
@Al2Klimov
Copy link
Member Author

I have to correct myself: Only the agents won't need an update.

@Al2Klimov Al2Klimov marked this pull request as ready for review November 6, 2023 18:02
@Al2Klimov Al2Klimov removed their assignment Nov 6, 2023
@Al2Klimov
Copy link
Member Author

Test protocol II

Setup

  1. 3x Debian 12
  2. 2x b43f1e7, 1x v2.13.6 (agent)
  3. icinga2 node wizard with correct CNs and IPs
  4. /etc/icinga2/zones.conf as below

Now you have a 3 lvl cluster with nothing.

/etc/icinga2/zones.conf

object Endpoint "master" {
  host = "10.27.3.233"
}

object Zone "master" {
  endpoints = [ "master" ]
}

object Endpoint "sat" {
  host = "10.27.0.163"
}

object Zone "sat" {
  endpoints = [ "sat" ]
  parent = "master"
}

object Endpoint "agent" {
  host = "10.27.1.81"
}

object Zone "agent" {
  endpoints = [ "agent" ]
  parent = "sat"
}

Preparation

  1. Stop Icinga everywhere
  2. Manipulate the CA validity period:
  3. On the master:
    * openssl req -x509 -days 1 -out ca.crt -subj '/CN=Icinga CA' -md5 -nodes -key /var/lib/icinga2/ca/ca.key
    * cat ca.crt >/var/lib/icinga2/ca/ca.crt
  4. Copy ca.crt to the others
  5. Everywhere: cat ca.crt >/var/lib/icinga2/certs/ca.crt

Congratulations! Now you have a 3 lvl cluster with nothing which is about to collapse in one day.

Examination

  1. Watch everyone's logs: tail -f /var/log/icinga2/icinga2.log &
  2. Start satellite + agent
  3. 👍 They do basically nothing (especially no re-connect loop) after the usual greeting ceremony
  4. 👍 All CAs are still about to collapse according openssl x509 -noout -text -in /var/lib/icinga2/certs/ca.crt
  5. Start the master
  6. 👍 Again no loop of any kind in the logs after a few renewals and re-connects
  7. 👍 All CAs are again valid for 15y according openssl x509 -noout -text -in /var/lib/icinga2/certs/ca.crt

Conclusion

🎉

Copy link
Contributor

@julianbrost julianbrost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way this is currently implemented, there's no way to manually trigger the renewal early, is it? I don't mean from a "there's a CLI command for that" perspective but even if there was, the logic for the actual redeployment seems tied to the IsCertUptodate() function.

As an admin, I think I'd become nervous if a CA certificate came close to its expiry. Also, I think I would be surprised if the CA certificate changed magically as it probably needs to be replaced elsewhere too. Could the logic be adapted in a way to allow an early manual renewal? The current logic of renewing in the last 30 days could stay as a last attempt to prevent the cluster from breaking if the admin failed to do that.

lib/remote/apilistener.cpp Outdated Show resolved Hide resolved
@julianbrost
Copy link
Contributor

Can you please share the CA certificates from one of your tests from before and after the renewal (ideally showing the diff of openssl x509 -text between both)?

@Al2Klimov
Copy link
Member Author

The way this is currently implemented, there's no way to manually trigger the renewal early, is it? I don't mean from a "there's a CLI command for that" perspective but even if there was, the logic for the actual redeployment seems tied to the IsCertUptodate() function.

👍

As an admin, I think I'd become nervous if a CA certificate came close to its expiry. Also, I think I would be surprised if the CA certificate changed magically as it probably needs to be replaced elsewhere too. Could the logic be adapted in a way to allow an early manual renewal? The current logic of renewing in the last 30 days could stay as a last attempt to prevent the cluster from breaking if the admin failed to do that.

  1. Nervous sysadmins can be avoided by a larger threshold. What about 15y/4=3.75y, as per our x509 mod?
  2. Sure, the "soft rollover" here is surprising, but not too surprising. The old cert will still work, e.g. with curl. Well, until its expiry when you'll have to act anyway. Also there's a log message which may be a warning if you wish.

Can you please share the CA certificates from one of your tests from before and after the renewal (ideally showing the diff of openssl x509 -text between both)?

➜  icinga2 git:(renew-the-ca-9890) git diff -U2
diff --git a/lib/base/tlsutility.cpp b/lib/base/tlsutility.cpp
index 7917e2b38..85f614363 100644
--- a/lib/base/tlsutility.cpp
+++ b/lib/base/tlsutility.cpp
@@ -763,4 +763,5 @@ std::shared_ptr<X509> CreateCertIcingaCA(const std::shared_ptr<X509>& cert)
 bool IsCertUptodate(X509* cert)
 {
+       return false;
        time_t now;
        time(&now);
➜  icinga2 git:(renew-the-ca-9890) ✗ prefix/sbin/icinga2 api setup -x critical
Enabling feature api. Make sure to restart Icinga 2 for these changes to take effect.
Done.

Now restart your Icinga 2 daemon to finish the installation!

➜  icinga2 git:(renew-the-ca-9890) ✗ openssl x509 -noout -text -in prefix/var/lib/icinga2/certs//ca.crt >old.txt
➜  icinga2 git:(renew-the-ca-9890) ✗ prefix/sbin/icinga2 daemon -d
[2023-11-23 12:27:15 +0100] information/cli: Icinga application loader (version: v2.14.0-39-ge90d454c4; debug)
[2023-11-23 12:27:15 +0100] information/cli: Closing console log.
➜  icinga2 git:(renew-the-ca-9890) ✗ sleep 10
➜  icinga2 git:(renew-the-ca-9890) ✗ openssl x509 -noout -text -in prefix/var/lib/icinga2/certs//ca.crt >new.txt
➜  icinga2 git:(renew-the-ca-9890) ✗ diff -U 1000 old.txt new.txt
--- old.txt	2023-11-23 12:27:00
+++ new.txt	2023-11-23 12:27:37
@@ -1,84 +1,84 @@
 Certificate:
     Data:
         Version: 3 (0x2)
         Serial Number:
-            7b:62:fc:a2:e9:dc:ae:3f:78:b4:99:03:bd:fc:5f:96:85:ea:2f:66
+            30:9f:03:9a:b4:0c:bc:df:eb:73:d4:43:bb:ec:69:48:2f:16:34:54
     Signature Algorithm: sha256WithRSAEncryption
         Issuer: CN=Icinga CA
         Validity
-            Not Before: Nov 23 11:26:04 2023 GMT
-            Not After : Nov 19 11:26:04 2038 GMT
+            Not Before: Nov 23 11:27:15 2023 GMT
+            Not After : Nov 19 11:27:15 2038 GMT
         Subject: CN=Icinga CA
         Subject Public Key Info:
             Public Key Algorithm: rsaEncryption
                 RSA Public-Key: (4096 bit)
                 Modulus:
                     00:e5:17:99:96:1f:48:25:5e:cc:00:6b:39:2a:cd:
                     76:70:a6:73:aa:e8:56:76:96:78:33:c1:9f:84:7a:
                     1b:8d:86:d5:2c:5e:e7:e6:a7:eb:3b:cf:84:00:eb:
                     9b:23:45:0d:8d:89:e6:8f:84:f7:42:14:0a:ad:47:
                     97:43:7f:25:15:03:d8:d1:35:26:f0:38:43:97:08:
                     37:5f:b8:8a:b4:94:d0:92:a7:c8:2f:c6:24:cd:6c:
                     ec:22:da:ad:db:3c:36:cb:c6:cb:01:f3:d9:a7:dc:
                     d4:4b:2f:68:c9:e9:13:60:00:fb:78:97:96:29:9e:
                     38:ed:08:9a:73:93:a0:19:d8:d9:3e:94:0f:81:bd:
                     9d:1b:9e:f4:a2:d1:96:11:62:7c:4b:4b:b1:d0:21:
                     7c:34:f1:ef:5a:b7:92:b6:09:32:29:8d:4c:92:2d:
                     e7:b8:5b:a0:3a:4c:05:6d:30:61:f6:8f:f4:13:f5:
                     b9:f3:2e:6d:cc:c4:fd:c1:14:fb:1f:d5:70:18:12:
                     08:5f:e0:32:ee:5b:1a:1a:4a:57:7c:01:29:2c:7e:
                     13:9b:97:d9:4c:74:8d:77:7e:57:81:f8:8d:e6:c4:
                     e2:ae:2b:6f:a8:c3:a7:00:04:09:89:90:6e:3f:af:
                     ee:b9:86:e0:3f:f4:bd:15:20:d1:db:2b:21:cf:c0:
                     2f:f4:05:8a:9a:aa:cb:b0:00:68:fb:b8:0d:e3:48:
                     73:9e:75:01:e5:2d:7c:25:49:a1:0e:7e:6a:94:d3:
                     dc:77:9d:58:2b:df:4d:1a:be:e7:fb:6d:d6:6c:ed:
                     c6:85:46:cf:2d:18:2a:aa:53:70:a0:c0:7c:71:d9:
                     83:a5:12:5a:d6:39:e6:df:dd:44:14:00:60:62:25:
                     84:ed:f1:a6:56:25:78:ea:ef:67:e1:ec:f3:38:81:
                     c2:67:27:ad:b7:ee:a8:f5:f3:34:73:c2:ea:c5:f2:
                     d5:3d:f8:bc:46:b3:67:4a:48:19:7c:01:41:fb:45:
                     3c:bd:f5:5b:3d:6c:0b:42:fe:98:ed:1d:b2:d2:b2:
                     25:04:ca:ca:fe:fc:a9:4b:6f:35:fc:e4:f4:a0:59:
                     ae:8b:af:01:36:36:d9:3b:5a:d1:b2:ed:11:8e:f7:
                     97:bb:d3:6a:09:a6:98:d3:e7:f3:8b:38:98:8d:bb:
                     5b:b6:ad:de:ad:58:de:34:4e:44:d9:c7:f1:9c:91:
                     fd:25:48:47:bb:8c:21:c1:28:54:13:f4:43:ce:f8:
                     a0:09:91:38:c2:4a:da:c7:f6:7e:65:07:85:1d:b1:
                     d1:4b:2c:95:17:72:dc:f1:40:1c:ef:e1:4b:27:67:
                     6d:55:18:ce:87:fd:d9:cf:ed:bc:3f:79:36:86:96:
                     97:65:d7
                 Exponent: 65537 (0x10001)
         X509v3 extensions:
             X509v3 Basic Constraints: critical
                 CA:TRUE
     Signature Algorithm: sha256WithRSAEncryption
-         48:02:ad:a9:d9:0d:a6:5e:99:14:0e:4c:09:88:b7:bf:bc:1f:
-         2d:41:a7:7b:13:47:dc:40:e4:5c:55:18:ba:cf:bc:f4:1f:6c:
-         c5:58:73:81:c2:65:22:14:0e:72:84:67:c4:cd:2c:7f:64:82:
-         8e:db:eb:cf:af:03:38:56:b7:f9:0b:46:8f:0d:73:c9:68:27:
-         ae:9a:29:61:f8:f0:f3:38:ac:8a:10:d2:2a:d6:21:95:95:07:
-         e4:2e:2c:5c:26:42:b9:b5:13:d4:2c:c2:88:03:6b:a0:e7:b9:
-         32:98:72:59:51:c9:96:35:b6:5e:1c:69:6a:2f:69:35:41:03:
-         4e:58:9f:58:c5:1d:4f:ce:69:c9:66:c2:99:af:eb:14:f5:ea:
-         72:0f:67:95:ee:be:c9:51:06:50:30:17:56:4d:18:3d:14:bd:
-         29:c6:e3:d2:81:1c:db:bd:91:ba:59:37:4f:31:48:56:06:e1:
-         43:1d:f3:3e:ff:76:e3:e0:0c:93:99:48:79:49:c5:61:dd:f9:
-         68:81:fb:40:6e:ca:ee:5f:87:fe:47:48:e3:d6:c8:37:52:d2:
-         74:af:ff:e0:24:24:95:ac:db:a4:b3:d2:b6:80:69:17:4f:35:
-         7e:bf:ea:38:2a:80:3f:2a:62:5c:ad:52:f0:21:96:a3:f4:f3:
-         c3:62:31:60:b8:bf:22:d4:fe:42:32:5a:5a:a4:a0:62:87:47:
-         14:b7:6a:33:a8:3a:a6:a1:26:08:95:4a:5a:ea:43:35:22:09:
-         c7:4f:e8:92:4e:72:dc:b8:00:57:04:ab:bc:47:08:ce:e0:29:
-         93:19:34:c3:54:c9:72:7b:ca:53:17:c5:d3:14:b5:64:d5:ee:
-         0a:dd:79:c4:ab:09:80:9b:00:64:90:b9:34:cb:a3:b6:af:c8:
-         f4:bd:b0:96:f6:af:c2:6c:32:7f:70:be:d1:45:6e:4a:c6:40:
-         ed:b0:de:e4:76:fd:a1:a6:9b:cb:e2:49:25:05:9e:01:b3:d8:
-         7a:a0:70:f5:01:ec:76:e3:00:ba:af:bf:90:e1:c2:62:b4:b9:
-         12:e6:56:6f:d7:ba:e6:79:be:d9:ae:22:d8:0d:81:61:4a:8c:
-         7b:0f:46:c8:9b:de:b2:04:47:4a:dc:77:64:4e:61:a1:5f:b7:
-         38:25:bd:fe:92:a8:91:74:bb:72:a8:47:31:66:68:ab:0a:64:
-         2a:19:a4:38:10:d9:86:36:d2:89:16:22:15:69:b3:88:a5:43:
-         fc:33:4d:ca:ae:a0:c2:fd:30:9f:a3:66:40:19:3d:aa:4d:22:
-         d2:cd:19:0c:9d:0a:a2:a1:7b:02:b7:9d:01:2a:3b:e1:34:cc:
-         4f:a4:08:29:e9:3c:03:db
+         cd:fe:2d:48:9f:14:12:58:7c:2e:61:88:fc:b7:73:f1:30:f0:
+         78:4d:84:80:df:57:f9:ec:d5:31:35:9d:db:9d:7d:ef:e6:f0:
+         aa:d9:56:1f:c1:62:ed:ea:a5:8c:95:02:15:c1:57:98:02:d0:
+         a1:98:89:1e:ff:00:2d:7e:3f:80:20:ee:71:fd:8a:35:9e:7b:
+         24:06:d7:26:89:3b:62:88:5a:bd:6f:5e:d0:92:0c:93:17:b6:
+         53:f2:8b:a6:88:8d:33:71:4d:a5:ec:be:9e:68:39:d8:00:be:
+         43:08:e5:59:2b:bd:33:e8:f5:e0:c0:cc:77:8b:1f:4e:1d:58:
+         97:9c:0b:f9:32:4a:03:3c:ee:c4:da:86:e4:31:59:ef:af:66:
+         42:91:5a:dc:53:ec:06:d8:57:37:b9:85:5b:37:0d:e7:8c:a4:
+         e7:a6:f5:73:d8:71:59:82:54:3f:95:fb:a3:fa:a4:d3:2a:78:
+         eb:88:1f:e6:14:e8:3a:9d:b2:fb:d6:41:b4:03:fd:7c:45:86:
+         a6:98:01:eb:36:08:70:77:ae:af:93:a8:ae:33:d3:44:a7:eb:
+         30:a2:c5:ad:e1:f8:a7:92:68:f1:d2:a7:00:d4:a2:70:cb:ef:
+         d8:d5:35:7b:ce:b1:41:b4:8a:b0:23:be:b6:17:c6:72:05:28:
+         5b:f2:56:82:77:48:9e:39:7f:34:48:b9:dc:81:2c:ac:54:db:
+         6a:cb:ac:73:a2:ad:ad:e9:bf:19:19:cc:21:93:b2:e5:61:72:
+         00:9e:57:53:1f:88:5d:ca:85:65:06:cf:3b:db:d1:95:b6:d8:
+         65:d2:4d:6a:3e:e3:8d:84:76:ad:cd:a5:40:69:f9:2c:af:89:
+         79:1e:4b:5a:0e:27:9a:a5:0a:46:65:ce:be:27:29:4b:7a:f1:
+         e1:cd:dd:82:fa:2b:47:11:11:98:d4:45:b2:ad:95:c2:f6:47:
+         c3:d1:a0:5d:88:d7:ff:da:89:7b:06:ab:27:d0:2e:fa:a4:ef:
+         a2:64:4e:a2:9b:87:ab:d9:71:a6:3b:ab:15:d9:64:40:34:f5:
+         0b:ce:2b:53:cc:b2:0e:7c:8f:3b:8d:5e:39:98:16:18:7e:9a:
+         3d:e6:23:59:e5:7f:73:ba:03:ca:f9:51:d6:c5:cd:81:05:3f:
+         f6:6a:04:d4:c1:d1:b7:87:7c:71:88:4a:0b:58:03:38:a8:9a:
+         64:af:a1:23:60:63:4c:c4:6e:c7:c4:cc:4a:57:51:38:b2:d6:
+         1c:45:5f:87:3d:b2:3b:0f:1e:e2:e7:59:bb:75:c6:aa:79:cc:
+         64:56:fd:c2:df:91:ec:9b:58:49:49:70:5c:67:e7:d9:a6:d4:
+         44:7b:c7:c0:1f:7d:3c:e7
➜  icinga2 git:(renew-the-ca-9890) ✗

@julianbrost
Copy link
Contributor

I've tested this PR in my test cluster with an additional patch that should give more frequent CA renewal, around every 5 minutes:

  1. Set renewal threshold for the CA to ROOT_VALID_FOR - 300, i.e. start renewing 300s/5m after the CA was issued (so immediately for the preexisting CA).
  2. Run the renewal timer every minute, so that there are no 24h waits during my tests.
  3. Disable a test case, they detect I intentionally "break" the IsCaUptodate() function.
diff --git a/lib/base/tlsutility.cpp b/lib/base/tlsutility.cpp
index 246bd5aee..cf991dbe7 100644
--- a/lib/base/tlsutility.cpp
+++ b/lib/base/tlsutility.cpp
@@ -796,7 +796,7 @@ bool IsCertUptodate(const std::shared_ptr<X509>& cert)
 
 bool IsCaUptodate(X509* cert)
 {
-       return !CertExpiresWithin(cert, LEAF_VALID_FOR);
+       return !CertExpiresWithin(cert, ROOT_VALID_FOR - 300);
 }
 
 String CertificateToString(X509* cert)
diff --git a/lib/base/tlsutility.hpp b/lib/base/tlsutility.hpp
index b06412020..74460736d 100644
--- a/lib/base/tlsutility.hpp
+++ b/lib/base/tlsutility.hpp
@@ -36,7 +36,7 @@ const unsigned int DEFAULT_CONNECT_TIMEOUT = 15;
 const auto ROOT_VALID_FOR  = 60 * 60 * 24 * 365 * 15;
 const auto LEAF_VALID_FOR  = 60 * 60 * 24 * 397;
 const auto RENEW_THRESHOLD = 60 * 60 * 24 * 30;
-const auto RENEW_INTERVAL  = 60 * 60 * 24;
+const auto RENEW_INTERVAL  = 60;
 
 void InitializeOpenSSL();
 
diff --git a/test/base-tlsutility.cpp b/test/base-tlsutility.cpp
index c20b5ed0f..15a4fa92f 100644
--- a/test/base-tlsutility.cpp
+++ b/test/base-tlsutility.cpp
@@ -87,6 +87,7 @@ BOOST_AUTO_TEST_CASE(sha1)
 
 BOOST_AUTO_TEST_CASE(iscauptodate_ok)
 {
+       return;
        auto key (GenKeypair());
 
        BOOST_CHECK(IsCaUptodate(MakeCert("Icinga CA", key, "Icinga CA", key, [](ASN1_TIME* notBefore, ASN1_TIME* notAfter) {

Observations so far:

  1. If both masters have the CA key, they renew the CA certificate independently so this ends up with two new CA certificates.
  2. Should the deployment of the new CA happen quickly? Looks like nodes connected via satellites didn't get the new CA without manually reloading (or I didn't wait long enough?)
  3. After the first renewal went through with a bit of help, the timer renews the CA on the masters after 5 to 6 minutes, but it isn't deployed to other nodes, not even those connected directly and after manually restarting them.

@Al2Klimov
Copy link
Member Author

  1. If both masters have the CA key, they renew the CA certificate independently so this ends up with two new CA certificates.

One more reason not to copy the CA over manually! I mean, it's still not broken, but... 🤯

Or! You can copy it over again to restore the symmetry. 🙈

  1. Should the deployment of the new CA happen quickly?

Well. The CA has LEAF_VALID_FOR time to get distributed through the whole cluster. That's 397d. That's 1.09y! Even in the extreme OP issue case there's a margin of safety of 5 months. Even that's 5x RENEW_THRESHOLD.

TL;DR: No. At least not quicker than leaf certs (already), would be paradox.

  1. After the first renewal went through with a bit of help, the timer renews the CA on the masters after 5 to 6 minutes, but it isn't deployed to other nodes, not even those connected directly and after manually restarting them.

Tbh I prefer testing the actual code. If this PR LGTY, but you wanna be sure on X and Y, please write down X and Y. I'll manipulate a few certificates with faketime openssl and share what Icinga did. But yes, I'd expect... oh! You have to wait one day which is the minimum diff for a new CA to be considered newer!

@julianbrost
Copy link
Contributor

  1. If both masters have the CA key, they renew the CA certificate independently so this ends up with two new CA certificates.

One more reason not to copy the CA over manually! I mean, it's still not broken, but... 🤯

What are the other reasons why I wouldn't want to do this? I mean if only one master can sign certificates, I don't have redundancy for that.

  1. Should the deployment of the new CA happen quickly?

Well. The CA has LEAF_VALID_FOR time to get distributed through the whole cluster. That's 397d. That's 1.09y! Even in the extreme OP issue case there's a margin of safety of 5 months. Even that's 5x RENEW_THRESHOLD.

TL;DR: No. At least not quicker than leaf certs (already), would be paradox.

But why doesn't it happen quickly? Shouldn't certificate renewal trigger reconnects which trigger certificate requests which should renew these certificates until everything is renewed?

  1. After the first renewal went through with a bit of help, the timer renews the CA on the masters after 5 to 6 minutes, but it isn't deployed to other nodes, not even those connected directly and after manually restarting them.

Tbh I prefer testing the actual code. If this PR LGTY, but you wanna be sure on X and Y, please write down X and Y. I'll manipulate a few certificates with faketime openssl and share what Icinga did. But yes, I'd expect... oh! You have to wait one day which is the minimum diff for a new CA to be considered newer!

My idea was to do this as kind of a stress test. Like change the thresholds so that it happens frequently, leave it running for a day so that you get hundreds of iterations, see if anything strange happens.

Copy link
Member Author

@Al2Klimov Al2Klimov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. The number of signing masters is directly proportional to the times you have to run ssh and icinga2 ca list in the worst case to see where a particular CSR is.
  2. I agree that even the original code, not to mention mine, should re-connect on cert deployment. I can test this with my code as-is along with some hard requirements you write down. But I don't see this as a hard requirement due to the already mentioned time periods. If regular renewals work despite this "bug"(?), CA ones will do for sure. Not to mention eventual non-obvious influence of you stress test patch. Apropos...

if (requestorCA && !IsCaUptodate(requestorCA)) {
int days;

if (ASN1_TIME_diff(&days, nullptr, X509_get_notAfter(requestorCA), X509_get_notAfter(cacert.get())) && days > 0) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. You can't properly speedrun renewals without updating all conditions. This one requires the new CA to expire 1d+ after the old one for propagation.

@Al2Klimov
Copy link
Member Author

Indeed:

  1. Satellite is connected to agent scenario: no-op as no newer CA is available
  2. Master is started: satellite gets newer CA
  3. Satellite is reloaded (likely sooner/later), resetting agent connection: agent gets newer CA

@Al2Klimov
Copy link
Member Author

💡

Yes. The satellite got a new cert+CA. And it cut off all connections. And the agent re-connects. And there's a new CA. But! It's still the master who decides who gets which cert and when! And it may or may not be connected, yet.

[2023-12-18 13:09:07 +0000] information/JsonRpcConnection: Updating the client certificate for CN 'sat' at runtime and reconnecting the endpoints.
[2023-12-18 13:09:07 +0000] warning/JsonRpcConnection: API client disconnected for identity 'agent'
[2023-12-18 13:09:07 +0000] warning/ApiListener: Removing API client for endpoint 'agent'. 0 API clients left.
[2023-12-18 13:09:07 +0000] warning/JsonRpcConnection: API client disconnected for identity 'master'
[2023-12-18 13:09:07 +0000] warning/ApiListener: Removing API client for endpoint 'master'. 0 API clients left.
[2023-12-18 13:09:09 +0000] information/ApiListener: New client connection for identity 'agent' from [::ffff:10.27.3.142]:43910
[2023-12-18 13:09:09 +0000] information/ApiListener: Sending config updates for endpoint 'agent' in zone 'agent'.
[2023-12-18 13:09:09 +0000] information/ApiListener: Finished sending config file updates for endpoint 'agent' in zone 'agent'.
[2023-12-18 13:09:09 +0000] information/ApiListener: Syncing runtime objects to endpoint 'agent'.
[2023-12-18 13:09:09 +0000] information/ApiListener: Finished syncing runtime objects to endpoint 'agent'.
[2023-12-18 13:09:09 +0000] information/ApiListener: Finished sending runtime config updates for endpoint 'agent' in zone 'agent'.
[2023-12-18 13:09:09 +0000] information/ApiListener: Sending replay log for endpoint 'agent' in zone 'agent'.
[2023-12-18 13:09:09 +0000] information/ApiListener: Finished sending replay log for endpoint 'agent' in zone 'agent'.
[2023-12-18 13:09:09 +0000] information/ApiListener: Finished syncing endpoint 'agent' in zone 'agent'.
[2023-12-18 13:09:09 +0000] information/JsonRpcConnection: Received certificate request for CN 'agent' signed by our CA.
[2023-12-18 13:09:09 +0000] information/JsonRpcConnection: Certificate request for CN 'agent' is pending. Waiting for approval.
[2023-12-18 13:09:11 +0000] information/ApiListener: Reconnecting to endpoint 'master' via host '10.27.3.197' and port '5665'
[2023-12-18 13:09:11 +0000] information/ApiListener: New client connection for identity 'master' to [10.27.3.197]:5665

return key;
}

static std::shared_ptr<X509> MakeCert(char* issuer, EVP_PKEY* signer, char* subject, EVP_PKEY* pubkey, std::function<void(ASN1_TIME*, ASN1_TIME*)> setTimes)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The char * parameters should be const, just noticed a new compiler warning scrolling by due to this:

.../test/base-tlsutility.cpp:92:36: warning: ISO C++ forbids converting a string constant to 'char*' [-Wwrite-strings]
   92 |  BOOST_CHECK(IsCaUptodate(MakeCert("Icinga CA", key, "Icinga CA", key, [](ASN1_TIME* notBefore, ASN1_TIME* notAfter) {
      |                                    ^~~~~~~~~~~

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But they're (unsigned char*)ed anyway.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gosh, I hate OpenSSL 1.0.2 (or those Linux distributions that think it's a good idea to keep that version alive), in 1.1.1 they realized that making that parameter const is probably a good idea.

Anyways, then please use const_cast<>(), that's more of a "I'm intentionally doing this" and shouldn't issue a warning.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Don't complain about your life. When I started working here, until my final exam, we had to support RHEL 5. I.e. Python 2.4, C++03 + auto and OpenSSL 0.9.8e with TLSv1.0. Not to mention the PHP version.
  2. The current cast does neither despite -Wall -Wextra (and is how our actual code does this).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean we're writing software in C++, not in C, but I get it, for some reason, you and C-style casts are just inseparable.

@julianbrost
Copy link
Contributor

Yes. The satellite got a new cert+CA. And it cut off all connections. And the agent re-connects. And there's a new CA. But! It's still the master who decides who gets which cert and when! And it may or may not be connected, yet.

So certificate requests are only forwarded if the parent node is currently connected? And if it's not, we're basically relying on ApiListener::m_RenewOwnCertTimer to resend the request after some time, where the parent zone hopefully is connected again?

@Al2Klimov
Copy link
Member Author

Ah, yes! Almost forgot that timer. 👍

@Al2Klimov
Copy link
Member Author

Al2Klimov commented Dec 18, 2023

Indeed, if I speedrun that timer (actually all) with faketime -f '+0 x700' on the agent, it gets everything without having to re-connect first. With some lag of course. 👍

@julianbrost
Copy link
Contributor

I added the following to my patch from #9891 (comment):

diff --git a/lib/remote/jsonrpcconnection-pki.cpp b/lib/remote/jsonrpcconnection-pki.cpp
index 340e12b30..41fea9664 100644
--- a/lib/remote/jsonrpcconnection-pki.cpp
+++ b/lib/remote/jsonrpcconnection-pki.cpp
@@ -113,9 +113,9 @@ Value RequestCertificateHandler(const MessageOrigin::Ptr& origin, const Dictiona
                        }
 
                        if (requestorCA && !IsCaUptodate(requestorCA)) {
-                               int days;
+                               int days, secs;
 
-                               if (ASN1_TIME_diff(&days, nullptr, X509_get_notAfter(requestorCA), X509_get_notAfter(cacert.get())) && days > 0) {
+                               if (ASN1_TIME_diff(&days, &secs, X509_get_notAfter(requestorCA), X509_get_notAfter(cacert.get())) && (days > 0 || secs > 0)) {
                                        uptodate = false;
                                }
                        }

With that and the other insights, things now behave more in way where I understand what's happening: it takes about a minute per cluster level for the certificates to propagate (master -> satellite -> 2nd-level-satellite -> agent takes 3 minutes), this is probably influenced by me running this with docker compose where I start everything at the same time, so the timers are pretty much synchronized.

@julianbrost
Copy link
Contributor

Other test I did:

  1. Stop everything.
  2. Generate a CA that should be renewed immediately with faketime -3000days openssl req -x509 -days 3200 -out ca.dummy.crt -subj '/CN=Icinga CA' -noenc -key ca.key
  3. Start the masters with this PR, everything else with 2.14.0 -> masters and satellites+agents connected directly to a master get a renewed CA (after manual restarts, as otherwise I'd have to wait for 24h)
  4. Restart satellites with this PR -> everything in the cluster gets new certificates, including agents still on 2.14.0 (again, with some manual restarts)

@Al2Klimov Al2Klimov merged commit 8b2e28a into master Dec 19, 2023
25 checks passed
@Al2Klimov Al2Klimov deleted the renew-the-ca-9890 branch December 19, 2023 13:57
@Al2Klimov Al2Klimov added backported Fix was included in a bugfix release and removed consider backporting Should be considered for inclusion in a bugfix release labels May 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/distributed Distributed monitoring (master, satellites, clients) backported Fix was included in a bugfix release bug Something isn't working cla/signed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Automatically renew the CA
3 participants