Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to seal/unseal own TLS certificate #17

Open
bnevis-i opened this issue Dec 11, 2023 · 12 comments
Open

Ability to seal/unseal own TLS certificate #17

bnevis-i opened this issue Dec 11, 2023 · 12 comments

Comments

@bnevis-i
Copy link

bnevis-i commented Dec 11, 2023

Although there is a built-in PKI secrets engine, OpenBao can't bootstrap its own TLS certificate. A handy feature would be able to use the PKI secrets engine to create OpenBao's own TLS serving certificate, and include it as part of the seal data (and deal with certificate rotation as well). While this would only be doable for a self-unsealing mechanism that doesn't require the API to be up to perform the unseal, it would resolve what today is a difficult chicken-and-egg problem. (Though this would be less of a problem if unsealing cloud be done over a named pipe or other secure non-network mechanism.)

Cloned from #16

@cipherboy
Copy link
Member

\o hey @bnevis-i --- some thoughts here.

I think this might be hard to achieve as written. Plugins cannot load until the underlying storage is decrypted and this can require manual involvement (Shamir's unseal, which is done remotely via API -- needing certificates in the HTTPS secure case), and generally requires the core + listeners to already be started. This bootstrapping leads to a chicken-and-egg problem, and breaking that cycle might be impossible without local unseal (or strictly requiring auto-unseal). However, if you have an existing certificate, this could be reused and refreshed periodically more easily... though this also requires Bao to be able to issue potentially privileged tokens (bypassing authentication plugins entirely) to itself (or, tightly scoped tokens for single API endpoints, using authentication information it can present to itself). This is all very hard, IMO, as written.

However, more broadly applicable (and doing 90% of what you want -- the steady-state renewal) I think, is perhaps using ACME similar to how Caddy works. Let the listener point at an ACME server to fetch certificates from, allow local caching, and perhaps have a temporary self-signed certificate mode when necessary. Luckily the version Bao is based on supports ACME in the PKI plugin, so this should be fairly achievable much more easily. Some detection (for pointing at ourselves) would be necessary -- so we opportunistically fetch via ACME but fallback to self-signing if no locally cached cert and no ACME responder is (yet) available -- but otherwise this should be much more achievable. This would also benefit people not using Bao's PKI capabilities.

Let me know your thoughts and if you'd like to formalize that into a proper RFC.

@bnevis-i
Copy link
Author

@cipherboy I can't figure out how to create a new ticket with the RFC template. I can't commit to technical exploration, prototyping, or a PR, but the least I can do is put down a use case. I like your ideas above.

Summary

This feature would allow OpenBao to self-manage its TLS serving certificate.

Problem Statement

Today, the TLS serving certificate and key must be provided via an external file and referenced by a configuration file.

OpenBao's ability to keep secrets means uniquely positions it to be the root of trust (for lack of a better term) of a microservice ecosystem built on top of it. This is especially true of bare-metal and Docker-based architectures where there is no platform-provided capabilities for identity or secret storage.

Given that the main value proposition of OpenBao is its ability to keep secrets, the requirement to have a TLS private key unencrypted on disk for OpenBao to talk confidentially over the network is somewhat ironic.

Moreover, it means that OpenBao then relies on some OTHER root of trust to bootstrap it, and practically speaking, an entirely manual process.
With Google's push for 90-day TLS certificates,
any certificate management process other than an automated one is practically unimplementable.

User-facing description

From the perspective of an OpenBao adopter, OpenBao would offer a mechanism to come up in a TLS-enabled mode by default.
In its simplest form, it could generate a per-boot self-signed TLS certificate.
In a more complex form, it could use ACME protocol to self-configure its TLS certificate from a source such as Let's Encrypt.
In a yet more complex form, it could issue itself a privileged token to self-request and self-approve a TLS leaf certificate from a pre-designated secrets engine, or perhaps make the request to an upstream OpenBao instance.
Certificates and keys could be provisioned using a fall-back mechanism. For example, use ACME or self-signed certificates prior to unsealing the data store.
Certificates and keys would be rotated automatically without a service restart when some (>50%?) portion of their TTL is consumed.

Technical description

TBD

Rationale and alternatives

Downsides

Security Implications

User/Developer Experience

Unresolved Questions

Proof of Concept

@DemiMarie
Copy link

The security of ACME depends on the security of DNS, which isn’t ideal. For publicly trusted certificates it is the best one can do, but OpenBao can do better.

Instead of trying to mess with self-issued tokens, I think a simpler approach would be to assume that if OpenBao can rotate its own cert, it should. Specifically, whenever OpenBao is set up as a certification authority, OpenBao can check if the issuing certificate it will be using is the one that issued its own TLS certificate. If so, then OpenBao can issue itself a new certificate, so it should. Users should be able to opt out of this behavior, but for most, I suspect it is exactly what they want.

@Scorpil Scorpil added enhancement New feature or request pending decision feature and removed enhancement New feature or request labels Dec 15, 2023
@cipherboy
Copy link
Member

\o Hey @DemiMarie, great to see you around OpenBao as well! Happy to review PRs for some of your other issues we didn't get to upstream. :-)

The security of ACME depends on the security of DNS, which isn’t ideal. For publicly trusted certificates it is the best one can do, but OpenBao can do better.

While not ideal, as you point out, it is widely trusted by the CA/BF and Let's Encrypt has on the order of 300M active certificates. Automation protocols like ACME are widely deployed internally and ACME is generally regarded as much better than previous iterations (SCEP, EST, or CMPv2) as those protocols lack real-time client path validation (DNS or direct connect) and rely on external authorization (PSK or established key material...) and CA policies for security. Things like multi-point validation (which admittedly isn't present in Bao especially due to the lack of clustering from upstream's Enterprise offering) and DoH/DNS SEC (which Go seems to have taken the option the system resolver should be doing this on its behalf or the application should be using a custom library) make DNS more palatable. Though, admittedly there is still substantial concern over DNS SEC in particular.

However, this also doesn't reflect the reality of what I've heard from larger users. Often their networking (including DNS) is one of the few channels they trust, perhaps because it's internal to datacenters or otherwise tightly controlled.

Plus, this helps our interoperability with third-party private CA solutions. Outside of AD/CS, nearly every other major and minor private CA software supports ACME. Some do provide Bao plugins though, but I think this is the exception, not the rule.

Instead of trying to mess with self-issued tokens, I think a simpler approach would be to assume that if OpenBao can rotate its own cert, it should. Specifically, whenever OpenBao is set up as a certification authority, OpenBao can check if the issuing certificate it will be using is the one that issued its own TLS certificate.

I'm sorry, but I think you've lost me here. :-)

I think on the surface, this part of the comment makes sense from an ideal UX perspective as a bystander but misses the reality of the project.

Bao and its upstream aren't a tightly coupled monolith despite being shipped as such. Outside of the Core API routing layer, the "physical" storage backend, the authentication mechanism, and any provided secrets management are wholly plugin oriented. The authorization and routing model rely on this token that is hard to disentangle. The Core (which holds the listener and thus needs the certificate) has no more knowledge of what is being run or its purpose than a simple routing table. One PKI plugin may differ substantially from the next, if only because API design was done at different companies both wishing to provide a plugin for this platform.

IMO, this needs to be solved by making the issuance path explicit (in configuration perhaps) and cannot be automatic inferred (which of N PKI plugins of M types is the best to use?). In a clustering product (which admittedly Bao is not -- though I do think it has inherited the HA / hot fail over "Standby" mode -- not "Performance Standby" -- of upstream), the backend plugin may not even be loaded and thus can't issue the certificate without loading and taking leadership. Furthermore, in a proper clustering environment ("Performance Standby nodes" in upstream's terminology), the plugin may be loaded but the operator's specified role to issue against may require a storage write on a local node that can't -- in upstream's model, this forwards the entire API request via GRPC up to the active cluster node to service from scratch -- this makes restarts hard to handle and if secondaries are brought up before primaries, may result in a delay in issuance (just like ACME, pointing at itself -- though ACME at least could use multiple fallback DNS / RR directories, perhaps allowing it to bootstrap from some other already running cluster copy).

This is where standardization has been missed by the ecosystem. While nominally some attributes and API routes may apply across providers of PKI plugins, specialized issuance parameters may not (IP sans? SPIFFE?). And while certainly not perfect or complete, ACME helps alleviate most of this.

But perhaps there's some angle I haven't considered...

Could you perhaps expand on the exact use case and operating model for Bao you're expecting here? How would different layers of Bao know whether it is operating as a CA? How would you handle encrypted storage bootstrapping with manual unseal? With clustering/HA Standby nodes? How would you (safely) bypass authorization layers? Can we show this ultimately doesn't rely on DNS again (e.g., for forwarding between nodes, if necessary)? &c. Maybe there is an alternative here with better UX.

@bnevis-i
Copy link
Author

@cipherboy Great discussion and I learned a lot. Since you bring it up, is a SPIFFE auth plugin impossible? The Vault OIDC auth method thing makes my head spin.

@cipherboy
Copy link
Member

@bnevis-i Hmm, I'd move that particular thread to a new issue too if you don't mind. ;-)

The short of it is that I have no idea what "SPIFFE authentication" actually means (like you and OIDC, it seemed a bit complicated in my mind -- but I'll admit I didn't spend a lot of time looking at it). One of the original requesters of it seemed to point at an external (non-stdlib) TLS stack being required for it. While possible, this isn't appealing for a number of maintainability and supportability reasons. What this could mean is that the auth plugin might need to expose a non-standard port to strictly talk SPIFFE auth over, if replacing the entire TLS stack isn't ideal (similar to upstream's KMIP support, a protocol which doesn't transit on HTTP at all). But without knowing more about SPIFFE and the differences between SPIFFE and other auth methods, I'm most likely wrong. :-)

@DemiMarie
Copy link

\o Hey @DemiMarie, great to see you around OpenBao as well! Happy to review PRs for some of your other issues we didn't get to upstream. :-)

Hello! It’s rather funny, considering I have never used Vault or OpenBao myself, mostly since I have not yet been in a position to need either. But I do care about security, so I like to see security-related stuff work well.

The security of ACME depends on the security of DNS, which isn’t ideal. For publicly trusted certificates it is the best one can do, but OpenBao can do better.

While not ideal, as you point out, it is widely trusted by the CA/BF and Let's Encrypt has on the order of 300M active certificates. Automation protocols like ACME are widely deployed internally and ACME is generally regarded as much better than previous iterations (SCEP, EST, or CMPv2) as those protocols lack real-time client path validation (DNS or direct connect) and rely on external authorization (PSK or established key material...) and CA policies for security. Things like multi-point validation (which admittedly isn't present in Bao especially due to the lack of clustering from upstream's Enterprise offering) and DoH/DNS SEC (which Go seems to have taken the option the system resolver should be doing this on its behalf or the application should be using a custom library) make DNS more palatable. Though, admittedly there is still substantial concern over DNS SEC in particular.

However, this also doesn't reflect the reality of what I've heard from larger users. Often their networking (including DNS) is one of the few channels they trust, perhaps because it's internal to datacenters or otherwise tightly controlled.

I’m not at all surprised here. The main problem then becomes the challenge verification, if a non-DNS challenge is used.

Plus, this helps our interoperability with third-party private CA solutions. Outside of AD/CS, nearly every other major and minor private CA software supports ACME. Some do provide Bao plugins though, but I think this is the exception, not the rule.

Yup.

Instead of trying to mess with self-issued tokens, I think a simpler approach would be to assume that if OpenBao can rotate its own cert, it should. Specifically, whenever OpenBao is set up as a certification authority, OpenBao can check if the issuing certificate it will be using is the one that issued its own TLS certificate.

I'm sorry, but I think you've lost me here. :-)

I think on the surface, this part of the comment makes sense from an ideal UX perspective as a bystander but misses the reality of the project.

No surprise there.

Bao and its upstream aren't a tightly coupled monolith despite being shipped as such. Outside of the Core API routing layer, the "physical" storage backend, the authentication mechanism, and any provided secrets management are wholly plugin oriented. The authorization and routing model rely on this token that is hard to disentangle. The Core (which holds the listener and thus needs the certificate) has no more knowledge of what is being run or its purpose than a simple routing table. One PKI plugin may differ substantially from the next, if only because API design was done at different companies both wishing to provide a plugin for this platform.

IMO, this needs to be solved by making the issuance path explicit (in configuration perhaps) and cannot be automatic inferred (which of N PKI plugins of M types is the best to use?). In a clustering product (which admittedly Bao is not -- though I do think it has inherited the HA / hot fail over "Standby" mode -- not "Performance Standby" -- of upstream), the backend plugin may not even be loaded and thus can't issue the certificate without loading and taking leadership. Furthermore, in a proper clustering environment ("Performance Standby nodes" in upstream's terminology), the plugin may be loaded but the operator's specified role to issue against may require a storage write on a local node that can't -- in upstream's model, this forwards the entire API request via GRPC up to the active cluster node to service from scratch -- this makes restarts hard to handle and if secondaries are brought up before primaries, may result in a delay in issuance (just like ACME, pointing at itself -- though ACME at least could use multiple fallback DNS / RR directories, perhaps allowing it to bootstrap from some other already running cluster copy).

This is where standardization has been missed by the ecosystem. While nominally some attributes and API routes may apply across providers of PKI plugins, specialized issuance parameters may not (IP sans? SPIFFE?). And while certainly not perfect or complete, ACME helps alleviate most of this.

But perhaps there's some angle I haven't considered...

Could you perhaps expand on the exact use case and operating model for Bao you're expecting here? How would different layers of Bao know whether it is operating as a CA? How would you handle encrypted storage bootstrapping with manual unseal? With clustering/HA Standby nodes? How would you (safely) bypass authorization layers? Can we show this ultimately doesn't rely on DNS again (e.g., for forwarding between nodes, if necessary)? &c. Maybe there is an alternative here with better UX.

  • Storage bootstrapping: This would need to be handled out of band somehow. If one is using OpenBao as its own CA, then presumably one has a way to do this.
  • Clustering/HA: I think this would be handled by implementing certificate renewal as an internally generated issuance request.
  • Authorization: Since the request is generated internally (rather than coming from a client), could it be treated as having been made with a root token?
  • Specialized issuance parameters: “Use the same ones as last time” would be a decent default, IMO.

Another option is for OpenBao to make an internal ACME request (to itself) that is treated as having already passed the challenge.

@cipherboy
Copy link
Member

\o hello @DemiMarie, sorry about the delay in getting back :-)

Hello! It’s rather funny, considering I have never used Vault or OpenBao myself, mostly since I have not yet been in a position to need either. But I do care about security, so I like to see security-related stuff work well.

:D No worries! Qubes is a great project!

Storage bootstrapping: This would need to be handled out of band somehow. If one is using OpenBao as its own CA, then presumably one has a way to do this.

Indeed. I wonder if we'd additionally need specialized recovery tools. In the simplest form, perhaps allowing fallback to self-signed certificates (and perhaps more tightly scoped listeners) with explicit pinning might work... This would allow initial setup and recovery operations, but perhaps require a restart of the daemon to fully reset into normal mode. Local-host only might not be sufficient though, since in general this is clustering software.

Related in particular to this conversation, with the multitude of storage backends, we might also need to allow offline issuance of certificates (likely not stored, though I'm not quite sure how to force auditing) to allow revitalizing infrastructure in the event of an outage. If say, an intermediate CA expires that is hosted in Bao, that is used for e.g., securing communication with a backing PostgreSQL data store, this might get complicated to revive. Self-signed wouldn't necessarily be sufficient or desired, so we'd definitely need issuance (against multiple mounts, roles, and types of certificates potentially). Limiting to Raft here could help prevent this dependency in this particular case.

I'm not quite sure though how the manual tool would handle this case though. Even if manual unseal keys were provided, it'd still need to access the backing data store, which might involve trusting expired certificates (which could be done manually). Maybe it is possible, but we'd have to test it in a lot of various scenarios.

Authorization: Since the request is generated internally (rather than coming from a client), could it be treated as having been made with a root token?

I think this is where ACME is looking more appealing. It might be possible to avoid bypassing issuance authorization in most cases (e.g., by avoiding DNS challenge types and preferring ALPN/HTTP instead). Generating a root token would be possible in general... Though generally this requires using recovery keys to authorize manually and isn't typically done automatically. (So it is usually a privileged operation limited to recovery in the event of failure or initial setup). However you're definitely correct that a root token would let us successfully issue a cert using the non-ACME path.

Specialized issuance parameters: “Use the same ones as last time” would be a decent default, IMO.

Agreed, though I don't know that we need this as much with ACME. But one trick is that OpenBao/our upstream don't cache and attribute requests in that way, so its hard in general to redo an operation. It would be up to the client (i.e., the daemon trying to self-request a certificate) to re-issue the request with the same parameters, which means you still need a way of handling the first time.

Clustering/HA: I think this would be handled by implementing certificate renewal as an internally generated issuance request.

I'll need to think more about clustering w.r.t. HA mode in OpenBao and if this is sufficient. It might be, but we might still need to handle forwarding as I'm not sure HA mode does do request forwarding transparently to the calling client. Hmm...

@DemiMarie
Copy link

\o hello @DemiMarie, sorry about the delay in getting back :-)

Hello! It’s rather funny, considering I have never used Vault or OpenBao myself, mostly since I have not yet been in a position to need either. But I do care about security, so I like to see security-related stuff work well.

:D No worries! Qubes is a great project!

Thank you! So is OpenBao.

Storage bootstrapping: This would need to be handled out of band somehow. If one is using OpenBao as its own CA, then presumably one has a way to do this.

Indeed. I wonder if we'd additionally need specialized recovery tools. In the simplest form, perhaps allowing fallback to self-signed certificates (and perhaps more tightly scoped listeners) with explicit pinning might work... This would allow initial setup and recovery operations, but perhaps require a restart of the daemon to fully reset into normal mode. Local-host only might not be sufficient though, since in general this is clustering software.

Related in particular to this conversation, with the multitude of storage backends, we might also need to allow offline issuance of certificates (likely not stored, though I'm not quite sure how to force auditing) to allow revitalizing infrastructure in the event of an outage. If say, an intermediate CA expires that is hosted in Bao, that is used for e.g., securing communication with a backing PostgreSQL data store, this might get complicated to revive. Self-signed wouldn't necessarily be sufficient or desired, so we'd definitely need issuance (against multiple mounts, roles, and types of certificates potentially). Limiting to Raft here could help prevent this dependency in this particular case.

I'm not quite sure though how the manual tool would handle this case though. Even if manual unseal keys were provided, it'd still need to access the backing data store, which might involve trusting expired certificates (which could be done manually). Maybe it is possible, but we'd have to test it in a lot of various scenarios.

Authorization: Since the request is generated internally (rather than coming from a client), could it be treated as having been made with a root token?

I think this is where ACME is looking more appealing. It might be possible to avoid bypassing issuance authorization in most cases (e.g., by avoiding DNS challenge types and preferring ALPN/HTTP instead). Generating a root token would be possible in general... Though generally this requires using recovery keys to authorize manually and isn't typically done automatically. (So it is usually a privileged operation limited to recovery in the event of failure or initial setup). However you're definitely correct that a root token would let us successfully issue a cert using the non-ACME path.

My mental model is that OpenBao does something like (in pseudo-Go):

	fn handle_request() {
		payload, err := unmarshalRequest()
		if err != nil {
			sendBadRequestResponse(err)
			return
		}
		credentials, err := unmarshalToken(payload.token)
		if err != nil {
			sendBadRequestResponse(err)
			return
		}
		dispatchRequest(payload.path, payload.body, credentials)
	}
}

OpenBao could instead create a synthentic credentials object that has the same privileges as a root token, even though no root token ever actually existed.

Specialized issuance parameters: “Use the same ones as last time” would be a decent default, IMO.

Agreed, though I don't know that we need this as much with ACME. But one trick is that OpenBao/our upstream don't cache and attribute requests in that way, so its hard in general to redo an operation. It would be up to the client (i.e., the daemon trying to self-request a certificate) to re-issue the request with the same parameters, which means you still need a way of handling the first time.

I was thinking of extracting them from OpenBao’s own TLS certificate.

@cipherboy
Copy link
Member

cipherboy commented Jan 6, 2024

Just dropping some notes here.

openbao/internalshared/configutil/listener.go describes the listener's configuration, but openbao/internalshared/listenerutil/listener.go handles taking the parsed config and building the underlying *tls.Config. This calls https://github.com/hashicorp/go-secure-stdlib/tree/main/reloadutil to handle the reloading. The nice thing is the ReloadFunc definition is simple, so we have two options:

  1. Leave the file-based loading alone, substituting in temporary paths for ACME-loaded certificates if necessary.
  2. Hijack the reload functionality entirely for ACME, not doing on-disk caching.

I think we can do 1 fairly easily, and reuse the existing paths transparently: if they exist, use them and refresh via ACME. Otherwise, point it at a temporary directory.

I think we can thus wrap the returned cg.Reload() function for ACME and handle writing fresh certs to the the paths one-off at the start, if necessary.

I think this then makes the entire change relatively small and self-contained.

Maybe:

+       TLSACMEDirectory string `hcl:"tls_acme_server"`
+       TLSACMERoot      string `hcl:"tls_acme_ca_root"`
+       TLSACMEKeyType   string `hcl:"tls_acme_key_type"`
+       TLSACMEEABId     string `hcl:"tls_acme_eab_id"`
+       TLSACMEEABKey    string `hcl:"tls_acme_eab_key"`
+       TLSACMEEmail     string `hcl:"tls_acme_email"`

for config parameters? This doesn't yet handle DNS challenges, but we can deal with those later if/when necessary.

Or should we nest it in an inner layer?

listener "tcp" {
    acme {
        server = ""
    }
}

IMO the latter is probably more extensible when it comes time for dns challenges, so maybe that'd be my preference... Thinking about it more, if we want to identify alternative domains (e.g., bind to localhost:8200 but present as vault.myinstance.com), we'll need to add a domains field as well... so this is probably the best approach.

For choice of library, I'd probably use the x/crypto/acme for now and implement it directly. I'm not quite sure yet how to hook reload for alpn and http challenges; when we do our initial reload, we'll be running before the listener is configured, so we can bind directly to the listen address and solve the challenges... However we'll still need to figure out how to deal with the actual listener, once up. This might require hooking listener.AddHandler(...) for ALPN; not sure about for http challenges, but this might require hooking the core directly in some way, as I think System is limited to the sys/ API space. I think DNS based challenges might be easiest here, to bypass all of this, so maybe it is worth adding them in the initial pass.

Another challenge I've not yet figured out is how to deal with multiple listeners. Ideally these would be a only a single ACME call (multiple SAN certificate), but we might have to figure out a loading procedure to make sure all listeners inform the ACME stack of the required domains and listen addresses, and then make sure only one actually issues the ACME challenges... Today, you could use multiple listeners with the same path, so adding an ACME directive to one should probably get you ACME for all (and maybe we'll want to validate that duplicate ACME directives for the same paths in different listeners don't exist).

This might warrant hosting ACME up into a global directive then:

acme {
    name = <...>
    server = <...>
    domains = <...>
    [cert_file = <...>]
}

listener "tcp" {
    tls_cert_file = "<...>"
}

listener "tcp" {
    tls_acme = "<name>"
}

Here, multiple listeners could refer to the same ACME config either by file path (which would be unique) or by using a tls_acme directive, which would infer the file paths (which is useful for the case when a temporary file/directory is used for the cert+key). This also helps with uniqueness and listener ordering, as we could handle ACME first, prior to listener directives, which guarantees certs are available on disk.

@cipherboy
Copy link
Member

@DemiMarie said:

It would be up to the client (i.e., the daemon trying to self-request a certificate) to re-issue the request with the same parameters, which means you still need a way of handling the first time.

I was thinking of extracting them from OpenBao’s own TLS certificate.

I had forgotten that we added this logic to the CLI:

func parseTemplateCertificate(certificate x509.Certificate, useExistingKey bool, keyRef string) (templateData map[string]interface{}, err error) {
// Generate Certificate Signing Parameters
templateData = map[string]interface{}{
"common_name": certificate.Subject.CommonName,
"alt_names": makeAltNamesCommaSeparatedString(certificate.DNSNames, certificate.EmailAddresses),
"ip_sans": makeIpAddressCommaSeparatedString(certificate.IPAddresses),
"uri_sans": makeUriCommaSeparatedString(certificate.URIs),
// other_sans (string: "") - Specifies custom OID/UTF8-string SANs. These must match values specified on the role in allowed_other_sans (see role creation for allowed_other_sans globbing rules). The format is the same as OpenSSL: <oid>;<type>:<value> where the only current valid type is UTF8. This can be a comma-delimited list or a JSON string slice.
// Punting on Other_SANs, shouldn't really be on CAs
"signature_bits": findSignatureBits(certificate.SignatureAlgorithm),
"exclude_cn_from_sans": determineExcludeCnFromSans(certificate),
"ou": certificate.Subject.OrganizationalUnit,
"organization": certificate.Subject.Organization,
"country": certificate.Subject.Country,
"locality": certificate.Subject.Locality,
"province": certificate.Subject.Province,
"street_address": certificate.Subject.StreetAddress,
"postal_code": certificate.Subject.PostalCode,
"serial_number": certificate.Subject.SerialNumber,
"ttl": (certificate.NotAfter.Sub(certificate.NotBefore)).String(),
"max_path_length": certificate.MaxPathLen,
"permitted_dns_domains": strings.Join(certificate.PermittedDNSDomains, ","),
"use_pss": isPSS(certificate.SignatureAlgorithm),
}
if useExistingKey {
templateData["skid"] = hex.EncodeToString(certificate.SubjectKeyId) // TODO: Double Check this with someone
if keyRef == "" {
return nil, fmt.Errorf("unable to create certificate template for existing key without a key_id")
}
templateData["key_ref"] = keyRef
} else {
templateData["key_type"] = getKeyType(certificate.PublicKeyAlgorithm.String())
templateData["key_bits"] = findBitLength(certificate.PublicKey)
}
return templateData, nil
}

This might need a little modification for leaf certificates, but should provide a good basis if we wanted to take this approach later.

cipherboy pushed a commit to cipherboy/openbao that referenced this issue Jan 21, 2024
…penbao#17)

* Adds makefile, gh action for tests, dependabot, tools

* Update .github/workflows/tests.yaml

Co-authored-by: Theron Voran <[email protected]>

* Update .github/workflows/tests.yaml

Co-authored-by: Theron Voran <[email protected]>

* Update .github/workflows/tests.yaml

Co-authored-by: Theron Voran <[email protected]>

* Update .go-version

Co-authored-by: Theron Voran <[email protected]>

* Fixes tests to run locally with testacc

* try acctests in CI

---------

Co-authored-by: Theron Voran <[email protected]>
@cipherboy
Copy link
Member

Summary

Allow OpenBao to self-manage TLS certificates for its listener via the ACME protocol, similar to Caddy's automated certificate management. This would align OpenBao with server projects like Caddy, Apache's httpd, nginx, and others, that can acquire and rotate their listener certificates automatically via the ACME protocol.

Problem Statement

Presently, OpenBao's TLS server certificate and key must be provided via external files (tls_cert_file and tls_key_file) and be referenced by the server's configuration file. This introduces friction to the TLS management workflow: certificates must be rotated and a SIGHUP sent to the server process to trigger a reload. This makes it hard in a Kubernetes environment, as SIGHUP cannot be sent to a pod. Indeed, the dev mode server recently introduced flags to create a temporary CA and sign its dev server certificate with it, though lacks rotation capabilities (as it is not meant to be used in production).

OpenBao's ability to keep secrets and keys uniquely positions it to be the X.509 root of trust of an ecosystem built on top of it. This is especially true of bare metal and Docker-based architectures where there is no platform-provided capabilities for identity or secret storage and every application has its own method of acquiring and using certificates.

Given that the main value proposition of OpenBao is its ability to rotate secrets, the inability to reuse an internal PKI service for its own leaf certificate, transparently handling its rotation, is a notable shortcoming.

Moreover, it means that OpenBao then relies on some other root of trust to bootstrap it, which may be a manual process if that trust does not itself support ACME (or, if no ACME client is present in the environment). Building an ACME client into OpenBao allows it to interact via open protocols to acquire certificates even when the environment lacks such tools, regardless of whether that source is internal or external.

With Google's push for 90-day public TLS certificates, any certificate management process other than an automated one is practically unimplementable for certificates from a public CA. When OpenBao is connected to the public internet, it is thus doubly important to have ACME support built-in.

User-facing Description

ACME is the most widely adopted certificate management protocol for TLS certificates. While EST, CMPv2, and SCEP exist for different niches (seemingly IoT, Telco, and device attestation respectively), none are as widely adopted as ACME, in either the public CA or private CA space. Further, ACME was the first protocol to bridge the disparate public CA APIs and domain validation processes into a single, automated protocol.

OpenBao defaults to a TLS-enabled listener because secrets management fundamentally isn't secure unless that connection is secure: even over private, internal networks, HTTPS is preferable to prevent accidental logging or packet captures from containing sensitive data. However, this listener lacks certificates by default: using ACME can fix this.

ACME aims to automate certificate issuance: when a ACME directory is known (usually defaulting to Let's Encrypt by convention otherwise), an ACME client can authenticate itself to a certificate authority and request certificates for particular domain name(s). The ACME protocol currently defines three types of requests: a HTTP request, hitting a /.well-known/acme-challenge URI on the server; an ALPN request, using TLS's ALPN protocol negotiation to authenticate itself; and a DNS challenge, requiring provisioning of DNS records visible to the CA. When run in internal networks, this latter method is preferable: often OpenBao is run on non-standard ports, which the former two protocols do not support.

By utilizing ACME in conjunction with a localhost-only HTTP listener (secure as it never leaves the system) and auto-unseal (for automated startup), OpenBao can thus automatically provision certificates from a local PKI instance over the ACME protocol, becoming its own root of trust. OpenBao's initialization must be done locally to localhost, as no PKI hierarchy or storage subsystem is yet configured and thus auto-fetching of certificates will fail. Furthermore, subsequent manual unseals would have to again be over the localhost interface as the ACME client will be unable to fetch certs when OpenBao is sealed.

By utilizing "on-demand" ACME certificates (whereby requested SNIs can be automatically requested from the CA), no additional service name configuration is required. Note that reusing the bind address may not work as it may not be to a particular domain but instead to a global listen address such as 0.0.0.0.

Technical Description

The Caddy server uses certmagic for its ACME management needs. This offers a few conviences over x/crypto/acme: caching and rotation are handled automatically and the on-demand issuance allows for minimal default configuration.

Notable complexity still remains however: in addition to more attributes to configure TLS with ACME (including choice of directory, an option root CA for connecting to said directory, and any external account bindings), for instances wishing to use DNS providers, additional support is necessary for provisioning a libdns instance. This will likely require creating a generic, hcl (and json for the community at large) config-driven dispatching library over those providers.

When using HTTP challenges, ACME requires the CA validate against port 80. When this bound, we'll solve challenges through OpenBao's http.ServerMux instance. However, when it is unbound, certmagic will automatically attempt to provision a listener on port 80. This is both desirable and undesirable: a short-lived port 80 means few (if any) accidental requests will come to this port, preventing accidental, inadvertent leakage of tokens & secrets due to misconfigured clients doing automatic http->https upgrade. However, this may still fail when the OpenBao instance is not running with sufficient (root or systemd/... port mapping) permissions.

Further, Fraser's IETF draft for ACME service discovery appears not to have been accepted by the ACME WG; a replacement standard with reliance on CAA records makes it largely unsuitable for internal, enterprise usage. This makes it hard for us to choose a default-correct ACME provider. Thus, we'll rely on Let's Encrypt here as a default, suggesting users set their own directory if running on a local network.

Certificates and keys are stored in-memory. This means that each restart of the service will request new certificates, but also prevents them from being stored on disk unencrypted. When provisioned as the root of trust, this shouldn't cause issues as OpenBao may not have quota information and thus won't ratelimit itself. However, when running against a public CA, too frequent restarts may trigger global rate limiting. This may warrant an option in the future to cache certificates on-disk, if also to help restarts and avoid a CA outage from affecting the ability to become operational. This would also allow unsealing OpenBao externally via the TLS listener on subsequent startups.

The following parameters will be added to the TCP listener configuration:

  • tls_acme_cache_path, to control where ACME account information (and perhaps, in a future release, certificates & keys) are stored. Per certmagic, this defaults to ~/.local/share/certmagic if $XDG_DATA_HOME is unset.
  • tls_acme_ca_directory, the default ACME CA directory path; this defaults to Let's Encrypt's production ACME instance if unset.
  • tls_acme_ca_root, an optional root CA to trust to validate connections to the ACME directory; defaults to all system-trusted CAs.
  • tls_acme_eab_key_id and tls_acme_eab_mac_key, for providing optional external account bindings.
  • tls_acme_key_type, for choosing the type of leaf keys to request.
  • tls_acme_email, for setting the optional account notification email to subscribe to CA's messages about certificate expiry.
  • tls_acme_dns_config, a structure for setting DNS provider configuration; doing so disables HTTP and ALPN challenges.
    • provider, a typed, per-provider configuration structure to control how DNS is managed.
    • ttl, for setting how long temporary challenge records will live.
    • propagation_delay, for setting how long to wait until starting propagation checks against the DNS resolver.
    • propagation_timeout, for setting how long propagation checks should run for until assuming DNS record provisioning failed, prior to completing the challenge.
    • resolvers, to use to perform DNS challenge lookups.
    • override_domain, to delegate the challenge to a different domain.
  • tls_acme_domains, an optional allow-list of domains to acquire certificates for; unrecognized SNIs will be ignored.

Rationale and Alternatives

Making TLS infrastructure easier to manage is a core use case of OpenBao; this feature brings much-needed UX improvements to this side of instance management.

One alternative would be to implement native support for OpenBao's PKI secrets engine's APIs. This could be done by utilizing support for parsing existing certificates back into API parameters, but requiring an authentication bypass as there's no concept of a "self" token for listener-to-plugin requests. Indeed, OpenBao may not be unsealed yet and thus may not even have access to unencrypted storage. Without solving the self-authentication approach, ACME support is preferable as a localhost-only listener can be used to fetch certificates, assuming .

Regardless of approach though (ACME vs native PKI APIs), unless some persistent caching of certificates and keys is utilized (likely unencrypted on disk), an auto-unseal mechanism must be used in order to automatically retrieve certificates on startup. Additionally, while native PKI APIs make sense for cross-instance communication in a situation where DNS could not be provisioned or necessarily trusted, DNS must still be trusted for the API request and thus there's little benefit over ACME unless running on non-standard ports.

Downsides

The approach above for two listeners (HTTP on localhost, HTTPS bound preferably) stipulates that all requested SNIs be self-reachable unless DNS challenges were used; this may not be the case in e.g., dual-homed virtualization environments with validated IP addresses (wherein a passthrough listener is used in conjunction with a VM-network or local listener; for DNS hostnames this can be avoided by configuring a local /etc/host entry to point at localhost).

Security Implications

  • This approach requires trusting DNS to validate issuance requests as documented in RFC 8555; though arguably this was already an issue with HTTPS listeners and X.509 validation today (unless Unix sockets were used via alternative routing protocols).

  • However, this approach does not impact any internal authentication or authorization code: this automated ACME client is no more or less privileged than any other ACME client, and so its security impact is well-understood.

  • Upstream CA outages may cause degraded user experience downstream: if no local cache is implemented or required, restarting OpenBao would result in clients being unable to connect until it is resolved.

  • If an allow-list of domains (for on-demand creation) is not specified, a remote attacker may be able to cause a DoS by sending requests with arbitrary SNI information (depending on how OpenBao is configured) which may negatively impact the ACME account (if the CA implements per-account ratelimiting).

User/Developer Experience

When deploying public network routable, production grade instances, this would make the experience much smoother assuming Let's Encrypt was the preferred certificate provider.

Otherwise, this requires setting fewer parameters and requiring less provisioning when an organization can use ACME (at minimum the tls_acme_ca_directory value, rather than at minimum requiring both tls_cert_file and tls_key_file).

Unresolved Questions

The current Proof of Concept lacks ALPN plumbing and libdns support; both should be achievable, though the latter will require building a wrapper library.

Additionally, certmagic requires Uber's zap logger, which is not compatible with HashiCorp's go-hclog. This means that configuration options set for logging will not be respected by certmagic and are presently disabled. These log lines look like e.g.:

1.7098166561676574e+09 info maintenance started background certificate maintenance {"cache": "0xc000883a00"}

as they don't use our formatter. While we could update the formatting of the log message to use the new format, it might be hard to ensure the log locations are reused and adequately locked to allow two disparate writers.

Proof of Concept

See code: https://github.com/cipherboy/openbao/pull/new/auto-tls-listener.

Presently this is buildable, but requires sudo permissions to bind to port 80 for challenge verification, which isn't ideal. DNS support should help to alleviate this in the near future, but will require building the aforementioned library first.

Assuming that's alright: use devbao to provision and then stop a new node:

$ devbao node start --name prod --force --initialize --unseal --profiles pki --listeners "tcp:0.0.0.0:9999,tcp:0.0.0.0:8200"
$ bao read -format=raw pki-root/ca/pem > root.pem
$ killall bao

This node is pre-provisioned with ACME support; we save the root CA for later.

Modify the configuration file as so:

listener "tcp" {
  address = "127.0.0.1:9999"
  tls_disable = true
}

listener "tcp" {
  address = "0.0.0.0:8200"
  tls_acme_ca_directory = "http:https://127.0.0.1:9999/v1/pki-int/acme/directory"
}

disable_mlock = true
storage "raft" {
  path = "/home/cipherboy/.local/share/devbao/nodes/prod/storage/raft"
}

cluster_addr = "https://localhost:8201"
api_addr = "https://localhost:8200"
plugin_directory = "/home/cipherboy/.local/share/devbao/nodes/prod/plugins"

Namely, we change the second listener to use a custom ACME directory pointed at the first listener. I've also updated cluster_addr and api_addr to use the second listener's port, though this matters less.

Now we can restart bao manually:

$ sudo /home/cipherboy/go/bin/bao server -exit-on-core-shutdown -config=/home/cipherboy/.local/share/devbao/nodes/prod/config.hcl

and in a new terminal, unseal bao using the localhost-only listener:

$ . <(devbao node env prod) # source env
$ bao operator unseal "$(devbao node get-unseal prod | head -n 1)"
$ bao operator unseal "$(devbao node get-unseal prod | tail -n 1)"
$ bao secrets list
$ bao list pki-int/certs # (should have a few certs)

(ignoring the warning about the ndoe not running, since it wasn't started through devbao).

Finally, the new address can be used:

$ export VAULT_ADDR=https://localhost:8200
$ export VAULT_CACERT=root.pem
$ bao list pki-int/certs # (should have more certs)

Note that changing VAULT_ADDR to equivalent ones will also work (e.g., localhost.localdomain, 127.0.0.1, &c): the initial connection will be slower but subsequent ones will be fast.

@cipherboy cipherboy added the rfc label Jun 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants