Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpCerts chains are required to be in memory and storage as a block, preventing memory optimizations #7695

Closed
tcarmelveilleux opened this issue Jun 16, 2021 · 2 comments · Fixed by #19819
Assignees

Comments

@tcarmelveilleux
Copy link
Contributor

tcarmelveilleux commented Jun 16, 2021

Problem

AdminPairingInfo has a large linear blob of storage for one set of opcerts for a given fabric:

    struct StorableAdminPairingInfo
    {
        uint16_t mAdmin;    /* This field is serialized in LittleEndian byte order */
        uint64_t mNodeId;   /* This field is serialized in LittleEndian byte order */
        uint64_t mFabricId; /* This field is serialized in LittleEndian byte order */
        uint16_t mVendorId; /* This field is serialized in LittleEndian byte order */

        uint16_t mRootCertLen; /* This field is serialized in LittleEndian byte order */
        uint16_t mOpCertLen;   /* This field is serialized in LittleEndian byte order */

        Crypto::P256SerializedKeypair mOperationalKey;
        uint8_t mRootCert[kMaxChipCertSize];
        // The operationa credentials set can have up to two certs -> ICAC and NOC
        uint8_t mOperationalCert[kMaxChipCertSize * 2];
        char mFabricLabel[kFabricLabelMaxLengthInBytes + 1] = { '\0' };
    };

This contains 3x opcerts at kMaxChipCertSize (600).

Furthermore, the OperationalCredentialSet expects to keep all of these in-memory, per fabric, in a way that they cannot be used in place.

Both of these together cause a large amount of data to be maintained in RAM on embedded platform, when they are actually very infrequently used (only during case), and only ever used one at a time.

This prevents memory optimizations on memory-starved devices and is an obstacle to fitting multi-fabric support on some low-RAM targets. Already some Silabs parts testing commissioning flow with real opcerts are running out of heap due to fragmentation because of these large blocks.

Proposed Solution

This particular problem can be addressed in several steps:

  1. Communicating to the dev team that operational certs are infrequently used, and their lookup from non-volatile storage on-demand is a very small cost compared to the time cost of signature verifications in which they take part, and memory cost in maintaining them in RAM working set permanently.
  2. Split kMaxChipCertSize from 600 (which is the max for DER certs for Device Attestation and for DER version of opcerts) to two values:
  • kMaxChipOpCertSize = 400 -> The maximum size of a TLV cert
  • kMaxCHIPDerCertSize = 600 -> The guaranteed max size of a DER cert like for DAC/PAI/PAA, and opcert at commissioner prior to TLV conversion
  • This is a savings of 33%
  1. Refactor OperationalCredentialSet and code using it, to never expect an "already in-place" version, and rather to look them up from persisted storage every time. Currently this keeps copies of all elements of the opcert chain in HEAP (see AdminPairingInfo::SetXXXCert which allocs into heap). Release is done at the end of CASE, but the whole set (3x certs) is needed in memory at a time. If N incoming CASE sessions are in progress, Nx3 certs are maintained.
  2. Introduce a TlvCertificateStore interface, which can be provided by platform to enable optimization of storage/lookup of operational certs.

All primitives in the spec (and their related implementations) should be able to work with OpCerts one by one. They do not need to be maintained in memory permanently, or contiguously.

The following interface sketch would fulfill all requirements of Matter Operational Certs:

  • Allow removal of roots by SKID
  • Allow lookup of "next in chain" by AKID/SKID
  • Allow lookup of NOC by full node reference if needed (PK + fabric id + node id)
  • Add simple implementation of AddTrustedRootCertificate/RemoveTrustedRootCertificate/AddOpCert/UpdateOpCert/RemoveFabric to spec with no dangling issues and no duplication of certificates
  • Allow optimization of large blob storage by platform, since OpCerts are the largest contributor to non-volatile stored state in quantity and size
  • Make clear that you can just look-up the certs in a temp buffer.
// Store/Retrieve TLV certificates by properties.
// Has all the primitives that may be needed to implement
// Matter operational certs
class TlvCertificateStore
{
 public:
  // Search by SKID --> Good to find known cert / next in chain
  virtual CHIP_ERROR GetBySkid(
    const uint8_t* cert_skid,
    size_t skid_len,
    uint8_t * cert_buffer_out,
    size_t cert_buffer_len) = 0;

  // Search by Public Key --> Used by GetByFullFabricRef
  virtual CHIP_ERROR GetByPublicKey(
    const uint8_t* public_key,
    size_t public_key_len,
    uint8_t * cert_buffer_out,
    size_t cert_buffer_len) = 0;

  // Good to find a NOC, so that chain can be followed
  virtual CHIP_ERROR GetByFullFabricRef(
    const uint8_t* public_key,
    size_t public_key_len,
    uint64_t fabric_id,
    uint64_t node_id,
    uint8_t * cert_buffer_out,
    size_t cert_buffer_len) = 0;

  // Insert any cert
  virtual CHIP_ERROR Insert(uint8_t * cert_buffer, size_t cert_buffer_len) = 0;

  // Remove by public key. Unambiguous.
  virtual CHIP_ERROR RemoveByPublicKey(const uint8_t* public_key, size_t public_key_len) = 0;
  // Remove by content. Unambiguous.
  virtual CHIP_ERROR RemoveByCertContent(const uint8_t* cert_buffer, size_t cert_buffer_len) = 0;
};
@tcarmelveilleux tcarmelveilleux changed the title OpCerts are required to be permanently in memory, preventing memory optimizations OpCerts chains are required to be in memory and storage as a block, preventing memory optimizations Jun 16, 2021
@tcarmelveilleux
Copy link
Contributor Author

The trust store concepts for roots mentioned in this issue would help solve #17208 and #17209

tcarmelveilleux added a commit to tcarmelveilleux/connectedhomeip that referenced this issue May 16, 2022
- For controllers that make use of their own key management,
  such as OS-aided key management, it is currently impossible
  to pass-in the Operational Key pair to use for a given controller.

This PR adds APIs to avoid FabricTable from trying to manage
the lifecycle of the operational key, and allow the caller/initializer
of CHIPDeviceControllerFactory to properly inject its own operational
key that it manages, without attempts of storage or dynamic memory
management.

Issue project-chip#18444
Issue project-chip#7695

Testing done:
- Unit tests still pass, cert tests as well
- This was tested internally by us using the APIs to
  manage a key for Android. That change will be a follow-up
tcarmelveilleux added a commit that referenced this issue May 17, 2022
)

* Allow external ownership of CHIPDeviceController operational key

- For controllers that make use of their own key management,
  such as OS-aided key management, it is currently impossible
  to pass-in the Operational Key pair to use for a given controller.

This PR adds APIs to avoid FabricTable from trying to manage
the lifecycle of the operational key, and allow the caller/initializer
of CHIPDeviceControllerFactory to properly inject its own operational
key that it manages, without attempts of storage or dynamic memory
management.

Issue #18444
Issue #7695

Testing done:
- Unit tests still pass, cert tests as well
- This was tested internally by us using the APIs to
  manage a key for Android. That change will be a follow-up

* Restyled by clang-format

* Apply suggestions from code review

Co-authored-by: Boris Zbarsky <[email protected]>

* Address review comments

* Restyled by clang-format

* Address review comments

Co-authored-by: Restyled.io <[email protected]>
Co-authored-by: Boris Zbarsky <[email protected]>
@woody-apple
Copy link
Contributor

@tcarmelveilleux is this now complete? What remains here?

tcarmelveilleux added a commit to tcarmelveilleux/connectedhomeip that referenced this issue Jun 13, 2022
To support moving to non-permanent storage, need to ensure
there is never direct access to certificates from FabricInfo classes
outside the FabricTable which owns all validations. This prevents
dangling FabricInfo instances and enables the changes needed to
make the fail-safe work to spec for AddNOC, UpdateNOC and
AddTrustedRootCertificate.

Issue project-chip#15585
Issue project-chip#7695

- Always go through the FabricTable, don't allow going directly via
  FabricInfo
- Updated CASESession to go through FabricTable also
- Getters for certs and root public key are now copying operations,
  rather than updating a ByteSpan to internally owned data (which
  may be stale!)
- First step towards moving to spec-compliant lifecycle for UpdateNOC
  with the same model as OperationalKeystore
- No functional changes, only structural changes

Testing done:
- Cert tests still pass
- Unit tests still pass
tcarmelveilleux added a commit that referenced this issue Jun 13, 2022
* Remove direct operational certs access from FabricInfo

To support moving to non-permanent storage, need to ensure
there is never direct access to certificates from FabricInfo classes
outside the FabricTable which owns all validations. This prevents
dangling FabricInfo instances and enables the changes needed to
make the fail-safe work to spec for AddNOC, UpdateNOC and
AddTrustedRootCertificate.

Issue #15585
Issue #7695

- Always go through the FabricTable, don't allow going directly via
  FabricInfo
- Updated CASESession to go through FabricTable also
- Getters for certs and root public key are now copying operations,
  rather than updating a ByteSpan to internally owned data (which
  may be stale!)
- First step towards moving to spec-compliant lifecycle for UpdateNOC
  with the same model as OperationalKeystore
- No functional changes, only structural changes

Testing done:
- Cert tests still pass
- Unit tests still pass

* Restyled by clang-format

* Rename GetXXX to FetchXXX

* Fix lint

* Fix Darwin CI

* Fix build on Darwin

* Apply review comments

Co-authored-by: Restyled.io <[email protected]>
Co-authored-by: Boris Zbarsky <[email protected]>
tcarmelveilleux added a commit to tcarmelveilleux/connectedhomeip that referenced this issue Jun 21, 2022
- FabricTable management during commissioning did not properly
  handle the fact that committing needs to only be done on
  CommissioningComplete, which prevented the AddTrustedRootCertificate,
  UpdateNOC and AddNOC command semantics to be implemented properly
  and prevented proper state observation of operational credential
  clusters server

Fixes project-chip#7695
Issue project-chip#8905
Fixes project-chip#18633
Issue project-chip#17208
Fixes project-chip#15585

This PR:

- Removes direct access to FabricInfo from everywhere, which caused
  possibly stale FabricInfo references during commissioning.
- Remove immediate committing of fabric table on UpdateNOC.
- Make Fabrics, NOCs and TrustedRootCertificates attributes reflect
  proper partial state during fail-safe, by using the shadow data
  capabilities of OperationalCertificateStore and by updates to
  FabricInfo
- Make it possible to unit test fabric table by providing the necessary
  lifecycle public APIs to test every modality of the commissioning flow
- Make Server and DeviceController use OperationalCertificateStore to
  allow proper external lifecycle management of the operational cert
  chain.
- Update all examples/controller code to new API
- Remove dangerous internal APIs from FabricTable and replace with
  direct accessors where needed
- Add more of the necessary spec validations to the UpdateNOC and AddNOC
  flows

Testing done:
- Updated all unit tests, all pass
- Cert tests still pass as before
- Working on further integration tests and unit tests as a follow-up
  noting that current state has not regressed on existing test coverage,
  and that new usage of OperationalCertificateStore class in FabricTable
  gains a large amount of additional coverage transitively via some
  of the existing tests making use of FabricTable.
tcarmelveilleux added a commit that referenced this issue Jun 24, 2022
* Implement shadow fail-safe data in FabricTable

- FabricTable management during commissioning did not properly
  handle the fact that committing needs to only be done on
  CommissioningComplete, which prevented the AddTrustedRootCertificate,
  UpdateNOC and AddNOC command semantics to be implemented properly
  and prevented proper state observation of operational credential
  clusters server

Fixes #7695
Issue #8905
Fixes #18633
Issue #17208
Fixes #15585

This PR:

- Removes direct access to FabricInfo from everywhere, which caused
  possibly stale FabricInfo references during commissioning.
- Remove immediate committing of fabric table on UpdateNOC.
- Make Fabrics, NOCs and TrustedRootCertificates attributes reflect
  proper partial state during fail-safe, by using the shadow data
  capabilities of OperationalCertificateStore and by updates to
  FabricInfo
- Make it possible to unit test fabric table by providing the necessary
  lifecycle public APIs to test every modality of the commissioning flow
- Make Server and DeviceController use OperationalCertificateStore to
  allow proper external lifecycle management of the operational cert
  chain.
- Update all examples/controller code to new API
- Remove dangerous internal APIs from FabricTable and replace with
  direct accessors where needed
- Add more of the necessary spec validations to the UpdateNOC and AddNOC
  flows

Testing done:
- Updated all unit tests, all pass
- Cert tests still pass as before
- Working on further integration tests and unit tests as a follow-up
  noting that current state has not regressed on existing test coverage,
  and that new usage of OperationalCertificateStore class in FabricTable
  gains a large amount of additional coverage transitively via some
  of the existing tests making use of FabricTable.

* Restyled by clang-format

* Fixes after merging master

* Rename callback of FabricTable::Delegate

- Remove needless ones
- Make the callbacks do nothing by default to avoid more
  "Intentionally left blank"
- Renamed to actually reflect what is happening

* Rekick restyle

* restyle

* Align last known good time commit / revert to shadow fail-safe approach

Fabric commit / revert is being refactored to only persist fabric
data when we actually commit on  receipt of CommissioningComplete.
This reworks Last known Good Time to use the same strategy.

Now, instead of persisting both last known good time and a fail-safe
backup at the time of certificate installation, a single, updated
last known good time is stored in RAM at the time of certificate
installation.  Then, on commit, this is persisted, but never before.

* Fix CI

* Fix TV app

* Reduce flash usage on K32W0 by disabling detail (verbose) logs

* Disable progress logging in TI platform lock example to save flash

* Disable progress logging in TI platform shell example to save flash

* Reduce stack usage of TestEventLogging

* Apply review comments from @msandstedt

* Save stack for Nordic build

* Save stack for Nordic build, some more

* Added unit tests for all basic operations

* Fix merge conflict

* Restyle

* Change nrf connect stack limit up by 512 bytes for fake unit test comparisons

* Revert "Save stack for Nordic build"

This reverts commit 52699c4.

* Revert "Save stack for Nordic build, some more"

This reverts commit e62391e.

* Fix UpdateLabel

* Apply review comments, add ACL fabric removal

* Apply review comments

* per bzbarsky, clear in-memory Last Known Good Time if revert fails

* Apply fixes from @bzbarsky-apple

* Apply more review comments

* Restyled

* Fix tests

* Add unit test for iterator

* Iterator now works

* Fix semantic merge conflict

* Restyled

* Fix predicates for storage presence temporarily

issue #16958

* Fix Darwin build

* Add missing docs, move methods to private that shouldn't be public

* Change stack warning temporarily to pass on nRFConnect

* Make MatterControllerFactory use const FabricInfo

* Restyle

* Fix semantic conflict on SessionManager

* Fix an init ordering issue in TestSessionManager.cpp

* Fix SessionManager shutdown

* Restyled

Co-authored-by: Restyled.io <[email protected]>
Co-authored-by: Michael Sandstedt <[email protected]>
Co-authored-by: Boris Zbarsky <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging a pull request may close this issue.

5 participants