Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CIP-0046? | Merkelised Plutus Scripts #385

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
270 changes: 270 additions & 0 deletions CIP-XXXX/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,270 @@
---
CIP: ?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
CIP: ?
CIP: 46

Title: Merkelised Plutus Scripts
Copy link
Contributor

@kwxm kwxm Dec 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be annoyingly pedantic, I'll point out that the process is named after Ralph Merkle, so it's Merklisation (or Merklization) rather than Merkelisation (which sounds like something from German politics).

Copy link
Contributor Author

@L-as L-as Dec 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realise this. I had thought of this, but Merklisation looks odd.

Authors: Las Safin <[email protected]>
Status: Draft
Type: Standards
Created: 2022-11-29
License: <CC-BY-4.0>
---

## Abstract

Currently, the hash of a script is simply the hash of its [serialisation](
https://github.com/input-output-hk/plutus/blob/a645d1ee0dd5efcd7a7da24678461e07396ad26e/plutus-ledger-api/src/PlutusLedgerApi/Common/SerialisedScript.hs#L88).
This CIP proposes changing this such that the hash of a script (term)
is a function of its immediate children's hashes, forming a Merkle Tree from the AST.
This allows one to shallowly verify a script's hash, and is useful on Cardano,
because it allows scripts to **check that a script hash is an instantiation of a parameterised script**.

In addition, a `blake2b_224` built-in function must be added.

This is inspired by [BIP-144](https://github.com/bitcoin/bips/blob/master/bip-0114.mediawiki),
but the motivations are very different.

## Motivation

Given some core logic expressible as a script, it is common to have parameters in
the form of constants, e.g. fees, references to other scripts, magical numbers.

These parameters can either be put in a datum somewhere, or can be put into the
script itself, either by inlining them, or applying the unapplied script to the constants.

On-chain it is currently hard to check that one script is an applied form
of another script. In cases where that is necessary, datums are instead used.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You make it sound like this is a problem... but this is basically what the datum is for. What I want to know is why that isn't sufficient.


By Merkelising the hashing, we make this possible,
which unlocks checking that a script is an application of another script to some parameter.

Example reasons to apply the parameters to the script:
- Staking validators currently don't support datums, and all staking validators
share a single rewards account. Allowing checking applied parameters
makes staking validators much more powerful. (More about this below)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like the main case where it actually makes a substantial difference, because you don't have a datum.

- Constants can be included in reference script, leading to less CPU and memory used,
since they don't have to be parsed from the adjacent datum (somewhat cheap)
or the script context (very expensive).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem that compelling to me.

- A script address + datum can't fit in an address,
if you want that you also need this (or need to change what an address is).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but I have elsewhere suggested that we should fix this by extending CIP-13 to include datums.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how that affects addresses in the ledger.


## Specification
Copy link
Contributor

@michaelpj michaelpj Nov 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would also need changes to the ledger spec. At the moment, the ledger doesn't deserialise Plutus scripts at all, it passes them to the evaluator still serialised, and despite this it can still hash them etc, straightforwardly. This CIP would probably require changing that in the spec and the implementation, so that the ledger has deserialized scripts around (one reason for this is that deserialization can fail, whereas hashing cannot). It might be good to have at least a sketch of those changes here.

I also don't know whether it violates any principles of the ledger to not have the hash of an item be the hash of its serialised bytes. I think that's true for everything else, it's possible that there's a reason for that (e.g. making it possible to check hashes without having to know the serialization).


The hash of a script will be derived directly from the AST, rather than its serialisation.
Currently, its formed by hashing the serialisation prefixed with a byte that represents its version, e.g. 0x02 for Plutus V2.

The hash of a script becomes the hash of the prefix version annotation prepended to the hash of the term.

[`Term`](https://github.com/input-output-hk/plutus/blob/a645d1ee0dd5efcd7a7da24678461e07396ad26e/plutus-core/untyped-plutus-core/src/UntypedPlutusCore/Core/Type.hs#L69)
currently has 8 constructors. On-chain, annotations are always the unit type,
and are hence ignored for this algorithm. Each case/constructor is generally handled by
hashing the concatenation of a prefix (single byte corresponding to the
constructor index) along with the hashes of the arguments passed to the constructor.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is slightly different to what @kwxm wrote here (https://github.com/input-output-hk/plutus/blob/master/doc/notes/plutus-core/Merklisation/Merklisation-notes.md#modified-merklisation-technique), which I think also included the serialized versions of the nodes in the value that gets hashed. Not sure if that's important, Kenneth do you remember?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which I think also included the serialized versions of the nodes in the value that gets hashed. Not sure if that's important

I'm not quite sure what you mean. It talks about "[serialising] all of the contents of the node into bytestrings", but I think by "contents" I meant all of the fields (things like variable names) except subnodes: you wouldn't serialise those and calculate hashes, but instead recursively apply the Merkle hash process. I think the overall process is basically similar to what's going on here.


Similar code can be found [in Plutarch](https://github.com/Plutonomicon/plutarch-plutus/blob/95e40b42a1190191d0a07e3e4e938b72e6f75268/Plutarch/Internal.hs#L100) (for a slightly different AST).

To avoid giving a single script two hashes,
this system must be used (exclusively) since at least a version after Plutus V2.

The algorithm for checking a script hash against a supplied script (of a new version)
in the ledger will change slightly: rather than hashing the supplied serialised
script directly, the decoding of the serialised script must be hashed.
(NB: the hashing and decoding can be fused to avoid intermediary structures.)

To allow computing the hash in scripts, we must support `blake2b-224` in Plutus scripts
as it's what is currently used. This algorithm used might change in the future, but that is
not relevant for this CIP.

### Hashing `Error`

Since there are no children, the hash of the `Error` term is the
hash of the prefix byte for the `Error` constructor.
You could theoretically choose any random number as the hash,
but it has to be proven to be random, hence hashing the prefix byte
is the best option.

In pseudocode: `hash prefix`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At first I found it a little confusing that everything used prefix (further complicated by the fact that earlier on it mentions that there's a version prefix attached to serialised Plutus Core scripts). It might be clearer if it said error_prefix, lamabs_prefix and so on, like it does later. You could even propose concrete values for the prefixes and use those. We might introduce more Term constructors in the future, but I don't think that's a problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.


### Hashing `Builtin`, `Var`

The hash of a `Builtin` is the hash of the prefix prepended to the base-256 encoded
(i.e. serialised to bytestring) index of the built-in function.
Because there are less than 256 built-ins, this is currently the same
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Less than 256 or less than 257? I think that if we had 256 you could still get away with one byte here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

257

as hashing the prefix byte prepended to the byte containing the index of
the built-in.

`Var` is handled the exact same way (with a different prefix),
but it's in this case feasible for the index to be more than 255.

In pseudocode: `hash $ prefix <> serialiseBase256 index`

### Hashing `Apply`, `Force`, `Delay`

These are hashed by hashing the result of prepending the prefix
byte to the concatenation of the hashes of the children.

In pseudocode: `hash $ foldl' (<>) prefix (hash <$> children)`

### Hashing `LamAbs`

This works the exact same way as above, notably, the _name_ is excluded
as it's a constant in the de-Bruijn encoding.

In pseudocode: `hash $ prefix <> hash body`

### Hashing `Constant`

The universe of types used on-chain is always `DefaultUni`.
Each possible data type is handled differently, with each having
a different prefix. The total number of prefixes does not exceed
255. If it did, the prefix would have to be increased to two bytes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it 255 or 256? I think any unsigned byte is a valid prefix, but I could be wrong.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. I'm dumb.


In addition:
Negative integers and non-negative integers have separate prefixes.
False and True also have separate prefixes.

#### Hashing non-negative integers

The serialisation according to [CIP-58](https://github.com/cardano-foundation/CIPs/blob/a1b9ff0190ad9f3e51ae23e85c7a8f29583278f0/CIP-%3F/README.md#representation-of-builtininteger-as-builtinbytestring-and-conversions),
prefixed with the two-byte prefix, is hashed.

In pseudocode: `hash $ prefix <> prefix' <> serialiseCIP58 n`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's going on here? Is it that prefix tells you that you've got an integer and prefix' tells you the sign?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes, I guess that's what the sentence on line 121 means.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think this is a mistake. This is a previous scheme I had, but there's no reason not to collapse it into one byte.


#### Hashing negative integers

The same algorithm as above is used, but the number hashed is `1 - n`.

In pseudocode: `hash $ prefix <> prefix' <> serialiseCIP58 (1 - n)`

#### Hashing bytestrings

The bytestring is hashed as-is.
We use the blake2b-256 hash here, such that we can usefully check that
the script refers to a bytestring that we know only the hash of.

In pseudocode: `hash $ prefix <> blake2b_256 bs`

#### Hashing strings

The flat-encoding is hashed.

In pseudocode: `hash $ prefix <> flat x`

#### Hashing lists, pairs

Lists and pairs are hashed like a Merkle tree,
much the same way that terms are.
The children have a known type, and are hashed according to how that
type should be hashed, i.e. with the correct algorithm and prefix.

In pseudocode: `hash $ foldl' (<>) prefix (hash <$> children)`

#### Hashing `()`, `False`, `True`

Each has its own separate prefix, like `Error`, hence:

In pseudocode: `hash prefix`

#### Hashing `Data`

The `CBOR` encoding is used, notably, it must be compatible with the `serialiseData`
built-in to be useful on-chain.
We use the blake2b-256 hash here, such that we can usefully check that
the script refers to a datum that we know only the hash of.
If the hashing algorithm for data changes, we must also change it here.

In pseudocode: `hash $ prefix <> blake2b_256 (serialiseData d)`

## Rationale
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might need some discussion of the cost of this kind of hashing. Our experiments suggested it was ~10x more expensive (https://github.com/input-output-hk/plutus/blob/master/doc/notes/plutus-core/Merklisation/Merklisation-notes.md#the-cost-of-calculating-merkle-hashes), unclear if this will have a meaningful impact but it might.

Copy link
Contributor

@kwxm kwxm Dec 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the potential cost of this is my main concern. Calculating the hash involves traversing the entire AST (although as the document points out it can be fused with the deserialisation process), but also calling the underlying hash function(s) at every node, which could become expensive compared with just feeding the serialised script directly to a hashing function in one go. I'd really like to see some figures for this: it's conceivable that computing the Merkle hash might be more expensive than executing the actual scripts, and that might make this proposal impractical. The estimates from our earlier experiments (which were maybe three years ago) were entirely theoretical though, and things have changed a lot since we did that: for one thing, we're using flat instead of CBOR now, which makes the serialised scripts a lot smaller. I think some experiments would really be needed to decide whether the extra cost is a real issue.


Given this minor change, we can now check that one script is the application of another script.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given this minor change

I think you underestimate how much work this would be 😅

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quite possibly.

Concretely, given hash `script`, hash `original`, parameter `d` (as data),
intermediate hashes `h0`, `h1`, hashing prefixes `ver_prefix`, `app_prefix`, `const_prefix`, we check:
```
script == blake2b_224 $ ver_prefix <> h0
h0 == blake2b_224 $ app_prefix <> original <> h1
h1 == blake2b_224 $ const_data_prefix <> blake2b_256 (serialiseData d)
```

We essentially open the Merkle tree commitment partially and check that the supplied path is correct.

### Relation with CIP-58

This CIP does not _depend_ on CIP-58, but to hash integers on-chain
the way it's done here, CIP-58's integer-to-bytestring serialisation built-in
must be available in Plutus.

### Relation with BIP-144

BIP144 uses this trick to avoid submitting the parts of the script that aren't used.
Given that reference scripts are common in Haskell, this isn't a big win for efficiency,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that reference scripts are common in Haskell

Not sure what this means, do you just mean "Given that Cardano supports reference scripts"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

brainfart

but it might be worth implementing for the sake of scripts used only once.
This CIP however doesn't require that that be implemented.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We looked into MAST during the development of Plutus Core, but we concluded that it wasn't worth it because the size of the hashes corresponding to omitted subtrees cancelled out the saving from omitting the subtree. You can read some notes on it here: https://github.com/input-output-hk/plutus/tree/master/doc/notes/plutus-core/Merklisation

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we tried a very similar Merklisation scheme, but for different reasons. We were looking at ways to reduce script sizes and the idea of using Merklisation to let us omit unused parts of the AST in fully applied validators seemed promising. It turned out that that involved replacing subtrees of the AST with hashes which were large (32 bytes) and incompressible, and that meant that we couldn't get any worthwhile size reductions, so we abandoned that idea. However that was for an entirely different purpose, so I don't think it's too relevant here.


The argument for privacy doesn't apply, private smart contracts can be achieved through
the use of non-interactive zero-knowledge probabilistic proofs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not today they can't. So I think it is still quite relevant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wdym? They definitely can once we have at least bitwise primitives.


### Reference scripts

Currently, different instances of the same script will need their own reference inputs
since their hashes are different. It seems feasible to allow sharing of a single reference script,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... or they can put them in the datum?

given the parameters and language version as witnesses, but given the complexity
involved, it is not specified in this CIP.

### Staking

This makes staking validators much more powerful, since a single protocol can
now manage many rewards accounts (by instantiating the script with a numeric identifier).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please can you write out this use case in more detail. You've alluded to it a few times but I'd really like to see more detail because I'm not familiar with it and I'm trying to back-infer the actual details, probably wrongly. And it seems to be the load-bearing example here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do

However, it is arguably not the optimal solution due to the reference
script problem described above. Even if the reference script problem
is solved as described above, it seems logical to allow supplying a datum
to a staking validator, or somehow combining the payment address and staking address for scripts,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with supplying a datum to anything is where does the datum live? For a validator script it lives on the output. Where could it live for a staking validator? If we can come up with a sane answer to that, then in principle we could just give staking validators datums.

and using the same datum for both, while somehow solving the separate accounts problem.

Given the heavy complexity of fixing staking validators, Merkelising script hashing seems much more feasible.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not really clear to me that it's complex, just that we don't have a design right now.


### Alternatives

#### Parameterised Reference Scripts

See https://github.com/cardano-foundation/CIPs/pull/354.

Seemingly, Merkelisation is a less invasive and possibly cleaner change.

#### Changing how constants are hashed

The hashing of constants might not have a clear best solution, but most notably,
it is not clear how much/less to Merkelise the hashing.
E.g., the hashing of data itself could be Merkelised. This is not done in this CIP.
The hashing of a `Data` constant could also prepend the prefix directly to the serialisation,
rather than to the hash of the `Data`. It is not clear what is best.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think stopping the merkelization at the constants is the right place.


##### Hashing strings, lists, pairs differently

Strings are not very useful in Plutus.
Hence, the hashing algorithm for them isn't optimised such that
they can be easily verified.

Strings have essentially no purpose on-chain, since they're only used
for tracing, which should not be used in production.

In the context of checking applied parameters, it is likely that only
`Data`, `Integer`, `Bool`, `ByteString`, will be used as parameters,
since they cover all useful behaviour in an efficient way.
If you want to parameterise your script by a pair of integers,
it is likely best to unwrap that into two separate integer parameters
for the sake of efficiency of _running_ the script, which is likely
to be more common that checking the parameters.

Built-in lists and pairs are not commonly used as parameters, but it's plausible
that they might still be the most efficient method in some scenarios.
Hence, they have been included.
They use Merkle-tree hashing since that's the simplest and most useful in this case.

## Path to Active
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should have some Acceptance Criteria a la the new CIP-001. Perhaps:

  • The ledger specification is updated as necessary
  • The Plutus Core specification is updated with the new hashing scheme
  • Performance assessment has been performed
  • Necessary hashing builtins have been added to PLC and costed
  • Example usecases have been validated to run in an acceptable amount of budget considering the increased use of hashing builtins

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems reasonable to me, but calculating a few hashes (see example pseudocode) is well within the budget last time I checked.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

calculating a few hashes (see example pseudocode) is well within the budget last time I checked.

Is that really true? If you're referring to the pseudocode here (under Rationale), then you need the hashes original and script, and I think those have to be calculated on the chain (or at least one of them does, no?), so there's a potentially large cost that has to be paid before you even get to that pseudocode.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

original and script are constants here. original is a fixed script, and can thus be computed beforehand and inlined into the script. script comes from ScriptContext.


### Implementation plan

Las Safin will implement this if IOG don't have time.

## Copyright

This CIP is licensed under CC-BY-4.0.