Initial benchmarking of GC performance of Motoko, to be extended and refined...
- Analyze and understand the current GC properties in more detail.
- Provide a baseline for evaluating potentially new GC implementations.
- Running on DFX: Measurements run in canisters on local replica by using DFX, in order to simulate a relatively realistic environment with the message instruction limits etc.
- GC points: Test scenarios are decomposed in multiple steps where the GC gets a chance to run in between. This is because the current GCs only run at specific times, when the call stack is (nearly) empty.
- Generated charts: Different GCs, different test scenarios, and different metrics are collected and rendered in visual charts, with a summary table showing aggregated metrics.
-
Make sure you have DFX installed.
-
Compile the
report
helper program that converts CSV to charts and generates a summary table.cd util cargo build --release cd ..
-
Open
nix-shell
in the motoko project and switch to this benchmark folder. -
Configure DFX to use the latest Motoko compiler
moc
(required for the primitiverts_mutator_instructions()
andrts_collector_instructions()
calls):export DFX_MOC_PATH=<motoko_repo_path>/bin/moc
You can add this configuration also as environment variable to your shell config, e.g.
~/.zshrc
. -
Run all benchmark cases:
./measure-all.sh
The tests run several minutes.
-
Results are stored in the
reports
folder:Open
summary.html
in the browser to see the aggregated measurement numbers and to navigate to the specific performance charts.
Performance test cases can also be selectively run, by choosing the GC and the scenario:
./performance.sh <compacting|copying|no|generational|incremental> <scenario_name>
For example, to run the linked list performance test with the compacting GC:
./performance.sh compacting linked-list
The scenario names are lised below.
Moreover, GC limit tests can be selectively run (see also below):
./limit.sh <compacting|copying|no|generational|incremental> <scenario_name>
The reports
folder contains the measurement results:
- Performance CSV file per test case: Each line lists the statistics values at measurement point, i.e. a scenario step where the GC got a chance to run.
- Limit CSV file per test case: This summarizes the memory limits that each scenario can allocate per GC.
- Chart file per test case: Different HTML5 performance charts, showing memory, allocation, and runtime behavior over the series of execution steps.
- One summary HTML page: Aggregation of different performance and limit metrics for all test cases, with links to the corresponding chart pages.
Different test scenarios are run with the available GC implementations, each such combination called a test case.
Two different measurements are performed:
- Performance measurements: Determining various runtime and memory metrics per test case
- Limit tests (for specific scenarios only): Determining the maximum number of memory that can be used
Note: In contrast to limit tests, the performance measurement are configured to run successully for all test cases, without hitting any limits (instruction limit, heap size limit). This is necessary for the comparability of the performance results.
A singly linked list of Nat numbers, used in a scenario of several steps. Each line denotes a tuple of how often the step is repeated and what operation is performed in each step. E.g. the first line specifies 50 msteps of populate(100_000), where each populate() step inserts 100_000 new elements to the list.
( 50, func() { populate(100_000) } ),
( 10, func() { traverse() } ),
( 25, func() { discard(100_000) } ),
( 10, func() { traverse() } ),
( 25, func() { populate(100_000) } ),
( 1, func() { clear() } ),
( 45, func() { populate(100_000) } ),
( 10, func() { traverse() } )
Rationale: Having a simplest but large-scaling case with many small objects and a very simple pointer structure, i.e. with as few pointers as possible for the amount of objects. External fragmentation should be relatively unproblematic.
Buffer of the Motoko base library containing Nat numbers. The buffer's implementation is a simple exponentially growing array list with amortized costs (growth rate 2), containing simple Nat numbers. Same scenario as for linked list:
( 50, func() { populate(100_000) } ),
( 10, func() { traverse() } ),
( 25, func() { discard(100_000) } ),
( 10, func() { traverse() } ),
( 25, func() { populate(100_000) } ),
( 1, func() { clear() } ),
( 45, func() { populate(100_000) } ),
( 10, func() { traverse() } )
Rationale: Measuring the heap with a memory-compact data structure that is part of the Motoko base library. The scenario creates large-growing list-internal arrays on the heap. Few pointers are used.
Two-level buffer that scales for larger sizes because of limited array list grow size. The first level is bounded to 1024 * 1024 elements.
( 50, func() { populate(100_000) } ),
( 10, func() { traverse() } ),
( 25, func() { discard(100_000) } ),
( 10, func() { traverse() } ),
( 25, func() { populate(100_000) } ),
( 1, func() { clear() } ),
( 45, func() { populate(100_000) } ),
( 10, func() { traverse() } )
Rationale: Measuring the heap with a memory-compact but scalable array list structure. Avoids too large arrays in the heap.
A Buffer
containing 64KB BLOBS as element items. Scenario of several steps:
( 10, func() { allocate(1000) } ),
( 5, func() { traverse() } ),
( 1, func() { discardAll() } ),
( 24, func() { allocate(1000) } ),
( 5, func() { traverse() } )
Rationale: Measuring the heap with relatively large objects and relatively few pointers. High moving costs compared to the number of objects. Fragmentation should be relatively unproblematic due to the mostly same-sized objects.
A fully connected directed graph (clique), where each node holds a Buffer
with pointers (directed edges) to each other node. Scales quadradically by design. Measurement scenario:
( 10, func() { populate(100) } ),
( 1, func() { clear() } ),
( 20, func() { populate(100) } ),
( 1, func() { clear() } ),
( 40, func() { populate(100) } )
In this scenario, traversal is implicitly contained in the population steps, since new nodes need to be connected to and from the other existing nodes.
Motivation: Analyzing a highly connected structure, with a massive amount of pointers compared to the amount of objects. This demonstrates the performance of updating pointers in moving GCs (compacting/copying), while also observing the efficiency of write/read barriers in potential future incremental GCs.
The current Motoko base library implementation of a red-black tree, storing Nat to Nat entries. Scenario:
( 30, func() { populate(10_000) } ),
( 10, func() { retrieve() } ),
( 20, func() { discard(10_000) } ),
( 10, func() { retrieve() } ),
( 20, func() { populate(10_000) } ),
( 1, func() { clear() } ),
( 40, func() { populate(10_000) } ),
( 10, func() { retrieve() } )
Motivation: Having a real implementation and more complex data structure implementation, with more temporary object allocations.
Imperative implementation of red-black trees obtained by user community member Byron Becker. The data structure implementation has been somewhat adjusted. Scenario of storing Nat to Nat entries:
( 30, func() { populate(10_000) } ),
( 10, func() { retrieve() } ),
( 1, func() { clear() } ),
( 40, func() { populate(10_000) } ),
( 10, func() { retrieve() } )
Motivation: Real example reported as a scalability challenge by a community member.
BTree map implementation by Byron Becker, Canscale, Apache 2.0 license. Tree order of 256, storing Nat to Nat entries:
( 30, func() { populate(10_000) } ),
( 10, func() { retrieve() } ),
( 1, func() { clear() } ),
( 40, func() { populate(10_000) } ),
( 10, func() { retrieve() } )
Motivation: Real use case from the community.
The current Motoko base library implementation of trie map storing Nat to Nat entries. Scenario:
( 30, func() { populate(10_000) } ),
( 10, func() { retrieve() } ),
( 20, func() { discard(10_000) } ),
( 10, func() { retrieve() } ),
( 20, func() { populate(10_000) } ),
( 1, func() { deleteAll() } ),
( 40, func() { populate(10_000) } ),
( 10, func() { retrieve() } )
Motivation: Same as for red-black tree.
Random Maze sample from the Motoko playground, with the following scenario. Additional GC measurement points are taken when awaiting results of message calls to the random number canister.
( 10, func(): async () { await generate(10) } ),
( 10, func(): async () { await generate(100) } ),
( 5, func(): async () { await generate(200) } )
Note: Adjusted to fix random seed to ensure deterministic runtime behavior and allow benchmark comparability.
(Compiler warning originates from the embedded sample code.)
Motivation: Include a more representative example taken from the Motoko playground. The scenario is also relatively compute-intense.
Game of Life sample version 2 from the Motoko playground with size 512:
( 10, func(): async () { await step() } )
Motivation: Similar to Random Maze.
Source: https://github.com/Toniq-Labs/extendable-token
Extendable Token project by Toniq Labs (MIT license), using standard extension and measuring 200 repeated transfers. Measurement trace point inserted at one place in third-party standard.mo
.
(Compiler warnings originate from the third-party code.)
Motivation: Using a real external application with higher code complexity.
Taken from the Motoko performance tests, simulating an asset storage:
70 iterations of:
- List the storage.
- Store a new 1MB content blob (16 pages of 64KB).
- List the storage.
- Retrieving last stored content.
- Clearing the storage
(Note: For more iterations, the test case without GC (no
GC) fails with the error message "IC0503: Canister ... trapped explicitly: could not perform remote call" although the heap is not full.)
The existing performance test has been slightly adjusted, in particular to insert GC trace points.
Motivation: Include existing Motoko performance tests that is memory-intense.
Taken from the Motoko performance tests. Computes 3 test QR codes and shows them.
(Note: For repeated iterations, the test case without GC (no
GC) fails with the error message "IC0503: Canister ... trapped explicitly: could not perform remote call", see also above.)
Minor code adjustment for benchmark integration.
Motivation: Include existing performance test that is compute-intense.
Taken from the Motoko performance tests, 30 iterations.
Motivation: Very simple existing performance test.
Taken from the Motoko performance tests, reduced to 64kb hashing due to instruction limit.
( 10, func() { Sha256.go() } ),
Motivation: Compute-intense existing performance test.
Source: https://github.com/dfinity/cancan-archived
CanCan video-sharing service project by DFINITY (Apache 2.0 license). Measurement trace point inserted in CanCan.mo
.
Scenario:
- 10 user created, each uploading a video of 16 chunks of 1 MB data (16MB video)
(Compiler warnings originate from the included CanCan project.)
Motivation: Using a real application that is more memory-intensive.
Name | Description |
---|---|
linked-list |
Small element linked list |
buffer |
Small element buffer |
blobs |
Large blobs buffer |
graph |
Fully connected graph |
rb-tree |
Red-black tree |
imperative-rb-tree |
Imperative red-black tree |
trie-map |
Trie map |
random-maze |
Random Maze (playground) |
game-of-life |
Game of Life (playground) |
extendable-token |
Extendable Token (Toniq Labs) |
asset-storage |
Asset storage (perf test) |
qr-code |
QR code (perf test) |
reversi |
Reversi (perf test) |
sha256 |
SHA256 (perf test) |
cancan |
CanCan video sharing (DFINITY) |
btree-map |
BTree Map (Byron Becker) |
scalable-buffer |
Scalable buffer |
The list is to be extended with more cases in future, e.g. more real and complex examples. A current difficulty is that benchmarked programs need to be split into measurement steps, to give the GC a possibility to run in between.
Note: This is NOT intended to compare data structures' efficiency, as the performance scenarios are often deliberately different in their configuration (different allocation sizes and different access patterns) in order to remain below the execution limits (instruction limit, heap limit) for all GC test cases.
All scenarios are run with the following GCs of the Motoko implementation:
Name | Description |
---|---|
compacting |
Compacting GC |
copying |
Copying GC |
no |
No GC (special build, see below) |
generational |
Generational GC (special build, see below) |
incremental |
Incremental GC (special build, see below) |
For running no
GC or generational
GC, use the compiler and runtime system from branch https://github.com/dfinity/motoko/tree/luc%2Fgenerational_gc
. Please see the instructions below.
For running incremental
GC, use the compiler and runtime system from branch https://github.com/dfinity/motoko/tree/luc%2Fincremental-gc
. Please see the instructions below.
- Clone the motoko repo and switch to branch
luc/generational_gc
. The source is located at https://github.com/dfinity/motoko/tree/luc%2Fgenerational_gc - Under
src
, build withmake
- Under
rts
, build withmake
- Set the environment variable
DFX_MOC_PATH
to refer to this special GC compiler build.
export DFX_MOC_LOC=<PATH_TO_REPO_FOLDER>/bin/moc
Alternatively, you can store the environment variable in the shell config, e.g. ~/.zshrc
.
Note: The special GC build also includes the classical compacting and copying GCs, such that they can also be run and measured by the benchmark.
- Clone the motoko repo and switch to branch
luc/incremental_gc
. The source is located at https://github.com/dfinity/motoko/tree/luc%2Fincremental-gc - Under
src
, build withmake
- Under
rts
, build withmake
- Set the environment variable
DFX_MOC_PATH
to refer to this special GC compiler build.
export DFX_MOC_LOC=<PATH_TO_REPO_FOLDER>/bin/moc
Alternatively, you can store the environment variable in the shell config, e.g. ~/.zshrc
.
Note: The special GC build also includes the classical compacting and copying GCs, such that they can also be run and measured by the benchmark.
The following performance metrics are computed by the benchmark:
Name | Description | Better | Calculation |
---|---|---|---|
Final Heap Size | Heap occupation at program end | lower | LAST(heap) |
Memory Size | Maximum allocated memory size | lower | MAX(memory) |
Mutator Utilization | Fraction of mutator of total program time | higher | SUM(mutator) / (SUM(mutator) + SUM(collector)) |
Max GC Pause | Longest GC pause (instructions) | lower | MAX(collector) |
Average GC Pause | Average GC pause (instructions) | lower | AVG(collector) |
Total Instructions | Total number of instructions executed | lower | SUM(mutator) + SUM(collector) |
Total Mutator | Number of instructions executed by mutator | lower | SUM(mutator) |
Survival Rate | Fraction of retained objects per GC run | neutral | 1-AVG(reclaimed[i] / SUM(allocated[0..i]) - SUM(reclaimed[0..i-1]) |
rts_collector_instructions()
always returns some small values (below 10_000) even if GC has not run. Therefore, for these performance metric computations, collector values less or equal 10_000 instructions per measurement points (message call) are neglected (considered as 0).
Survival rate makes more sense for generational garbage collection, where the metric specifies the fraction of young live objects that get promoted to the older generation. Here, it denotes the fraction of alive objects per GC run.
For data-structure-like scenarios, the maximum amoung of allocations are also examined until the program hits the message instruction limit or the heap size limit. The tests run as part of the measure-all.sh
, and continuoulsy allocate and add new elements (populate-steps) until program failure.
The summary page summary.html
includes the results of these measurements:
Name | Description | Better |
---|---|---|
Allocation Limit | Maximum number of elements addded | higher |
Heap Maximum | Heap size before when the limit | higher |
Note: Graph scenario is deliberately not tested for limit, as it is not designed to scale.
Limit test can also be selectively run (see above).
Additonal metrics that would be interesting, but currently not available:
- Throughput of instructions per time unit
- Effective costs on IC
- Reclamation latency
- Average lifetime of objects
- External fragmentation
- Heap locality
- Object structure shapes
- ...
The generated charts show the following properties over the time axis of the steps per benchmark test case (test scenario and GC version):
- Memory chart
- Memory space, heap space, live objects
- Allocation chart
- Allocated objects, reclaimed objects
- Runtime chart
- Mutator instructions, garbage collector instructions