BTrees in Motoko. #396

matthewhammer · 2022-06-23T22:18:31Z

See also BTree forum discussion.

matthewhammer · 2022-06-23T22:24:43Z

Status:

provisional data type definitions for the B-Tree
Check module checks its invariants

Still to do:

find
insert
insertSorted
remove

And then usual utilities like iteration, mapping, etc.

ByronBecker

A few comments and suggestions - thanks for starting into this 👍

ByronBecker · 2022-06-24T05:02:46Z

src/BTree.mo

+ /// Constants we use to shape the tree.
+ /// See https://en.wikipedia.org/wiki/B-tree#Definition
+ module Constants {
+ let MAX_CHILDREN = 4;


Why is the max 4? https://panthema.net/2007/stx-btree/stx-btree-0.8.3/doxygen-html/speedtest.html shows that 32-128 perform considerably better at large n.

I think it therefore might make sense to allow the developer to configure larger child values (i.e. 4, 16, 32, 64, 128, 256). I'd be curious to run some performance tests inserting a running counter or batch of elements (ordered or unordered) to see what difference this might make as the tree grows in size.

I want to write simple tests before adjusting this number into something that varies, which I agree is desirable.

ByronBecker · 2022-06-24T05:05:36Z

src/BTree.mo

+
+ public type Tree<K, V> = {
+ #index : Index<K, V>;
+ #data : Data<K, V>;


Can the Tree variant's values be renamed #node and #leaf? Index makes it sound more like a hash table type of lookup.

Sure, except that I'd prefer to call it "internal" and "leaf", since both are kinds of "nodes" in my mind.

"Index" seemed more natural when I was misunderstanding the structure of a B-Tree, and thinking of it as a B+-Tree (I guess?) where the data is only at the leaves; then the internal nodes only store keys, and serve as an index.

Now, I've adopted the standard B-Tree definition where key-value pairs can be internal too, and the name "index" is more inappropriate.

BTW, as one reference for "standard", I've been looking at the Rust implementation in std, and the helpful comments in particular. Hopefully this implementation will be much (much!) shorter, at least initially.

Makes sense, I'm fine with both of those as long as there's a comment above the type definition for the reader.

Went with "internal" and "leaf".

What should the comment say?

ByronBecker · 2022-06-24T05:06:47Z

src/BTree.mo

+ #data : Data<K, V>;
+ };
+
+ func find_data<K, V>(data : Data<K, V>, find_k : K, c : Compare<K>) : ?V {


c can just be of type (K, K) -> Order.Order.

So the function signature can become

func find_data<K, V>(data : Data<K, V>, find_k : K, compareTo: (K, K) -> Order.Order) : ?V {

what's the advantage of that? Less cognitive overhead in some way?

I actually like the record with a function inside pattern, even with just one function: If I want to add debugging later, or any other operations, it helps to have a record in place already to place those operations.

I wouldn't change the API because of concern about adding debugging.

If you want backwards compatibility and a more flexible API, why not just make the function parameters a record?

func find_data<K, V>(options: { data : Data<K, V>; find_k : K; compareTo: (K, K) -> Order.Order; // add new optional parameters here later if desired }) : ?V { ... }

Okay. I've removed that record-of-function from the public-facing API used in tests, etc.

FWIW, it was indeed useful to keep in in the checking API (just for tests), since then I could debug by printing out keys from within the BTree code.

Without having a show function for the key type, there's no way for the generic code to print them during debugging phases of development.

ByronBecker · 2022-06-24T05:10:19Z

src/BTree.mo

+ compare : (K, K) -> Order.Order
+ };
+
+ public type Data<K, V> = [(K, V)];


Curious, what's the benefit of the Data<K, V> type being the tuple [(K, V)] over something like

public type Data<K, V> = { key: K; value: V; };

There's a slight space savings to the tuple (1 word per instance). Otherwise, I'd prefer the record with labeled fields.

I actually had that same record definition initially, but recalled this recent review from Claudio for the HashMap improvements.

Also, why is Data an array and not just (K, V)? I thought the data of each leaf in the index was already folded into the index type here

public type Index<K, V> = { data : Data<K, V>; // data value pointing to this index trees : [Tree<K, V>]; // each of these sub-trees in the array has their own (K, V) };

I'm probably missing something as it's early and I definitely didn't get enough sleep last night 😅

Is there a reference implementation you used for this that I can peek through?

The Rust implementation in std

ByronBecker · 2022-06-24T05:19:29Z

src/BTree.mo

+ for (j in I.range(0, i.data.size())) {
+ switch (c.compare(k, i.data[j].0)) {
+ case (#equal) { return ?i.data[j].1 };
+ case (#less) { return find<K, V>(i.trees[j], k, c) };
+ case _ { }


Since trees is an ordered array, this could be a great place to do a modified binary search on the i.trees array to find the correct subtree to navigate down.

Also, are there any (scalability) concerns about the call stack size since we're using recursion vs. a loop?

this could be a great place to do a modified binary search on the i.trees array to find the correct subtree to navigate down.

Yes, true!

OTOH, that would not be a "baby step". : )

Ideally, we'd profile both variations, at various choices of m (the branching factor for B-Trees seems to be called m by everyone for some reason) and we'd have some way to know if there's a benefit, and how much. There may be little benefit when m is small, or it may even be worse (?)

Also, are there any (scalability) concerns about the call stack size since we're using recursion vs. a loop?

Within the compiler, tail recursion to the same function gets transformed into a loop, with no stack/recursion.

Is there any difference in performance in terms of what it gets transformed down to? I'm planning on writing an imperative RBTree later this week to test out the RBTree performance issues I was running into in #390

It might end up being the case that a BTree with index/sub-tree size 16/32/64 removes this issue since the tree is more shallow and there's less rebalancing steps required 🤷‍♂️

Just for context, the reason the insertion performance is important to me is that I'd prefer to find a balanced data structure solution that avoids hitting the message instruction limit and doesn't rely on deterministic time slicing.

Yep, I think B-Trees will be shallow for any realistic internal node size (16, 32, etc.).

As mentioned above, in Motoko, any self-contained tail-recursive function (not two mutually-recursive ones) will become a single while loop without a stack during compilation.

For any realistic BTree, that stack-free loop should run at most a handful of iterations (less than 10), so the cycle limit will certainly not matter for searching, even if you need to search for a bunch of records in random positions all within one message response time.

ByronBecker · 2022-06-24T05:23:07Z

src/BTree.mo

+ /// Check that a B-Tree instance observes invariants of B-Trees.
+ /// Invariants ensure performance is what we expect.
+ /// For testing and debugging.
+ public module Check {


Didn't know we could have more than one module in a file! I learn something new every day :)

Yep! And they can even vary in whether they are public or private, and nest to any depth.

All imports have to be at the top of the file though (very unlike Rust).

ByronBecker · 2022-06-24T05:24:43Z

src/BTree.mo

+ compare : (?K, ?K) -> Order.Order
+ };
+
+ func compareOp<K>(c : Compare<K>) : CompareOp<K> = {


similar comment on the Compare type to above

src/BTree.mo

ByronBecker · 2022-06-24T05:36:30Z

src/BTree.mo

+ };
+
+ func index<K, V>(lower : ?K, c : CompareOp<K>, i : Index<K, V>, upper : ?K) {
+ assert (i.data.size() + 1 == i.trees.size());


Why does this keep the invariant? I thought i.data was just [(K, V)], whereas i.trees could be any size array specified by the BTree child size

Perhaps I don't understand the question entirely, but the arrays of (sub)trees and keys (in data) are highly related under my current understanding of B-Trees, and the way they work in the rust implementation.

From that work, I get that this property holds (related to assert (i.data.size() + 1 == i.trees.size());)

For each pair of consecutive subtrees (t1, t2), there is a key k12 (and value v12) such that keysOf(t1) < k12 < keysOf(t2)

(abusing notation to compare sets of keys with a single key; what I mean is every key in each set observes the relevant comparison operator holding.)

I found it very helpful to look at pictures that people drew, and draw some myself. For instance, the one on the wikipedia page shows this pattern where keys are shown between consecutive subtrees..

PS: I strongly suspect that this invariant code is wrong. It's a baby step that will need to be revised, probably a few times. Thanks for taking a look!

Based on briefly looking at that wikipedia page image, let me know if I'm understanding this correctly...

So for each new level/sub-tree the "trees" at that level get instantiated to be a specific size array with each data slot containing a default "empty" value until filled.

I understand this invariant a bit better now given my response to this comment of yours.

It has to do that you have both a data array and a sub-trees array - which I argue could be folded into one array, and just have a single data tuple (and not a data tuple array).

I'm cribbing the type structure from this Rust implementation from std, also linked above in another thread.

What makes the B-Tree "tricky" as a simple type def, IMO is the combination of array structure and tree structure, and how it mixes them together in an efficient implementation that uses arrays to organize things of the same type.

In a "usual" internal node, there is a key-value between each pair of consecutive sub-trees, giving a key that separates these sub-trees' keys.

But there is no key-value pair on either end of the sub-tree list, so in total, there is one fewer key-value pairs than sub-trees.

For instance, see this line of the checking algorithm.

The Rust implementation linked above does this, based on me reading the comments (not the code itself).

Got it, I understand completely now! Thanks for explaining.

matthewhammer · 2022-07-16T01:39:11Z

Hey @ByronBecker thanks for the lively discussion about the initial code!

I've taken our discussion into account in the latest commit, which also adds unit tests for some tiny cases.

Please let me know what you think about the revisions, when you have time.

ByronBecker

A few minor comments/suggestions. Looking forward to insert 🎉

ByronBecker · 2022-07-18T06:12:18Z

test/BTreeTest.mo

+let _ = Suite.suite("find", [
+ Suite.test("pine",
+ BT.find<Text, Nat>(binary_internal, "pine", Text.compare),
+ M.equals(T.optional<Nat>(T.natTestable, ?42))
+ )
+]);


👍 To using matchers instead of debug statements for tests.

This is the pattern I've been using for my tests.

I create a "suite" for a specific function, like here where I have all of the different test cases for that function inside it.

Then I run the test suite for each function here.

👍 To using matchers instead of debug statements for tests.

For me, it's not one or the other. It's "which here?" -- Like all tools and techniques, there is a place for matches, but it's not a universal solution, or even close to one.

For instance, what value would it add to the first tests, where I just want to define a bunch of trees using let expressions that scope in a nested way, and assert that they are valid? To me, plain Motoko is the right structure for that test, and I used Debug prints so I know where I am in that script.

How does matchers help there? I can't see it -- It's not really helping guide that sort of test structure at all, and that's fine. I think it's okay to use it only where it helps --- organizing tests of an API, ensuring that each operation has some coverage. (I plan to use it for the rest of the API coverage tests.)

ByronBecker · 2022-07-18T06:17:42Z

src/BTree.mo

+ };
+ };
+
+ /// Check that a B-Tree instance observes invariants of B-Trees.


If this module is just for testing/debugging, should Check be in a different file?

When I import the default BTree module like import BTree "mo:base/BTree";, does it shake out the Check, or does it include it and bloat the import size?

does it shake out the Check, or does it include it and bloat the import size?

This module is static. The compiler will not compile static code that you import but do not use. (However, class instances will always contain all methods, regardless of usage. Not applicable here.)

If this module is just for testing/debugging, should Check be in a different file?

Why? It's more enclosed and encapsulated here. There is no way to "hide" a top-level module in base.

ByronBecker · 2022-07-18T06:24:25Z

src/BTree.mo

+ if (i.data.size() == 0) { assert i.trees.size() == 0; return };
+ if (i.trees.size() < 2) { assert false };


Is there a benefit to having all of the invalid cases trap vs. have be this be more like a isBTreeValid() function?

I guess the question from a usability standpoint is do you see developers using it like

try { Check.root<Text, Text>(Text.compare, bt); bt; } catch e { // invalid btree logic here }

or

do { if (isBTreeValid<Text, Text>(Text.compare, bt) { bt } else { // invalid btree logic here } }

Using assert is easier for this phase of development.

Indeed, using assert does not lead to composable code (the second version of the code you gave wouldn't work), so eventually, to make these checks more composable and thus more useful, they should either return ? Error or perhaps Result<_, Error> for some well-defined variant type of Error (to be defined) -- not sure about the value of returning anything in the #ok case, so that's why I'd tend toward returning ? Error and have null mean "okay", no errors. Then, a wrapper can check for null and return true, or some other kind of Result, etc.

Using assert is easier for this phase of development.

My intentions are time-variant:

Now, before all functions are defined, I'd like to focus on writing them, and using these assertions to check that they are correct (after each unit test, I'd use the assertion code to ensure that the unit tests' operations preserve all invariants of the BTrees.). They need not be composable to be used in unit tests.

Once the basic API works, we can extend it with some invariant checks, if that seems useful. Or, they could remain as is, just used in unit tests, depending. If we do expose them, I envision them returning a value, not trapping.

…environment

ByronBecker · 2022-07-19T22:26:50Z

src/BTree.mo

+ public func assertIsValid<K, V>(
+ t : Tree<K, V>,
+ compare : (K, K) -> Order.Order,
+ show : K -> Text)
+ {
+ Check.root<K, V>({compare; show}, t)
+ };
+
+ public func assertIsValidTextKeys<V>(t : Tree<Text, V>){
+ Check.root<Text, V>({compare=Text.compare; show=func (t:Text) : Text { t }}, t)
+ };


Would it help if Check and these test methods were just moved into test/BTreeTest.mo?

BTree.

c35b3d9

matthewhammer added 4 commits June 23, 2022 16:39

remove already-stale invariants comment. prefer Check module.

b95320f

revised definition. wrote find. checks need revision.

d81a1c3

simplify type defs.

142a0d0

adjust Check module.

ffd99ec

ByronBecker reviewed Jun 24, 2022

View reviewed changes

fix issues and unit test.

61334ec

matthewhammer added 2 commits July 15, 2022 20:04

missing file: BTreeTest.mo

0d06506

simpler public API; unit test for find.

273fb6f

ByronBecker reviewed Jul 18, 2022

View reviewed changes

matthewhammer added 2 commits July 19, 2022 14:56

internal compiler error. const ... captures ..., not found in static …

9db84ca

…environment

workaround compiler issue.

4393a81

ByronBecker reviewed Jul 19, 2022

View reviewed changes

ggreif force-pushed the master branch 2 times, most recently from d52aecd to 08507fc Compare October 21, 2022 12:22

ByronBecker mentioned this pull request Nov 5, 2022

Add module for checking the validity of the BTree canscale/StableHeapBTreeMap#4

Merged

		if (i.data.size() == 0) { assert i.trees.size() == 0; return };
		if (i.trees.size() < 2) { assert false };

BTrees in Motoko. #396

Are you sure you want to change the base?

BTrees in Motoko. #396

Conversation

matthewhammer commented Jun 23, 2022 • edited Loading

matthewhammer commented Jun 23, 2022 • edited Loading

ByronBecker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

matthewhammer Jul 16, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

matthewhammer Jul 16, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ByronBecker Jul 11, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

matthewhammer Jul 16, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

matthewhammer commented Jul 16, 2022

ByronBecker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ByronBecker Jul 18, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

matthewhammer commented Jun 23, 2022 •

edited

Loading

matthewhammer commented Jun 23, 2022 •

edited

Loading

matthewhammer Jul 16, 2022 •

edited

Loading

matthewhammer Jul 16, 2022 •

edited

Loading

ByronBecker Jul 11, 2022 •

edited

Loading

matthewhammer Jul 16, 2022 •

edited

Loading

ByronBecker Jul 18, 2022 •

edited

Loading