Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate vfs2 #661

Closed
wants to merge 20 commits into from
Closed

Conversation

cole-miller
Copy link
Contributor

No description provided.

Signed-off-by: Cole Miller <[email protected]>
Signed-off-by: Cole Miller <[email protected]>
Signed-off-by: Cole Miller <[email protected]>
Signed-off-by: Cole Miller <[email protected]>
Signed-off-by: Cole Miller <[email protected]>
Signed-off-by: Cole Miller <[email protected]>
Signed-off-by: Cole Miller <[email protected]>
Signed-off-by: Cole Miller <[email protected]>
Signed-off-by: Cole Miller <[email protected]>
Signed-off-by: Cole Miller <[email protected]>
Signed-off-by: Cole Miller <[email protected]>
Signed-off-by: Cole Miller <[email protected]>
Copy link

codecov bot commented Jun 25, 2024

Codecov Report

Attention: Patch coverage is 71.08434% with 168 lines in your changes missing coverage. Please review.

Project coverage is 77.35%. Comparing base (6633bd8) to head (86ef5cd).
Report is 29 commits behind head on master.

Files Patch % Lines
src/fsm.c 56.25% 34 Missing and 15 partials ⚠️
src/server.c 61.34% 33 Missing and 13 partials ⚠️
src/vfs2.c 74.78% 10 Missing and 19 partials ⚠️
src/raft/replication.c 58.06% 7 Missing and 6 partials ⚠️
src/leader.c 61.90% 5 Missing and 3 partials ⚠️
src/raft/uv.c 77.27% 2 Missing and 3 partials ⚠️
test/raft/integration/test_uv_load.c 84.61% 4 Missing ⚠️
src/utils.h 57.14% 0 Missing and 3 partials ⚠️
src/raft/uv_metadata.c 75.00% 0 Missing and 2 partials ⚠️
src/raft/uv_segment.c 92.00% 0 Missing and 2 partials ⚠️
... and 5 more
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #661      +/-   ##
==========================================
- Coverage   77.41%   77.35%   -0.07%     
==========================================
  Files         196      196              
  Lines       27269    27377     +108     
  Branches     5455     5519      +64     
==========================================
+ Hits        21111    21178      +67     
+ Misses       4297     4286      -11     
- Partials     1861     1913      +52     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Cole Miller <[email protected]>
Signed-off-by: Cole Miller <[email protected]>
@cole-miller
Copy link
Contributor Author

This PR does a few different things (sorry):

  • It removes --enable-dqlite-next from the build system and all mention of DQLITE_NEXT in the source code. All the code for the new disk-based configuration is now compiled in all build configurations, except for uses of new raft APIs that aren't available when linking against an old, external libraft.
  • It introduces new codepaths that use vfs2 instead of the old VFS, including calling functions that have no counterpart in the old setup, like vfs2_unapply. I've repurposed the existing dqlite_node_enable_disk_mode option to switch between the old in-memory VFS and vfs2; the existing on-disk configuration of the old VFS is no longer reachable after this PR.
  • It introduces a run-time configuration option for the raft_uv I/O implementation, raft_uv_set_format_version, that toggles between the backwards-compatible format for the persistent log and the new format (originally introduced and gated behind DQLITE_NEXT in Add local_data and is_local to raft_entry #639).
  • Includes various fixes for vfs2 that were added in the course of getting our test suite to pass when running in "disk mode". Note that the changes included in this initial pass are definitely not the final word---for example, as discussed elsewhere, vfs2_pseudo_read_begin needs to be rewritten and the way the WAL-index header is handled needs to be redone, also some of the fixes in this PR are more like band-aids and need to be redone more carefully.
  • Finally, reworks the initialization logic in server.c, moving things around so that we no longer leak resources if you create a dqlite_node and then destroy it without running it first (a longstanding issue, see comments in this file on master). I got sucked into doing this while working on the extra initialization required to set up vfs2, although it wasn't strictly necessary to carry it as far as I have done here.

Copy link
Contributor

@letFunny letFunny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a first pass to half of the files approximately. The PR is looking very good, thanks!

I just have one general comment and that is that we need to make sure to remove all the usages of the disk config prior to this PR. For example: https://github.com/canonical/dqlite/blob/master/src/db.c#L50

@@ -39,10 +39,6 @@ AM_CONDITIONAL(BUILD_SQLITE_ENABLED, test "x$enable_build_sqlite" = "xyes")
AC_ARG_ENABLE(build-raft, AS_HELP_STRING([--enable-build-raft[=ARG]], [use the bundled raft sources instead of linking to libraft [default=no]]))
AM_CONDITIONAL(BUILD_RAFT_ENABLED, test "x$enable_build_raft" = "xyes")

AC_ARG_ENABLE(dqlite-next, AS_HELP_STRING([--enable-dqlite-next[=ARG]], [build with the experimental dqlite backend [default=no]]))
AM_CONDITIONAL(DQLITE_NEXT_ENABLED, test "x$enable_dqlite_next" = "xyes")
AS_IF([test "x$enable_build_raft" != "xyes" -a "x$enable_dqlite_next" = "xyes"], [AC_MSG_ERROR([dqlite-next requires bundled raft])], [])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If dqlite next needs to bundle raft, shouldn't we remove the option enable_build_raft as well and make it a default?

EDIT: we discussed this in the daily but now I think it makes more sense to do it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should do this for the next release but need to make sure first that our known downstreams are ready to move to bundled raft (or have already done so).

@@ -22,8 +24,20 @@
#define POST(cond) assert((cond))
#define ERGO(a, b) (!(a) || (b))

#define UNHANDLED(expr) if (expr) assert(0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this macro log something or is the stack trace going to be enough?

BTW thanks for this macro, I find code easier to read this way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I introduced the macro to mark places where there's a potential failure that I didn't yet know how to handle, so at least the defective lines are easily greppable. It doesn't mean I think we should abort in all these cases (I should've explained this).

/* TODO maybe vfs2 should just accept the pages and page numbers
* in the layout that we receive them over the wire? */
dqlite_vfs_frame *frames = sqlite3_malloc((int)sizeof(*frames) * (int)cf->frames.n_pages);
UNHANDLED(frames == NULL);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to have a unified approach for out of memory errors, either crash or return an error code, but right now I see code doing both things.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, generally we try to return an error code for that---anything marked with UNHANDLED here is intended to be temporary until proper error handling is introduced. (Writing it this way initially helped me develop the patch faster, but I can fix the UNHANDLED call sites in this PR before it merges.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a TODO above the UNHANDLED definition macro then? Just so we remember to update it.

Comment on lines +766 to +767
/* TODO maybe vfs2 should just accept the pages and page numbers
* in the layout that we receive them over the wire? */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would still mean we would copy the data and not use the pointer from the command directly, right? So it would translate into removing the "translation loop" below from one data structure to the other, right?

Copy link
Contributor Author

@cole-miller cole-miller Jul 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be possible for vfs2_apply_uncommitted to parse the frames data in exactly the format that it comes over the wire, so no copy is needed at all. Maybe I should add that to the PR while I'm at it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can also be part of the next PR, but if the changes are minimal and we are avoiding a copy it seems worth it to me.

src/fsm.c Show resolved Hide resolved
struct fsm *f = raft_malloc(sizeof *f);

(void)config;
struct fsm *f = raft_malloc(sizeof(*f));
if (f == NULL) {
return DQLITE_NOMEM;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example of what I meant in the previous comment regarding error handling on memory allocation.

Comment on lines -581 to -585
struct dqlite_node *node = g->raft->data;
pool_t *pool = !!(pool_ut_fallback()->flags & POOL_FOR_UT)
? pool_ut_fallback() : &node->pool;
pool_queue_work(pool, &req->work, g->leader->db->cookie,
WT_UNORD, qb_top, qb_bottom);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we removing this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new calls into vfs2 in this PR all happen on the main thread, so in order to not have data races I had to undo the earlier change that moved sqlite3_step calls to the thread pool. The next PR will make everything async again.

Comment on lines +682 to +683
/**
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing doc.

rv = r->fsm->apply(r->fsm, buf, &result);

if (r->fsm->version >= 4 && r->fsm->apply2 != NULL) {
bool is_mine = is_local && term == r->current_term;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be part of the documentation of the apply2 function, stating what it means for an entry to be mine. Also, let's try to think about a better name that conveys that it has to have been created by the node while it was the leader (if we cannot, then the documentation is enough and we can keep it as is_mine). For example, what about created_as_leader_current_term? (maybe too verbose)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed about documentation. Maybe a more verbose name in the header file and the concise is_mine when it's referred to in .c?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would work.

Comment on lines -111 to -116
rv = uvMetadataLoad(uv->dir, &metadata, io->errmsg);
if (rv != 0) {
return rv;
}
uv->metadata = metadata;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we moving this and the timer to _start and _load?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uvMetadataLoad checks the format_version field to detect malformed metadata files. The correct format version isn't known until load or bootstrap time, so I moved the metadata loading later on. But I'm definitely less than 100% certain that this is the right way to do it. Maybe it would be a better idea to just modify raft_uv_init to accept the desired format version as an additional argument?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me check my understanding here: we are using the disk_mode as a toggle for all of these changes (vfs2 + new format representation) which is why by default uv->format_version = 1 and we change it when calling dqlite_node_enable_disk_mode. And we need to support both implementations for at least some time.

If all of that is true then I guess it makes sense to keep it as is because the sequence has to be init -> enable_disk_mode -> load. Ideally we would pass the format as a parameter from the beginning so that we don't have to think about when the format changes and when to do each operation, like reading metadata, but it seems that is not possible here just yet with the current design.

Copy link
Contributor

@letFunny letFunny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed another batch of files. I think we could have split this PR into the changes for the format version and removing the DQLITE_NEXT macro and the changes to the vfs2 (maybe it is not that straightforward to do). Definitely not something to do now, but for next PRs it might make reviews faster.

@@ -47,6 +47,7 @@ typedef unsigned long long uvCounter;
/* Information persisted in a single metadata file. */
struct uvMetadata
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the metadata tied to the uv-based raft_io implementation below or is it agnostic? I assume it is the former because it is in the uv.h file. If that is the case, can we add a comment in the format_version like the one we have below?

/* 1 (original recipe) or 2 (with local data) */

If we need to document the format change further we could define an enum and document it there more thoroughly. The benefit of an enum would be that the checks PRE(1 <= version && version < 3) would automatically be in sync when we add new fields instead of having to change all occurrences.

Comment on lines -111 to -116
rv = uvMetadataLoad(uv->dir, &metadata, io->errmsg);
if (rv != 0) {
return rv;
}
uv->metadata = metadata;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me check my understanding here: we are using the disk_mode as a toggle for all of these changes (vfs2 + new format representation) which is why by default uv->format_version = 1 and we change it when calling dqlite_node_enable_disk_mode. And we need to support both implementations for at least some time.

If all of that is true then I guess it makes sense to keep it as is because the sequence has to be init -> enable_disk_mode -> load. Ideally we would pass the format as a parameter from the beginning so that we don't have to think about when the format changes and when to do each operation, like reading metadata, but it seems that is not possible here just yet with the current design.

rv = uv->transport->init(uv->transport, id, address);
if (rv != 0) {
ErrMsgTransfer(uv->transport->errmsg, io->errmsg, "transport");
return rv;
}
uv->transport->data = uv;

rv = uv_timer_init(uv->loop, &uv->timer);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it that we are moving the timer?

{
size_t res = 8 + /* Number of entries in the batch, little endian */
16 * n; /* One header per entry */;
if (with_local_data) {
#ifdef DQLITE_NEXT
if (format_version > 1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be tricky to find in the future if the format changes in a way that drops local data for version 3 (for example). Do you think it is better to check for the format version explicitly?

@@ -143,7 +143,7 @@ static void encodeAppendEntries(const struct raft_append_entries *p, void *buf)
bytePut64(&cursor, p->prev_log_term); /* Previous term. */
bytePut64(&cursor, p->leader_commit); /* Commit index. */

uvEncodeBatchHeader(p->entries, p->n_entries, cursor, false /* no local data */);
uvEncodeBatchHeader(p->entries, p->n_entries, cursor, 1 /* no local data ever */);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why no local data ever? I assume it is because this is only used to send AppendEntries messages and we don't transmit local data. Maybe a slightly bigger comment saying something like: encodeAppendEntries is only called when sending AppendEntries messages and local data is never transmitted, is more clear. What do you think?

{
unsigned i;
void *cursor = buf;

/* Number of entries in the batch, little endian */
bytePut64(&cursor, n);

if (with_local_data) {
#ifdef DQLITE_NEXT
if (format_version > 1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same.

@@ -391,15 +390,13 @@ int uvDecodeBatchHeader(const void *batch,
return 0;
}

if (local_data_size != NULL) {
#ifdef DQLITE_NEXT
if (format_version > 1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same.

@@ -456,7 +453,7 @@ static int decodeAppendEntries(const uv_buf_t *buf,
args->prev_log_term = byteGet64(&cursor);
args->leader_commit = byteGet64(&cursor);

rv = uvDecodeBatchHeader(cursor, &args->entries, &args->n_entries, false);
rv = uvDecodeBatchHeader(cursor, &args->entries, &args->n_entries, NULL, 1 /* no local data ever */);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same nit about slightly longer documentation.

@@ -295,7 +295,7 @@ static void uvServerReadCb(uv_stream_t *stream,
s->message.append_entries.entries,
s->message.append_entries
.n_entries,
false);
0, 1 /* no local data ever */);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same nit, although in this case it might be easier to see because the function name already states that it is "receiving" something. Might still be useful for future readers though.

@@ -405,7 +405,7 @@ int uvSegmentLoadClosed(struct uv *uv,
if (rv != 0) {
goto err;
}
if (format != UV__DISK_FORMAT) {
if (format != (uint64_t)uv->format_version) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: maybe be super defensive with uv->format_version < 0 || format != ... (just for the extra security going from int to uint) or even a PRE. This applies to several places but it is very minor so feel free to ignore it.

@cole-miller
Copy link
Contributor Author

Closing in favor of more focused PRs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants