Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix backup/checkpoint stress test failure #12227

Closed

Conversation

jowlyzhang
Copy link
Contributor

This PR fixes this type of stress test failure that could happen in either checkpoint or backup. Example failure messages are like this:

Failure in a backup/restore operation with: Corruption: 0x00000000000001D5000000000000012B00000000000000FD exists in original db but not in restore

A checkpoint operation failed with: Corruption: 0x0000000000000365000000000000012B0000000000000067 exists in original db but not in checkpoint /...

The internal task has an example test command to quickly reproduce this type of error.

The common symptom of these test failures are these expected keys do not exist in the original db either. The root cause is TestCheckpoint and TestBackupRestore both use the expected state as a proxy for the state of the original db when it comes to check a key's existence.

bool exists = thread->shared->Exists(rand_column_families[i], rand_keys[0]);

This ExpectedState::Exists API returns true if a key has a pending write, such as a pending put. In usual case, this pending put should either soon materialize to an actual write when PendingExpectedValue::Commit is called to reflect a successful write to the DB, or test should be safely terminated if write to DB fails. All of which happens while a key is locked. So checkpoint and backup usually won't see the discrepancy between db and expected state caused by pending writes. However, the external file ingestion test currently has a path that will proceed the test after a failed ingestion caused by injected errors, leaving the pending put in the expected state.

if (!s.ok()) {
if (!s.IsIOError() || !std::strstr(s.getState(), "injected")) {
fprintf(stderr, "file ingestion error: %s\n", s.ToString().c_str());
thread->shared->SafeTerminate();
} else {
fprintf(stdout, "file ingestion error: %s\n", s.ToString().c_str());
}
} else {
for (size_t i = 0; i < pending_expected_values.size(); ++i) {
pending_expected_values[i].Commit();
}
}
}

I think a proper and future proof fix for this is to explicitly rollback a pending state when a db write operation failed so that expected state do not diverge from db in the first place. I added a PendingExpectedValue::Rollback API so that we don't implicitly depend on thread termination to prevent test failures. Another place that could cause same divergence as external file ingestion is PreloadDbAndReopenAsReadOnly.

pending_expected_value.Commit();
if (!s.ok()) {
break;
}

@facebook-github-bot
Copy link
Contributor

@jowlyzhang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Contributor

@hx235 hx235 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - I also wonder if it's possible to do some checking in ExpectedValue's destructor to see if Rollback() or Commit() is called. If the check is complicated, it could be left in a follow up PR.

@facebook-github-bot
Copy link
Contributor

@jowlyzhang has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@jowlyzhang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@jowlyzhang
Copy link
Contributor Author

LGTM - I also wonder if it's possible to do some checking in ExpectedValue's destructor to see if Rollback() or Commit() is called. If the check is complicated, it could be left in a follow up PR.

That's a good point. This can help make this more future proof. It looks straightforward from a glance, after looking more into it, I realize some intermediate methods could be creating PendingExpectedValue on other's behalf, for example ExpectedState::PrepareDeleteRange. So I need to look a bit more into other related changes for this, like enforcing a usage pattern for PendingExpectedValue besides the check in the destructor. I will do it in a follow up.

@facebook-github-bot
Copy link
Contributor

@jowlyzhang merged this pull request in c4228ab.

@jowlyzhang jowlyzhang deleted the fix_backup_checkpoint_stress branch January 22, 2024 23:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants