Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix manifest_number_ point to invalid file #12882

Conversation

jowlyzhang
Copy link
Contributor

This PR fix VersionSet's manifest_number_ could be pointing to an invalid number intermediately. This happens when a new manifest roll is attempted but fast failed after loading table handlers and before the new manifest file creation/writing is actually attempted.

In theory, a later manifest roll effort will overthrow this intermediate invalid in memory state. There is on harm when the DB crashes in this invalid state either. But efforts that takes a file snapshot of the DB like backup will incorrectly try to copy a non existing manifest file.

@@ -5768,7 +5770,7 @@ Status VersionSet::ProcessManifestWrites(
for (auto v : versions) {
delete v;
}
if (manifest_io_status.ok()) {
if (manifest_io_status.ok() && !current_file_io_status.ok()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if s is set to non-ok after successfully creating the CURRENT file? Would it make more sense to just record if a new MANIFEST is created and check that flag here?

Copy link
Contributor Author

@jowlyzhang jowlyzhang Jul 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, this keeps the manifest_number_ unchanged, to point to the existing manifest and not updating its size. If we crashed right after this and CURRENT points to the new manifest file, the new manifest file should be around to be used since below logic only remove the new manifest file if manifest write IO itself encountered issues. If we didn't crash, this divergence will be fixed by the next manifest write attempt when it sees descriptor_log_ is nullptr.

Anyways, in these error scenario, manifest_number_ no longer points to what descriptor_log_ points to, I wonder should we err towards keeping it point to the old manifest number (which is removing this code block altogether), or switch to the new manifest number, and the criteria for doing that, the criteria is the diff between this approach and the flag approach I think:

  1. this approach : new manifest is in a good state, because manifest_io_status is ok, but current file creation may have some issues
  2. flag approach: a new manifest file is successfully created and written to.

What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think either is fine. I found checking current_file_io_status harder to reason about since we are updating the manifest_file_number when CURRENT creation failed, but not updating it when CURRENT Is created successfully.

wonder should we err towards keeping it point to the old manifest number (which is removing this code block altogether)

Yeah, something probably went wrong with the new MANIFEST. Maybe backup should read the old MANIFEST that's valid.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point that in this approach, manifest number is updated when current file could encounter issues, looks very unintuitive. I have removed this block altogether.

@jowlyzhang jowlyzhang force-pushed the fix_intermediate_invalid_manifest_number branch from 7113a67 to 5c146af Compare July 24, 2024 21:35
Copy link
Member

@cbi42 cbi42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@facebook-github-bot
Copy link
Contributor

@jowlyzhang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@jowlyzhang merged this pull request in 9883b5f.

@jowlyzhang jowlyzhang deleted the fix_intermediate_invalid_manifest_number branch July 25, 2024 20:45
@hx235
Copy link
Contributor

hx235 commented Jul 27, 2024

But efforts that takes a file snapshot of the DB like backup will incorrectly try to copy a non existing manifest file.

Worth a history update since users could have observed "Failure in BackupEngine::CreateNewBackup..."? @jowlyzhang

jowlyzhang added a commit to jowlyzhang/rocksdb that referenced this pull request Jul 29, 2024
facebook-github-bot pushed a commit that referenced this pull request Jul 29, 2024
Summary:
As titled.

Pull Request resolved: #12892

Reviewed By: hx235

Differential Revision: D60400651

Pulled By: jowlyzhang

fbshipit-source-id: 2dd60c2287143f464ecab0de859715af6ab3a825
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants