Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix/cleanup SeqnoToTimeMapping #12253

Closed

Conversation

pdillinger
Copy link
Contributor

@pdillinger pdillinger commented Jan 19, 2024

Summary: The SeqnoToTimeMapping class (RocksDB internal) used by the preserve_internal_time_seconds / preclude_last_level_data_seconds options was essentially in a prototype state with some significant flaws that would risk biting us some day. This is a big, complicated change because both the implementation and the behavioral requirements of the class needed to be upgraded together. In short, this makes SeqnoToTimeMapping more internally responsible for maintaining good invariants, so that callers don't easily encounter dangerous scenarios.

  • Some API functions were confusingly named and structured, so I fully refactored the APIs to use clear naming (e.g. DecodeFrom and CopyFromSeqnoRange), object states, function preconditions, etc.
    • Previously the object could informally be sorted / compacted or not, and there was limited checking or enforcement on these states. Now there's a well-defined "enforced" state that is consistently checked in debug mode for applicable operations. (I attempted to create a separate "builder" class for unenforced states, but IIRC found that more cumbersome for existing uses than it was worth.)
  • Previously operations would coalesce data in a way that was better for GetProximalTimeBeforeSeqno than for GetProximalSeqnoBeforeTime which is odd because the latter is the only one used by DB code currently (what is the seqno cut-off for data definitely older than this given time?). This is now reversed to consistently favor GetProximalSeqnoBeforeTime, with that logic concentrated in one place: SeqnoToTimeMapping::SeqnoTimePair::Merge(). Unfortunately, a lot of unit test logic was specifically testing the old, suboptimal behavior.
  • Previously, the natural behavior of SeqnoToTimeMapping was to THROW AWAY data needed to get reasonable answers to the important GetProximalSeqnoBeforeTime queries. This is because SeqnoToTimeMapping only had a FIFO policy for staying within the entry capacity (except in aggregate+sort+serialize mode). If the DB wasn't extremely careful to avoid gathering too many time mappings, it could lose track of where the seqno cutoff was for cold data (GetProximalSeqnoBeforeTime() returning 0) and preventing all further data migration to the cold tier--until time passes etc. for mappings to catch up with FIFO purging of them. (The problem is not so acute because SST files contain relevant snapshots of the mappings, but the problem would apply to long-lived memtables.)
    • Now the SeqnoToTimeMapping class has fully-integrated smarts for keeping a sufficiently complete history, within capacity limits, to give good answers to GetProximalSeqnoBeforeTime queries.
    • Fixes old // FIXME: be smarter about how we erase to avoid data falling off the front prematurely.
  • Fix an apparent bug in how entries are selected for storing into SST files. Previously, it only selected entries within the seqno range of the file, but that would easily leave a gap at the beginning of the timeline for data in the file for the purposes of answering GetProximalXXX queries with reasonable accuracy. This could probably lead to the same problem discussed above in naively throwing away entries in FIFO order in the old SeqnoToTimeMapping. The updated testing of GetProximalSeqnoBeforeTime in BasicSeqnoToTimeMapping relies on the fixed behavior.
  • Fix a potential compaction CPU efficiency/scaling issue in which each compaction output file would iterate over and sort all seqno-to-time mappings from all compaction input files. Now we distill the input file entries to a constant size before processing each compaction output file.

Intended follow-up (me or others):

  • Expand some direct testing of SeqnoToTimeMapping APIs. Here I've focused on updating existing tests to make sense.
  • There are likely more gaps in availability of needed SeqnoToTimeMapping data when the DB shuts down and is restarted, at least with WAL.
  • The data tracked in the DB could be kept more accurate and limited if it used the oldest seqno of unflushed data. This might require some more API refactoring.

Test Plan: unit tests updated

Summary: ...

Test Plan: unit tests updated
@pdillinger pdillinger marked this pull request as ready for review January 19, 2024 17:50
@facebook-github-bot
Copy link
Contributor

@pdillinger has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@pdillinger has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@pdillinger has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Contributor

@jowlyzhang jowlyzhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you for this substantial improvement on the class and its usage.

db/seqno_to_time_mapping.h Show resolved Hide resolved
db/seqno_to_time_mapping.cc Show resolved Hide resolved
@facebook-github-bot
Copy link
Contributor

@pdillinger merged this pull request in cb08a68.

pdillinger added a commit to pdillinger/rocksdb that referenced this pull request Jan 25, 2024
Summary: After facebook#12253 this function has crashed in the crash test, in
its call to `std::copy`. I haven't reproduced the crash directly, but
`std::copy` probably has undefined behavior if the starting iterator is
after the ending iterator, which was possible. I've fixed the logic to
deal with that case and to add an assertion to check that precondition
of `std::copy` (which appears can be unchecked by `std::copy` itself
even with UBSAN+ASAN).

Also added some unit tests etc. that were unfinished for facebook#12253, and
slightly tweak SeqnoToTimeMapping::EnforceMaxTimeSpan handling of zero
time span case.

This is intended for patching 8.11.

Test Plan: tests added. Will trigger ~20 runs of the crash test job that
saw the crash.
facebook-github-bot pushed a commit that referenced this pull request Jan 25, 2024
Summary:
After #12253 this function has crashed in the crash test, in its call to `std::copy`. I haven't reproduced the crash directly, but `std::copy` probably has undefined behavior if the starting iterator is after the ending iterator, which was possible. I've fixed the logic to deal with that case and to add an assertion to check that precondition of `std::copy` (which appears can be unchecked by `std::copy` itself even with UBSAN+ASAN).

Also added some unit tests etc. that were unfinished for #12253, and slightly tweak SeqnoToTimeMapping::EnforceMaxTimeSpan handling of zero time span case.

This is intended for patching 8.11.

Pull Request resolved: #12293

Test Plan: tests added. Will trigger ~20 runs of the crash test job that saw the crash. https://fburl.com/ci/5iiizvfa

Reviewed By: jowlyzhang

Differential Revision: D53090422

Pulled By: pdillinger

fbshipit-source-id: 69d60b1847d9c7e4ae62b153011c2040405db461
pdillinger added a commit that referenced this pull request Jan 25, 2024
Summary:
After #12253 this function has crashed in the crash test, in its call to `std::copy`. I haven't reproduced the crash directly, but `std::copy` probably has undefined behavior if the starting iterator is after the ending iterator, which was possible. I've fixed the logic to deal with that case and to add an assertion to check that precondition of `std::copy` (which appears can be unchecked by `std::copy` itself even with UBSAN+ASAN).

Also added some unit tests etc. that were unfinished for #12253, and slightly tweak SeqnoToTimeMapping::EnforceMaxTimeSpan handling of zero time span case.

This is intended for patching 8.11.

Pull Request resolved: #12293

Test Plan: tests added. Will trigger ~20 runs of the crash test job that saw the crash. https://fburl.com/ci/5iiizvfa

Reviewed By: jowlyzhang

Differential Revision: D53090422

Pulled By: pdillinger

fbshipit-source-id: 69d60b1847d9c7e4ae62b153011c2040405db461
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants