Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MDEV-33067 Support creating snapshot based on SCN #2955

Open
wants to merge 1 commit into
base: 11.4
Choose a base branch
from

Conversation

poempeng
Copy link

@poempeng poempeng commented Dec 20, 2023

What's SCN

SCN is a sequence number which was generated while committing transaction, and it should always be increased. With SCN, the snapshot for MVCC can only use a version number to compare with SCN.
To make code simple, we use trx->no as SCN of transaction. During startup, the current trx id and no is initialized by max id stored int trx_sys page, while max_trx_id is multiple of 2 and max_scn is not, so we can distinguish transaction id and SCN stored in record field. For example, if id on startup is 101, then max_scn is 101 and max_trx_id is 102. the increasing step of trx id/scn is 2.

Get SCN

While committing transaction, trx->no (aka SCN) is stored in undo header of transaction.
Before undo log for insert is immediately purged after committing transaction. In this commit, it'll keep insert undo not cleaned, so that user thread can get chance to read scn from undo header. The purge thread is resposible for removing all undo log. In each clust record, it stores roll_ptr which points to undo record in undo space. In order to get undo header address from undo log, we changed the format of undo record by adding extra 9 bytes in header of undo record:
-- 2 bytes: point to next record
-- 1 bytes: 0xFF, indicate it's new version of undo record (added) -- 2 bytes: low 2 bytes of trx id, for verification (added) -- 4 bytes: page number of undo header (added)
-- 2 bytes: offset of undo header inside undo hdr page (added) With this change we can find the undo hdr page, and get SCN number from it. To avoid too frequent reading of undo log, it implementted a scn map which stores trx_id->SCN, SCN_Map is a lock free structure. and it has two map:

  • normal map: add id->scn to it while commtting transaction
  • random map: add id->scn to it after extracting a scn from undo log while reading

WriteBack SCN

While reader got record and check visibility, it'll read in scn of the record, the cursor will be stored in buf_block_t.
A group of background threads take cursor from buf_block_t and process them in background.

Snapshot

Snapshot now has three fields to implement MVCC:

  • For fast checking, up_limit_id and low_limit_id is still preserved, but it's not always precise and just for fast checking.
  • Max SCN value as its version number, all scn bigger than/equal to it is not visible.

Undo Purge

We don't prevent purging undo log, even when related SCN is not written back to record.

New Variable

  1. innodb_use_scn: use scn or not, default value is OFF, this variable is read only, all undo logs should be purged before changing it to different value.
  2. innodb_cleanout_threads: the number of threads to write back scn.

Limitations:

  1. Check table ... extended is not supported.
  2. Undo log optimization of inserting into empty table is not supported.
  • The Jira issue number for this PR is: MDEV-______

Description

TODO: fill description here

How can this PR be tested?

TODO: modify the automated test suite to verify that the PR causes MariaDB to behave as intended.
Consult the documentation on "Writing good test cases".

If the changes are not amenable to automated testing, please explain why not and carefully describe how to test manually.

Basing the PR against the correct MariaDB version

  • This is a new feature and the PR is based against the latest MariaDB development branch.
  • This is a bug fix and the PR is based against the earliest maintained branch in which the bug can be reproduced.

PR quality check

  • I checked the CODING_STANDARDS.md file and my PR conforms to this where appropriate.
  • For any trivial modifications to the PR, I am ok with the reviewer making the changes themselves.

@CLAassistant
Copy link

CLAassistant commented Dec 20, 2023

CLA assistant check
All committers have signed the CLA.

@poempeng poempeng changed the title Support creating snapshot based on SCN MDEV-33067 Support creating snapshot based on SCN Dec 20, 2023
Copy link
Contributor

@dr-m dr-m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I understand, this change introduces a Boolean file format option innodb_use_scn. If we implemented this, we should also test how older versions would react to undo logs that exist in this format. The latest change to any InnoDB file format was in the MariaDB Server 10.8 short term release. You’d have to test how MariaDB Server 10.11 (the latest long-term support release) would react.

Changing the option on server restart is only expected to work if the undo logs are logically empty. I see that this rule is being enforced in trx_lists_init_at_db_start().

I think that the parameter needs to be covered in --suite=mariabackup as well. Did all test suites pass when running ./mtr --mysqld=--loose-innodb-use-scn=on? I would like to see a successful CI run that enables this option by default. All builders (including ASAN, MSAN, UBSAN) need to pass all tests in both modes. What is the test suite execution time with both values of the parameter?

Has this been tested with Galera or with replication? Do all nodes have to use the same value of the parameter? How does the SCN affect checks for implicit locks?

I would like to know how the performance impact of this change was tested, so that those tests can be repeated independently and the results analyzed. Do you have any build scripts, benchmark scripts, information on the test environment, and results, comparing to the parent commit as well as comparing the two values of the new configuration parameter innodb_use_scn?

I am worried that the added conditions in low-level code could incur some performance penalty. Could we have a cmake option for enabling this feature, similar to PLUGIN_PERFSCHEMA, WITH_WSREP, or WITH_INNODB_AHI?

@@ -2298,6 +2298,9 @@ page_cur_delete_rec(
consistency checks would fail for the dummy block that is being
used during IMPORT TABLESPACE. */
block->modify_clock++;
if (mtr != nullptr) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

error: comparison of nonnull parameter 'mtr' not equal to a null pointer is 'true' on first encounter [-Werror,-Wtautological-pointer-compare]

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will remove the unnecessary condition.

Comment on lines +1030 to +996
/** Clear the set */
void clear_cursor() const
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When is this supposed to be called? Could it be called as part of mtr_t::commit()?

ALTER TABLE t1 ADD COLUMN c INT GENERATED ALWAYS AS(a+b), ADD INDEX idx (c), ALGORITHM=INPLACE, LOCK=NONE;
sleep 2;
send ALTER TABLE t1 ADD COLUMN c INT GENERATED ALWAYS AS(a+b), ADD INDEX idx (c), ALGORITHM=INPLACE, LOCK=SHARED;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of this sleep and how is it expected to interact with the preceding DEBUG_SYNC point inplace_after_index_build?

If we want everything to be purged at this point, SET GLOBAL innodb_max_purge_lag_wait=0; should do the trick more efficiently and reliably.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The alter statement will create an readview, and the value of ReadViewBase::m_low_limit_no is obtained from SCN_Mgr::m_safe_limit_no, which is refreshed every second. Here sleep is used to wait SCN_Mgr::m_safe_limit_no to be refreshed, otherwise the subsequent wait_all_purged.inc will be blocked by this alter statement.

Comment on lines 753 to 754
ib::error()
<< "Can't start with innodb_use_scn=OFF while undo is not empty.";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should prefer sql_print_error() for simple error messages.

@@ -725,9 +747,48 @@ dberr_t trx_lists_init_at_db_start()
return err;
}

if (srv_operation == SRV_OPERATION_NORMAL) {
if (trx_sys.is_start_from_scn() && !srv_use_scn) {
if (!trx_sys.is_undo_empty() || !purge_sys.purge_queue.empty()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the check purge_sys.purge_queue.empty() really needed?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the check purge_sys.purge_queue.empty() is not needed here.

m_startup_id= 0;
m_startup_scn= 0;
m_abort= false;
m_max_cleanout_threads= srv_cleanout_threads;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do without m_max_cleanout_threads? One variable name innodb_cleanout_threads would be more grep friendly.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

Comment on lines 486 to 493
if (m_scn_map)
{
delete m_scn_map;
}
if (m_random_map)
{
delete m_random_map;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete of a null pointer is allowed already in C++98.

Comment on lines 777 to 733
block= buf_page_get_gen(page_id, zip_size, RW_X_LATCH, nullptr,
BUF_GET_IF_IN_POOL, &mtr, nullptr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BUF_GET_IF_IN_POOL does not require zip_size.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK.

Comment on lines +827 to +788
m_cleanout_task_timer.reset(
srv_thread_pool->create_timer(::cleanout_task_monitor, nullptr));
m_cleanout_task_timer->set_time(0, 1000);

m_view_task_timer.reset(
srv_thread_pool->create_timer(run_view_task, nullptr));
m_view_task_timer->set_time(0, 1000);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are these for? The commit message does not mention any periodic task or timer.

Could any of this be executed by the purge_coordinator_task?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To optimize the performance of MVCC, we need to write back SCN to TRX_ID fields of rows that touched by reading. These write back operations are done in background tasks. The timer m_view_task_timer is used to traverse trx_sys_t::rw_trx_hash periodically and refresh SCN_Mgr::m_safe_limit_no and SCN_Mgr::m_min_active_id.
The work of these two timers can be done in purge_coordinator_task, but it may make the code become complex.

Comment on lines 1814 to 1815
if (memcmp(mrec_trx_id, rec_trx_id,
DATA_TRX_ID_LEN + DATA_ROLL_PTR_LEN)) {
DATA_TRX_ID_LEN + DATA_ROLL_PTR_LEN)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The indentation was correct before this change.

@poempeng poempeng force-pushed the feature_innodb_scn branch 3 times, most recently from 637704a to 35214ee Compare January 8, 2024 12:03
What's SCN
==========
SCN is a sequence number which was generated while committing
transaction, and it should always be increased. With SCN, the
snapshot for MVCC can only use a version number to compare with
SCN.
To make code simple, we use trx->no as SCN of transaction. During
startup, the current trx id and no is initialized by max id stored
int trx_sys page, while max_trx_id is multiple of 2 and max_scn is not,
so we can distinguish transaction id and SCN stored in record field.
For example, if id on startup is 101, then max_scn is 101 and max_trx_id
is 102. the increasing step of trx id/scn is 2.

Get SCN
=======
While committing transaction, trx->no (aka SCN) is stored in undo header
of transaction.
Before undo log for insert is immediately purged after committing transaction.
In this commit, it'll keep insert undo not cleaned, so that user thread can get
chance to read scn from undo header. The purge thread is resposible for removing all undo log.
In each clust record, it stores roll_ptr which points to undo record in undo space.
In order to get undo header address from undo log, we changed the format of undo record
by adding extra 9 bytes in header of undo record:
-- 2 bytes: point to next record
-- 1 bytes: 0xFF, indicate it's new version of undo record (added)
-- 2 bytes: low 2 bytes of trx id, for verification (added)
-- 4 bytes: page number of undo header (added)
-- 2 bytes: offset of undo header inside undo hdr page (added)
With this change we can find the undo hdr page, and get SCN number from it.
To avoid too frequent reading of undo log, it implementted a scn map which
stores trx_id->SCN, SCN_Map is a lock free structure. and it has two map:
- normal map: add id->scn to it while commtting transaction
- random map: add id->scn to it after extracting a scn from undo log while reading

WriteBack SCN
=============
While reader got record and check visibility, it'll read in scn of the record, the cursor
will be stored in buf_block_t.
A group of background tasks take cursor from buf_block_t and process them in background.

Snapshot
========
Snapshot now has three fields to implement MVCC:
* For fast checking, up_limit_id and low_limit_id is still preserved, but it's not
  always precise and just for fast checking.
* Max SCN value as its version number, all scn bigger than/equal to it is not visible.

Undo Purge
==========
We don't prevent purging undo log, even when related SCN is not written back to record.

New Variable
============
1. innodb_use_scn: use scn or not, default value is OFF, this variable is read only,
all undo logs should be purged before changing it to different value.
2. innodb_cleanout_threads: the number of threads to write back scn.

Limitations:
============
1. Check table ... extended is not supported.
2. Undo log optimization of inserting into empty table is not supported.

CMake option:
=============
WITH_INNODB_SCN, default: ON
Comment on lines +4847 to +4853
dict_index_t *dict_index_get_by_id(table_id_t table_id, index_id_t index_id,
const bool dict_sys_locked) {
/** Get table object by id */
if (!dict_sys_locked) {
dict_sys.freeze(SRW_LOCK_CALL);
}
dict_table_t *table= dict_sys.find_table(table_id);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only invoked from trx_t::commit_in_memory(), which is defined in another compilation unit. I think that it would be better to define the function close to its use site, for better code cache performance especially when link-time optimization (-DCMAKE_INTERPROCEDURAL_OPTIMIZATION=ON) is not used.

I don’t think we need any table lookup or the parameter dict_sys_locked at all. Any tables that were modified by the transaction will be available in mod_tables as well as in lock.trx_locks during trx_t::commit_in_memory().

}
}

index->trx_scn= scn;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If multiple transactions that modified records in the same indexes are committing concurrently, it looks like dict_index_t::trx_scn could be overwritten with an older value. What is the impact of that?

Comment on lines +1510 to +1513
for (auto &item : this->scn_indexes)
{
dict_index_t *index= dict_index_get_by_id(
item.first, item.second, this->dict_operation_lock_mode);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could trx_t::scn_indexes be merged into trx_t::mod_tables somehow?

MariaDB allows at most MAX_INDEXES to be created, and sql/sql_bitmap.h defines the following:

typedef Bitmap<MAX_INDEXES> key_map;

This could be a more efficient way of identifying the indexes that were modified by a transaction. However, this might be inaccurate in case there are ROLLBACK TO SAVEPOINT or a rollback to the start of the latest statement. Example:

CREATE TABLE t(a INT PRIMARY KEY, b INT UNIQUE,c INT) ENGINE=InnoDB;
INSERT INTO t VALUES(1,1,1); COMMIT;
START TRANSACTION;
UPDATE t SET c=2;
SAVEPOINT a;
UPDATE t SET b=2;
ROLLBACK TO SAVEPOINT a;
COMMIT;

This transaction is only modifying the PRIMARY KEY index, not the unique secondary index. How would this problem be solved?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I later realized that the trx_t::scn_indexes is only used during DDL operations, via row_create_index_for_mysql(). I wonder if we can simplify that and change the way how dict_index_t::trx_id is assigned earlier during DDL.

Comment on lines +1008 to +1010
/** Add record to set */
bool add_lazy_cursor(rec_t *rec, ulint trx_id_offset, trx_id_t id,
trx_id_t scn)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be very helpful to document the parameters and their meaning in the function comment. I had to read the code of the added function ReadViewBase::changes_visible() to understand that rec must be a record in the clustered index and this is scheduling an update of the field DB_TRX_ID in that record.

Comment on lines +1017 to +1018
buf_lazy_recs->lazy_recs.insert(
std::make_pair(rec, std::make_tuple(trx_id_offset, id, scn)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we not store a single pointer to the DB_TRX_ID field (rec + trx_id_offset)?

I see that this data structure is apparently emptied in operations that would restructure the page, such as btr_page_reorganize_low(), btr_compress(), btr_cur_pessimistic_insert(). That code would best be consolidated into a new function, instead of duplicating the source code lines.

Comment on lines +1434 to +1435
/** SCN of the index */
trx_id_t trx_scn;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment should explain the exact purpose of the field. As far as I can tell, this field is not being restored from any persistent data structures. Modifications appear to be scheduled via a call to trx_t::add_scn_index() in row_create_index_for_mysql(), that is, during DDL operations only.

What is the relationship of dict_index_t::trx_scn and dict_index_t::trx_id? The latter field is not being restored on database startup, because all transactions will be committed or rolled back at server startup and therefore all indexes after crash recovery will be accessible (unless they are marked as corrupted). As far as I could tell (by adding __attribute((deprecated)) to the declarations), the only place where these fields matter is row_merge_is_index_usable(), whose original purpose was to prevent old read views from using indexes that had been created (or tables rebuilt) after the read view creation. Can we do with just dict_index_t::trx_id?

Comment on lines -859 to +861
alignas(CPU_LEVEL1_DCACHE_LINESIZE) Atomic_counter<trx_id_t> m_max_trx_id;
alignas(CPU_LEVEL1_DCACHE_LINESIZE) std::atomic<trx_id_t> m_max_trx_id;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that Atomic_relaxed would be a better choice here. By default, std::atomic would use std::memory_order_seq_cst, which would makes stores more expensive even on processors that implement Total Store Ordering. On processors that employ a weak memory model there will be an increased cost for loads as well. See https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html.

Comment on lines +4434 to +4445
#ifdef WITH_INNODB_SCN
if (innodb_use_scn && index->is_clust()) {
if (!index->online_log)
{
scn_mgr.batch_write(block, mtr);
}
else
{
block->clear_cursor();
}
}
#endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is being duplicated in many places. I think that it would be better to write a separate function for it.

Comment on lines +15215 to +15222
#ifdef WITH_INNODB_SCN
if (m_prebuilt->need_to_access_clustered && innodb_use_scn) {
/* trx id on secondary index page can't be used to accurately determine
visibility while scn is used, so disable it and print a warning */
m_prebuilt->need_to_access_clustered = FALSE;
sql_print_warning("InnoDB: check table ... extended is not supported by scn");
}
#endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the exact problem, and would it also affect lock_sec_rec_some_has_impl() or row_vers_impl_x_locked_low()?

If I understood it correctly, innodb_use_scn=true causes the DB_TRX_ID in clustered index records to be something else than the last transaction identifier that modified the record. This could potentially break some logic in trx_undo_prev_version_build().

I did not find any change to the way how PAGE_MAX_TRX_ID is updated on secondary index pages. The CHECK TABLE…EXTENDED code is simply traversing the entire available history of clustered index record versions to find out whether a secondary index record is orphan or belongs to some available record version. Why would that not work when innodb_use_scn=true?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
3 participants