MDEV-33067 Support creating snapshot based on SCN #2955

poempeng · 2023-12-20T06:05:13Z

What's SCN

SCN is a sequence number which was generated while committing transaction, and it should always be increased. With SCN, the snapshot for MVCC can only use a version number to compare with SCN.
To make code simple, we use trx->no as SCN of transaction. During startup, the current trx id and no is initialized by max id stored int trx_sys page, while max_trx_id is multiple of 2 and max_scn is not, so we can distinguish transaction id and SCN stored in record field. For example, if id on startup is 101, then max_scn is 101 and max_trx_id is 102. the increasing step of trx id/scn is 2.

Get SCN

While committing transaction, trx->no (aka SCN) is stored in undo header of transaction.
Before undo log for insert is immediately purged after committing transaction. In this commit, it'll keep insert undo not cleaned, so that user thread can get chance to read scn from undo header. The purge thread is resposible for removing all undo log. In each clust record, it stores roll_ptr which points to undo record in undo space. In order to get undo header address from undo log, we changed the format of undo record by adding extra 9 bytes in header of undo record:
-- 2 bytes: point to next record
-- 1 bytes: 0xFF, indicate it's new version of undo record (added) -- 2 bytes: low 2 bytes of trx id, for verification (added) -- 4 bytes: page number of undo header (added)
-- 2 bytes: offset of undo header inside undo hdr page (added) With this change we can find the undo hdr page, and get SCN number from it. To avoid too frequent reading of undo log, it implementted a scn map which stores trx_id->SCN, SCN_Map is a lock free structure. and it has two map:

normal map: add id->scn to it while commtting transaction
random map: add id->scn to it after extracting a scn from undo log while reading

WriteBack SCN

While reader got record and check visibility, it'll read in scn of the record, the cursor will be stored in buf_block_t.
A group of background threads take cursor from buf_block_t and process them in background.

Snapshot

Snapshot now has three fields to implement MVCC:

For fast checking, up_limit_id and low_limit_id is still preserved, but it's not always precise and just for fast checking.
Max SCN value as its version number, all scn bigger than/equal to it is not visible.

Undo Purge

We don't prevent purging undo log, even when related SCN is not written back to record.

New Variable

innodb_use_scn: use scn or not, default value is OFF, this variable is read only, all undo logs should be purged before changing it to different value.
innodb_cleanout_threads: the number of threads to write back scn.

Limitations:

Check table ... extended is not supported.
Undo log optimization of inserting into empty table is not supported.

The Jira issue number for this PR is: MDEV-______

Description

TODO: fill description here

How can this PR be tested?

TODO: modify the automated test suite to verify that the PR causes MariaDB to behave as intended.
Consult the documentation on "Writing good test cases".

If the changes are not amenable to automated testing, please explain why not and carefully describe how to test manually.

Basing the PR against the correct MariaDB version

This is a new feature and the PR is based against the latest MariaDB development branch.
This is a bug fix and the PR is based against the earliest maintained branch in which the bug can be reproduced.

PR quality check

I checked the CODING_STANDARDS.md file and my PR conforms to this where appropriate.
For any trivial modifications to the PR, I am ok with the reviewer making the changes themselves.

CLAassistant · 2023-12-20T06:05:19Z

All committers have signed the CLA.

dr-m

As far as I understand, this change introduces a Boolean file format option innodb_use_scn. If we implemented this, we should also test how older versions would react to undo logs that exist in this format. The latest change to any InnoDB file format was in the MariaDB Server 10.8 short term release. You’d have to test how MariaDB Server 10.11 (the latest long-term support release) would react.

Changing the option on server restart is only expected to work if the undo logs are logically empty. I see that this rule is being enforced in trx_lists_init_at_db_start().

I think that the parameter needs to be covered in --suite=mariabackup as well. Did all test suites pass when running ./mtr --mysqld=--loose-innodb-use-scn=on? I would like to see a successful CI run that enables this option by default. All builders (including ASAN, MSAN, UBSAN) need to pass all tests in both modes. What is the test suite execution time with both values of the parameter?

Has this been tested with Galera or with replication? Do all nodes have to use the same value of the parameter? How does the SCN affect checks for implicit locks?

I would like to know how the performance impact of this change was tested, so that those tests can be repeated independently and the results analyzed. Do you have any build scripts, benchmark scripts, information on the test environment, and results, comparing to the parent commit as well as comparing the two values of the new configuration parameter innodb_use_scn?

I am worried that the added conditions in low-level code could incur some performance penalty. Could we have a cmake option for enabling this feature, similar to PLUGIN_PERFSCHEMA, WITH_WSREP, or WITH_INNODB_AHI?

dr-m · 2024-01-02T13:46:05Z

storage/innobase/page/page0cur.cc

@@ -2298,6 +2298,9 @@ page_cur_delete_rec(
 consistency checks would fail for the dummy block that is being
 used during IMPORT TABLESPACE. */
 block->modify_clock++;
+ if (mtr != nullptr) {


error: comparison of nonnull parameter 'mtr' not equal to a null pointer is 'true' on first encounter [-Werror,-Wtautological-pointer-compare]

I will remove the unnecessary condition.

dr-m · 2024-01-02T13:47:37Z

storage/innobase/include/buf0buf.h

+ /** Clear the set */
+ void clear_cursor() const


When is this supposed to be called? Could it be called as part of mtr_t::commit()?

dr-m · 2024-01-02T13:50:53Z

mysql-test/suite/gcol/t/innodb_virtual_debug_purge.test

 ALTER TABLE t1 ADD COLUMN c INT GENERATED ALWAYS AS(a+b), ADD INDEX idx (c), ALGORITHM=INPLACE, LOCK=NONE;
+sleep 2;
 send ALTER TABLE t1 ADD COLUMN c INT GENERATED ALWAYS AS(a+b), ADD INDEX idx (c), ALGORITHM=INPLACE, LOCK=SHARED;


What is the purpose of this sleep and how is it expected to interact with the preceding DEBUG_SYNC point inplace_after_index_build?

If we want everything to be purged at this point, SET GLOBAL innodb_max_purge_lag_wait=0; should do the trick more efficiently and reliably.

The alter statement will create an readview, and the value of ReadViewBase::m_low_limit_no is obtained from SCN_Mgr::m_safe_limit_no, which is refreshed every second. Here sleep is used to wait SCN_Mgr::m_safe_limit_no to be refreshed, otherwise the subsequent wait_all_purged.inc will be blocked by this alter statement.

dr-m · 2024-01-02T14:03:36Z

storage/innobase/trx/trx0trx.cc

+ ib::error()
+ << "Can't start with innodb_use_scn=OFF while undo is not empty.";


We should prefer sql_print_error() for simple error messages.

dr-m · 2024-01-02T14:04:21Z

storage/innobase/trx/trx0trx.cc

@@ -725,9 +747,48 @@ dberr_t trx_lists_init_at_db_start()
 return err;
 }

+ if (srv_operation == SRV_OPERATION_NORMAL) {
+ if (trx_sys.is_start_from_scn() && !srv_use_scn) {
+ if (!trx_sys.is_undo_empty() || !purge_sys.purge_queue.empty()) {


Is the check purge_sys.purge_queue.empty() really needed?

Well, the check purge_sys.purge_queue.empty() is not needed here.

dr-m · 2024-01-02T14:29:17Z

storage/innobase/read/read0read.cc

+ m_startup_id= 0;
+ m_startup_scn= 0;
+ m_abort= false;
+ m_max_cleanout_threads= srv_cleanout_threads;


Can we do without m_max_cleanout_threads? One variable name innodb_cleanout_threads would be more grep friendly.

dr-m · 2024-01-02T14:29:55Z

storage/innobase/read/read0read.cc

+ if (m_scn_map)
+ {
+ delete m_scn_map;
+ }
+ if (m_random_map)
+ {
+ delete m_random_map;
+ }


delete of a null pointer is allowed already in C++98.

dr-m · 2024-01-02T14:30:57Z

storage/innobase/read/read0read.cc

+ block= buf_page_get_gen(page_id, zip_size, RW_X_LATCH, nullptr,
+ BUF_GET_IF_IN_POOL, &mtr, nullptr);


BUF_GET_IF_IN_POOL does not require zip_size.

dr-m · 2024-01-02T14:33:03Z

storage/innobase/read/read0read.cc

+ m_cleanout_task_timer.reset(
+ srv_thread_pool->create_timer(::cleanout_task_monitor, nullptr));
+ m_cleanout_task_timer->set_time(0, 1000);
+
+ m_view_task_timer.reset(
+ srv_thread_pool->create_timer(run_view_task, nullptr));
+ m_view_task_timer->set_time(0, 1000);


What are these for? The commit message does not mention any periodic task or timer.

Could any of this be executed by the purge_coordinator_task?

To optimize the performance of MVCC, we need to write back SCN to TRX_ID fields of rows that touched by reading. These write back operations are done in background tasks. The timer m_view_task_timer is used to traverse trx_sys_t::rw_trx_hash periodically and refresh SCN_Mgr::m_safe_limit_no and SCN_Mgr::m_min_active_id.
The work of these two timers can be done in purge_coordinator_task, but it may make the code become complex.

dr-m · 2024-01-02T14:33:37Z

storage/innobase/row/row0log.cc

 if (memcmp(mrec_trx_id, rec_trx_id,
-   DATA_TRX_ID_LEN + DATA_ROLL_PTR_LEN)) {
+   DATA_TRX_ID_LEN + DATA_ROLL_PTR_LEN)) {


The indentation was correct before this change.

What's SCN ========== SCN is a sequence number which was generated while committing transaction, and it should always be increased. With SCN, the snapshot for MVCC can only use a version number to compare with SCN. To make code simple, we use trx->no as SCN of transaction. During startup, the current trx id and no is initialized by max id stored int trx_sys page, while max_trx_id is multiple of 2 and max_scn is not, so we can distinguish transaction id and SCN stored in record field. For example, if id on startup is 101, then max_scn is 101 and max_trx_id is 102. the increasing step of trx id/scn is 2. Get SCN ======= While committing transaction, trx->no (aka SCN) is stored in undo header of transaction. Before undo log for insert is immediately purged after committing transaction. In this commit, it'll keep insert undo not cleaned, so that user thread can get chance to read scn from undo header. The purge thread is resposible for removing all undo log. In each clust record, it stores roll_ptr which points to undo record in undo space. In order to get undo header address from undo log, we changed the format of undo record by adding extra 9 bytes in header of undo record: -- 2 bytes: point to next record -- 1 bytes: 0xFF, indicate it's new version of undo record (added) -- 2 bytes: low 2 bytes of trx id, for verification (added) -- 4 bytes: page number of undo header (added) -- 2 bytes: offset of undo header inside undo hdr page (added) With this change we can find the undo hdr page, and get SCN number from it. To avoid too frequent reading of undo log, it implementted a scn map which stores trx_id->SCN, SCN_Map is a lock free structure. and it has two map: - normal map: add id->scn to it while commtting transaction - random map: add id->scn to it after extracting a scn from undo log while reading WriteBack SCN ============= While reader got record and check visibility, it'll read in scn of the record, the cursor will be stored in buf_block_t. A group of background tasks take cursor from buf_block_t and process them in background. Snapshot ======== Snapshot now has three fields to implement MVCC: * For fast checking, up_limit_id and low_limit_id is still preserved, but it's not always precise and just for fast checking. * Max SCN value as its version number, all scn bigger than/equal to it is not visible. Undo Purge ========== We don't prevent purging undo log, even when related SCN is not written back to record. New Variable ============ 1. innodb_use_scn: use scn or not, default value is OFF, this variable is read only, all undo logs should be purged before changing it to different value. 2. innodb_cleanout_threads: the number of threads to write back scn. Limitations: ============ 1. Check table ... extended is not supported. 2. Undo log optimization of inserting into empty table is not supported. CMake option: ============= WITH_INNODB_SCN, default: ON

dr-m · 2024-02-13T08:24:33Z

storage/innobase/dict/dict0dict.cc

+dict_index_t *dict_index_get_by_id(table_id_t table_id, index_id_t index_id,
+ const bool dict_sys_locked) {
+ /** Get table object by id */
+ if (!dict_sys_locked) {
+ dict_sys.freeze(SRW_LOCK_CALL);
+ }
+ dict_table_t *table= dict_sys.find_table(table_id);


This is only invoked from trx_t::commit_in_memory(), which is defined in another compilation unit. I think that it would be better to define the function close to its use site, for better code cache performance especially when link-time optimization (-DCMAKE_INTERPROCEDURAL_OPTIMIZATION=ON) is not used.

I don’t think we need any table lookup or the parameter dict_sys_locked at all. Any tables that were modified by the transaction will be available in mod_tables as well as in lock.trx_locks during trx_t::commit_in_memory().

dr-m · 2024-02-13T08:26:48Z

storage/innobase/trx/trx0trx.cc

+ }
+ }
+
+ index->trx_scn= scn;


If multiple transactions that modified records in the same indexes are committing concurrently, it looks like dict_index_t::trx_scn could be overwritten with an older value. What is the impact of that?

dr-m · 2024-02-13T08:33:27Z

storage/innobase/trx/trx0trx.cc

+ for (auto &item : this->scn_indexes)
+ {
+ dict_index_t *index= dict_index_get_by_id(
+ item.first, item.second, this->dict_operation_lock_mode);


Could trx_t::scn_indexes be merged into trx_t::mod_tables somehow?

MariaDB allows at most MAX_INDEXES to be created, and sql/sql_bitmap.h defines the following:

typedef Bitmap<MAX_INDEXES> key_map;

This could be a more efficient way of identifying the indexes that were modified by a transaction. However, this might be inaccurate in case there are ROLLBACK TO SAVEPOINT or a rollback to the start of the latest statement. Example:

CREATE TABLE t(a INT PRIMARY KEY, b INT UNIQUE,c INT) ENGINE=InnoDB; INSERT INTO t VALUES(1,1,1); COMMIT; START TRANSACTION; UPDATE t SET c=2; SAVEPOINT a; UPDATE t SET b=2; ROLLBACK TO SAVEPOINT a; COMMIT;

This transaction is only modifying the PRIMARY KEY index, not the unique secondary index. How would this problem be solved?

Sorry, I later realized that the trx_t::scn_indexes is only used during DDL operations, via row_create_index_for_mysql(). I wonder if we can simplify that and change the way how dict_index_t::trx_id is assigned earlier during DDL.

dr-m · 2024-02-13T08:41:58Z

storage/innobase/include/buf0buf.h

+ /** Add record to set */
+ bool add_lazy_cursor(rec_t *rec, ulint trx_id_offset, trx_id_t id,
+ trx_id_t scn)


It would be very helpful to document the parameters and their meaning in the function comment. I had to read the code of the added function ReadViewBase::changes_visible() to understand that rec must be a record in the clustered index and this is scheduling an update of the field DB_TRX_ID in that record.

dr-m · 2024-02-13T08:44:40Z

storage/innobase/include/buf0buf.h

+ buf_lazy_recs->lazy_recs.insert(
+ std::make_pair(rec, std::make_tuple(trx_id_offset, id, scn)));


Why do we not store a single pointer to the DB_TRX_ID field (rec + trx_id_offset)?

I see that this data structure is apparently emptied in operations that would restructure the page, such as btr_page_reorganize_low(), btr_compress(), btr_cur_pessimistic_insert(). That code would best be consolidated into a new function, instead of duplicating the source code lines.

dr-m · 2024-02-13T09:08:57Z

storage/innobase/include/dict0mem.h

+ /** SCN of the index */
+ trx_id_t trx_scn;


The comment should explain the exact purpose of the field. As far as I can tell, this field is not being restored from any persistent data structures. Modifications appear to be scheduled via a call to trx_t::add_scn_index() in row_create_index_for_mysql(), that is, during DDL operations only.

What is the relationship of dict_index_t::trx_scn and dict_index_t::trx_id? The latter field is not being restored on database startup, because all transactions will be committed or rolled back at server startup and therefore all indexes after crash recovery will be accessible (unless they are marked as corrupted). As far as I could tell (by adding __attribute((deprecated)) to the declarations), the only place where these fields matter is row_merge_is_index_usable(), whose original purpose was to prevent old read views from using indexes that had been created (or tables rebuilt) after the read view creation. Can we do with just dict_index_t::trx_id?

dr-m · 2024-02-13T09:32:21Z

storage/innobase/include/trx0sys.h

- alignas(CPU_LEVEL1_DCACHE_LINESIZE) Atomic_counter<trx_id_t> m_max_trx_id;
+ alignas(CPU_LEVEL1_DCACHE_LINESIZE) std::atomic<trx_id_t> m_max_trx_id;


I think that Atomic_relaxed would be a better choice here. By default, std::atomic would use std::memory_order_seq_cst, which would makes stores more expensive even on processors that implement Total Store Ordering. On processors that employ a weak memory model there will be an increased cost for loads as well. See https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html.

dr-m · 2024-02-13T09:33:53Z

storage/innobase/btr/btr0cur.cc

+#ifdef WITH_INNODB_SCN
+ if (innodb_use_scn && index->is_clust()) {
+ if (!index->online_log)
+ {
+ scn_mgr.batch_write(block, mtr);
+ }
+ else
+ {
+ block->clear_cursor();
+ }
+ }
+#endif


This code is being duplicated in many places. I think that it would be better to write a separate function for it.

dr-m · 2024-02-13T09:46:34Z

storage/innobase/handler/ha_innodb.cc

+#ifdef WITH_INNODB_SCN
+ if (m_prebuilt->need_to_access_clustered && innodb_use_scn) {
+ /* trx id on secondary index page can't be used to accurately determine
+ visibility while scn is used, so disable it and print a warning */
+ m_prebuilt->need_to_access_clustered = FALSE;
+ sql_print_warning("InnoDB: check table ... extended is not supported by scn");
+ }
+#endif


What is the exact problem, and would it also affect lock_sec_rec_some_has_impl() or row_vers_impl_x_locked_low()?

If I understood it correctly, innodb_use_scn=true causes the DB_TRX_ID in clustered index records to be something else than the last transaction identifier that modified the record. This could potentially break some logic in trx_undo_prev_version_build().

I did not find any change to the way how PAGE_MAX_TRX_ID is updated on secondary index pages. The CHECK TABLE…EXTENDED code is simply traversing the entire available history of clustered index record versions to find out whether a secondary index record is orphan or belongs to some available record version. Why would that not work when innodb_use_scn=true?

poempeng changed the title ~~Support creating snapshot based on SCN~~ MDEV-33067 Support creating snapshot based on SCN Dec 20, 2023

poempeng force-pushed the feature_innodb_scn branch from 858e1c2 to 0c34b93 Compare December 20, 2023 06:31

poempeng force-pushed the feature_innodb_scn branch from 0c34b93 to 2c3b22c Compare December 29, 2023 03:09

dr-m requested changes Jan 2, 2024

View reviewed changes

poempeng force-pushed the feature_innodb_scn branch 3 times, most recently from 637704a to 35214ee Compare January 8, 2024 12:03

poempeng force-pushed the feature_innodb_scn branch from 35214ee to af7b084 Compare January 9, 2024 06:01

dr-m reviewed Feb 13, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MDEV-33067 Support creating snapshot based on SCN #2955

MDEV-33067 Support creating snapshot based on SCN #2955

poempeng commented Dec 20, 2023 •

edited

CLAassistant commented Dec 20, 2023 •

edited

dr-m left a comment

dr-m Jan 2, 2024

poempeng Jan 3, 2024

dr-m Jan 2, 2024

dr-m Jan 2, 2024

poempeng Jan 3, 2024

dr-m Jan 2, 2024

dr-m Jan 2, 2024

poempeng Jan 3, 2024

dr-m Jan 2, 2024

poempeng Jan 3, 2024

dr-m Jan 2, 2024

dr-m Jan 2, 2024

poempeng Jan 3, 2024

dr-m Jan 2, 2024

poempeng Jan 3, 2024

dr-m Jan 2, 2024

dr-m Feb 13, 2024

dr-m Feb 13, 2024

dr-m Feb 13, 2024

dr-m Feb 13, 2024

dr-m Feb 13, 2024

dr-m Feb 13, 2024

dr-m Feb 13, 2024

dr-m Feb 13, 2024

dr-m Feb 13, 2024

dr-m Feb 13, 2024

		ib::error()
		<< "Can't start with innodb_use_scn=OFF while undo is not empty.";

		block= buf_page_get_gen(page_id, zip_size, RW_X_LATCH, nullptr,
		BUF_GET_IF_IN_POOL, &mtr, nullptr);

		buf_lazy_recs->lazy_recs.insert(
		std::make_pair(rec, std::make_tuple(trx_id_offset, id, scn)));

		alignas(CPU_LEVEL1_DCACHE_LINESIZE) Atomic_counter<trx_id_t> m_max_trx_id;
		alignas(CPU_LEVEL1_DCACHE_LINESIZE) std::atomic<trx_id_t> m_max_trx_id;

MDEV-33067 Support creating snapshot based on SCN #2955

Are you sure you want to change the base?

MDEV-33067 Support creating snapshot based on SCN #2955

Conversation

poempeng commented Dec 20, 2023 • edited

What's SCN

Get SCN

WriteBack SCN

Snapshot

Undo Purge

New Variable

Limitations:

Description

How can this PR be tested?

Basing the PR against the correct MariaDB version

PR quality check

CLAassistant commented Dec 20, 2023 • edited

dr-m left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

poempeng commented Dec 20, 2023 •

edited

CLAassistant commented Dec 20, 2023 •

edited