-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MDEV-33067 Support creating snapshot based on SCN #2955
base: 11.4
Are you sure you want to change the base?
Conversation
858e1c2
to
0c34b93
Compare
0c34b93
to
2c3b22c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I understand, this change introduces a Boolean file format option innodb_use_scn
. If we implemented this, we should also test how older versions would react to undo logs that exist in this format. The latest change to any InnoDB file format was in the MariaDB Server 10.8 short term release. You’d have to test how MariaDB Server 10.11 (the latest long-term support release) would react.
Changing the option on server restart is only expected to work if the undo logs are logically empty. I see that this rule is being enforced in trx_lists_init_at_db_start()
.
I think that the parameter needs to be covered in --suite=mariabackup
as well. Did all test suites pass when running ./mtr --mysqld=--loose-innodb-use-scn=on
? I would like to see a successful CI run that enables this option by default. All builders (including ASAN, MSAN, UBSAN) need to pass all tests in both modes. What is the test suite execution time with both values of the parameter?
Has this been tested with Galera or with replication? Do all nodes have to use the same value of the parameter? How does the SCN affect checks for implicit locks?
I would like to know how the performance impact of this change was tested, so that those tests can be repeated independently and the results analyzed. Do you have any build scripts, benchmark scripts, information on the test environment, and results, comparing to the parent commit as well as comparing the two values of the new configuration parameter innodb_use_scn
?
I am worried that the added conditions in low-level code could incur some performance penalty. Could we have a cmake
option for enabling this feature, similar to PLUGIN_PERFSCHEMA
, WITH_WSREP
, or WITH_INNODB_AHI
?
storage/innobase/page/page0cur.cc
Outdated
@@ -2298,6 +2298,9 @@ page_cur_delete_rec( | |||
consistency checks would fail for the dummy block that is being | |||
used during IMPORT TABLESPACE. */ | |||
block->modify_clock++; | |||
if (mtr != nullptr) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
error: comparison of nonnull parameter 'mtr' not equal to a null pointer is 'true' on first encounter [-Werror,-Wtautological-pointer-compare]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will remove the unnecessary condition.
/** Clear the set */ | ||
void clear_cursor() const |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When is this supposed to be called? Could it be called as part of mtr_t::commit()
?
ALTER TABLE t1 ADD COLUMN c INT GENERATED ALWAYS AS(a+b), ADD INDEX idx (c), ALGORITHM=INPLACE, LOCK=NONE; | ||
sleep 2; | ||
send ALTER TABLE t1 ADD COLUMN c INT GENERATED ALWAYS AS(a+b), ADD INDEX idx (c), ALGORITHM=INPLACE, LOCK=SHARED; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the purpose of this sleep
and how is it expected to interact with the preceding DEBUG_SYNC
point inplace_after_index_build
?
If we want everything to be purged at this point, SET GLOBAL innodb_max_purge_lag_wait=0;
should do the trick more efficiently and reliably.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The alter statement will create an readview, and the value of ReadViewBase::m_low_limit_no
is obtained from SCN_Mgr::m_safe_limit_no
, which is refreshed every second. Here sleep
is used to wait SCN_Mgr::m_safe_limit_no
to be refreshed, otherwise the subsequent wait_all_purged.inc
will be blocked by this alter statement.
storage/innobase/trx/trx0trx.cc
Outdated
ib::error() | ||
<< "Can't start with innodb_use_scn=OFF while undo is not empty."; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should prefer sql_print_error()
for simple error messages.
storage/innobase/trx/trx0trx.cc
Outdated
@@ -725,9 +747,48 @@ dberr_t trx_lists_init_at_db_start() | |||
return err; | |||
} | |||
|
|||
if (srv_operation == SRV_OPERATION_NORMAL) { | |||
if (trx_sys.is_start_from_scn() && !srv_use_scn) { | |||
if (!trx_sys.is_undo_empty() || !purge_sys.purge_queue.empty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the check purge_sys.purge_queue.empty()
really needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, the check purge_sys.purge_queue.empty()
is not needed here.
storage/innobase/read/read0read.cc
Outdated
m_startup_id= 0; | ||
m_startup_scn= 0; | ||
m_abort= false; | ||
m_max_cleanout_threads= srv_cleanout_threads; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we do without m_max_cleanout_threads
? One variable name innodb_cleanout_threads
would be more grep
friendly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
storage/innobase/read/read0read.cc
Outdated
if (m_scn_map) | ||
{ | ||
delete m_scn_map; | ||
} | ||
if (m_random_map) | ||
{ | ||
delete m_random_map; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delete
of a null pointer is allowed already in C++98.
storage/innobase/read/read0read.cc
Outdated
block= buf_page_get_gen(page_id, zip_size, RW_X_LATCH, nullptr, | ||
BUF_GET_IF_IN_POOL, &mtr, nullptr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BUF_GET_IF_IN_POOL
does not require zip_size
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK.
m_cleanout_task_timer.reset( | ||
srv_thread_pool->create_timer(::cleanout_task_monitor, nullptr)); | ||
m_cleanout_task_timer->set_time(0, 1000); | ||
|
||
m_view_task_timer.reset( | ||
srv_thread_pool->create_timer(run_view_task, nullptr)); | ||
m_view_task_timer->set_time(0, 1000); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are these for? The commit message does not mention any periodic task or timer.
Could any of this be executed by the purge_coordinator_task
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To optimize the performance of MVCC, we need to write back SCN to TRX_ID fields of rows that touched by reading. These write back operations are done in background tasks. The timer m_view_task_timer
is used to traverse trx_sys_t::rw_trx_hash
periodically and refresh SCN_Mgr::m_safe_limit_no
and SCN_Mgr::m_min_active_id
.
The work of these two timers can be done in purge_coordinator_task
, but it may make the code become complex.
storage/innobase/row/row0log.cc
Outdated
if (memcmp(mrec_trx_id, rec_trx_id, | ||
DATA_TRX_ID_LEN + DATA_ROLL_PTR_LEN)) { | ||
DATA_TRX_ID_LEN + DATA_ROLL_PTR_LEN)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The indentation was correct before this change.
637704a
to
35214ee
Compare
What's SCN ========== SCN is a sequence number which was generated while committing transaction, and it should always be increased. With SCN, the snapshot for MVCC can only use a version number to compare with SCN. To make code simple, we use trx->no as SCN of transaction. During startup, the current trx id and no is initialized by max id stored int trx_sys page, while max_trx_id is multiple of 2 and max_scn is not, so we can distinguish transaction id and SCN stored in record field. For example, if id on startup is 101, then max_scn is 101 and max_trx_id is 102. the increasing step of trx id/scn is 2. Get SCN ======= While committing transaction, trx->no (aka SCN) is stored in undo header of transaction. Before undo log for insert is immediately purged after committing transaction. In this commit, it'll keep insert undo not cleaned, so that user thread can get chance to read scn from undo header. The purge thread is resposible for removing all undo log. In each clust record, it stores roll_ptr which points to undo record in undo space. In order to get undo header address from undo log, we changed the format of undo record by adding extra 9 bytes in header of undo record: -- 2 bytes: point to next record -- 1 bytes: 0xFF, indicate it's new version of undo record (added) -- 2 bytes: low 2 bytes of trx id, for verification (added) -- 4 bytes: page number of undo header (added) -- 2 bytes: offset of undo header inside undo hdr page (added) With this change we can find the undo hdr page, and get SCN number from it. To avoid too frequent reading of undo log, it implementted a scn map which stores trx_id->SCN, SCN_Map is a lock free structure. and it has two map: - normal map: add id->scn to it while commtting transaction - random map: add id->scn to it after extracting a scn from undo log while reading WriteBack SCN ============= While reader got record and check visibility, it'll read in scn of the record, the cursor will be stored in buf_block_t. A group of background tasks take cursor from buf_block_t and process them in background. Snapshot ======== Snapshot now has three fields to implement MVCC: * For fast checking, up_limit_id and low_limit_id is still preserved, but it's not always precise and just for fast checking. * Max SCN value as its version number, all scn bigger than/equal to it is not visible. Undo Purge ========== We don't prevent purging undo log, even when related SCN is not written back to record. New Variable ============ 1. innodb_use_scn: use scn or not, default value is OFF, this variable is read only, all undo logs should be purged before changing it to different value. 2. innodb_cleanout_threads: the number of threads to write back scn. Limitations: ============ 1. Check table ... extended is not supported. 2. Undo log optimization of inserting into empty table is not supported. CMake option: ============= WITH_INNODB_SCN, default: ON
35214ee
to
af7b084
Compare
dict_index_t *dict_index_get_by_id(table_id_t table_id, index_id_t index_id, | ||
const bool dict_sys_locked) { | ||
/** Get table object by id */ | ||
if (!dict_sys_locked) { | ||
dict_sys.freeze(SRW_LOCK_CALL); | ||
} | ||
dict_table_t *table= dict_sys.find_table(table_id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only invoked from trx_t::commit_in_memory()
, which is defined in another compilation unit. I think that it would be better to define the function close to its use site, for better code cache performance especially when link-time optimization (-DCMAKE_INTERPROCEDURAL_OPTIMIZATION=ON
) is not used.
I don’t think we need any table lookup or the parameter dict_sys_locked
at all. Any tables that were modified by the transaction will be available in mod_tables
as well as in lock.trx_locks
during trx_t::commit_in_memory()
.
} | ||
} | ||
|
||
index->trx_scn= scn; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If multiple transactions that modified records in the same indexes are committing concurrently, it looks like dict_index_t::trx_scn
could be overwritten with an older value. What is the impact of that?
for (auto &item : this->scn_indexes) | ||
{ | ||
dict_index_t *index= dict_index_get_by_id( | ||
item.first, item.second, this->dict_operation_lock_mode); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could trx_t::scn_indexes
be merged into trx_t::mod_tables
somehow?
MariaDB allows at most MAX_INDEXES
to be created, and sql/sql_bitmap.h
defines the following:
typedef Bitmap<MAX_INDEXES> key_map;
This could be a more efficient way of identifying the indexes that were modified by a transaction. However, this might be inaccurate in case there are ROLLBACK TO SAVEPOINT
or a rollback to the start of the latest statement. Example:
CREATE TABLE t(a INT PRIMARY KEY, b INT UNIQUE,c INT) ENGINE=InnoDB;
INSERT INTO t VALUES(1,1,1); COMMIT;
START TRANSACTION;
UPDATE t SET c=2;
SAVEPOINT a;
UPDATE t SET b=2;
ROLLBACK TO SAVEPOINT a;
COMMIT;
This transaction is only modifying the PRIMARY KEY
index, not the unique secondary index. How would this problem be solved?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I later realized that the trx_t::scn_indexes
is only used during DDL operations, via row_create_index_for_mysql()
. I wonder if we can simplify that and change the way how dict_index_t::trx_id
is assigned earlier during DDL.
/** Add record to set */ | ||
bool add_lazy_cursor(rec_t *rec, ulint trx_id_offset, trx_id_t id, | ||
trx_id_t scn) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be very helpful to document the parameters and their meaning in the function comment. I had to read the code of the added function ReadViewBase::changes_visible()
to understand that rec
must be a record in the clustered index and this is scheduling an update of the field DB_TRX_ID
in that record.
buf_lazy_recs->lazy_recs.insert( | ||
std::make_pair(rec, std::make_tuple(trx_id_offset, id, scn))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we not store a single pointer to the DB_TRX_ID
field (rec + trx_id_offset
)?
I see that this data structure is apparently emptied in operations that would restructure the page, such as btr_page_reorganize_low()
, btr_compress()
, btr_cur_pessimistic_insert()
. That code would best be consolidated into a new function, instead of duplicating the source code lines.
/** SCN of the index */ | ||
trx_id_t trx_scn; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment should explain the exact purpose of the field. As far as I can tell, this field is not being restored from any persistent data structures. Modifications appear to be scheduled via a call to trx_t::add_scn_index()
in row_create_index_for_mysql()
, that is, during DDL operations only.
What is the relationship of dict_index_t::trx_scn
and dict_index_t::trx_id
? The latter field is not being restored on database startup, because all transactions will be committed or rolled back at server startup and therefore all indexes after crash recovery will be accessible (unless they are marked as corrupted). As far as I could tell (by adding __attribute((deprecated))
to the declarations), the only place where these fields matter is row_merge_is_index_usable()
, whose original purpose was to prevent old read views from using indexes that had been created (or tables rebuilt) after the read view creation. Can we do with just dict_index_t::trx_id
?
alignas(CPU_LEVEL1_DCACHE_LINESIZE) Atomic_counter<trx_id_t> m_max_trx_id; | ||
alignas(CPU_LEVEL1_DCACHE_LINESIZE) std::atomic<trx_id_t> m_max_trx_id; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that Atomic_relaxed
would be a better choice here. By default, std::atomic
would use std::memory_order_seq_cst
, which would makes stores more expensive even on processors that implement Total Store Ordering. On processors that employ a weak memory model there will be an increased cost for loads as well. See https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html.
#ifdef WITH_INNODB_SCN | ||
if (innodb_use_scn && index->is_clust()) { | ||
if (!index->online_log) | ||
{ | ||
scn_mgr.batch_write(block, mtr); | ||
} | ||
else | ||
{ | ||
block->clear_cursor(); | ||
} | ||
} | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code is being duplicated in many places. I think that it would be better to write a separate function for it.
#ifdef WITH_INNODB_SCN | ||
if (m_prebuilt->need_to_access_clustered && innodb_use_scn) { | ||
/* trx id on secondary index page can't be used to accurately determine | ||
visibility while scn is used, so disable it and print a warning */ | ||
m_prebuilt->need_to_access_clustered = FALSE; | ||
sql_print_warning("InnoDB: check table ... extended is not supported by scn"); | ||
} | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the exact problem, and would it also affect lock_sec_rec_some_has_impl()
or row_vers_impl_x_locked_low()
?
If I understood it correctly, innodb_use_scn=true
causes the DB_TRX_ID
in clustered index records to be something else than the last transaction identifier that modified the record. This could potentially break some logic in trx_undo_prev_version_build()
.
I did not find any change to the way how PAGE_MAX_TRX_ID
is updated on secondary index pages. The CHECK TABLE…EXTENDED
code is simply traversing the entire available history of clustered index record versions to find out whether a secondary index record is orphan or belongs to some available record version. Why would that not work when innodb_use_scn=true
?
What's SCN
SCN is a sequence number which was generated while committing transaction, and it should always be increased. With SCN, the snapshot for MVCC can only use a version number to compare with SCN.
To make code simple, we use trx->no as SCN of transaction. During startup, the current trx id and no is initialized by max id stored int trx_sys page, while max_trx_id is multiple of 2 and max_scn is not, so we can distinguish transaction id and SCN stored in record field. For example, if id on startup is 101, then max_scn is 101 and max_trx_id is 102. the increasing step of trx id/scn is 2.
Get SCN
While committing transaction, trx->no (aka SCN) is stored in undo header of transaction.
Before undo log for insert is immediately purged after committing transaction. In this commit, it'll keep insert undo not cleaned, so that user thread can get chance to read scn from undo header. The purge thread is resposible for removing all undo log. In each clust record, it stores roll_ptr which points to undo record in undo space. In order to get undo header address from undo log, we changed the format of undo record by adding extra 9 bytes in header of undo record:
-- 2 bytes: point to next record
-- 1 bytes: 0xFF, indicate it's new version of undo record (added) -- 2 bytes: low 2 bytes of trx id, for verification (added) -- 4 bytes: page number of undo header (added)
-- 2 bytes: offset of undo header inside undo hdr page (added) With this change we can find the undo hdr page, and get SCN number from it. To avoid too frequent reading of undo log, it implementted a scn map which stores trx_id->SCN, SCN_Map is a lock free structure. and it has two map:
WriteBack SCN
While reader got record and check visibility, it'll read in scn of the record, the cursor will be stored in buf_block_t.
A group of background threads take cursor from buf_block_t and process them in background.
Snapshot
Snapshot now has three fields to implement MVCC:
Undo Purge
We don't prevent purging undo log, even when related SCN is not written back to record.
New Variable
Limitations:
Description
TODO: fill description here
How can this PR be tested?
TODO: modify the automated test suite to verify that the PR causes MariaDB to behave as intended.
Consult the documentation on "Writing good test cases".
If the changes are not amenable to automated testing, please explain why not and carefully describe how to test manually.
Basing the PR against the correct MariaDB version
PR quality check