Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Async rollback prepared transactions during binlog crash recovery #3030

Open
wants to merge 1 commit into
base: 10.6
Choose a base branch
from

Conversation

SongLibing
Copy link
Contributor

@SongLibing SongLibing commented Jan 30, 2024

MDEV-33853 Async rollback prepared transactions during binlog
crash recovery

Summary

When doing server recovery, the active transactions will be rolled
back by InnoDB background rollback thread automatically. The
prepared transactions will be committed or rolled back accordingly
by binlog recovery. Binlog recovery is done in main thread before
the server can provide service to users. If there is a big
transaction to rollback, the server will not available for a long
time.

This patch provides a way to rollback the prepared transactions
asynchronously. Thus the rollback will not block server startup.

Design

  • Handler::recover_rollback_by_xid()
    This patch provides a new handler interface to rollback transactions
    in recover phase. InnoDB just set the transaction's state to active.
    Then the transaction will be rolled back by the background rollback
    thread.

  • Handler::signal_tc_log_recover_done()
    This function is called after tc log is opened(typically binlog opened)
    has done. When this function is called, all transactions will be rolled
    back have been reverted to ACTIVE state. Thus it starts rollback thread
    to rollback the transactions.

  • Background rollback thread
    With this patch, background rollback thread is defered to run until binlog
    recovery is finished. It is started by innobase_tc_log_recovery_done().

  • The Jira issue number for this PR is: MDEV-33853

Description

TODO: fill description here

How can this PR be tested?

TODO: modify the automated test suite to verify that the PR causes MariaDB to behave as intended.
Consult the documentation on "Writing good test cases".

If the changes are not amenable to automated testing, please explain why not and carefully describe how to test manually.

Basing the PR against the correct MariaDB version

  • This is a new feature and the PR is based against the latest MariaDB development branch.
  • This is a bug fix and the PR is based against the earliest maintained branch in which the bug can be reproduced.

PR quality check

  • I checked the CODING_STANDARDS.md file and my PR conforms to this where appropriate.
  • For any trivial modifications to the PR, I am ok with the reviewer making the changes themselves.

storage/innobase/trx/trx0roll.cc Outdated Show resolved Hide resolved
storage/innobase/trx/trx0roll.cc Outdated Show resolved Hide resolved
storage/innobase/trx/trx0roll.cc Outdated Show resolved Hide resolved
storage/innobase/handler/ha_innodb.cc Outdated Show resolved Hide resolved
storage/innobase/handler/ha_innodb.cc Outdated Show resolved Hide resolved
storage/innobase/handler/ha_innodb.cc Show resolved Hide resolved
storage/innobase/handler/ha_innodb.cc Outdated Show resolved Hide resolved
storage/innobase/handler/ha_innodb.cc Outdated Show resolved Hide resolved
storage/innobase/handler/ha_innodb.cc Outdated Show resolved Hide resolved
storage/innobase/handler/ha_innodb.cc Outdated Show resolved Hide resolved
sql/handler.cc Outdated Show resolved Hide resolved
@grooverdan
Copy link
Member

@SongLibing are you ok to finish this?

@SongLibing
Copy link
Contributor Author

@SongLibing are you ok to finish this?

Sorry for the delay. I have been very busy these days. I can probably finish the update in next week.

@SongLibing SongLibing force-pushed the binlog_recover_async_rollback branch 2 times, most recently from f7758d1 to 6c1a4e6 Compare April 4, 2024 19:43
@SongLibing
Copy link
Contributor Author

Hi @dr-m , The patch is updated, please have a look. @knielsen Marko suggest you to review the patch, please have a look.

Copy link
Contributor

@dr-m dr-m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, this is a step forward, but I think that it could be simplified further.

sql/handler.cc Outdated Show resolved Hide resolved
storage/innobase/trx/trx0roll.cc Outdated Show resolved Hide resolved
@SongLibing SongLibing force-pushed the binlog_recover_async_rollback branch from 771d19e to 5105774 Compare April 5, 2024 17:49
@SongLibing SongLibing requested a review from dr-m April 7, 2024 11:36
Copy link
Contributor

@dr-m dr-m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, this is definitely moving to the right direction, but some details still need to be sorted out. Did you file an MDEV ticket for this?

storage/innobase/log/log0log.cc Outdated Show resolved Hide resolved
storage/innobase/srv/srv0start.cc Outdated Show resolved Hide resolved
@SongLibing SongLibing force-pushed the binlog_recover_async_rollback branch from 5105774 to 038784a Compare April 8, 2024 12:56
Copy link
Member

@knielsen knielsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a look at the patch 5105774, I only looked at the non-InnoDB parts in detail.

I think the patch looks solid, it does all the things already I would expect it to do. I had one suggestion to clarify the comments on the new handlerton calls introduced.

This part of the commit message I did not understand:

"With this patch, background rollback thread will not exit unless ddl log
  recovery is finished. It guarantees that all recovered prepared transactions
  which reverted to active state will be rolled back by the background thread."

Maybe you can elaborate/explain how this patch relates to the ddl log recovery?

sql/handler.h Outdated Show resolved Hide resolved
@SongLibing
Copy link
Contributor Author

I took a look at the patch 5105774, I only looked at the non-InnoDB parts in detail.

I think the patch looks solid, it does all the things already I would expect it to do. I had one suggestion to clarify the comments on the new handlerton calls introduced.

This part of the commit message I did not understand:

"With this patch, background rollback thread will not exit unless ddl log
  recovery is finished. It guarantees that all recovered prepared transactions
  which reverted to active state will be rolled back by the background thread."

Maybe you can elaborate/explain how this patch relates to the ddl log recovery?

It is not related to ddl log in the latest patch, the comment message has been updated.

Copy link
Contributor

@dr-m dr-m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. The InnoDB part looks almost OK to me. I reviewed the generated code when this was cherry-picked to 10.6 bb2e125.

I hope that @knielsen can soon comment on the code outside InnoDB.

storage/innobase/trx/trx0sys.cc Outdated Show resolved Hide resolved
storage/innobase/trx/trx0roll.cc Outdated Show resolved Hide resolved
storage/innobase/srv/srv0start.cc Outdated Show resolved Hide resolved
storage/innobase/include/trx0trx.h Outdated Show resolved Hide resolved
storage/innobase/include/trx0trx.h Outdated Show resolved Hide resolved
storage/innobase/log/log0log.cc Outdated Show resolved Hide resolved
storage/innobase/include/trx0sys.h Outdated Show resolved Hide resolved
storage/innobase/include/trx0sys.h Show resolved Hide resolved
@SongLibing SongLibing force-pushed the binlog_recover_async_rollback branch 2 times, most recently from 606b059 to 07e6743 Compare April 18, 2024 15:14
@SongLibing SongLibing requested a review from dr-m April 18, 2024 22:49
@SongLibing SongLibing force-pushed the binlog_recover_async_rollback branch from 07e6743 to 0783240 Compare April 19, 2024 13:38
Copy link
Contributor

@andrelkin andrelkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good work LiBing!

I have two points to raise.

sql/handler.cc Outdated Show resolved Hide resolved
sql/handler.cc Outdated Show resolved Hide resolved
sql/handler.h Outdated Show resolved Hide resolved
@SongLibing SongLibing force-pushed the binlog_recover_async_rollback branch 2 times, most recently from 133465d to c590cc0 Compare April 23, 2024 06:50
@andrelkin
Copy link
Contributor

Looks good. Thanks LiBing!

@andrelkin andrelkin closed this Apr 23, 2024
@andrelkin andrelkin reopened this Apr 23, 2024
@andrelkin
Copy link
Contributor

Sorry, in the last comment I meant a mere approval.

@SongLibing SongLibing changed the base branch from 11.4 to 10.6 May 6, 2024 15:20
           crash recovery

Summary
=======
When doing server recovery, the active transactions will be rolled
back by InnoDB background rollback thread automatically. The
prepared transactions will be committed or rolled back accordingly
by binlog recovery. Binlog recovery is done in main thread before
the server can provide service to users. If there is a big
transaction to rollback, the server will not available for a long
time.

This patch provides a way to rollback the prepared transactions
asynchronously. Thus the rollback will not block server startup.

Design
======
- Handler::recover_rollback_by_xid()
  This patch provides a new handler interface to rollback transactions
  in recover phase. InnoDB just set the transaction's state to active.
  Then the transaction will be rolled back by the background rollback
  thread.

- Handler::signal_tc_log_recover_done()
  This function is called after tc log is opened(typically binlog opened)
  has done. When this function is called, all transactions will be rolled
  back have been reverted to ACTIVE state. Thus it starts rollback thread
  to rollback the transactions.

- Background rollback thread
  With this patch, background rollback thread is defered to run until binlog
  recovery is finished. It is started by innobase_tc_log_recovery_done().
@SongLibing SongLibing force-pushed the binlog_recover_async_rollback branch from c590cc0 to d31208d Compare May 6, 2024 23:43
Copy link
Contributor

@dr-m dr-m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, the InnoDB part of this looks mostly OK now. I think that we should try to clean up the shutdown logic a bit more, to avoid any potential race condition.

I did not check the changes outside InnoDB, and I did not compile the code or run the test case.

@@ -418,6 +420,7 @@ class rw_trx_hash_t
(trx_state_eq(trx, TRX_STATE_ACTIVE) &&
(!srv_was_started ||
srv_read_only_mode ||
!trx_rollback_is_active ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that we fail to invoke rollback_all_recovered_task.wait() in logs_empty_and_mark_files_at_shutdown() or srv_shutdown_threads(). It seems that this added condition might introduce a resource leak if the rollback is still running during shutdown. I would suggest to replace the following code in logs_empty_and_mark_files_at_shutdown():

	/* We need these threads to stop early in shutdown. */
	const char* thread_name = srv_fast_shutdown != 2
		&& trx_rollback_is_active
		? "rollback of recovered transactions" : nullptr;

with something like this (note: I hope that it will not hang if there was no prior call to srv_thread_pool->submit_task(&rollback_all_recovered_task) in case startup was aborted due to some error):

	if (srv_fast_shutdown != 2) {
		rollback_all_recovered_task.wait();
		ut_ad(!trx_rollback_is_active);
	}

This logic will have to be tested with various values of innodb_force_recovery and innodb_fast_shutdown.

I would also suggest to minimize the references to trx_rollback_is_active in other code. Ideally we would only have a few ut_ad(!trx_rollback_is_active) where appropriate.

OPTION(WITH_ASYNC_BINLOG_ROLLBACK
"Rollback prepared transactions asynchronously at binlog recovery" OFF)
IF(WITH_ASYNC_BINLOG_ROLLBACK)
SET(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -DASYNC_BINLOG_ROLLBACK")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the change to CMAKE_C_FLAGS? All affected files seem to be C++.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
5 participants