Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MDEV-21322: report slave progress to the master #2374

Open
wants to merge 5 commits into
base: 10.11
Choose a base branch
from

Conversation

an3l
Copy link
Collaborator

@an3l an3l commented Dec 14, 2022

  • We are trying to extend the command SHOW REPLICA HOSTS that is
    executed on master, with columns GTID_state_sent and GTID_state_ack.
    In order to achieve that we have to extend thd->slave_info struct with
    2 new entries. This struct needs to be populated and must be accessable to primary.

  • The first column GTID_state_sent:

    • It is populated by the last event that primary has sent to replica
      (only for semi-sync slave connection, but without need to know
      that replica actually obtained request,
      like it is case in asynchronous replication).
    • Primary executes mysql_binlog that first starts the binlog_dump thread.
      That thread starts ack_receiver thread, that returns second column
      of interest (in first iteration we don't care about).
      After binlog_dump thread creating, file is sent with
      send_one_binlog_file() function.
      This function is sending event with send_events() function,
      during which send_event_to_slave() function is executed.
      On that place we are creating the thd->slave_info->gtid_state_sent struct.
  • The second column GTID_state_ack:

    • It is populated by the reply_packet_binlog (a.k.a. ack) that is called,
      only for semi-sync replication. It is called on 2 places:
      a) during binlog_dump creation, after thread is added to
      ack_reciver and after thread is added as semi-sync slave. This
      should be first ack received from replica to primary.
      b) constantly in running phase of ack_receiver thread run().
    • A the end of function reply_packet_binlog we are creating the
      struct GTID_state_ack
  • Closes PR#1427

@an3l an3l added the MariaDB Foundation Pull requests created by MariaDB Foundation label Dec 14, 2022
@an3l an3l added this to the 10.9 milestone Dec 14, 2022
@an3l an3l requested a review from bnestere December 14, 2022 09:42
@an3l an3l self-assigned this Dec 14, 2022
@an3l an3l force-pushed the bb-10.9-anel-repl-slave_report-v2 branch 4 times, most recently from 9f4e1ab to d23fbd6 Compare December 20, 2022 12:46
SHOW REPLICA HOSTS;
Server_id Host Port Master_id Semisync File Position Gtid
3 slave2 SLAVE_PORT2 1 OFF 0
3 slave2 SLAVE_PORT2 1 OFF master-bin.000001 1018 0-1-4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is interesting. Perhaps a bug (although not related to the task at-hand). @andrelkin if a slave re-connects while a master is awaiting a semi-sync ACK, show slave hosts will report the same slave multiple times, until the timeout finishes and the thread finishes up. Should we add logic (in a different ticket) on slave connect, to check for any existing connections with that server, and clean them up if they exist?


connection master;
set global rpl_semi_sync_master_enabled = 0;
set global rpl_semi_sync_master_timeout= default;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to restore the server to its state at the start of the test, rather than hard-coded 0 and default (similar to $save_wait_point).

@an3l an3l force-pushed the bb-10.9-anel-repl-slave_report-v2 branch from d23fbd6 to ca263db Compare January 20, 2023 10:45
@an3l an3l requested a review from ottok as a code owner January 20, 2023 10:45
@CLAassistant
Copy link

CLAassistant commented Jan 20, 2023

CLA assistant check
All committers have signed the CLA.

@an3l an3l changed the base branch from 10.9 to 10.11 January 20, 2023 10:46
@an3l an3l removed the request for review from ottok January 20, 2023 10:47
@an3l an3l force-pushed the bb-10.9-anel-repl-slave_report-v2 branch 3 times, most recently from c762bbc to 0619127 Compare January 20, 2023 15:51
Copy link
Contributor

@bnestere bnestere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Anel! Thanks for the patch. I left a few notes, overall it is coming along nicely :)

sql/sql_repl.cc Outdated Show resolved Hide resolved
sql/semisync_master.cc Outdated Show resolved Hide resolved
@bnestere bnestere self-assigned this Jan 26, 2023
@an3l an3l force-pushed the bb-10.9-anel-repl-slave_report-v2 branch from 0619127 to e01713c Compare March 14, 2023 16:09
@an3l
Copy link
Collaborator Author

an3l commented Mar 14, 2023

Hi @bnestere I have created first part of your review (wip patch)- busy replica and there are some questions/problems with tests. Tests are simulating busy replica and I have 2 problems (as stated in commit message, please see the results there).

  1. randomnes of events for test 7 (problem of corrupt_queue_event)
  2. cannot save the result (problem of corrupt_queue_event)
    Please let me know your opinion about :).

@an3l an3l removed their assignment Mar 28, 2023
stop slave;
reset slave;
set global rpl_semi_sync_slave_enabled = 1;
start slave;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have more consistent macros for the stop, reset, and start commands which are safer for testing as

--source include/stop_slave.inc
--source include/reset_slave.inc
--source include/start_slave.inc

where the stop and start scrips wait for the slave threads to actually start or stop; and the reset one will also reset the GTID state (otherwise transactions with lesser GTIDs won't actually be re-run)

--echo # Should be reset somehow for this particular slave (server_id=3) and wait that yes_tx be 2 ? Or other apporach?
--echo # Note that primary doesn't report not acked transaction by replicas.
--echo # Update: Looking into documentation - "semisync replication waits only on 1 replica to respond" ->
--echo # So according to documentation it is ok, but in later example you will see that both replica have sent ACK.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually because the master is re-sending old transactions. Only new transactions on the master side will be ACKed by the replica. See logic pertaining to k_packet_flag_sync for more details. This is a good test to keep to prove that is the case, can you re-word the test description to reflect that behavior? Then the next test case does ACK a new transaction, which is good.

INSERT INTO t VALUES (5);
--echo # Question 2: Is this the problem ? We have sent the event to replica that
--echo # has corrupted connection. I understand that we expect to have result SENT, and not ACK,
--echo # but since we are iterating through slave_info, this slave is not found,why?In previous test slave_info is found.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question.. I think I led you slightly astray with corrupt_queue_event, as it forces the IO thread to error, eventually disconnecting from the master. Then once a dump thread disconnects on the master, it will call into unregister_slave(), thereby removing the slave_info.

I think instead it would make more sense to use dbug_sync points, as you can manually wait/progress the IO threads queueing to simulate its slowness (and report show slave hosts specifically while the IO thread is in a WAIT_FOR).

@@ -679,6 +672,11 @@ int Repl_semi_sync_master::report_reply_binlog(uint32 server_id,
m_wait_file_name_inited = false;
}
}
if (rpl_semi_sync_master_enabled)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this check is necessary, as the function preamble already checks get_master_enabled() and exits if not.

@an3l an3l force-pushed the bb-10.9-anel-repl-slave_report-v2 branch from e01713c to 538f2f6 Compare May 25, 2023 14:07
# --echo Write new event IO thread is running
INSERT INTO t VALUES (505);

# Should `yes_tx` be incremented, I guess not, since it is async
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, it should not when timeout=0

show variables like 'rpl_semi_sync_slave_enabled';
Variable_name Value
rpl_semi_sync_slave_enabled OFF
# Q2: Here we don't get any row for server_id=2, but we should? we get only `Gtid_State_Sent= Gtid_State_Ack=0-1-5` for server_id=3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is correct, as when the IO thread stopped on the replica, it shut down its connection to the primary. When the primary sees the connection is taken down, it deletes the thd->slave_info variable.

START SLAVE IO_THREAD;
include/wait_for_slave_io_to_start.inc
# Q*: Stoping and starting the new IO thread will restart slave sent ack count, see below.
# So we basically want this to be recorded in edge case, right ?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, as these are slave-side variables, and the slave has no consideration of the master's semisync timeout value

# Q*: Stoping and starting the new IO thread will restart slave sent ack count, see below.
# So we basically want this to be recorded in edge case, right ?
# And that is rpl_semi_sync_master_timeout=0 and rpl_semi_sync_master_enabled=TRUE
# Can we use this as patch instead of injecting the DBUG_SYNC?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No; the purpose of the DEBUG_SYNC point is to delay the slave's ACK reply after receiving the events. So the IO thread needs to be active. The process of the DEBUG_SYNC check would then resemble the following;

  1. Execute a query on the primary (there should be no wait on the connection)
  2. The primary sends the corresponding event group to the replica
  3. IO thread receives the corresponding event
  4. IO thread stops on a DEBUG_SYNC now WAIT_FOR...
  5. We check SHOW REPLICA HOSTS on the primary and ensure that Gtid_state_sent > Gtid_state_ack
  6. We signal the IO thread to continue
  7. Wait for the Rpl_semi_sync_master_get_ack status variable to increment
  8. Check SHOW REPLICA HOSTS to ensure Gtid_state_sent now matches Gtid_state_ack

# So we basically want this to be recorded in edge case, right ?
# And that is rpl_semi_sync_master_timeout=0 and rpl_semi_sync_master_enabled=TRUE
# Can we use this as patch instead of injecting the DBUG_SYNC?
# We are expecting here `Rpl_semi_sync_slave_send_ack == 4`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite, as the replica will only send ACK replies for the active transaction on the primary. Because we have 2 replicas here, and one is running, it is able to ACK the transaction, so when the other starts up, the flag requiring an ACK is never set, and so the replica never replies (hence why Rpl_semi_sync_slave_send_ack is always 0).

I imagine that if this was the only replica, when the primary sends 0-1-6, it would set the flag that the event needs to be ACKed. Then the replica will ACK only on the last transaction, and Rpl_semi_sync_slave_send_ack should eventually be 1.

Variable_name Value
Rpl_semi_sync_master_clients 2
# Q3: here we get `Gtid_State_Sent= 0-1-5 != Gtid_State_Ack=''` (but Gtid_State_Ack should be 0-1-4 right?)
# So should we get for server_id=2 `Gtid_State_Sent= 0-1-6 != Gtid_State_Ack=0-1-4` ?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your question ties into the DEBUG_SYNC answer. By keeping the IO thread on, and using DEBUG_SYNC to only pausing it (thereby maintaining the binlog_dump_thread connection), this would be the case when the slave just had not yet replied with the ACK (but again, only on the active transaction).

2 localhost SLAVE_PORT 1 0-1-6
3 slave2 SLAVE_PORT 1 0-1-6 0-1-6
connection slave;
# Q4: here we get sync good with server_id=3, `Gtid_State_Sent= Gtid_State_Ack=0-1-6` for server_id=2?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, as this now deals with the active transaction logic.. Because there were no new transactions, this would present as the last transaction that the replica had sent an ACK for. I see that being potentially misleading to users when the replica does catch up, yet presents an old Gtid_state_ack. I see two options, either document the behavior (my preference), or we'd add additional if (!timeout) behavior to sometimes allow for "special" ACKs to non-active transactions.. @andrelkin what do you think?

show status like 'Rpl_semi_sync_master_yes_tx';
Variable_name Value
Rpl_semi_sync_master_yes_tx 4
# Q1: Here we need to get the `Gtid_State_Sent=0-1-5 != Gtid_State_Ack=0-1-4` for server_id=2 OR MAYBE THIS IS GOOD,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the IO thread ON (and no reason to pause/error) then the replica will ACK for the latest transaction, so both .._sent and .._ack would be 0-1-5. So the current result seems right to me.

@an3l an3l force-pushed the bb-10.9-anel-repl-slave_report-v2 branch 2 times, most recently from dac9166 to 5f0eab5 Compare June 1, 2023 03:56
AFTER_COMMIT
reset master;
# Test 1: Primary has not enabled semisync
# Note: even if not enabled semisync 2 new columns should be visible
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than describing as "2 new columns" and "should return 6 columns", I'd be better to be specific as to exactly which columns you are expecting. The numbers don't mean much after the patch has been pushed, and 6 can change later

2 localhost SLAVE_PORT 1 0-1-2 0-1-2
# Test 5: Enable semi-sync for slave2
# Note 1: Replica (server_id=3) should be started and semi-sync enabled
# Note 2: In semy-sync replication ACK thread (on master),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo of semy-sync

Rpl_semi_sync_master_clients 1
SHOW REPLICA HOSTS;
Server_id Host Port Master_id Gtid_State_Sent Gtid_State_Ack
2 localhost SLAVE_PORT 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems time-dependent, where sometimes Gtid_State_Sent could have 0-1-5, and sometimes not, so I would suggest only keeping the SHOW REPLICA HOSTS after syncing the replica with the master

connection slave2;
--let $save_server_3_dbug= `SELECT @@GLOBAL.DEBUG_DBUG`
SET @@GLOBAL.DEBUG_DBUG="d,synchronize_semisync_slave_reply";
--source include/start_slave.inc
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'd probably here want

SET debug_sync='now WAIT_FOR at_slave_reply';

like you have for the slave to prove the following claim of both replicas being stopped

@an3l an3l force-pushed the bb-10.9-anel-repl-slave_report-v2 branch from 5b8d725 to e963de5 Compare June 25, 2023 12:11
an3l added a commit to an3l/server that referenced this pull request Jun 25, 2023
+ We are trying to extend the command `SHOW REPLICA HOSTS` that is
executed on master, with columns `GTID_state_sent` and `GTID_state_ack`.
In order to achieve that we have to extend `thd->slave_info` struct with
2 new entries. This struct needs to be populated and  must be accessable to primary.

+ The first column `GTID_state_sent`:
  - It is populated by the last event that primary has sent to replica
    (only for semi-sync slave connection, but without need to know
    that replica actually obtained request,
    like it is case in asynchronous replication).
  - Primary executes `mysql_binlog` that first starts the `binlog_dump` thread.
    That thread starts `ack_receiver` thread, that returns second column
    of interest (in first iteration we don't care about).
    After `binlog_dump` thread creating, file is sent with
    `send_one_binlog_file()` function.
    This function is sending event with `send_events()` function,
    during which `send_event_to_slave()` function is executed.
    On that place we are creating the `thd->slave_info->gtid_state_sent` struct.

+ The second column `GTID_state_ack`:
  - It is populated by the `reply_packet_binlog` (a.k.a. `ack`) that is called,
    only for semi-sync replication. It is called on 2 places:
    a) during `binlog_dump` creation, after thread is added to
    `ack_reciver` and after thread is added as semi-sync slave. This
    should be first ack received from replica to primary.
    b) constantly in running phase of `ack_receiver` thread `run()`.
  - A the end of function `reply_packet_binlog` we are creating the
  struct `GTID_state_ack`

+ The third column `Sync_Status`
  - It should be filled with values `asynchronous`,`semi-sync active`,
    `semi-sync stale` in order to express the state of replica.
  - `semi-sync [active/stale]` related states depend on the value of `need_sync`
     that is updated on master and will be `semi-sync active` iff:
    - primary and replicas are configured to be in semi-sync mode and
    - if the event is a transaction's ending event.
    Otherwise state is `semi-sync stale`.
  - state is `asynchronous` if semi-sync is disabled on master as fast
  check and if there thd is not semi-sync thread.

+ Handling of `rpl_semi_sync_master_timeout= 0` means that primary status
  `rpl_semi_sync_master_status` will remain ON instead of switching off to async
  and incrementing the counter updates.

+ PR closes MariaDB#2374, MariaDB#1427

Reviewer:<[email protected]>
@an3l an3l force-pushed the bb-10.9-anel-repl-slave_report-v2 branch from e963de5 to 0ee671a Compare June 25, 2023 12:59
an3l added a commit to an3l/server that referenced this pull request Jun 25, 2023
+ We are trying to extend the command `SHOW REPLICA HOSTS` that is
executed on master, with columns `GTID_state_sent` and `GTID_state_ack`.
In order to achieve that we have to extend `thd->slave_info` struct with
2 new entries. This struct needs to be populated and  must be accessable to primary.

+ The first column `GTID_state_sent`:
  - It is populated by the last event that primary has sent to replica
    (only for semi-sync slave connection, but without need to know
    that replica actually obtained request,
    like it is case in asynchronous replication).
  - Primary executes `mysql_binlog` that first starts the `binlog_dump` thread.
    That thread starts `ack_receiver` thread, that returns second column
    of interest (in first iteration we don't care about).
    After `binlog_dump` thread creating, file is sent with
    `send_one_binlog_file()` function.
    This function is sending event with `send_events()` function,
    during which `send_event_to_slave()` function is executed.
    On that place we are creating the `thd->slave_info->gtid_state_sent` struct.

+ The second column `GTID_state_ack`:
  - It is populated by the `reply_packet_binlog` (a.k.a. `ack`) that is called,
    only for semi-sync replication. It is called on 2 places:
    a) during `binlog_dump` creation, after thread is added to
    `ack_reciver` and after thread is added as semi-sync slave. This
    should be first ack received from replica to primary.
    b) constantly in running phase of `ack_receiver` thread `run()`.
  - A the end of function `reply_packet_binlog` we are creating the
  struct `GTID_state_ack`

+ The third column `Sync_Status`
  - It should be filled with values `asynchronous`,`semi-sync active`,
    `semi-sync stale` in order to express the state of replica.
  - `semi-sync [active/stale]` related states depend on the value of `need_sync`
     that is updated on master and will be `semi-sync active` iff:
    - primary and replicas are configured to be in semi-sync mode and
    - if the event is a transaction's ending event.
    Otherwise state is `semi-sync stale`.
  - state is `asynchronous` if semi-sync is disabled on master as fast
  check and if there thd is not semi-sync thread.

+ Handling of `rpl_semi_sync_master_timeout= 0` means that primary status
  `rpl_semi_sync_master_status` will remain ON instead of switching off to async
  and incrementing the counter updates.

+ PR closes MariaDB#2374, MariaDB#1427

Reviewer:<[email protected]>
@an3l an3l force-pushed the bb-10.9-anel-repl-slave_report-v2 branch from 0ee671a to cf65491 Compare June 25, 2023 14:04
an3l added a commit to an3l/server that referenced this pull request Jun 25, 2023
+ We are trying to extend the command `SHOW REPLICA HOSTS` that is
executed on master, with columns `GTID_state_sent` and `GTID_state_ack`.
In order to achieve that we have to extend `thd->slave_info` struct with
2 new entries. This struct needs to be populated and  must be accessable to primary.

+ The first column `GTID_state_sent`:
  - It is populated by the last event that primary has sent to replica
    (only for semi-sync slave connection, but without need to know
    that replica actually obtained request,
    like it is case in asynchronous replication).
  - Primary executes `mysql_binlog` that first starts the `binlog_dump` thread.
    That thread starts `ack_receiver` thread, that returns second column
    of interest (in first iteration we don't care about).
    After `binlog_dump` thread creating, file is sent with
    `send_one_binlog_file()` function.
    This function is sending event with `send_events()` function,
    during which `send_event_to_slave()` function is executed.
    On that place we are creating the `thd->slave_info->gtid_state_sent` struct.

+ The second column `GTID_state_ack`:
  - It is populated by the `reply_packet_binlog` (a.k.a. `ack`) that is called,
    only for semi-sync replication. It is called on 2 places:
    a) during `binlog_dump` creation, after thread is added to
    `ack_reciver` and after thread is added as semi-sync slave. This
    should be first ack received from replica to primary.
    b) constantly in running phase of `ack_receiver` thread `run()`.
  - A the end of function `reply_packet_binlog` we are creating the
  struct `GTID_state_ack`

+ The third column `Sync_Status`
  - It should be filled with values `asynchronous`,`semi-sync active`,
    `semi-sync stale` in order to express the state of replica.
  - `semi-sync [active/stale]` related states depend on the value of `need_sync`
     that is updated on master and will be `semi-sync active` iff:
    - primary and replicas are configured to be in semi-sync mode and
    - if the event is a transaction's ending event.
    Otherwise state is `semi-sync stale`.
  - state is `asynchronous` if semi-sync is disabled on master as fast
  check and if there thd is not semi-sync thread.

+ Handling of `rpl_semi_sync_master_timeout= 0` means that primary status
  `rpl_semi_sync_master_status` will remain ON instead of switching off to async
  and incrementing the counter updates.

+ PR closes MariaDB#2374, MariaDB#1427

Reviewer:<[email protected]>
@an3l an3l force-pushed the bb-10.9-anel-repl-slave_report-v2 branch from cf65491 to 7cdbbbb Compare June 25, 2023 15:46
an3l added a commit to an3l/server that referenced this pull request Jul 1, 2023
+ We are trying to extend the command `SHOW REPLICA HOSTS` that is
executed on master, with columns `GTID_state_sent` and `GTID_state_ack`.
In order to achieve that we have to extend `thd->slave_info` struct with
2 new entries. This struct needs to be populated and  must be accessable to primary.

+ The first column `GTID_state_sent`:
  - It is populated by the last event that primary has sent to replica
    (only for semi-sync slave connection, but without need to know
    that replica actually obtained request,
    like it is case in asynchronous replication).
  - Primary executes `mysql_binlog` that first starts the `binlog_dump` thread.
    That thread starts `ack_receiver` thread, that returns second column
    of interest (in first iteration we don't care about).
    After `binlog_dump` thread creating, file is sent with
    `send_one_binlog_file()` function.
    This function is sending event with `send_events()` function,
    during which `send_event_to_slave()` function is executed.
    On that place we are creating the `thd->slave_info->gtid_state_sent` struct.

+ The second column `GTID_state_ack`:
  - It is populated by the `reply_packet_binlog` (a.k.a. `ack`) that is called,
    only for semi-sync replication. It is called on 2 places:
    a) during `binlog_dump` creation, after thread is added to
    `ack_reciver` and after thread is added as semi-sync slave. This
    should be first ack received from replica to primary.
    b) constantly in running phase of `ack_receiver` thread `run()`.
  - A the end of function `reply_packet_binlog` we are creating the
  struct `GTID_state_ack`

+ The third column `Sync_Status`
  - It should be filled with values `asynchronous`,`semi-sync active`,
    `semi-sync stale` in order to express the state of replica.
  - `semi-sync [active/stale]` related states depend on the value of `need_sync`
     that is updated on master and will be `semi-sync active` iff:
    - primary and replicas are configured to be in semi-sync mode and
    - if the event is a transaction's ending event.
    Otherwise state is `semi-sync stale`.
  - state is `asynchronous` if semi-sync is disabled on master as fast
  check and if there thd is not semi-sync thread.

+ Handling of `rpl_semi_sync_master_timeout= 0` means that primary status
  `rpl_semi_sync_master_status` will remain ON instead of switching off to async
  and incrementing the counter updates.

+ PR closes MariaDB#2374, MariaDB#1427

Reviewer:<[email protected]>
@an3l an3l force-pushed the bb-10.9-anel-repl-slave_report-v2 branch from 7cdbbbb to 50c8d40 Compare July 1, 2023 13:55
an3l added a commit to an3l/server that referenced this pull request Jul 1, 2023
+ We are trying to extend the command `SHOW REPLICA HOSTS` that is
executed on master, with columns `GTID_state_sent` and `GTID_state_ack`.
In order to achieve that we have to extend `thd->slave_info` struct with
2 new entries. This struct needs to be populated and  must be accessable to primary.

+ The first column `GTID_state_sent`:
  - It is populated by the last event that primary has sent to replica
    (only for semi-sync slave connection, but without need to know
    that replica actually obtained request,
    like it is case in asynchronous replication).
  - Primary executes `mysql_binlog` that first starts the `binlog_dump` thread.
    That thread starts `ack_receiver` thread, that returns second column
    of interest (in first iteration we don't care about).
    After `binlog_dump` thread creating, file is sent with
    `send_one_binlog_file()` function.
    This function is sending event with `send_events()` function,
    during which `send_event_to_slave()` function is executed.
    On that place we are creating the `thd->slave_info->gtid_state_sent` struct.

+ The second column `GTID_state_ack`:
  - It is populated by the `reply_packet_binlog` (a.k.a. `ack`) that is called,
    only for semi-sync replication. It is called on 2 places:
    a) during `binlog_dump` creation, after thread is added to
    `ack_reciver` and after thread is added as semi-sync slave. This
    should be first ack received from replica to primary.
    b) constantly in running phase of `ack_receiver` thread `run()`.
  - A the end of function `reply_packet_binlog` we are creating the
  struct `GTID_state_ack`

+ The third column `Sync_Status`
  - It should be filled with values `asynchronous`,`semi-sync active`,
    `semi-sync stale` in order to express the state of replica.
  - `semi-sync [active/stale]` related states depend on the value of `need_sync`
     that is updated on master and will be `semi-sync active` iff:
    - primary and replicas are configured to be in semi-sync mode and
    - if the event is a transaction's ending event.
    Otherwise state is `semi-sync stale`.
  - state is `asynchronous` if semi-sync is disabled on master as fast
  check and if there thd is not semi-sync thread.

+ Handling of `rpl_semi_sync_master_timeout= 0` means that primary status
  `rpl_semi_sync_master_status` will remain ON instead of switching off to async
  and incrementing the counter updates.

+ PR closes MariaDB#2374, MariaDB#1427

Reviewer:<[email protected]>
@an3l an3l force-pushed the bb-10.9-anel-repl-slave_report-v2 branch from 50c8d40 to 6b8b018 Compare July 1, 2023 19:52
an3l added a commit to an3l/server that referenced this pull request Aug 20, 2023
+ We are trying to extend the command `SHOW REPLICA HOSTS` that is
executed on master, with columns `GTID_state_sent` and `GTID_state_ack`.
In order to achieve that we have to extend `thd->slave_info` struct with
2 new entries. This struct needs to be populated and  must be accessable to primary.

+ The first column `GTID_state_sent`:
  - It is populated by the last event that primary has sent to replica
    (only for semi-sync slave connection, but without need to know
    that replica actually obtained request,
    like it is case in asynchronous replication).
  - Primary executes `mysql_binlog` that first starts the `binlog_dump` thread.
    That thread starts `ack_receiver` thread, that returns second column
    of interest (in first iteration we don't care about).
    After `binlog_dump` thread creating, file is sent with
    `send_one_binlog_file()` function.
    This function is sending event with `send_events()` function,
    during which `send_event_to_slave()` function is executed.
    On that place we are creating the `thd->slave_info->gtid_state_sent` struct.

+ The second column `GTID_state_ack`:
  - It is populated by the `reply_packet_binlog` (a.k.a. `ack`) that is called,
    only for semi-sync replication. It is called on 2 places:
    a) during `binlog_dump` creation, after thread is added to
    `ack_reciver` and after thread is added as semi-sync slave. This
    should be first ack received from replica to primary.
    b) constantly in running phase of `ack_receiver` thread `run()`.
  - A the end of function `reply_packet_binlog` we are creating the
  struct `GTID_state_ack`

+ The third column `Sync_Status`
  - It should be filled with values `asynchronous`,`semi-sync active`,
    `semi-sync stale` in order to express the state of replica.
  - `semi-sync [active/stale]` related states depend on the value of `need_sync`
     that is updated on master and will be `semi-sync active` iff:
    - primary and replicas are configured to be in semi-sync mode and
    - if the event is a transaction's ending event.
    Otherwise state is `semi-sync stale`.
  - state is `asynchronous` if semi-sync is disabled on master as fast
  check and if there thd is not semi-sync thread.

+ Handling of `rpl_semi_sync_master_timeout= 0` means that primary status
  `rpl_semi_sync_master_status` will remain ON instead of switching off to async
  and incrementing the counter updates.

+ PR closes MariaDB#2374, MariaDB#1427

Reviewer:<[email protected]>
@an3l an3l force-pushed the bb-10.9-anel-repl-slave_report-v2 branch from 6b8b018 to e74d798 Compare August 20, 2023 17:41
an3l added a commit that referenced this pull request Aug 21, 2023
+ We are trying to extend the command `SHOW REPLICA HOSTS` that is
executed on master, with columns `GTID_state_sent` and `GTID_state_ack`.
In order to achieve that we have to extend `thd->slave_info` struct with
2 new entries. This struct needs to be populated and  must be accessable to primary.

+ The first column `GTID_state_sent`:
  - It is populated by the last event that primary has sent to replica
    (only for semi-sync slave connection, but without need to know
    that replica actually obtained request,
    like it is case in asynchronous replication).
  - Primary executes `mysql_binlog` that first starts the `binlog_dump` thread.
    That thread starts `ack_receiver` thread, that returns second column
    of interest (in first iteration we don't care about).
    After `binlog_dump` thread creating, file is sent with
    `send_one_binlog_file()` function.
    This function is sending event with `send_events()` function,
    during which `send_event_to_slave()` function is executed.
    On that place we are creating the `thd->slave_info->gtid_state_sent` struct.

+ The second column `GTID_state_ack`:
  - It is populated by the `reply_packet_binlog` (a.k.a. `ack`) that is called,
    only for semi-sync replication. It is called on 2 places:
    a) during `binlog_dump` creation, after thread is added to
    `ack_reciver` and after thread is added as semi-sync slave. This
    should be first ack received from replica to primary.
    b) constantly in running phase of `ack_receiver` thread `run()`.
  - A the end of function `reply_packet_binlog` we are creating the
  struct `GTID_state_ack`

+ The third column `Sync_Status`
  - It should be filled with values `asynchronous`,`semi-sync active`,
    `semi-sync stale` in order to express the state of replica.
  - `semi-sync [active/stale]` related states depend on the value of `need_sync`
     that is updated on master and will be `semi-sync active` iff:
    - primary and replicas are configured to be in semi-sync mode and
    - if the event is a transaction's ending event.
    Otherwise state is `semi-sync stale`.
  - state is `asynchronous` if semi-sync is disabled on master as fast
  check and if there thd is not semi-sync thread.

+ Handling of `rpl_semi_sync_master_timeout= 0` means that primary status
  `rpl_semi_sync_master_status` will remain ON instead of switching off to async
  and incrementing the counter updates.

+ PR closes #2374, #1427

Reviewer:<[email protected]>
Thanks <[email protected]> for reviewing the failure and
suppresions.
an3l added a commit to an3l/server that referenced this pull request Feb 26, 2024
+ We are trying to extend the command `SHOW REPLICA HOSTS` that is
executed on master, with columns `GTID_state_sent` and `GTID_state_ack`.
In order to achieve that we have to extend `thd->slave_info` struct with
2 new entries. This struct needs to be populated and  must be accessable to primary.

+ The first column `GTID_state_sent`:
  - It is populated by the last event that primary has sent to replica
    (only for semi-sync slave connection, but without need to know
    that replica actually obtained request,
    like it is case in asynchronous replication).
  - Primary executes `mysql_binlog` that first starts the `binlog_dump` thread.
    That thread starts `ack_receiver` thread, that returns second column
    of interest (in first iteration we don't care about).
    After `binlog_dump` thread creating, file is sent with
    `send_one_binlog_file()` function.
    This function is sending event with `send_events()` function,
    during which `send_event_to_slave()` function is executed.
    On that place we are creating the `thd->slave_info->gtid_state_sent` struct.

+ The second column `GTID_state_ack`:
  - It is populated by the `reply_packet_binlog` (a.k.a. `ack`) that is called,
    only for semi-sync replication. It is called on 2 places:
    a) during `binlog_dump` creation, after thread is added to
    `ack_reciver` and after thread is added as semi-sync slave. This
    should be first ack received from replica to primary.
    b) constantly in running phase of `ack_receiver` thread `run()`.
  - A the end of function `reply_packet_binlog` we are creating the
  struct `GTID_state_ack`

+ The third column `Sync_Status`
  - It should be filled with values `asynchronous`,`semi-sync active`,
    `semi-sync stale` in order to express the state of replica.
  - `semi-sync [active/stale]` related states depend on the value of `need_sync`
     that is updated on master and will be `semi-sync active` iff:
    - primary and replicas are configured to be in semi-sync mode and
    - if the event is a transaction's ending event.
    Otherwise state is `semi-sync stale`.
  - state is `asynchronous` if semi-sync is disabled on master as fast
  check and if there thd is not semi-sync thread.

+ Handling of `rpl_semi_sync_master_timeout= 0` means that primary status
  `rpl_semi_sync_master_status` will remain ON instead of switching off to async
  and incrementing the counter updates.

+ PR closes MariaDB#2374, MariaDB#1427

Reviewer:<[email protected]>
@an3l an3l force-pushed the bb-10.9-anel-repl-slave_report-v2 branch from e74d798 to 0a8ca59 Compare February 26, 2024 14:58
Copy link
Contributor

@bnestere bnestere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Anel,

Thanks for rebasing the patch. I gave it another pass-through with a few comments. There are also a few un-resolved old comments that are still relevant.

show variables like 'rpl_semi_sync_slave_status';
--echo # slave_send_ack should be reseted when replica is stopped, but it is not (TODO?)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not on stop (so statistics are saved if the slave is stopped), it is reset when the slave io thread starts.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks have updated in new commit.


# Write the new event
INSERT INTO t VALUES (10);

let $status_var= Rpl_semi_sync_master_status;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't delete the expected behavior out. I suppose we can adapt the test to allow it so it passes (and the actual gtid_state_sent/gtid_state_ack is consistent); but we should file a new MDEV about the broken behavior, comment out (but leave) code that supports the expected behavior, and add a comment referencing the new MDEV to un-comment the "good" path with the fix. We should follow this pattern for all the current broken semi-sync behaviors too, e.g. that described in test case 6:

# It may happen that rpl_semi_sync_master_no_transactions increments for above
# query.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree have added MDEV-33615

@@ -531,7 +563,7 @@ class Repl_semi_sync_master
* Return:
* 0: success; non-zero: error
*/
int report_reply_binlog(uint32 server_id,
int report_reply_binlog(THD *replica_thd, uint32 server_id,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple minor comments here:

  1. We don't need the whole THD; but we can pass the thd->slave_info because that is all that is needed.
  2. We can remove the server_id parameter; because it can be taken from thd->slave_info.server_id.

an3l added a commit to an3l/server that referenced this pull request Mar 6, 2024
+ We are trying to extend the command `SHOW REPLICA HOSTS` that is
executed on master, with columns `GTID_state_sent` and `GTID_state_ack`.
In order to achieve that we have to extend `thd->slave_info` struct with
2 new entries. This struct needs to be populated and  must be accessable to primary.

+ The first column `GTID_state_sent`:
  - It is populated by the last event that primary has sent to replica
    (only for semi-sync slave connection, but without need to know
    that replica actually obtained request,
    like it is case in asynchronous replication).
  - Primary executes `mysql_binlog` that first starts the `binlog_dump` thread.
    That thread starts `ack_receiver` thread, that returns second column
    of interest (in first iteration we don't care about).
    After `binlog_dump` thread creating, file is sent with
    `send_one_binlog_file()` function.
    This function is sending event with `send_events()` function,
    during which `send_event_to_slave()` function is executed.
    On that place we are creating the `thd->slave_info->gtid_state_sent` struct.

+ The second column `GTID_state_ack`:
  - It is populated by the `reply_packet_binlog` (a.k.a. `ack`) that is called,
    only for semi-sync replication. It is called on 2 places:
    a) during `binlog_dump` creation, after thread is added to
    `ack_reciver` and after thread is added as semi-sync slave. This
    should be first ack received from replica to primary.
    b) constantly in running phase of `ack_receiver` thread `run()`.
  - A the end of function `reply_packet_binlog` we are creating the
  struct `GTID_state_ack`

+ The third column `Sync_Status`
  - It should be filled with values `asynchronous`,`semi-sync active`,
    `semi-sync stale` in order to express the state of replica.
  - `semi-sync [active/stale]` related states depend on the value of `need_sync`
     that is updated on master and will be `semi-sync active` iff:
    - primary and replicas are configured to be in semi-sync mode and
    - if the event is a transaction's ending event.
    Otherwise state is `semi-sync stale`.
  - state is `asynchronous` if semi-sync is disabled on master as fast
  check and if there thd is not semi-sync thread.

+ Handling of `rpl_semi_sync_master_timeout= 0` means that primary status
  `rpl_semi_sync_master_status` will remain ON instead of switching off to async
  and incrementing the counter updates.

+ PR closes MariaDB#2374, MariaDB#1427

Reviewer:<[email protected]>
@an3l an3l force-pushed the bb-10.9-anel-repl-slave_report-v2 branch 3 times, most recently from e13ace8 to d7afc1e Compare March 11, 2024 10:36
Copy link
Contributor

@bnestere bnestere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+ We are trying to extend the command `SHOW REPLICA HOSTS` that is
executed on master, with columns `GTID_state_sent` and `GTID_state_ack`.
In order to achieve that we have to extend `thd->slave_info` struct with
2 new entries. This struct needs to be populated and  must be accessable to primary.

+ The first column `GTID_state_sent`:
  - It is populated by the last event that primary has sent to replica
    (only for semi-sync slave connection, but without need to know
    that replica actually obtained request,
    like it is case in asynchronous replication).
  - Primary executes `mysql_binlog` that first starts the `binlog_dump` thread.
    That thread starts `ack_receiver` thread, that returns second column
    of interest (in first iteration we don't care about).
    After `binlog_dump` thread creating, file is sent with
    `send_one_binlog_file()` function.
    This function is sending event with `send_events()` function,
    during which `send_event_to_slave()` function is executed.
    On that place we are creating the `thd->slave_info->gtid_state_sent` struct.

+ The second column `GTID_state_ack`:
  - It is populated by the `reply_packet_binlog` (a.k.a. `ack`) that is called,
    only for semi-sync replication. It is called on 2 places:
    a) during `binlog_dump` creation, after thread is added to
    `ack_reciver` and after thread is added as semi-sync slave. This
    should be first ack received from replica to primary.
    b) constantly in running phase of `ack_receiver` thread `run()`.
  - A the end of function `reply_packet_binlog` we are creating the
  struct `GTID_state_ack`

+ The third column `Sync_Status`
  - It should be filled with values `asynchronous`,`semi-sync active`,
    `semi-sync stale` in order to express the state of replica.
  - `semi-sync [active/stale]` related states depend on the value of `need_sync`
     that is updated on master and will be `semi-sync active` iff:
    - primary and replicas are configured to be in semi-sync mode and
    - if the event is a transaction's ending event.
    Otherwise state is `semi-sync stale`.
  - state is `asynchronous` if semi-sync is disabled on master as fast
  check and if there thd is not semi-sync thread.

+ Handling of `rpl_semi_sync_master_timeout= 0` means that primary status
  `rpl_semi_sync_master_status` will remain ON instead of switching off to async
  and incrementing the counter updates.

+ PR closes MariaDB#2374, MariaDB#1427

Reviewer:<[email protected]>
1. Delete redudant check of `rpl_semi_sync_master_enabled`
2. Move check of `m_wait_timeout` at the beginning
3. Pass by address `thd->slave_info` struct  instead of `thd` to `report_reply_binlog()` and remove `server_id` argument that will be used from `thd->slave_info.server_id`
@an3l an3l force-pushed the bb-10.9-anel-repl-slave_report-v2 branch from d7afc1e to e53ff4d Compare March 31, 2024 11:44
@an3l
Copy link
Collaborator Author

an3l commented Mar 31, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
MariaDB Foundation Pull requests created by MariaDB Foundation
3 participants