MDEV-33817: AVX-512BW and VPCLMULQDQ based CRC-32 for x86 and x86-64 #3195
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Intel AVX512 instructions allow 512-bit carry-less product to be computed as well as 512-bit unaligned loads of data. This allows any CRC polynomial up to CRC-32 to be used, just passing different tables to an implementation.
This is based on the assembler code in https://github.com/intel/intel-ipsec-mb/ with some minor optimization and cleanup. https://github.com/dr-m/crc32_simd is a stand-alone version of this, with both normal and reversed polynomials.
Release Notes
If the required x86 instruction set extensions are available, the server startup will report
to indicate that CRC-32 and CRC-32C will be computed up to 256 bytes per loop iteration.
How can this PR be tested?
I tested f19b0a6 on GCC 11 and GCC 13, as well as clang 8, 11, 12, 18 on two systems that Intel contributed.
On the smaller system, with Ubuntu 22.04 and GCC 11.4.0, I ran a simple 30-second Sysbench
oltp_update_non_index
test with 72 threads, 32 tables, 10,000 rows each. This ran on/dev/shm
, with 1GiB of log file and buffer pool. The system tablespace (containing undo logs) would grow to 2.4 GiB during this workload. So, the server must have been computing a lot of page checksums and log block checksums. The summary of the results was:I also tested the time to prepare a backup that was made some 90 seconds into a Sysbench
oltp_update_index
test run. The time to apply the log to the backup should be proportional to the size of the log:I timed several runs of preparing this backup, with various
--use-memory
parameters. We can compare the single-threaded performance by specifying a large enough--use-memory
so that the log can be parsed, buffered and applied in a single batch. I noticed bottlenecks outside the checksum calculation. Perhaps it is simplest here to concentrate on the time consumed between the very first steps, because that happens in a single thread withmariadb-backup --prepare --use-memory=1g
:For the AVX512BW and VPCLMULQDQ based implementation, we got the following:
Within 15 seconds from the start, we process 486320128-43020 or 469608448-43020 bytes of log. That is an improvement of 3.56% in a single-threaded benchmark.
It should be noted that there is quite a bit of variance in these numbers. Here are some statistics of the "Read redo log up to" in the 15 first seconds of
mariadb-backup --use-memory=1g
:The maximum LSN=498444288 was actually reached by the baseline. The averages are 470076562.3 and 472229888 in favor of the AVX512BW and VPCLMULQDQ implementation.
I was also surprised that specifying a large enough
--use-memory
so that the entire log can be processed in a single batch will result in a significantly slower overall time, in this case slightly over 5 minutes, compared to the about 3 minutes of--use-memory=1g
that will force frequent writes and reads of data pages, partly forcing the same pages to be accessed via the file system multiple times. A follow-up of the improvements made in MDEV-29911 will be necessary.When using the maximum
--use-memory
, the AVX512 version of the CRC-32C algorithm surprisingly was 2 seconds slower (5 minutes and 3 seconds vs. 5 minutes and 5 seconds).I tested this scenario with this patch applied on 10.11 ae03374. Because the log file format is different, I created a new data set. I started the backup some 90 seconds into the Sysbench test. In 10.11, because the throughput is about twice that of 10.6, I thought that I would have to revise my procedure and start the backup earlier, to avoid a scenario like this:
Unfortunately, there is more to this. It looks like parsing the log (
recv_sys_t::parse<recv_buf, false>
) is a huge bottleneck in the 10.11mariadb-backup --backup
. It consumes way more CPU time than the CRC-32C calculations. In the MDEV-14425 log file format, each individual (possibly tiny) mini-transaction is a logical log block on its own; there are no 512-byte log blocks. Themariadb-backup --backup
in 10.11 would be very slow, copying only about 0.2 GiB of log per minute.To be able to test this at all on 10.11, I thought that I would simply kill the server during the workload and then measure the time it takes to recover when using
innodb_buffer_pool_size=1g
. There is quite a bit of noise in the numbers, so it is hard to say how much improvement there is. Baseline: 9 seconds to reach the end of the log, and 128 seconds total recovery time:With the AVX512BW and VPCLMULQDQ based CRC-32C: 9 seconds and 124 seconds:
It should be noted that on rerun, each case took a longer time to complete. So, we can’t conclude anything firm from the above tests.
More testing will be needed to establish the impact in a broader set of scenarios. This should be tested on the 10.11 branch as well, which features a more scalable redo log design.
I will write a stand-alone single-threaded performance test program in https://github.com/dr-m/crc32_simd in order to measure just the CRC-32 performance on its own.
On our normal development machines (including the CI farm), this new code will not be exercised.
Basing the PR against the correct MariaDB version
This is a performance fix, applicable to 10.5 too, but I chose the 10.6 branch for now.
PR quality check