MDEV-33817: AVX-512BW and VPCLMULQDQ based CRC-32 for x86 and x86-64 #3195

dr-m · 2024-04-12T15:30:06Z

The Jira issue number for this PR is: MDEV-33817

Description

Intel AVX512 instructions allow 512-bit carry-less product to be computed as well as 512-bit unaligned loads of data. This allows any CRC polynomial up to CRC-32 to be used, just passing different tables to an implementation.

This is based on the assembler code in https://github.com/intel/intel-ipsec-mb/ with some minor optimization and cleanup. https://github.com/dr-m/crc32_simd is a stand-alone version of this, with both normal and reversed polynomials.

Release Notes

If the required x86 instruction set extensions are available, the server startup will report

InnoDB: Using AVX512 instructions

to indicate that CRC-32 and CRC-32C will be computed up to 256 bytes per loop iteration.

How can this PR be tested?

I tested f19b0a6 on GCC 11 and GCC 13, as well as clang 8, 11, 12, 18 on two systems that Intel contributed.

On the smaller system, with Ubuntu 22.04 and GCC 11.4.0, I ran a simple 30-second Sysbench oltp_update_non_index test with 72 threads, 32 tables, 10,000 rows each. This ran on /dev/shm, with 1GiB of log file and buffer pool. The system tablespace (containing undo logs) would grow to 2.4 GiB during this workload. So, the server must have been computing a lot of page checksums and log block checksums. The summary of the results was:

Variant	Transactions
pclmul	217426.03/s
AVX512	225342.07/s

I also tested the time to prepare a backup that was made some 90 seconds into a Sysbench oltp_update_index test run. The time to apply the log to the backup should be proportional to the size of the log:

[00] 2024-05-01 21:20:01 Redo log (from LSN 43020 to 3959269412) was copied.

I timed several runs of preparing this backup, with various --use-memory parameters. We can compare the single-threaded performance by specifying a large enough --use-memory so that the log can be parsed, buffered and applied in a single batch. I noticed bottlenecks outside the checksum calculation. Perhaps it is simplest here to concentrate on the time consumed between the very first steps, because that happens in a single thread with mariadb-backup --prepare --use-memory=1g:

2024-05-01 21:52:38 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=43020,43020
2024-05-01 21:52:41 0 [Note] InnoDB: Multi-batch recovery needed at LSN 357084496
2024-05-01 21:52:53 0 [Note] InnoDB: Read redo log up to LSN=469608448

For the AVX512BW and VPCLMULQDQ based implementation, we got the following:

2024-05-01 21:42:39 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=43020,43020
2024-05-01 21:42:42 0 [Note] InnoDB: Multi-batch recovery needed at LSN 357089342
2024-05-01 21:42:54 0 [Note] InnoDB: Read redo log up to LSN=486320128

Within 15 seconds from the start, we process 486320128-43020 or 469608448-43020 bytes of log. That is an improvement of 3.56% in a single-threaded benchmark.

It should be noted that there is quite a bit of variance in these numbers. Here are some statistics of the "Read redo log up to" in the 15 first seconds of mariadb-backup --use-memory=1g:

pclmul	AVX512
469608448	486320128
498444288	462399488
463513600	493398016
471115776	467380224
463513600	448112640
481011712	478390272
443328512	469608448

The maximum LSN=498444288 was actually reached by the baseline. The averages are 470076562.3 and 472229888 in favor of the AVX512BW and VPCLMULQDQ implementation.

I was also surprised that specifying a large enough --use-memory so that the entire log can be processed in a single batch will result in a significantly slower overall time, in this case slightly over 5 minutes, compared to the about 3 minutes of --use-memory=1g that will force frequent writes and reads of data pages, partly forcing the same pages to be accessed via the file system multiple times. A follow-up of the improvements made in MDEV-29911 will be necessary.

When using the maximum --use-memory, the AVX512 version of the CRC-32C algorithm surprisingly was 2 seconds slower (5 minutes and 3 seconds vs. 5 minutes and 5 seconds).

I tested this scenario with this patch applied on 10.11 ae03374. Because the log file format is different, I created a new data set. I started the backup some 90 seconds into the Sysbench test. In 10.11, because the throughput is about twice that of 10.6, I thought that I would have to revise my procedure and start the backup earlier, to avoid a scenario like this:

[00] FATAL ERROR: 2024-05-01 22:52:47 Was only able to copy log from 46862 to 138400442, not 6924111181; try increasing innodb_log_file_size

Unfortunately, there is more to this. It looks like parsing the log (recv_sys_t::parse<recv_buf, false>) is a huge bottleneck in the 10.11 mariadb-backup --backup. It consumes way more CPU time than the CRC-32C calculations. In the MDEV-14425 log file format, each individual (possibly tiny) mini-transaction is a logical log block on its own; there are no 512-byte log blocks. The mariadb-backup --backup in 10.11 would be very slow, copying only about 0.2 GiB of log per minute.

To be able to test this at all on 10.11, I thought that I would simply kill the server during the workload and then measure the time it takes to recover when using innodb_buffer_pool_size=1g. There is quite a bit of noise in the numbers, so it is hard to say how much improvement there is. Baseline: 9 seconds to reach the end of the log, and 128 seconds total recovery time:

2024-05-02  0:18:33 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=46846
2024-05-02  0:18:36 0 [Note] InnoDB: Multi-batch recovery needed at LSN 375670587
2024-05-02  0:18:42 0 [Note] InnoDB: End of log at LSN=4092777671
…
2024-05-02  0:20:41 0 [Note] InnoDB: log sequence number 4092777671 (memory-mapped); transaction id 35130137

With the AVX512BW and VPCLMULQDQ based CRC-32C: 9 seconds and 124 seconds:

2024-05-02  0:26:35 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=46846
…
2024-05-02  0:30:34 0 [Note] InnoDB: Parsed redo log up to LSN=3920491375; to recover: 332507 pages
2024-05-02  0:30:46 0 [Note] InnoDB: End of log at LSN=4092777671
2024-05-02  0:30:53 0 [Note] InnoDB: 37 transaction(s) which must be rolled back or cleaned up in total 37 row operations to undo
2024-05-02  0:30:53 0 [Note] InnoDB: Trx id counter is 35130099
2024-05-02  0:31:01 0 [Note] InnoDB: To recover: 346656 pages
2024-05-02  0:31:26 0 [Note] InnoDB: 128 rollback segments are active.
2024-05-02  0:31:26 0 [Note] InnoDB: Starting in background the rollback of recovered transactions
2024-05-02  0:31:26 0 [Note] InnoDB: To roll back: 37 transactions, 37 rows
…
2024-05-02  0:31:26 0 [Note] InnoDB: Rollback of non-prepared transactions completed
2024-05-02  0:31:26 0 [Note] InnoDB: File './ibtmp1' size is now 12.000MiB.
2024-05-02  0:31:26 0 [Note] InnoDB: log sequence number 4092777671 (memory-mapped); transaction id 35130137

It should be noted that on rerun, each case took a longer time to complete. So, we can’t conclude anything firm from the above tests.

More testing will be needed to establish the impact in a broader set of scenarios. This should be tested on the 10.11 branch as well, which features a more scalable redo log design.

I will write a stand-alone single-threaded performance test program in https://github.com/dr-m/crc32_simd in order to measure just the CRC-32 performance on its own.

On our normal development machines (including the CI farm), this new code will not be exercised.

Basing the PR against the correct MariaDB version

This is a new feature and the PR is based against the latest MariaDB development branch.
This is a bug fix and the PR is based against the earliest maintained branch in which the bug can be reproduced.

This is a performance fix, applicable to 10.5 too, but I chose the 10.6 branch for now.

PR quality check

I checked the CODING_STANDARDS.md file and my PR conforms to this where appropriate.
For any trivial modifications to the PR, I am ok with the reviewer making the changes themselves.

CLAassistant · 2024-04-12T15:30:12Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

In our unit test, let us rely on our own reference implementation using the reflected CRC-32 ISO 3309 and CRC-32C polynomials. Let us also test with various lengths. Let us refactor the CRC-32 and CRC-32C implementations so that no special compilation flags will be needed and that some function call indirection will be avoided. pmull_supported: Remove. We will have pointers to two separate functions crc32c_aarch64_pmull() and crc32c_aarch64().

vaintroub

Looks good. Newly added file needs copyright header.

In general, we also need some benchmark to prove VPCLMULQDQ superiority to the best we had before (which was crc32c_3way) for CRC32C calculation.

maybe taking mariabackup from a large idle server, with small redo log, something like that could be interesting.

mysys/crc32/crc32c_x86.cc

dr-m · 2024-05-03T12:54:10Z

Looks good. Newly added file needs copyright header.

In general, we also need some benchmark to prove VPCLMULQDQ superiority to the best we had before (which was crc32c_3way) for CRC32C calculation.

maybe taking mariabackup from a large idle server, with small redo log, something like that could be interesting.

I posted some more results to MDEV-33817. It currently is a mixed bag, and the performance of mariadb-backup --backup as well as mariadb-backup --prepare needs to be improved. While I observed more than tripled performance in a single-threaded test (computing a CRC on a 1 GiB buffer) compared to the crc32c_3way implementation, in real-world scenarios we have context switches and mutex contention that affect things. Context switches that occur when vzeroupper is not in effect will have to save and restore a lot more registers, and this can be costly (causing a regression). As we saw in MDEV-33515, the CPU microarchitecture can play a huge role here.

This is based on https://github.com/intel/intel-ipsec-mb/ and has been tested both on x86 and x86-64, with code that was generated by several versions of GCC and clang. GCC 11 or clang 8 or later should be able to compile this, and so should recent versions of MSVC. Thanks to Intel Corporation for providing access to hardware, for answering my questions regarding the code, and for providing the coefficients for the CRC-32C computation. crc32_avx512(): Compute a reverse polynomial CRC-32 using precomputed tables and carry-less product, for up to 256 bytes of unaligned input per loop iteration. Reviewed by: Vladislav Vaintroub

dr-m marked this pull request as draft April 12, 2024 15:30

dr-m self-assigned this Apr 12, 2024

dr-m force-pushed the 10.6-MDEV-33817 branch from dcd1669 to 2104c95 Compare April 15, 2024 06:54

dr-m force-pushed the 10.6-MDEV-33817 branch from 51200ac to f19b0a6 Compare April 30, 2024 12:16

dr-m changed the title ~~MDEV-33817 WIP: AVX-512 based CRC-32 algorithms~~ MDEV-33817: AVX-512BW and VPCLMULQDQ based CRC-32 for x86 and x86-64 Apr 30, 2024

dr-m requested a review from vaintroub April 30, 2024 12:31

dr-m marked this pull request as ready for review April 30, 2024 12:33

dr-m force-pushed the 10.6-MDEV-33817 branch from fdf10a6 to f7c36b7 Compare May 3, 2024 10:19

dr-m changed the base branch from 10.6 to 10.5 May 3, 2024 10:19

vaintroub approved these changes May 3, 2024

View reviewed changes

mysys/crc32/crc32c_x86.cc Show resolved Hide resolved

dr-m force-pushed the 10.6-MDEV-33817 branch from f7c36b7 to 9ec7819 Compare May 3, 2024 12:55

dr-m merged commit 9ec7819 into 10.5 May 3, 2024
13 of 18 checks passed

dr-m deleted the 10.6-MDEV-33817 branch May 3, 2024 13:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MDEV-33817: AVX-512BW and VPCLMULQDQ based CRC-32 for x86 and x86-64 #3195

MDEV-33817: AVX-512BW and VPCLMULQDQ based CRC-32 for x86 and x86-64 #3195

dr-m commented Apr 12, 2024 •

edited

CLAassistant commented Apr 12, 2024 •

edited

vaintroub left a comment

dr-m commented May 3, 2024

MDEV-33817: AVX-512BW and VPCLMULQDQ based CRC-32 for x86 and x86-64 #3195

MDEV-33817: AVX-512BW and VPCLMULQDQ based CRC-32 for x86 and x86-64 #3195

Conversation

dr-m commented Apr 12, 2024 • edited

Description

Release Notes

How can this PR be tested?

Basing the PR against the correct MariaDB version

PR quality check

CLAassistant commented Apr 12, 2024 • edited

vaintroub left a comment

Choose a reason for hiding this comment

dr-m commented May 3, 2024

dr-m commented Apr 12, 2024 •

edited

CLAassistant commented Apr 12, 2024 •

edited