Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MDEV-33817: AVX-512BW and VPCLMULQDQ based CRC-32 for x86 and x86-64 #3195

Merged
merged 2 commits into from May 3, 2024

Conversation

dr-m
Copy link
Contributor

@dr-m dr-m commented Apr 12, 2024

  • The Jira issue number for this PR is: MDEV-33817

Description

Intel AVX512 instructions allow 512-bit carry-less product to be computed as well as 512-bit unaligned loads of data. This allows any CRC polynomial up to CRC-32 to be used, just passing different tables to an implementation.

This is based on the assembler code in https://github.com/intel/intel-ipsec-mb/ with some minor optimization and cleanup. https://github.com/dr-m/crc32_simd is a stand-alone version of this, with both normal and reversed polynomials.

Release Notes

If the required x86 instruction set extensions are available, the server startup will report

InnoDB: Using AVX512 instructions

to indicate that CRC-32 and CRC-32C will be computed up to 256 bytes per loop iteration.

How can this PR be tested?

I tested f19b0a6 on GCC 11 and GCC 13, as well as clang 8, 11, 12, 18 on two systems that Intel contributed.

On the smaller system, with Ubuntu 22.04 and GCC 11.4.0, I ran a simple 30-second Sysbench oltp_update_non_index test with 72 threads, 32 tables, 10,000 rows each. This ran on /dev/shm, with 1GiB of log file and buffer pool. The system tablespace (containing undo logs) would grow to 2.4 GiB during this workload. So, the server must have been computing a lot of page checksums and log block checksums. The summary of the results was:

Variant Transactions
pclmul 217426.03/s
AVX512 225342.07/s

I also tested the time to prepare a backup that was made some 90 seconds into a Sysbench oltp_update_index test run. The time to apply the log to the backup should be proportional to the size of the log:

[00] 2024-05-01 21:20:01 Redo log (from LSN 43020 to 3959269412) was copied.

I timed several runs of preparing this backup, with various --use-memory parameters. We can compare the single-threaded performance by specifying a large enough --use-memory so that the log can be parsed, buffered and applied in a single batch. I noticed bottlenecks outside the checksum calculation. Perhaps it is simplest here to concentrate on the time consumed between the very first steps, because that happens in a single thread with mariadb-backup --prepare --use-memory=1g:

2024-05-01 21:52:38 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=43020,43020
2024-05-01 21:52:41 0 [Note] InnoDB: Multi-batch recovery needed at LSN 357084496
2024-05-01 21:52:53 0 [Note] InnoDB: Read redo log up to LSN=469608448

For the AVX512BW and VPCLMULQDQ based implementation, we got the following:

2024-05-01 21:42:39 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=43020,43020
2024-05-01 21:42:42 0 [Note] InnoDB: Multi-batch recovery needed at LSN 357089342
2024-05-01 21:42:54 0 [Note] InnoDB: Read redo log up to LSN=486320128

Within 15 seconds from the start, we process 486320128-43020 or 469608448-43020 bytes of log. That is an improvement of 3.56% in a single-threaded benchmark.

It should be noted that there is quite a bit of variance in these numbers. Here are some statistics of the "Read redo log up to" in the 15 first seconds of mariadb-backup --use-memory=1g:

pclmul AVX512
469608448 486320128
498444288 462399488
463513600 493398016
471115776 467380224
463513600 448112640
481011712 478390272
443328512 469608448

The maximum LSN=498444288 was actually reached by the baseline. The averages are 470076562.3 and 472229888 in favor of the AVX512BW and VPCLMULQDQ implementation.

I was also surprised that specifying a large enough --use-memory so that the entire log can be processed in a single batch will result in a significantly slower overall time, in this case slightly over 5 minutes, compared to the about 3 minutes of --use-memory=1g that will force frequent writes and reads of data pages, partly forcing the same pages to be accessed via the file system multiple times. A follow-up of the improvements made in MDEV-29911 will be necessary.

When using the maximum --use-memory, the AVX512 version of the CRC-32C algorithm surprisingly was 2 seconds slower (5 minutes and 3 seconds vs. 5 minutes and 5 seconds).

I tested this scenario with this patch applied on 10.11 ae03374. Because the log file format is different, I created a new data set. I started the backup some 90 seconds into the Sysbench test. In 10.11, because the throughput is about twice that of 10.6, I thought that I would have to revise my procedure and start the backup earlier, to avoid a scenario like this:

[00] FATAL ERROR: 2024-05-01 22:52:47 Was only able to copy log from 46862 to 138400442, not 6924111181; try increasing innodb_log_file_size

Unfortunately, there is more to this. It looks like parsing the log (recv_sys_t::parse<recv_buf, false>) is a huge bottleneck in the 10.11 mariadb-backup --backup. It consumes way more CPU time than the CRC-32C calculations. In the MDEV-14425 log file format, each individual (possibly tiny) mini-transaction is a logical log block on its own; there are no 512-byte log blocks. The mariadb-backup --backup in 10.11 would be very slow, copying only about 0.2 GiB of log per minute.

To be able to test this at all on 10.11, I thought that I would simply kill the server during the workload and then measure the time it takes to recover when using innodb_buffer_pool_size=1g. There is quite a bit of noise in the numbers, so it is hard to say how much improvement there is. Baseline: 9 seconds to reach the end of the log, and 128 seconds total recovery time:

2024-05-02  0:18:33 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=46846
2024-05-02  0:18:36 0 [Note] InnoDB: Multi-batch recovery needed at LSN 375670587
2024-05-02  0:18:42 0 [Note] InnoDB: End of log at LSN=4092777671
…
2024-05-02  0:20:41 0 [Note] InnoDB: log sequence number 4092777671 (memory-mapped); transaction id 35130137

With the AVX512BW and VPCLMULQDQ based CRC-32C: 9 seconds and 124 seconds:

2024-05-02  0:26:35 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=46846
…
2024-05-02  0:30:34 0 [Note] InnoDB: Parsed redo log up to LSN=3920491375; to recover: 332507 pages
2024-05-02  0:30:46 0 [Note] InnoDB: End of log at LSN=4092777671
2024-05-02  0:30:53 0 [Note] InnoDB: 37 transaction(s) which must be rolled back or cleaned up in total 37 row operations to undo
2024-05-02  0:30:53 0 [Note] InnoDB: Trx id counter is 35130099
2024-05-02  0:31:01 0 [Note] InnoDB: To recover: 346656 pages
2024-05-02  0:31:26 0 [Note] InnoDB: 128 rollback segments are active.
2024-05-02  0:31:26 0 [Note] InnoDB: Starting in background the rollback of recovered transactions
2024-05-02  0:31:26 0 [Note] InnoDB: To roll back: 37 transactions, 37 rows
…
2024-05-02  0:31:26 0 [Note] InnoDB: Rollback of non-prepared transactions completed
2024-05-02  0:31:26 0 [Note] InnoDB: File './ibtmp1' size is now 12.000MiB.
2024-05-02  0:31:26 0 [Note] InnoDB: log sequence number 4092777671 (memory-mapped); transaction id 35130137

It should be noted that on rerun, each case took a longer time to complete. So, we can’t conclude anything firm from the above tests.

More testing will be needed to establish the impact in a broader set of scenarios. This should be tested on the 10.11 branch as well, which features a more scalable redo log design.

I will write a stand-alone single-threaded performance test program in https://github.com/dr-m/crc32_simd in order to measure just the CRC-32 performance on its own.

On our normal development machines (including the CI farm), this new code will not be exercised.

Basing the PR against the correct MariaDB version

  • This is a new feature and the PR is based against the latest MariaDB development branch.
  • This is a bug fix and the PR is based against the earliest maintained branch in which the bug can be reproduced.

This is a performance fix, applicable to 10.5 too, but I chose the 10.6 branch for now.

PR quality check

  • I checked the CODING_STANDARDS.md file and my PR conforms to this where appropriate.
  • For any trivial modifications to the PR, I am ok with the reviewer making the changes themselves.

@CLAassistant
Copy link

CLAassistant commented Apr 12, 2024

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@dr-m dr-m marked this pull request as draft April 12, 2024 15:30
@dr-m dr-m self-assigned this Apr 12, 2024
@dr-m dr-m changed the title MDEV-33817 WIP: AVX-512 based CRC-32 algorithms MDEV-33817: AVX-512BW and VPCLMULQDQ based CRC-32 for x86 and x86-64 Apr 30, 2024
@dr-m dr-m requested a review from vaintroub April 30, 2024 12:31
@dr-m dr-m marked this pull request as ready for review April 30, 2024 12:33
In our unit test, let us rely on our own reference
implementation using the reflected
CRC-32 ISO 3309 and CRC-32C polynomials. Let us also
test with various lengths.

Let us refactor the CRC-32 and CRC-32C implementations
so that no special compilation flags will be needed and
that some function call indirection will be avoided.

pmull_supported: Remove. We will have pointers to two separate
functions crc32c_aarch64_pmull() and crc32c_aarch64().
@dr-m dr-m changed the base branch from 10.6 to 10.5 May 3, 2024 10:19
Copy link
Member

@vaintroub vaintroub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Newly added file needs copyright header.

In general, we also need some benchmark to prove VPCLMULQDQ superiority to the best we had before (which was crc32c_3way) for CRC32C calculation.

maybe taking mariabackup from a large idle server, with small redo log, something like that could be interesting.

mysys/crc32/crc32c_x86.cc Show resolved Hide resolved
@dr-m
Copy link
Contributor Author

dr-m commented May 3, 2024

Looks good. Newly added file needs copyright header.

In general, we also need some benchmark to prove VPCLMULQDQ superiority to the best we had before (which was crc32c_3way) for CRC32C calculation.

maybe taking mariabackup from a large idle server, with small redo log, something like that could be interesting.

I posted some more results to MDEV-33817. It currently is a mixed bag, and the performance of mariadb-backup --backup as well as mariadb-backup --prepare needs to be improved. While I observed more than tripled performance in a single-threaded test (computing a CRC on a 1 GiB buffer) compared to the crc32c_3way implementation, in real-world scenarios we have context switches and mutex contention that affect things. Context switches that occur when vzeroupper is not in effect will have to save and restore a lot more registers, and this can be costly (causing a regression). As we saw in MDEV-33515, the CPU microarchitecture can play a huge role here.

This is based on https://github.com/intel/intel-ipsec-mb/
and has been tested both on x86 and x86-64, with code that
was generated by several versions of GCC and clang.
GCC 11 or clang 8 or later should be able to compile this,
and so should recent versions of MSVC.

Thanks to Intel Corporation for providing access to hardware,
for answering my questions regarding the code, and for
providing the coefficients for the CRC-32C computation.

crc32_avx512(): Compute a reverse polynomial CRC-32 using
precomputed tables and carry-less product, for up to 256 bytes
of unaligned input per loop iteration.

Reviewed by: Vladislav Vaintroub
@dr-m dr-m merged commit 9ec7819 into 10.5 May 3, 2024
13 of 18 checks passed
@dr-m dr-m deleted the 10.6-MDEV-33817 branch May 3, 2024 13:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
3 participants