Expand A64 SVE scatter/gather memory access instructions #5036

abhinav92003 · 2021-08-04T14:43:57Z

ARM has a rich set of SVE instructions1. For clients that need to instrument memrefs (e.g. drcachesim), we need to expand them to scalar loads and stores, like what we did for x86 scatter/gather in #2985. These instructions have many variants; I'm summarising my understanding below. 1 and the Scalable Vector Extension of the Arm manual2 have a more detailed discussion.

There are multiple ways in which the memory address can be specified. These are all predicated loads/stores, meaning that an element may be active or inactive, based on special predicate registers. They have either the LD1* or ST1* prefix

Scalar + immediate: For contiguous access. Memory address is generated by a 64-bit scalar base and immediate index.
Scalar + scalar: For contiguous access. Memory address is generated by a 64-bit scalar base and scalar index which is added to the base address.
Scalar + vector: For possible non-contiguous access, also known as gather load/scatter store. Memory addresses are generated by a 64-bit scalar base plus vector index.
Vector + immediate: For possible non-contiguous access, also known as gather load/scatter store. Memory addresses are generated by a vector base plus immediate index.

There are variants with different element sizes (unsigned double-word; signed and unsigned byte, halfword, word).

Faults for non-active elements are always suppressed. There are different load instruction variants based on how faults for active elements are treated: besides the usual, each of the above has a “first fault” (faults only for first active element) and “non fault” variants.

For “scalar plus scalar” and “scalar plus immediate” load instructions, there are variants that allow reading contiguous 2/3/4 elements, each to the same element number in 2/3/4 vector registers. These have LDN* or STN* prefix, where N=2/3/4.

There are also some un-predicated instructions (LDR and STR) that use the "scalar + immediate" scheme to load/store vectors or predicate registers.

The x86 scatter/gather that we handled in #2985 is the "scalar + vector" variant with regular faulting behaviour. More work will be required to adapt drx_expand_scatter_gather to these other variants.

For the contiguous access variants, we could model them as a single memory address with a larger size. But this is not a correct model, because each element can be active/inactive based on the predicate register, so the memory addresses that end up being accessed can be non-contiguous. It'll be correct to model them as scatter/gather, using multiple element-sized accesses.

The text was updated successfully, but these errors were encountered:

abhinav92003 · 2021-08-04T15:51:29Z

The reason why I didn't mention "scatter/gather" specifically in the title was because this issue also includes supporting the contiguous access variants (which are not termed as "scatter/gather" in the ARM manual).

The existing drx_expand_scatter_gather() function is > 2000 lines of code, including helper functions, and most of it is very x86 specific. In preparation for writing the AArch64 version of this function, this commit moves the x86 implementation of drx_expand_scatter_gather() out of drx.c into its own architecture specific files, and creates dummy implementations for other architectures. The behaviour of the function should be unchanged on all platforms. As we develop the AArch64 version of drx_expand_scatter_gather(), any parts of the original x86 implementation which are usefull to both can be pulled out into helper functions which are shared between both architectures. Issues: #5844, #5036, #3837

Creates an AArch64 version of drx_expand_scatter_gather() and unit tests for it. So far only SVE scalar+vector loads are supported. Support and tests for more instructions will follow in future commits. Issues: #5844, #5036, #3837

derekbruening · 2023-08-16T16:05:31Z

I would close this one too as a duplicate of the original #3837 but let's instead have #3837 cover only AArch32 and this one cover AArch64.

Creates an AArch64 version of drx_expand_scatter_gather() and tests for it. So far only SVE scalar+vector loads are supported. Support and tests for more instructions will follow in future commits. State restore is also not yet supported and will be implemented when i#5365 is complete. Issue: #5036

Adds support to drx_expand_scatter_gather() for SVE scalar+vector scatter store instructions, along with tests. Issue: #5036

This should reduce the churn in the test output template files compared to basic_counts which it used before. Issues: #6298, #5036

While developing the the AArch64 version of drx_expand_scatter_gather I have seen the code_api|sample.memval_simple_scattergather fail intermittently with the not yet implemented assert from drx_scatter_gather_restore_state() and tracked down the problem to the trace buffer management mechanism. When the trace buffer is full the next store will be past the end of the buffer and trigger a fault. The fault is meant to be intercepted by the client and used to flush the buffer. If the buffer happens to be full when tracing an expanded scatter/gather instruction then the scatter/gather state restore function will be called so it would previously just fail with the assert because state restoration is not implemented yet. Until scatter/gather state restore on AArch64 is implemented properly we can work around the problem by marking the expansion loads and stores we emit with a note that we can check for in the state restore function. If the faulting access does not originate from an expansion instruction then we pass the fault on to the client to handle instead. Issue: #5036

Adds support to drx_expand_scatter_gather() for SVE vector+immediate gather load and scatter store instructions, along with tests. Issue: #5036

A couple of places in the drcachesim tracer there was code that handles scatter/gather instructions that was guarded by #ifdef X86 which meant offline traces that include scatter/gather instructions were producing errors on AArch64. For example running the aarch64 scattergather test app: $bin64/drrun -t drcachesim -offline -- suite/tests/bin/client.drx-scattergather $bin64/drrun -t drcachesim -indir drmemtrace.client.drx-scattergather.662655.1769.dir produced the error: ERROR: failed to initialize analyzer: raw2trace failed: Failed to process file for thread 662655: memref entry found outside of bb This commit enables the code on AArch64 as well and fixes the error. Issue: #5036

Adds support to drx_expand_scatter_gather() for SVE scalar+scalar contiguous load and store instructions, along with tests. Issue: #5036

derekbruening · 2024-01-18T02:15:38Z

To get the individual memory sizes right during post-processing in raw2trace where we look at the original scatter/gather instruction, we've been relying on the IR storing the per-element operand size instead of the total (which is different from other SIMD operations; xref #6561 on what to do for predicated SIMD). An alternative solution is to store the encoding during tracing like we do for vdso or JIT code (see offline_instru_t::record_instr_encodings()), but not the encoding of the emulated instr info's pointer to the original scatter/gather: rather, the unrolled single load/store. This would greatly simplify raw2trace and questions about what the IR should store. OTOH it adds some tracing overhead: which probably outweighs gains in reducing code complexity.

When debugging i#6499 we noticed that drcachesim was producing 0 byte read/write records for some SVE load/store instructions: ``` ifetch 4 byte(s) @ 0x0000000000405b3c a54a4681 ld1w (%x20,%x10,lsl #2) %p1/z -> %z1.s read 0 byte(s) @ 0x0000000000954e80 by PC 0x0000000000405b3c read 0 byte(s) @ 0x0000000000954e84 by PC 0x0000000000405b3c read 0 byte(s) @ 0x0000000000954e88 by PC 0x0000000000405b3c read 0 byte(s) @ 0x0000000000954e8c by PC 0x0000000000405b3c read 0 byte(s) @ 0x0000000000954e90 by PC 0x0000000000405b3c read 0 byte(s) @ 0x0000000000954e94 by PC 0x0000000000405b3c read 0 byte(s) @ 0x0000000000954e98 by PC 0x0000000000405b3c read 0 byte(s) @ 0x0000000000954e9c by PC 0x0000000000405b3c ifetch 4 byte(s) @ 0x0000000000405b4 ``` This turned out to be due to drdecode being linked into drcachesim twice: once into the drcachesim executable, once into libdynamorio. drdecode uses a global variable to store the SVE vector length to use when decoding so we end up with two copies of that variable and only one was being initialized. To fix this properly we would need to refactor the libraries so that there is only one copy of the sve_veclen global variable, or change the way that the decoder gets the vector length so its no longer stored in a global variable. In the mean time we have a workaround which makes sure both copies of the variable get initialized and drcachesim produces correct results. With that workaround in place however, the results were still wrong. For expanded scatter/gather instructions when you are using an offline trace, raw2trace doesn't have access to the load/store instructions from the expansion, only the original app scatter/gather instruction. It has to create the read/write records using only information from the original scatter/gather instruction and it uses the size of the memory operand to determine the size of each read/write. This works for x86 because the x86 IR uses the per-element data size as for the memory operand of scatter/gather instructions. This doesn't work for AArch64 because the AArch64 codec uses the maximum data transferred (per-element data size * number of elements) like other SIMD load/store instructions. We plan to make the AArch64 IR consistent with x86 by changing it to use the same convention as x86 for scatter/gather instructions but in the mean time we can work around the inconsistency by fixing the size in raw2trace based on the instruction's opcode. Issues: #6499, #5365, #5036

This makes the IR consistent with x86 which already uses the per-element transfer size for the scatter/gather memory operand size. Issues: #5365, #5036, #6561

#6574) Make the AArch64 IR consistent with x86 which already uses the per-element transfer size for the scatter/gather memory operand size. This changes the AArch64 codec for the scatter/gather and predicated contiguous load/store instructions to use the per-element access size for the memory operand instead of the maximum total transfer size that it used previously, and updates the tests accordingly. Issues: #5365, #5036, #6561

This patch adds SVE support for signals in the core. It is the follow on patch from the SVE core work part 1, in PR #5835 (f646a63) and includes vector address computation for SVE scatter/gather, enabling first-fault load handling. Issue: #5365, #5036 Co-authored-by: Jack Gallagher <[email protected]>

Adds support for restoring register app state when a fault is triggered by one of the instructions in a scatter/gather expansion sequence emitted by drx_expand_scatter_gather(). Also extends the existing scatter/gather expansion tests to check the value of the FFR register, and run the test instructions with different elements rigged to fault so we can test the state restoration behaviour. Support for the first-faulting and non-faulting variants (ldff1*, ldnf1*) will be added in future commits. Issue: #5036

Adds support for non-fault loads to drx_expand_scatter_gather(). Non-fault loads (ldnf1*) work similarly to scalar+immediate predicated contiguous ld1* loads, but with different behaviour if an element access faults. This commit implements this behaviour and extends the scatter/gather tests to include ldnf1* instructions. Issue: #5036

Fixes the slot used to save and restore FP regs at fcache enter and return events. PR #6725 adjusted the slots used during signal handling in core/unix/signal_linux_aarch64.c but did not adjust the same in fcache enter/return and attach events. Prior to that PR, the FP regs were simply stored in a contiguous manner in signal handling code and fcache enter/return routines (instead of in their respective dr_simd_t struct member), which was a bit confusing. The mismatch between slot usage in signal handling and fcache enter/return code caused failures in the signalNNN1 tests on some systems. Verified that those tests pass with this fix. Also fixes the same issue in save_priv_mcontext_helper which is used in the dr_app_start API. Unit tests for this scenario will be added as part of #6759. Issue: #5036, #6755, #5365 Fixes #6758

Adds support for non-fault loads to drx_expand_scatter_gather(). Non-fault loads (ldnf1*) work similarly to scalar+immediate predicated contiguous ld1* loads, but with different behaviour if an element access faults. This commit implements this behaviour and extends the scatter/gather tests to include ldnf1* instructions. Issue: #5036

Adds support for first-fault loads to drx_expand_scatter_gather(). First-fault loads (ldff1*) work similarly to scalar+vector/vector+immediate gather instructions or scalar+scalar predicated contiguous ld1* loads, but with different behaviour if an element access faults. This commit implements this behaviour and extends the scatter/gather tests to include ldff1* instructions. Issue: #5036

Adds details of the AArch64 scatter/gather expansion to the scatter/gather expansion developer documentation. Issue: #5036

I have done some testing with 512-bit vector registers and found a couple of bugs in the scatter/gather expansion code. - An assert had incorrect logic (< should have been <=) - Added OPSZ_192, needed to represent 3 512-bit Z registers. Issue: #5036

Fixes the compiler error seen on GCC 13.2: ext/drx/scatter_gather_aarch64.c:1238:37: error: array subscript 2 is above array bounds of 'sg_slot_t[2]' {aka 'struct _sg_slot_t[2]'} [-Werror=array-bounds=] 1238 | DR_ASSERT(slot_state->pred_slots[slot].kind == SLOT_KIND_UNUSED); In practice the code will never make an out of bounds access because the for loop above will always terminate by breaking so `slot` is always `< NUM_PRED_SLOTS`. Issue: #5036

abhinav92003 added OpSys-AArch64 Component-DRTool labels Aug 4, 2021

derekbruening changed the title ~~Expand A64 SVE memory access instructions~~ Expand A64 SVE scatter/gather memory access instructions Aug 4, 2021

derekbruening mentioned this issue Dec 20, 2021

Add drx_expand() function for VSIB AVX2/AVX-512 scatter/gather. #2985

Closed

derekbruening mentioned this issue Feb 2, 2023

Add scatter/gather support for AArch64's Scalable Vector Extension (SVE) #5844

Closed

jackgallagher-arm mentioned this issue Jun 26, 2023

i#5844 Move x86 expand_scatter_gather to its own file #6173

Merged

jackgallagher-arm mentioned this issue Aug 16, 2023

i#5036 A64 scatter/gather, part 1: Expand scalar+vector loads #6267

Merged

derekbruening mentioned this issue Aug 16, 2023

Add support for drx_expand_scatter_gather() and instr_is_{scatter,gather}() on AArch32 #3837

Closed

jackgallagher-arm added a commit that referenced this issue Sep 6, 2023

i#5036 A64 scatter/gather, part 2: Expand scalar+vector stores

5bd6b74

Adds support to drx_expand_scatter_gather() for SVE scalar+vector scatter store instructions, along with tests. Issue: #5036

jackgallagher-arm mentioned this issue Sep 6, 2023

i#5036 A64 scatter/gather, part 2: Expand scalar+vector stores #6297

Merged

AssadHashmi pushed a commit that referenced this issue Sep 7, 2023

i#5036 A64 scatter/gather, part 2: Expand scalar+vector stores (#6297)

c809c63

Adds support to drx_expand_scatter_gather() for SVE scalar+vector scatter store instructions, along with tests. Issue: #5036

jackgallagher-arm added a commit that referenced this issue Sep 8, 2023

i#6298 Use invariant_checker for drcachesim scattergather test

a990ca5

This should reduce the churn in the test output template files compared to basic_counts which it used before. Issues: #6298, #5036

jackgallagher-arm mentioned this issue Sep 8, 2023

i#6298 Use invariant_checker for drcachesim scattergather test #6305

Merged

derekbruening pushed a commit that referenced this issue Sep 11, 2023

i#6298 Use invariant_checker for drcachesim scattergather test (#6305)

d6cceb0

This should reduce the churn in the test output template files compared to basic_counts which it used before. Issues: #6298, #5036

jackgallagher-arm mentioned this issue Sep 22, 2023

i#5036 Fix memval_simple_scattergather failures #6325

Merged

jackgallagher-arm added a commit that referenced this issue Oct 5, 2023

i#5036 A64 scatter/gather, part 3: Expand vector+immediate

e797477

Adds support to drx_expand_scatter_gather() for SVE vector+immediate gather load and scatter store instructions, along with tests. Issue: #5036

jackgallagher-arm mentioned this issue Oct 5, 2023

i#5036 A64 scatter/gather, part 3: Expand vector+immediate #6350

Merged

jackgallagher-arm added a commit that referenced this issue Oct 9, 2023

i#5036 A64 scatter/gather, part 3: Expand vector+immediate (#6350)

9761d72

Adds support to drx_expand_scatter_gather() for SVE vector+immediate gather load and scatter store instructions, along with tests. Issue: #5036

jackgallagher-arm mentioned this issue Oct 16, 2023

i#5036 A64 scatter/gather: Fix drcachesim offline tracing #6370

Merged

jackgallagher-arm added a commit that referenced this issue Oct 16, 2023

i#5036 A64 scatter/gather, part 4: Expand scalar+scalar

c403d7a

Adds support to drx_expand_scatter_gather() for SVE scalar+scalar contiguous load and store instructions, along with tests. Issue: #5036

derekbruening mentioned this issue Jan 18, 2024

Build and validate DynamoRIO on AArch64 SVE hardware #5365

Open

derekbruening mentioned this issue Jan 18, 2024

i#5365 AArch64: Fix 0 size read/write records in drmemtrace #6544

Merged

jackgallagher-arm mentioned this issue Jan 23, 2024

i#5365 AArch64: Change scatter/gather instructions to per-element size #6574

Merged

AssadHashmi mentioned this issue Mar 26, 2024

i#5365 AArch64 SVE core, part 2: add signals support #6725

Merged

jackgallagher-arm mentioned this issue Apr 3, 2024

i#5036 A64 scatter/gather, part 9: State restore #6753

Merged

jackgallagher-arm mentioned this issue Apr 4, 2024

i#5036 A64 scatter/gather, part 10: Non-fault loads #6756

Merged

abhinav92003 mentioned this issue Apr 5, 2024

i#6758: Fix AArch64 FP state at fcache events and attach #6757

Merged

jackgallagher-arm mentioned this issue Apr 16, 2024

i#5036 A64 scatter/gather, part 11: First-fault loads #6776

Merged

jackgallagher-arm added a commit that referenced this issue May 1, 2024

i#5036 Update scatter/gather docs with AArch64 details

47cb1d2

Adds details of the AArch64 scatter/gather expansion to the scatter/gather expansion developer documentation. Issue: #5036

jackgallagher-arm mentioned this issue May 1, 2024

i#5036 Update scatter/gather docs with AArch64 details #6795

Merged

jackgallagher-arm added a commit that referenced this issue May 2, 2024

i#5036 Update scatter/gather docs with AArch64 details (#6795)

3e1ec2f

Adds details of the AArch64 scatter/gather expansion to the scatter/gather expansion developer documentation. Issue: #5036

jackgallagher-arm mentioned this issue Jul 4, 2024

i#5036 AArch64 scatter/gather: Fix 512-bit bugs #6866

Merged

jackgallagher-arm mentioned this issue Oct 10, 2024

i#5036 AArch64 scatter/gather: Fix GCC 13.2 compiler error #7035

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expand A64 SVE scatter/gather memory access instructions #5036

Expand A64 SVE scatter/gather memory access instructions #5036

abhinav92003 commented Aug 4, 2021

abhinav92003 commented Aug 4, 2021

derekbruening commented Aug 16, 2023

derekbruening commented Jan 18, 2024

Expand A64 SVE scatter/gather memory access instructions #5036

Expand A64 SVE scatter/gather memory access instructions #5036

Comments

abhinav92003 commented Aug 4, 2021

abhinav92003 commented Aug 4, 2021

derekbruening commented Aug 16, 2023

derekbruening commented Jan 18, 2024