-
Notifications
You must be signed in to change notification settings - Fork 558
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expand A64 SVE scatter/gather memory access instructions #5036
Comments
The reason why I didn't mention "scatter/gather" specifically in the title was because this issue also includes supporting the contiguous access variants (which are not termed as "scatter/gather" in the ARM manual). |
The existing drx_expand_scatter_gather() function is > 2000 lines of code, including helper functions, and most of it is very x86 specific. In preparation for writing the AArch64 version of this function, this commit moves the x86 implementation of drx_expand_scatter_gather() out of drx.c into its own architecture specific files, and creates dummy implementations for other architectures. The behaviour of the function should be unchanged on all platforms. As we develop the AArch64 version of drx_expand_scatter_gather(), any parts of the original x86 implementation which are usefull to both can be pulled out into helper functions which are shared between both architectures. Issues: #5844, #5036, #3837
The existing drx_expand_scatter_gather() function is > 2000 lines of code, including helper functions, and most of it is very x86 specific. In preparation for writing the AArch64 version of this function, this commit moves the x86 implementation of drx_expand_scatter_gather() out of drx.c into its own architecture specific files, and creates dummy implementations for other architectures. The behaviour of the function should be unchanged on all platforms. As we develop the AArch64 version of drx_expand_scatter_gather(), any parts of the original x86 implementation which are usefull to both can be pulled out into helper functions which are shared between both architectures. Issues: #5844, #5036, #3837
Creates an AArch64 version of drx_expand_scatter_gather() and tests for it. So far only SVE scalar+vector loads are supported. Support and tests for more instructions will follow in future commits. State restore is also not yet supported and will be implemented when i#5365 is complete. Issue: #5036
Adds support to drx_expand_scatter_gather() for SVE scalar+vector scatter store instructions, along with tests. Issue: #5036
Adds support to drx_expand_scatter_gather() for SVE scalar+vector scatter store instructions, along with tests. Issue: #5036
While developing the the AArch64 version of drx_expand_scatter_gather I have seen the code_api|sample.memval_simple_scattergather fail intermittently with the not yet implemented assert from drx_scatter_gather_restore_state() and tracked down the problem to the trace buffer management mechanism. When the trace buffer is full the next store will be past the end of the buffer and trigger a fault. The fault is meant to be intercepted by the client and used to flush the buffer. If the buffer happens to be full when tracing an expanded scatter/gather instruction then the scatter/gather state restore function will be called so it would previously just fail with the assert because state restoration is not implemented yet. Until scatter/gather state restore on AArch64 is implemented properly we can work around the problem by marking the expansion loads and stores we emit with a note that we can check for in the state restore function. If the faulting access does not originate from an expansion instruction then we pass the fault on to the client to handle instead. Issue: #5036
While developing the the AArch64 version of drx_expand_scatter_gather I have seen the code_api|sample.memval_simple_scattergather fail intermittently with the not yet implemented assert from drx_scatter_gather_restore_state() and tracked down the problem to the trace buffer management mechanism. When the trace buffer is full the next store will be past the end of the buffer and trigger a fault. The fault is meant to be intercepted by the client and used to flush the buffer. If the buffer happens to be full when tracing an expanded scatter/gather instruction then the scatter/gather state restore function will be called so it would previously just fail with the assert because state restoration is not implemented yet. Until scatter/gather state restore on AArch64 is implemented properly we can work around the problem by marking the expansion loads and stores we emit with a note that we can check for in the state restore function. If the faulting access does not originate from an expansion instruction then we pass the fault on to the client to handle instead. Issue: #5036
Adds support to drx_expand_scatter_gather() for SVE vector+immediate gather load and scatter store instructions, along with tests. Issue: #5036
Adds support to drx_expand_scatter_gather() for SVE vector+immediate gather load and scatter store instructions, along with tests. Issue: #5036
A couple of places in the drcachesim tracer there was code that handles scatter/gather instructions that was guarded by #ifdef X86 which meant offline traces that include scatter/gather instructions were producing errors on AArch64. For example running the aarch64 scattergather test app: $bin64/drrun -t drcachesim -offline -- suite/tests/bin/client.drx-scattergather $bin64/drrun -t drcachesim -indir drmemtrace.client.drx-scattergather.662655.1769.dir produced the error: ERROR: failed to initialize analyzer: raw2trace failed: Failed to process file for thread 662655: memref entry found outside of bb This commit enables the code on AArch64 as well and fixes the error. Issue: #5036
A couple of places in the drcachesim tracer there was code that handles scatter/gather instructions that was guarded by #ifdef X86 which meant offline traces that include scatter/gather instructions were producing errors on AArch64. For example running the aarch64 scattergather test app: $bin64/drrun -t drcachesim -offline -- suite/tests/bin/client.drx-scattergather $bin64/drrun -t drcachesim -indir drmemtrace.client.drx-scattergather.662655.1769.dir produced the error: ERROR: failed to initialize analyzer: raw2trace failed: Failed to process file for thread 662655: memref entry found outside of bb This commit enables the code on AArch64 as well and fixes the error. Issue: #5036
Adds support to drx_expand_scatter_gather() for SVE scalar+scalar contiguous load and store instructions, along with tests. Issue: #5036
To get the individual memory sizes right during post-processing in raw2trace where we look at the original scatter/gather instruction, we've been relying on the IR storing the per-element operand size instead of the total (which is different from other SIMD operations; xref #6561 on what to do for predicated SIMD). An alternative solution is to store the encoding during tracing like we do for vdso or JIT code (see offline_instru_t::record_instr_encodings()), but not the encoding of the emulated instr info's pointer to the original scatter/gather: rather, the unrolled single load/store. This would greatly simplify raw2trace and questions about what the IR should store. OTOH it adds some tracing overhead: which probably outweighs gains in reducing code complexity. |
When debugging i#6499 we noticed that drcachesim was producing 0 byte read/write records for some SVE load/store instructions: ``` ifetch 4 byte(s) @ 0x0000000000405b3c a54a4681 ld1w (%x20,%x10,lsl #2) %p1/z -> %z1.s read 0 byte(s) @ 0x0000000000954e80 by PC 0x0000000000405b3c read 0 byte(s) @ 0x0000000000954e84 by PC 0x0000000000405b3c read 0 byte(s) @ 0x0000000000954e88 by PC 0x0000000000405b3c read 0 byte(s) @ 0x0000000000954e8c by PC 0x0000000000405b3c read 0 byte(s) @ 0x0000000000954e90 by PC 0x0000000000405b3c read 0 byte(s) @ 0x0000000000954e94 by PC 0x0000000000405b3c read 0 byte(s) @ 0x0000000000954e98 by PC 0x0000000000405b3c read 0 byte(s) @ 0x0000000000954e9c by PC 0x0000000000405b3c ifetch 4 byte(s) @ 0x0000000000405b4 ``` This turned out to be due to drdecode being linked into drcachesim twice: once into the drcachesim executable, once into libdynamorio. drdecode uses a global variable to store the SVE vector length to use when decoding so we end up with two copies of that variable and only one was being initialized. To fix this properly we would need to refactor the libraries so that there is only one copy of the sve_veclen global variable, or change the way that the decoder gets the vector length so its no longer stored in a global variable. In the mean time we have a workaround which makes sure both copies of the variable get initialized and drcachesim produces correct results. With that workaround in place however, the results were still wrong. For expanded scatter/gather instructions when you are using an offline trace, raw2trace doesn't have access to the load/store instructions from the expansion, only the original app scatter/gather instruction. It has to create the read/write records using only information from the original scatter/gather instruction and it uses the size of the memory operand to determine the size of each read/write. This works for x86 because the x86 IR uses the per-element data size as for the memory operand of scatter/gather instructions. This doesn't work for AArch64 because the AArch64 codec uses the maximum data transferred (per-element data size * number of elements) like other SIMD load/store instructions. We plan to make the AArch64 IR consistent with x86 by changing it to use the same convention as x86 for scatter/gather instructions but in the mean time we can work around the inconsistency by fixing the size in raw2trace based on the instruction's opcode. Issues: #6499, #5365, #5036
#6574) Make the AArch64 IR consistent with x86 which already uses the per-element transfer size for the scatter/gather memory operand size. This changes the AArch64 codec for the scatter/gather and predicated contiguous load/store instructions to use the per-element access size for the memory operand instead of the maximum total transfer size that it used previously, and updates the tests accordingly. Issues: #5365, #5036, #6561
This patch adds SVE support for signals in the core. It is the follow on patch from the SVE core work part 1, in PR #5835 (f646a63) and includes vector address computation for SVE scatter/gather, enabling first-fault load handling. Issue: #5365, #5036 Co-authored-by: Jack Gallagher <[email protected]>
This patch adds SVE support for signals in the core. It is the follow on patch from the SVE core work part 1, in PR #5835 (f646a63) and includes vector address computation for SVE scatter/gather, enabling first-fault load handling. Issue: #5365, #5036 Co-authored-by: Jack Gallagher <[email protected]>
Adds support for restoring register app state when a fault is triggered by one of the instructions in a scatter/gather expansion sequence emitted by drx_expand_scatter_gather(). Also extends the existing scatter/gather expansion tests to check the value of the FFR register, and run the test instructions with different elements rigged to fault so we can test the state restoration behaviour. Support for the first-faulting and non-faulting variants (ldff1*, ldnf1*) will be added in future commits. Issue: #5036
Adds support for restoring register app state when a fault is triggered by one of the instructions in a scatter/gather expansion sequence emitted by drx_expand_scatter_gather(). Also extends the existing scatter/gather expansion tests to check the value of the FFR register, and run the test instructions with different elements rigged to fault so we can test the state restoration behaviour. Support for the first-faulting and non-faulting variants (ldff1*, ldnf1*) will be added in future commits. Issue: #5036
Adds support for non-fault loads to drx_expand_scatter_gather(). Non-fault loads (ldnf1*) work similarly to scalar+immediate predicated contiguous ld1* loads, but with different behaviour if an element access faults. This commit implements this behaviour and extends the scatter/gather tests to include ldnf1* instructions. Issue: #5036
Fixes the slot used to save and restore FP regs at fcache enter and return events. PR #6725 adjusted the slots used during signal handling in core/unix/signal_linux_aarch64.c but did not adjust the same in fcache enter/return and attach events. Prior to that PR, the FP regs were simply stored in a contiguous manner in signal handling code and fcache enter/return routines (instead of in their respective dr_simd_t struct member), which was a bit confusing. The mismatch between slot usage in signal handling and fcache enter/return code caused failures in the signalNNN1 tests on some systems. Verified that those tests pass with this fix. Also fixes the same issue in save_priv_mcontext_helper which is used in the dr_app_start API. Unit tests for this scenario will be added as part of #6759. Issue: #5036, #6755, #5365 Fixes #6758
Adds support for non-fault loads to drx_expand_scatter_gather(). Non-fault loads (ldnf1*) work similarly to scalar+immediate predicated contiguous ld1* loads, but with different behaviour if an element access faults. This commit implements this behaviour and extends the scatter/gather tests to include ldnf1* instructions. Issue: #5036
Adds support for first-fault loads to drx_expand_scatter_gather(). First-fault loads (ldff1*) work similarly to scalar+vector/vector+immediate gather instructions or scalar+scalar predicated contiguous ld1* loads, but with different behaviour if an element access faults. This commit implements this behaviour and extends the scatter/gather tests to include ldff1* instructions. Issue: #5036
Adds support for first-fault loads to drx_expand_scatter_gather(). First-fault loads (ldff1*) work similarly to scalar+vector/vector+immediate gather instructions or scalar+scalar predicated contiguous ld1* loads, but with different behaviour if an element access faults. This commit implements this behaviour and extends the scatter/gather tests to include ldff1* instructions. Issue: #5036
Adds details of the AArch64 scatter/gather expansion to the scatter/gather expansion developer documentation. Issue: #5036
Adds details of the AArch64 scatter/gather expansion to the scatter/gather expansion developer documentation. Issue: #5036
I have done some testing with 512-bit vector registers and found a couple of bugs in the scatter/gather expansion code. - An assert had incorrect logic (< should have been <=) - Added OPSZ_192, needed to represent 3 512-bit Z registers. Issue: #5036
I have done some testing with 512-bit vector registers and found a couple of bugs in the scatter/gather expansion code. - An assert had incorrect logic (< should have been <=) - Added OPSZ_192, needed to represent 3 512-bit Z registers. Issue: #5036
Fixes the compiler error seen on GCC 13.2: ext/drx/scatter_gather_aarch64.c:1238:37: error: array subscript 2 is above array bounds of 'sg_slot_t[2]' {aka 'struct _sg_slot_t[2]'} [-Werror=array-bounds=] 1238 | DR_ASSERT(slot_state->pred_slots[slot].kind == SLOT_KIND_UNUSED); In practice the code will never make an out of bounds access because the for loop above will always terminate by breaking so `slot` is always `< NUM_PRED_SLOTS`. Issue: #5036
Fixes the compiler error seen on GCC 13.2: ext/drx/scatter_gather_aarch64.c:1238:37: error: array subscript 2 is above array bounds of 'sg_slot_t[2]' {aka 'struct _sg_slot_t[2]'} [-Werror=array-bounds=] 1238 | DR_ASSERT(slot_state->pred_slots[slot].kind == SLOT_KIND_UNUSED); In practice the code will never make an out of bounds access because the for loop above will always terminate by breaking so `slot` is always `< NUM_PRED_SLOTS`. Issue: #5036
ARM has a rich set of SVE instructions1. For clients that need to instrument memrefs (e.g. drcachesim), we need to expand them to scalar loads and stores, like what we did for x86 scatter/gather in #2985. These instructions have many variants; I'm summarising my understanding below. 1 and the Scalable Vector Extension of the Arm manual2 have a more detailed discussion.
There are multiple ways in which the memory address can be specified. These are all predicated loads/stores, meaning that an element may be active or inactive, based on special predicate registers. They have either the
LD1*
orST1*
prefixThere are variants with different element sizes (unsigned double-word; signed and unsigned byte, halfword, word).
Faults for non-active elements are always suppressed. There are different load instruction variants based on how faults for active elements are treated: besides the usual, each of the above has a “first fault” (faults only for first active element) and “non fault” variants.
For “scalar plus scalar” and “scalar plus immediate” load instructions, there are variants that allow reading contiguous 2/3/4 elements, each to the same element number in 2/3/4 vector registers. These have LDN* or STN* prefix, where N=2/3/4.
There are also some un-predicated instructions (LDR and STR) that use the "scalar + immediate" scheme to load/store vectors or predicate registers.
The x86 scatter/gather that we handled in #2985 is the "scalar + vector" variant with regular faulting behaviour. More work will be required to adapt
drx_expand_scatter_gather
to these other variants.For the contiguous access variants, we could model them as a single memory address with a larger size. But this is not a correct model, because each element can be active/inactive based on the predicate register, so the memory addresses that end up being accessed can be non-contiguous. It'll be correct to model them as scatter/gather, using multiple element-sized accesses.
The text was updated successfully, but these errors were encountered: