Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expand A64 SVE scatter/gather memory access instructions #5036

Open
abhinav92003 opened this issue Aug 4, 2021 · 3 comments
Open

Expand A64 SVE scatter/gather memory access instructions #5036

abhinav92003 opened this issue Aug 4, 2021 · 3 comments

Comments

@abhinav92003
Copy link
Contributor

ARM has a rich set of SVE instructions1. For clients that need to instrument memrefs (e.g. drcachesim), we need to expand them to scalar loads and stores, like what we did for x86 scatter/gather in #2985. These instructions have many variants; I'm summarising my understanding below. 1 and the Scalable Vector Extension of the Arm manual2 have a more detailed discussion.

There are multiple ways in which the memory address can be specified. These are all predicated loads/stores, meaning that an element may be active or inactive, based on special predicate registers. They have either the LD1* or ST1* prefix

  • Scalar + immediate: For contiguous access. Memory address is generated by a 64-bit scalar base and immediate index.
  • Scalar + scalar: For contiguous access. Memory address is generated by a 64-bit scalar base and scalar index which is added to the base address.
  • Scalar + vector: For possible non-contiguous access, also known as gather load/scatter store. Memory addresses are generated by a 64-bit scalar base plus vector index.
  • Vector + immediate: For possible non-contiguous access, also known as gather load/scatter store. Memory addresses are generated by a vector base plus immediate index.

There are variants with different element sizes (unsigned double-word; signed and unsigned byte, halfword, word).

Faults for non-active elements are always suppressed. There are different load instruction variants based on how faults for active elements are treated: besides the usual, each of the above has a “first fault” (faults only for first active element) and “non fault” variants.

For “scalar plus scalar” and “scalar plus immediate” load instructions, there are variants that allow reading contiguous 2/3/4 elements, each to the same element number in 2/3/4 vector registers. These have LDN* or STN* prefix, where N=2/3/4.

There are also some un-predicated instructions (LDR and STR) that use the "scalar + immediate" scheme to load/store vectors or predicate registers.

The x86 scatter/gather that we handled in #2985 is the "scalar + vector" variant with regular faulting behaviour. More work will be required to adapt drx_expand_scatter_gather to these other variants.

For the contiguous access variants, we could model them as a single memory address with a larger size. But this is not a correct model, because each element can be active/inactive based on the predicate register, so the memory addresses that end up being accessed can be non-contiguous. It'll be correct to model them as scatter/gather, using multiple element-sized accesses.

@derekbruening derekbruening changed the title Expand A64 SVE memory access instructions Expand A64 SVE scatter/gather memory access instructions Aug 4, 2021
@abhinav92003
Copy link
Contributor Author

The reason why I didn't mention "scatter/gather" specifically in the title was because this issue also includes supporting the contiguous access variants (which are not termed as "scatter/gather" in the ARM manual).

jackgallagher-arm added a commit that referenced this issue Jun 26, 2023
The existing drx_expand_scatter_gather() function is > 2000 lines of
code, including helper functions, and most of it is very x86 specific.

In preparation for writing the AArch64 version of this function, this
commit moves the x86 implementation of drx_expand_scatter_gather() out
of drx.c into its own architecture specific files, and creates dummy
implementations for other architectures.

The behaviour of the function should be unchanged on all platforms.

As we develop the AArch64 version of drx_expand_scatter_gather(), any
parts of the original x86 implementation which are usefull to both can
be pulled out into helper functions which are shared between both
architectures.

Issues: #5844, #5036, #3837
github-merge-queue bot pushed a commit that referenced this issue Jun 27, 2023
The existing drx_expand_scatter_gather() function is > 2000 lines of
code, including helper functions, and most of it is very x86 specific.

In preparation for writing the AArch64 version of this function, this
commit moves the x86 implementation of drx_expand_scatter_gather() out
of drx.c into its own architecture specific files, and creates dummy
implementations for other architectures.

The behaviour of the function should be unchanged on all platforms.

As we develop the AArch64 version of drx_expand_scatter_gather(), any
parts of the original x86 implementation which are usefull to both can
be pulled out into helper functions which are shared between both
architectures.

Issues: #5844, #5036, #3837
jackgallagher-arm added a commit that referenced this issue Aug 16, 2023
Creates an AArch64 version of drx_expand_scatter_gather() and unit tests
for it. So far only SVE scalar+vector loads are supported.
Support and tests for more instructions will follow in future commits.

Issues: #5844, #5036, #3837
@derekbruening
Copy link
Contributor

I would close this one too as a duplicate of the original #3837 but let's instead have #3837 cover only AArch32 and this one cover AArch64.

github-merge-queue bot pushed a commit that referenced this issue Aug 24, 2023
Creates an AArch64 version of drx_expand_scatter_gather() and tests for
it. So far only SVE scalar+vector loads are supported. Support and tests
for more instructions will follow in future commits.

State restore is also not yet supported and will be implemented when
i#5365 is complete.

Issue: #5036
jackgallagher-arm added a commit that referenced this issue Sep 6, 2023
Adds support to drx_expand_scatter_gather() for SVE scalar+vector
scatter store instructions, along with tests.

Issue: #5036
AssadHashmi pushed a commit that referenced this issue Sep 7, 2023
Adds support to drx_expand_scatter_gather() for SVE scalar+vector
scatter store instructions, along with tests.

Issue: #5036
jackgallagher-arm added a commit that referenced this issue Sep 8, 2023
This should reduce the churn in the test output template files compared
to basic_counts which it used before.

Issues: #6298, #5036
derekbruening pushed a commit that referenced this issue Sep 11, 2023
This should reduce the churn in the test output template files compared
to basic_counts which it used before.

Issues: #6298, #5036
jackgallagher-arm added a commit that referenced this issue Sep 22, 2023
While developing the the AArch64 version of drx_expand_scatter_gather
I have seen the code_api|sample.memval_simple_scattergather fail
intermittently with the not yet implemented assert from
drx_scatter_gather_restore_state() and tracked down the problem to
the trace buffer management mechanism. When the trace buffer is full
the next store will be past the end of the buffer and trigger a fault.
The fault is meant to be intercepted by the client and used to flush
the buffer.

If the buffer happens to be full when tracing an expanded scatter/gather
instruction then the scatter/gather state restore function will be
called so it would previously just fail with the assert because state
restoration is not implemented yet.

Until scatter/gather state restore on AArch64 is implemented properly we
can work around the problem by marking the expansion loads and stores we
emit with a note that we can check for in the state restore function.
If the faulting access does not originate from an expansion instruction
then we pass the fault on to the client to handle instead.

Issue: #5036
jackgallagher-arm added a commit that referenced this issue Oct 4, 2023
While developing the the AArch64 version of drx_expand_scatter_gather I
have seen the code_api|sample.memval_simple_scattergather fail
intermittently with the not yet implemented assert from
drx_scatter_gather_restore_state() and tracked down the problem to the
trace buffer management mechanism. When the trace buffer is full the
next store will be past the end of the buffer and trigger a fault. The
fault is meant to be intercepted by the client and used to flush the
buffer.

If the buffer happens to be full when tracing an expanded scatter/gather
instruction then the scatter/gather state restore function will be
called so it would previously just fail with the assert because state
restoration is not implemented yet.

Until scatter/gather state restore on AArch64 is implemented properly we
can work around the problem by marking the expansion loads and stores we
emit with a note that we can check for in the state restore function. If
the faulting access does not originate from an expansion instruction
then we pass the fault on to the client to handle instead.

Issue: #5036
jackgallagher-arm added a commit that referenced this issue Oct 5, 2023
Adds support to drx_expand_scatter_gather() for SVE vector+immediate
gather load and scatter store instructions, along with tests.

Issue: #5036
jackgallagher-arm added a commit that referenced this issue Oct 9, 2023
Adds support to drx_expand_scatter_gather() for SVE vector+immediate
gather load and scatter store instructions, along with tests.

Issue: #5036
jackgallagher-arm added a commit that referenced this issue Oct 16, 2023
A couple of places in the drcachesim tracer there was code that handles
scatter/gather instructions that was guarded by #ifdef X86 which meant
offline traces that include scatter/gather instructions were producing
errors on AArch64.

For example running the aarch64 scattergather test app:
    $bin64/drrun -t drcachesim -offline -- suite/tests/bin/client.drx-scattergather
    $bin64/drrun -t drcachesim -indir drmemtrace.client.drx-scattergather.662655.1769.dir

produced the error:
    ERROR: failed to initialize analyzer: raw2trace failed: Failed to
    process file for thread 662655: memref entry found outside of bb

This commit enables the code on AArch64 as well and fixes the error.

Issue: #5036
jackgallagher-arm added a commit that referenced this issue Oct 16, 2023
A couple of places in the drcachesim tracer there was code that handles
scatter/gather instructions that was guarded by #ifdef X86 which meant
offline traces that include scatter/gather instructions were producing
errors on AArch64.

For example running the aarch64 scattergather test app:
    $bin64/drrun -t drcachesim -offline -- suite/tests/bin/client.drx-scattergather
    $bin64/drrun -t drcachesim -indir drmemtrace.client.drx-scattergather.662655.1769.dir

produced the error:
    ERROR: failed to initialize analyzer: raw2trace failed: Failed to process file for thread 662655: memref entry found outside of bb

This commit enables the code on AArch64 as well and fixes the error.

Issue: #5036
jackgallagher-arm added a commit that referenced this issue Oct 16, 2023
Adds support to drx_expand_scatter_gather() for SVE scalar+scalar
contiguous load and store instructions, along with tests.

Issue: #5036
@derekbruening
Copy link
Contributor

To get the individual memory sizes right during post-processing in raw2trace where we look at the original scatter/gather instruction, we've been relying on the IR storing the per-element operand size instead of the total (which is different from other SIMD operations; xref #6561 on what to do for predicated SIMD). An alternative solution is to store the encoding during tracing like we do for vdso or JIT code (see offline_instru_t::record_instr_encodings()), but not the encoding of the emulated instr info's pointer to the original scatter/gather: rather, the unrolled single load/store. This would greatly simplify raw2trace and questions about what the IR should store. OTOH it adds some tracing overhead: which probably outweighs gains in reducing code complexity.

jackgallagher-arm added a commit that referenced this issue Jan 18, 2024
When debugging i#6499 we noticed that drcachesim was producing 0 byte
read/write records for some SVE load/store instructions:

```
 ifetch       4 byte(s) @ 0x0000000000405b3c a54a4681   ld1w   (%x20,%x10,lsl #2) %p1/z -> %z1.s
 read         0 byte(s) @ 0x0000000000954e80 by PC 0x0000000000405b3c
 read         0 byte(s) @ 0x0000000000954e84 by PC 0x0000000000405b3c
 read         0 byte(s) @ 0x0000000000954e88 by PC 0x0000000000405b3c
 read         0 byte(s) @ 0x0000000000954e8c by PC 0x0000000000405b3c
 read         0 byte(s) @ 0x0000000000954e90 by PC 0x0000000000405b3c
 read         0 byte(s) @ 0x0000000000954e94 by PC 0x0000000000405b3c
 read         0 byte(s) @ 0x0000000000954e98 by PC 0x0000000000405b3c
 read         0 byte(s) @ 0x0000000000954e9c by PC 0x0000000000405b3c
 ifetch       4 byte(s) @ 0x0000000000405b4
```

This turned out to be due to drdecode being linked into drcachesim
twice: once into the drcachesim executable, once into libdynamorio.
drdecode uses a global variable to store the SVE vector length to use
when decoding so we end up with two copies of that variable and only one
was being initialized.
To fix this properly we would need to refactor the libraries so that
there is only one copy of the sve_veclen global variable, or change the
way that the decoder gets the vector length so its no longer stored in a
global variable. In the mean time we have a workaround which makes sure
both copies of the variable get initialized and drcachesim produces
correct results.

With that workaround in place however, the results were still wrong. For
expanded scatter/gather instructions when you are using an offline
trace, raw2trace doesn't have access to the load/store instructions from
the expansion, only the original app scatter/gather instruction. It has
to create the read/write records using only information from the
original scatter/gather instruction and it uses the size of the memory
operand to determine the size of each read/write. This works for x86
because the x86 IR uses the per-element data size as for the memory
operand of scatter/gather instructions. This doesn't work for AArch64
because the AArch64 codec uses the maximum data transferred (per-element
data size * number of elements) like other SIMD load/store instructions.

We plan to make the AArch64 IR consistent with x86 by changing it to use
the same convention as x86 for scatter/gather instructions but in the
mean time we can work around the inconsistency by fixing the size in
raw2trace based on the instruction's opcode.

Issues: #6499, #5365, #5036
jackgallagher-arm added a commit that referenced this issue Jan 23, 2024
This makes the IR consistent with x86 which already uses the per-element
transfer size for the scatter/gather memory operand size.

Issues: #5365, #5036, #6561
jackgallagher-arm added a commit that referenced this issue Jan 24, 2024
#6574)

Make the AArch64 IR consistent with x86 which already uses the
per-element transfer size for the scatter/gather memory operand
size.

This changes the AArch64 codec for the scatter/gather and predicated
contiguous load/store instructions to use the per-element access size
for the memory operand instead of the maximum total transfer size
that it used previously, and updates the tests accordingly.
    
Issues: #5365, #5036, #6561
AssadHashmi added a commit that referenced this issue Mar 26, 2024
This patch adds SVE support for signals in the core. It is the follow
on patch from the SVE core work part 1, in PR #5835 (f646a63) and
includes vector address computation for SVE scatter/gather, enabling
first-fault load handling.

Issue: #5365, #5036

Co-authored-by: Jack Gallagher <[email protected]>
AssadHashmi added a commit that referenced this issue Apr 3, 2024
This patch adds SVE support for signals in the core. It is the follow on
patch from the SVE core work part 1, in PR #5835 (f646a63) and
includes vector address computation for SVE scatter/gather, enabling
first-fault load handling.

Issue: #5365, #5036

Co-authored-by: Jack Gallagher <[email protected]>
jackgallagher-arm added a commit that referenced this issue Apr 3, 2024
Adds support for restoring register app state when a fault is triggered
by one of the instructions in a scatter/gather expansion sequence
emitted by drx_expand_scatter_gather().

Also extends the existing scatter/gather expansion tests to check the value
of the FFR register, and run the test instructions with different elements
rigged to fault so we can test the state restoration behaviour.

Support for the first-faulting and non-faulting variants (ldff1*, ldnf1*)
will be added in future commits.

Issue: #5036
jackgallagher-arm added a commit that referenced this issue Apr 4, 2024
Adds support for restoring register app state when a fault is triggered
by one of the instructions in a scatter/gather expansion sequence
emitted by drx_expand_scatter_gather().

Also extends the existing scatter/gather expansion tests to check the
value of the FFR register, and run the test instructions with different
elements rigged to fault so we can test the state restoration behaviour.

Support for the first-faulting and non-faulting variants (ldff1*,
ldnf1*) will be added in future commits.

Issue: #5036
jackgallagher-arm added a commit that referenced this issue Apr 4, 2024
Adds support for non-fault loads to drx_expand_scatter_gather().
Non-fault loads (ldnf1*) work similarly to scalar+immediate predicated
contiguous ld1* loads, but with different behaviour if an element
access faults. This commit implements this behaviour and extends the
scatter/gather tests to include ldnf1* instructions.

Issue: #5036
abhinav92003 added a commit that referenced this issue Apr 5, 2024
Fixes the slot used to save and restore FP regs at fcache enter and
return events. PR #6725 adjusted the slots used during signal handling
in core/unix/signal_linux_aarch64.c but did not adjust the same in
fcache enter/return and attach events. Prior to that PR, the FP regs
were simply stored in a contiguous manner in signal handling code and
fcache enter/return routines (instead of in their respective dr_simd_t
struct member), which was a bit confusing.

The mismatch between slot usage in signal handling and fcache
enter/return code caused failures in the signalNNN1 tests on some
systems. Verified that those tests pass with this fix.

Also fixes the same issue in save_priv_mcontext_helper which is used in
the dr_app_start API. Unit tests for this scenario will be added as part
of #6759.

Issue: #5036, #6755, #5365
Fixes #6758
AssadHashmi pushed a commit that referenced this issue Apr 8, 2024
Adds support for non-fault loads to drx_expand_scatter_gather().
Non-fault loads (ldnf1*) work similarly to scalar+immediate predicated
contiguous ld1* loads, but with different behaviour if an element access
faults. This commit implements this behaviour and extends the
scatter/gather tests to include ldnf1* instructions.

Issue: #5036
jackgallagher-arm added a commit that referenced this issue Apr 16, 2024
Adds support for first-fault loads to drx_expand_scatter_gather().
First-fault loads (ldff1*) work similarly to
scalar+vector/vector+immediate gather instructions or scalar+scalar
predicated contiguous ld1* loads, but with different behaviour if an
element access faults. This commit implements this behaviour and extends
the scatter/gather tests to include ldff1* instructions.

Issue: #5036
AssadHashmi pushed a commit that referenced this issue Apr 17, 2024
Adds support for first-fault loads to drx_expand_scatter_gather().
First-fault loads (ldff1*) work similarly to
scalar+vector/vector+immediate gather instructions or scalar+scalar
predicated contiguous ld1* loads, but with different behaviour if an
element access faults. This commit implements this behaviour and extends
the scatter/gather tests to include ldff1* instructions.

Issue: #5036
jackgallagher-arm added a commit that referenced this issue May 1, 2024
Adds details of the AArch64 scatter/gather expansion to the
scatter/gather expansion developer documentation.

Issue: #5036
jackgallagher-arm added a commit that referenced this issue May 2, 2024
Adds details of the AArch64 scatter/gather expansion to the
scatter/gather expansion developer documentation.

Issue: #5036
jackgallagher-arm added a commit that referenced this issue Jul 4, 2024
I have done some testing with 512-bit vector registers and found a
couple of bugs in the scatter/gather expansion code.

 - An assert had incorrect logic (< should have been <=)
 - Added OPSZ_192, needed to represent 3 512-bit Z registers.

Issue: #5036
jackgallagher-arm added a commit that referenced this issue Jul 4, 2024
I have done some testing with 512-bit vector registers and found a
couple of bugs in the scatter/gather expansion code.

 - An assert had incorrect logic (< should have been <=)
 - Added OPSZ_192, needed to represent 3 512-bit Z registers.

Issue: #5036
jackgallagher-arm added a commit that referenced this issue Oct 10, 2024
Fixes the compiler error seen on GCC 13.2:

ext/drx/scatter_gather_aarch64.c:1238:37: error:
    array subscript 2 is above array bounds of 'sg_slot_t[2]' {aka 'struct _sg_slot_t[2]'}
    [-Werror=array-bounds=]
 1238 |     DR_ASSERT(slot_state->pred_slots[slot].kind == SLOT_KIND_UNUSED);

In practice the code will never make an out of bounds access because the
for loop above will always terminate by breaking so `slot` is always
`< NUM_PRED_SLOTS`.

Issue: #5036
jackgallagher-arm added a commit that referenced this issue Oct 10, 2024
Fixes the compiler error seen on GCC 13.2:

ext/drx/scatter_gather_aarch64.c:1238:37: error:
array subscript 2 is above array bounds of 'sg_slot_t[2]' {aka 'struct
_sg_slot_t[2]'}
    [-Werror=array-bounds=]
1238 | DR_ASSERT(slot_state->pred_slots[slot].kind == SLOT_KIND_UNUSED);

In practice the code will never make an out of bounds access because the
for loop above will always terminate by breaking so `slot` is always `<
NUM_PRED_SLOTS`.

Issue: #5036
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants