Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements & a bug #77

Open
dzaima opened this issue Jan 29, 2024 · 3 comments
Open

Improvements & a bug #77

dzaima opened this issue Jan 29, 2024 · 3 comments

Comments

@dzaima
Copy link

dzaima commented Jan 29, 2024

Some notes of better possible implementations from me scrolling through this for a bit:

blendv uses e.g. __riscv_vmsne_vx_i64m1_b64(__riscv_vsra_vx_i64m1(x, 63, 2), 0, 2) where __riscv_vmslt_vx_i64m1_b64(x, 0, 2) would do.

Expanding vbool to a mask: in e.g. __riscv_vmerge_vvm_i64m1(__riscv_vmv_v_x_i64m1(0, 2), __riscv_vmv_v_x_i64m1(UINT64_MAX, 2), mask, 2), a _vxm_ version can be used, giving __riscv_vmerge_vxm_i64m1(__riscv_vmv_v_x_i64m1(0, 2), -1, mask, 2). This compiles to a vmerge.vim, with the -1 as an immediate.

_sd/_ss functions: a tail-undisturbed op can be used to preserve the top element(s), e.g.

// I/O types changed for me to more easily get these into compiler explorer; it's just reinterprets though
vfloat32m1_t _mm_add_ss(vfloat32m1_t _a, vfloat32m1_t _b) {
  return __riscv_vfadd_vv_f32m1_tu(_a, _a, _b, 1);
}
vint64m1_t _mm_cmplt_sd(vfloat64m1_t _a, vfloat64m1_t _b) {
  vbool64_t cmp_res = __riscv_vmflt_vv_f64m1_b64(_a, _b, 1);
  return __riscv_vmerge_vxm_i64m1_tu(__riscv_vreinterpret_v_f64m1_i64m1(_a),
      __riscv_vmv_v_x_i64m1(0, 1), -1,
      cmp_res, 1);
}
vfloat64m1_t _mm_load_sd(double const *mem_addr) {
  vfloat64m1_t zeros = __riscv_vfmv_v_f_f64m1(0, 2);
  return __riscv_vle64_v_f64m1_tu(zeros, mem_addr, 1);
}

Your current definitions for them don't always behave correctly as your __riscv_vslideup_vxs need a _tu; you can observe tests failing with rvv_ta_all_1s=on,rvv_ma_all_1s=on added to QEMU's -cpu

Widening ops should LMUL-truncate the input, not output, to avoid overly large temporary registers. And _vf4/_vf8 can be used too:

vint64m1_t _mm_cvtepi8_epi64(vint8m1_t _a) {
  vint8mf8_t a_trunc = __riscv_vlmul_trunc_v_i8m1_i8mf8(_a);
  return __riscv_vsext_vf8_i64m1(a_trunc, 2);
}

_mm_mulhi_epu16 & _mm_mulhi_epi16 have exact RVV equivalents without any temporary widening - __riscv_vmulhu_vv_u16m1 & __riscv_vmulh_vv_i16m1.
_mm_mullo_epi16 & _mm_mullo_epi32 are just __riscv_vmul_vv_i16m1 & __riscv_vmul_vv_i32m1.

int _mm_test_all_ones(vint32m1_t a) {
  vint32m1_t redand = __riscv_vredand_vs_i32m1_i32m1(a, a, 4); // bit of a hack to use `a` as the scalar too, but it works
  return __riscv_vmv_x_s_i32m1_i32(redand) == -1;
}

(additionally, clang gives (harmless) warnings on __riscv_vmv_v_x_i16m1(UINT16_MAX, 8) (also for UINT8 too), as that's passing an unsigned value to a signed parameter. __riscv_vmv_v_x_i16m1(-1, 8) is both shorter and avoids the warning (but most places with these should be using vmerge_vxm or similar anyway); to cross-compile with clang all you need to do is add --target=riscv64-linux-gnu to clang++ invocations)

@howjmay
Copy link
Member

howjmay commented Jan 30, 2024

Thank you @dzaima for your advice!
This helps a lot. I'm not able to use computers on weekdays. I will fix these on the weekend
Btw I don't know the behavior of _tu
Is there any keyword that can help me find some information?

@dzaima
Copy link
Author

dzaima commented Jan 31, 2024

_tu stands for tail-undisturbed, as opposed to the default of tail-agnostic; rvv specification section on it.

The default for an intrinsic is to be tail-agnostic, e.g. __riscv_vfadd_vv_f32m1(a, b, 2) is allowed to have arbitrary results for elements at indices above 2. But with _tu, you request those tail items to be specific ones, namely __riscv_vfadd_vv_f32m1_tu(base, a, b, 2) requests those elements to be those of base.

In my intrinsics viewer you can click on the options under Variations: on an instruction; here's __riscv_vfadd_vv_f32m1 and here's __riscv_vfadd_vv_f32m1_tu - between those the tail loop for (size_t i = vl; i < vlmax; i++) changes its behavior.

The QEMU options mentioned change its behavior on those elements as the RVV spec allows implementation behavior to vary - by default QEMU makes tail-agnostic and tail-undisturbed behave the same, but the options I mentioned make them set all-bits-1 elements instead.

Your __riscv_vslideup_vx_f64m1(a,b,c,1), therefore, results in undefined elements at any index above 1, but QEMU's default behavior hides it.

@howjmay
Copy link
Member

howjmay commented Jan 31, 2024

Thank you so much for the explanation. And your intrinsics viewer is an awesome work!!
Thank you so much. I will change them as soon as I have the access of my computer

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants