Improvements & a bug #77

dzaima · 2024-01-29T18:43:44Z

Some notes of better possible implementations from me scrolling through this for a bit:

blendv uses e.g. __riscv_vmsne_vx_i64m1_b64(__riscv_vsra_vx_i64m1(x, 63, 2), 0, 2) where __riscv_vmslt_vx_i64m1_b64(x, 0, 2) would do.

Expanding vbool to a mask: in e.g. __riscv_vmerge_vvm_i64m1(__riscv_vmv_v_x_i64m1(0, 2), __riscv_vmv_v_x_i64m1(UINT64_MAX, 2), mask, 2), a _vxm_ version can be used, giving __riscv_vmerge_vxm_i64m1(__riscv_vmv_v_x_i64m1(0, 2), -1, mask, 2). This compiles to a vmerge.vim, with the -1 as an immediate.

_sd/_ss functions: a tail-undisturbed op can be used to preserve the top element(s), e.g.

// I/O types changed for me to more easily get these into compiler explorer; it's just reinterprets though
vfloat32m1_t _mm_add_ss(vfloat32m1_t _a, vfloat32m1_t _b) {
  return __riscv_vfadd_vv_f32m1_tu(_a, _a, _b, 1);
}
vint64m1_t _mm_cmplt_sd(vfloat64m1_t _a, vfloat64m1_t _b) {
  vbool64_t cmp_res = __riscv_vmflt_vv_f64m1_b64(_a, _b, 1);
  return __riscv_vmerge_vxm_i64m1_tu(__riscv_vreinterpret_v_f64m1_i64m1(_a),
      __riscv_vmv_v_x_i64m1(0, 1), -1,
      cmp_res, 1);
}
vfloat64m1_t _mm_load_sd(double const *mem_addr) {
  vfloat64m1_t zeros = __riscv_vfmv_v_f_f64m1(0, 2);
  return __riscv_vle64_v_f64m1_tu(zeros, mem_addr, 1);
}

Your current definitions for them don't always behave correctly as your __riscv_vslideup_vxs need a _tu; you can observe tests failing with rvv_ta_all_1s=on,rvv_ma_all_1s=on added to QEMU's -cpu

Widening ops should LMUL-truncate the input, not output, to avoid overly large temporary registers. And _vf4/_vf8 can be used too:

vint64m1_t _mm_cvtepi8_epi64(vint8m1_t _a) {
  vint8mf8_t a_trunc = __riscv_vlmul_trunc_v_i8m1_i8mf8(_a);
  return __riscv_vsext_vf8_i64m1(a_trunc, 2);
}

_mm_mulhi_epu16 & _mm_mulhi_epi16 have exact RVV equivalents without any temporary widening - __riscv_vmulhu_vv_u16m1 & __riscv_vmulh_vv_i16m1.
_mm_mullo_epi16 & _mm_mullo_epi32 are just __riscv_vmul_vv_i16m1 & __riscv_vmul_vv_i32m1.

int _mm_test_all_ones(vint32m1_t a) {
  vint32m1_t redand = __riscv_vredand_vs_i32m1_i32m1(a, a, 4); // bit of a hack to use `a` as the scalar too, but it works
  return __riscv_vmv_x_s_i32m1_i32(redand) == -1;
}

(additionally, clang gives (harmless) warnings on __riscv_vmv_v_x_i16m1(UINT16_MAX, 8) (also for UINT8 too), as that's passing an unsigned value to a signed parameter. __riscv_vmv_v_x_i16m1(-1, 8) is both shorter and avoids the warning (but most places with these should be using vmerge_vxm or similar anyway); to cross-compile with clang all you need to do is add --target=riscv64-linux-gnu to clang++ invocations)

The text was updated successfully, but these errors were encountered:

howjmay · 2024-01-30T23:51:34Z

Thank you @dzaima for your advice!
This helps a lot. I'm not able to use computers on weekdays. I will fix these on the weekend
Btw I don't know the behavior of _tu
Is there any keyword that can help me find some information?

dzaima · 2024-01-31T00:00:13Z

_tu stands for tail-undisturbed, as opposed to the default of tail-agnostic; rvv specification section on it.

The default for an intrinsic is to be tail-agnostic, e.g. __riscv_vfadd_vv_f32m1(a, b, 2) is allowed to have arbitrary results for elements at indices above 2. But with _tu, you request those tail items to be specific ones, namely __riscv_vfadd_vv_f32m1_tu(base, a, b, 2) requests those elements to be those of base.

In my intrinsics viewer you can click on the options under Variations: on an instruction; here's __riscv_vfadd_vv_f32m1 and here's __riscv_vfadd_vv_f32m1_tu - between those the tail loop for (size_t i = vl; i < vlmax; i++) changes its behavior.

The QEMU options mentioned change its behavior on those elements as the RVV spec allows implementation behavior to vary - by default QEMU makes tail-agnostic and tail-undisturbed behave the same, but the options I mentioned make them set all-bits-1 elements instead.

Your __riscv_vslideup_vx_f64m1(a,b,c,1), therefore, results in undefined elements at any index above 1, but QEMU's default behavior hides it.

howjmay · 2024-01-31T04:29:57Z

Thank you so much for the explanation. And your intrinsics viewer is an awesome work!!
Thank you so much. I will change them as soon as I have the access of my computer

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improvements & a bug #77

Improvements & a bug #77

dzaima commented Jan 29, 2024

howjmay commented Jan 30, 2024 •

edited

Loading

dzaima commented Jan 31, 2024 •

edited

Loading

howjmay commented Jan 31, 2024

Improvements & a bug #77

Improvements & a bug #77

Comments

dzaima commented Jan 29, 2024

howjmay commented Jan 30, 2024 • edited Loading

dzaima commented Jan 31, 2024 • edited Loading

howjmay commented Jan 31, 2024

howjmay commented Jan 30, 2024 •

edited

Loading

dzaima commented Jan 31, 2024 •

edited

Loading