array assignment slower than numpy #2095

hmaarrfk · 2023-04-09T20:23:58Z

Hey, thanks for making this cool library. I really do believe that the advantages you outline in terms of ahead of time compilation are valuable to those building powerful scientific computation libraries.

I was trying my hand at doing a "simple" image analysis task: rgb -> nv12 conversion.

nv12 seems to be similar to yuv420 (I420) but with the U and V channels interleaved instead of on distinct planes.

This should be a simple transpose operation, but as is typical, it is easy to do this operation very slowly based on how you do it.

mark@nano 
--------- 
OS: Ubuntu 22.10 x86_64 
Host: 20UN005JUS ThinkPad X1 Nano Gen 1 
Kernel: 5.19.0-38-generic 
Uptime: 7 days, 1 hour, 43 mins 
Packages: 4010 (dpkg), 23 (flatpak), 12 (snap) 
Shell: bash 5.2.2 
Resolution: 2160x1350 
DE: GNOME 43.1 
WM: Mutter 
WM Theme: Adwaita 
Theme: Yaru-viridian [GTK2/3] 
Icons: Yaru-viridian [GTK2/3] 
Terminal: kitty 
CPU: 11th Gen Intel i7-1180G7 (8) @ 4.600GHz 
GPU: Intel Tiger Lake-UP4 GT2 [Iris Xe Graphics] 
Memory: 6991MiB / 15704MiB

The operation amounts to something like transposing a 2D Array. From [u_0, u_1, ... u_n, v_0, v_v1, v_n] to [u_0, v_0, u_1, v_1, ... u_n, v_n].

To setup the problem, lets start with:

import numpy as np
N_iter = 200
image_shape = (3072, 3072)

uv = np.full(
    (image_shape[0] // 4 * 2, image_shape[1]),
    fill_value=0,
    dtype='uint8'
)
uv.shape = (-1,)

For some reason, I need to do the operation in place. In numpy, this would amount to:

import numpy as np
def pack_np(uv):
    uv_copy = uv.copy()
    u_size = uv.shape[0] // 2
    uv[0::2] = uv_copy[:u_size]
    uv[1::2] = uv_copy[u_size:]

On my computer, I get about 700-750 iterations per second with this loop:

import tqdm
N_iter = 200
for i in tqdm(range(N_iter)):
    pack_np(uv)

As a bound, I estimated that that the "upper bound of performance" would be achieved with

def double_copy(uv):
    uv_copy = uv.copy()
    u_size = uv.shape[0] // 2
    uv[...] = uv_copy
    uv[...] = uv_copy
    
for i in tqdm(range(N_iter)):
    double_copy(uv)

This acheives 1184 iterations / sec! Pretty good. I was hoping that pythran could help get that leve

I tries two diferent ways of doing this:

import pythran
%load_ext pythran.magic

%%pythran
import numpy as np

# pythran export pack_my_pythran_loop(uint8[:])
def pack_my_pythran_loop(uv):
    u_size = uv.shape[0] // 2
    
    uv_copy = uv.copy()
    for i in range(u_size):
        uv[i * 2 + 0] = uv_copy[i]
        uv[i * 2 + 1] = uv_copy[u_size + i]

# pythran export pack_my_pythran_assign(uint8[:])
def pack_my_pythran_assign(uv):
    u_size = uv.shape[0] // 2
    uv_copy =  uv.copy()
    uv[0::2] = uv_copy[:u_size]
    uv[1::2] = uv_copy[u_size:]

pack_my_pythran_assign acheives 315 its / second.
pack_my_pythran_loop achieves 460 its / second

I tried with numba for completeness

from numba import njit as jit
@jit
def pack_numba(uv):
    u_size = uv.shape[0] // 2
    
    uv_copy = uv.copy()
    for i in range(u_size):
        uv[i * 2 + 0] = uv_copy[i]
        uv[i * 2 + 1] = uv_copy[u_size + i]
        
pack_numba(uv)

for i in tqdm(range(N_iter)):
    pack_numba(uv)

Achieves pretty close to 600 its/second.

I mean, I'm not really expecting too much "speedup here" but I figured that this was strange. I'm hoping that this little example can help improve this library.

Best.

Mark

library	iterations per second
numpy	700
double copy	1200
pythran assignment	315
pythran loop	450
numba loop	600

The text was updated successfully, but these errors were encountered:

serge-sans-paille · 2023-04-09T21:11:35Z

Thanks for the detailed report! I'm a bit puzzled by the result. I setup the following test, inspired by your test case:

python -m timeit -s 'from pack import pack_my_pythran_loop; import numpy as np; N_iter = 200; image_shape = (3072, 3072); uv = np.full((image_shape[0] // 4 * 2, image_shape[1]),fill_value=0,dtype=np.uint8); uv.shape = (-1,)' 'pack_my_pythran_loop(uv)'

with numpy (loop): 436 msec per loop
with numpy (assign): 2.12 msec per loop
with pythran: (loop) 755 usec per loop
with pythran: (assign): 2.14 msec per loop
with numba (loop): 1.05 msec
with numba (assign): 1.16 msec

So i see an issue with the assign part (and I can work on it!) but none with the loop :-/

hmaarrfk · 2023-04-09T21:58:19Z

For full reproducibility, I probably owe you information about my conda environment, and compiler versions.

I was travelling when I wrote the report so it wasn't so easy for me to spin up new, clean, environments.

I'll try your little test when I get back on a few machines.

hmaarrfk · 2023-04-10T00:47:52Z

$ python -m timeit -s 'from pack import pack_my_pythran_loop; import numpy as np; N_iter = 200; image_shape = (3072, 3072); uv = np.full((image_shape[0] // 4 * 2, image_shape[1]),fill_value=0,dtype=np.uint8); uv.shape = (-1,)' 'pack_my_pythran_loop(uv)'
200 loops, best of 5: 1.88 msec per loop
(pythran) ✔ ~/git/pythran 
$ python -m timeit -s 'from pack import pack_my_pythran_assign; import numpy as np; N_iter = 200; image_shape = (3072, 3072); uv = np.full((image_shape[0] // 4 * 2, image_shape[1]),fill_value=0,dtype=np.uint8); uv.shape = (-1,)' 'pack_my_pythran_assign(uv)'
100 loops, best of 5: 3.01 msec per loop

numpy assign: 200 loops, best of 5: 1.24 msec per loop
pythran loop: 200 loops, best of 5: 1.88 msec per loop
pythran assign: 100 loops, best of 5: 3.01 msec per loop

Environment before installing numba

mamba create --name pythran pythran blas numpy --channel conda-forge --override-channels

# packages in environment at /home/mark/mambaforge/envs/pythran:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                  2_kmp_llvm    conda-forge
beniget                   0.4.1              pyhd8ed1ab_0    conda-forge
binutils_impl_linux-64    2.39                 he00db2b_1    conda-forge
binutils_linux-64         2.39                h5fc0e48_12    conda-forge
blas                      2.116                  openblas    conda-forge
blas-devel                3.9.0           16_linux64_openblas    conda-forge
bzip2                     1.0.8                h7f98852_4    conda-forge
ca-certificates           2022.12.7            ha878542_0    conda-forge
decorator                 5.1.1              pyhd8ed1ab_0    conda-forge
gast                      0.5.3              pyhd8ed1ab_0    conda-forge
gcc_impl_linux-64         11.3.0              hab1b70f_19    conda-forge
gcc_linux-64              11.3.0              he6f903b_12    conda-forge
gxx_impl_linux-64         11.3.0              hab1b70f_19    conda-forge
gxx_linux-64              11.3.0              hc203a17_12    conda-forge
kernel-headers_linux-64   2.6.32              he073ed8_15    conda-forge
ld_impl_linux-64          2.39                 hcc3a1bd_1    conda-forge
libblas                   3.9.0           16_linux64_openblas    conda-forge
libcblas                  3.9.0           16_linux64_openblas    conda-forge
libexpat                  2.5.0                hcb278e6_1    conda-forge
libffi                    3.4.2                h7f98852_5    conda-forge
libgcc-devel_linux-64     11.3.0              h210ce93_19    conda-forge
libgcc-ng                 12.2.0              h65d4601_19    conda-forge
libgfortran-ng            12.2.0              h69a702a_19    conda-forge
libgfortran5              12.2.0              h337968e_19    conda-forge
libgomp                   12.2.0              h65d4601_19    conda-forge
liblapack                 3.9.0           16_linux64_openblas    conda-forge
liblapacke                3.9.0           16_linux64_openblas    conda-forge
libnsl                    2.0.0                h7f98852_0    conda-forge
libopenblas               0.3.21          pthreads_h78a6416_3    conda-forge
libsanitizer              11.3.0              h239ccf8_19    conda-forge
libsqlite                 3.40.0               h753d276_0    conda-forge
libstdcxx-devel_linux-64  11.3.0              h210ce93_19    conda-forge
libstdcxx-ng              12.2.0              h46fd767_19    conda-forge
libuuid                   2.38.1               h0b41bf4_0    conda-forge
libzlib                   1.2.13               h166bdaf_4    conda-forge
llvm-openmp               16.0.1               h417c0b6_0    conda-forge
ncurses                   6.3                  h27087fc_1    conda-forge
numpy                     1.24.2          py311h8e6699e_0    conda-forge
openblas                  0.3.21          pthreads_h320a7e8_3    conda-forge
openssl                   3.1.0                h0b41bf4_0    conda-forge
pip                       23.0.1             pyhd8ed1ab_0    conda-forge
ply                       3.11                       py_1    conda-forge
python                    3.11.3          h2755cc3_0_cpython    conda-forge
python_abi                3.11                    3_cp311    conda-forge
pythran                   0.12.1          py311hcb41070_0    conda-forge
readline                  8.2                  h8228510_1    conda-forge
setuptools                67.6.1             pyhd8ed1ab_0    conda-forge
sysroot_linux-64          2.12                he073ed8_15    conda-forge
tk                        8.6.12               h27826a3_0    conda-forge
tzdata                    2023c                h71feb2d_0    conda-forge
wheel                     0.40.0             pyhd8ed1ab_0    conda-forge
xsimd                     8.0.5                h4bd325d_0    conda-forge
xz                        5.2.6                h166bdaf_0    conda-forge
zstd                      1.5.2                h3eb15da_6    conda-forge


     active environment : pythran
    active env location : /home/mark/mambaforge/envs/pythran
            shell level : 1
       user config file : /home/mark/.condarc
 populated config files : /home/mark/mambaforge/.condarc
                          /home/mark/.condarc
          conda version : 23.3.1
    conda-build version : 3.24.0
         python version : 3.9.16.final.0
       virtual packages : __archspec=1=x86_64
                          __glibc=2.36=0
                          __linux=5.19.0=0
                          __unix=0=0
       base environment : /home/mark/mambaforge  (writable)
      conda av data dir : /home/mark/mambaforge/etc/conda
  conda av metadata url : None
           channel URLs : https://conda.anaconda.org/conda-forge/linux-64
                          https://conda.anaconda.org/conda-forge/noarch
          package cache : /home/mark/mambaforge/pkgs
                          /home/mark/.conda/pkgs
       envs directories : /home/mark/mambaforge/envs
                          /home/mark/.conda/envs
               platform : linux-64
             user-agent : conda/23.3.1 requests/2.28.2 CPython/3.9.16 Linux/5.19.0-38-generic ubuntu/22.10 glibc/2.36
                UID:GID : 1000:1000
             netrc file : None
           offline mode : False

pack numba (python 3.10...): 200 loops, best of 5: 1.53 msec per loop

I included the output of pythran -E as well.

pythran_pack.zip

hmaarrfk · 2023-04-10T01:01:41Z

I should also note, that the "size" of the array here is kinda huge compared to what I see many benchmarks use.

Often I see that people benchmark with 256 x 256 images or 512 x 512 images. These images are less than 1MB in size (with uint8 precision).

However, this image, encoded in yuv420p is more than 13MB.

It doesn't fit in my cache, all at once. (but the uv part might)
https://ark.intel.com/content/www/us/en/ark/products/208663/intel-core-i71180g7-processor-12m-cache-up-to-4-60-ghz-with-ipu.html

It might fit in yours if you have a large machine which may explain some differences in our results.

It's very rare to have non-constant stride, and using a constant unlocks various optimization, including potential improvement for #2095

serge-sans-paille · 2023-04-12T15:27:36Z

Thanks for the extra piece of information. Does #2097 improve the situation? It does on my setup, making pack_my_pythran_assign as fast as the numpy version.

EDIT: this PR doesn't passes full validation yet, but it should be ok for our concern though.
EDIT²: #2097 now fully functionnal, waiting for your feedback before merging it.

It's very rare to have non-constant stride, and using a constant unlocks various optimization, including potential improvement for #2095

hmaarrfk · 2023-04-14T01:09:02Z

Unfortunately, it does not resolve things.

Is there any intermediate output I can give you to debug things? compiler info and whatnot?
pack_pythran_dev_2097.zip

serge-sans-paille · 2023-04-14T17:15:36Z

Hell! A funny side effect of this track is that I'm fixing a lot of performance issues (some very big ones) that shows in various numpy benchmarks :-)

jeanlaroche · 2023-04-14T17:20:50Z

Wow, that's great news!

…

On 4/14/23 10:15 AM, serge-sans-paille wrote: Hell! A funny side effect of this track is that I'm fixing a lot of performance issues (some very big ones) that shows in various numpy benchmarks :-) — Reply to this email directly, view it on GitHub <#2095 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AGUBEPN76ICHF3U4DUDLAYTXBGA4LANCNFSM6AAAAAAWYJ36D4>. You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

serge-sans-paille · 2023-04-14T17:27:24Z

@jeanlaroche : you can check PR #2096, #2097 and #2096, they are likely to be merged soon.

@hmaarrfk : thanks for the archive - I can reproduce. An analysis of the generated assembly shows twice as many mov in the pythran generated code compared to numpy's, I'll investigate.

jeanlaroche · 2023-04-14T20:24:15Z

#2096: no problem. #2097 no problem #2098 is not building. In file included from /Users/jlaroche/packages/system/algo/Transients/transient_api.cpp:4: /Users/jlaroche/packages/system/algo/build/darwin/Debug-Individual-Xcode/generated/transient.cpp:999:55: error: no matching function for call to 'call' typename pythonic::assignable_noescape<decltype(pythonic::types::call(transient_tf_bridge::tflite_bridge::runModel(), 0L, Input_x, std::get<1>(pythonic::types::as_const(self))))>::type y = pythonic::types::call(transient_tf_bridge::tflite_bridge::runM... That last one isn't in the message below where you repeat 2096... and the error could be because of some of my code. Can you confirm you're planning on merging 2098 as well? Jean

…

On 4/14/23 10:27 AM, serge-sans-paille wrote: @jeanlaroche <https://github.com/jeanlaroche> : you can check PR #2096 <#2096>, #2097 <#2097> and #2096 <#2096>, they are likely to be merged soon. @hmaarrfk <https://github.com/hmaarrfk> : thanks for the archive - I can reproduce. An analysis of the generated assembly shows twice as many |mov| in the pythran generated code compared to numpy's, I'll investigate. — Reply to this email directly, view it on GitHub <#2095 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AGUBEPOP33ZJ5MWMY7FXWGDXBGCIPANCNFSM6AAAAAAWYJ36D4>. You are receiving this because you were mentioned.Message ID: ***@***.***>

It's very rare to have non-constant stride, and using a constant unlocks various optimization, including potential improvement for #2095

serge-sans-paille · 2023-04-15T19:39:44Z

Short notice: the following numpy version achieves the same result with twice less memory:

def pack_my_pythran_assign(uv):
    u_size = uv.shape[0] // 2
    uv_lo =  uv[:u_size].copy()
    uv[1::2] = uv[u_size:]
    uv[0::2] = uv_lo

no significant impact on runtime from the pythran perspective on my setup though.

EDIT: it actually brings interesting speedup to the pythran-compiled version. Could you give it a try? (from the feature/faster-gexpr branch)

hmaarrfk · 2023-04-15T23:27:54Z

Short notice: the following numpy version achieves the same result with twice less memory:
def pack_my_pythran_assign(uv):
    u_size = uv.shape[0] // 2
    uv_lo =  uv[:u_size].copy()
    uv[1::2] = uv[u_size:]
    uv[0::2] = uv_lo
no significant impact on runtime from the pythran perspective on my setup though.

EDIT: it actually brings interesting speedup to the pythran-compiled version. Could you give it a try? (from the feature/faster-gexpr branch)

Away from my computer at the moment. But at some point I found mixed results with this approach. I think my hunch was that due to the memory aliasing.

Looking into things in the past, i found that x86 had some inter sting instructions that skipped the cache. I didn't know if pythran could be optimized for that, i wanted an to avoid speedups due to that optimization.

hmaarrfk · 2023-04-16T13:22:31Z

Looking into this now that I have a second, it seems that there is no action item on my part. keep me posted.

serge-sans-paille · 2023-04-17T13:59:01Z

Pythran is not doing anything specific related to cache. I can get good speedup if I remove a runtime check for aliasing between lhs and rhs when assigning between slices, but I currently fail at doing so in an elegant way.

hmaarrfk · 2023-04-17T14:14:31Z

As for the memory copy, I'm mostly talking about my results from:

https://github.com/awreece/memory-bandwidth-demo

./memory_profiler 
           read_memory_rep_lodsq: 12.19 GiB/s
                read_memory_loop: 19.69 GiB/s
                 read_memory_sse: 18.63 GiB/s
                 read_memory_avx: 19.02 GiB/s
        read_memory_prefetch_avx: 16.90 GiB/s
               write_memory_loop: 13.23 GiB/s
          write_memory_rep_stosq: 39.86 GiB/s
                write_memory_sse: 13.97 GiB/s
    write_memory_nontemporal_sse: 43.03 GiB/s
                write_memory_avx: 13.26 GiB/s
    write_memory_nontemporal_avx: 47.87 GiB/s
             write_memory_memset: 39.80 GiB/s
       read_memory_rep_lodsq_omp: 49.37 GiB/s
            read_memory_loop_omp: 52.74 GiB/s
             read_memory_sse_omp: 54.41 GiB/s
             read_memory_avx_omp: 52.21 GiB/s
    read_memory_prefetch_avx_omp: 47.91 GiB/s
           write_memory_loop_omp: 31.18 GiB/s
      write_memory_rep_stosq_omp: 55.99 GiB/s
            write_memory_sse_omp: 22.40 GiB/s
write_memory_nontemporal_sse_omp: 56.80 GiB/s
            write_memory_avx_omp: 31.22 GiB/s
write_memory_nontemporal_avx_omp: 54.89 GiB/s
         write_memory_memset_omp: 54.15 GiB/s

I feel like this is a "future improvement" maybe.....

I'm not a fan of using "threading" for speedups, but the non_temporal_avx shows an impressive speedup.

It's very rare to have non-constant stride, and using a constant unlocks various optimization, including potential improvement for #2095

hmaarrfk changed the title ~~Slower in Pythran than numpy?~~ array assignment slower than numpy Apr 10, 2023

serge-sans-paille added a commit that referenced this issue Apr 12, 2023

Introduce cstride_slice<N>, constant stride slice

60a16d9

It's very rare to have non-constant stride, and using a constant unlocks various optimization, including potential improvement for #2095

serge-sans-paille added a commit that referenced this issue Apr 12, 2023

Introduce cstride_slice<N>, constant stride slice

f93b5eb

It's very rare to have non-constant stride, and using a constant unlocks various optimization, including potential improvement for #2095

serge-sans-paille mentioned this issue Apr 14, 2023

Pass scalar by values for user functions #2098

Merged

serge-sans-paille added a commit that referenced this issue Apr 15, 2023

Introduce cstride_slice<N>, constant stride slice

5edc917

It's very rare to have non-constant stride, and using a constant unlocks various optimization, including potential improvement for #2095

serge-sans-paille added a commit that referenced this issue Apr 23, 2023

Introduce cstride_slice<N>, constant stride slice

c991b29

It's very rare to have non-constant stride, and using a constant unlocks various optimization, including potential improvement for #2095

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

array assignment slower than numpy #2095

array assignment slower than numpy #2095

hmaarrfk commented Apr 9, 2023

serge-sans-paille commented Apr 9, 2023

hmaarrfk commented Apr 9, 2023

hmaarrfk commented Apr 10, 2023 •

edited

Loading

hmaarrfk commented Apr 10, 2023

serge-sans-paille commented Apr 12, 2023 •

edited

Loading

hmaarrfk commented Apr 14, 2023

serge-sans-paille commented Apr 14, 2023

jeanlaroche commented Apr 14, 2023 via email

serge-sans-paille commented Apr 14, 2023

jeanlaroche commented Apr 14, 2023 via email

serge-sans-paille commented Apr 15, 2023 •

edited

Loading

hmaarrfk commented Apr 15, 2023

hmaarrfk commented Apr 16, 2023

serge-sans-paille commented Apr 17, 2023

hmaarrfk commented Apr 17, 2023

array assignment slower than numpy #2095

array assignment slower than numpy #2095

Comments

hmaarrfk commented Apr 9, 2023

serge-sans-paille commented Apr 9, 2023

hmaarrfk commented Apr 9, 2023

hmaarrfk commented Apr 10, 2023 • edited Loading

hmaarrfk commented Apr 10, 2023

serge-sans-paille commented Apr 12, 2023 • edited Loading

hmaarrfk commented Apr 14, 2023

serge-sans-paille commented Apr 14, 2023

jeanlaroche commented Apr 14, 2023 via email

serge-sans-paille commented Apr 14, 2023

jeanlaroche commented Apr 14, 2023 via email

serge-sans-paille commented Apr 15, 2023 • edited Loading

hmaarrfk commented Apr 15, 2023

hmaarrfk commented Apr 16, 2023

serge-sans-paille commented Apr 17, 2023

hmaarrfk commented Apr 17, 2023

hmaarrfk commented Apr 10, 2023 •

edited

Loading

serge-sans-paille commented Apr 12, 2023 •

edited

Loading

serge-sans-paille commented Apr 15, 2023 •

edited

Loading