Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

array assignment slower than numpy #2095

Open
hmaarrfk opened this issue Apr 9, 2023 · 15 comments
Open

array assignment slower than numpy #2095

hmaarrfk opened this issue Apr 9, 2023 · 15 comments

Comments

@hmaarrfk
Copy link
Contributor

hmaarrfk commented Apr 9, 2023

Hey, thanks for making this cool library. I really do believe that the advantages you outline in terms of ahead of time compilation are valuable to those building powerful scientific computation libraries.

I was trying my hand at doing a "simple" image analysis task: rgb -> nv12 conversion.

nv12 seems to be similar to yuv420 (I420) but with the U and V channels interleaved instead of on distinct planes.

This should be a simple transpose operation, but as is typical, it is easy to do this operation very slowly based on how you do it.

mark@nano 
--------- 
OS: Ubuntu 22.10 x86_64 
Host: 20UN005JUS ThinkPad X1 Nano Gen 1 
Kernel: 5.19.0-38-generic 
Uptime: 7 days, 1 hour, 43 mins 
Packages: 4010 (dpkg), 23 (flatpak), 12 (snap) 
Shell: bash 5.2.2 
Resolution: 2160x1350 
DE: GNOME 43.1 
WM: Mutter 
WM Theme: Adwaita 
Theme: Yaru-viridian [GTK2/3] 
Icons: Yaru-viridian [GTK2/3] 
Terminal: kitty 
CPU: 11th Gen Intel i7-1180G7 (8) @ 4.600GHz 
GPU: Intel Tiger Lake-UP4 GT2 [Iris Xe Graphics] 
Memory: 6991MiB / 15704MiB

The operation amounts to something like transposing a 2D Array. From [u_0, u_1, ... u_n, v_0, v_v1, v_n] to [u_0, v_0, u_1, v_1, ... u_n, v_n].

To setup the problem, lets start with:

import numpy as np
N_iter = 200
image_shape = (3072, 3072)

uv = np.full(
    (image_shape[0] // 4 * 2, image_shape[1]),
    fill_value=0,
    dtype='uint8'
)
uv.shape = (-1,)

For some reason, I need to do the operation in place. In numpy, this would amount to:

import numpy as np
def pack_np(uv):
    uv_copy = uv.copy()
    u_size = uv.shape[0] // 2
    uv[0::2] = uv_copy[:u_size]
    uv[1::2] = uv_copy[u_size:]

On my computer, I get about 700-750 iterations per second with this loop:

import tqdm
N_iter = 200
for i in tqdm(range(N_iter)):
    pack_np(uv)

As a bound, I estimated that that the "upper bound of performance" would be achieved with

def double_copy(uv):
    uv_copy = uv.copy()
    u_size = uv.shape[0] // 2
    uv[...] = uv_copy
    uv[...] = uv_copy
    
for i in tqdm(range(N_iter)):
    double_copy(uv)

This acheives 1184 iterations / sec! Pretty good. I was hoping that pythran could help get that leve

I tries two diferent ways of doing this:

import pythran
%load_ext pythran.magic
%%pythran
import numpy as np

# pythran export pack_my_pythran_loop(uint8[:])
def pack_my_pythran_loop(uv):
    u_size = uv.shape[0] // 2
    
    uv_copy = uv.copy()
    for i in range(u_size):
        uv[i * 2 + 0] = uv_copy[i]
        uv[i * 2 + 1] = uv_copy[u_size + i]

# pythran export pack_my_pythran_assign(uint8[:])
def pack_my_pythran_assign(uv):
    u_size = uv.shape[0] // 2
    uv_copy =  uv.copy()
    uv[0::2] = uv_copy[:u_size]
    uv[1::2] = uv_copy[u_size:]
  • pack_my_pythran_assign acheives 315 its / second.
  • pack_my_pythran_loop achieves 460 its / second

I tried with numba for completeness

from numba import njit as jit
@jit
def pack_numba(uv):
    u_size = uv.shape[0] // 2
    
    uv_copy = uv.copy()
    for i in range(u_size):
        uv[i * 2 + 0] = uv_copy[i]
        uv[i * 2 + 1] = uv_copy[u_size + i]
        
pack_numba(uv)
for i in tqdm(range(N_iter)):
    pack_numba(uv)

Achieves pretty close to 600 its/second.

I mean, I'm not really expecting too much "speedup here" but I figured that this was strange. I'm hoping that this little example can help improve this library.

Best.

Mark

library iterations per second
numpy 700
double copy 1200
pythran assignment 315
pythran loop 450
numba loop 600
@serge-sans-paille
Copy link
Owner

Thanks for the detailed report! I'm a bit puzzled by the result. I setup the following test, inspired by your test case:

python -m timeit -s 'from pack import pack_my_pythran_loop; import numpy as np; N_iter = 200; image_shape = (3072, 3072); uv = np.full((image_shape[0] // 4 * 2, image_shape[1]),fill_value=0,dtype=np.uint8); uv.shape = (-1,)' 'pack_my_pythran_loop(uv)'

with numpy (loop): 436 msec per loop
with numpy (assign): 2.12 msec per loop
with pythran: (loop) 755 usec per loop
with pythran: (assign): 2.14 msec per loop
with numba (loop): 1.05 msec
with numba (assign): 1.16 msec

So i see an issue with the assign part (and I can work on it!) but none with the loop :-/

@hmaarrfk
Copy link
Contributor Author

hmaarrfk commented Apr 9, 2023

For full reproducibility, I probably owe you information about my conda environment, and compiler versions.

I was travelling when I wrote the report so it wasn't so easy for me to spin up new, clean, environments.

I'll try your little test when I get back on a few machines.

@hmaarrfk
Copy link
Contributor Author

hmaarrfk commented Apr 10, 2023

$ python -m timeit -s 'from pack import pack_my_pythran_loop; import numpy as np; N_iter = 200; image_shape = (3072, 3072); uv = np.full((image_shape[0] // 4 * 2, image_shape[1]),fill_value=0,dtype=np.uint8); uv.shape = (-1,)' 'pack_my_pythran_loop(uv)'
200 loops, best of 5: 1.88 msec per loop
(pythran) ✔ ~/git/pythran 
$ python -m timeit -s 'from pack import pack_my_pythran_assign; import numpy as np; N_iter = 200; image_shape = (3072, 3072); uv = np.full((image_shape[0] // 4 * 2, image_shape[1]),fill_value=0,dtype=np.uint8); uv.shape = (-1,)' 'pack_my_pythran_assign(uv)'
100 loops, best of 5: 3.01 msec per loop
  • numpy assign: 200 loops, best of 5: 1.24 msec per loop
  • pythran loop: 200 loops, best of 5: 1.88 msec per loop
  • pythran assign: 100 loops, best of 5: 3.01 msec per loop
Environment before installing numba
mamba create --name pythran pythran blas numpy --channel conda-forge --override-channels
# packages in environment at /home/mark/mambaforge/envs/pythran:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                  2_kmp_llvm    conda-forge
beniget                   0.4.1              pyhd8ed1ab_0    conda-forge
binutils_impl_linux-64    2.39                 he00db2b_1    conda-forge
binutils_linux-64         2.39                h5fc0e48_12    conda-forge
blas                      2.116                  openblas    conda-forge
blas-devel                3.9.0           16_linux64_openblas    conda-forge
bzip2                     1.0.8                h7f98852_4    conda-forge
ca-certificates           2022.12.7            ha878542_0    conda-forge
decorator                 5.1.1              pyhd8ed1ab_0    conda-forge
gast                      0.5.3              pyhd8ed1ab_0    conda-forge
gcc_impl_linux-64         11.3.0              hab1b70f_19    conda-forge
gcc_linux-64              11.3.0              he6f903b_12    conda-forge
gxx_impl_linux-64         11.3.0              hab1b70f_19    conda-forge
gxx_linux-64              11.3.0              hc203a17_12    conda-forge
kernel-headers_linux-64   2.6.32              he073ed8_15    conda-forge
ld_impl_linux-64          2.39                 hcc3a1bd_1    conda-forge
libblas                   3.9.0           16_linux64_openblas    conda-forge
libcblas                  3.9.0           16_linux64_openblas    conda-forge
libexpat                  2.5.0                hcb278e6_1    conda-forge
libffi                    3.4.2                h7f98852_5    conda-forge
libgcc-devel_linux-64     11.3.0              h210ce93_19    conda-forge
libgcc-ng                 12.2.0              h65d4601_19    conda-forge
libgfortran-ng            12.2.0              h69a702a_19    conda-forge
libgfortran5              12.2.0              h337968e_19    conda-forge
libgomp                   12.2.0              h65d4601_19    conda-forge
liblapack                 3.9.0           16_linux64_openblas    conda-forge
liblapacke                3.9.0           16_linux64_openblas    conda-forge
libnsl                    2.0.0                h7f98852_0    conda-forge
libopenblas               0.3.21          pthreads_h78a6416_3    conda-forge
libsanitizer              11.3.0              h239ccf8_19    conda-forge
libsqlite                 3.40.0               h753d276_0    conda-forge
libstdcxx-devel_linux-64  11.3.0              h210ce93_19    conda-forge
libstdcxx-ng              12.2.0              h46fd767_19    conda-forge
libuuid                   2.38.1               h0b41bf4_0    conda-forge
libzlib                   1.2.13               h166bdaf_4    conda-forge
llvm-openmp               16.0.1               h417c0b6_0    conda-forge
ncurses                   6.3                  h27087fc_1    conda-forge
numpy                     1.24.2          py311h8e6699e_0    conda-forge
openblas                  0.3.21          pthreads_h320a7e8_3    conda-forge
openssl                   3.1.0                h0b41bf4_0    conda-forge
pip                       23.0.1             pyhd8ed1ab_0    conda-forge
ply                       3.11                       py_1    conda-forge
python                    3.11.3          h2755cc3_0_cpython    conda-forge
python_abi                3.11                    3_cp311    conda-forge
pythran                   0.12.1          py311hcb41070_0    conda-forge
readline                  8.2                  h8228510_1    conda-forge
setuptools                67.6.1             pyhd8ed1ab_0    conda-forge
sysroot_linux-64          2.12                he073ed8_15    conda-forge
tk                        8.6.12               h27826a3_0    conda-forge
tzdata                    2023c                h71feb2d_0    conda-forge
wheel                     0.40.0             pyhd8ed1ab_0    conda-forge
xsimd                     8.0.5                h4bd325d_0    conda-forge
xz                        5.2.6                h166bdaf_0    conda-forge
zstd                      1.5.2                h3eb15da_6    conda-forge


     active environment : pythran
    active env location : /home/mark/mambaforge/envs/pythran
            shell level : 1
       user config file : /home/mark/.condarc
 populated config files : /home/mark/mambaforge/.condarc
                          /home/mark/.condarc
          conda version : 23.3.1
    conda-build version : 3.24.0
         python version : 3.9.16.final.0
       virtual packages : __archspec=1=x86_64
                          __glibc=2.36=0
                          __linux=5.19.0=0
                          __unix=0=0
       base environment : /home/mark/mambaforge  (writable)
      conda av data dir : /home/mark/mambaforge/etc/conda
  conda av metadata url : None
           channel URLs : https://conda.anaconda.org/conda-forge/linux-64
                          https://conda.anaconda.org/conda-forge/noarch
          package cache : /home/mark/mambaforge/pkgs
                          /home/mark/.conda/pkgs
       envs directories : /home/mark/mambaforge/envs
                          /home/mark/.conda/envs
               platform : linux-64
             user-agent : conda/23.3.1 requests/2.28.2 CPython/3.9.16 Linux/5.19.0-38-generic ubuntu/22.10 glibc/2.36
                UID:GID : 1000:1000
             netrc file : None
           offline mode : False

  • pack numba (python 3.10...): 200 loops, best of 5: 1.53 msec per loop

I included the output of pythran -E as well.

pythran_pack.zip

@hmaarrfk hmaarrfk changed the title Slower in Pythran than numpy? array assignment slower than numpy Apr 10, 2023
@hmaarrfk
Copy link
Contributor Author

I should also note, that the "size" of the array here is kinda huge compared to what I see many benchmarks use.

Often I see that people benchmark with 256 x 256 images or 512 x 512 images. These images are less than 1MB in size (with uint8 precision).

However, this image, encoded in yuv420p is more than 13MB.

It doesn't fit in my cache, all at once. (but the uv part might)
https://ark.intel.com/content/www/us/en/ark/products/208663/intel-core-i71180g7-processor-12m-cache-up-to-4-60-ghz-with-ipu.html

It might fit in yours if you have a large machine which may explain some differences in our results.

serge-sans-paille added a commit that referenced this issue Apr 12, 2023
It's very rare to have non-constant stride, and using a constant unlocks
various optimization, including potential improvement for #2095
@serge-sans-paille
Copy link
Owner

serge-sans-paille commented Apr 12, 2023

Thanks for the extra piece of information. Does #2097 improve the situation? It does on my setup, making pack_my_pythran_assign as fast as the numpy version.

EDIT: this PR doesn't passes full validation yet, but it should be ok for our concern though.
EDIT²: #2097 now fully functionnal, waiting for your feedback before merging it.

serge-sans-paille added a commit that referenced this issue Apr 12, 2023
It's very rare to have non-constant stride, and using a constant unlocks
various optimization, including potential improvement for #2095
@hmaarrfk
Copy link
Contributor Author

Unfortunately, it does not resolve things.

Is there any intermediate output I can give you to debug things? compiler info and whatnot?
pack_pythran_dev_2097.zip

@serge-sans-paille
Copy link
Owner

Hell! A funny side effect of this track is that I'm fixing a lot of performance issues (some very big ones) that shows in various numpy benchmarks :-)

@jeanlaroche
Copy link
Contributor

jeanlaroche commented Apr 14, 2023 via email

@serge-sans-paille
Copy link
Owner

@jeanlaroche : you can check PR #2096, #2097 and #2096, they are likely to be merged soon.

@hmaarrfk : thanks for the archive - I can reproduce. An analysis of the generated assembly shows twice as many mov in the pythran generated code compared to numpy's, I'll investigate.

@jeanlaroche
Copy link
Contributor

jeanlaroche commented Apr 14, 2023 via email

serge-sans-paille added a commit that referenced this issue Apr 15, 2023
It's very rare to have non-constant stride, and using a constant unlocks
various optimization, including potential improvement for #2095
@serge-sans-paille
Copy link
Owner

serge-sans-paille commented Apr 15, 2023

Short notice: the following numpy version achieves the same result with twice less memory:

def pack_my_pythran_assign(uv):
    u_size = uv.shape[0] // 2
    uv_lo =  uv[:u_size].copy()
    uv[1::2] = uv[u_size:]
    uv[0::2] = uv_lo

no significant impact on runtime from the pythran perspective on my setup though.

EDIT: it actually brings interesting speedup to the pythran-compiled version. Could you give it a try? (from the feature/faster-gexpr branch)

@hmaarrfk
Copy link
Contributor Author

Short notice: the following numpy version achieves the same result with twice less memory:

def pack_my_pythran_assign(uv):
    u_size = uv.shape[0] // 2
    uv_lo =  uv[:u_size].copy()
    uv[1::2] = uv[u_size:]
    uv[0::2] = uv_lo

no significant impact on runtime from the pythran perspective on my setup though.

EDIT: it actually brings interesting speedup to the pythran-compiled version. Could you give it a try? (from the feature/faster-gexpr branch)

Away from my computer at the moment. But at some point I found mixed results with this approach. I think my hunch was that due to the memory aliasing.

Looking into things in the past, i found that x86 had some inter sting instructions that skipped the cache. I didn't know if pythran could be optimized for that, i wanted an to avoid speedups due to that optimization.

@hmaarrfk
Copy link
Contributor Author

Looking into this now that I have a second, it seems that there is no action item on my part. keep me posted.

@serge-sans-paille
Copy link
Owner

Pythran is not doing anything specific related to cache. I can get good speedup if I remove a runtime check for aliasing between lhs and rhs when assigning between slices, but I currently fail at doing so in an elegant way.

@hmaarrfk
Copy link
Contributor Author

As for the memory copy, I'm mostly talking about my results from:

https://github.com/awreece/memory-bandwidth-demo

./memory_profiler 
           read_memory_rep_lodsq: 12.19 GiB/s
                read_memory_loop: 19.69 GiB/s
                 read_memory_sse: 18.63 GiB/s
                 read_memory_avx: 19.02 GiB/s
        read_memory_prefetch_avx: 16.90 GiB/s
               write_memory_loop: 13.23 GiB/s
          write_memory_rep_stosq: 39.86 GiB/s
                write_memory_sse: 13.97 GiB/s
    write_memory_nontemporal_sse: 43.03 GiB/s
                write_memory_avx: 13.26 GiB/s
    write_memory_nontemporal_avx: 47.87 GiB/s
             write_memory_memset: 39.80 GiB/s
       read_memory_rep_lodsq_omp: 49.37 GiB/s
            read_memory_loop_omp: 52.74 GiB/s
             read_memory_sse_omp: 54.41 GiB/s
             read_memory_avx_omp: 52.21 GiB/s
    read_memory_prefetch_avx_omp: 47.91 GiB/s
           write_memory_loop_omp: 31.18 GiB/s
      write_memory_rep_stosq_omp: 55.99 GiB/s
            write_memory_sse_omp: 22.40 GiB/s
write_memory_nontemporal_sse_omp: 56.80 GiB/s
            write_memory_avx_omp: 31.22 GiB/s
write_memory_nontemporal_avx_omp: 54.89 GiB/s
         write_memory_memset_omp: 54.15 GiB/s

I feel like this is a "future improvement" maybe.....

I'm not a fan of using "threading" for speedups, but the non_temporal_avx shows an impressive speedup.

serge-sans-paille added a commit that referenced this issue Apr 23, 2023
It's very rare to have non-constant stride, and using a constant unlocks
various optimization, including potential improvement for #2095
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants