GitHub - fm16191/fpga-streams-bandwith: Load vs Store comparison on Intel FPGA using oneAPI

Load vs Store streams Bandwith Comparison

We expected the store streams to be "faster to be processed" than load ones. We conducted the following experiment :

One code with multiple load streams from FPGA DRAM and 1 store stream

One code with 1 load stream from FPGA DRAM and multiple store streams

Code structure is as follows :

Example code structure for 4 loads, 1 store streams

void launcher_loads(T *d1, T *d2, T *d3, T *d4, T *d_res, size_t N, sycl::queue queue)
{
    queue.submit([&](sycl::handler &h) {
        h.single_task([=]() [[intel::kernel_args_restrict]] {

            for (size_t i = 0; i < N; ++i) {
                T x1 = d1[i];
                T x2 = d2[i];
                T x3 = d3[i];
                T x4 = d4[i];
                d_res[i] = x1 + x2 + x3 + x4;
            }

        });
    });
}

Example code structure for 4 stores.

void launcher_stores4(T *d_input, T *d1, T *d2, T *d3, T *d4, size_t N, sycl::queue queue)
{
    queue.submit([&](sycl::handler &h) {
        h.single_task([=]() [[intel::kernel_args_restrict]] {

            for (size_t i = 0; i < N; ++i) {
                T input = d_input[i];
                T x1 = input + 1;
                T x2 = input + 2;
                T x3 = input + 3;
                T x4 = input + 4;

                d1[i] = x1;
                d2[i] = x2;
                d3[i] = x3;
                d4[i] = x4;
            }

        });
    });
}

We performed a few combinations of these, also varying the count of element passed and the unroll factor to ensure peak bandwidth.

On the left, comparison between 8 load + 1 store streams vs. 1 load + 1 store streams for counts of 1e6, 1e7, and 2e7 elements.

In the middle, using 1E7 elements, we compare the bandwidth for 4 load + 1 store, 5 + 1, 8 + 1, and vice versa.

On the right, using 1E7 elements, we compare a combination of 8+1 / 4+1 and vice versa for different unrolled sizes. This is to ensure full throughput.

How to reproduce :

Makefile targets : cpu, fpga_emu, report, fpga

Makefile's BOARD_NAME to be edited if target is FPGA

Compilation example :

# Compile for 4 loads, 1 store on CPU
make KERNEL_SRC=kernel_4loads.cxx cpu

# Compile for 1 lod, 8 stores on FPGA
make KERNEL_SRC=kernel_8stores.cxx fpga

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
pics		pics
.clang-format		.clang-format
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
define.hpp		define.hpp
kernel.hpp		kernel.hpp
kernel_4loads.cxx		kernel_4loads.cxx
kernel_4loads32.cxx		kernel_4loads32.cxx
kernel_4loads_struct.cxx		kernel_4loads_struct.cxx
kernel_4stores.cxx		kernel_4stores.cxx
kernel_4stores32.cxx		kernel_4stores32.cxx
kernel_4stores_struct.cxx		kernel_4stores_struct.cxx
kernel_5loads.cxx		kernel_5loads.cxx
kernel_5stores.cxx		kernel_5stores.cxx
kernel_8loads.cxx		kernel_8loads.cxx
kernel_8loads16.cxx		kernel_8loads16.cxx
kernel_8loads32.cxx		kernel_8loads32.cxx
kernel_8loads_profiling32.cxx		kernel_8loads_profiling32.cxx
kernel_8stores.cxx		kernel_8stores.cxx
kernel_8stores16.cxx		kernel_8stores16.cxx
kernel_8stores32.cxx		kernel_8stores32.cxx
kernel_8stores_profiling32.cxx		kernel_8stores_profiling32.cxx
main.cxx		main.cxx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Load vs Store streams Bandwith Comparison

How to reproduce :

About

Releases

Packages

Contributors 2

Languages

fm16191/fpga-streams-bandwith

Folders and files

Latest commit

History

Repository files navigation

Load vs Store streams Bandwith Comparison

How to reproduce :

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages