feat: use direct type casting instead of memcpy #687

miaoerduo · 2024-01-09T04:36:07Z

Hi, I'm trying to read the llama.cpp code, it's a very interesting and fantasy project.

Here is a small optimization: for the small object copy (int64_t, float, int16_t, etc), we just need to cast it, no need to make a func call memcpy.

Here is a simple example.

#include <string.h>
#include <vector>
#include <iostream>
#include <chrono>

#define ggml_memcpy_opt(dst, src, type) do { *(type *)(dst) = *(type *)(src); } while (0)

int main() {
    // cur time in ns

    std::vector<size_t> time_cost(3, 0);

    const size_t length = 10000;
    const size_t times = 10000;
    std::vector<float> src(length, 100.0);
    std::vector<int32_t> dst1(length, 100);
    std::vector<int32_t> dst2(length, 100);
    std::vector<int32_t> dst3(length, 100);

    for (int idx = 0; idx < times; ++ idx) {
        auto t1 = std::chrono::high_resolution_clock::now();
        memcpy(dst1.data(), src.data(), src.size() * sizeof(float));
        auto t2 = std::chrono::high_resolution_clock::now();
        for (size_t idx = 0; idx < src.size(); ++idx) {
            memcpy(&dst2[idx], &src[idx], sizeof(int32_t));
        }
        auto t3 = std::chrono::high_resolution_clock::now();
        for (size_t idx = 0; idx < src.size(); ++idx) {
            ggml_memcpy_opt(&dst3[idx], &src[idx], int32_t);
        }
        auto t4 = std::chrono::high_resolution_clock::now();

        time_cost[0] += std::chrono::duration_cast<std::chrono::nanoseconds>(t2 - t1).count();
        time_cost[1] += std::chrono::duration_cast<std::chrono::nanoseconds>(t3 - t2).count();
        time_cost[2] += std::chrono::duration_cast<std::chrono::nanoseconds>(t4 - t3).count();
    }

    std::cout << "memcpy: " << time_cost[0] / times << " ns " << std::endl;
    std::cout << "for_loop memcpy: " << time_cost[1] / times << " ns " << std::endl;
    std::cout << "for_loop cast: " << time_cost[2] / times << " ns " << std::endl;
}

when compilee with -O2

the time cost should be on my PC:

memcpy: 1169 ns 
for_loop memcpy: 6541 ns 
for_loop cast: 3785 ns

when compiled with -O3 (I think the compiler has done some tricks):

memcpy: 1158 ns 
for_loop memcpy: 1224 ns 
for_loop cast: 1194 ns

theoretically for small object copy the direct casting should be faster than memcpy.

Hope I could contribute to this great project!

miaoerduo marked this pull request as draft January 9, 2024 04:49

miaoerduo closed this Jan 9, 2024

miaoerduo force-pushed the master branch from a6b0c8f to 5a3154b Compare January 9, 2024 04:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: use direct type casting instead of memcpy #687

feat: use direct type casting instead of memcpy #687

miaoerduo commented Jan 9, 2024

feat: use direct type casting instead of memcpy #687

feat: use direct type casting instead of memcpy #687

Conversation

miaoerduo commented Jan 9, 2024