Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: use direct type casting instead of memcpy #687

Closed
wants to merge 0 commits into from

Conversation

miaoerduo
Copy link

Hi, I'm trying to read the llama.cpp code, it's a very interesting and fantasy project.

Here is a small optimization: for the small object copy (int64_t, float, int16_t, etc), we just need to cast it, no need to make a func call memcpy.

Here is a simple example.

#include <string.h>
#include <vector>
#include <iostream>
#include <chrono>

#define ggml_memcpy_opt(dst, src, type) do { *(type *)(dst) = *(type *)(src); } while (0)

int main() {
    // cur time in ns

    std::vector<size_t> time_cost(3, 0);

    const size_t length = 10000;
    const size_t times = 10000;
    std::vector<float> src(length, 100.0);
    std::vector<int32_t> dst1(length, 100);
    std::vector<int32_t> dst2(length, 100);
    std::vector<int32_t> dst3(length, 100);

    for (int idx = 0; idx < times; ++ idx) {
        auto t1 = std::chrono::high_resolution_clock::now();
        memcpy(dst1.data(), src.data(), src.size() * sizeof(float));
        auto t2 = std::chrono::high_resolution_clock::now();
        for (size_t idx = 0; idx < src.size(); ++idx) {
            memcpy(&dst2[idx], &src[idx], sizeof(int32_t));
        }
        auto t3 = std::chrono::high_resolution_clock::now();
        for (size_t idx = 0; idx < src.size(); ++idx) {
            ggml_memcpy_opt(&dst3[idx], &src[idx], int32_t);
        }
        auto t4 = std::chrono::high_resolution_clock::now();

        time_cost[0] += std::chrono::duration_cast<std::chrono::nanoseconds>(t2 - t1).count();
        time_cost[1] += std::chrono::duration_cast<std::chrono::nanoseconds>(t3 - t2).count();
        time_cost[2] += std::chrono::duration_cast<std::chrono::nanoseconds>(t4 - t3).count();
    }

    std::cout << "memcpy: " << time_cost[0] / times << " ns " << std::endl;
    std::cout << "for_loop memcpy: " << time_cost[1] / times << " ns " << std::endl;
    std::cout << "for_loop cast: " << time_cost[2] / times << " ns " << std::endl;
}

when compilee with -O2

the time cost should be on my PC:

memcpy: 1169 ns 
for_loop memcpy: 6541 ns 
for_loop cast: 3785 ns

when compiled with -O3 (I think the compiler has done some tricks):

memcpy: 1158 ns 
for_loop memcpy: 1224 ns 
for_loop cast: 1194 ns 

theoretically for small object copy the direct casting should be faster than memcpy.

Hope I could contribute to this great project!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant