Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: use direct type casting instead of memcpy #688

Closed
wants to merge 1 commit into from

Conversation

miaoerduo
Copy link

Hi, I'm trying to read the llama.cpp code, it's a very interesting and fantasy project.

Here is a small optimization: for the small object copy (int64_t, float, int16_t, etc), we just need to cast it, no need to make a func call memcpy.

Here is a simple example.

#include <string.h>
#include <vector>
#include <iostream>
#include <chrono>

#define ggml_memcpy_opt(dst, src, type) do { *(type *)(dst) = *(type *)(src); } while (0)

int main() {
    // cur time in ns

    std::vector<size_t> time_cost(3, 0);

    const size_t length = 10000;
    const size_t times = 10000;
    std::vector<float> src(length, 100.0);
    std::vector<int32_t> dst1(length, 100);
    std::vector<int32_t> dst2(length, 100);
    std::vector<int32_t> dst3(length, 100);

    for (int idx = 0; idx < times; ++ idx) {
        auto t1 = std::chrono::high_resolution_clock::now();
        memcpy(dst1.data(), src.data(), src.size() * sizeof(float));
        auto t2 = std::chrono::high_resolution_clock::now();
        for (size_t idx = 0; idx < src.size(); ++idx) {
            memcpy(&dst2[idx], &src[idx], sizeof(int32_t));
        }
        auto t3 = std::chrono::high_resolution_clock::now();
        for (size_t idx = 0; idx < src.size(); ++idx) {
            ggml_memcpy_opt(&dst3[idx], &src[idx], int32_t);
        }
        auto t4 = std::chrono::high_resolution_clock::now();

        time_cost[0] += std::chrono::duration_cast<std::chrono::nanoseconds>(t2 - t1).count();
        time_cost[1] += std::chrono::duration_cast<std::chrono::nanoseconds>(t3 - t2).count();
        time_cost[2] += std::chrono::duration_cast<std::chrono::nanoseconds>(t4 - t3).count();
    }

    std::cout << "memcpy: " << time_cost[0] / times << " ns " << std::endl;
    std::cout << "for_loop memcpy: " << time_cost[1] / times << " ns " << std::endl;
    std::cout << "for_loop cast: " << time_cost[2] / times << " ns " << std::endl;
}

when compilee with -O2

the time cost should be on my PC:

memcpy: 1169 ns 
for_loop memcpy: 6541 ns 
for_loop cast: 3785 ns

when compiled with -O3 (I think the compiler has done some tricks):

memcpy: 1158 ns 
for_loop memcpy: 1224 ns 
for_loop cast: 1194 ns 

theoretically for small object copy the direct casting should be faster than memcpy.

Hope I could contribute to this great project!

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - nice contribution. The reason for these memcpy is that I never have to worry about pointer alignment and type punning errors, so I've adopted the practice to always memcpy and don't worry about any potential problems (although, these are C++ related concerns and don't apply to C anyway).

Looking at the PR, I suppose these are all safe changes, so it seems like a good change. Even more so that it removes a header.

@ggerganov ggerganov requested a review from slaren January 9, 2024 09:01
@slaren
Copy link
Collaborator

slaren commented Jan 9, 2024

The compiler is smart enough to replace the memcpys if it can prove that it is safe, but even if it wasn't, the memcpy are there because they are necessary to respect the strict aliasing rules.

The benchmark is not correct. Testing very small code is tricky, but we can always check the assembly:
https://godbolt.org/z/vaK8549G1

@miaoerduo
Copy link
Author

The compiler is smart enough to replace the memcpys if it can prove that it is safe, but even if it wasn't, the memcpy are there because they are necessary to respect the strict aliasing rules.

The benchmark is not correct. Testing very small code is tricky, but we can always check the assembly: https://godbolt.org/z/vaK8549G1

oh yeah, it's right! Thanks for the review. I haven't used this tool to see the asm before. Let me close this PR.

@miaoerduo miaoerduo marked this pull request as draft January 9, 2024 10:07
@miaoerduo miaoerduo closed this Jan 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants