Add GPU support to ggml #915

ggerganov · 2023-04-12T11:11:42Z

ggerganov
Apr 12, 2023
Maintainer

Intro

This issue is more suitable for the https://github.com/ggerganov/ggml repo, but adding it here for more visibility.

First, I don't see adding a GPU framework that is tightly integrated with ggml anytime soon because it usually comes with a lot of maintenance drawbacks, architecture changes and issues. However, there is an alternative approach that might be relatively easy to implement and I think would be a very cool way for new developers to join in and help.

Description

ggml produces computation graphs which are basically directed acyclic graphs (DAGs) that can be easily exported, iterated, etc. A graph contains the information about all necessary tensor operations and buffers needed to evaluate the model. The idea is to first add basic ggml functionality for exporting the graphs in some trivial text format that can be parsed as a second step by a separate ggml tool. Having the exported graphs, one can process them and construct hardware-specific code for evaluating them. This way, we keep implementing existing and new transformer models as we currently do - with a focus for CPU execution, but we gain the benefit of being able to export the computation graphs and translate them for GPU execution.

For example, a ggml-cuda tool can parse the exported graph and construct the necessary CUDA kernels and GPU buffers to evaluate it on a NVIDIA GPU. Another tool, for example ggml-mps, can do similar stuff but for Metal Performance Shaders. Or maybe even a ggml-webgpu tool.

This approach preserves the cross-platform nature of ggml and allows custom hardware support, via compiler-like translation of the exported computation graphs.

Still, the most difficult part of implementing the respective kernels for the targeted backend remains the biggest obstacle.

However, I think this decoupled approach of the implementation would make the development process much easier and can potentially allow for some interesting optimizations. My biggest fear of adding a tightly integrated GPU backend to ggml is that I don't know the important details for supporting the respective backend, which could lead to bad software design decisions that in turn can have negative side-effects even on the core CPU implementation.

With the proposed approach in this issue, we eliminate this risk and allow multiple independent implementations to be provided without any negative side effects on the core ggml implementation.

Another cool thing about this idea is that there could be separate leading developers for each backend.
So if you have a good knowledge and understanding about a certain hardware architecture, you are one step away from initiating the kernel "translation" process and making a very significant contribution to the project.

Guiding principles

I don't know all the specifics of a good GPU implementation, but I believe one could try to adopt the fundamental principles of ggml.

For example, there could be a single memory buffer allocated and all the tensors can be distributed within that memory buffer at certain offsets. Each graph operation will correspond to a kernel with source tensors as input and a destination tensor for output which will be all part of that single memory buffer allocated at the start of the execution.

Additionally, I think we don't need to explicitly add 3rd party dependencies (e.g. CUDA SDK, OpenCL, etc.) to ggml to achieve that. The new ggml translation tools will simply read a computation graph and generate code for a certain GPU backend, which will be up to the user to compile and run.

The existing CPU code for each tensor operation is your reference implementation. Ideally, you would always want to implement the same computation in the corresponding new kernel and after that, you can try to optimize it for the specifics of the hardware. This is especially true for the 4-bit kernels.

All computations and buffers remain on the GPU. Avoid back-and-forth copies of data to the CPU RAM at all cost.

Taking shortcuts and making custom hacks in favor of better performance is very welcome. "General-purpose" is "bad". For example, we can have a tool like ggml-cuda-llama which is a very custom ggml translator to CUDA backend which works only with LLaMA graphs and nothing else, but does some very LLaMA-specific optimizations. This is fine.

Keep things minimalistic and don't over-engineer. For example, a CUDA translation tool will output a single C++ (or some other language) file with all the kernels and backend initialization code embedded in it. A simple C-style function for evaluation can be exported so that we can call this from other code bases. The actual translation tool should also be implemented as a single source file in a preferred language. (this guiding principle has to be defined a bit better, but we will figure it out as we go)

The GPU "translators" will likely remain second-class citizens from ggml point of view and they will need to adapt to the core CPU implementation - not the other way around.

Why?

Currently, ggml is one of the few ML frameworks that provides efficient 4-bit quantization and demonstrates effective application for quantized transformer evaluation. The code is compact, easily comprehensible with very little bloat. I think ggml has a slight leading edge in this regard compared to other general purpose frameworks and if we utilize it now, it has the potential of becoming a very respectable machine learning framework in the future with a focus for on-device inference.

Note that there is a very large dose of "reinventing the wheel" in the outlined strategy. Therefore, if you want to get involved, it's very important to have the right mindset. Definitely do not approach this with: "this has already been done in another project" , "we should do all those things that project X does" or "this is not going to scale well for all those reasons", etc.

I think the right mindset to approach this is: "let's try to hack something fast, small and cool and see where it goes"

Links

Thoughts about Inference at the edge
Starting point for exporting ggml graphs: .dot file of ggml_graph can not be generated to .png file #589 (comment)
Sample computation graph for single-layer LLaMA 7B:

Update 28 May 2023:

MNIST prototype of the idea above: ggml : cgraph export/import/eval example + GPU support ggml#108
This is the pattern that we should follow and try to apply to LLM inference
First attempt at full Metal-based LLaMA inference: llama : Metal inference #1642
First attempt at full Vulkan-based LLaMA inference: Vulkan implementation (via Kompute) #2039

clxyder · 2023-04-12T11:46:24Z

clxyder
Apr 12, 2023

Would it be possible to use https://github.com/openai/triton/ to generate the specific backend GPU code? From what I can tell it generates CUDA code for you.

The only drawback right now is that it does not support the pre-Volta series, but there is a PR working on that triton-lang/triton#1505.

2 replies

btilmon Apr 12, 2023

To my knowledge triton can't convert C/C++ to PTX (CUDA), the kernels have to be defined in a triton-specific python style and then triton can be used to convert to PTX. So, ggml.c would have to first be converted to python. Not sure if this is worth it.

btilmon Apr 12, 2023

Also, PyTorch 2.0 uses triton under the hood, so if ggml.c computation graph was translated to PyTorch and compiled with PyTorch 2.0 it might be just as good as using a more custom implementation of each kernel with triton.

henk717 · 2023-04-12T13:58:16Z

henk717
Apr 12, 2023

Over at the python side SHARK has universal support using vulkan and they recently implemented llama here : https://github.com/nod-ai/llama

Perhaps a similar approach could be adopted leveraging vulkan as a universal backend? Vulkan should be sufficiently universal across all platforms you are trying to target.

7 replies

ggerganov Apr 17, 2023
Maintainer Author

...then why would running on high end nvidia GPUs be a priority?

It won't. For me running on Metal / MPS is a priority. But supporting NVIDIA would be also nice to have.

janekb04 Apr 18, 2023

I think that using Vulcan is the way to go. Shaders could be either cross-platform GLSL or could also have hand-optimized Metal Shading Language shaders for Apple. I think that, to keep with the current "from scratch" philosophy of GGML, this could be implemented from scratch without using a framework or engine over Vulkan.

StuartIanNaylor Apr 25, 2023

Metal 3 in gaming terms has had some updates where it now matches many methods to allow porting more PC games to the Mac.

Metal 3 has added more simularity to Vulkan & DirectX and the benefit of this is its also true for ML.
Maybe now there is enough commonality to create a middle API that can bridge to either Metal or Vulkan on compile?

It can not be exclusively Vulkan as that excludes 'Apple silicon first-class citizen` but with Metal 3 the backend of Metal vs Vulkan could be a compile choice.

If you can do https://github.com/DTolm/VkFft then I guess you can also do GGML ...

sandorkonya Jun 7, 2023

@janekb04 something like this?

janekb04 Jun 8, 2023

@sandorkonya Hi, the project you shared seems to be a Java library that presents a relatively simple interface to run GLSL compute shaders on Android devices on top of Vulkan. To me it doesn't really seem that relevant to GGML. I would expect GGML to continue to be a native library, including on Android. Still, it could provide an interface in a higher level language like C#, Python, or Java. However, this seems to be more in the domain of projects like llama.cpp. Additionally, I would expect GGML to provide a set of predefined tensor operators, implemented over various APIs on multiple platforms.

nerzhul · 2023-04-12T14:48:50Z

nerzhul
Apr 12, 2023

What about openCL ?

5 replies

r618 Apr 12, 2023

OpenCL was deprecated in macOS 10.14. To create high-performance code on GPUs, they recommend to use Metal framework
e.g. https://developer.apple.com/library/archive/documentation/Performance/Conceptual/OpenCL_MacProgGuide/Introduction/Introduction.html
it will still take probably few OS iterations now until it is completely removed, but basing any new code on it would be not entirely feasible

ivanstepanovftw Apr 15, 2023
Collaborator

Most of my previous systems does not have nor CUDA nor Metal.

xloem Apr 16, 2023

opencl was the standard before vulkan, it might be most compatible even if apple says they might remove it. note the users of ggml are more likely to have old or low-end systems.

rbrisita Apr 25, 2023

This is why I would lean towards OpenCL; maturity and available tools to convert OpenCL to other newer standards if need be.

StuartIanNaylor Apr 25, 2023

ArmNN is using OpenCL but like OpenGL if the API is any good is debatable to say, bespoke optimised code or a more granular lower levl Apis such as Vulkan.
Vulkan exists because OpenGL needed a different mantra.

If you can have a Vulkan Api over Metal then likely its also possible to run Metal over Vulkan or at least create an Api that will run on either.
https://github.com/KhronosGroup/MoltenVK

Metal 3 (The latest version) supposedly has added much compatibity aimed more at gaming but the advantage is to get games to port Metal 3 has similar methods as Vulkan & DirectX.

Likely Metal 3 & Vulkan could have very similar Api's but some bridging logic and a different compile would have to be implemented.

I am thinking OpenCL is heading for a cul-de-sac and Vulkan & Metal are the way to go.

LiliumSancta · 2023-04-12T15:43:17Z

LiliumSancta
Apr 12, 2023

I'm very interested in this (amd gpu owner...) I'm mainly using llama.cpp because it runs on wsl and despite having the python implementation running with rocm on linux I work mainly on windows with wsl. I have no experience with ml and have been away from c/c++ for many years, but I'm mostly exploring model inference and webgpu documentation, although I don't expect to make any contributions in the short term (I lack technical knowledge at the moment), but I was already planning to test some things on my own. For users with nvidia hardware I don't think there is much to gain except perhaps greater speed due to c and a much "cleaner" environment than python, but for amd and apus users including intel integrated graphics and even integrated devices with gpus there can be a lot of gain.

9 replies

SlyEcho Apr 13, 2023
Collaborator Sponsor

It could also be written in Cuda and then hipify-perl or something can translate to HIP and compile for ROCm.

funnbot Apr 13, 2023

That still has the problem of actually compiling the HIP for windows.

https://github.com/CHIP-SPV/chip-spv maybe this?

DGdev91 Apr 21, 2023

Last week in the ROCm docs was added an hint to a future windows release, even Toms' Hardware talked about that
https://www.tomshardware.com/news/amd-rocm-comes-to-windows-on-consumer-gpus

Recently it was added the support for cuda, but linking up HIP could still be a good idea for AMD users. it wold only work on linux at first, but then it should be possible to use it on window too when rocm for windows finally is released.

andrewjc Apr 26, 2023

I've started a HIP/ROCM port on my branch: https://github.com/andrewjc/llama.cpp

Not working yet - but close!

SlyEcho Apr 26, 2023
Collaborator Sponsor

@andrewjc Seems like a lot of duplication, in #1087 I made it much more easily but it may not be "correct" in that sense.

Dampfinchen · 2023-04-12T15:44:37Z

Dampfinchen
Apr 12, 2023

In my experience, the prompt processing appears to be the main bottleneck for speed. Accelerating prompt processing with cublas on tensor cores could speed up the matrix multiplication considerably.

However, transfering the matrices to the GPU appears to be the main bottleneck in the case of using GPU accelerated prompt processing.

10 replies

anzz1 Apr 14, 2023

@jon-chuang It's not as simple as function calls, the overhead of code interpretation and memory management, shifting data around etc. is the main culprit why it's slow.

You could refer to the 2-part video series posted above for a better general overview. That actually shows precisely the problem pretty well, it's not that the calculations themselves are 10-thousand-fold slower in python, it's that the memory management and interpretation is. It's a simple problem which can demonstrate the performance differences in various coding languages pretty well. However going into specifics why interpreted languages are slow can't be adequately explained without making it a full comp sci class.

Sure, you can "glue" native code to python or whatever, make the code a function call while everything actually happens in precompiled native code. That is not the point here. The point is that everything that can be done with interpreted languages can be done in native, but faster. Interpreted languages sure have their place and are good for prototyping, but when it comes to performance there is really no argument there.

It's also pretty stupid to be blind to shortcomings of something just because it's your favorite language. I don't like Rust but I don't claim it's slow, because it isn't. I like Lua but I don't claim it "can be as fast as native" because it can't. Moving the actual code to a native module which is just called from interpreted language and claiming because of that it "can be as fast as native", while technically true, is simply misleading.

It's the same as claiming that a remote-controlled toy car is as fast as a Formula 1 car, when your definition of "designed right" is that the RC car should be strapped to the F1 car.

jon-chuang Apr 14, 2023

Sorry, context is everything. Don't drag your misconceptions into a separate issue. In the first place, we've been discussing

transfering the matrices to the GPU appears to be the main bottleneck in the case of using GPU accelerated prompt processing.

Your claim is that python is the overhead. Unfortunately, you're dead wrong, but spent the last 100 sentences babbling about something that is known to be completely false by anyone who's ever done systems programming, ignoring that:

bottleneck is everything
everything can be optimized as long as the framework is flexible enough

Even GPT4, heck llama could do better than that.
I'm peacing out here.

MillionthOdin16 Apr 14, 2023

Sorry, context is everything. Don't drag your misconceptions into a separate issue. In the first place, we've been discussing

transfering the matrices to the GPU appears to be the main bottleneck in the case of using GPU accelerated prompt processing.

Your claim is that python is the overhead. Unfortunately, you're dead wrong, but spent the last 100 sentences babbling about something that is known to be completely false by anyone who's ever done systems programming, ignoring that:

bottleneck is everything

everything can be optimized as long as the framework is flexible enough

Even GPT4, heck llama could do better than that. I'm peacing out here.

Dude you need to chill. You're confrontational, and he has much more contribution history than you. Be respectful.

His claim is that python has inherent performance losses, and it's true that if you implement something in python vs native, the native should have better performance. You don't choose python for its performance.

umbra-scientia Jun 24, 2023

Lets think step by step.

Dude you need to chill. You're confrontational, and he has much more contribution history than you. Be respectful.

Please realize that, "Dude you need to chill. You're confrontational," is highly confrontational.

Please realize that, "he has much more contribution history than you." is a logical fallacy.

Please realize that his comment is correct and productive in clearing misconceptions, and "Be respectful." is great advice for yourself.

...is something that is known to be completely false by anyone who's ever done systems programming

Is a harsh truth, sure, but it is truth nonetheless. I hope you will refrain from prioritizing your emotional perceptions over the health of this project. Using Python as an intermediate language can in-fact be faster than native code if done correctly. It's quite arrogant to assume that your own methods and undersranding are the optimal pinnacle of design. Python-as-IL fits the mantra of GGML ("let's try to hack something fast, small and cool and see where it goes") and is a real-world battle tested approach, while "Python=Slower Always!!1 Something something memory management!!" truly is a counterproductive armchair misconception being used to justify a baseless emotional attack against a solid and proven scalable path forward which neatly re-uses the user friendly and highly performant machinery of industry.

Codes4Fun Jun 24, 2023

c/c++ will always be able to achieve higher performance than python and javascript like languages, but there is more work in achieving that optimization.

for example, in javascript with deterministic functions it can cache results, so if a function is called multiple times with the same inputs it can use the cached results instead of re-running the same function. now a good c/c++ engineer would realize they are calling the same function with the same inputs and cache the results also, but the engineer would need to write that themselves instead of having it be done automatically for them, and while you can use a library to do that, it changes the way you program, requires integration.

so in the end it is a matter of development time to achieve some level of performance. in fact the main benefit of languages like python is rapid application development, but not necessarily optimal applications. overall you have to be more thoughtful in c/c++ than you do in python.

personally I don't like the trend of using python so much because python libraries are generally not available to other languages, while lots of c/c++ libraries are made available to python developers through wrappers.

funnbot · 2023-04-12T15:50:52Z

funnbot
Apr 12, 2023

Another amd gpu here, wanting windows support for gpu inference, I’ve been looking at https://github.com/KomputeProject/kompute
Which would mean writing one or more compute shaders (glsl, …) -> spirv, while kompute has some wrappers for handling device memory as tensors. This would technically allow for a zero library approach if kompute is replaced with the usual vulkan boiler plate.

The most interesting area of research to me is running larger models that can’t fit into vram, something like DeepSpeed ZeRO-3

1 reply

LiliumSancta Apr 12, 2023

I took a quick look at the examples and it looks a lot like what I was seeing about the webgpu here https://eliemichel.github.io/LearnWebGPU/basic-compute/compute-pipeline.html I'll take a better look at kompute, thanks.

antimora · 2023-04-12T17:05:01Z

antimora
Apr 12, 2023

WONNX has WebGPU kernels for operation if one wants to support WebGPU.

0 replies

dggonz · 2023-04-12T19:18:15Z

dggonz
Apr 12, 2023

What about supporting TPU or NPU like in the rk chips used in the khadas boards or rock pi?

0 replies

nietras · 2023-04-12T20:54:49Z

nietras
Apr 12, 2023

Right have you considered simply exporting to ONNX and using ONNX Runtime? Perhaps you have, what were reasons why this would not make sense?

5 replies

jon-chuang Apr 14, 2023

How well are 4 bit quantized ops running on onnxrt?

This is the main question.

We are only able to run 13B param LLMs on our laptops because of ggml's first class support for 4 bit params. It's the only reason why llama.cpp is popular in the first place.

fxmarty Apr 14, 2023

Simple: they don't run (not even with tensorrt, and not even a native int4 ONNX dtype). So you'd need to code your custom kernels and e.g. extend ORT.

xloem Apr 16, 2023

At microsoft/onnxruntime#14997 (comment) onnxruntime rejects 4bit quantization. At onnx/onnx#4192 (comment) onnx recommends making local modifications to the spec to support 4bit, comparable to those made to support bfloat16.

xloem Apr 21, 2023

I was thinking about this and realized that you could put 4bit quantization in standard onnx by simply including the 4bit code when converting it.

xloem Apr 21, 2023

it looks like a casual export to onnx presently has slower runtime huggingface/optimum#922 (comment) i imagine it’s something troubleshootable with investigation

charles-liang · 2023-04-12T22:43:24Z

charles-liang
Apr 12, 2023

one idea is using like bgfx framework, even though it's only designed for games, you can fork it and use the layer of generic shaders it provides to compile it into a platform-specific shader, like mac using metal, win using d3d, linux using vulkan, or even compile it into webgpu.

https://bkaradzic.github.io/bgfx/examples.html#nbody

1 reply

xloem Apr 16, 2023

writing these shaders sounds worthwhile as bgfx would likely be conducive to adding obscure compiler targets

https://bkaradzic.github.io/bgfx/tools.html#shader-compiler-shaderc

Visne · 2023-04-12T22:55:04Z

Visne
Apr 12, 2023

If this happens, please use HIP instead of CUDA. The code should be almost the same, but it will make it much easier for AMD users to run.

0 replies

fotcorn · 2023-04-13T00:32:48Z

fotcorn
Apr 13, 2023

I'd like to propose to use MLIR as the text format. It's a flexible intermediate representation to which most of the machine learning ecosystem seems to be gravitating to right now.

The approach could be that we define a ggml specific dialect that maps to the internal graph representation as closely as possible.

People could then build transforms from the ggml MLIR dialect to other MLIR dialects, for example:

One or more builtin MLIR dialects like affine, tensor, linalg, tosa etc.
Triton/TritonGPU to get fast NVIDIA GPU support, and soon also AMD and CPU
IREE: Alternative way to get NVIDIA/AMD GPU support, also LLVM CPU, Vulkan, WebGPU
torch-mlir

Obviously, MLIR is a huge dependency, but implementing a minimal MLIR text generator should be possible with just the C++ standard library.

3 replies

jon-chuang Apr 14, 2023

100% with this. I vote for XLA as the first supported compiler backend. With some kernel/custom op plugins, it could end up doing better than our hand-coded kernels, even for our 4-bit quantized ops.

jon-chuang Apr 14, 2023

Note that exporting MLIR is more non trivial than it may seem. Every listed consumer of MLIR above has their own incompatible dialect of MLIR. So it requires focusing on a portable backend.

jpienaar Aug 30, 2023

Couple of these backends share dialects and can consume same inputs, considering portability and targets of interest is important. So it definitely depends on what one wants to model and use it for. Having it be the graph compute level representation and then ability to add different paths for backends (where the existing executor could be one) gives the most flexibility. Opens up using a toolbox of optimizations. And indeed just doing a printer would be relatively easy, but it's not a long term stable format (contrasted to the bytecode with upgrade/downgrade hooks), but may be sufficient or good starting point (work can also be done to create a lower dependency bytecode writer but no one has asked for one yet).

nathanodle · 2023-04-13T00:43:49Z

nathanodle
Apr 13, 2023

Hear me out on something that could make a lot of sense:

Intel Arc A770 GPUs are somewhat under appreciated when it comes to performance per $. They're also almost completely ignored by the trendy side of AI research.

Doing some tests with some multimodal VQA models, I get about 1/3 the performance of a 4090 but the card only costs $379. I will admit to not having performance/watt benchmarks but for most consumer/edge situations this isn't a huge deal, it's more about the cost of the card for most folks. With Intel trying to break into the GPU market, the A770 is priced to sell, and did I mention it's 16GB?

I would love to see this card be the first platform targeted by a ggml gpu stack. With all the optimizations it would be fun to see how much that "1/3 of a 4090" number could be improved upon and at $379 this could be a huge unlock for the community.

5 replies

gjmulder Apr 13, 2023
Collaborator

Doing some tests with some multimodal VQA models

@nathanodle Would this also open up the possibility of offloading some work to integrated Intel GPUs such as Iris?

nathanodle Apr 14, 2023

It's the same OneDNN stuff, could be interesting

fc59283 Apr 14, 2023

Intel Arc supports OpenCL - ArrayFire has OpenCL, CUDA and CPU backends.
[ArrayFire is a free and open source software (FOSS) library]
https://arrayfire.com/
https://github.com/arrayfire/arrayfire

jon-chuang Apr 14, 2023

I would love to see this card be the first platform targeted by a ggml gpu stack.

This would never work for a simple reason. The fewer users, the smaller the reason to support. The smaller the support, the less compelling to users. If there is a breakthrough, it should be from the mainstream AI folks, I.e. Intel folks pushing their product.

As the small guys, we need to support the largest base, I.e. CPUs and commonly available GPUs I.e. AMD and NVIDIA.

FNsi Apr 23, 2023

I would love to see this card be the first platform targeted by a ggml gpu stack.

This would never work for a simple reason. The fewer users, the smaller the reason to support. The smaller the support, the less compelling to users. If there is a breakthrough, it should be from the mainstream AI folks, I.e. Intel folks pushing their product.

As the small guys, we need to support the largest base, I.e. CPUs and commonly available GPUs I.e. AMD and NVIDIA.

You both make the point, support is needed, and the cost should be lower. Start with more general toolkit to slightly decrease the advantage of Nv is a good choice.

nuance1979 · 2023-04-13T01:50:07Z

nuance1979
Apr 13, 2023

I feel what is great about ggml is the fact that it can run on any CPU with a C++ compiler, e.g. a Raspberry Pi. However, on GPUs, I don't know how likely a ggml-based inference code can beat PyTorch or Triton in terms of performance.

2 replies

jon-chuang Apr 14, 2023

That's why we should just export the quantized operators/weights to a production-grade compiler like XLA.

xloem Apr 21, 2023

XLA seems to be a cloud thing. Hard to find this running locally?

wailovet · 2023-04-13T01:50:18Z

wailovet
Apr 13, 2023

Recently, I tried to use amp with Windows to do some matrix calculations, but it turned out to be harder than I thought.

0 replies

HanClinto · 2023-04-27T23:21:54Z

HanClinto
Apr 27, 2023
Collaborator

A coworker and I have been playing around with the cuBLAS version of llama.cpp on a Jetson Xavier NX development kit.

We're getting ~600ms/token on a Xavier NX, but we aren't seeing a significant performance improvement vs. compiling for cuBLAS and without. Thank you for your excellent implementation, @slaren -- I have learned a lot from you!

One thought that stuck out to us was this comment from @Dampfinchen.

However, transfering the matrices to the GPU appears to be the main bottleneck in the case of using GPU accelerated prompt processing.

NVIDIA Tegra devices have shared DRAM between the CPU and GPU (so as long as we allocate the memory in the correct mode). So I think this means that on these devices, we don't need to spend time copying memory from the CPU to the GPU, and all of those calls to cudaMemcpyAsync() in ggml.c can be #ifdef'd out.

Is that worthwhile? I'm in new territory (and not entirely sure of the best way to tell if I'm in a shared-memory context or not), so any feedback will be helpful. Either way, I'll probably start hacking on this tomorrow and see if I can't get something working...

3 replies

HanClinto Apr 27, 2023
Collaborator

And while I'm here, these are some lessons that we've learned so far about compiling and running llama.cpp with cuBLAS on embedded NVIDIA Jetson platforms:

To compile with the version of cuBLAS that nvidia ships with their Jetpack toolkit, you may need to adjust your NVCCFLAGS. For our particular target, we used NVCCFLAGS = --forward-unknown-to-host-compiler -arch=sm_72, but ymmv. You can look up your appropriate flags here.
We also needed to change our call to cublasGemmEx:

    +++ b/ggml.c
&beta,  d_D, CUDA_R_32F, ne01,
-                            CUBLAS_COMPUTE_32F,
+                                 CUDA_R_32F, //CUBLAS_COMPUTE_32F,
                             CUBLAS_GEMM_DEFAULT));

But the above two changes may not be necessary, depending on which version of the CUDA toolkit you're using.
Definitely make sure to set your thread parameter correctly. The default in alpaca.sh was set to -t 7 and we only had 6 cores. Once we set it to -t 6, performance increased dramatically. Having a couple of threads too few is far better than setting one thread too many.
We got best performance by using the --mlock option -- probably because our device is a tad memory-constrained. We were getting ~750ms/token without it, and ~590ms/token with it. If you're getting the error:

warning: failed to mlock 81920000-byte buffer (after previously locking 0 bytes): Cannot allocate memory
Try increasing RLIMIT_MLOCK ('ulimit -l' as root).

Then just as the message indicates, increasing the ulimit solved it. ulimit -l unlimited

mmap is better than no-mmap for us.
On our Orin devkit, we were getting around 160ms - 180ms/token -- pretty respectable.

slaren Apr 28, 2023
Collaborator

#1207 may be useful for you. From what I understand from your link to the documentation, this device can only access pinned memory, so with that PR you should be able to remove all the device allocations and memcpys and just use the host pointers. Remember to disable mmap to ensure that all memory is pinned.

HanClinto Apr 28, 2023
Collaborator

#1207 may be useful for you.
Yes, thank you so much!!

From what I understand from your link to the documentation, this device can only access pinned memory, so with that PR you should be able to remove all the device allocations and memcpys and just use the host pointers.

In my case, I have an iGPU with shared DRAM, so I can use Pinned or Unified memory. Unified memory has caching support on the GPU, but it comes with additional requirements:

using unified memory in applications requires additional coherency and cache maintenance operations during the kernel launch, synchronization and prefetching hint calls.

I might just start with pinned memory to begin with, but I'm intrigued by what it would take to make it unified.

I really appreciate it, thank you!

twobombs · 2023-04-28T07:28:24Z

twobombs
Apr 28, 2023

Would like to note that with #801 merged in main an NVME RAID0 set should be enough to feed a PCIe bus for a future GPU addon: offloading the GPU's FB would make llama.cpp run ahead of so many others in the field. Yes, the GPU would not be at 100% most of the time especially on older systems but such is the case with HPC scaling.

0 replies

StuartIanNaylor · 2023-04-29T12:02:26Z

StuartIanNaylor
Apr 29, 2023

Has anyone seen https://github.com/Noeda/rllama

Load model only partially to GPU with --percentage-to-gpu command line switch to run hybrid-GPU-CPU inference.

Is that possible with GGML CPP as does hybrid-GPU-CPU inference increase speed or is it butchered by the memory transfer?

1 reply

slaren Apr 29, 2023
Collaborator

I think that this would be the more interesting approach to add GPU support to ggml. We could upload as many weights as possible to VRAM, compute as much as possible on the GPU, and then fall back to the CPU for everything else. We could also try other things, such as uploading the weights that will be needed in the future while the current computation completes.

While this is nothing new, we would benefit from having an optimized CPU implementation of inference with quantized models to fall back to, which is what makes ggml unique.

swittk · 2023-04-29T15:51:03Z

swittk
Apr 29, 2023

This project called MLC-LLM just released (https://github.com/mlc-ai/mlc-llm), and is able to target compilation of LLM models to GPU shaders for CUDA, Vulkan and Metal.
Just tried their demo repo (which uses int3 quantized Vicuna v1 model) on an M1 Max MBP and the response times are surprisingly fast (at least 3 times faster than LLaMA.cpp using Vicuna v1 q4_2 quantized GGML on the same machine), though it's rather inaccurate at times likely due to the int3 quantization.
I'm just posting in case their implementation can help regarding targeting different GPU backends.

3 replies

StuartIanNaylor Apr 29, 2023

I guess its how it compares to native Metal, Vulkan, Cuda but if somewhere near it looks like that 'universal solution', suppose it can not partition models as above and splt between cpu & gpu :)

funnbot May 1, 2023

I've built a version of Relax/TVM for Windows that has Vulkan enabled, which you can find at https://github.com/funnbot/relax/actions. I had to make several tweaks to mlc-ai/mlc-llm aswell to get build.py and the CMake stuff to work properly.

Eventually I was able to build vicuna-7b int4, and it's incredibly fast. This is the moment I've been waiting for: Windows + AMD GPU-accelerated LLM.

I might contribute back to the mlc-llm repository to make things easier for everyone. The weight conversion process alone took nearly 40GB of RAM. I suspect that's due to inefficient use of AutoModelForCasualLM just for loading the parameters and configuration.

As for allowing for models larger than VRAM, I wonder if it's possible to switch to the CPU after N layers, and let TVM figure out the rest. I'm not entirely sure, though.

Update: I hacked out the AutoModelForCasualLM, so memory usage for conversion is now roughly the size of the parameters. I int3 quantized Llama-30b but I couldn't get the ~13gb of params on my 16gb vram gpu.

For larger than VRAM models, TVM has https://tvm.apache.org/docs/how_to/work_with_relay/using_pipeline_executor.html which should mean splitting the model up is possible? The graph_split from unit test that isnt shown, https://github.com/mlc-ai/relax/blob/60d68d9cf71293e553c2cac7658eb69a2d477155/tests/python/relay/test_pipeline_executor.py#L32

swittk May 4, 2023

I tried building TVM on my machine and it took crazy amount of RAM to do so, so my computer started swapping and... I couldn't do any work. So I gave up for now.
How's the performance for Vicuna-13B int4? Still fast compared to the 7B int3?

soufianekhiat · 2023-05-18T21:02:23Z

soufianekhiat
May 18, 2023

Hello,
To comment on that. Did you consider Halide:
https://github.com/halide/Halide/tree/master/apps/onnx
https://github.com/halide/Halide/tree/master/apps/resnet_50

That allow us to split algorithm & implementation:
https://halide-lang.org/

And battle-tested: Adobe: Photoshop, Google: Pixel Phones, Youtube, Samsung, ...

Pure C++, and can target virtually any LLVM-backend and supports:
"
CPU architectures: X86, ARM, Hexagon, PowerPC, RISC-V
Operating systems: Linux, Windows, macOS, Android, iOS, Qualcomm QuRT
GPU Compute APIs: CUDA, OpenCL, OpenGL Compute Shaders, Apple Metal, Microsoft Direct X 12, Vulkan
"
Adding: WebGPU, WebAssembly

Write the front-end once. and specialize scheduler per platform or/and use auto-scheduler to warm up an optimization.

my 2 cents

0 replies

Renegadesoffun · 2023-05-29T21:35:45Z

Renegadesoffun
May 29, 2023

howdy yall! i made most of a GUI for the LLAMA with ability to preload onto GPU! .. seems like the input / output parameters need some tweaking to really function fully but if anyone can put the finnishing touches (maybe multi threading to let it run fast, (edit the folder to be set by user, etc.) i think it can be sweet! honestly whats keeping me from using Llama is not having a gui.. i dont like talking in the cmd. so if someone can add finishing touches to this Id be stoked!~

https://github.com/Renegadesoffun/llamagpu

0 replies

siddhsql · 2023-06-07T01:35:48Z

siddhsql
Jun 7, 2023

Hello, I am little confused by all this. Pardon my ignorance but

is GGML another ML library like PyTorch/TensorFlow?
what ML library does llama.cpp use?
It looks like its not GGML because llama.cpp is able to take advantage of GPU. see: GPU question zylon-ai/private-gpt#217 for example
I see most of the time the models input to llama.cpp are ggml. Does llama.cpp use ggml as a format but use something else like PyTorch or TensorFlow for the computations?

can someone explain to me? I apologize these questions may seem very basic to experts but I wasn't able to find answers.

3 replies

HanClinto Jun 7, 2023
Collaborator

@siddhsql No worries about the questions -- I'll do my best to answer them for you.

Sort-of. It's more highly-specialized than those libraries, but it's on that same level of operations.
It uses GGML, which by default is optimized for CPU, but it also can be compiled with options that will leverage GPU libraries such as Metal, CUDA, and (as your link notes) CUBLAS.
In your linked issue, it notes that you should compile GGML with the option LLAMA_CUBLAS=on to get GPU support. This is documented here, and you can see other GPU options as well.
It uses GGML as a format, and as a computation library. It has no dependency on Pytorch / Tensorflow. If you compile against CUDA / CUBLAS / OpenBLAS / some other GPU acceleration framework, then you will need those as external dependencies, but otherwise the default "base" version of GGML / llama.cpp does not require any external dependencies.

siddhsql Jun 7, 2023

@HanClinto thanks a lot for your response. One more question. w.r.t. https://github.com/ggerganov/llama.cpp/blob/master/llama.cpp#L965:

static void llama_model_load_internal(
        const std::string & fname,
        llama_context & lctx,
        int n_ctx,
        int n_batch,
        int n_gpu_layers,
        int main_gpu,
        const float * tensor_split,
        ggml_type memory_type,
        bool use_mmap,
        bool use_mlock,
        bool vocab_only,
        llama_progress_callback progress_callback,
        void * progress_callback_user_data) {

can yo explain what n_gpu_layers does? is it the # of layers that will be offloaded to the GPU (Not necessarily the total # of layers). what if someone sets n_gpu_layers greater than total # of layers in the model? thanks.

PS: is n_layer the total # of layers in the model?

HanClinto Jun 8, 2023
Collaborator

@siddhsql

can yo explain what n_gpu_layers does? is it the # of layers that will be offloaded to the GPU (Not necessarily the total # of layers)

Yes. This was introduced in #1483, I believe -- if you look at the diff on that PR, it may explain things more clearly -- but basically, for each layer, it can have a specific "backend" associated with it -- whether it's stored in GPU, or in CPU. If it's stored in GPU, then it doesn't need to be transferred there for processing every time -- it's already there, and so we don't need to malloc out of the GPU's memory pool and memcpy it over there every time we want to process that layer -- it's cached, and we can perform our mulmat (or whatever other operation) against that layer directly.

what if someone sets n_gpu_layers greater than total # of layers in the model? thanks.

No issue at all. In fact, this is one of the easiest ways to ensure that 100% of the model is loaded into VRAM -- just set n_gpu_layers to 999999 or something, and everything will get transferred over, and when testing this I've regularly set the number higher than it's expected to be.

PS: is n_layer the total # of layers in the model?

In the context of lllama_model_load_internal, I believe that n_layer is the index of the current layer that's being loaded into memory. It splits them up, loops through, and loads all the layers individually. And as it's loading the layers, it simply checks to see how many layers it is supposed to copy to the GPU, and it begins copying layers over to the GPU so long as it hasn't gone over the limit. There's no way to copy the later layers to the GPU and keep the earlier layers on the CPU -- it always prioritizes the early layers to GPU and anything else gets put on the CPU (I.E., loaded normally).

Codes4Fun · 2023-06-07T17:32:21Z

Codes4Fun
Jun 7, 2023

I've noticed that clblast version of ggml performs about the same as the cpu version with q4_0 formatted files (haven't test other file formats), and debugging it, I had noticed that it doesn't use the gpu a lot, and trying to force it to use it more, by modifying code, seemed to make performance worse.

so I've implemented my own opencl kernel that focuses only on mul-mat... and I managed to get a much larger performance increase, on my fastest system an rtx 4070ti has an 80% increase in performance over it's 8 core ryzen 9 5900hx cpu. on older systems gtx 2070 is about 11% increase over it's i7-9750h, and a gtx 1070 has an 22% increase over it's i7-6700T.

and that does not include optimizations to use local memory in the opencl kernel, which will likely increase it further.

to get that performance increase though, I had to reformat the data. I originally tried it using the ggml structures block_q8_0 and block_q4_0, but breaking the structure out into two separate arrays, one for qs and another for d, improved performance, and in the kernel using those arrays as 8-component vectors made a large improvement.

this is what the vector dot product looks like in opencl:

float vec32_dot_q4_q8_raw8(
   __global uchar4 *xqs,
   __global half *xd,
   __global char8 *yqs,
   __global half *yd,
   const unsigned int nb)
{
    float8 fsum = 0;
    for (unsigned int j = 0; j < nb; j++) {
        float8 sum = 0;
        for (int i = 0; i < 4; i++) {
            uchar4 q4 = *xqs;
            float8 _xqs;
            _xqs.even = convert_float4(q4 & (uchar4)(0xf));
            _xqs.odd = convert_float4(q4 >> (uchar4)(4));
            _xqs -= (float8)(8);
            _xqs *= convert_float8(*yqs);
            sum += _xqs;
            xqs++; yqs++;
        }
        fsum += sum * vload_half(0,xd) * vload_half(0,yd);
        xd++; yd++;
    }
    fsum.s0123 += fsum.s4567;
    fsum.s01 += fsum.s23;
    return fsum.s0 + fsum.s1;
}

I could share the code if anyone is interested, but it is really hacky atm, and I don't know if the performance I am experiencing with clblast is normal or if I did something wrong, and I am also thinking of trying to optimize it further using opencl local memory.

15 replies

Codes4Fun Jun 9, 2023

I've updated with changes to make it more comparable to cpu and clblast versions.
it pre-caches now, so lazy caching doesn't affect token eval timing.
and multithreading works, so you can specify as many threads as you want.

the threading fix increases performance a bit

Codes4Fun Jun 19, 2023

I updated my fork with a new branch:
https://github.com/Codes4Fun/llama.cpp/tree/opencl_dev2

this new branch is integrated with clblast code.

I found two things

I was not using clblast correctly before and now that I am, I see it is a lot faster than the cpu.
I've come to realize a flaw in the way I am approaching it. while on an older gpu, a gtx 1070, my kernel is faster than the current one, but on a newer gpu such as 4070ti it has more compute than I am structured to use. as a result processing one token at a time is slower. prompt eval is still faster since it uses more compute but run evals are slower.

I might have to do custom data interleaving and local memory to get optimal compute.

I was looking at another interesting solution to this... I noticed two matrices in llama that could be combined, wq and wk, they both multiply against the same vector/matrix, so it seems possible to combine them and grow the compute, and use ggml_view to separate out the results. while it is better to have a general purpose solution, this may have other benefits !

teleprint-me Jun 19, 2023

Once I get more comfortable with Cpp, I'll test run it on my AMD RX 580. It has the same amount of VRAM as the GTX 1070. I'll have to check out the specs for differences when I have time. It should work though since its OpenCL.

Codes4Fun Jun 21, 2023

I updated the new branch, with another optimization that double buffers local memory, and tried to improve performance on my rtx 4070ti by quadrupling compute, 4 cores per vector.
https://github.com/Codes4Fun/llama.cpp/tree/opencl_dev2

these increased performance on my gtx 1070, which clblast default is about 106 ms per token, and is now down to 68 ms per token.

these increased performance on my rtx 4070 ti, but clblast default is still faster at about 26 ms per token, while it sits currently at 31 ms per token, down from 47. trying to divide compute more didn't make a difference.

uchuusen Jun 22, 2023

My Vega 10 iGPU gets a good boost from this latest branch as well, going down from 267ms/T with 7b q4_0, to 200ms/T. With CPU only, the fastest it gets is 240ms. This now beats my CPU in all cases. Really great work!

Dampfinchen · 2023-06-08T10:56:22Z

Dampfinchen
Jun 8, 2023

@JohannesGaessler I did notice that with the latest version of llama.cpp at the time of this writing, the VRAM usage while offloading layers to the GPU using CUDA increased. I had 15 layers (13B model, RTX 2060, ggml 5_1) before with a VRAM usage of around 3400 MB. Now at the same amount of layers it needs 3900 MB. This slows down generation quite a lot as I have to use less layers now.

2 replies

JohannesGaessler Jun 8, 2023
Collaborator

Which GPU and operating system are you using? Generally you should be getting a higher VRAM usage compared to before but not a significant performance regression because more of the model can now be accelerated. These are the numbers that I got on Linux: #1703 (comment)

In any case, llama.cpp currently needs 1 MB of VRAM memory per batch size so you can reduce your VRAM usage by reducing that parameter.

Dampfinchen Jun 8, 2023

Which GPU and operating system are you using? Generally you should be getting a higher VRAM usage compared to before but not a significant performance regression because more of the model can now be accelerated. These are the numbers that I got on Linux: #1703 (comment)

In any case, llama.cpp currently needs 1 MB of VRAM memory per batch size so you can reduce your VRAM usage by reducing that parameter.

Now that I did a proper comparison between the two, you are right that the speeds are roughly comparable. I guess my initial feeling of generation being slower was just confirmation bias. Here's the performance figures (old 15 layers, around 3400 mb, new 13 layers, around 3400 mb vram usage)

Llama.cpp old:

llama_print_timings: load time = 44221.63 ms
llama_print_timings: sample time = 144.82 ms / 180 runs ( 0.80 ms per token)
llama_print_timings: prompt eval time = 59045.43 ms / 1849 tokens ( 31.93 ms per token)
llama_print_timings: eval time = 77643.51 ms / 179 runs ( 433.76 ms per token)
llama_print_timings: total time = 142706.14 ms

Llama.cpp new:

llama_print_timings: load time = 45933.30 ms
llama_print_timings: sample time = 136.82 ms / 180 runs ( 0.76 ms per token)
llama_print_timings: prompt eval time = 60995.50 ms / 1849 tokens ( 32.99 ms per token)
llama_print_timings: eval time = 78081.99 ms / 179 runs ( 436.21 ms per token)
llama_print_timings: total time = 145089.77 ms

I cannot verify if the prompt processing is truly faster. The first batch of prompt processing always takes very long for some reason, so that leads to unreliable results. Once loaded in a WebUI like Ooba and chatting with the model, the prompt processing is much faster even when processing 1800 tokens (then its around 15 ms per token). But I will be sure to test this once the new llama.cpp updates are in the python wrapper and Ooba.

I am using Windows 11, and my CPU is a Core i7 9750H btw.

kelteseth · 2023-08-03T11:48:41Z

kelteseth
Aug 3, 2023

With HIP Windows SDK now being public 🎉, would this mean it is now possible to add AMD Windows HIP support to llama?

See https://www.amd.com/en/developer/rocm-hub/hip-sdk.html

https://rocm.docs.amd.com/en/latest/release/windows_support.html

1 reply

sorasoras Dec 7, 2023

It's support and working with my 7900xtx

crystalthoughts · 2023-08-10T13:46:13Z

crystalthoughts
Aug 10, 2023

Has anyone considered transpiling the ggml format to Futhark?
It has CUDA and OpenCL exports and might save a lot of work.

1 reply

ianscrivener Aug 10, 2023

According to the Futhark bio it is an "ongoing research project" from the Uninversity of Copenhagen - "a small programming language designed to be compiled to efficient parallel code"... "use the compute power of the GPU to accelerate data-parallel array computations".. "not intended to replace existing general-purpose languages"

crystalthoughts · 2023-08-11T06:30:46Z

crystalthoughts
Aug 11, 2023

Might be a source of inspiration / I’m sure they accept pull requests. I was under the impression ml weights were basically DSP graphs.

…

On Thu, 10 Aug 2023 at 23:34, Ian Scrivener ***@***.***> wrote: According to the *Futhark* bio it is an "*ongoing research project*" from the UNinversity of Copenhagen - "a small programming language designed to be compiled to efficient parallel code"... "use the compute power of the GPU to accelerate data-parallel array computations".. "*not intended to replace existing general-purpose languages*" — Reply to this email directly, view it on GitHub <#915 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABSNQQBX4FRASL7OTH3LR4TXUVHWXANCNFSM6AAAAAAW3XHQYE> . You are receiving this because you commented.Message ID: <ggerganov/llama. ***@***.***>

0 replies

StuartIanNaylor · 2023-08-11T13:09:13Z

StuartIanNaylor
Aug 11, 2023

Likely GGML Vulkan based routines would of been best as it would be the only single source multiplatform compatible solution as just look at the plethora of suggestions that most have specific platform support problems.

1 reply

PoignardAzur Dec 18, 2023

For what it's worth, WebGPU is more multi-platform than Vulkan (eg no conversion layer on MacOS).

It's a WebStandard that all major browser have agreed to implement (though only Chrome ships it for now). So eventually anything that runs on WebGPU will run on NVidia, AMD, destkops, iphones, windows, linux, firefox, chrome, etc.

PoignardAzur · 2023-12-18T19:33:11Z

PoignardAzur
Dec 18, 2023

I'm extremely in favor of a WebGPU backend.

Being able to run ggml graphs in WebGPU would mean being able to run them with GPU performance in the browser. Combined with Chrome's ttsEngine API, that means being able to install an extension that runs state-of-the-art Text to Speech models locally as the backend for your screen-reader. Compared to the current crappy robot-voices you get from your OS and from popular screen reader extensions, I can't overstate how huge an improvement that would be.

7 replies

sroussey Dec 18, 2023

And if you doing just embeddings: some are quite small (relatively):

https://huggingface.co/spaces/mteb/leaderboard

sroussey Dec 18, 2023

Oh, and it I remember right, they uses webworkers for extra threads and the browser cache API to keep the models around.

PoignardAzur Dec 18, 2023

WebGPU backend would be great, though it's application for LLMs is still questionable. I don't see a good reason to load multiple GB of data in the web-browser each time I want to generate a few tokens.

To be specific, I was thinking of the ttsEngine API which is a non-standard part of the WebExtension API that Chrome implements. The idea is that you download an extension once (say, because you're someone with poor eyesight) and then you get state of the art TTS for every page you visit.

But more generally speaking, anything you can use in a WebExtension would be great.

FrankenApps Jan 27, 2024

Using Dawn natively could be interesting, too. It has wide support for operating systems and devices. Moreover utilizing the WebGPU spec could mean that other WebGPU backends could also easily be supported in the future (e.g. wgpu or a future implementation of WebGPU for Safari if it would be made publicly available).

sorasoras Jan 27, 2024

https://webllm.mlc.ai/
There are something like this that run llm on webgpu

zhouwg · 2024-03-25T14:58:27Z

zhouwg
Mar 25, 2024

let's try to hack something fast, small and cool and see where it goes 👍

0 replies

Add GPU support to ggml #915

ggerganov Apr 12, 2023 Maintainer

Intro

Description

Guiding principles

Why?

Links

Update 28 May 2023:

Replies: 48 comments · 161 replies

ggerganov Apr 17, 2023 Maintainer Author

ivanstepanovftw Apr 15, 2023 Collaborator

SlyEcho Apr 13, 2023 Collaborator Sponsor

SlyEcho Apr 26, 2023 Collaborator Sponsor

gjmulder Apr 13, 2023 Collaborator

ggerganov
Apr 12, 2023
Maintainer

Replies: 48 comments 161 replies

ggerganov Apr 17, 2023
Maintainer Author

ivanstepanovftw Apr 15, 2023
Collaborator

SlyEcho Apr 13, 2023
Collaborator Sponsor

SlyEcho Apr 26, 2023
Collaborator Sponsor

gjmulder Apr 13, 2023
Collaborator