move BLAS to a separate backend #6210

slaren · 2024-03-21T19:07:33Z

Moves BLAS support from ggml.c to a separate backend, and adds the necessary changes to ggml-backend to support backends that only implement matrix multiplication.

Changes to ggml-backend
- Support for fallback to CPU with ggml_backend_sched
  - Operations not implemented by a backend will be automatically run on the CPU, as long as the operation is reported as not supported in the supports_op function of the backend
- Moved buffer type function supports_backend to backend function supports_buft
- Backends that want to declare compatibility with any kind of host buffer can return ggml_backend_buft_is_host from supports_buft
- ggml_backend_sched will avoid copies between backends when the backend supports the buffer type
  - Eg. when switching from Metal to CPU, no tensors will be copied since the Metal buffers are compatible with the CPU backend (but not the other way around)
The GGML_SCHED_DEBUG environment variable can be used to view the graph splits. This is useful to see what operations are being run on each backend
Adds the BLAS backend
- Supports matrix multiplication using a BLAS library. Previously, this was supported as part of the CPU backend
- Threads are no longer spinning while BLAS is running, potentially improving performance, and batch processing is no longer limited to 4 threads when using BLAS
- The number of threads of the BLAS library configured automatically for OpenBLAS and BLIS (with -t or -tb)
- For better performance, it is recommended to use OpenMP versions of the BLAS libraries, if available (except macOS)
- Like before, to enable the BLAS backend, build with the flag LLAMA_BLAS when using cmake, or when using make, LLAMA_OPENBLAS, LLAMA_OPENBLAS64 or LLAMA_BLIS
- On macOS, this is enabled by default through Accelerate
- BLAS support has been removed from the CPU backend in ggml.c. Applications that want to support BLAS will need to use the BLAS backend
- Since this backend only implements matrix multiplication, it should be used with ggml_backend_sched alongside the CPU or other backends to provide support for other operations
- Note: the BLAS backend should not be used alongside GPU backends, as it will prevent offloading of large batches with partial offloading

ggerganov · 2024-03-21T19:38:35Z

this will also have the effect that using BLAS will require using ggml-backend and ggml_backend_sched, is that a problem?

Will just need to adapt whisper.cpp when it's ready

github-actions · 2024-05-11T08:02:46Z

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 556 iterations 🚀

Expand details for performance related PR only

Concurrent users: 8, duration: 10m
HTTP request : avg=8392.96ms p(95)=20049.24ms fails=, finish reason: stop=505 truncated=51
Prompt processing (pp): avg=95.91tk/s p(95)=451.37tk/s
Token generation (tg): avg=32.54tk/s p(95)=46.27tk/s
ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=sl/blas-backend commit=ecb75b5f54cab6ca7f77ec51eb5f7d87c87be6cd

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 556 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1717788785 --> 1717789413
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 472.97, 472.97, 472.97, 472.97, 472.97, 651.19, 651.19, 651.19, 651.19, 651.19, 684.37, 684.37, 684.37, 684.37, 684.37, 717.14, 717.14, 717.14, 717.14, 717.14, 777.97, 777.97, 777.97, 777.97, 777.97, 803.49, 803.49, 803.49, 803.49, 803.49, 802.74, 802.74, 802.74, 802.74, 802.74, 830.76, 830.76, 830.76, 830.76, 830.76, 842.55, 842.55, 842.55, 842.55, 842.55, 858.83, 858.83, 858.83, 858.83, 858.83, 884.07, 884.07, 884.07, 884.07, 884.07, 906.98, 906.98, 906.98, 906.98, 906.98, 917.57, 917.57, 917.57, 917.57, 917.57, 927.93, 927.93, 927.93, 927.93, 927.93, 889.29, 889.29, 889.29, 889.29, 889.29, 888.42, 888.42, 888.42, 888.42, 888.42, 883.9, 883.9, 883.9, 883.9, 883.9, 890.66, 890.66, 890.66, 890.66, 890.66, 889.58, 889.58, 889.58, 889.58, 889.58, 885.33, 885.33, 885.33, 885.33, 885.33, 891.37, 891.37, 891.37, 891.37, 891.37, 892.18, 892.18, 892.18, 892.18, 892.18, 903.58, 903.58, 903.58, 903.58, 903.58, 884.3, 884.3, 884.3, 884.3, 884.3, 885.93, 885.93, 885.93, 885.93, 885.93, 887.7, 887.7, 887.7, 887.7, 887.7, 897.88, 897.88, 897.88, 897.88, 897.88, 898.83, 898.83, 898.83, 898.83, 898.83, 901.18, 901.18, 901.18, 901.18, 901.18, 903.85, 903.85, 903.85, 903.85, 903.85, 902.06, 902.06, 902.06, 902.06, 902.06, 899.67, 899.67, 899.67, 899.67, 899.67, 901.12, 901.12, 901.12, 901.12, 901.12, 901.61, 901.61, 901.61, 901.61, 901.61, 894.81, 894.81, 894.81, 894.81, 894.81, 899.59, 899.59, 899.59, 899.59, 899.59, 902.05, 902.05, 902.05, 902.05, 902.05, 899.29, 899.29, 899.29, 899.29, 899.29, 900.17, 900.17, 900.17, 900.17, 900.17, 901.99, 901.99, 901.99, 901.99, 901.99, 902.25, 902.25, 902.25, 902.25, 902.25, 897.96, 897.96, 897.96, 897.96, 897.96, 899.74, 899.74, 899.74, 899.74, 899.74, 898.21, 898.21, 898.21, 898.21, 898.21, 896.68, 896.68, 896.68, 896.68, 896.68, 893.88, 893.88, 893.88, 893.88, 893.88, 900.65, 900.65, 900.65, 900.65, 900.65, 900.28, 900.28, 900.28, 900.28, 900.28, 905.32, 905.32, 905.32, 905.32, 905.32, 905.04, 905.04, 905.04, 905.04, 905.04, 907.84, 907.84, 907.84, 907.84, 907.84, 911.77, 911.77, 911.77, 911.77, 911.77, 910.14, 910.14, 910.14, 910.14, 910.14, 916.43, 916.43, 916.43, 916.43, 916.43, 913.54, 913.54, 913.54, 913.54, 913.54, 913.98, 913.98, 913.98, 913.98, 913.98, 913.74, 913.74, 913.74, 913.74, 913.74, 914.78, 914.78, 914.78, 914.78, 914.78, 915.78, 915.78, 915.78, 915.78, 915.78, 917.28, 917.28, 917.28, 917.28, 917.28, 917.38, 917.38, 917.38, 917.38, 917.38]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 556 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1717788785 --> 1717789413
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 40.3, 40.3, 40.3, 40.3, 40.3, 29.58, 29.58, 29.58, 29.58, 29.58, 31.08, 31.08, 31.08, 31.08, 31.08, 32.49, 32.49, 32.49, 32.49, 32.49, 32.53, 32.53, 32.53, 32.53, 32.53, 33.31, 33.31, 33.31, 33.31, 33.31, 34.81, 34.81, 34.81, 34.81, 34.81, 35.15, 35.15, 35.15, 35.15, 35.15, 35.47, 35.47, 35.47, 35.47, 35.47, 35.45, 35.45, 35.45, 35.45, 35.45, 35.54, 35.54, 35.54, 35.54, 35.54, 34.66, 34.66, 34.66, 34.66, 34.66, 32.62, 32.62, 32.62, 32.62, 32.62, 32.1, 32.1, 32.1, 32.1, 32.1, 30.91, 30.91, 30.91, 30.91, 30.91, 31.16, 31.16, 31.16, 31.16, 31.16, 31.23, 31.23, 31.23, 31.23, 31.23, 30.84, 30.84, 30.84, 30.84, 30.84, 30.62, 30.62, 30.62, 30.62, 30.62, 30.52, 30.52, 30.52, 30.52, 30.52, 30.71, 30.71, 30.71, 30.71, 30.71, 30.98, 30.98, 30.98, 30.98, 30.98, 30.9, 30.9, 30.9, 30.9, 30.9, 30.97, 30.97, 30.97, 30.97, 30.97, 31.24, 31.24, 31.24, 31.24, 31.24, 31.26, 31.26, 31.26, 31.26, 31.26, 31.41, 31.41, 31.41, 31.41, 31.41, 31.71, 31.71, 31.71, 31.71, 31.71, 31.78, 31.78, 31.78, 31.78, 31.78, 31.88, 31.88, 31.88, 31.88, 31.88, 31.92, 31.92, 31.92, 31.92, 31.92, 31.91, 31.91, 31.91, 31.91, 31.91, 31.95, 31.95, 31.95, 31.95, 31.95, 31.7, 31.7, 31.7, 31.7, 31.7, 31.31, 31.31, 31.31, 31.31, 31.31, 31.11, 31.11, 31.11, 31.11, 31.11, 31.04, 31.04, 31.04, 31.04, 31.04, 31.13, 31.13, 31.13, 31.13, 31.13, 31.3, 31.3, 31.3, 31.3, 31.3, 31.36, 31.36, 31.36, 31.36, 31.36, 31.44, 31.44, 31.44, 31.44, 31.44, 31.28, 31.28, 31.28, 31.28, 31.28, 30.79, 30.79, 30.79, 30.79, 30.79, 30.73, 30.73, 30.73, 30.73, 30.73, 29.48, 29.48, 29.48, 29.48, 29.48, 29.5, 29.5, 29.5, 29.5, 29.5, 29.6, 29.6, 29.6, 29.6, 29.6, 29.72, 29.72, 29.72, 29.72, 29.72, 29.78, 29.78, 29.78, 29.78, 29.78, 29.84, 29.84, 29.84, 29.84, 29.84, 29.84, 29.84, 29.84, 29.84, 29.84, 29.67, 29.67, 29.67, 29.67, 29.67, 29.7, 29.7, 29.7, 29.7, 29.7, 29.6, 29.6, 29.6, 29.6, 29.6, 29.61, 29.61, 29.61, 29.61, 29.61, 29.73, 29.73, 29.73, 29.73, 29.73, 29.84, 29.84, 29.84, 29.84, 29.84, 29.92, 29.92, 29.92, 29.92, 29.92, 30.04, 30.04, 30.04, 30.04, 30.04, 30.08, 30.08, 30.08, 30.08, 30.08]

Details

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 556 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1717788785 --> 1717789413
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.38, 0.38, 0.38, 0.38, 0.38, 0.2, 0.2, 0.2, 0.2, 0.2, 0.13, 0.13, 0.13, 0.13, 0.13, 0.19, 0.19, 0.19, 0.19, 0.19, 0.24, 0.24, 0.24, 0.24, 0.24, 0.09, 0.09, 0.09, 0.09, 0.09, 0.15, 0.15, 0.15, 0.15, 0.15, 0.14, 0.14, 0.14, 0.14, 0.14, 0.17, 0.17, 0.17, 0.17, 0.17, 0.11, 0.11, 0.11, 0.11, 0.11, 0.31, 0.31, 0.31, 0.31, 0.31, 0.33, 0.33, 0.33, 0.33, 0.33, 0.44, 0.44, 0.44, 0.44, 0.44, 0.25, 0.25, 0.25, 0.25, 0.25, 0.16, 0.16, 0.16, 0.16, 0.16, 0.06, 0.06, 0.06, 0.06, 0.06, 0.31, 0.31, 0.31, 0.31, 0.31, 0.3, 0.3, 0.3, 0.3, 0.3, 0.19, 0.19, 0.19, 0.19, 0.19, 0.15, 0.15, 0.15, 0.15, 0.15, 0.16, 0.16, 0.16, 0.16, 0.16, 0.27, 0.27, 0.27, 0.27, 0.27, 0.13, 0.13, 0.13, 0.13, 0.13, 0.15, 0.15, 0.15, 0.15, 0.15, 0.13, 0.13, 0.13, 0.13, 0.13, 0.12, 0.12, 0.12, 0.12, 0.12, 0.11, 0.11, 0.11, 0.11, 0.11, 0.15, 0.15, 0.15, 0.15, 0.15, 0.16, 0.16, 0.16, 0.16, 0.16, 0.19, 0.19, 0.19, 0.19, 0.19, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.26, 0.26, 0.26, 0.26, 0.26, 0.18, 0.18, 0.18, 0.18, 0.18, 0.17, 0.17, 0.17, 0.17, 0.17, 0.24, 0.24, 0.24, 0.24, 0.24, 0.2, 0.2, 0.2, 0.2, 0.2, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.16, 0.16, 0.16, 0.16, 0.16, 0.37, 0.37, 0.37, 0.37, 0.37, 0.51, 0.51, 0.51, 0.51, 0.51, 0.48, 0.48, 0.48, 0.48, 0.48, 0.49, 0.49, 0.49, 0.49, 0.49, 0.09, 0.09, 0.09, 0.09, 0.09, 0.17, 0.17, 0.17, 0.17, 0.17, 0.15, 0.15, 0.15, 0.15, 0.15, 0.18, 0.18, 0.18, 0.18, 0.18, 0.09, 0.09, 0.09, 0.09, 0.09, 0.22, 0.22, 0.22, 0.22, 0.22, 0.26, 0.26, 0.26, 0.26, 0.26, 0.09, 0.09, 0.09, 0.09, 0.09, 0.29, 0.29, 0.29, 0.29, 0.29, 0.23, 0.23, 0.23, 0.23, 0.23, 0.17, 0.17, 0.17, 0.17, 0.17, 0.19, 0.19, 0.19, 0.19, 0.19, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.15, 0.15, 0.15, 0.15, 0.15, 0.2, 0.2, 0.2, 0.2, 0.2]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 556 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1717788785 --> 1717789413
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 1.0, 1.0, 1.0, 1.0, 1.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0]

slaren · 2024-05-30T16:04:01Z

@mofosyne I appreciate that you are trying to help, but please don't do that on my PRs. I very often have not pushed local changes and I prefer to deal with the merge conflicts myself.

…times

slaren · 2024-06-06T00:28:32Z

@ggerganov I am thinking about how accelerate should interact with the BLAS backend. I think this would make sense:

GGML_USE_ACCELERATE defined: accelerate is used in ggml.c
GGML_USE_ACCELERATE and GGML_USE_BLAS defined: the BLAS backend is used using accelerate as the BLAS library

Conversely:

If GGML_USE_BLAS is not defined, accelerate will not be used for GEMMs

Currently llama.cpp has to check check for defined(GGML_USE_BLAS) || defined(GGML_USE_ACCELERATE) to decide when to use the BLAS backend, which doesn't seem very good. In the BLAS backend, accelerate is treated in the same way as any other BLAS library.

…ot available

ggerganov · 2024-06-06T07:04:01Z

Yes, that makes sense. With only GGML_USE_ACCELERATE we will still be able to use some non-GEMM functionality from the Accelerate framework such as vDSP - it's just very convenient for Apple Silicon devices to have this framework available in the core ggml.c. For GEMMs we would explicitly need to have GGML_USE_BLAS for all kinds of BLAS implementations, including Accelerate's BLAS

slaren · 2024-06-07T07:58:20Z

Some notes on performance.

BLIS responds very well to these changes, and improves performance in most cases. In master, the number of threads is limited to 4 when using BLAS. This is not the case with the BLAS backend. Using a higher number of threads should improve the performance of dequantizing and other operations.

LLAMA_BLIS=1 BLIS_NUM_THREADS=24:

CPU	Model	Model Size [GiB]	Threads	Test	t/s master	t/s sl/blas-backend	Speedup
i9-13900K	llama 7B Q4_0	3.56	8	pp32	12.75	19.76	1.55
i9-13900K	llama 7B Q4_0	3.56	8	pp64	21.37	31.86	1.49
i9-13900K	llama 7B Q4_0	3.56	8	pp128	33.38	44.09	1.32
i9-13900K	llama 7B Q4_0	3.56	8	pp256	49.46	57.94	1.17
i9-13900K	llama 7B Q4_0	3.56	8	pp512	59.21	63.64	1.07
i9-13900K	llama 7B Q4_0	3.56	16	pp32	13.32	18.52	1.39
i9-13900K	llama 7B Q4_0	3.56	16	pp64	23.34	27.53	1.18
i9-13900K	llama 7B Q4_0	3.56	16	pp128	33.67	38.56	1.15
i9-13900K	llama 7B Q4_0	3.56	16	pp256	49.20	47.68	0.97
i9-13900K	llama 7B Q4_0	3.56	16	pp512	59.41	57.81	0.97
i9-13900K	llama 7B all F32	25.10	8	pp32	24.42	36.85	1.51
i9-13900K	llama 7B all F32	25.10	8	pp64	35.50	47.30	1.33
i9-13900K	llama 7B all F32	25.10	8	pp128	45.71	54.15	1.18
i9-13900K	llama 7B all F32	25.10	8	pp256	62.75	69.03	1.10
i9-13900K	llama 7B all F32	25.10	8	pp512	66.93	70.51	1.05
i9-13900K	llama 7B all F32	25.10	16	pp32	25.00	26.01	1.04
i9-13900K	llama 7B all F32	25.10	16	pp64	35.31	35.59	1.01
i9-13900K	llama 7B all F32	25.10	16	pp128	45.62	45.87	1.01
i9-13900K	llama 7B all F32	25.10	16	pp256	61.40	55.45	0.90
i9-13900K	llama 7B all F32	25.10	16	pp512	66.05	63.46	0.96

OpenBLAS is a headache. Increasing the number of threads above 8 has catastrophic effects on performance in ways that I cannot explain, because these threads are not running while the OpenBLAS gemm is running. Since there are better alternatives easily available now, I would suggest ignoring this library.

LLAMA_OPENBLAS=1 OPENBLAS_NUM_THREADS=24:

CPU	Model	Model Size [GiB]	Threads	Test	t/s master	t/s sl/blas-backend	Speedup
i9-13900K	llama 7B Q4_0	3.56	4	pp32	14.79	14.63	0.99
i9-13900K	llama 7B Q4_0	3.56	4	pp64	22.30	21.85	0.98
i9-13900K	llama 7B Q4_0	3.56	4	pp128	29.36	28.44	0.97
i9-13900K	llama 7B Q4_0	3.56	8	pp32	15.01	15.83	1.05
i9-13900K	llama 7B Q4_0	3.56	8	pp64	22.22	23.77	1.07
i9-13900K	llama 7B Q4_0	3.56	8	pp128	29.38	29.97	1.02
i9-13900K	llama 7B all F32	25.10	4	pp32	23.55	23.55	1.00
i9-13900K	llama 7B all F32	25.10	4	pp64	30.88	30.46	0.99
i9-13900K	llama 7B all F32	25.10	4	pp128	36.20	34.92	0.96
i9-13900K	llama 7B all F32	25.10	8	pp32	23.60	22.81	0.97
i9-13900K	llama 7B all F32	25.10	8	pp64	30.84	31.13	1.01
i9-13900K	llama 7B all F32	25.10	8	pp128	34.42	35.92	1.04

The OpenMP thread pool does not work on macOS, so the overhead of starting threads continuously is expected to be higher. However, with the default number of threads on my system (12), it still seems to result in an speedup. There is a drop in performance specifically with 4 threads and the larger batch sizes that I don't understand, I would expect the overhead to be proportionally smaller as the batch size increases.

CPU	Model	Model Size [GiB]	Threads	Test	t/s master	t/s sl/blas-backend	Speedup
M3 Max	llama 7B Q4_0	3.56	4	pp32	33.75	35.89	1.06
M3 Max	llama 7B Q4_0	3.56	4	pp64	57.97	61.30	1.06
M3 Max	llama 7B Q4_0	3.56	4	pp128	89.04	91.20	1.02
M3 Max	llama 7B Q4_0	3.56	4	pp256	121.10	111.30	0.92
M3 Max	llama 7B Q4_0	3.56	4	pp512	128.46	117.87	0.92
M3 Max	llama 7B Q4_0	3.56	12	pp32	33.56	37.09	1.11
M3 Max	llama 7B Q4_0	3.56	12	pp64	57.27	63.48	1.11
M3 Max	llama 7B Q4_0	3.56	12	pp128	88.55	97.80	1.10
M3 Max	llama 7B Q4_0	3.56	12	pp256	121.35	127.76	1.05
M3 Max	llama 7B Q4_0	3.56	12	pp512	128.42	136.79	1.07
M3 Max	llama 7B all F32	25.10	4	pp32	51.04	49.88	0.98
M3 Max	llama 7B all F32	25.10	4	pp64	79.66	77.13	0.97
M3 Max	llama 7B all F32	25.10	4	pp128	103.58	107.29	1.04
M3 Max	llama 7B all F32	25.10	4	pp256	141.23	123.31	0.87
M3 Max	llama 7B all F32	25.10	4	pp512	138.78	122.95	0.89
M3 Max	llama 7B all F32	25.10	12	pp32	50.99	48.56	0.95
M3 Max	llama 7B all F32	25.10	12	pp64	79.38	77.12	0.97
M3 Max	llama 7B all F32	25.10	12	pp128	112.81	112.74	1.00
M3 Max	llama 7B all F32	25.10	12	pp256	141.26	140.00	0.99
M3 Max	llama 7B all F32	25.10	12	pp512	138.73	140.82	1.02

ggerganov · 2024-06-08T14:15:16Z

There is a drop in performance specifically with 4 threads and the larger batch sizes that I don't understand

On M2 Ultra there is a similar effect with the Q4_0 and F32 models, while for Q8_0 and F16 it behaves as you expect:

LLAMA_NO_LLAMAFILE=1 LLAMA_NO_METAL=1 ./scripts/compare-commits.sh master sl/blas-backend -m models/tinyllama-1b/ggml-model-q4_0.gguf -m models/tinyllama-1b/ggml-model-q8_0.gguf -m models/tinyllama-1b/ggml-model-f16.gguf -m models/tinyllama-1b/ggml-model-f32.gguf -p 32,64,128,256,512 -n 0 -t 4,8,16

CPU	Model	Model [GiB]	Threads	Test	t/s master	t/s sl/blas-backend	Speedup
M2 Ultra	1B F16	2.05	4	pp32	265.26	214.52	0.81
M2 Ultra	1B F16	2.05	4	pp64	452.51	361.71	0.80
M2 Ultra	1B F16	2.05	4	pp128	640.73	537.69	0.84
M2 Ultra	1B F16	2.05	4	pp256	664.02	588.17	0.89
M2 Ultra	1B F16	2.05	4	pp512	624.32	570.26	0.91
M2 Ultra	1B F16	2.05	8	pp32	266.72	263.57	0.99
M2 Ultra	1B F16	2.05	8	pp64	464.59	441.93	0.95
M2 Ultra	1B F16	2.05	8	pp128	648.08	660.74	1.02
M2 Ultra	1B F16	2.05	8	pp256	659.51	705.26	1.07
M2 Ultra	1B F16	2.05	8	pp512	622.22	745.74	1.20
M2 Ultra	1B F16	2.05	16	pp32	271.83	224.23	0.82
M2 Ultra	1B F16	2.05	16	pp64	456.11	394.74	0.87
M2 Ultra	1B F16	2.05	16	pp128	633.06	610.69	0.96
M2 Ultra	1B F16	2.05	16	pp256	656.49	699.64	1.07
M2 Ultra	1B F16	2.05	16	pp512	625.12	816.75	1.31
M2 Ultra	1B Q4_0	0.59	4	pp32	227.12	224.63	0.99
M2 Ultra	1B Q4_0	0.59	4	pp64	373.63	371.20	0.99
M2 Ultra	1B Q4_0	0.59	4	pp128	622.08	571.99	0.92
M2 Ultra	1B Q4_0	0.59	4	pp256	658.47	612.67	0.93
M2 Ultra	1B Q4_0	0.59	4	pp512	634.58	587.23	0.93
M2 Ultra	1B Q4_0	0.59	8	pp32	253.38	269.27	1.06
M2 Ultra	1B Q4_0	0.59	8	pp64	410.88	436.95	1.06
M2 Ultra	1B Q4_0	0.59	8	pp128	636.57	673.07	1.06
M2 Ultra	1B Q4_0	0.59	8	pp256	661.17	712.17	1.08
M2 Ultra	1B Q4_0	0.59	8	pp512	631.02	745.85	1.18
M2 Ultra	1B Q4_0	0.59	16	pp32	246.48	213.06	0.86
M2 Ultra	1B Q4_0	0.59	16	pp64	396.73	375.92	0.95
M2 Ultra	1B Q4_0	0.59	16	pp128	632.46	600.46	0.95
M2 Ultra	1B Q4_0	0.59	16	pp256	658.53	703.00	1.07
M2 Ultra	1B Q4_0	0.59	16	pp512	631.18	816.15	1.29
M2 Ultra	1B Q8_0	1.09	4	pp32	248.64	221.43	0.89
M2 Ultra	1B Q8_0	1.09	4	pp64	430.98	378.20	0.88
M2 Ultra	1B Q8_0	1.09	4	pp128	628.65	565.11	0.90
M2 Ultra	1B Q8_0	1.09	4	pp256	650.70	606.51	0.93
M2 Ultra	1B Q8_0	1.09	4	pp512	629.00	584.55	0.93
M2 Ultra	1B Q8_0	1.09	8	pp32	253.84	253.78	1.00
M2 Ultra	1B Q8_0	1.09	8	pp64	434.08	431.23	0.99
M2 Ultra	1B Q8_0	1.09	8	pp128	628.08	663.30	1.06
M2 Ultra	1B Q8_0	1.09	8	pp256	645.79	710.54	1.10
M2 Ultra	1B Q8_0	1.09	8	pp512	629.27	753.74	1.20
M2 Ultra	1B Q8_0	1.09	16	pp32	252.36	214.02	0.85
M2 Ultra	1B Q8_0	1.09	16	pp64	431.50	372.52	0.86
M2 Ultra	1B Q8_0	1.09	16	pp128	635.57	589.64	0.93
M2 Ultra	1B Q8_0	1.09	16	pp256	644.00	707.12	1.10
M2 Ultra	1B Q8_0	1.09	16	pp512	634.01	813.76	1.28
M2 Ultra	1B all F32	4.10	4	pp32	324.34	318.84	0.98
M2 Ultra	1B all F32	4.10	4	pp64	497.14	501.49	1.01
M2 Ultra	1B all F32	4.10	4	pp128	780.47	687.67	0.88
M2 Ultra	1B all F32	4.10	4	pp256	719.40	676.08	0.94
M2 Ultra	1B all F32	4.10	4	pp512	668.30	614.47	0.92
M2 Ultra	1B all F32	4.10	8	pp32	394.99	325.57	0.82
M2 Ultra	1B all F32	4.10	8	pp64	588.41	522.05	0.89
M2 Ultra	1B all F32	4.10	8	pp128	769.56	727.87	0.95
M2 Ultra	1B all F32	4.10	8	pp256	715.89	742.46	1.04
M2 Ultra	1B all F32	4.10	8	pp512	672.34	772.63	1.15
M2 Ultra	1B all F32	4.10	16	pp32	397.21	246.50	0.62
M2 Ultra	1B all F32	4.10	16	pp64	589.79	421.42	0.71
M2 Ultra	1B all F32	4.10	16	pp128	770.60	655.37	0.85
M2 Ultra	1B all F32	4.10	16	pp256	712.96	724.44	1.02
M2 Ultra	1B all F32	4.10	16	pp512	669.12	837.94	1.25

slaren · 2024-06-08T19:52:58Z

I realized that there is an issue that causes the kqv matrix multiplication to not be offloaded to the BLAS backend, and this is what causes the performance difference with a low number of threads. If this matrix multiplication is removed by enabling flash attention, then the performance difference disappears.

CPU	Model	Model Size [GiB]	Threads	Test	t/s master	t/s sl/blas-backend	Speedup
M3 Max	7B Q4_0	3.56	4	pp32	32.77	36.23	1.11
M3 Max	7B Q4_0	3.56	4	pp64	58.37	62.11	1.06
M3 Max	7B Q4_0	3.56	4	pp128	92.00	95.05	1.03
M3 Max	7B Q4_0	3.56	4	pp256	119.94	121.50	1.01
M3 Max	7B Q4_0	3.56	4	pp512	127.86	126.98	0.99
M3 Max	7B Q4_0	3.56	12	pp32	33.67	38.43	1.14
M3 Max	7B Q4_0	3.56	12	pp64	57.92	64.87	1.12
M3 Max	7B Q4_0	3.56	12	pp128	90.62	99.64	1.10
M3 Max	7B Q4_0	3.56	12	pp256	120.33	129.79	1.08
M3 Max	7B Q4_0	3.56	12	pp512	126.31	141.03	1.12
M3 Max	7B F32	25.10	4	pp32	51.18	50.64	0.99
M3 Max	7B F32	25.10	4	pp64	79.80	79.86	1.00
M3 Max	7B F32	25.10	4	pp128	112.13	111.49	0.99
M3 Max	7B F32	25.10	4	pp256	138.19	133.77	0.97
M3 Max	7B F32	25.10	4	pp512	133.84	131.51	0.98
M3 Max	7B F32	25.10	12	pp32	50.47	48.97	0.97
M3 Max	7B F32	25.10	12	pp64	79.65	78.19	0.98
M3 Max	7B F32	25.10	12	pp128	112.97	115.07	1.02
M3 Max	7B F32	25.10	12	pp256	137.53	143.82	1.05
M3 Max	7B F32	25.10	12	pp512	134.25	149.98	1.12

slaren · 2024-06-11T23:37:04Z

This should be good now. I have updated the PR description with more details about the changes included here.

ggerganov

Note: the BLAS backend should not be used alongside GPU backends, as it will prevent offloading of large batches with partial offloading

On macOS with Metal enabled, when I build with LLAMA_BLAS=OFF and run with partial offloading (-ngl 28), the non-offloaded layers are running on the CPU backend:

...
node # 32 (       ADD):              l_out-0 (   8M) [  CPU         ]:            ffn_out-0 (   8M) [  CPU         ]            ffn_inp-0 (   8M) [  CPU         ]
node # 33 (  RMS_NORM):               norm-1 (   8M) [  CPU         ]:              l_out-0 (   8M) [  CPU         ]
node # 34 (       MUL):          attn_norm-1 (   8M) [  CPU         ]:               norm-1 (   8M) [  CPU         ] blk.1.attn_norm.weig (  16K) [  CPU         ]
node # 35 (   MUL_MAT):               Qcur-1 (   8M) [  CPU         ]:  blk.1.attn_q.weight (  17M) [  CPU         ]          attn_norm-1 (   8M) [  CPU         ]
node # 37 (      ROPE):               Qcur-1 (   8M) [  CPU         ]:    Qcur-1 (reshaped) (   8M) [  CPU         ]              inp_pos (   2K) [  CPU         ]
node # 38 (   MUL_MAT):               Kcur-1 (   8M) [  CPU         ]:  blk.1.attn_k.weight (  17M) [  CPU         ]          attn_norm-1 (   8M) [  CPU         ]
node # 40 (      ROPE):               Kcur-1 (   8M) [  CPU         ]:    Kcur-1 (reshaped) (   8M) [  CPU         ]              inp_pos (   2K) [  CPU         ]
node # 41 (   MUL_MAT):               Vcur-1 (   8M) [  CPU         ]:  blk.1.attn_v.weight (  17M) [  CPU         ]          attn_norm-1 (   8M) [  CPU         ]
node # 43 (       CPY): k_cache_view-1 (copy (   4M) [  CPU         ]:               Kcur-1 (   8M) [  CPU         ]       k_cache_view-1 (   4M) [  CPU         ]
node # 46 (       CPY): v_cache_view-1 (copy (   4M) [  CPU         ]:  Vcur-1 (transposed) (   8M) [  CPU         ]       v_cache_view-1 (   4M) [  CPU         ]
node # 50 (   MUL_MAT):                 kq-1 (  32M) [  CPU         ]:                  k-1 (   4M) [  CPU         ]                  q-1 (   8M) [  CPU         ]
node # 51 (  SOFT_MAX):    kq_soft_max_ext-1 (  32M) [  CPU         ]:                 kq-1 (  32M) [  CPU         ]              KQ_mask (   1M) [  CPU         ]
node # 52 (   MUL_MAT):                kqv-1 (   8M) [  CPU         ]:                  v-1 (   4M) [  CPU         ]    kq_soft_max_ext-1 (  32M) [  CPU         ]
node # 54 (      CONT):    kqv_merged_cont-1 (   8M) [  CPU         ]:         kqv_merged-1 (   8M) [  CPU         ]
node # 55 (   MUL_MAT):            kqv_out-1 (   8M) [  CPU         ]: blk.1.attn_output.we (  17M) [  CPU         ]    kqv_merged_cont-1 (   8M) [  CPU         ]
node # 56 (       ADD):            ffn_inp-1 (   8M) [  CPU         ]:            kqv_out-1 (   8M) [  CPU         ]              l_out-0 (   8M) [  CPU         ]
node # 57 (  RMS_NORM):               norm-1 (   8M) [  CPU         ]:            ffn_inp-1 (   8M) [  CPU         ]
node # 58 (       MUL):           ffn_norm-1 (   8M) [  CPU         ]:               norm-1 (   8M) [  CPU         ] blk.1.ffn_norm.weigh (  16K) [  CPU         ]
node # 59 (   MUL_MAT):           ffn_gate-1 (  21M) [  CPU         ]: blk.1.ffn_gate.weigh (  45M) [  CPU         ]           ffn_norm-1 (   8M) [  CPU         ]
node # 60 (     UNARY):           ffn_silu-1 (  21M) [  CPU         ]:           ffn_gate-1 (  21M) [  CPU         ]
node # 61 (   MUL_MAT):             ffn_up-1 (  21M) [  CPU         ]:  blk.1.ffn_up.weight (  45M) [  CPU         ]           ffn_norm-1 (   8M) [  CPU         ]
node # 62 (       MUL):       ffn_gate_par-1 (  21M) [  CPU         ]:           ffn_silu-1 (  21M) [  CPU         ]             ffn_up-1 (  21M) [  CPU         ]
node # 63 (   MUL_MAT):            ffn_out-1 (   8M) [  CPU         ]: blk.1.ffn_down.weigh (  45M) [  CPU         ]       ffn_gate_par-1 (  21M) [  CPU         ]
node # 64 (       ADD):              l_out-1 (   8M) [  CPU         ]:            ffn_out-1 (   8M) [  CPU         ]            ffn_inp-1 (   8M) [  CPU         ]
node # 65 (  RMS_NORM):               norm-2 (   8M) [  CPU         ]:              l_out-1 (   8M) [  CPU         ]
...

With LLAMA_BLAS=ON it uses the BLAS backend for the matrix multiplications:

...
## SPLIT #16: Metal # 1 inputs: [ffn_out-0 (   8M)] 
node # 32 (       ADD):              l_out-0 (   8M) [Metal         ]:    Metal#ffn_out-0#0 (   8M) [ NULL         ]            ffn_inp-0 (   8M) [Metal         ]
node # 33 (  RMS_NORM):               norm-1 (   8M) [Metal         ]:              l_out-0 (   8M) [Metal         ]

## SPLIT #17: CPU # 0 inputs: 
node # 34 (       MUL):          attn_norm-1 (   8M) [  CPU         ]:               norm-1 (   8M) [Metal         ] blk.1.attn_norm.weig (  16K) [  CPU         ]

## SPLIT #18: BLAS # 0 inputs: 
node # 35 (   MUL_MAT):               Qcur-1 (   8M) [ BLAS         ]:  blk.1.attn_q.weight (  17M) [  CPU         ]          attn_norm-1 (   8M) [  CPU         ]

## SPLIT #19: Metal # 1 inputs: [Qcur-1 (reshaped) (   8M)] 
node # 37 (      ROPE):               Qcur-1 (   8M) [Metal         ]: Metal#Qcur-1 (reshap (   8M) [ NULL         ]      Metal#inp_pos#0 (   2K) [ NULL         ]

## SPLIT #20: BLAS # 0 inputs: 
node # 38 (   MUL_MAT):               Kcur-1 (   8M) [ BLAS         ]:  blk.1.attn_k.weight (  17M) [  CPU         ]          attn_norm-1 (   8M) [  CPU         ]

## SPLIT #21: Metal # 1 inputs: [Kcur-1 (reshaped) (   8M)] 
node # 40 (      ROPE):               Kcur-1 (   8M) [Metal         ]: Metal#Kcur-1 (reshap (   8M) [ NULL         ]      Metal#inp_pos#0 (   2K) [ NULL         ]

## SPLIT #22: BLAS # 0 inputs: 
node # 41 (   MUL_MAT):               Vcur-1 (   8M) [ BLAS         ]:  blk.1.attn_v.weight (  17M) [  CPU         ]          attn_norm-1 (   8M) [  CPU         ]

## SPLIT #23: CPU # 0 inputs: 
node # 43 (       CPY): k_cache_view-1 (copy (   4M) [  CPU         ]:               Kcur-1 (   8M) [Metal         ]       k_cache_view-1 (   4M) [  CPU         ]
node # 46 (       CPY): v_cache_view-1 (copy (   4M) [  CPU         ]:  Vcur-1 (transposed) (   8M) [ BLAS         ]       v_cache_view-1 (   4M) [  CPU         ]

## SPLIT #24: Metal # 2 inputs: [k-1 (   4M)] [v-1 (   4M)] 
node # 50 (   MUL_MAT):                 kq-1 (  32M) [Metal         ]:          Metal#k-1#0 (   4M) [ NULL         ]                  q-1 (   8M) [Metal         ]
node # 51 (  SOFT_MAX):    kq_soft_max_ext-1 (  32M) [Metal         ]:                 kq-1 (  32M) [Metal         ]      Metal#KQ_mask#0 (   1M) [ NULL         ]
node # 52 (   MUL_MAT):                kqv-1 (   8M) [Metal         ]:          Metal#v-1#0 (   4M) [ NULL         ]    kq_soft_max_ext-1 (  32M) [Metal         ]
node # 54 (      CONT):    kqv_merged_cont-1 (   8M) [Metal         ]:         kqv_merged-1 (   8M) [Metal         ]

## SPLIT #25: BLAS # 0 inputs: 
node # 55 (   MUL_MAT):            kqv_out-1 (   8M) [ BLAS         ]: blk.1.attn_output.we (  17M) [  CPU         ]    kqv_merged_cont-1 (   8M) [Metal         ]

## SPLIT #26: Metal # 1 inputs: [kqv_out-1 (   8M)] 
node # 56 (       ADD):            ffn_inp-1 (   8M) [Metal         ]:    Metal#kqv_out-1#0 (   8M) [ NULL         ]              l_out-0 (   8M) [Metal         ]
node # 57 (  RMS_NORM):               norm-1 (   8M) [Metal         ]:            ffn_inp-1 (   8M) [Metal         ]

## SPLIT #27: CPU # 0 inputs: 
node # 58 (       MUL):           ffn_norm-1 (   8M) [  CPU         ]:               norm-1 (   8M) [Metal         ] blk.1.ffn_norm.weigh (  16K) [  CPU         ]

## SPLIT #28: BLAS # 0 inputs: 
node # 59 (   MUL_MAT):           ffn_gate-1 (  21M) [ BLAS         ]: blk.1.ffn_gate.weigh (  45M) [  CPU         ]           ffn_norm-1 (   8M) [  CPU         ]

## SPLIT #29: Metal # 1 inputs: [ffn_gate-1 (  21M)] 
node # 60 (     UNARY):           ffn_silu-1 (  21M) [Metal         ]:   Metal#ffn_gate-1#0 (  21M) [ NULL         ]

## SPLIT #30: BLAS # 0 inputs: 
node # 61 (   MUL_MAT):             ffn_up-1 (  21M) [ BLAS         ]:  blk.1.ffn_up.weight (  45M) [  CPU         ]           ffn_norm-1 (   8M) [  CPU         ]

## SPLIT #31: Metal # 1 inputs: [ffn_up-1 (  21M)] 
node # 62 (       MUL):       ffn_gate_par-1 (  21M) [Metal         ]:           ffn_silu-1 (  21M) [Metal         ]     Metal#ffn_up-1#0 (  21M) [ NULL         ]

## SPLIT #32: BLAS # 0 inputs: 
node # 63 (   MUL_MAT):            ffn_out-1 (   8M) [ BLAS         ]: blk.1.ffn_down.weigh (  45M) [  CPU         ]       ffn_gate_par-1 (  21M) [Metal         ]

## SPLIT #33: Metal # 1 inputs: [ffn_out-1 (   8M)] 
node # 64 (       ADD):              l_out-1 (   8M) [Metal         ]:    Metal#ffn_out-1#0 (   8M) [ NULL         ]            ffn_inp-1 (   8M) [Metal         ]
node # 65 (  RMS_NORM):               norm-2 (   8M) [Metal         ]:              l_out-1 (   8M) [Metal         ]
...

Is this the expectation? It seems like using BLAS together with GPU offloading leads to improvement in this case, or did I misunderstood this comment?

ggml-blas.cpp

slaren · 2024-06-12T08:28:06Z

Specifically, this applies to backends that implement the offload_op function to offload large batches even when the model is not completely offloaded, which currently it is CUDA, Vulkan and SYCL. For these backends, enabling the BLAS backend will cause it to be used instead of offloading large batches to the GPU by copying the weights to VRAM as needed. Since the Metal backend does not implement this function it is not affected, and the BLAS backend can be used to enable Accelerate for the layers not offloaded.

Co-authored-by: Georgi Gerganov <[email protected]>

slaren · 2024-06-12T08:39:24Z

Metal should not be used for the operations in between the BLAS backend in not offloaded layers though, I will try to fix that.

zhouwg · 2024-06-12T08:53:30Z

Metal should not be used for the operations in between the BLAS backend in not offloaded layers though, I will try to fix that.

pls consider my standalone PR for purpose of mixed inference between CPU&GPU / CPU&NPU if backend's ggml_backend_xx_buffer_is_host return true.

slaren · 2024-06-12T08:55:58Z

@zhouwg I already considered it and rejected it. Spamming more about it is not going to help your cause.

ggerganov · 2024-06-12T09:38:38Z

@zhouwg Please focus on your PR and respect the comments and suggestions that have already been provided. Consider this final warning, before having to block you

zhouwg · 2024-06-12T09:42:40Z

@zhouwg Please focus on your PR and respect the comments and suggestions that have already been provided. Consider this final warning, before having to block you

thanks for your reminder. I see.

ggerganov · 2024-06-12T10:07:19Z

In that same example, if I allow the GGML_OP_MUL operation to be offloaded in the Metal backend:

diff --git a/ggml-metal.m b/ggml-metal.m
index 7786acd6..665eae15 100644
--- a/ggml-metal.m
+++ b/ggml-metal.m
@@ -3178,6 +3178,12 @@ GGML_CALL static bool ggml_backend_metal_supports_buft(ggml_backend_t backend, g
     UNUSED(backend);
 }
 
+GGML_CALL static bool ggml_backend_metal_offload_op(ggml_backend_t backend, const struct ggml_tensor * op) {
+    return (op->op == GGML_OP_MUL);
+
+    GGML_UNUSED(backend);
+}
+
 static struct ggml_backend_i ggml_backend_metal_i = {
     /* .get_name                = */ ggml_backend_metal_name,
     /* .free                    = */ ggml_backend_metal_free,
@@ -3193,7 +3199,7 @@ static struct ggml_backend_i ggml_backend_metal_i = {
     /* .graph_compute           = */ ggml_backend_metal_graph_compute,
     /* .supports_op             = */ ggml_backend_metal_supports_op,
     /* .supports_buft           = */ ggml_backend_metal_supports_buft,
-    /* .offload_op              = */ NULL,
+    /* .offload_op              = */ ggml_backend_metal_offload_op,
     /* .event_new               = */ NULL,
     /* .event_free              = */ NULL,
     /* .event_record            = */ NULL,

I get the following schedule:

## SPLIT #7: Metal # 1 inputs: [ffn_out-0 (   8M)] 
node # 32 (       ADD):              l_out-0 (   8M) [Metal         ]:    Metal#ffn_out-0#0 (   8M) [ NULL         ]            ffn_inp-0 (   8M) [Metal         ]
node # 33 (  RMS_NORM):               norm-1 (   8M) [Metal         ]:              l_out-0 (   8M) [Metal         ]

## SPLIT #8: Metal # 1 inputs: [blk.1.attn_norm.weight (  16K)] 
node # 34 (       MUL):          attn_norm-1 (   8M) [Metal         ]:               norm-1 (   8M) [Metal         ] Metal#blk.1.attn_nor (  16K) [ NULL         ]

## SPLIT #9: CPU # 0 inputs: 
node # 35 (   MUL_MAT):               Qcur-1 (   8M) [  CPU         ]:  blk.1.attn_q.weight (  17M) [  CPU         ]          attn_norm-1 (   8M) [Metal         ]
node # 37 (      ROPE):               Qcur-1 (   8M) [  CPU         ]:    Qcur-1 (reshaped) (   8M) [  CPU         ]              inp_pos (   2K) [  CPU         ]
node # 38 (   MUL_MAT):               Kcur-1 (   8M) [  CPU         ]:  blk.1.attn_k.weight (  17M) [  CPU         ]          attn_norm-1 (   8M) [Metal         ]
node # 40 (      ROPE):               Kcur-1 (   8M) [  CPU         ]:    Kcur-1 (reshaped) (   8M) [  CPU         ]              inp_pos (   2K) [  CPU         ]
node # 41 (   MUL_MAT):               Vcur-1 (   8M) [  CPU         ]:  blk.1.attn_v.weight (  17M) [  CPU         ]          attn_norm-1 (   8M) [Metal         ]
node # 43 (       CPY): k_cache_view-1 (copy (   4M) [  CPU         ]:               Kcur-1 (   8M) [  CPU         ]       k_cache_view-1 (   4M) [  CPU         ]
node # 46 (       CPY): v_cache_view-1 (copy (   4M) [  CPU         ]:  Vcur-1 (transposed) (   8M) [  CPU         ]       v_cache_view-1 (   4M) [  CPU         ]
node # 50 (   MUL_MAT):                 kq-1 (  32M) [  CPU         ]:                  k-1 (   4M) [  CPU         ]                  q-1 (   8M) [  CPU         ]
node # 51 (  SOFT_MAX):    kq_soft_max_ext-1 (  32M) [  CPU         ]:                 kq-1 (  32M) [  CPU         ]              KQ_mask (   1M) [  CPU         ]
node # 52 (   MUL_MAT):                kqv-1 (   8M) [  CPU         ]:                  v-1 (   4M) [  CPU         ]    kq_soft_max_ext-1 (  32M) [  CPU         ]
node # 54 (      CONT):    kqv_merged_cont-1 (   8M) [  CPU         ]:         kqv_merged-1 (   8M) [  CPU         ]
node # 55 (   MUL_MAT):            kqv_out-1 (   8M) [  CPU         ]: blk.1.attn_output.we (  17M) [  CPU         ]    kqv_merged_cont-1 (   8M) [  CPU         ]

## SPLIT #10: Metal # 1 inputs: [kqv_out-1 (   8M)] 
node # 56 (       ADD):            ffn_inp-1 (   8M) [Metal         ]:    Metal#kqv_out-1#0 (   8M) [ NULL         ]              l_out-0 (   8M) [Metal         ]
node # 57 (  RMS_NORM):               norm-1 (   8M) [Metal         ]:            ffn_inp-1 (   8M) [Metal         ]

## SPLIT #11: Metal # 1 inputs: [blk.1.ffn_norm.weight (  16K)] 
node # 58 (       MUL):           ffn_norm-1 (   8M) [Metal         ]:               norm-1 (   8M) [Metal         ] Metal#blk.1.ffn_norm (  16K) [ NULL         ]

## SPLIT #12: CPU # 0 inputs: 
node # 59 (   MUL_MAT):           ffn_gate-1 (  21M) [  CPU         ]: blk.1.ffn_gate.weigh (  45M) [  CPU         ]           ffn_norm-1 (   8M) [Metal         ]

How does the logic decide to also offload nodes #56 (ADD) and #57 (RMS_NORM) in addition to #58 (MUL)?

slaren · 2024-06-12T10:14:52Z

In the first pass, ops with weights are assigned the backend of the weight. offload_op is used at this point to allow overriding this assignment when the batch size is large enough that it may be worth to copy the weight to VRAM. Then these initial assignments are expanded to the rest of the ops. In this case, what is likely happening is that #58 is assigned to Metal due to offload_op, and then this assignment was expanded to the adjacent ops. You can enable the GET_CAUSE/SET_CAUSE macros to find exactly at which step the assignment was made.

This will cause the weight to be copied to a backend that supports the op, which is very costly. The weight should have been stored in a buffer of a backend that can run the op, but llama.cpp cannot do this automatically at the moment. ggml-ci

slaren mentioned this pull request Apr 20, 2024

ggml:add new member in GGML's internal data structure ggerganov/whisper.cpp#2073

Closed

mofosyne added Review Complexity : High Generally require indepth knowledge of LLMs or GPUs refactoring Refactoring labels May 10, 2024

slaren mentioned this pull request May 30, 2024

ggml-backend: refine backend subsystem for CPU&GPU / CPU&NPU mixed inference more easily for a specified GGML backend #7641

Closed

This comment was marked as off-topic.

Sign in to view

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label May 30, 2024

slaren mentioned this pull request Jun 3, 2024

ggml: Support OpenMP for multi-thread processing #7606

Merged

slaren force-pushed the sl/blas-backend branch from d7cc6bc to cde46d2 Compare June 4, 2024 00:28

github-actions bot added the build Compilation issues label Jun 4, 2024

slaren force-pushed the sl/blas-backend branch 3 times, most recently from 2b5c73d to ca91205 Compare June 4, 2024 23:16

github-actions bot added Vulkan Issues specific to the Vulkan backend SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language Kompute https://github.com/KomputeProject/kompute/ labels Jun 4, 2024

slaren added 4 commits June 5, 2024 15:16

move BLAS to a separate backend

7f58793

rename GGML_USE_OPENBLAS to GGML_USE_BLAS

b88957e

add support for out_prod

77f88e3

alloc : reuse same buffer when the same buffer type if used multiple …

845fa20

…times

slaren force-pushed the sl/blas-backend branch from ca91205 to 845fa20 Compare June 6, 2024 00:18

support multithreaded dequantization with std::async when openmp is n…

2bfdb7f

…ot available

fix apple build

0425305

slaren marked this pull request as ready for review June 6, 2024 23:58

slaren added 2 commits June 7, 2024 02:58

ggml-ci

2dd049e

reuse main thread

63e06d0

slaren added 3 commits June 7, 2024 15:32

Merge remote-tracking branch 'origin/master' into sl/blas-backend

fa2ec60

set number of threads automatically for openblas and blis

a8a1bf7

sched : print assignments when GGML_SCHED_DEBUG env variable is set

ecb75b5

fixes

e066598

github-actions bot added the examples label Jun 11, 2024

ggerganov self-requested a review June 12, 2024 07:17

ggerganov reviewed Jun 12, 2024

View reviewed changes

ggml-blas.cpp Outdated Show resolved Hide resolved

ggml-blas.cpp Outdated Show resolved Hide resolved

Apply suggestions from code review

a54b791

Co-authored-by: Georgi Gerganov <[email protected]>

This comment was marked as off-topic.

Sign in to view

ggerganov approved these changes Jun 12, 2024

View reviewed changes

slaren mentioned this pull request Jun 12, 2024

[experimental]backend: add new oneDNN backend ggerganov/ggml#855

Draft

slaren added 2 commits June 13, 2024 02:19

fix metal being used in layers not offloaded

ae9cd85

slaren merged commit f578b86 into master Jun 13, 2024
63 of 73 checks passed

slaren deleted the sl/blas-backend branch June 13, 2024 01:11

stduhpf mentioned this pull request Jun 17, 2024

Bug: Vulkan, I-quants partially working since PR #6210 (very slow, only with all repeating layers offloaded) #7976

Closed

ggerganov mentioned this pull request Jun 18, 2024

Add Intel Advanced Matrix Extensions (AMX) support to ggml #7707

Open

ggerganov mentioned this pull request Jul 2, 2024

Inference support for T5 and FLAN-T5 model families #8141

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

move BLAS to a separate backend #6210

move BLAS to a separate backend #6210

slaren commented Mar 21, 2024 •

edited

Loading

ggerganov commented Mar 21, 2024

github-actions bot commented May 11, 2024 •

edited

Loading

This comment was marked as off-topic.

slaren commented May 30, 2024 •

edited

Loading

slaren commented Jun 6, 2024

ggerganov commented Jun 6, 2024

slaren commented Jun 7, 2024 •

edited

Loading

ggerganov commented Jun 8, 2024

slaren commented Jun 8, 2024

slaren commented Jun 11, 2024

ggerganov left a comment

slaren commented Jun 12, 2024

slaren commented Jun 12, 2024

zhouwg commented Jun 12, 2024 •

edited

Loading

slaren commented Jun 12, 2024

This comment was marked as off-topic.

ggerganov commented Jun 12, 2024

zhouwg commented Jun 12, 2024

ggerganov commented Jun 12, 2024 •

edited

Loading

slaren commented Jun 12, 2024 •

edited

Loading

move BLAS to a separate backend #6210

move BLAS to a separate backend #6210

Conversation

slaren commented Mar 21, 2024 • edited Loading

ggerganov commented Mar 21, 2024

github-actions bot commented May 11, 2024 • edited Loading

This comment was marked as off-topic.

slaren commented May 30, 2024 • edited Loading

slaren commented Jun 6, 2024

ggerganov commented Jun 6, 2024

slaren commented Jun 7, 2024 • edited Loading

ggerganov commented Jun 8, 2024

slaren commented Jun 8, 2024

slaren commented Jun 11, 2024

ggerganov left a comment

Choose a reason for hiding this comment

slaren commented Jun 12, 2024

slaren commented Jun 12, 2024

zhouwg commented Jun 12, 2024 • edited Loading

slaren commented Jun 12, 2024

This comment was marked as off-topic.

ggerganov commented Jun 12, 2024

zhouwg commented Jun 12, 2024

ggerganov commented Jun 12, 2024 • edited Loading

slaren commented Jun 12, 2024 • edited Loading

slaren commented Mar 21, 2024 •

edited

Loading

github-actions bot commented May 11, 2024 •

edited

Loading

slaren commented May 30, 2024 •

edited

Loading

slaren commented Jun 7, 2024 •

edited

Loading

zhouwg commented Jun 12, 2024 •

edited

Loading

ggerganov commented Jun 12, 2024 •

edited

Loading

slaren commented Jun 12, 2024 •

edited

Loading