Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

move BLAS to a separate backend #6210

Merged
merged 15 commits into from
Jun 13, 2024
Merged

move BLAS to a separate backend #6210

merged 15 commits into from
Jun 13, 2024

Conversation

slaren
Copy link
Collaborator

@slaren slaren commented Mar 21, 2024

Moves BLAS support from ggml.c to a separate backend, and adds the necessary changes to ggml-backend to support backends that only implement matrix multiplication.

  • Changes to ggml-backend
    • Support for fallback to CPU with ggml_backend_sched
      • Operations not implemented by a backend will be automatically run on the CPU, as long as the operation is reported as not supported in the supports_op function of the backend
    • Moved buffer type function supports_backend to backend function supports_buft
    • Backends that want to declare compatibility with any kind of host buffer can return ggml_backend_buft_is_host from supports_buft
    • ggml_backend_sched will avoid copies between backends when the backend supports the buffer type
      • Eg. when switching from Metal to CPU, no tensors will be copied since the Metal buffers are compatible with the CPU backend (but not the other way around)
  • The GGML_SCHED_DEBUG environment variable can be used to view the graph splits. This is useful to see what operations are being run on each backend
  • Adds the BLAS backend
    • Supports matrix multiplication using a BLAS library. Previously, this was supported as part of the CPU backend
    • Threads are no longer spinning while BLAS is running, potentially improving performance, and batch processing is no longer limited to 4 threads when using BLAS
    • The number of threads of the BLAS library configured automatically for OpenBLAS and BLIS (with -t or -tb)
    • For better performance, it is recommended to use OpenMP versions of the BLAS libraries, if available (except macOS)
    • Like before, to enable the BLAS backend, build with the flag LLAMA_BLAS when using cmake, or when using make, LLAMA_OPENBLAS, LLAMA_OPENBLAS64 or LLAMA_BLIS
    • On macOS, this is enabled by default through Accelerate
    • BLAS support has been removed from the CPU backend in ggml.c. Applications that want to support BLAS will need to use the BLAS backend
    • Since this backend only implements matrix multiplication, it should be used with ggml_backend_sched alongside the CPU or other backends to provide support for other operations
    • Note: the BLAS backend should not be used alongside GPU backends, as it will prevent offloading of large batches with partial offloading

@ggerganov
Copy link
Owner

this will also have the effect that using BLAS will require using ggml-backend and ggml_backend_sched, is that a problem?

Will just need to adapt whisper.cpp when it's ready

@mofosyne mofosyne added Review Complexity : High Generally require indepth knowledge of LLMs or GPUs refactoring Refactoring labels May 10, 2024
Copy link
Contributor

github-actions bot commented May 11, 2024

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 556 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=8392.96ms p(95)=20049.24ms fails=, finish reason: stop=505 truncated=51
  • Prompt processing (pp): avg=95.91tk/s p(95)=451.37tk/s
  • Token generation (tg): avg=32.54tk/s p(95)=46.27tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=sl/blas-backend commit=ecb75b5f54cab6ca7f77ec51eb5f7d87c87be6cd

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 556 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1717788785 --> 1717789413
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 472.97, 472.97, 472.97, 472.97, 472.97, 651.19, 651.19, 651.19, 651.19, 651.19, 684.37, 684.37, 684.37, 684.37, 684.37, 717.14, 717.14, 717.14, 717.14, 717.14, 777.97, 777.97, 777.97, 777.97, 777.97, 803.49, 803.49, 803.49, 803.49, 803.49, 802.74, 802.74, 802.74, 802.74, 802.74, 830.76, 830.76, 830.76, 830.76, 830.76, 842.55, 842.55, 842.55, 842.55, 842.55, 858.83, 858.83, 858.83, 858.83, 858.83, 884.07, 884.07, 884.07, 884.07, 884.07, 906.98, 906.98, 906.98, 906.98, 906.98, 917.57, 917.57, 917.57, 917.57, 917.57, 927.93, 927.93, 927.93, 927.93, 927.93, 889.29, 889.29, 889.29, 889.29, 889.29, 888.42, 888.42, 888.42, 888.42, 888.42, 883.9, 883.9, 883.9, 883.9, 883.9, 890.66, 890.66, 890.66, 890.66, 890.66, 889.58, 889.58, 889.58, 889.58, 889.58, 885.33, 885.33, 885.33, 885.33, 885.33, 891.37, 891.37, 891.37, 891.37, 891.37, 892.18, 892.18, 892.18, 892.18, 892.18, 903.58, 903.58, 903.58, 903.58, 903.58, 884.3, 884.3, 884.3, 884.3, 884.3, 885.93, 885.93, 885.93, 885.93, 885.93, 887.7, 887.7, 887.7, 887.7, 887.7, 897.88, 897.88, 897.88, 897.88, 897.88, 898.83, 898.83, 898.83, 898.83, 898.83, 901.18, 901.18, 901.18, 901.18, 901.18, 903.85, 903.85, 903.85, 903.85, 903.85, 902.06, 902.06, 902.06, 902.06, 902.06, 899.67, 899.67, 899.67, 899.67, 899.67, 901.12, 901.12, 901.12, 901.12, 901.12, 901.61, 901.61, 901.61, 901.61, 901.61, 894.81, 894.81, 894.81, 894.81, 894.81, 899.59, 899.59, 899.59, 899.59, 899.59, 902.05, 902.05, 902.05, 902.05, 902.05, 899.29, 899.29, 899.29, 899.29, 899.29, 900.17, 900.17, 900.17, 900.17, 900.17, 901.99, 901.99, 901.99, 901.99, 901.99, 902.25, 902.25, 902.25, 902.25, 902.25, 897.96, 897.96, 897.96, 897.96, 897.96, 899.74, 899.74, 899.74, 899.74, 899.74, 898.21, 898.21, 898.21, 898.21, 898.21, 896.68, 896.68, 896.68, 896.68, 896.68, 893.88, 893.88, 893.88, 893.88, 893.88, 900.65, 900.65, 900.65, 900.65, 900.65, 900.28, 900.28, 900.28, 900.28, 900.28, 905.32, 905.32, 905.32, 905.32, 905.32, 905.04, 905.04, 905.04, 905.04, 905.04, 907.84, 907.84, 907.84, 907.84, 907.84, 911.77, 911.77, 911.77, 911.77, 911.77, 910.14, 910.14, 910.14, 910.14, 910.14, 916.43, 916.43, 916.43, 916.43, 916.43, 913.54, 913.54, 913.54, 913.54, 913.54, 913.98, 913.98, 913.98, 913.98, 913.98, 913.74, 913.74, 913.74, 913.74, 913.74, 914.78, 914.78, 914.78, 914.78, 914.78, 915.78, 915.78, 915.78, 915.78, 915.78, 917.28, 917.28, 917.28, 917.28, 917.28, 917.38, 917.38, 917.38, 917.38, 917.38]
                    
Loading
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 556 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1717788785 --> 1717789413
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 40.3, 40.3, 40.3, 40.3, 40.3, 29.58, 29.58, 29.58, 29.58, 29.58, 31.08, 31.08, 31.08, 31.08, 31.08, 32.49, 32.49, 32.49, 32.49, 32.49, 32.53, 32.53, 32.53, 32.53, 32.53, 33.31, 33.31, 33.31, 33.31, 33.31, 34.81, 34.81, 34.81, 34.81, 34.81, 35.15, 35.15, 35.15, 35.15, 35.15, 35.47, 35.47, 35.47, 35.47, 35.47, 35.45, 35.45, 35.45, 35.45, 35.45, 35.54, 35.54, 35.54, 35.54, 35.54, 34.66, 34.66, 34.66, 34.66, 34.66, 32.62, 32.62, 32.62, 32.62, 32.62, 32.1, 32.1, 32.1, 32.1, 32.1, 30.91, 30.91, 30.91, 30.91, 30.91, 31.16, 31.16, 31.16, 31.16, 31.16, 31.23, 31.23, 31.23, 31.23, 31.23, 30.84, 30.84, 30.84, 30.84, 30.84, 30.62, 30.62, 30.62, 30.62, 30.62, 30.52, 30.52, 30.52, 30.52, 30.52, 30.71, 30.71, 30.71, 30.71, 30.71, 30.98, 30.98, 30.98, 30.98, 30.98, 30.9, 30.9, 30.9, 30.9, 30.9, 30.97, 30.97, 30.97, 30.97, 30.97, 31.24, 31.24, 31.24, 31.24, 31.24, 31.26, 31.26, 31.26, 31.26, 31.26, 31.41, 31.41, 31.41, 31.41, 31.41, 31.71, 31.71, 31.71, 31.71, 31.71, 31.78, 31.78, 31.78, 31.78, 31.78, 31.88, 31.88, 31.88, 31.88, 31.88, 31.92, 31.92, 31.92, 31.92, 31.92, 31.91, 31.91, 31.91, 31.91, 31.91, 31.95, 31.95, 31.95, 31.95, 31.95, 31.7, 31.7, 31.7, 31.7, 31.7, 31.31, 31.31, 31.31, 31.31, 31.31, 31.11, 31.11, 31.11, 31.11, 31.11, 31.04, 31.04, 31.04, 31.04, 31.04, 31.13, 31.13, 31.13, 31.13, 31.13, 31.3, 31.3, 31.3, 31.3, 31.3, 31.36, 31.36, 31.36, 31.36, 31.36, 31.44, 31.44, 31.44, 31.44, 31.44, 31.28, 31.28, 31.28, 31.28, 31.28, 30.79, 30.79, 30.79, 30.79, 30.79, 30.73, 30.73, 30.73, 30.73, 30.73, 29.48, 29.48, 29.48, 29.48, 29.48, 29.5, 29.5, 29.5, 29.5, 29.5, 29.6, 29.6, 29.6, 29.6, 29.6, 29.72, 29.72, 29.72, 29.72, 29.72, 29.78, 29.78, 29.78, 29.78, 29.78, 29.84, 29.84, 29.84, 29.84, 29.84, 29.84, 29.84, 29.84, 29.84, 29.84, 29.67, 29.67, 29.67, 29.67, 29.67, 29.7, 29.7, 29.7, 29.7, 29.7, 29.6, 29.6, 29.6, 29.6, 29.6, 29.61, 29.61, 29.61, 29.61, 29.61, 29.73, 29.73, 29.73, 29.73, 29.73, 29.84, 29.84, 29.84, 29.84, 29.84, 29.92, 29.92, 29.92, 29.92, 29.92, 30.04, 30.04, 30.04, 30.04, 30.04, 30.08, 30.08, 30.08, 30.08, 30.08]
                    
Loading

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 556 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1717788785 --> 1717789413
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.38, 0.38, 0.38, 0.38, 0.38, 0.2, 0.2, 0.2, 0.2, 0.2, 0.13, 0.13, 0.13, 0.13, 0.13, 0.19, 0.19, 0.19, 0.19, 0.19, 0.24, 0.24, 0.24, 0.24, 0.24, 0.09, 0.09, 0.09, 0.09, 0.09, 0.15, 0.15, 0.15, 0.15, 0.15, 0.14, 0.14, 0.14, 0.14, 0.14, 0.17, 0.17, 0.17, 0.17, 0.17, 0.11, 0.11, 0.11, 0.11, 0.11, 0.31, 0.31, 0.31, 0.31, 0.31, 0.33, 0.33, 0.33, 0.33, 0.33, 0.44, 0.44, 0.44, 0.44, 0.44, 0.25, 0.25, 0.25, 0.25, 0.25, 0.16, 0.16, 0.16, 0.16, 0.16, 0.06, 0.06, 0.06, 0.06, 0.06, 0.31, 0.31, 0.31, 0.31, 0.31, 0.3, 0.3, 0.3, 0.3, 0.3, 0.19, 0.19, 0.19, 0.19, 0.19, 0.15, 0.15, 0.15, 0.15, 0.15, 0.16, 0.16, 0.16, 0.16, 0.16, 0.27, 0.27, 0.27, 0.27, 0.27, 0.13, 0.13, 0.13, 0.13, 0.13, 0.15, 0.15, 0.15, 0.15, 0.15, 0.13, 0.13, 0.13, 0.13, 0.13, 0.12, 0.12, 0.12, 0.12, 0.12, 0.11, 0.11, 0.11, 0.11, 0.11, 0.15, 0.15, 0.15, 0.15, 0.15, 0.16, 0.16, 0.16, 0.16, 0.16, 0.19, 0.19, 0.19, 0.19, 0.19, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.26, 0.26, 0.26, 0.26, 0.26, 0.18, 0.18, 0.18, 0.18, 0.18, 0.17, 0.17, 0.17, 0.17, 0.17, 0.24, 0.24, 0.24, 0.24, 0.24, 0.2, 0.2, 0.2, 0.2, 0.2, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.16, 0.16, 0.16, 0.16, 0.16, 0.37, 0.37, 0.37, 0.37, 0.37, 0.51, 0.51, 0.51, 0.51, 0.51, 0.48, 0.48, 0.48, 0.48, 0.48, 0.49, 0.49, 0.49, 0.49, 0.49, 0.09, 0.09, 0.09, 0.09, 0.09, 0.17, 0.17, 0.17, 0.17, 0.17, 0.15, 0.15, 0.15, 0.15, 0.15, 0.18, 0.18, 0.18, 0.18, 0.18, 0.09, 0.09, 0.09, 0.09, 0.09, 0.22, 0.22, 0.22, 0.22, 0.22, 0.26, 0.26, 0.26, 0.26, 0.26, 0.09, 0.09, 0.09, 0.09, 0.09, 0.29, 0.29, 0.29, 0.29, 0.29, 0.23, 0.23, 0.23, 0.23, 0.23, 0.17, 0.17, 0.17, 0.17, 0.17, 0.19, 0.19, 0.19, 0.19, 0.19, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.15, 0.15, 0.15, 0.15, 0.15, 0.2, 0.2, 0.2, 0.2, 0.2]
                    
Loading
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 556 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1717788785 --> 1717789413
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 1.0, 1.0, 1.0, 1.0, 1.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0]
                    
Loading

@zhouwg

This comment was marked as off-topic.

@slaren
Copy link
Collaborator Author

slaren commented May 30, 2024

@mofosyne I appreciate that you are trying to help, but please don't do that on my PRs. I very often have not pushed local changes and I prefer to deal with the merge conflicts myself.

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label May 30, 2024
@github-actions github-actions bot added the build Compilation issues label Jun 4, 2024
@slaren slaren force-pushed the sl/blas-backend branch 3 times, most recently from 2b5c73d to ca91205 Compare June 4, 2024 23:16
@github-actions github-actions bot added Vulkan Issues specific to the Vulkan backend SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language Kompute https://github.com/KomputeProject/kompute/ labels Jun 4, 2024
@slaren
Copy link
Collaborator Author

slaren commented Jun 6, 2024

@ggerganov I am thinking about how accelerate should interact with the BLAS backend. I think this would make sense:

  • GGML_USE_ACCELERATE defined: accelerate is used in ggml.c
  • GGML_USE_ACCELERATE and GGML_USE_BLAS defined: the BLAS backend is used using accelerate as the BLAS library

Conversely:

  • If GGML_USE_BLAS is not defined, accelerate will not be used for GEMMs

Currently llama.cpp has to check check for defined(GGML_USE_BLAS) || defined(GGML_USE_ACCELERATE) to decide when to use the BLAS backend, which doesn't seem very good. In the BLAS backend, accelerate is treated in the same way as any other BLAS library.

@ggerganov
Copy link
Owner

Yes, that makes sense. With only GGML_USE_ACCELERATE we will still be able to use some non-GEMM functionality from the Accelerate framework such as vDSP - it's just very convenient for Apple Silicon devices to have this framework available in the core ggml.c. For GEMMs we would explicitly need to have GGML_USE_BLAS for all kinds of BLAS implementations, including Accelerate's BLAS

@slaren slaren marked this pull request as ready for review June 6, 2024 23:58
@slaren
Copy link
Collaborator Author

slaren commented Jun 7, 2024

Some notes on performance.

BLIS responds very well to these changes, and improves performance in most cases. In master, the number of threads is limited to 4 when using BLAS. This is not the case with the BLAS backend. Using a higher number of threads should improve the performance of dequantizing and other operations.

LLAMA_BLIS=1 BLIS_NUM_THREADS=24:

CPU Model Model Size [GiB] Threads Test t/s master t/s sl/blas-backend Speedup
i9-13900K llama 7B Q4_0 3.56 8 pp32 12.75 19.76 1.55
i9-13900K llama 7B Q4_0 3.56 8 pp64 21.37 31.86 1.49
i9-13900K llama 7B Q4_0 3.56 8 pp128 33.38 44.09 1.32
i9-13900K llama 7B Q4_0 3.56 8 pp256 49.46 57.94 1.17
i9-13900K llama 7B Q4_0 3.56 8 pp512 59.21 63.64 1.07
i9-13900K llama 7B Q4_0 3.56 16 pp32 13.32 18.52 1.39
i9-13900K llama 7B Q4_0 3.56 16 pp64 23.34 27.53 1.18
i9-13900K llama 7B Q4_0 3.56 16 pp128 33.67 38.56 1.15
i9-13900K llama 7B Q4_0 3.56 16 pp256 49.20 47.68 0.97
i9-13900K llama 7B Q4_0 3.56 16 pp512 59.41 57.81 0.97
i9-13900K llama 7B all F32 25.10 8 pp32 24.42 36.85 1.51
i9-13900K llama 7B all F32 25.10 8 pp64 35.50 47.30 1.33
i9-13900K llama 7B all F32 25.10 8 pp128 45.71 54.15 1.18
i9-13900K llama 7B all F32 25.10 8 pp256 62.75 69.03 1.10
i9-13900K llama 7B all F32 25.10 8 pp512 66.93 70.51 1.05
i9-13900K llama 7B all F32 25.10 16 pp32 25.00 26.01 1.04
i9-13900K llama 7B all F32 25.10 16 pp64 35.31 35.59 1.01
i9-13900K llama 7B all F32 25.10 16 pp128 45.62 45.87 1.01
i9-13900K llama 7B all F32 25.10 16 pp256 61.40 55.45 0.90
i9-13900K llama 7B all F32 25.10 16 pp512 66.05 63.46 0.96

OpenBLAS is a headache. Increasing the number of threads above 8 has catastrophic effects on performance in ways that I cannot explain, because these threads are not running while the OpenBLAS gemm is running. Since there are better alternatives easily available now, I would suggest ignoring this library.

LLAMA_OPENBLAS=1 OPENBLAS_NUM_THREADS=24:

CPU Model Model Size [GiB] Threads Test t/s master t/s sl/blas-backend Speedup
i9-13900K llama 7B Q4_0 3.56 4 pp32 14.79 14.63 0.99
i9-13900K llama 7B Q4_0 3.56 4 pp64 22.30 21.85 0.98
i9-13900K llama 7B Q4_0 3.56 4 pp128 29.36 28.44 0.97
i9-13900K llama 7B Q4_0 3.56 8 pp32 15.01 15.83 1.05
i9-13900K llama 7B Q4_0 3.56 8 pp64 22.22 23.77 1.07
i9-13900K llama 7B Q4_0 3.56 8 pp128 29.38 29.97 1.02
i9-13900K llama 7B all F32 25.10 4 pp32 23.55 23.55 1.00
i9-13900K llama 7B all F32 25.10 4 pp64 30.88 30.46 0.99
i9-13900K llama 7B all F32 25.10 4 pp128 36.20 34.92 0.96
i9-13900K llama 7B all F32 25.10 8 pp32 23.60 22.81 0.97
i9-13900K llama 7B all F32 25.10 8 pp64 30.84 31.13 1.01
i9-13900K llama 7B all F32 25.10 8 pp128 34.42 35.92 1.04

The OpenMP thread pool does not work on macOS, so the overhead of starting threads continuously is expected to be higher. However, with the default number of threads on my system (12), it still seems to result in an speedup. There is a drop in performance specifically with 4 threads and the larger batch sizes that I don't understand, I would expect the overhead to be proportionally smaller as the batch size increases.

CPU Model Model Size [GiB] Threads Test t/s master t/s sl/blas-backend Speedup
M3 Max llama 7B Q4_0 3.56 4 pp32 33.75 35.89 1.06
M3 Max llama 7B Q4_0 3.56 4 pp64 57.97 61.30 1.06
M3 Max llama 7B Q4_0 3.56 4 pp128 89.04 91.20 1.02
M3 Max llama 7B Q4_0 3.56 4 pp256 121.10 111.30 0.92
M3 Max llama 7B Q4_0 3.56 4 pp512 128.46 117.87 0.92
M3 Max llama 7B Q4_0 3.56 12 pp32 33.56 37.09 1.11
M3 Max llama 7B Q4_0 3.56 12 pp64 57.27 63.48 1.11
M3 Max llama 7B Q4_0 3.56 12 pp128 88.55 97.80 1.10
M3 Max llama 7B Q4_0 3.56 12 pp256 121.35 127.76 1.05
M3 Max llama 7B Q4_0 3.56 12 pp512 128.42 136.79 1.07
M3 Max llama 7B all F32 25.10 4 pp32 51.04 49.88 0.98
M3 Max llama 7B all F32 25.10 4 pp64 79.66 77.13 0.97
M3 Max llama 7B all F32 25.10 4 pp128 103.58 107.29 1.04
M3 Max llama 7B all F32 25.10 4 pp256 141.23 123.31 0.87
M3 Max llama 7B all F32 25.10 4 pp512 138.78 122.95 0.89
M3 Max llama 7B all F32 25.10 12 pp32 50.99 48.56 0.95
M3 Max llama 7B all F32 25.10 12 pp64 79.38 77.12 0.97
M3 Max llama 7B all F32 25.10 12 pp128 112.81 112.74 1.00
M3 Max llama 7B all F32 25.10 12 pp256 141.26 140.00 0.99
M3 Max llama 7B all F32 25.10 12 pp512 138.73 140.82 1.02

@ggerganov
Copy link
Owner

There is a drop in performance specifically with 4 threads and the larger batch sizes that I don't understand

On M2 Ultra there is a similar effect with the Q4_0 and F32 models, while for Q8_0 and F16 it behaves as you expect:

LLAMA_NO_LLAMAFILE=1 LLAMA_NO_METAL=1 ./scripts/compare-commits.sh master sl/blas-backend -m models/tinyllama-1b/ggml-model-q4_0.gguf -m models/tinyllama-1b/ggml-model-q8_0.gguf -m models/tinyllama-1b/ggml-model-f16.gguf -m models/tinyllama-1b/ggml-model-f32.gguf -p 32,64,128,256,512 -n 0 -t 4,8,16
CPU Model Model [GiB] Threads Test t/s master t/s sl/blas-backend Speedup
M2 Ultra 1B F16 2.05 4 pp32 265.26 214.52 0.81
M2 Ultra 1B F16 2.05 4 pp64 452.51 361.71 0.80
M2 Ultra 1B F16 2.05 4 pp128 640.73 537.69 0.84
M2 Ultra 1B F16 2.05 4 pp256 664.02 588.17 0.89
M2 Ultra 1B F16 2.05 4 pp512 624.32 570.26 0.91
M2 Ultra 1B F16 2.05 8 pp32 266.72 263.57 0.99
M2 Ultra 1B F16 2.05 8 pp64 464.59 441.93 0.95
M2 Ultra 1B F16 2.05 8 pp128 648.08 660.74 1.02
M2 Ultra 1B F16 2.05 8 pp256 659.51 705.26 1.07
M2 Ultra 1B F16 2.05 8 pp512 622.22 745.74 1.20
M2 Ultra 1B F16 2.05 16 pp32 271.83 224.23 0.82
M2 Ultra 1B F16 2.05 16 pp64 456.11 394.74 0.87
M2 Ultra 1B F16 2.05 16 pp128 633.06 610.69 0.96
M2 Ultra 1B F16 2.05 16 pp256 656.49 699.64 1.07
M2 Ultra 1B F16 2.05 16 pp512 625.12 816.75 1.31
M2 Ultra 1B Q4_0 0.59 4 pp32 227.12 224.63 0.99
M2 Ultra 1B Q4_0 0.59 4 pp64 373.63 371.20 0.99
M2 Ultra 1B Q4_0 0.59 4 pp128 622.08 571.99 0.92
M2 Ultra 1B Q4_0 0.59 4 pp256 658.47 612.67 0.93
M2 Ultra 1B Q4_0 0.59 4 pp512 634.58 587.23 0.93
M2 Ultra 1B Q4_0 0.59 8 pp32 253.38 269.27 1.06
M2 Ultra 1B Q4_0 0.59 8 pp64 410.88 436.95 1.06
M2 Ultra 1B Q4_0 0.59 8 pp128 636.57 673.07 1.06
M2 Ultra 1B Q4_0 0.59 8 pp256 661.17 712.17 1.08
M2 Ultra 1B Q4_0 0.59 8 pp512 631.02 745.85 1.18
M2 Ultra 1B Q4_0 0.59 16 pp32 246.48 213.06 0.86
M2 Ultra 1B Q4_0 0.59 16 pp64 396.73 375.92 0.95
M2 Ultra 1B Q4_0 0.59 16 pp128 632.46 600.46 0.95
M2 Ultra 1B Q4_0 0.59 16 pp256 658.53 703.00 1.07
M2 Ultra 1B Q4_0 0.59 16 pp512 631.18 816.15 1.29
M2 Ultra 1B Q8_0 1.09 4 pp32 248.64 221.43 0.89
M2 Ultra 1B Q8_0 1.09 4 pp64 430.98 378.20 0.88
M2 Ultra 1B Q8_0 1.09 4 pp128 628.65 565.11 0.90
M2 Ultra 1B Q8_0 1.09 4 pp256 650.70 606.51 0.93
M2 Ultra 1B Q8_0 1.09 4 pp512 629.00 584.55 0.93
M2 Ultra 1B Q8_0 1.09 8 pp32 253.84 253.78 1.00
M2 Ultra 1B Q8_0 1.09 8 pp64 434.08 431.23 0.99
M2 Ultra 1B Q8_0 1.09 8 pp128 628.08 663.30 1.06
M2 Ultra 1B Q8_0 1.09 8 pp256 645.79 710.54 1.10
M2 Ultra 1B Q8_0 1.09 8 pp512 629.27 753.74 1.20
M2 Ultra 1B Q8_0 1.09 16 pp32 252.36 214.02 0.85
M2 Ultra 1B Q8_0 1.09 16 pp64 431.50 372.52 0.86
M2 Ultra 1B Q8_0 1.09 16 pp128 635.57 589.64 0.93
M2 Ultra 1B Q8_0 1.09 16 pp256 644.00 707.12 1.10
M2 Ultra 1B Q8_0 1.09 16 pp512 634.01 813.76 1.28
M2 Ultra 1B all F32 4.10 4 pp32 324.34 318.84 0.98
M2 Ultra 1B all F32 4.10 4 pp64 497.14 501.49 1.01
M2 Ultra 1B all F32 4.10 4 pp128 780.47 687.67 0.88
M2 Ultra 1B all F32 4.10 4 pp256 719.40 676.08 0.94
M2 Ultra 1B all F32 4.10 4 pp512 668.30 614.47 0.92
M2 Ultra 1B all F32 4.10 8 pp32 394.99 325.57 0.82
M2 Ultra 1B all F32 4.10 8 pp64 588.41 522.05 0.89
M2 Ultra 1B all F32 4.10 8 pp128 769.56 727.87 0.95
M2 Ultra 1B all F32 4.10 8 pp256 715.89 742.46 1.04
M2 Ultra 1B all F32 4.10 8 pp512 672.34 772.63 1.15
M2 Ultra 1B all F32 4.10 16 pp32 397.21 246.50 0.62
M2 Ultra 1B all F32 4.10 16 pp64 589.79 421.42 0.71
M2 Ultra 1B all F32 4.10 16 pp128 770.60 655.37 0.85
M2 Ultra 1B all F32 4.10 16 pp256 712.96 724.44 1.02
M2 Ultra 1B all F32 4.10 16 pp512 669.12 837.94 1.25

@slaren
Copy link
Collaborator Author

slaren commented Jun 8, 2024

I realized that there is an issue that causes the kqv matrix multiplication to not be offloaded to the BLAS backend, and this is what causes the performance difference with a low number of threads. If this matrix multiplication is removed by enabling flash attention, then the performance difference disappears.

CPU Model Model Size [GiB] Threads Test t/s master t/s sl/blas-backend Speedup
M3 Max 7B Q4_0 3.56 4 pp32 32.77 36.23 1.11
M3 Max 7B Q4_0 3.56 4 pp64 58.37 62.11 1.06
M3 Max 7B Q4_0 3.56 4 pp128 92.00 95.05 1.03
M3 Max 7B Q4_0 3.56 4 pp256 119.94 121.50 1.01
M3 Max 7B Q4_0 3.56 4 pp512 127.86 126.98 0.99
M3 Max 7B Q4_0 3.56 12 pp32 33.67 38.43 1.14
M3 Max 7B Q4_0 3.56 12 pp64 57.92 64.87 1.12
M3 Max 7B Q4_0 3.56 12 pp128 90.62 99.64 1.10
M3 Max 7B Q4_0 3.56 12 pp256 120.33 129.79 1.08
M3 Max 7B Q4_0 3.56 12 pp512 126.31 141.03 1.12
M3 Max 7B F32 25.10 4 pp32 51.18 50.64 0.99
M3 Max 7B F32 25.10 4 pp64 79.80 79.86 1.00
M3 Max 7B F32 25.10 4 pp128 112.13 111.49 0.99
M3 Max 7B F32 25.10 4 pp256 138.19 133.77 0.97
M3 Max 7B F32 25.10 4 pp512 133.84 131.51 0.98
M3 Max 7B F32 25.10 12 pp32 50.47 48.97 0.97
M3 Max 7B F32 25.10 12 pp64 79.65 78.19 0.98
M3 Max 7B F32 25.10 12 pp128 112.97 115.07 1.02
M3 Max 7B F32 25.10 12 pp256 137.53 143.82 1.05
M3 Max 7B F32 25.10 12 pp512 134.25 149.98 1.12

@slaren
Copy link
Collaborator Author

slaren commented Jun 11, 2024

This should be good now. I have updated the PR description with more details about the changes included here.

@ggerganov ggerganov self-requested a review June 12, 2024 07:17
Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: the BLAS backend should not be used alongside GPU backends, as it will prevent offloading of large batches with partial offloading

On macOS with Metal enabled, when I build with LLAMA_BLAS=OFF and run with partial offloading (-ngl 28), the non-offloaded layers are running on the CPU backend:

...
node # 32 (       ADD):              l_out-0 (   8M) [  CPU         ]:            ffn_out-0 (   8M) [  CPU         ]            ffn_inp-0 (   8M) [  CPU         ]
node # 33 (  RMS_NORM):               norm-1 (   8M) [  CPU         ]:              l_out-0 (   8M) [  CPU         ]
node # 34 (       MUL):          attn_norm-1 (   8M) [  CPU         ]:               norm-1 (   8M) [  CPU         ] blk.1.attn_norm.weig (  16K) [  CPU         ]
node # 35 (   MUL_MAT):               Qcur-1 (   8M) [  CPU         ]:  blk.1.attn_q.weight (  17M) [  CPU         ]          attn_norm-1 (   8M) [  CPU         ]
node # 37 (      ROPE):               Qcur-1 (   8M) [  CPU         ]:    Qcur-1 (reshaped) (   8M) [  CPU         ]              inp_pos (   2K) [  CPU         ]
node # 38 (   MUL_MAT):               Kcur-1 (   8M) [  CPU         ]:  blk.1.attn_k.weight (  17M) [  CPU         ]          attn_norm-1 (   8M) [  CPU         ]
node # 40 (      ROPE):               Kcur-1 (   8M) [  CPU         ]:    Kcur-1 (reshaped) (   8M) [  CPU         ]              inp_pos (   2K) [  CPU         ]
node # 41 (   MUL_MAT):               Vcur-1 (   8M) [  CPU         ]:  blk.1.attn_v.weight (  17M) [  CPU         ]          attn_norm-1 (   8M) [  CPU         ]
node # 43 (       CPY): k_cache_view-1 (copy (   4M) [  CPU         ]:               Kcur-1 (   8M) [  CPU         ]       k_cache_view-1 (   4M) [  CPU         ]
node # 46 (       CPY): v_cache_view-1 (copy (   4M) [  CPU         ]:  Vcur-1 (transposed) (   8M) [  CPU         ]       v_cache_view-1 (   4M) [  CPU         ]
node # 50 (   MUL_MAT):                 kq-1 (  32M) [  CPU         ]:                  k-1 (   4M) [  CPU         ]                  q-1 (   8M) [  CPU         ]
node # 51 (  SOFT_MAX):    kq_soft_max_ext-1 (  32M) [  CPU         ]:                 kq-1 (  32M) [  CPU         ]              KQ_mask (   1M) [  CPU         ]
node # 52 (   MUL_MAT):                kqv-1 (   8M) [  CPU         ]:                  v-1 (   4M) [  CPU         ]    kq_soft_max_ext-1 (  32M) [  CPU         ]
node # 54 (      CONT):    kqv_merged_cont-1 (   8M) [  CPU         ]:         kqv_merged-1 (   8M) [  CPU         ]
node # 55 (   MUL_MAT):            kqv_out-1 (   8M) [  CPU         ]: blk.1.attn_output.we (  17M) [  CPU         ]    kqv_merged_cont-1 (   8M) [  CPU         ]
node # 56 (       ADD):            ffn_inp-1 (   8M) [  CPU         ]:            kqv_out-1 (   8M) [  CPU         ]              l_out-0 (   8M) [  CPU         ]
node # 57 (  RMS_NORM):               norm-1 (   8M) [  CPU         ]:            ffn_inp-1 (   8M) [  CPU         ]
node # 58 (       MUL):           ffn_norm-1 (   8M) [  CPU         ]:               norm-1 (   8M) [  CPU         ] blk.1.ffn_norm.weigh (  16K) [  CPU         ]
node # 59 (   MUL_MAT):           ffn_gate-1 (  21M) [  CPU         ]: blk.1.ffn_gate.weigh (  45M) [  CPU         ]           ffn_norm-1 (   8M) [  CPU         ]
node # 60 (     UNARY):           ffn_silu-1 (  21M) [  CPU         ]:           ffn_gate-1 (  21M) [  CPU         ]
node # 61 (   MUL_MAT):             ffn_up-1 (  21M) [  CPU         ]:  blk.1.ffn_up.weight (  45M) [  CPU         ]           ffn_norm-1 (   8M) [  CPU         ]
node # 62 (       MUL):       ffn_gate_par-1 (  21M) [  CPU         ]:           ffn_silu-1 (  21M) [  CPU         ]             ffn_up-1 (  21M) [  CPU         ]
node # 63 (   MUL_MAT):            ffn_out-1 (   8M) [  CPU         ]: blk.1.ffn_down.weigh (  45M) [  CPU         ]       ffn_gate_par-1 (  21M) [  CPU         ]
node # 64 (       ADD):              l_out-1 (   8M) [  CPU         ]:            ffn_out-1 (   8M) [  CPU         ]            ffn_inp-1 (   8M) [  CPU         ]
node # 65 (  RMS_NORM):               norm-2 (   8M) [  CPU         ]:              l_out-1 (   8M) [  CPU         ]
...

With LLAMA_BLAS=ON it uses the BLAS backend for the matrix multiplications:

...
## SPLIT #16: Metal # 1 inputs: [ffn_out-0 (   8M)] 
node # 32 (       ADD):              l_out-0 (   8M) [Metal         ]:    Metal#ffn_out-0#0 (   8M) [ NULL         ]            ffn_inp-0 (   8M) [Metal         ]
node # 33 (  RMS_NORM):               norm-1 (   8M) [Metal         ]:              l_out-0 (   8M) [Metal         ]

## SPLIT #17: CPU # 0 inputs: 
node # 34 (       MUL):          attn_norm-1 (   8M) [  CPU         ]:               norm-1 (   8M) [Metal         ] blk.1.attn_norm.weig (  16K) [  CPU         ]

## SPLIT #18: BLAS # 0 inputs: 
node # 35 (   MUL_MAT):               Qcur-1 (   8M) [ BLAS         ]:  blk.1.attn_q.weight (  17M) [  CPU         ]          attn_norm-1 (   8M) [  CPU         ]

## SPLIT #19: Metal # 1 inputs: [Qcur-1 (reshaped) (   8M)] 
node # 37 (      ROPE):               Qcur-1 (   8M) [Metal         ]: Metal#Qcur-1 (reshap (   8M) [ NULL         ]      Metal#inp_pos#0 (   2K) [ NULL         ]

## SPLIT #20: BLAS # 0 inputs: 
node # 38 (   MUL_MAT):               Kcur-1 (   8M) [ BLAS         ]:  blk.1.attn_k.weight (  17M) [  CPU         ]          attn_norm-1 (   8M) [  CPU         ]

## SPLIT #21: Metal # 1 inputs: [Kcur-1 (reshaped) (   8M)] 
node # 40 (      ROPE):               Kcur-1 (   8M) [Metal         ]: Metal#Kcur-1 (reshap (   8M) [ NULL         ]      Metal#inp_pos#0 (   2K) [ NULL         ]

## SPLIT #22: BLAS # 0 inputs: 
node # 41 (   MUL_MAT):               Vcur-1 (   8M) [ BLAS         ]:  blk.1.attn_v.weight (  17M) [  CPU         ]          attn_norm-1 (   8M) [  CPU         ]

## SPLIT #23: CPU # 0 inputs: 
node # 43 (       CPY): k_cache_view-1 (copy (   4M) [  CPU         ]:               Kcur-1 (   8M) [Metal         ]       k_cache_view-1 (   4M) [  CPU         ]
node # 46 (       CPY): v_cache_view-1 (copy (   4M) [  CPU         ]:  Vcur-1 (transposed) (   8M) [ BLAS         ]       v_cache_view-1 (   4M) [  CPU         ]

## SPLIT #24: Metal # 2 inputs: [k-1 (   4M)] [v-1 (   4M)] 
node # 50 (   MUL_MAT):                 kq-1 (  32M) [Metal         ]:          Metal#k-1#0 (   4M) [ NULL         ]                  q-1 (   8M) [Metal         ]
node # 51 (  SOFT_MAX):    kq_soft_max_ext-1 (  32M) [Metal         ]:                 kq-1 (  32M) [Metal         ]      Metal#KQ_mask#0 (   1M) [ NULL         ]
node # 52 (   MUL_MAT):                kqv-1 (   8M) [Metal         ]:          Metal#v-1#0 (   4M) [ NULL         ]    kq_soft_max_ext-1 (  32M) [Metal         ]
node # 54 (      CONT):    kqv_merged_cont-1 (   8M) [Metal         ]:         kqv_merged-1 (   8M) [Metal         ]

## SPLIT #25: BLAS # 0 inputs: 
node # 55 (   MUL_MAT):            kqv_out-1 (   8M) [ BLAS         ]: blk.1.attn_output.we (  17M) [  CPU         ]    kqv_merged_cont-1 (   8M) [Metal         ]

## SPLIT #26: Metal # 1 inputs: [kqv_out-1 (   8M)] 
node # 56 (       ADD):            ffn_inp-1 (   8M) [Metal         ]:    Metal#kqv_out-1#0 (   8M) [ NULL         ]              l_out-0 (   8M) [Metal         ]
node # 57 (  RMS_NORM):               norm-1 (   8M) [Metal         ]:            ffn_inp-1 (   8M) [Metal         ]

## SPLIT #27: CPU # 0 inputs: 
node # 58 (       MUL):           ffn_norm-1 (   8M) [  CPU         ]:               norm-1 (   8M) [Metal         ] blk.1.ffn_norm.weigh (  16K) [  CPU         ]

## SPLIT #28: BLAS # 0 inputs: 
node # 59 (   MUL_MAT):           ffn_gate-1 (  21M) [ BLAS         ]: blk.1.ffn_gate.weigh (  45M) [  CPU         ]           ffn_norm-1 (   8M) [  CPU         ]

## SPLIT #29: Metal # 1 inputs: [ffn_gate-1 (  21M)] 
node # 60 (     UNARY):           ffn_silu-1 (  21M) [Metal         ]:   Metal#ffn_gate-1#0 (  21M) [ NULL         ]

## SPLIT #30: BLAS # 0 inputs: 
node # 61 (   MUL_MAT):             ffn_up-1 (  21M) [ BLAS         ]:  blk.1.ffn_up.weight (  45M) [  CPU         ]           ffn_norm-1 (   8M) [  CPU         ]

## SPLIT #31: Metal # 1 inputs: [ffn_up-1 (  21M)] 
node # 62 (       MUL):       ffn_gate_par-1 (  21M) [Metal         ]:           ffn_silu-1 (  21M) [Metal         ]     Metal#ffn_up-1#0 (  21M) [ NULL         ]

## SPLIT #32: BLAS # 0 inputs: 
node # 63 (   MUL_MAT):            ffn_out-1 (   8M) [ BLAS         ]: blk.1.ffn_down.weigh (  45M) [  CPU         ]       ffn_gate_par-1 (  21M) [Metal         ]

## SPLIT #33: Metal # 1 inputs: [ffn_out-1 (   8M)] 
node # 64 (       ADD):              l_out-1 (   8M) [Metal         ]:    Metal#ffn_out-1#0 (   8M) [ NULL         ]            ffn_inp-1 (   8M) [Metal         ]
node # 65 (  RMS_NORM):               norm-2 (   8M) [Metal         ]:              l_out-1 (   8M) [Metal         ]
...

Is this the expectation? It seems like using BLAS together with GPU offloading leads to improvement in this case, or did I misunderstood this comment?

ggml-blas.cpp Outdated Show resolved Hide resolved
ggml-blas.cpp Outdated Show resolved Hide resolved
@slaren
Copy link
Collaborator Author

slaren commented Jun 12, 2024

Specifically, this applies to backends that implement the offload_op function to offload large batches even when the model is not completely offloaded, which currently it is CUDA, Vulkan and SYCL. For these backends, enabling the BLAS backend will cause it to be used instead of offloading large batches to the GPU by copying the weights to VRAM as needed. Since the Metal backend does not implement this function it is not affected, and the BLAS backend can be used to enable Accelerate for the layers not offloaded.

@slaren
Copy link
Collaborator Author

slaren commented Jun 12, 2024

Metal should not be used for the operations in between the BLAS backend in not offloaded layers though, I will try to fix that.

@zhouwg
Copy link
Contributor

zhouwg commented Jun 12, 2024

Metal should not be used for the operations in between the BLAS backend in not offloaded layers though, I will try to fix that.

pls consider my standalone PR for purpose of mixed inference between CPU&GPU / CPU&NPU if backend's ggml_backend_xx_buffer_is_host return true.

@slaren
Copy link
Collaborator Author

slaren commented Jun 12, 2024

@zhouwg I already considered it and rejected it. Spamming more about it is not going to help your cause.

@zhouwg

This comment was marked as off-topic.

@ggerganov
Copy link
Owner

@zhouwg Please focus on your PR and respect the comments and suggestions that have already been provided. Consider this final warning, before having to block you

@zhouwg
Copy link
Contributor

zhouwg commented Jun 12, 2024

@zhouwg Please focus on your PR and respect the comments and suggestions that have already been provided. Consider this final warning, before having to block you

thanks for your reminder. I see.

@ggerganov
Copy link
Owner

ggerganov commented Jun 12, 2024

In that same example, if I allow the GGML_OP_MUL operation to be offloaded in the Metal backend:

diff --git a/ggml-metal.m b/ggml-metal.m
index 7786acd6..665eae15 100644
--- a/ggml-metal.m
+++ b/ggml-metal.m
@@ -3178,6 +3178,12 @@ GGML_CALL static bool ggml_backend_metal_supports_buft(ggml_backend_t backend, g
     UNUSED(backend);
 }
 
+GGML_CALL static bool ggml_backend_metal_offload_op(ggml_backend_t backend, const struct ggml_tensor * op) {
+    return (op->op == GGML_OP_MUL);
+
+    GGML_UNUSED(backend);
+}
+
 static struct ggml_backend_i ggml_backend_metal_i = {
     /* .get_name                = */ ggml_backend_metal_name,
     /* .free                    = */ ggml_backend_metal_free,
@@ -3193,7 +3199,7 @@ static struct ggml_backend_i ggml_backend_metal_i = {
     /* .graph_compute           = */ ggml_backend_metal_graph_compute,
     /* .supports_op             = */ ggml_backend_metal_supports_op,
     /* .supports_buft           = */ ggml_backend_metal_supports_buft,
-    /* .offload_op              = */ NULL,
+    /* .offload_op              = */ ggml_backend_metal_offload_op,
     /* .event_new               = */ NULL,
     /* .event_free              = */ NULL,
     /* .event_record            = */ NULL,

I get the following schedule:

## SPLIT #7: Metal # 1 inputs: [ffn_out-0 (   8M)] 
node # 32 (       ADD):              l_out-0 (   8M) [Metal         ]:    Metal#ffn_out-0#0 (   8M) [ NULL         ]            ffn_inp-0 (   8M) [Metal         ]
node # 33 (  RMS_NORM):               norm-1 (   8M) [Metal         ]:              l_out-0 (   8M) [Metal         ]

## SPLIT #8: Metal # 1 inputs: [blk.1.attn_norm.weight (  16K)] 
node # 34 (       MUL):          attn_norm-1 (   8M) [Metal         ]:               norm-1 (   8M) [Metal         ] Metal#blk.1.attn_nor (  16K) [ NULL         ]

## SPLIT #9: CPU # 0 inputs: 
node # 35 (   MUL_MAT):               Qcur-1 (   8M) [  CPU         ]:  blk.1.attn_q.weight (  17M) [  CPU         ]          attn_norm-1 (   8M) [Metal         ]
node # 37 (      ROPE):               Qcur-1 (   8M) [  CPU         ]:    Qcur-1 (reshaped) (   8M) [  CPU         ]              inp_pos (   2K) [  CPU         ]
node # 38 (   MUL_MAT):               Kcur-1 (   8M) [  CPU         ]:  blk.1.attn_k.weight (  17M) [  CPU         ]          attn_norm-1 (   8M) [Metal         ]
node # 40 (      ROPE):               Kcur-1 (   8M) [  CPU         ]:    Kcur-1 (reshaped) (   8M) [  CPU         ]              inp_pos (   2K) [  CPU         ]
node # 41 (   MUL_MAT):               Vcur-1 (   8M) [  CPU         ]:  blk.1.attn_v.weight (  17M) [  CPU         ]          attn_norm-1 (   8M) [Metal         ]
node # 43 (       CPY): k_cache_view-1 (copy (   4M) [  CPU         ]:               Kcur-1 (   8M) [  CPU         ]       k_cache_view-1 (   4M) [  CPU         ]
node # 46 (       CPY): v_cache_view-1 (copy (   4M) [  CPU         ]:  Vcur-1 (transposed) (   8M) [  CPU         ]       v_cache_view-1 (   4M) [  CPU         ]
node # 50 (   MUL_MAT):                 kq-1 (  32M) [  CPU         ]:                  k-1 (   4M) [  CPU         ]                  q-1 (   8M) [  CPU         ]
node # 51 (  SOFT_MAX):    kq_soft_max_ext-1 (  32M) [  CPU         ]:                 kq-1 (  32M) [  CPU         ]              KQ_mask (   1M) [  CPU         ]
node # 52 (   MUL_MAT):                kqv-1 (   8M) [  CPU         ]:                  v-1 (   4M) [  CPU         ]    kq_soft_max_ext-1 (  32M) [  CPU         ]
node # 54 (      CONT):    kqv_merged_cont-1 (   8M) [  CPU         ]:         kqv_merged-1 (   8M) [  CPU         ]
node # 55 (   MUL_MAT):            kqv_out-1 (   8M) [  CPU         ]: blk.1.attn_output.we (  17M) [  CPU         ]    kqv_merged_cont-1 (   8M) [  CPU         ]

## SPLIT #10: Metal # 1 inputs: [kqv_out-1 (   8M)] 
node # 56 (       ADD):            ffn_inp-1 (   8M) [Metal         ]:    Metal#kqv_out-1#0 (   8M) [ NULL         ]              l_out-0 (   8M) [Metal         ]
node # 57 (  RMS_NORM):               norm-1 (   8M) [Metal         ]:            ffn_inp-1 (   8M) [Metal         ]

## SPLIT #11: Metal # 1 inputs: [blk.1.ffn_norm.weight (  16K)] 
node # 58 (       MUL):           ffn_norm-1 (   8M) [Metal         ]:               norm-1 (   8M) [Metal         ] Metal#blk.1.ffn_norm (  16K) [ NULL         ]

## SPLIT #12: CPU # 0 inputs: 
node # 59 (   MUL_MAT):           ffn_gate-1 (  21M) [  CPU         ]: blk.1.ffn_gate.weigh (  45M) [  CPU         ]           ffn_norm-1 (   8M) [Metal         ]

How does the logic decide to also offload nodes #56 (ADD) and #57 (RMS_NORM) in addition to #58 (MUL)?

@slaren
Copy link
Collaborator Author

slaren commented Jun 12, 2024

In the first pass, ops with weights are assigned the backend of the weight. offload_op is used at this point to allow overriding this assignment when the batch size is large enough that it may be worth to copy the weight to VRAM. Then these initial assignments are expanded to the rest of the ops. In this case, what is likely happening is that #58 is assigned to Metal due to offload_op, and then this assignment was expanded to the adjacent ops. You can enable the GET_CAUSE/SET_CAUSE macros to find exactly at which step the assignment was made.

This will cause the weight to be copied to a backend that supports the
op, which is very costly. The weight should have been stored in a buffer
of a backend that can run the op, but llama.cpp cannot do this
automatically at the moment.

ggml-ci
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build Compilation issues examples ggml changes relating to the ggml tensor library for machine learning Kompute https://github.com/KomputeProject/kompute/ refactoring Refactoring Review Complexity : High Generally require indepth knowledge of LLMs or GPUs SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language Vulkan Issues specific to the Vulkan backend
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

None yet

4 participants