Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sync : llama.cpp #856

Merged
merged 43 commits into from
Jun 15, 2024
Merged
Changes from 1 commit
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
1230387
ggml : use atomic_flag for critical section (llama/7598)
slaren May 29, 2024
8fcca7c
llama-bench : add support for the RPC backend (llama/7435)
rgerganov May 29, 2024
1b7ff70
cuda : non-cont concat support (llama/7610)
ggerganov May 29, 2024
f2703f7
ggml : fix YARN + add tests + add asserts (llama/7617)
ggerganov May 29, 2024
79751ef
metal : add missing asserts (llama/7617)
ggerganov May 29, 2024
0269773
metal : remove invalid asserts (llama/7617)
ggerganov May 29, 2024
ef948cb
ggml : fix loongarch build (O2 issue) (llama/7636)
junchao-loongson May 30, 2024
a539f93
faster avx512 exp implementation (llama/7551)
chriselrod May 30, 2024
80d21d4
ggml : fix loongson compile warnings (llama/7537)
ggerganov May 31, 2024
5e6eeed
CUDA: quantized KV support for FA vec (llama/7527)
JohannesGaessler Jun 1, 2024
2a2e184
CUDA: fix Pascal FA, deq. KV to FP16 for batch > 8 (llama/7681)
JohannesGaessler Jun 1, 2024
030c282
Fix FlashAttention debug test, FP32 assert (llama/7684)
JohannesGaessler Jun 1, 2024
6ec5cfc
fix bug introduced in using calloc (llama/7701)
airlied Jun 2, 2024
9f8d074
kompute : implement op_getrows_f32 (llama/6403)
woachk Jun 3, 2024
225883c
Vulkan Mixture of Experts (MoE) support (llama/7628)
0cc4m Jun 3, 2024
597c758
ggml : use OpenMP as a thread pool (llama/7606)
msy-kato Jun 3, 2024
cac02b4
llama : offload to RPC in addition to other backends (llama/7640)
rgerganov Jun 3, 2024
a96df45
ggml : prevent builds with -ffinite-math-only (llama/7726)
ggerganov Jun 4, 2024
ad2ed7f
ggml : remove OpenCL (llama/7735)
ggerganov Jun 4, 2024
6eb6783
Allow number of nodes in CUDA graph to change (llama/7738)
agray3 Jun 4, 2024
c943c8e
ggml : refactor rope norm/neox (llama/7634)
ggerganov Jun 5, 2024
024c5bc
CUDA: refactor mmq, dmmv, mmvq (llama/7716)
JohannesGaessler Jun 5, 2024
b89a6ff
fix softmax r2r result wrong issue (llama/7811)
pengxin99 Jun 7, 2024
c7b818b
vulkan : reuse parent extra for views (llama/7806)
slaren Jun 7, 2024
ea4c21b
CUDA: revise q8_1 data layout for mul_mat_q (llama/7824)
JohannesGaessler Jun 9, 2024
2552787
use the correct SYCL context for host USM allocations (llama/7777)
bashbaug Jun 10, 2024
52d4a6d
CUDA: use tensor cores for MMQ (llama/7676)
JohannesGaessler Jun 10, 2024
2238cd2
CUDA: int8 tensor cores for MMQ (q4_K, q5_K, q6_K) (llama/7860)
JohannesGaessler Jun 11, 2024
9279216
Update Vulkan RoPE implementation (llama/7818)
0cc4m Jun 11, 2024
5de7ab4
vulkan: select only one device for single gpu with multiple drivers (…
Adriankhl Jun 11, 2024
47968ff
ggml : improve ggml_is_contiguous logic (llama/7856)
ggerganov Jun 12, 2024
c29e392
tests : add non-cont unary tests (llama/7857)
ggerganov Jun 12, 2024
228a35f
CUDA: fix broken oob check for FA vec f32 kernel (llama/7904)
JohannesGaessler Jun 12, 2024
5a8910e
move BLAS to a separate backend (llama/6210)
slaren Jun 13, 2024
d13c89f
rpc : fix ggml_backend_rpc_supports_buft() (llama/7918)
rgerganov Jun 13, 2024
ca9e524
metal : utilize max shared memory for mul_mat_id (llama/7935)
ggerganov Jun 14, 2024
65d8379
CUDA: faster q2_K, q3_K MMQ + int8 tensor cores (llama/7921)
JohannesGaessler Jun 14, 2024
f00648a
remove global variables (llama/7710)
airMeng Jun 15, 2024
77ea030
tests : adapt to changes (#0)
ggerganov Jun 15, 2024
872e074
sync : llama.cpp
ggerganov Jun 15, 2024
1a9eb9c
cuda : update build (#0)
ggerganov Jun 15, 2024
8714ee5
ggml : remove opencl (#0)
ggerganov Jun 15, 2024
e2b8b50
ci : add GG_BUILD_NO_DOWNLOAD
ggerganov Jun 15, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
vulkan: select only one device for single gpu with multiple drivers (…
…llama/7582)
  • Loading branch information
Adriankhl authored and ggerganov committed Jun 15, 2024
commit 5de7ab454cf2d7d913fcefcc16d3549a8bebe99b
82 changes: 78 additions & 4 deletions src/ggml-vulkan.cpp
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
#include "ggml-vulkan.h"

#include <vulkan/vulkan_core.h>
#ifdef GGML_VULKAN_RUN_TESTS
#include <chrono>
#endif
Expand All @@ -9,12 +9,13 @@
#include <algorithm>
#include <cmath>
#include <iostream>
#include <limits>
#include <tuple>
#include <vector>
#include <sstream>
#include <utility>
#include <memory>
#include <limits>
#include <map>

#include "ggml.h"
#include "ggml-backend-impl.h"
Expand Down Expand Up @@ -1555,8 +1556,10 @@ static void ggml_vk_print_gpu_info(size_t idx) {
vk::PhysicalDeviceProperties2 props2;
vk::PhysicalDeviceMaintenance3Properties props3;
vk::PhysicalDeviceSubgroupProperties subgroup_props;
vk::PhysicalDeviceDriverProperties driver_props;
props2.pNext = &props3;
props3.pNext = &subgroup_props;
subgroup_props.pNext = &driver_props;
physical_device.getProperties2(&props2);

const size_t subgroup_size = subgroup_props.subgroupSize;
Expand Down Expand Up @@ -1600,7 +1603,7 @@ static void ggml_vk_print_gpu_info(size_t idx) {
fp16 = fp16 && vk12_features.shaderFloat16;

std::string device_name = props2.properties.deviceName.data();
std::cerr << GGML_VK_NAME << idx << ": " << device_name << " | uma: " << uma << " | fp16: " << fp16 << " | warp size: " << subgroup_size << std::endl;
std::cerr << GGML_VK_NAME << idx << ": " << device_name << " (" << driver_props.driverName << ") | uma: " << uma << " | fp16: " << fp16 << " | warp size: " << subgroup_size << std::endl;

if (props2.properties.deviceType == vk::PhysicalDeviceType::eCpu) {
std::cerr << "ggml_vulkan: Warning: Device type is CPU. This is probably not the device you want." << std::endl;
Expand Down Expand Up @@ -1696,7 +1699,78 @@ void ggml_vk_instance_init() {
vk::PhysicalDeviceProperties props = devices[i].getProperties();

if (props.deviceType == vk::PhysicalDeviceType::eDiscreteGpu) {
vk_instance.device_indices.push_back(i);
// Check if there are two physical devices corresponding to the same GPU
auto old_device = std::find_if(
vk_instance.device_indices.begin(),
vk_instance.device_indices.end(),
[&devices, &props](const size_t k){ return devices[k].getProperties().deviceID == props.deviceID; }
);
if (old_device == vk_instance.device_indices.end()) {
vk_instance.device_indices.push_back(i);
} else {
// There can be two physical devices corresponding to the same GPU if there are 2 different drivers
// This can cause error when splitting layers aross the devices, need to keep only 1
#ifdef GGML_VULKAN_DEBUG
std::cerr << "Device " << i << " and device " << *old_device << " have the same device id" << std::endl;
#endif

vk::PhysicalDeviceProperties2 old_prop;
vk::PhysicalDeviceDriverProperties old_driver;
old_prop.pNext = &old_driver;
devices[*old_device].getProperties2(&old_prop);

vk::PhysicalDeviceProperties2 new_prop;
vk::PhysicalDeviceDriverProperties new_driver;
new_prop.pNext = &new_driver;
devices[i].getProperties2(&new_prop);

std::map<vk::DriverId, int> driver_priorities {};
int old_priority = std::numeric_limits<int>::max();
int new_priority = std::numeric_limits<int>::max();

// Check https://registry.khronos.org/vulkan/specs/1.3-extensions/man/html/VkDriverId.html for the list of driver id
// Smaller number -> higher priority
switch (old_prop.properties.vendorID) {
case VK_VENDOR_ID_AMD:
driver_priorities[vk::DriverId::eMesaRadv] = 1;
driver_priorities[vk::DriverId::eAmdOpenSource] = 2;
driver_priorities[vk::DriverId::eAmdProprietary] = 3;
break;
case VK_VENDOR_ID_INTEL:
driver_priorities[vk::DriverId::eIntelOpenSourceMESA] = 1;
driver_priorities[vk::DriverId::eIntelProprietaryWindows] = 2;
break;
case VK_VENDOR_ID_NVIDIA:
driver_priorities[vk::DriverId::eNvidiaProprietary] = 1;
#if defined(VK_API_VERSION_1_3) && VK_HEADER_VERSION >= 235
driver_priorities[vk::DriverId::eMesaNvk] = 2;
#endif
break;
}

if (driver_priorities.count(old_driver.driverID)) {
old_priority = driver_priorities[old_driver.driverID];
}
if (driver_priorities.count(new_driver.driverID)) {
new_priority = driver_priorities[new_driver.driverID];
}

if (new_priority < old_priority) {
auto r = std::remove(vk_instance.device_indices.begin(), vk_instance.device_indices.end(), *old_device);
vk_instance.device_indices.erase(r, vk_instance.device_indices.end());
vk_instance.device_indices.push_back(i);

#ifdef GGML_VULKAN_DEBUG
std::cerr << "Prioritize device " << i << " driver " << new_driver.driverName << " over device " << *old_device << " driver " << old_driver.driverName << std::endl;
#endif
}
#ifdef GGML_VULKAN_DEBUG
else {
std::cerr << "Prioritize device " << *old_device << " driver " << old_driver.driverName << " over device " << i << " driver " << new_driver.driverName << std::endl;

}
#endif
}
}
}

Expand Down