Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

have an NVIDIA GPU, but can not use. #4726

Closed
pengyuxiang1 opened this issue May 30, 2024 · 16 comments
Closed

have an NVIDIA GPU, but can not use. #4726

pengyuxiang1 opened this issue May 30, 2024 · 16 comments
Assignees
Labels
bug Something isn't working nvidia Issues relating to Nvidia GPUs and CUDA

Comments

@pengyuxiang1
Copy link

What is the issue?

the script is functioning normally.
image
but, the current program does not utilize the GPU.
image
Two days ago I found that there was a problem with the install.sh:
#4679
but now it seems that there is not only a problem with the script.

OS

Linux

GPU

Nvidia

CPU

No response

Ollama version

0.1.39

@pengyuxiang1 pengyuxiang1 added the bug Something isn't working label May 30, 2024
@jmorganca
Copy link
Member

Hi @pengyuxiang1, sorry you hit this. Do you see anything in the logs regarding GPU detection?

journalctl -fu ollama

@jmorganca jmorganca added the nvidia Issues relating to Nvidia GPUs and CUDA label May 30, 2024
@pengyuxiang1
Copy link
Author

As follows:
image

@AzizEmir
Copy link

"I have the same problem. After I typed the journalctl -fu ollama command and entered a prompt, the logs appeared as follows:"

May 31 00:48:17 DEBIAN12 ollama[2469]: [GIN] 2024/05/31 - 00:48:17 | 200 |      40.284µs |       127.0.0.1 | GET      "/api/version"
May 31 01:03:30 DEBIAN12 ollama[2469]: time=2024-05-31T01:03:30.753+03:00 level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=-1 layers.real=0 memory.available="639.8 MiB" memory.required.full="12.5 GiB" memory.required.partial="465.5 MiB" memory.required.kv="448.0 MiB" memory.weights.total="11.6 GiB" memory.weights.repeating="11.4 GiB" memory.weights.nonrepeating="157.5 MiB" memory.graph.full="244.0 MiB" memory.graph.partial="256.3 MiB"
May 31 01:03:30 DEBIAN12 ollama[2469]: time=2024-05-31T01:03:30.753+03:00 level=INFO source=server.go:338 msg="starting llama server" cmd="/tmp/ollama952982598/runners/cpu_avx2/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-22a849aafe3ded20e9b6551b02684d8fa911537c35895dd2a1bf9eb70da8f69e --ctx-size 2048 --batch-size 512 --embedding --log-disable --parallel 1 --port 40797"
May 31 01:03:30 DEBIAN12 ollama[2469]: time=2024-05-31T01:03:30.753+03:00 level=INFO source=sched.go:338 msg="loaded runners" count=1
May 31 01:03:30 DEBIAN12 ollama[2469]: time=2024-05-31T01:03:30.754+03:00 level=INFO source=server.go:526 msg="waiting for llama runner to start responding"
May 31 01:03:30 DEBIAN12 ollama[2469]: time=2024-05-31T01:03:30.754+03:00 level=INFO source=server.go:564 msg="waiting for server to become available" status="llm server error"
May 31 01:03:30 DEBIAN12 ollama[32103]: INFO [main] build info | build=1 commit="74f33ad" tid="139939799009152" timestamp=1717106610
May 31 01:03:30 DEBIAN12 ollama[32103]: INFO [main] system info | n_threads=4 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="139939799009152" timestamp=1717106610 total_threads=16
May 31 01:03:30 DEBIAN12 ollama[32103]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="15" port="40797" tid="139939799009152" timestamp=1717106610
May 31 01:03:30 DEBIAN12 ollama[2469]: llama_model_loader: loaded meta data with 25 key-value pairs and 507 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-22a849aafe3ded20e9b6551b02684d8fa911537c35895dd2a1bf9eb70da8f69e (version GGUF V3 (latest))
May 31 01:03:30 DEBIAN12 ollama[2469]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
May 31 01:03:30 DEBIAN12 ollama[2469]: llama_model_loader: - kv   0:                       general.architecture str              = llama
May 31 01:03:30 DEBIAN12 ollama[2469]: llama_model_loader: - kv   1:                               general.name str              = Codestral-22B-v0.1
May 31 01:03:30 DEBIAN12 ollama[2469]: llama_model_loader: - kv   2:                          llama.block_count u32              = 56
May 31 01:03:30 DEBIAN12 ollama[2469]: llama_model_loader: - kv   3:                       llama.context_length u32              = 32768
May 31 01:03:30 DEBIAN12 ollama[2469]: llama_model_loader: - kv   4:                     llama.embedding_length u32              = 6144
May 31 01:03:30 DEBIAN12 ollama[2469]: llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 16384
May 31 01:03:30 DEBIAN12 ollama[2469]: llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 48
May 31 01:03:30 DEBIAN12 ollama[2469]: llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 8
May 31 01:03:30 DEBIAN12 ollama[2469]: llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 1000000.000000
May 31 01:03:30 DEBIAN12 ollama[2469]: llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
May 31 01:03:30 DEBIAN12 ollama[2469]: llama_model_loader: - kv  10:                          general.file_type u32              = 2
May 31 01:03:30 DEBIAN12 ollama[2469]: llama_model_loader: - kv  11:                           llama.vocab_size u32              = 32768
May 31 01:03:30 DEBIAN12 ollama[2469]: llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 128
May 31 01:03:30 DEBIAN12 ollama[2469]: llama_model_loader: - kv  13:            tokenizer.ggml.add_space_prefix bool             = true
May 31 01:03:30 DEBIAN12 ollama[2469]: llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = llama
May 31 01:03:30 DEBIAN12 ollama[2469]: llama_model_loader: - kv  15:                         tokenizer.ggml.pre str              = default
May 31 01:03:30 DEBIAN12 ollama[2469]: llama_model_loader: - kv  16:                      tokenizer.ggml.tokens arr[str,32768]   = ["<unk>", "<s>", "</s>", "[INST]", "[...
May 31 01:03:30 DEBIAN12 ollama[2469]: llama_model_loader: - kv  17:                      tokenizer.ggml.scores arr[f32,32768]   = [0.000000, 0.000000, 0.000000, 0.0000...
May 31 01:03:30 DEBIAN12 ollama[2469]: llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,32768]   = [2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
May 31 01:03:30 DEBIAN12 ollama[2469]: llama_model_loader: - kv  19:                tokenizer.ggml.bos_token_id u32              = 1
May 31 01:03:30 DEBIAN12 ollama[2469]: llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 2
May 31 01:03:30 DEBIAN12 ollama[2469]: llama_model_loader: - kv  21:            tokenizer.ggml.unknown_token_id u32              = 0
May 31 01:03:30 DEBIAN12 ollama[2469]: llama_model_loader: - kv  22:               tokenizer.ggml.add_bos_token bool             = true
May 31 01:03:30 DEBIAN12 ollama[2469]: llama_model_loader: - kv  23:               tokenizer.ggml.add_eos_token bool             = false
May 31 01:03:30 DEBIAN12 ollama[2469]: llama_model_loader: - kv  24:               general.quantization_version u32              = 2
May 31 01:03:30 DEBIAN12 ollama[2469]: llama_model_loader: - type  f32:  113 tensors
May 31 01:03:30 DEBIAN12 ollama[2469]: llama_model_loader: - type q4_0:  393 tensors
May 31 01:03:30 DEBIAN12 ollama[2469]: llama_model_loader: - type q6_K:    1 tensors
May 31 01:03:30 DEBIAN12 ollama[2469]: llm_load_vocab: special tokens definition check successful ( 1027/32768 ).
May 31 01:03:30 DEBIAN12 ollama[2469]: llm_load_print_meta: format           = GGUF V3 (latest)
May 31 01:03:30 DEBIAN12 ollama[2469]: llm_load_print_meta: arch             = llama
May 31 01:03:30 DEBIAN12 ollama[2469]: llm_load_print_meta: vocab type       = SPM
May 31 01:03:30 DEBIAN12 ollama[2469]: llm_load_print_meta: n_vocab          = 32768
May 31 01:03:30 DEBIAN12 ollama[2469]: llm_load_print_meta: n_merges         = 0
May 31 01:03:30 DEBIAN12 ollama[2469]: llm_load_print_meta: n_ctx_train      = 32768
May 31 01:03:30 DEBIAN12 ollama[2469]: llm_load_print_meta: n_embd           = 6144
May 31 01:03:30 DEBIAN12 ollama[2469]: llm_load_print_meta: n_head           = 48
May 31 01:03:30 DEBIAN12 ollama[2469]: llm_load_print_meta: n_head_kv        = 8
May 31 01:03:30 DEBIAN12 ollama[2469]: llm_load_print_meta: n_layer          = 56
May 31 01:03:30 DEBIAN12 ollama[2469]: llm_load_print_meta: n_rot            = 128
May 31 01:03:30 DEBIAN12 ollama[2469]: llm_load_print_meta: n_embd_head_k    = 128
May 31 01:03:30 DEBIAN12 ollama[2469]: llm_load_print_meta: n_embd_head_v    = 128
May 31 01:03:30 DEBIAN12 ollama[2469]: llm_load_print_meta: n_gqa            = 6
May 31 01:03:30 DEBIAN12 ollama[2469]: llm_load_print_meta: n_embd_k_gqa     = 1024
May 31 01:03:30 DEBIAN12 ollama[2469]: llm_load_print_meta: n_embd_v_gqa     = 1024
May 31 01:03:30 DEBIAN12 ollama[2469]: llm_load_print_meta: f_norm_eps       = 0.0e+00
May 31 01:03:30 DEBIAN12 ollama[2469]: llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
May 31 01:03:30 DEBIAN12 ollama[2469]: llm_load_print_meta: f_clamp_kqv      = 0.0e+00
May 31 01:03:30 DEBIAN12 ollama[2469]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
May 31 01:03:30 DEBIAN12 ollama[2469]: llm_load_print_meta: f_logit_scale    = 0.0e+00
May 31 01:03:30 DEBIAN12 ollama[2469]: llm_load_print_meta: n_ff             = 16384
May 31 01:03:30 DEBIAN12 ollama[2469]: llm_load_print_meta: n_expert         = 0
May 31 01:03:30 DEBIAN12 ollama[2469]: llm_load_print_meta: n_expert_used    = 0
May 31 01:03:30 DEBIAN12 ollama[2469]: llm_load_print_meta: causal attn      = 1
May 31 01:03:30 DEBIAN12 ollama[2469]: llm_load_print_meta: pooling type     = 0
May 31 01:03:30 DEBIAN12 ollama[2469]: llm_load_print_meta: rope type        = 0
May 31 01:03:30 DEBIAN12 ollama[2469]: llm_load_print_meta: rope scaling     = linear
May 31 01:03:30 DEBIAN12 ollama[2469]: llm_load_print_meta: freq_base_train  = 1000000.0
May 31 01:03:30 DEBIAN12 ollama[2469]: llm_load_print_meta: freq_scale_train = 1
May 31 01:03:30 DEBIAN12 ollama[2469]: llm_load_print_meta: n_yarn_orig_ctx  = 32768
May 31 01:03:30 DEBIAN12 ollama[2469]: llm_load_print_meta: rope_finetuned   = unknown
May 31 01:03:30 DEBIAN12 ollama[2469]: llm_load_print_meta: ssm_d_conv       = 0
May 31 01:03:30 DEBIAN12 ollama[2469]: llm_load_print_meta: ssm_d_inner      = 0
May 31 01:03:30 DEBIAN12 ollama[2469]: llm_load_print_meta: ssm_d_state      = 0
May 31 01:03:30 DEBIAN12 ollama[2469]: llm_load_print_meta: ssm_dt_rank      = 0
May 31 01:03:30 DEBIAN12 ollama[2469]: llm_load_print_meta: model type       = ?B
May 31 01:03:30 DEBIAN12 ollama[2469]: llm_load_print_meta: model ftype      = Q4_0
May 31 01:03:30 DEBIAN12 ollama[2469]: llm_load_print_meta: model params     = 22.25 B
May 31 01:03:30 DEBIAN12 ollama[2469]: llm_load_print_meta: model size       = 11.71 GiB (4.52 BPW)
May 31 01:03:30 DEBIAN12 ollama[2469]: llm_load_print_meta: general.name     = Codestral-22B-v0.1
May 31 01:03:30 DEBIAN12 ollama[2469]: llm_load_print_meta: BOS token        = 1 '<s>'
May 31 01:03:30 DEBIAN12 ollama[2469]: llm_load_print_meta: EOS token        = 2 '</s>'
May 31 01:03:30 DEBIAN12 ollama[2469]: llm_load_print_meta: UNK token        = 0 '<unk>'
May 31 01:03:30 DEBIAN12 ollama[2469]: llm_load_print_meta: LF token         = 781 '<0x0A>'
May 31 01:03:30 DEBIAN12 ollama[2469]: llm_load_print_meta: PRE token        = 32007 '材'
May 31 01:03:30 DEBIAN12 ollama[2469]: llm_load_print_meta: SUF token        = 32008 'ホ'
May 31 01:03:30 DEBIAN12 ollama[2469]: llm_load_print_meta: MID token        = 32009 '張'
May 31 01:03:30 DEBIAN12 ollama[2469]: llm_load_print_meta: EOT token        = 32010 '洞'
May 31 01:03:30 DEBIAN12 ollama[2469]: llm_load_tensors: ggml ctx size =    0.26 MiB
May 31 01:03:31 DEBIAN12 ollama[2469]: time=2024-05-31T01:03:31.005+03:00 level=INFO source=server.go:564 msg="waiting for server to become available" status="llm server loading model"
May 31 01:03:34 DEBIAN12 ollama[2469]: llm_load_tensors:        CPU buffer size = 11986.15 MiB
May 31 01:03:34 DEBIAN12 ollama[2469]: llama_new_context_with_model: n_ctx      = 2048
May 31 01:03:34 DEBIAN12 ollama[2469]: llama_new_context_with_model: n_batch    = 512
May 31 01:03:34 DEBIAN12 ollama[2469]: llama_new_context_with_model: n_ubatch   = 512
May 31 01:03:34 DEBIAN12 ollama[2469]: llama_new_context_with_model: flash_attn = 0
May 31 01:03:34 DEBIAN12 ollama[2469]: llama_new_context_with_model: freq_base  = 1000000.0
May 31 01:03:34 DEBIAN12 ollama[2469]: llama_new_context_with_model: freq_scale = 1
May 31 01:03:34 DEBIAN12 ollama[2469]: llama_kv_cache_init:        CPU KV buffer size =   448.00 MiB
May 31 01:03:34 DEBIAN12 ollama[2469]: llama_new_context_with_model: KV self size  =  448.00 MiB, K (f16):  224.00 MiB, V (f16):  224.00 MiB
May 31 01:03:34 DEBIAN12 ollama[2469]: llama_new_context_with_model:        CPU  output buffer size =     0.15 MiB
May 31 01:03:34 DEBIAN12 ollama[2469]: llama_new_context_with_model:        CPU compute buffer size =   244.01 MiB
May 31 01:03:34 DEBIAN12 ollama[2469]: llama_new_context_with_model: graph nodes  = 1798
May 31 01:03:34 DEBIAN12 ollama[2469]: llama_new_context_with_model: graph splits = 1
May 31 01:03:39 DEBIAN12 ollama[32103]: INFO [main] model loaded | tid="139939799009152" timestamp=1717106619
May 31 01:03:39 DEBIAN12 ollama[2469]: time=2024-05-31T01:03:39.311+03:00 level=INFO source=server.go:569 msg="llama runner started in 8.55 seconds"
May 31 01:13:24 DEBIAN12 ollama[2469]: [GIN] 2024/05/31 - 01:13:24 | 200 |         9m55s |       127.0.0.1 | POST     "/api/chat"

i have nvidia 3050 6gb mobile

> nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

@dhiltgen
Copy link
Collaborator

@pengyuxiang1 @AzizEmir unfortunately your logs only show recent output and omit earlier log messages where we're trying to discover the GPUs. Can you try the following instead so we can try to isolate the failure to discover your GPUs

sudo systemctl stop ollama
OLLAMA_DEBUG=1 ollama serve 2>&1 | tee server.log

Then in another terminal, try to run one model, and share the results of the server log.

@dhiltgen dhiltgen self-assigned this May 31, 2024
@AzizEmir
Copy link

AzizEmir commented Jun 1, 2024

sudo systemctl stop ollama
OLLAMA_DEBUG=1 ollama serve 2>&1 | tee server.log

I wrote the commands, then I started the ollama service. When I typed 'ollama list', the models were not listed. Is this normal? I reinstalled it and ran the tests.

image

~                                                                         
> OLLAMA_DEBUG=1 ollama serve 2>&1 | tee server.log
2024/06/01 13:48:18 routes.go:1028: INFO server config env="map[OLLAMA_DEBUG:true OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST: OLLAMA_KEEP_ALIVE: OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:1 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_MODELS: OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http:https://localhost https://localhost http:https://localhost:* https://localhost:* http:https://127.0.0.1 https://127.0.0.1 http:https://127.0.0.1:* https://127.0.0.1:* http:https://0.0.0.0 https://0.0.0.0 http:https://0.0.0.0:* https://0.0.0.0:*] OLLAMA_RUNNERS_DIR: OLLAMA_TMPDIR:]"
time=2024-06-01T13:48:18.305+03:00 level=INFO source=images.go:729 msg="total blobs: 0"
time=2024-06-01T13:48:18.305+03:00 level=INFO source=images.go:736 msg="total unused blobs removed: 0"
time=2024-06-01T13:48:18.305+03:00 level=INFO source=routes.go:1074 msg="Listening on 127.0.0.1:11434 (version 0.1.39)"
time=2024-06-01T13:48:18.306+03:00 level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama4185150566/runners
time=2024-06-01T13:48:18.306+03:00 level=DEBUG source=payload.go:180 msg=extracting variant=cpu file=build/linux/x86_64/cpu/bin/ollama_llama_server.gz
time=2024-06-01T13:48:18.306+03:00 level=DEBUG source=payload.go:180 msg=extracting variant=cpu_avx file=build/linux/x86_64/cpu_avx/bin/ollama_llama_server.gz
time=2024-06-01T13:48:18.306+03:00 level=DEBUG source=payload.go:180 msg=extracting variant=cpu_avx2 file=build/linux/x86_64/cpu_avx2/bin/ollama_llama_server.gz
time=2024-06-01T13:48:18.306+03:00 level=DEBUG source=payload.go:180 msg=extracting variant=cuda_v11 file=build/linux/x86_64/cuda_v11/bin/libcublas.so.11.gz
time=2024-06-01T13:48:18.306+03:00 level=DEBUG source=payload.go:180 msg=extracting variant=cuda_v11 file=build/linux/x86_64/cuda_v11/bin/libcublasLt.so.11.gz
time=2024-06-01T13:48:18.306+03:00 level=DEBUG source=payload.go:180 msg=extracting variant=cuda_v11 file=build/linux/x86_64/cuda_v11/bin/libcudart.so.11.0.gz
time=2024-06-01T13:48:18.306+03:00 level=DEBUG source=payload.go:180 msg=extracting variant=cuda_v11 file=build/linux/x86_64/cuda_v11/bin/ollama_llama_server.gz
time=2024-06-01T13:48:18.306+03:00 level=DEBUG source=payload.go:180 msg=extracting variant=rocm_v60002 file=build/linux/x86_64/rocm_v60002/bin/deps.txt.gz
time=2024-06-01T13:48:18.306+03:00 level=DEBUG source=payload.go:180 msg=extracting variant=rocm_v60002 file=build/linux/x86_64/rocm_v60002/bin/ollama_llama_server.gz
time=2024-06-01T13:48:20.150+03:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4185150566/runners/cpu
time=2024-06-01T13:48:20.150+03:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4185150566/runners/cpu_avx
time=2024-06-01T13:48:20.150+03:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4185150566/runners/cpu_avx2
time=2024-06-01T13:48:20.150+03:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4185150566/runners/cuda_v11
time=2024-06-01T13:48:20.150+03:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4185150566/runners/rocm_v60002
time=2024-06-01T13:48:20.150+03:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu_avx2 cuda_v11 rocm_v60002 cpu cpu_avx]"
time=2024-06-01T13:48:20.150+03:00 level=DEBUG source=payload.go:45 msg="Override detection logic by setting OLLAMA_LLM_LIBRARY"
time=2024-06-01T13:48:20.150+03:00 level=DEBUG source=sched.go:90 msg="starting llm scheduler"
time=2024-06-01T13:48:20.150+03:00 level=DEBUG source=gpu.go:122 msg="Detecting GPUs"
time=2024-06-01T13:48:20.150+03:00 level=DEBUG source=gpu.go:261 msg="Searching for GPU library" name=libcuda.so*
time=2024-06-01T13:48:20.150+03:00 level=DEBUG source=gpu.go:280 msg="gpu library search" globs="[/home/aziz/libcuda.so** /usr/local/cuda*/targets/*/lib/libcuda.so* /usr/lib/*-linux-gnu/nvidia/current/libcuda.so* /usr/lib/*-linux-gnu/libcuda.so* /usr/lib/wsl/lib/libcuda.so* /usr/lib/wsl/drivers/*/libcuda.so* /opt/cuda/lib*/libcuda.so* /usr/local/cuda/lib*/libcuda.so* /usr/lib*/libcuda.so* /usr/local/lib*/libcuda.so*]"
time=2024-06-01T13:48:20.154+03:00 level=DEBUG source=gpu.go:313 msg="discovered GPU libraries" paths="[/usr/lib/i386-linux-gnu/nvidia/current/libcuda.so.555.42.02 /usr/lib/x86_64-linux-gnu/nvidia/current/libcuda.so.555.42.02]"
library /usr/lib/i386-linux-gnu/nvidia/current/libcuda.so.555.42.02 load err: /usr/lib/i386-linux-gnu/nvidia/current/libcuda.so.555.42.02: wrong ELF class: ELFCLASS32
time=2024-06-01T13:48:20.154+03:00 level=DEBUG source=gpu.go:342 msg="Unable to load nvcuda" library=/usr/lib/i386-linux-gnu/nvidia/current/libcuda.so.555.42.02 error="Unable to load /usr/lib/i386-linux-gnu/nvidia/current/libcuda.so.555.42.02 library to query for Nvidia GPUs: /usr/lib/i386-linux-gnu/nvidia/current/libcuda.so.555.42.02: wrong ELF class: ELFCLASS32"
cuInit err: 999
time=2024-06-01T13:48:20.482+03:00 level=DEBUG source=gpu.go:342 msg="Unable to load nvcuda" library=/usr/lib/x86_64-linux-gnu/nvidia/current/libcuda.so.555.42.02 error="nvcuda init failure: 999"
time=2024-06-01T13:48:20.482+03:00 level=DEBUG source=gpu.go:261 msg="Searching for GPU library" name=libcudart.so*
time=2024-06-01T13:48:20.482+03:00 level=DEBUG source=gpu.go:280 msg="gpu library search" globs="[/home/aziz/libcudart.so** /tmp/ollama4185150566/runners/cuda*/libcudart.so* /usr/local/cuda/lib64/libcudart.so* /usr/lib/x86_64-linux-gnu/nvidia/current/libcudart.so* /usr/lib/x86_64-linux-gnu/libcudart.so* /usr/lib/wsl/lib/libcudart.so* /usr/lib/wsl/drivers/*/libcudart.so* /opt/cuda/lib64/libcudart.so* /usr/local/cuda*/targets/aarch64-linux/lib/libcudart.so* /usr/lib/aarch64-linux-gnu/nvidia/current/libcudart.so* /usr/lib/aarch64-linux-gnu/libcudart.so* /usr/local/cuda/lib*/libcudart.so* /usr/lib*/libcudart.so* /usr/local/lib*/libcudart.so*]"
time=2024-06-01T13:48:20.483+03:00 level=DEBUG source=gpu.go:313 msg="discovered GPU libraries" paths="[/tmp/ollama4185150566/runners/cuda_v11/libcudart.so.11.0 /usr/local/cuda/lib64/libcudart.so.12.5.39]"
cudaSetDevice err: 999
time=2024-06-01T13:48:20.810+03:00 level=DEBUG source=gpu.go:325 msg="Unable to load cudart" library=/tmp/ollama4185150566/runners/cuda_v11/libcudart.so.11.0 error="cudart init failure: 999"
cudaSetDevice err: 999
time=2024-06-01T13:48:21.138+03:00 level=DEBUG source=gpu.go:325 msg="Unable to load cudart" library=/usr/local/cuda/lib64/libcudart.so.12.5.39 error="cudart init failure: 999"
time=2024-06-01T13:48:21.138+03:00 level=DEBUG source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-06-01T13:48:21.138+03:00 level=DEBUG source=amd_linux.go:322 msg="amdgpu driver not detected /sys/module/amdgpu"
time=2024-06-01T13:48:21.139+03:00 level=INFO source=types.go:71 msg="inference compute" id=0 library=cpu compute="" driver=0.0 name="" total="15.3 GiB" available="4.1 GiB"
[GIN] 2024/06/01 - 13:48:27 | 200 |      18.984µs |       127.0.0.1 | HEAD     "/"
[GIN] 2024/06/01 - 13:48:27 | 200 |     118.507µs |       127.0.0.1 | GET      "/api/tags"
^F[GIN] 2024/06/01 - 13:49:48 | 200 |      27.632µs |       127.0.0.1 | HEAD     "/"
[GIN] 2024/06/01 - 13:49:48 | 404 |     182.482µs |       127.0.0.1 | POST     "/api/show"
time=2024-06-01T13:49:50.271+03:00 level=INFO source=download.go:136 msg="downloading 6a0746a1ec1a in 47 100 MB part(s)"
time=2024-06-01T13:50:16.908+03:00 level=INFO source=download.go:178 msg="6a0746a1ec1a part 28 attempt 0 failed: read tcp 192.168.1.104:48024->104.18.9.90:443: read: connection reset by peer, retrying in 1s"
time=2024-06-01T13:50:17.050+03:00 level=INFO source=download.go:178 msg="6a0746a1ec1a part 36 attempt 0 failed: read tcp 192.168.1.104:54016->104.18.8.90:443: read: connection reset by peer, retrying in 1s"
time=2024-06-01T13:50:17.214+03:00 level=INFO source=download.go:178 msg="6a0746a1ec1a part 6 attempt 0 failed: read tcp 192.168.1.104:54216->104.18.8.90:443: read: connection reset by peer, retrying in 1s"
time=2024-06-01T13:50:22.215+03:00 level=INFO source=download.go:251 msg="6a0746a1ec1a part 6 stalled; retrying. If this persists, press ctrl-c to exit, then 'ollama pull' to find a faster connection."
time=2024-06-01T13:52:21.271+03:00 level=INFO source=download.go:251 msg="6a0746a1ec1a part 30 stalled; retrying. If this persists, press ctrl-c to exit, then 'ollama pull' to find a faster connection."
time=2024-06-01T13:52:28.271+03:00 level=INFO source=download.go:251 msg="6a0746a1ec1a part 42 stalled; retrying. If this persists, press ctrl-c to exit, then 'ollama pull' to find a faster connection."
time=2024-06-01T13:55:20.227+03:00 level=INFO source=download.go:178 msg="6a0746a1ec1a part 20 attempt 0 failed: unexpected EOF, retrying in 1s"
time=2024-06-01T13:55:22.430+03:00 level=INFO source=download.go:178 msg="6a0746a1ec1a part 17 attempt 0 failed: read tcp 192.168.1.104:47992->104.18.9.90:443: read: connection reset by peer, retrying in 1s"
time=2024-06-01T13:56:39.531+03:00 level=INFO source=download.go:178 msg="6a0746a1ec1a part 13 attempt 0 failed: unexpected EOF, retrying in 1s"
time=2024-06-01T13:56:54.283+03:00 level=INFO source=download.go:178 msg="6a0746a1ec1a part 8 attempt 0 failed: unexpected EOF, retrying in 1s"
time=2024-06-01T13:57:46.939+03:00 level=INFO source=download.go:178 msg="6a0746a1ec1a part 37 attempt 0 failed: read tcp 192.168.1.104:48036->104.18.9.90:443: read: connection reset by peer, retrying in 1s"
time=2024-06-01T13:57:57.271+03:00 level=INFO source=download.go:251 msg="6a0746a1ec1a part 19 stalled; retrying. If this persists, press ctrl-c to exit, then 'ollama pull' to find a faster connection."
time=2024-06-01T13:58:12.190+03:00 level=INFO source=download.go:178 msg="6a0746a1ec1a part 14 attempt 0 failed: read tcp 192.168.1.104:54116->104.18.8.90:443: read: connection reset by peer, retrying in 1s"
time=2024-06-01T14:00:03.510+03:00 level=INFO source=download.go:178 msg="6a0746a1ec1a part 25 attempt 0 failed: read tcp 192.168.1.104:47924->104.18.9.90:443: read: connection reset by peer, retrying in 1s"
time=2024-06-01T14:00:14.271+03:00 level=INFO source=download.go:251 msg="6a0746a1ec1a part 24 stalled; retrying. If this persists, press ctrl-c to exit, then 'ollama pull' to find a faster connection."
time=2024-06-01T14:00:19.070+03:00 level=INFO source=download.go:178 msg="6a0746a1ec1a part 28 attempt 1 failed: unexpected EOF, retrying in 2s"
time=2024-06-01T14:02:24.035+03:00 level=INFO source=download.go:178 msg="6a0746a1ec1a part 0 attempt 0 failed: unexpected EOF, retrying in 1s"
time=2024-06-01T14:02:25.271+03:00 level=INFO source=download.go:251 msg="6a0746a1ec1a part 27 stalled; retrying. If this persists, press ctrl-c to exit, then 'ollama pull' to find a faster connection."
time=2024-06-01T14:02:45.932+03:00 level=INFO source=download.go:178 msg="6a0746a1ec1a part 2 attempt 0 failed: unexpected EOF, retrying in 1s"
time=2024-06-01T14:03:33.271+03:00 level=INFO source=download.go:251 msg="6a0746a1ec1a part 31 stalled; retrying. If this persists, press ctrl-c to exit, then 'ollama pull' to find a faster connection."
time=2024-06-01T14:03:48.766+03:00 level=INFO source=download.go:178 msg="6a0746a1ec1a part 16 attempt 0 failed: unexpected EOF, retrying in 1s"
time=2024-06-01T14:04:19.804+03:00 level=INFO source=download.go:178 msg="6a0746a1ec1a part 5 attempt 0 failed: unexpected EOF, retrying in 1s"
time=2024-06-01T14:06:42.615+03:00 level=INFO source=download.go:178 msg="6a0746a1ec1a part 26 attempt 0 failed: unexpected EOF, retrying in 1s"
time=2024-06-01T14:08:57.371+03:00 level=INFO source=download.go:178 msg="6a0746a1ec1a part 8 attempt 1 failed: unexpected EOF, retrying in 2s"
time=2024-06-01T14:09:11.241+03:00 level=INFO source=download.go:178 msg="6a0746a1ec1a part 1 attempt 0 failed: unexpected EOF, retrying in 1s"
time=2024-06-01T14:10:02.178+03:00 level=INFO source=download.go:136 msg="downloading 4fa551d4f938 in 1 12 KB part(s)"
time=2024-06-01T14:10:04.045+03:00 level=INFO source=download.go:136 msg="downloading 8ab4849b038c in 1 254 B part(s)"
time=2024-06-01T14:10:05.911+03:00 level=INFO source=download.go:136 msg="downloading 577073ffcc6c in 1 110 B part(s)"
time=2024-06-01T14:10:07.765+03:00 level=INFO source=download.go:136 msg="downloading 3f8eb4da87fa in 1 485 B part(s)"
[GIN] 2024/06/01 - 14:10:11 | 200 |        20m23s |       127.0.0.1 | POST     "/api/pull"
[GIN] 2024/06/01 - 14:10:11 | 200 |     490.523µs |       127.0.0.1 | POST     "/api/show"
[GIN] 2024/06/01 - 14:10:11 | 200 |     409.118µs |       127.0.0.1 | POST     "/api/show"
time=2024-06-01T14:10:11.789+03:00 level=DEBUG source=gpu.go:122 msg="Detecting GPUs"
time=2024-06-01T14:10:11.789+03:00 level=DEBUG source=gpu.go:261 msg="Searching for GPU library" name=libcuda.so*
time=2024-06-01T14:10:11.789+03:00 level=DEBUG source=gpu.go:280 msg="gpu library search" globs="[/home/aziz/libcuda.so** /usr/local/cuda*/targets/*/lib/libcuda.so* /usr/lib/*-linux-gnu/nvidia/current/libcuda.so* /usr/lib/*-linux-gnu/libcuda.so* /usr/lib/wsl/lib/libcuda.so* /usr/lib/wsl/drivers/*/libcuda.so* /opt/cuda/lib*/libcuda.so* /usr/local/cuda/lib*/libcuda.so* /usr/lib*/libcuda.so* /usr/local/lib*/libcuda.so*]"
time=2024-06-01T14:10:11.793+03:00 level=DEBUG source=gpu.go:313 msg="discovered GPU libraries" paths="[/usr/lib/i386-linux-gnu/nvidia/current/libcuda.so.555.42.02 /usr/lib/x86_64-linux-gnu/nvidia/current/libcuda.so.555.42.02]"
library /usr/lib/i386-linux-gnu/nvidia/current/libcuda.so.555.42.02 load err: /usr/lib/i386-linux-gnu/nvidia/current/libcuda.so.555.42.02: wrong ELF class: ELFCLASS32
time=2024-06-01T14:10:11.793+03:00 level=DEBUG source=gpu.go:342 msg="Unable to load nvcuda" library=/usr/lib/i386-linux-gnu/nvidia/current/libcuda.so.555.42.02 error="Unable to load /usr/lib/i386-linux-gnu/nvidia/current/libcuda.so.555.42.02 library to query for Nvidia GPUs: /usr/lib/i386-linux-gnu/nvidia/current/libcuda.so.555.42.02: wrong ELF class: ELFCLASS32"
cuInit err: 999
time=2024-06-01T14:10:12.129+03:00 level=DEBUG source=gpu.go:342 msg="Unable to load nvcuda" library=/usr/lib/x86_64-linux-gnu/nvidia/current/libcuda.so.555.42.02 error="nvcuda init failure: 999"
time=2024-06-01T14:10:12.129+03:00 level=DEBUG source=gpu.go:261 msg="Searching for GPU library" name=libcudart.so*
time=2024-06-01T14:10:12.129+03:00 level=DEBUG source=gpu.go:280 msg="gpu library search" globs="[/home/aziz/libcudart.so** /tmp/ollama4185150566/runners/cuda*/libcudart.so* /usr/local/cuda/lib64/libcudart.so* /usr/lib/x86_64-linux-gnu/nvidia/current/libcudart.so* /usr/lib/x86_64-linux-gnu/libcudart.so* /usr/lib/wsl/lib/libcudart.so* /usr/lib/wsl/drivers/*/libcudart.so* /opt/cuda/lib64/libcudart.so* /usr/local/cuda*/targets/aarch64-linux/lib/libcudart.so* /usr/lib/aarch64-linux-gnu/nvidia/current/libcudart.so* /usr/lib/aarch64-linux-gnu/libcudart.so* /usr/local/cuda/lib*/libcudart.so* /usr/lib*/libcudart.so* /usr/local/lib*/libcudart.so*]"
time=2024-06-01T14:10:12.132+03:00 level=DEBUG source=gpu.go:313 msg="discovered GPU libraries" paths="[/tmp/ollama4185150566/runners/cuda_v11/libcudart.so.11.0 /usr/local/cuda/lib64/libcudart.so.12.5.39]"
cudaSetDevice err: 999
time=2024-06-01T14:10:12.459+03:00 level=DEBUG source=gpu.go:325 msg="Unable to load cudart" library=/tmp/ollama4185150566/runners/cuda_v11/libcudart.so.11.0 error="cudart init failure: 999"
cudaSetDevice err: 999
time=2024-06-01T14:10:12.790+03:00 level=DEBUG source=gpu.go:325 msg="Unable to load cudart" library=/usr/local/cuda/lib64/libcudart.so.12.5.39 error="cudart init failure: 999"
time=2024-06-01T14:10:12.790+03:00 level=DEBUG source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-06-01T14:10:12.790+03:00 level=DEBUG source=amd_linux.go:322 msg="amdgpu driver not detected /sys/module/amdgpu"
time=2024-06-01T14:10:12.790+03:00 level=DEBUG source=gguf.go:57 msg="model = &llm.gguf{containerGGUF:(*llm.containerGGUF)(0xc000ade500), kv:llm.KV{}, tensors:[]*llm.Tensor(nil), parameters:0x0}"
time=2024-06-01T14:10:13.332+03:00 level=DEBUG source=sched.go:146 msg="cpu mode with existing models, loading"
time=2024-06-01T14:10:13.332+03:00 level=DEBUG source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-06-01T14:10:13.332+03:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4185150566/runners/cpu
time=2024-06-01T14:10:13.332+03:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4185150566/runners/cpu_avx
time=2024-06-01T14:10:13.332+03:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4185150566/runners/cpu_avx2
time=2024-06-01T14:10:13.332+03:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4185150566/runners/cuda_v11
time=2024-06-01T14:10:13.332+03:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4185150566/runners/rocm_v60002
time=2024-06-01T14:10:13.332+03:00 level=DEBUG source=memory.go:44 msg=evaluating library=cpu gpu_count=1 available="220.7 MiB"
time=2024-06-01T14:10:13.332+03:00 level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=-1 layers.real=0 memory.available="220.7 MiB" memory.required.full="4.6 GiB" memory.required.partial="794.5 MiB" memory.required.kv="256.0 MiB" memory.weights.total="4.1 GiB" memory.weights.repeating="3.7 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="164.0 MiB" memory.graph.partial="677.5 MiB"
time=2024-06-01T14:10:13.332+03:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4185150566/runners/cpu
time=2024-06-01T14:10:13.332+03:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4185150566/runners/cpu_avx
time=2024-06-01T14:10:13.332+03:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4185150566/runners/cpu_avx2
time=2024-06-01T14:10:13.332+03:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4185150566/runners/cuda_v11
time=2024-06-01T14:10:13.332+03:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4185150566/runners/rocm_v60002
time=2024-06-01T14:10:13.332+03:00 level=DEBUG source=gpu.go:372 msg="no filter required for library cpu"
time=2024-06-01T14:10:13.332+03:00 level=INFO source=server.go:338 msg="starting llama server" cmd="/tmp/ollama4185150566/runners/cpu_avx2/ollama_llama_server --model /home/aziz/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa --ctx-size 2048 --batch-size 512 --embedding --log-disable --verbose --parallel 1 --port 41601"
time=2024-06-01T14:10:13.332+03:00 level=DEBUG source=server.go:353 msg=subprocess environment="[PATH=/usr/local/cuda/bin:/home/aziz/.cargo/bin:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games LD_LIBRARY_PATH=/tmp/ollama4185150566/runners/cpu_avx2]"
time=2024-06-01T14:10:13.342+03:00 level=INFO source=sched.go:338 msg="loaded runners" count=1
time=2024-06-01T14:10:13.342+03:00 level=INFO source=server.go:526 msg="waiting for llama runner to start responding"
time=2024-06-01T14:10:13.342+03:00 level=INFO source=server.go:564 msg="waiting for server to become available" status="llm server error"
INFO [main] build info | build=1 commit="74f33ad" tid="140558753294208" timestamp=1717240213
INFO [main] system info | n_threads=4 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="140558753294208" timestamp=1717240213 total_threads=16
INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="15" port="41601" tid="140558753294208" timestamp=1717240213
llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /home/aziz/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Meta-Llama-3-8B-Instruct
llama_model_loader: - kv   2:                          llama.block_count u32              = 32
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 2
llama_model_loader: - kv  11:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  20:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  21:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
time=2024-06-01T14:10:13.593+03:00 level=INFO source=server.go:564 msg="waiting for server to become available" status="llm server loading model"
llm_load_vocab: special tokens definition check successful ( 256/128256 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 4.33 GiB (4.64 BPW) 
llm_load_print_meta: general.name     = Meta-Llama-3-8B-Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_tensors: ggml ctx size =    0.15 MiB
llm_load_tensors:        CPU buffer size =  4437.80 MiB
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =   256.00 MiB
llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.50 MiB
llama_new_context_with_model:        CPU compute buffer size =   258.50 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 1
time=2024-06-01T14:10:14.095+03:00 level=DEBUG source=server.go:575 msg="model load progress 1.00"
DEBUG [initialize] initializing slots | n_slots=1 tid="140558753294208" timestamp=1717240214
DEBUG [initialize] new slot | n_ctx_slot=2048 slot_id=0 tid="140558753294208" timestamp=1717240214
INFO [main] model loaded | tid="140558753294208" timestamp=1717240214
DEBUG [update_slots] all slots are idle and system prompt is empty, clear the KV cache | tid="140558753294208" timestamp=1717240214
DEBUG [process_single_task] slot data | n_idle_slots=1 n_processing_slots=0 task_id=0 tid="140558753294208" timestamp=1717240214
time=2024-06-01T14:10:14.346+03:00 level=INFO source=server.go:569 msg="llama runner started in 1.00 seconds"
time=2024-06-01T14:10:14.346+03:00 level=DEBUG source=sched.go:351 msg="finished setting up runner" model=/home/aziz/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa
time=2024-06-01T14:10:14.346+03:00 level=DEBUG source=prompt.go:172 msg="prompt now fits in context window" required=1 window=2048
[GIN] 2024/06/01 - 14:10:14 | 200 |  2.558010867s |       127.0.0.1 | POST     "/api/chat"
time=2024-06-01T14:10:14.346+03:00 level=DEBUG source=sched.go:355 msg="context for request finished"
time=2024-06-01T14:10:14.346+03:00 level=DEBUG source=sched.go:237 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/home/aziz/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa duration=5m0s
time=2024-06-01T14:10:14.347+03:00 level=DEBUG source=sched.go:255 msg="after processing request finished event" modelPath=/home/aziz/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa refCount=0
time=2024-06-01T14:10:25.686+03:00 level=DEBUG source=sched.go:447 msg="evaluating already loaded" model=/home/aziz/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa
DEBUG [process_single_task] slot data | n_idle_slots=1 n_processing_slots=0 task_id=1 tid="140558753294208" timestamp=1717240225
DEBUG [process_single_task] slot data | n_idle_slots=1 n_processing_slots=0 task_id=2 tid="140558753294208" timestamp=1717240225
DEBUG [log_server_request] request | method="POST" params={} path="/tokenize" remote_addr="127.0.0.1" remote_port=49130 status=200 tid="140558731429568" timestamp=1717240225
time=2024-06-01T14:10:25.774+03:00 level=DEBUG source=prompt.go:172 msg="prompt now fits in context window" required=20 window=2048
time=2024-06-01T14:10:25.774+03:00 level=DEBUG source=routes.go:1322 msg="chat handler" prompt="<|start_header_id|>user<|end_header_id|>\n\nwrite me a oop program in rust lang<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" images=0
time=2024-06-01T14:10:25.774+03:00 level=DEBUG source=server.go:665 msg="setting token limit to 10x num_ctx" num_ctx=2048 num_predict=20480
DEBUG [process_single_task] slot data | n_idle_slots=1 n_processing_slots=0 task_id=3 tid="140558753294208" timestamp=1717240225
DEBUG [launch_slot_with_data] slot is processing task | slot_id=0 task_id=4 tid="140558753294208" timestamp=1717240225
DEBUG [update_slots] slot progression | ga_i=0 n_past=0 n_past_se=0 n_prompt_tokens_processed=18 slot_id=0 task_id=4 tid="140558753294208" timestamp=1717240225
DEBUG [update_slots] kv cache rm [p0, end) | p0=0 slot_id=0 task_id=4 tid="140558753294208" timestamp=1717240225
DEBUG [print_timings] prompt eval time     =    1105.95 ms /    18 tokens (   61.44 ms per token,    16.28 tokens per second) | n_prompt_tokens_processed=18 n_tokens_second=16.275659028560167 slot_id=0 t_prompt_processing=1105.946 t_token=61.44144444444444 task_id=4 tid="140558753294208" timestamp=1717240293
DEBUG [print_timings] generation eval time =   66106.35 ms /   379 runs   (  174.42 ms per token,     5.73 tokens per second) | n_decoded=379 n_tokens_second=5.7331858215379965 slot_id=0 t_token=174.42309234828497 t_token_generation=66106.352 task_id=4 tid="140558753294208" timestamp=1717240293
DEBUG [print_timings]           total time =   67212.30 ms | slot_id=0 t_prompt_processing=1105.946 t_token_generation=66106.352 t_total=67212.298 task_id=4 tid="140558753294208" timestamp=1717240293
DEBUG [update_slots] slot released | n_cache_tokens=397 n_ctx=2048 n_past=396 n_system_tokens=0 slot_id=0 task_id=4 tid="140558753294208" timestamp=1717240293 truncated=false
DEBUG [log_server_request] request | method="POST" params={} path="/completion" remote_addr="127.0.0.1" remote_port=49130 status=200 tid="140558731429568" timestamp=1717240293
[GIN] 2024/06/01 - 14:11:33 | 200 |          1m7s |       127.0.0.1 | POST     "/api/chat"
time=2024-06-01T14:11:33.035+03:00 level=DEBUG source=sched.go:304 msg="context for request finished"
time=2024-06-01T14:11:33.036+03:00 level=DEBUG source=sched.go:237 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/home/aziz/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa duration=5m0s
time=2024-06-01T14:11:33.036+03:00 level=DEBUG source=sched.go:255 msg="after processing request finished event" modelPath=/home/aziz/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa refCount=0

@dhiltgen
Copy link
Collaborator

dhiltgen commented Jun 2, 2024

@AzizEmir the 999 cuda errors in your logs are "unknown" low-level driver errors from the nvidia stack. You can most likely resolve this by following the guide here https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#container-fails-to-run-on-nvidia-gpu

If you recently upgraded to the 555 driver, you may want to re-run our install script as there have been some changes in the way nvidia sets up the drivers which required changes to our install flow to make sure the uvm driver is properly loaded.

@pengyuxiang1
Copy link
Author

pengyuxiang1 commented Jun 3, 2024

@pengyuxiang1 @AzizEmir unfortunately your logs only show recent output and omit earlier log messages where we're trying to discover the GPUs. Can you try the following instead so we can try to isolate the failure to discover your GPUs

sudo systemctl stop ollama
OLLAMA_DEBUG=1 ollama serve 2>&1 | tee server.log

Then in another terminal, try to run one model, and share the results of the server log.

I run the command and the following log appears:

2024/06/03 10:25:01 routes.go:1028: INFO server config env="map[OLLAMA_DEBUG:true OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST: OLLAMA_KEEP_ALIVE: OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:1 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_MODELS: OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http:https://localhost https://localhost http:https://localhost:* https://localhost:* http:https://127.0.0.1 https://127.0.0.1 http:https://127.0.0.1:* https://127.0.0.1:* http:https://0.0.0.0 https://0.0.0.0 http:https://0.0.0.0:* https://0.0.0.0:*] OLLAMA_RUNNERS_DIR: OLLAMA_TMPDIR:]"
time=2024-06-03T10:25:01.929+08:00 level=INFO source=images.go:729 msg="total blobs: 0"
time=2024-06-03T10:25:01.929+08:00 level=INFO source=images.go:736 msg="total unused blobs removed: 0"
time=2024-06-03T10:25:01.930+08:00 level=INFO source=routes.go:1074 msg="Listening on 127.0.0.1:11434 (version 0.1.39)"
time=2024-06-03T10:25:01.930+08:00 level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama4014666769/runners
time=2024-06-03T10:25:01.930+08:00 level=DEBUG source=payload.go:180 msg=extracting variant=cpu file=build/linux/x86_64/cpu/bin/ollama_llama_server.gz
time=2024-06-03T10:25:01.930+08:00 level=DEBUG source=payload.go:180 msg=extracting variant=cpu_avx file=build/linux/x86_64/cpu_avx/bin/ollama_llama_server.gz
time=2024-06-03T10:25:01.930+08:00 level=DEBUG source=payload.go:180 msg=extracting variant=cpu_avx2 file=build/linux/x86_64/cpu_avx2/bin/ollama_llama_server.gz
time=2024-06-03T10:25:01.930+08:00 level=DEBUG source=payload.go:180 msg=extracting variant=cuda_v11 file=build/linux/x86_64/cuda_v11/bin/libcublas.so.11.gz
time=2024-06-03T10:25:01.930+08:00 level=DEBUG source=payload.go:180 msg=extracting variant=cuda_v11 file=build/linux/x86_64/cuda_v11/bin/libcublasLt.so.11.gz
time=2024-06-03T10:25:01.930+08:00 level=DEBUG source=payload.go:180 msg=extracting variant=cuda_v11 file=build/linux/x86_64/cuda_v11/bin/libcudart.so.11.0.gz
time=2024-06-03T10:25:01.930+08:00 level=DEBUG source=payload.go:180 msg=extracting variant=cuda_v11 file=build/linux/x86_64/cuda_v11/bin/ollama_llama_server.gz
time=2024-06-03T10:25:01.930+08:00 level=DEBUG source=payload.go:180 msg=extracting variant=rocm_v60002 file=build/linux/x86_64/rocm_v60002/bin/deps.txt.gz
time=2024-06-03T10:25:01.930+08:00 level=DEBUG source=payload.go:180 msg=extracting variant=rocm_v60002 file=build/linux/x86_64/rocm_v60002/bin/ollama_llama_server.gz
time=2024-06-03T10:25:05.102+08:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4014666769/runners/cpu
time=2024-06-03T10:25:05.102+08:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4014666769/runners/cpu_avx
time=2024-06-03T10:25:05.102+08:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4014666769/runners/cpu_avx2
time=2024-06-03T10:25:05.102+08:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4014666769/runners/cuda_v11
time=2024-06-03T10:25:05.102+08:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4014666769/runners/rocm_v60002
time=2024-06-03T10:25:05.102+08:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu_avx cpu_avx2 cuda_v11 rocm_v60002 cpu]"
time=2024-06-03T10:25:05.102+08:00 level=DEBUG source=payload.go:45 msg="Override detection logic by setting OLLAMA_LLM_LIBRARY"
time=2024-06-03T10:25:05.102+08:00 level=DEBUG source=sched.go:90 msg="starting llm scheduler"
time=2024-06-03T10:25:05.102+08:00 level=DEBUG source=gpu.go:122 msg="Detecting GPUs"
time=2024-06-03T10:25:05.102+08:00 level=DEBUG source=gpu.go:261 msg="Searching for GPU library" name=libcuda.so*
time=2024-06-03T10:25:05.102+08:00 level=DEBUG source=gpu.go:280 msg="gpu library search" globs="[/usr/local/cuda-11.0/lib64/libcuda.so** /usr/local/cuda*/targets/*/lib/libcuda.so* /usr/lib/*-linux-gnu/nvidia/current/libcuda.so* /usr/lib/*-linux-gnu/libcuda.so* /usr/lib/wsl/lib/libcuda.so* /usr/lib/wsl/drivers/*/libcuda.so* /opt/cuda/lib*/libcuda.so* /usr/local/cuda/lib*/libcuda.so* /usr/lib*/libcuda.so* /usr/local/lib*/libcuda.so*]"
time=2024-06-03T10:25:05.110+08:00 level=DEBUG source=gpu.go:313 msg="discovered GPU libraries" paths="[/usr/lib/libcuda.so.515.65.01 /usr/lib64/libcuda.so.515.65.01]"
library /usr/lib/libcuda.so.515.65.01 load err: /usr/lib/libcuda.so.515.65.01: wrong ELF class: ELFCLASS32
time=2024-06-03T10:25:05.111+08:00 level=DEBUG source=gpu.go:342 msg="Unable to load nvcuda" library=/usr/lib/libcuda.so.515.65.01 error="Unable to load /usr/lib/libcuda.so.515.65.01 library to query for Nvidia GPUs: /usr/lib/libcuda.so.515.65.01: wrong ELF class: ELFCLASS32"
CUDA driver version: 11.7
time=2024-06-03T10:25:05.175+08:00 level=DEBUG source=gpu.go:127 msg="detected GPUs" count=1 library=/usr/lib64/libcuda.so.515.65.01
time=2024-06-03T10:25:05.175+08:00 level=DEBUG source=cpu_common.go:11 msg="CPU has AVX2"
[GPU-08bf9adf-c6f1-1ed9-641d-573b522a9d68] CUDA totalMem 7680 mb
[GPU-08bf9adf-c6f1-1ed9-641d-573b522a9d68] CUDA freeMem 7583 mb
[GPU-08bf9adf-c6f1-1ed9-641d-573b522a9d68] Compute Capability 7.5
time=2024-06-03T10:25:05.390+08:00 level=DEBUG source=amd_linux.go:322 msg="amdgpu driver not detected /sys/module/amdgpu"
releasing nvcuda library
time=2024-06-03T10:25:05.390+08:00 level=INFO source=types.go:71 msg="inference compute" id=GPU-08bf9adf-c6f1-1ed9-641d-573b522a9d68 library=cuda compute=7.5 driver=11.7 name="Tesla T4" total="7.5 GiB" available="7.4 GiB"
[GIN] 2024/06/03 - 10:25:25 | 200 |      33.667µs |       127.0.0.1 | HEAD     "/"
[GIN] 2024/06/03 - 10:25:25 | 200 |     174.645µs |       127.0.0.1 | GET      "/api/tags"
[GIN] 2024/06/03 - 10:25:36 | 200 |      24.344µs |       127.0.0.1 | HEAD     "/"
[GIN] 2024/06/03 - 10:25:36 | 200 |      45.405µs |       127.0.0.1 | GET      "/api/tags"

There seems to be an error in the log:

time=2024-06-03T10:25:05.111+08:00 level=DEBUG source=gpu.go:342 msg="Unable to load nvcuda" library=/usr/lib/libcuda.so.515.65.01 error="Unable to load /usr/lib/libcuda.so.515.65.01 library to query for Nvidia GPUs: /usr/lib/libcuda.so.515.65.01: wrong ELF class: ELFCLASS32"

Then, at this point, another terminal runs the command 'ollama list' and there is no model list output.
image

@matteopic
Copy link

I have the same issue. I regularly used Ollama with Docker, but for the past few days, it has stopped utilizing the GPU. I tried the images ollama/ollama:0.1.39 and ollama/ollama:0.1.41, but the problem persists.

time=2024-06-03T06:07:00.414Z level=DEBUG source=gpu.go:132 msg="Detecting GPUs"
time=2024-06-03T06:07:00.415Z level=DEBUG source=gpu.go:274 msg="Searching for GPU library" name=libcuda.so*
time=2024-06-03T06:07:00.415Z level=DEBUG source=gpu.go:293 msg="gpu library search" globs="[/usr/local/nvidia/lib/libcuda.so** /usr/local/nvidia/lib64/libcuda.so** /usr/local/cuda*/targets/*/lib/libcuda.so* /usr/lib/*-linux-gnu/nvidia/current/libcuda.so* /usr/lib/*-linux-gnu/libcuda.so* /usr/lib/wsl/lib/libcuda.so* /usr/lib/wsl/drivers/*/libcuda.so* /opt/cuda/lib*/libcuda.so* /usr/local/cuda/lib*/libcuda.so* /usr/lib*/libcuda.so* /usr/local/lib*/libcuda.so*]"
time=2024-06-03T06:07:00.415Z level=DEBUG source=gpu.go:326 msg="discovered GPU libraries" paths="[/usr/lib/x86_64-linux-gnu/libcuda.so.1 /usr/lib/wsl/drivers/nvltsi.inf_amd64_0f2f0ed95e1f8d10/libcuda.so.1.1]"
cuInit err: 500
time=2024-06-03T06:07:00.476Z level=DEBUG source=gpu.go:355 msg="Unable to load nvcuda" library=/usr/lib/x86_64-linux-gnu/libcuda.so.1 error="nvcuda init failure: 500"
cuInit err: 500
time=2024-06-03T06:07:00.477Z level=DEBUG source=gpu.go:355 msg="Unable to load nvcuda" library=/usr/lib/wsl/drivers/nvltsi.inf_amd64_0f2f0ed95e1f8d10/libcuda.so.1.1 error="nvcuda init failure: 500"   
time=2024-06-03T06:07:00.477Z level=DEBUG source=gpu.go:274 msg="Searching for GPU library" name=libcudart.so*
time=2024-06-03T06:07:00.477Z level=DEBUG source=gpu.go:293 msg="gpu library search" globs="[/usr/local/nvidia/lib/libcudart.so** /usr/local/nvidia/lib64/libcudart.so** /tmp/ollama2564465701/runners/cuda*/libcudart.so* /usr/local/cuda/lib64/libcudart.so* /usr/lib/x86_64-linux-gnu/nvidia/current/libcudart.so* /usr/lib/x86_64-linux-gnu/libcudart.so* /usr/lib/wsl/lib/libcudart.so* /usr/lib/wsl/drivers/*/libcudart.so* /opt/cuda/lib64/libcudart.so* /usr/local/cuda*/targets/aarch64-linux/lib/libcudart.so* /usr/lib/aarch64-linux-gnu/nvidia/current/libcudart.so* /usr/lib/aarch64-linux-gnu/libcudart.so* /usr/local/cuda/lib*/libcudart.so* /usr/lib*/libcudart.so* /usr/local/lib*/libcudart.so*]"
time=2024-06-03T06:07:00.478Z level=DEBUG source=gpu.go:326 msg="discovered GPU libraries" paths=[/tmp/ollama2564465701/runners/cuda_v11/libcudart.so.11.0]
cudaSetDevice err: 500
time=2024-06-03T06:07:00.479Z level=DEBUG source=gpu.go:338 msg="Unable to load cudart" library=/tmp/ollama2564465701/runners/cuda_v11/libcudart.so.11.0 error="cudart init failure: 500"
time=2024-06-03T06:07:00.479Z level=DEBUG source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-06-03T06:07:00.479Z level=DEBUG source=amd_linux.go:322 msg="amdgpu driver not detected /sys/module/amdgpu"
time=2024-06-03T06:07:00.479Z level=INFO source=types.go:71 msg="inference compute" id=0 library=cpu compute="" driver=0.0 name="" total="15.3 GiB" available="11.4 GiB"

@AzizEmir
Copy link

AzizEmir commented Jun 3, 2024

@AzizEmir the 999 cuda errors in your logs are "unknown" low-level driver errors from the nvidia stack. You can most likely resolve this by following the guide here https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#container-fails-to-run-on-nvidia-gpu

If you recently upgraded to the 555 driver, you may want to re-run our install script as there have been some changes in the way nvidia sets up the drivers which required changes to our install flow to make sure the uvm driver is properly loaded.

ollama is not working on docker container.

I think something is missing in Debian 12.

image

For testing, I set up GPU virtualization and installed Fedora on the virtual machine.
Fedora-Workstation-Live-x86_64-40-1.14.iso

I performed a network installation of CUDA Toolkit 12.5 and then installed the nvidia-driver (555.42.02).

There are no issues on the virtual machine :D

Ollama

FEDORA_GPU_LM_STUDIO_2-obfuscated

LM Studio

image

@zzzhouuu
Copy link

zzzhouuu commented Jun 3, 2024

I running ollama by nvidia/cuda:12.5.0-runtime-ubuntu22.04 on WSL2

FROM nvidia/cuda:12.5.0-runtime-ubuntu22.04
COPY ollama-linux-amd64 /bin/ollama

EXPOSE 11434
ENV OLLAMA_HOST 0.0.0.0
ENV OLLAMA_ORIGINS *
ENV OLLAMA_NUM_PARALLEL 4
ENV OLLAMA_MAX_LOADED_MODELS 4

ENTRYPOINT ["/bin/ollama"]
CMD ["serve"]

run nvidia-smi cmd on container

# nvidia-smi
Mon Jun  3 13:54:32 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.03              Driver Version: 555.85         CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090        On  |   00000000:01:00.0  On |                  Off |
|  0%   42C    P8             13W /  450W |    2853MiB /  24564MiB |     43%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
2024-06-03 21:43:35 2024/06/03 13:43:35 routes.go:1007: INFO server config env="map[OLLAMA_DEBUG:true OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST: OLLAMA_KEEP_ALIVE: OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:4 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_MODELS: OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:4 OLLAMA_ORIGINS:[* http:https://localhost https://localhost http:https://localhost:* https://localhost:* http:https://127.0.0.1 https://127.0.0.1 http:https://127.0.0.1:* https://127.0.0.1:* http:https://0.0.0.0 https://0.0.0.0 http:https://0.0.0.0:* https://0.0.0.0:*] OLLAMA_RUNNERS_DIR: OLLAMA_TMPDIR:]"
2024-06-03 21:43:35 time=2024-06-03T13:43:35.341Z level=INFO source=images.go:729 msg="total blobs: 97"
2024-06-03 21:43:35 time=2024-06-03T13:43:35.345Z level=INFO source=images.go:736 msg="total unused blobs removed: 0"
2024-06-03 21:43:35 time=2024-06-03T13:43:35.345Z level=INFO source=routes.go:1053 msg="Listening on [::]:11434 (version 0.1.41)"
2024-06-03 21:43:35 time=2024-06-03T13:43:35.346Z level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama219412353/runners
2024-06-03 21:43:35 time=2024-06-03T13:43:35.346Z level=DEBUG source=payload.go:180 msg=extracting variant=cpu file=build/linux/x86_64/cpu/bin/ollama_llama_server.gz
2024-06-03 21:43:35 time=2024-06-03T13:43:35.346Z level=DEBUG source=payload.go:180 msg=extracting variant=cpu_avx file=build/linux/x86_64/cpu_avx/bin/ollama_llama_server.gz
2024-06-03 21:43:35 time=2024-06-03T13:43:35.346Z level=DEBUG source=payload.go:180 msg=extracting variant=cpu_avx2 file=build/linux/x86_64/cpu_avx2/bin/ollama_llama_server.gz
2024-06-03 21:43:35 time=2024-06-03T13:43:35.346Z level=DEBUG source=payload.go:180 msg=extracting variant=cuda_v11 file=build/linux/x86_64/cuda_v11/bin/libcublas.so.11.gz
2024-06-03 21:43:35 time=2024-06-03T13:43:35.346Z level=DEBUG source=payload.go:180 msg=extracting variant=cuda_v11 file=build/linux/x86_64/cuda_v11/bin/libcublasLt.so.11.gz
2024-06-03 21:43:35 time=2024-06-03T13:43:35.346Z level=DEBUG source=payload.go:180 msg=extracting variant=cuda_v11 file=build/linux/x86_64/cuda_v11/bin/libcudart.so.11.0.gz
2024-06-03 21:43:35 time=2024-06-03T13:43:35.346Z level=DEBUG source=payload.go:180 msg=extracting variant=cuda_v11 file=build/linux/x86_64/cuda_v11/bin/ollama_llama_server.gz
2024-06-03 21:43:35 time=2024-06-03T13:43:35.346Z level=DEBUG source=payload.go:180 msg=extracting variant=rocm_v60002 file=build/linux/x86_64/rocm_v60002/bin/deps.txt.gz
2024-06-03 21:43:35 time=2024-06-03T13:43:35.346Z level=DEBUG source=payload.go:180 msg=extracting variant=rocm_v60002 file=build/linux/x86_64/rocm_v60002/bin/ollama_llama_server.gz
2024-06-03 21:43:37 time=2024-06-03T13:43:37.201Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama219412353/runners/cpu
2024-06-03 21:43:37 time=2024-06-03T13:43:37.202Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama219412353/runners/cpu_avx
2024-06-03 21:43:37 time=2024-06-03T13:43:37.202Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama219412353/runners/cpu_avx2
2024-06-03 21:43:37 time=2024-06-03T13:43:37.202Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama219412353/runners/cuda_v11
2024-06-03 21:43:37 time=2024-06-03T13:43:37.202Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama219412353/runners/rocm_v60002
2024-06-03 21:43:37 time=2024-06-03T13:43:37.202Z level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu cpu_avx cpu_avx2 cuda_v11 rocm_v60002]"
2024-06-03 21:43:37 time=2024-06-03T13:43:37.202Z level=DEBUG source=payload.go:45 msg="Override detection logic by setting OLLAMA_LLM_LIBRARY"
2024-06-03 21:43:37 time=2024-06-03T13:43:37.202Z level=DEBUG source=sched.go:90 msg="starting llm scheduler"
2024-06-03 21:43:37 time=2024-06-03T13:43:37.202Z level=DEBUG source=gpu.go:132 msg="Detecting GPUs"
2024-06-03 21:43:37 time=2024-06-03T13:43:37.202Z level=DEBUG source=gpu.go:274 msg="Searching for GPU library" name=libcuda.so*
2024-06-03 21:43:37 time=2024-06-03T13:43:37.202Z level=DEBUG source=gpu.go:293 msg="gpu library search" globs="[/usr/local/nvidia/lib/libcuda.so** /usr/local/nvidia/lib64/libcuda.so** /usr/local/cuda*/targets/*/lib/libcuda.so* /usr/lib/*-linux-gnu/nvidia/current/libcuda.so* /usr/lib/*-linux-gnu/libcuda.so* /usr/lib/wsl/lib/libcuda.so* /usr/lib/wsl/drivers/*/libcuda.so* /opt/cuda/lib*/libcuda.so* /usr/local/cuda/lib*/libcuda.so* /usr/lib*/libcuda.so* /usr/local/lib*/libcuda.so*]"
2024-06-03 21:43:37 time=2024-06-03T13:43:37.213Z level=DEBUG source=gpu.go:326 msg="discovered GPU libraries" paths="[/usr/lib/x86_64-linux-gnu/libcuda.so.1 /usr/lib/wsl/drivers/nv_dispi.inf_amd64_5cf411dadcb5710d/libcuda.so.1.1]"
2024-06-03 21:43:37 cuInit err: 500
2024-06-03 21:43:37 time=2024-06-03T13:43:37.245Z level=DEBUG source=gpu.go:355 msg="Unable to load nvcuda" library=/usr/lib/x86_64-linux-gnu/libcuda.so.1 error="nvcuda init failure: 500"
2024-06-03 21:43:37 cuInit err: 500
2024-06-03 21:43:37 time=2024-06-03T13:43:37.246Z level=DEBUG source=gpu.go:355 msg="Unable to load nvcuda" library=/usr/lib/wsl/drivers/nv_dispi.inf_amd64_5cf411dadcb5710d/libcuda.so.1.1 error="nvcuda init failure: 500"
2024-06-03 21:43:37 time=2024-06-03T13:43:37.246Z level=DEBUG source=gpu.go:274 msg="Searching for GPU library" name=libcudart.so*
2024-06-03 21:43:37 time=2024-06-03T13:43:37.246Z level=DEBUG source=gpu.go:293 msg="gpu library search" globs="[/usr/local/nvidia/lib/libcudart.so** /usr/local/nvidia/lib64/libcudart.so** /tmp/ollama219412353/runners/cuda*/libcudart.so* /usr/local/cuda/lib64/libcudart.so* /usr/lib/x86_64-linux-gnu/nvidia/current/libcudart.so* /usr/lib/x86_64-linux-gnu/libcudart.so* /usr/lib/wsl/lib/libcudart.so* /usr/lib/wsl/drivers/*/libcudart.so* /opt/cuda/lib64/libcudart.so* /usr/local/cuda*/targets/aarch64-linux/lib/libcudart.so* /usr/lib/aarch64-linux-gnu/nvidia/current/libcudart.so* /usr/lib/aarch64-linux-gnu/libcudart.so* /usr/local/cuda/lib*/libcudart.so* /usr/lib*/libcudart.so* /usr/local/lib*/libcudart.so*]"
2024-06-03 21:43:37 time=2024-06-03T13:43:37.246Z level=DEBUG source=gpu.go:326 msg="discovered GPU libraries" paths="[/tmp/ollama219412353/runners/cuda_v11/libcudart.so.11.0 /usr/local/cuda/lib64/libcudart.so.12.5.39]"
2024-06-03 21:43:37 cudaSetDevice err: 500
2024-06-03 21:43:37 time=2024-06-03T13:43:37.246Z level=DEBUG source=gpu.go:338 msg="Unable to load cudart" library=/tmp/ollama219412353/runners/cuda_v11/libcudart.so.11.0 error="cudart init failure: 500"
2024-06-03 21:43:37 cudaSetDevice err: 500
2024-06-03 21:43:37 time=2024-06-03T13:43:37.251Z level=DEBUG source=gpu.go:338 msg="Unable to load cudart" library=/usr/local/cuda/lib64/libcudart.so.12.5.39 error="cudart init failure: 500"
2024-06-03 21:43:37 time=2024-06-03T13:43:37.251Z level=DEBUG source=cpu_common.go:11 msg="CPU has AVX2"
2024-06-03 21:43:37 time=2024-06-03T13:43:37.251Z level=DEBUG source=amd_linux.go:322 msg="amdgpu driver not detected /sys/module/amdgpu"
2024-06-03 21:43:37 time=2024-06-03T13:43:37.251Z level=INFO source=types.go:71 msg="inference compute" id=0 library=cpu compute="" driver=0.0 name="" total="46.8 GiB" available="43.7 GiB"

@dhiltgen
Copy link
Collaborator

dhiltgen commented Jun 4, 2024

@pengyuxiang1 your log shows it did discover your GPU

time=2024-06-03T10:25:05.390+08:00 level=INFO source=types.go:71 msg="inference compute" id=GPU-08bf9adf-c6f1-1ed9-641d-573b522a9d68 library=cuda compute=7.5 driver=11.7 name="Tesla T4" total="7.5 GiB" available="7.4 GiB"

It looks like you haven't pulled or run any models. What happens if you run

ollama run llama3 hello
ollama ps

If that doesn't load onto the GPU, please share the updated logs or any errors you see.

@dhiltgen
Copy link
Collaborator

dhiltgen commented Jun 4, 2024

@AzizEmir and @zzzhouuu it looks like you're likely hitting compatibility problems between the 555 nvidia driver and docker container runtime. Others have reported downgrading their drivers work for now until we (or nvidia) have a more solid solution.

@whargrove
Copy link

I'm facing a (possihbly) related issue

Try reloading the nvidia_uvm driver - sudo rmmod nvidia_uvm then sudo modprobe nvidia_uvm

Seemed to fix the issue for me.

Logs for posterity.

time=2024-06-04T21:36:06.363-06:00 level=DEBUG source=gpu.go:132 msg="Detecting GPUs"
time=2024-06-04T21:36:06.363-06:00 level=DEBUG source=gpu.go:274 msg="Searching for GPU library" name=libcuda.so*
time=2024-06-04T21:36:06.363-06:00 level=DEBUG source=gpu.go:293 msg="gpu library search" globs="[/home/wes/workspace/chat-with-paper/libcuda.so** /usr/local/cuda*/targets/*/lib/libcuda.so* /usr/lib/*-linux-gnu/nvidia/current/libcuda.so* /usr/lib/*-linux-gnu/libcuda.so* /usr/lib/wsl/lib/libcuda.so* /usr/lib/wsl/drivers/*/libcuda.so* /opt/cuda/lib*/libcuda.so* /usr/local/cuda/lib*/libcuda.so* /usr/lib*/libcuda.so* /usr/local/lib*/libcuda.so*]"
time=2024-06-04T21:36:06.371-06:00 level=DEBUG source=gpu.go:326 msg="discovered GPU libraries" paths="[/usr/lib/libcuda.so.550.78 /usr/lib64/libcuda.so.550.78]"
cuInit err: 999
time=2024-06-04T21:36:06.385-06:00 level=DEBUG source=gpu.go:355 msg="Unable to load nvcuda" library=/usr/lib/libcuda.so.550.78 error="nvcuda init failure: 999"
cuInit err: 999
time=2024-06-04T21:36:06.387-06:00 level=DEBUG source=gpu.go:355 msg="Unable to load nvcuda" library=/usr/lib64/libcuda.so.550.78 error="nvcuda init failure: 999"

I am not using running ollama from a container:

nvidia-smi
Tue Jun  4 21:40:04 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.78                 Driver Version: 550.78         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3080 Ti     Off |   00000000:01:00.0  On |                  N/A |
|  0%   47C    P8             26W /  350W |    1816MiB /  12288MiB |     10%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A     48119      G   /usr/lib/Xorg                                 522MiB |
|    0   N/A  N/A     48225      G   /usr/bin/gnome-shell                          190MiB |
|    0   N/A  N/A     48989      G   /usr/bin/alacritty                             11MiB |
|    0   N/A  N/A     49953      G   ...erProcess --variations-seed-version        113MiB |
|    0   N/A  N/A     51510      G   /usr/lib/firefox/firefox                      957MiB |
+-----------------------------------------------------------------------------------------+

@bigsk1
Copy link

bigsk1 commented Jun 6, 2024

fyi has same issue running proxmox vm with ubuntu 24.04, had to make sure to use host cpu and not x86-64-v2-AES the default setting. Need to have vector extentions for cpu, AVX and AVX2

lscpu | grep -i avx

@zzzhouuu
Copy link

Docker Desktop 4.31 was released 2024-06-06 and includes NVIDIA Container Toolkit 1.15.0, which resolves my issue.

@dhiltgen
Copy link
Collaborator

dhiltgen commented Jul 3, 2024

I believe all the issues have been resolved now with the troubleshooting steps.

If anyone is still having problems, please make sure to upgrade to the latest version, and if that doesn't clear it, share your latest server log and I'll reopen the issue.

@dhiltgen dhiltgen closed this as completed Jul 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working nvidia Issues relating to Nvidia GPUs and CUDA
Projects
None yet
Development

No branches or pull requests

8 participants