Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Baby-llama.cpp report bus error #4830

Closed
YangZyyyy opened this issue Jan 9, 2024 · 6 comments
Closed

Baby-llama.cpp report bus error #4830

YangZyyyy opened this issue Jan 9, 2024 · 6 comments

Comments

@YangZyyyy
Copy link

YangZyyyy commented Jan 9, 2024

System: MacOS Ventura 13.2.1
CPU: M2 Pro

Reproduction process

  1. git clone llama.cpp
  2. cd llama.cpp
  3. make
  4. ./baby-llama
    And then, the terminal will print:
init model
init_kv_cache
zsh: bus error  ./baby-llama

And similar error on my linux server

System: Ubuntu 22.04.2 LTS
Architecture: x86_64
CPU(s): 128
Model name: Intel(R) Xeon(R) CPU @ 2.90GHz

Under the same steps, the terminal will print the following error:

init model
init_kv_cache
Floating point exception (core dumped)
@NawafAlansari
Copy link
Contributor

I have encountered the same issue while trying to run this example. I am trying to implement something similar to what is described in this issue. If no one else is on it, I'd like to try finding and implementing a fix.

@NawafAlansari
Copy link
Contributor

After further investigation, I've pinpointed the issue within the forward_batch function, specifically at line 988, which calls ggml_build_forward_expand(gf, inpL). It seems the root of the problem lies in the size of gf->visited_hash_table being zero at the time of this call.

Here's a breakdown of the call sequence leading to the exception:

  1. ggml_build_forward_expand(gf, inpL) is invoked, passing the computation graph gf as an argument.
  2. Inside ggml_build_forward_expand, the computation graph gf is passed to ggml_visit_parents.
  3. ggml_visit_parents then passes the graph's hash table to ggml_hash_insert.
  4. Finally, ggml_hash_insert calls ggml_hash_find, where the issue manifests. The problematic line in ggml_hash_find is the first one:
   size_t h = ggml_hash(key) % hash_set.size;

it seems that hash_set.size is zero, which results in a division by zero, leading to SIGFPE (Arithmetic exception). So it appears that the division by zero in ggml_hash_find is the main cause.

I'm currently exploring why gf->visited_hash_table's size is zero at this point in the execution and how we can ensure it's properly initialized before reaching this critical operation. Any further suggestions would be greatly appreciated.

@NawafAlansari
Copy link
Contributor

Okay I was able to fix the issue by just changing the lines that contain

gf = {}

in the main function to

struct ggml_cgraph * gf = NULL; 
gf = ggml_new_graph_custom(ctx0, LLAMA_TRAIN_MAX_NODES, true);

I am not sure this is the best way to go about it, but it works for now, any suggestions would be appreciated. I can also try to push the fix.

@slaren
Copy link
Collaborator

slaren commented Feb 16, 2024

@NawafAlansari That looks correct. When the baby-llama example was created, graphs had a fixed size and could be allocated in the stack, but that was changed a while ago and now graphs need to be allocated in a ggml_context (see ggerganov/ggml#567 for more details). A PR to fix the example would be very welcome.

@NawafAlansari
Copy link
Contributor

@slaren I just made a PR with a fix for the example.

ggerganov added a commit that referenced this issue Feb 19, 2024
* Fixed the baby-llama issue (see issue #4830)

* minor : fix whitespaces

---------

Co-authored-by: Georgi Gerganov <[email protected]>
Nexesenex pushed a commit to Nexesenex/kobold.cpp that referenced this issue Feb 19, 2024
* Fixed the baby-llama issue (see issue ggerganov#4830)

* minor : fix whitespaces

---------

Co-authored-by: Georgi Gerganov <[email protected]>
jordankanter pushed a commit to jordankanter/llama.cpp that referenced this issue Mar 13, 2024
* Fixed the baby-llama issue (see issue ggerganov#4830)

* minor : fix whitespaces

---------

Co-authored-by: Georgi Gerganov <[email protected]>
@github-actions github-actions bot added the stale label Mar 20, 2024
hodlen pushed a commit to hodlen/llama.cpp that referenced this issue Apr 1, 2024
* Fixed the baby-llama issue (see issue ggerganov#4830)

* minor : fix whitespaces

---------

Co-authored-by: Georgi Gerganov <[email protected]>
Copy link
Contributor

github-actions bot commented Apr 4, 2024

This issue was closed because it has been inactive for 14 days since being marked as stale.

@github-actions github-actions bot closed this as completed Apr 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants