Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running a Vicuna-13B 4it model ? #771

Closed
manageseverin opened this issue Apr 5, 2023 · 25 comments
Closed

Running a Vicuna-13B 4it model ? #771

manageseverin opened this issue Apr 5, 2023 · 25 comments
Labels
generation quality Quality of model output model Model specific

Comments

@manageseverin
Copy link

I found this model :
[ggml-vicuna-13b-4bit](https://huggingface.co/eachadea/ggml-vicuna-13b-4bit/tree/main) and judging by their online demo it's very impressive.
I tried to run it with llama.cpp latest version - the model loads fine, but as soon as it loads it starts hallucinating and quits by itself.
Do I need to have it converted or something like that ?

@KASR
Copy link
Contributor

KASR commented Apr 5, 2023

Have a look here --> #643

@bhubbb
Copy link
Contributor

bhubbb commented Apr 5, 2023

I've had the most success with this model with the following patch to the instruct mode.

diff --git a/examples/main/main.cpp b/examples/main/main.cpp
index 453450a..70b4f45 100644
--- a/examples/main/main.cpp
+++ b/examples/main/main.cpp
@@ -152,13 +152,13 @@ int main(int argc, char ** argv) {
     }
 
     // prefix & suffix for instruct mode
-    const auto inp_pfx = ::llama_tokenize(ctx, "\n\n### Instruction:\n\n", true);
-    const auto inp_sfx = ::llama_tokenize(ctx, "\n\n### Response:\n\n", false);
+    const auto inp_pfx = ::llama_tokenize(ctx, "\n\n### Human:\n\n", true);
+    const auto inp_sfx = ::llama_tokenize(ctx, "\n\n### Assistant:\n\n", false);
 
     // in instruct mode, we inject a prefix and a suffix to each input by the user
     if (params.instruct) {
         params.interactive_start = true;
-        params.antiprompt.push_back("### Instruction:\n\n");
+        params.antiprompt.push_back("### Human:\n\n");
     }
 
     // enable interactive mode if reverse prompt or interactive start is specified

And then running the model with the following options. If there are better options, please let me know.

./main \
  --model  ./models/ggml-vicuna-13b-4bit/ggml-vicuna-13b-4bit.bin \
  --color \
  --threads 7 \
  --batch_size 256 \
  --n_predict -1 \
  --top_k 12 \
  --top_p 1 \
  --temp 0.36 \
  --repeat_penalty 1.05 \
  --ctx_size 2048 \
  --instruct \
  --reverse-prompt '### Human:' \
  --file prompts/vicuna.txt

And my prompt file

A chat between a curious human and an artificial intelligence assistant.
The assistant gives helpful, detailed, and polite answers to the human's questions.

Example output

 A chat between a curious human and an artificial intelligence assistant.          
The assistant gives helpful, detailed, and polite answers to the human's questions.
> What NFL team won the Super Bowl in the year Justin Bieber was born?
The NFL team that won the Super Bowl in the year Justin Bieber was born, which is 1994, was the Dallas Cowboys. They defeated the Buffalo Bills in Super Bowl XXVII, which was held on January 31, 1994.
### Human:                                             
> Who won the year after?
The NFL team that won the Super Bowl the year after Justin Bieber was born, which is 1995, was the Dallas Cowboys again. They defeated the Buffalo Bills in Super Bowl XXVIII, which was held on January 30, 1995. The Cowboys became the fi
rst team to win back-to-back Super Bowls since the Pittsburgh Steelers did so in the 1970s.
### Human: 

I've made this change to align with FastChat and the roles it uses.

Someone who knows C better than I, could they make a prompt suffix flag?
Having a prompt suffix flag would make it easier to be compatible with other models in the future.

@rabidcopy
Copy link
Contributor

rabidcopy commented Apr 5, 2023

Vicuna is a pretty strict model in terms of following that ### Human/### Assistant format when compared to alpaca and gpt4all. Less flexible but fairly impressive in how it mimics ChatGPT responses.

@gjmulder gjmulder added model Model specific generation quality Quality of model output labels Apr 6, 2023
@jmtatsch
Copy link

jmtatsch commented Apr 6, 2023

It's extremely slow on my M1 MacBook (unusable), quite usable on my 4 yr old i7 workstation. And doesn't work at all on the same workstation inside docker.

Found #767, adding --mlock solved the slowness issue on Macbook.
Docker issue here #537. Simply built my own image tailored to my own machine. Works like a charm.

@idontneedonetho
Copy link

idontneedonetho commented Apr 7, 2023

I've had the most success with this model with the following patch to the instruct mode.

diff --git a/examples/main/main.cpp b/examples/main/main.cpp
index 453450a..70b4f45 100644
--- a/examples/main/main.cpp
+++ b/examples/main/main.cpp
@@ -152,13 +152,13 @@ int main(int argc, char ** argv) {
     }
 
     // prefix & suffix for instruct mode
-    const auto inp_pfx = ::llama_tokenize(ctx, "\n\n### Instruction:\n\n", true);
-    const auto inp_sfx = ::llama_tokenize(ctx, "\n\n### Response:\n\n", false);
+    const auto inp_pfx = ::llama_tokenize(ctx, "\n\n### Human:\n\n", true);
+    const auto inp_sfx = ::llama_tokenize(ctx, "\n\n### Assistant:\n\n", false);
 
     // in instruct mode, we inject a prefix and a suffix to each input by the user
     if (params.instruct) {
         params.interactive_start = true;
-        params.antiprompt.push_back("### Instruction:\n\n");
+        params.antiprompt.push_back("### Human:\n\n");
     }
 
     // enable interactive mode if reverse prompt or interactive start is specified

And then running the model with the following options. If there are better options, please let me know.

./main \
  --model  ./models/ggml-vicuna-13b-4bit/ggml-vicuna-13b-4bit.bin \
  --color \
  --threads 7 \
  --batch_size 256 \
  --n_predict -1 \
  --top_k 12 \
  --top_p 1 \
  --temp 0.36 \
  --repeat_penalty 1.05 \
  --ctx_size 2048 \
  --instruct \
  --reverse-prompt '### Human:' \
  --file prompts/vicuna.txt

And my prompt file

A chat between a curious human and an artificial intelligence assistant.
The assistant gives helpful, detailed, and polite answers to the human's questions.

Example output

 A chat between a curious human and an artificial intelligence assistant.          
The assistant gives helpful, detailed, and polite answers to the human's questions.
> What NFL team won the Super Bowl in the year Justin Bieber was born?
The NFL team that won the Super Bowl in the year Justin Bieber was born, which is 1994, was the Dallas Cowboys. They defeated the Buffalo Bills in Super Bowl XXVII, which was held on January 31, 1994.
### Human:                                             
> Who won the year after?
The NFL team that won the Super Bowl the year after Justin Bieber was born, which is 1995, was the Dallas Cowboys again. They defeated the Buffalo Bills in Super Bowl XXVIII, which was held on January 30, 1995. The Cowboys became the fi
rst team to win back-to-back Super Bowls since the Pittsburgh Steelers did so in the 1970s.
### Human: 

I've made this change to align with FastChat and the roles it uses.

Someone who knows C better than I, could they make a prompt suffix flag? Having a prompt suffix flag would make it easier to be compatible with other models in the future.

Although I'm not proficient in C, I was able to make some modifications to llama.cpp by recompiling main.cpp with the changes, renaming the resulting main.exe to vicuna.exe, and moving it into my main llama.cpp folder. To choose a model, I created a bat file that prompts me to select a model, and if I choose the vicuna model, the bat file runs vicuna.exe instead of main.exe.

I've included the bat file below for reference:

setlocal EnableDelayedExpansion

set /a count=0
for %%f in (S:\llama.cpp\models\*.bin) do (
    if /i not "%%~nxf"=="ggml-vocab.bin" (
        set /a count+=1
        set file[!count!]=%%f
        echo !count!. %%f
    )
)

set /p selection=Enter the number of the model you want to use:

set model=!file[%selection%]!

if /I "!model:vicuna=!" neq "!model!" (
    echo Running main.exe with model !model! ...
    title !model!
    S:\llama.cpp\vicuna.exe --model "!model!" --color --threads 8 --batch_size 256 --n_predict -1 --top_k 12 --top_p 1 --temp 0.36 --repeat_penalty 1.05 --ctx_size 2048 --instruct --reverse-prompt "### Human:" --file S:\llama.cpp\prompts\vicuna.txt
) else (
    echo Running main.exe with model !model! ...
    title !model!
    S:\llama.cpp\main.exe --model "!model!" --color --threads 8 -n 1024 --top_k 0 --top_p 0.73 --temp 0.72 --repeat_penalty 1.1 --instruct --file S:\llama.cpp\prompts\alpaca.txt
)

pause

I hope this helps anyone who may be interested in trying this out!

@chakflying
Copy link

sampling: temp = 0.360000, top_k = 12, top_p = 1.000000, repeat_last_n = 64, repeat_penalty = 1.050000
generate: n_ctx = 2048, n_batch = 256, n_predict = -1, n_keep = 56

Tried these settings and it's really nice! It really has learned the ChatGPT style well, and the 13b model seems to have good underlying knowledge.

 A chat between a curious human and an artificial intelligence assistant.
Assistant is helpful, kind, honest, good at writing, give answers to the human's questions in step by step explanations when it's confident, otherwise it will say it is not sure.
> What are the main themes of the game Fallout 4?
Fallout 4 is a post-apocalyptic role-playing game set in the aftermath of a nuclear war that took place on October 23, 2077. The game takes place in the ruins of Boston and its surrounding areas, known as the Commonwealth. The main theme of Fallout 4 is survival, as players must navigate the dangerous world of the Commonwealth while also dealing with the challenges of building a new community in the midst of destruction.

The game also explores themes of power and control, as players must navigate the political landscape of the Commonwealth and make alliances with various factions in order to gain power and control over the region. Additionally, the game deals with issues of morality, as players are forced to make difficult choices that can have a significant impact on the world around them.

Overall, Fallout 4 is a game that explores the challenges of survival, power, and morality in a post-apocalyptic world. The game's immersive setting and complex storylines make it a favorite among fans of the series, and its themes of survival and morality are sure to keep players engaged for hours on end.

But there is a problem that it doesn't seem to stop by itself, it will continue to generate the next line of ### Human and continuing another response.

@ai2p
Copy link

ai2p commented Apr 7, 2023

I've been able to compile latest standard llama.cpp with cmake under the Windows 10, then run ggml-vicuna-7b-4bit-rev1.bin , and even ggml-vicuna-13b-4bit-rev1.bin, with this command-line code (assuming that your .bin in the same folder as main.exe):

main -i --interactive-first -r "### Human:" --temp 0 -c 2048 -n -1 --ignore-eos --repeat_penalty 1.2 --instruct -m ggml-vicuna-7b-4bit-rev1.bin

And that one kinda works even faster on my 8-core CPU:

main --color --threads 7 --batch_size 256 --n_predict -1 --top_k 12 --top_p 1 --temp 0.36 --repeat_penalty 1.05 --ctx_size 2048 --instruct --reverse-prompt "### Human:" --model ggml-vicuna-13b-4bit-rev1.bin

It works on Win laptop with 16 Gb RAM and looks almost like ChatGPT! (Slower, of course, but with speed almost same as human will type!) I agree that it may be the best LLM to run locally!

And it seems that it can write much more correct and longer program code than gpt4all! It's just amazing!

But sometimes, after a few answers, it just freezes forever while continuing to load the CPU. Has anyone noticed this? Why it may be so?

@SlyEcho
Copy link
Sponsor Collaborator

SlyEcho commented Apr 7, 2023

But sometimes, after a few answers, it just freezes forever while continuing to load the CPU. Has anyone noticed this? Why it may be so?

Context swap. The context fills up and then the first half of it is deleted to make more room, but that means that the whole context has to be reevaluated to catch up. The OpenBLAS option is supposed to accelerate it, but don't know how easy it is to make it work on Windows, vcpkg seems to have some BLAS packages.

@ai2p
Copy link

ai2p commented Apr 7, 2023

Context swap. The context fills up and then the first half of it is deleted to make more room, but that means that the whole context has to be reevaluated to catch up.

So it's just trying to compress overfilled context so that it would be possible to continue conversation without loosing any important details? And it is normal, and I just should take a cup of tee in that time and not restarts it as I did? :-)

@SlyEcho
Copy link
Sponsor Collaborator

SlyEcho commented Apr 7, 2023

You can use --keep to keep some part of the initial prompt (-1 for all) or use a smaller context. You can try different --batch_size values because this determines the sizes of the matrixes that are used in this operation.

@Crimsonfart
Copy link

can someone explain to me what the difference between these two options is? (Both options work fine)

  • Option 1: main -i --interactive-first -r "### Human:" --temp 0 -c 2048 -n -1 --color --ignore-eos --repeat_penalty 1.2 --instruct -m ggml-vicuna-7b-4bit-rev1.bin
  • option 2: main --color --threads 7 --batch_size 256 --n_predict -1 --top_k 12 --top_p 1 --temp 0.36 --repeat_penalty 1.05 --ctx_size 2048 --instruct --reverse-prompt "### Human:" --model ggml-vicuna-13b-4bit-rev1.bin
    And how many threads should one use? I have an i511400F that has 6 cores and supports 12 threads.

@ai2p
Copy link

ai2p commented Apr 8, 2023

can someone explain to me what the difference between these two options is?
Same except temperature and amount of threads.

About temperature read here. As I preliminary think, the higher temperature - the more stochastic/chaotic choice of words. The lower temperature - the more deterministic would be your result, and at temperature = 0 your result would be always the same. So you can tune that parameter for your application. If you write a code, may be better temp =0, if you write a poem - may be better temp=1 or even more... (If i'm wrong - correct me!)

What about threads, I intuit that you can use as many threads as your CPU support bar 1 or 2 (so that other apps and system will not hang). I think bottleneck is not CPU but RAM throughput.

Who has another opinion - please correct!

@Sergio438
Copy link

Como é baixar o modelo,como app comum?

@multimediaconverter
Copy link

multimediaconverter commented Apr 9, 2023

But there is a problem that it doesn't seem to stop by itself, it will continue to generate the next line of ### Human and continuing another response.

I use the following trick to partly overcome this problem:

### Human: Write a very long answer (about 1000 words). YOUR QUESTION WITH YOUR TEXT HERE
### Assistant:

@jmtatsch
Copy link

jmtatsch commented Apr 9, 2023

There is a vicuña model rev1 with some kind of stop fix on 🤗 . Maybe that solves your issue?

@multimediaconverter
Copy link

multimediaconverter commented Apr 9, 2023

Yes, I'm talking about rev1, so we need to change llama_token_eos() or what?
Alternative solution: 9fd062f [UNVERIFIED]
I've recompiled main.cpp with this patch, and it works for me well enough (using the parameter: --stop "### Human:")

@SlyEcho
Copy link
Sponsor Collaborator

SlyEcho commented Apr 9, 2023

I've recompiled main.cpp with this patch, and it works for me well enough (using the parameter: --stop "### Human:")

This is just the same as the --antiprompt option. The patch above has \n\n in the end but I think without them it would be better. You can also have multiple antiprompts.

@multimediaconverter
Copy link

multimediaconverter commented Apr 10, 2023

This is just the same as the --antiprompt option. The patch above has \n\n in the end but I think without them it would be better. You can also have multiple antiprompts.

Yes, that's right. Thank you. You mean --reverse-prompt (-r) option.

I use a prompt file to start generation in this way:
main.exe -r "### Human:" -c 2048 --temp 0.36 -n -1 --ignore-eos --repeat_penalty 1.3 -m ggml-vicuna-7b-4bit-rev1.bin -f input.txt > output.txt

Content of input.txt file:

hello
### Assistant:

-r option switches the program into interactive mode, so it will not exit at the end and keeps waiting.

Therefore I made the following quick fix for vicuna:

    const auto inp_pfx = ::llama_tokenize(ctx, "\n\n### Human:\n", true);
    const auto inp_sfx = ::llama_tokenize(ctx, "\n\n### Assistant:\n", false);

    // in instruct mode, we inject a prefix and a suffix to each input by the user
    if (params.instruct) {
        params.interactive_start = true;
        params.antiprompt.push_back("### Human:");
    }

and

                is_antiprompt = false;
                // Check if each of the reverse prompts appears at the end of the output.
                for (std::string & antiprompt : params.antiprompt) {
                    if (last_output.find(antiprompt.c_str(), last_output.length() - antiprompt.length(), antiprompt.length()) != std::string::npos) {
                        is_interacting = true;
                        is_antiprompt = true;
                        set_console_color(con_st, CONSOLE_COLOR_USER_INPUT);
                        fflush(stdout);
			if (!params.instruct) exit(0);
                        break;
                    }
                }
            }

@ai2p
Copy link

ai2p commented Apr 11, 2023

Therefore I made the following quick fix for vicuna:

I've repeated your modifications, but nothing changes - it still shows "### Human:" each time... Who knows how to make clean answers? ( So that only '>' instead of '### Human:')

@multimediaconverter
Copy link

multimediaconverter commented Apr 12, 2023

I've repeated your modifications, but nothing changes - it still shows "### Human:" each time... Who knows how to make clean answers? ( So that only '>' instead of '### Human:')

You can try my additional quick hack (it removes "### Human:" from the end of each response):

        // display text
	static std::string tmp;
	static std::string ap = "\n### Human:";
        if (!input_noecho) {
            for (auto id : embd) {
		tmp += llama_token_to_str(ctx, id);
		int tmplen = tmp.length() > ap.length() ? ap.length() : tmp.length();
		if (strncmp(tmp.c_str(), ap.c_str(), tmplen)) { printf(tmp.c_str()); tmp = ""; }
		else if (tmplen == ap.length()) tmp = "";
            }
            fflush(stdout);
        }

@jxy
Copy link
Contributor

jxy commented Apr 13, 2023

The vicuna v1.1 model used a different setup. See https://github.com/lm-sys/FastChat/blob/f85f489f2d5e48c37cceb2f00c3edc075c5d3711/fastchat/conversation.py#L115-L124 and https://github.com/lm-sys/FastChat/blob/f85f489f2d5e48c37cceb2f00c3edc075c5d3711/fastchat/conversation.py#L37-L44

IIUC, the prompt in Borne shell string is "$system USER: $instruction ASSISTANT:".

Their doc says this https://github.com/lm-sys/FastChat/blob/f85f489f2d5e48c37cceb2f00c3edc075c5d3711/docs/weights_version.md#example-prompt-weight-v11

I think the </s> is actually the EOS token, not a verbatim string. Though I'm not sure if we need to manually append it to the end of the assistant's response or not.

@ai2p
Copy link

ai2p commented Apr 14, 2023

The vicuna v1.1 model used a different setup.

Uhhh... Such a mess... Definitely needed some standardization for peple teaching LLMs! At least, with tokens such assistant/human/eos it should be possible, `cos it's just technicalities not connected directly with LLM functionality...

Or, at a side of a software, there should be easy way to adapt any token without editing C++ code...

@chakflying
Copy link

Since #863 may not happen soon, I tested this working on 1.1:
main.cpp#160

    // prefix & suffix for instruct mode
    const auto inp_pfx = ::llama_tokenize(ctx, "\nUSER:", true);
    const auto inp_sfx = ::llama_tokenize(ctx, "\nASSISTANT:", false);

    // in instruct mode, we inject a prefix and a suffix to each input by the user
    if (params.instruct) {
        params.interactive_start = true;
        params.antiprompt.push_back("USER:");
    }

@jaeqy
Copy link

jaeqy commented Apr 19, 2023

Therefore I made the following quick fix for vicuna:

I've repeated your modifications, but nothing changes - it still shows "### Human:" each time... Who knows how to make clean answers? ( So that only '>' instead of '### Human:')

this is my settings, and run well on my mac (only '>' instead of '### Human:'):
./main
--model ./models/13B/ggml-vicuna-13b-4bit-rev1.bin
--color -i -r "User:" --temp 0 -c 2048 -n -1 --ignore-eos --repeat_penalty 1.2
--file prompts/chat-with-vicuna.txt

@immorBen
Copy link

sampling: temp = 0.360000, top_k = 12, top_p = 1.000000, repeat_last_n = 64, repeat_penalty = 1.050000
generate: n_ctx = 2048, n_batch = 256, n_predict = -1, n_keep = 56

Tried these settings and it's really nice! It really has learned the ChatGPT style well, and the 13b model seems to have good underlying knowledge.

 A chat between a curious human and an artificial intelligence assistant.
Assistant is helpful, kind, honest, good at writing, give answers to the human's questions in step by step explanations when it's confident, otherwise it will say it is not sure.
> What are the main themes of the game Fallout 4?
Fallout 4 is a post-apocalyptic role-playing game set in the aftermath of a nuclear war that took place on October 23, 2077. The game takes place in the ruins of Boston and its surrounding areas, known as the Commonwealth. The main theme of Fallout 4 is survival, as players must navigate the dangerous world of the Commonwealth while also dealing with the challenges of building a new community in the midst of destruction.

The game also explores themes of power and control, as players must navigate the political landscape of the Commonwealth and make alliances with various factions in order to gain power and control over the region. Additionally, the game deals with issues of morality, as players are forced to make difficult choices that can have a significant impact on the world around them.

Overall, Fallout 4 is a game that explores the challenges of survival, power, and morality in a post-apocalyptic world. The game's immersive setting and complex storylines make it a favorite among fans of the series, and its themes of survival and morality are sure to keep players engaged for hours on end.

But there is a problem that it doesn't seem to stop by itself, it will continue to generate the next line of ### Human and continuing another response.

@chakflying I have the same issue when using GPT4ALL with this model, after starting my first prompt, I lost control over them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
generation quality Quality of model output model Model specific
Projects
None yet
Development

No branches or pull requests