Running a Vicuna-13B 4it model ? #771

manageseverin · 2023-04-05T07:33:04Z

I found this model :
[ggml-vicuna-13b-4bit](https://huggingface.co/eachadea/ggml-vicuna-13b-4bit/tree/main) and judging by their online demo it's very impressive.
I tried to run it with llama.cpp latest version - the model loads fine, but as soon as it loads it starts hallucinating and quits by itself.
Do I need to have it converted or something like that ?

KASR · 2023-04-05T07:42:08Z

Have a look here --> #643

bhubbb · 2023-04-05T11:46:43Z

I've had the most success with this model with the following patch to the instruct mode.

diff --git a/examples/main/main.cpp b/examples/main/main.cpp
index 453450a..70b4f45 100644
--- a/examples/main/main.cpp
+++ b/examples/main/main.cpp
@@ -152,13 +152,13 @@ int main(int argc, char ** argv) {
     }
 
     // prefix & suffix for instruct mode
-    const auto inp_pfx = ::llama_tokenize(ctx, "\n\n### Instruction:\n\n", true);
-    const auto inp_sfx = ::llama_tokenize(ctx, "\n\n### Response:\n\n", false);
+    const auto inp_pfx = ::llama_tokenize(ctx, "\n\n### Human:\n\n", true);
+    const auto inp_sfx = ::llama_tokenize(ctx, "\n\n### Assistant:\n\n", false);
 
     // in instruct mode, we inject a prefix and a suffix to each input by the user
     if (params.instruct) {
         params.interactive_start = true;
-        params.antiprompt.push_back("### Instruction:\n\n");
+        params.antiprompt.push_back("### Human:\n\n");
     }
 
     // enable interactive mode if reverse prompt or interactive start is specified

And then running the model with the following options. If there are better options, please let me know.

./main \
  --model  ./models/ggml-vicuna-13b-4bit/ggml-vicuna-13b-4bit.bin \
  --color \
  --threads 7 \
  --batch_size 256 \
  --n_predict -1 \
  --top_k 12 \
  --top_p 1 \
  --temp 0.36 \
  --repeat_penalty 1.05 \
  --ctx_size 2048 \
  --instruct \
  --reverse-prompt '### Human:' \
  --file prompts/vicuna.txt

And my prompt file

A chat between a curious human and an artificial intelligence assistant.
The assistant gives helpful, detailed, and polite answers to the human's questions.

Example output

 A chat between a curious human and an artificial intelligence assistant.          
The assistant gives helpful, detailed, and polite answers to the human's questions.
> What NFL team won the Super Bowl in the year Justin Bieber was born?
The NFL team that won the Super Bowl in the year Justin Bieber was born, which is 1994, was the Dallas Cowboys. They defeated the Buffalo Bills in Super Bowl XXVII, which was held on January 31, 1994.
### Human:                                             
> Who won the year after?
The NFL team that won the Super Bowl the year after Justin Bieber was born, which is 1995, was the Dallas Cowboys again. They defeated the Buffalo Bills in Super Bowl XXVIII, which was held on January 30, 1995. The Cowboys became the fi
rst team to win back-to-back Super Bowls since the Pittsburgh Steelers did so in the 1970s.
### Human:

I've made this change to align with FastChat and the roles it uses.

Someone who knows C better than I, could they make a prompt suffix flag?
Having a prompt suffix flag would make it easier to be compatible with other models in the future.

rabidcopy · 2023-04-05T19:57:27Z

Vicuna is a pretty strict model in terms of following that ### Human/### Assistant format when compared to alpaca and gpt4all. Less flexible but fairly impressive in how it mimics ChatGPT responses.

jmtatsch · 2023-04-06T18:45:24Z

It's extremely slow on my M1 MacBook (unusable), quite usable on my 4 yr old i7 workstation. And doesn't work at all on the same workstation inside docker.

Found #767, adding --mlock solved the slowness issue on Macbook.
Docker issue here #537. Simply built my own image tailored to my own machine. Works like a charm.

idontneedonetho · 2023-04-07T03:45:51Z

I've had the most success with this model with the following patch to the instruct mode.

diff --git a/examples/main/main.cpp b/examples/main/main.cpp
index 453450a..70b4f45 100644
--- a/examples/main/main.cpp
+++ b/examples/main/main.cpp
@@ -152,13 +152,13 @@ int main(int argc, char ** argv) {
     }
 
     // prefix & suffix for instruct mode
-    const auto inp_pfx = ::llama_tokenize(ctx, "\n\n### Instruction:\n\n", true);
-    const auto inp_sfx = ::llama_tokenize(ctx, "\n\n### Response:\n\n", false);
+    const auto inp_pfx = ::llama_tokenize(ctx, "\n\n### Human:\n\n", true);
+    const auto inp_sfx = ::llama_tokenize(ctx, "\n\n### Assistant:\n\n", false);
 
     // in instruct mode, we inject a prefix and a suffix to each input by the user
     if (params.instruct) {
         params.interactive_start = true;
-        params.antiprompt.push_back("### Instruction:\n\n");
+        params.antiprompt.push_back("### Human:\n\n");
     }
 
     // enable interactive mode if reverse prompt or interactive start is specified

And then running the model with the following options. If there are better options, please let me know.

./main \
  --model  ./models/ggml-vicuna-13b-4bit/ggml-vicuna-13b-4bit.bin \
  --color \
  --threads 7 \
  --batch_size 256 \
  --n_predict -1 \
  --top_k 12 \
  --top_p 1 \
  --temp 0.36 \
  --repeat_penalty 1.05 \
  --ctx_size 2048 \
  --instruct \
  --reverse-prompt '### Human:' \
  --file prompts/vicuna.txt

And my prompt file

A chat between a curious human and an artificial intelligence assistant.
The assistant gives helpful, detailed, and polite answers to the human's questions.

Example output

 A chat between a curious human and an artificial intelligence assistant.          
The assistant gives helpful, detailed, and polite answers to the human's questions.
> What NFL team won the Super Bowl in the year Justin Bieber was born?
The NFL team that won the Super Bowl in the year Justin Bieber was born, which is 1994, was the Dallas Cowboys. They defeated the Buffalo Bills in Super Bowl XXVII, which was held on January 31, 1994.
### Human:                                             
> Who won the year after?
The NFL team that won the Super Bowl the year after Justin Bieber was born, which is 1995, was the Dallas Cowboys again. They defeated the Buffalo Bills in Super Bowl XXVIII, which was held on January 30, 1995. The Cowboys became the fi
rst team to win back-to-back Super Bowls since the Pittsburgh Steelers did so in the 1970s.
### Human:

I've made this change to align with FastChat and the roles it uses.

Someone who knows C better than I, could they make a prompt suffix flag? Having a prompt suffix flag would make it easier to be compatible with other models in the future.

Although I'm not proficient in C, I was able to make some modifications to llama.cpp by recompiling main.cpp with the changes, renaming the resulting main.exe to vicuna.exe, and moving it into my main llama.cpp folder. To choose a model, I created a bat file that prompts me to select a model, and if I choose the vicuna model, the bat file runs vicuna.exe instead of main.exe.

I've included the bat file below for reference:

setlocal EnableDelayedExpansion

set /a count=0
for %%f in (S:\llama.cpp\models\*.bin) do (
    if /i not "%%~nxf"=="ggml-vocab.bin" (
        set /a count+=1
        set file[!count!]=%%f
        echo !count!. %%f
    )
)

set /p selection=Enter the number of the model you want to use:

set model=!file[%selection%]!

if /I "!model:vicuna=!" neq "!model!" (
    echo Running main.exe with model !model! ...
    title !model!
    S:\llama.cpp\vicuna.exe --model "!model!" --color --threads 8 --batch_size 256 --n_predict -1 --top_k 12 --top_p 1 --temp 0.36 --repeat_penalty 1.05 --ctx_size 2048 --instruct --reverse-prompt "### Human:" --file S:\llama.cpp\prompts\vicuna.txt
) else (
    echo Running main.exe with model !model! ...
    title !model!
    S:\llama.cpp\main.exe --model "!model!" --color --threads 8 -n 1024 --top_k 0 --top_p 0.73 --temp 0.72 --repeat_penalty 1.1 --instruct --file S:\llama.cpp\prompts\alpaca.txt
)

pause

I hope this helps anyone who may be interested in trying this out!

chakflying · 2023-04-07T08:54:06Z

sampling: temp = 0.360000, top_k = 12, top_p = 1.000000, repeat_last_n = 64, repeat_penalty = 1.050000
generate: n_ctx = 2048, n_batch = 256, n_predict = -1, n_keep = 56

Tried these settings and it's really nice! It really has learned the ChatGPT style well, and the 13b model seems to have good underlying knowledge.

 A chat between a curious human and an artificial intelligence assistant.
Assistant is helpful, kind, honest, good at writing, give answers to the human's questions in step by step explanations when it's confident, otherwise it will say it is not sure.
> What are the main themes of the game Fallout 4?
Fallout 4 is a post-apocalyptic role-playing game set in the aftermath of a nuclear war that took place on October 23, 2077. The game takes place in the ruins of Boston and its surrounding areas, known as the Commonwealth. The main theme of Fallout 4 is survival, as players must navigate the dangerous world of the Commonwealth while also dealing with the challenges of building a new community in the midst of destruction.

The game also explores themes of power and control, as players must navigate the political landscape of the Commonwealth and make alliances with various factions in order to gain power and control over the region. Additionally, the game deals with issues of morality, as players are forced to make difficult choices that can have a significant impact on the world around them.

Overall, Fallout 4 is a game that explores the challenges of survival, power, and morality in a post-apocalyptic world. The game's immersive setting and complex storylines make it a favorite among fans of the series, and its themes of survival and morality are sure to keep players engaged for hours on end.

But there is a problem that it doesn't seem to stop by itself, it will continue to generate the next line of ### Human and continuing another response.

ai2p · 2023-04-07T13:51:48Z

I've been able to compile latest standard llama.cpp with cmake under the Windows 10, then run ggml-vicuna-7b-4bit-rev1.bin , and even ggml-vicuna-13b-4bit-rev1.bin, with this command-line code (assuming that your .bin in the same folder as main.exe):

main -i --interactive-first -r "### Human:" --temp 0 -c 2048 -n -1 --ignore-eos --repeat_penalty 1.2 --instruct -m ggml-vicuna-7b-4bit-rev1.bin

And that one kinda works even faster on my 8-core CPU:

main --color --threads 7 --batch_size 256 --n_predict -1 --top_k 12 --top_p 1 --temp 0.36 --repeat_penalty 1.05 --ctx_size 2048 --instruct --reverse-prompt "### Human:" --model ggml-vicuna-13b-4bit-rev1.bin

It works on Win laptop with 16 Gb RAM and looks almost like ChatGPT! (Slower, of course, but with speed almost same as human will type!) I agree that it may be the best LLM to run locally!

And it seems that it can write much more correct and longer program code than gpt4all! It's just amazing!

But sometimes, after a few answers, it just freezes forever while continuing to load the CPU. Has anyone noticed this? Why it may be so?

SlyEcho · 2023-04-07T19:55:10Z

But sometimes, after a few answers, it just freezes forever while continuing to load the CPU. Has anyone noticed this? Why it may be so?

Context swap. The context fills up and then the first half of it is deleted to make more room, but that means that the whole context has to be reevaluated to catch up. The OpenBLAS option is supposed to accelerate it, but don't know how easy it is to make it work on Windows, vcpkg seems to have some BLAS packages.

ai2p · 2023-04-07T21:53:27Z

Context swap. The context fills up and then the first half of it is deleted to make more room, but that means that the whole context has to be reevaluated to catch up.

So it's just trying to compress overfilled context so that it would be possible to continue conversation without loosing any important details? And it is normal, and I just should take a cup of tee in that time and not restarts it as I did? :-)

SlyEcho · 2023-04-07T22:37:27Z

You can use --keep to keep some part of the initial prompt (-1 for all) or use a smaller context. You can try different --batch_size values because this determines the sizes of the matrixes that are used in this operation.

Crimsonfart · 2023-04-08T00:46:28Z

can someone explain to me what the difference between these two options is? (Both options work fine)

Option 1: main -i --interactive-first -r "### Human:" --temp 0 -c 2048 -n -1 --color --ignore-eos --repeat_penalty 1.2 --instruct -m ggml-vicuna-7b-4bit-rev1.bin
option 2: main --color --threads 7 --batch_size 256 --n_predict -1 --top_k 12 --top_p 1 --temp 0.36 --repeat_penalty 1.05 --ctx_size 2048 --instruct --reverse-prompt "### Human:" --model ggml-vicuna-13b-4bit-rev1.bin
And how many threads should one use? I have an i511400F that has 6 cores and supports 12 threads.

ai2p · 2023-04-08T06:37:57Z

can someone explain to me what the difference between these two options is?
Same except temperature and amount of threads.

About temperature read here. As I preliminary think, the higher temperature - the more stochastic/chaotic choice of words. The lower temperature - the more deterministic would be your result, and at temperature = 0 your result would be always the same. So you can tune that parameter for your application. If you write a code, may be better temp =0, if you write a poem - may be better temp=1 or even more... (If i'm wrong - correct me!)

What about threads, I intuit that you can use as many threads as your CPU support bar 1 or 2 (so that other apps and system will not hang). I think bottleneck is not CPU but RAM throughput.

Who has another opinion - please correct!

Sergio438 · 2023-04-09T00:31:33Z

Como é baixar o modelo,como app comum?

multimediaconverter · 2023-04-09T11:13:48Z

But there is a problem that it doesn't seem to stop by itself, it will continue to generate the next line of ### Human and continuing another response.

I use the following trick to partly overcome this problem:

### Human: Write a very long answer (about 1000 words). YOUR QUESTION WITH YOUR TEXT HERE
### Assistant:

jmtatsch · 2023-04-09T13:14:18Z

There is a vicuña model rev1 with some kind of stop fix on 🤗 . Maybe that solves your issue?

multimediaconverter · 2023-04-09T14:01:37Z

Yes, I'm talking about rev1, so we need to change llama_token_eos() or what?
Alternative solution: 9fd062f [UNVERIFIED]
I've recompiled main.cpp with this patch, and it works for me well enough (using the parameter: --stop "### Human:")

SlyEcho · 2023-04-09T20:55:14Z

I've recompiled main.cpp with this patch, and it works for me well enough (using the parameter: --stop "### Human:")

This is just the same as the --antiprompt option. The patch above has \n\n in the end but I think without them it would be better. You can also have multiple antiprompts.

multimediaconverter · 2023-04-10T03:40:21Z

This is just the same as the --antiprompt option. The patch above has \n\n in the end but I think without them it would be better. You can also have multiple antiprompts.

Yes, that's right. Thank you. You mean --reverse-prompt (-r) option.

I use a prompt file to start generation in this way:
main.exe -r "### Human:" -c 2048 --temp 0.36 -n -1 --ignore-eos --repeat_penalty 1.3 -m ggml-vicuna-7b-4bit-rev1.bin -f input.txt > output.txt

Content of input.txt file:

hello
### Assistant:

-r option switches the program into interactive mode, so it will not exit at the end and keeps waiting.

Therefore I made the following quick fix for vicuna:

    const auto inp_pfx = ::llama_tokenize(ctx, "\n\n### Human:\n", true);
    const auto inp_sfx = ::llama_tokenize(ctx, "\n\n### Assistant:\n", false);

    // in instruct mode, we inject a prefix and a suffix to each input by the user
    if (params.instruct) {
        params.interactive_start = true;
        params.antiprompt.push_back("### Human:");
    }

and

                is_antiprompt = false;
                // Check if each of the reverse prompts appears at the end of the output.
                for (std::string & antiprompt : params.antiprompt) {
                    if (last_output.find(antiprompt.c_str(), last_output.length() - antiprompt.length(), antiprompt.length()) != std::string::npos) {
                        is_interacting = true;
                        is_antiprompt = true;
                        set_console_color(con_st, CONSOLE_COLOR_USER_INPUT);
                        fflush(stdout);
			if (!params.instruct) exit(0);
                        break;
                    }
                }
            }

ai2p · 2023-04-11T15:43:11Z

Therefore I made the following quick fix for vicuna:

I've repeated your modifications, but nothing changes - it still shows "### Human:" each time... Who knows how to make clean answers? ( So that only '>' instead of '### Human:')

multimediaconverter · 2023-04-12T12:27:57Z

I've repeated your modifications, but nothing changes - it still shows "### Human:" each time... Who knows how to make clean answers? ( So that only '>' instead of '### Human:')

You can try my additional quick hack (it removes "### Human:" from the end of each response):

        // display text
	static std::string tmp;
	static std::string ap = "\n### Human:";
        if (!input_noecho) {
            for (auto id : embd) {
		tmp += llama_token_to_str(ctx, id);
		int tmplen = tmp.length() > ap.length() ? ap.length() : tmp.length();
		if (strncmp(tmp.c_str(), ap.c_str(), tmplen)) { printf(tmp.c_str()); tmp = ""; }
		else if (tmplen == ap.length()) tmp = "";
            }
            fflush(stdout);
        }

jxy · 2023-04-13T16:26:35Z

The vicuna v1.1 model used a different setup. See https://github.com/lm-sys/FastChat/blob/f85f489f2d5e48c37cceb2f00c3edc075c5d3711/fastchat/conversation.py#L115-L124 and https://github.com/lm-sys/FastChat/blob/f85f489f2d5e48c37cceb2f00c3edc075c5d3711/fastchat/conversation.py#L37-L44

IIUC, the prompt in Borne shell string is "$system USER: $instruction ASSISTANT:".

Their doc says this https://github.com/lm-sys/FastChat/blob/f85f489f2d5e48c37cceb2f00c3edc075c5d3711/docs/weights_version.md#example-prompt-weight-v11

I think the </s> is actually the EOS token, not a verbatim string. Though I'm not sure if we need to manually append it to the end of the assistant's response or not.

ai2p · 2023-04-14T14:59:51Z

The vicuna v1.1 model used a different setup.

Uhhh... Such a mess... Definitely needed some standardization for peple teaching LLMs! At least, with tokens such assistant/human/eos it should be possible, `cos it's just technicalities not connected directly with LLM functionality...

Or, at a side of a software, there should be easy way to adapt any token without editing C++ code...

chakflying · 2023-04-14T23:17:57Z

Since #863 may not happen soon, I tested this working on 1.1:
main.cpp#160

    // prefix & suffix for instruct mode
    const auto inp_pfx = ::llama_tokenize(ctx, "\nUSER:", true);
    const auto inp_sfx = ::llama_tokenize(ctx, "\nASSISTANT:", false);

    // in instruct mode, we inject a prefix and a suffix to each input by the user
    if (params.instruct) {
        params.interactive_start = true;
        params.antiprompt.push_back("USER:");
    }

jaeqy · 2023-04-19T02:50:37Z

Therefore I made the following quick fix for vicuna:

I've repeated your modifications, but nothing changes - it still shows "### Human:" each time... Who knows how to make clean answers? ( So that only '>' instead of '### Human:')

this is my settings, and run well on my mac (only '>' instead of '### Human:'):
./main
--model ./models/13B/ggml-vicuna-13b-4bit-rev1.bin
--color -i -r "User:" --temp 0 -c 2048 -n -1 --ignore-eos --repeat_penalty 1.2
--file prompts/chat-with-vicuna.txt

immorBen · 2023-04-20T05:25:13Z

sampling: temp = 0.360000, top_k = 12, top_p = 1.000000, repeat_last_n = 64, repeat_penalty = 1.050000
generate: n_ctx = 2048, n_batch = 256, n_predict = -1, n_keep = 56

Tried these settings and it's really nice! It really has learned the ChatGPT style well, and the 13b model seems to have good underlying knowledge.

 A chat between a curious human and an artificial intelligence assistant.
Assistant is helpful, kind, honest, good at writing, give answers to the human's questions in step by step explanations when it's confident, otherwise it will say it is not sure.
> What are the main themes of the game Fallout 4?
Fallout 4 is a post-apocalyptic role-playing game set in the aftermath of a nuclear war that took place on October 23, 2077. The game takes place in the ruins of Boston and its surrounding areas, known as the Commonwealth. The main theme of Fallout 4 is survival, as players must navigate the dangerous world of the Commonwealth while also dealing with the challenges of building a new community in the midst of destruction.

The game also explores themes of power and control, as players must navigate the political landscape of the Commonwealth and make alliances with various factions in order to gain power and control over the region. Additionally, the game deals with issues of morality, as players are forced to make difficult choices that can have a significant impact on the world around them.

Overall, Fallout 4 is a game that explores the challenges of survival, power, and morality in a post-apocalyptic world. The game's immersive setting and complex storylines make it a favorite among fans of the series, and its themes of survival and morality are sure to keep players engaged for hours on end.

But there is a problem that it doesn't seem to stop by itself, it will continue to generate the next line of ### Human and continuing another response.

@chakflying I have the same issue when using GPT4ALL with this model, after starting my first prompt, I lost control over them.

gjmulder added model Model specific generation quality Quality of model output labels Apr 6, 2023

oldsj mentioned this issue Apr 6, 2023

How to decrease inference time? lm-sys/FastChat#245

Closed

ggerganov closed this as completed Jul 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running a Vicuna-13B 4it model ? #771

Running a Vicuna-13B 4it model ? #771

manageseverin commented Apr 5, 2023

KASR commented Apr 5, 2023

bhubbb commented Apr 5, 2023 •

edited

Loading

rabidcopy commented Apr 5, 2023 •

edited

Loading

jmtatsch commented Apr 6, 2023 •

edited

Loading

idontneedonetho commented Apr 7, 2023 •

edited

Loading

chakflying commented Apr 7, 2023

ai2p commented Apr 7, 2023 •

edited

Loading

SlyEcho commented Apr 7, 2023

ai2p commented Apr 7, 2023

SlyEcho commented Apr 7, 2023 •

edited

Loading

Crimsonfart commented Apr 8, 2023

ai2p commented Apr 8, 2023

Sergio438 commented Apr 9, 2023

multimediaconverter commented Apr 9, 2023 •

edited

Loading

jmtatsch commented Apr 9, 2023

multimediaconverter commented Apr 9, 2023 •

edited

Loading

SlyEcho commented Apr 9, 2023

multimediaconverter commented Apr 10, 2023 •

edited

Loading

ai2p commented Apr 11, 2023

multimediaconverter commented Apr 12, 2023 •

edited

Loading

jxy commented Apr 13, 2023

ai2p commented Apr 14, 2023

chakflying commented Apr 14, 2023

jaeqy commented Apr 19, 2023 •

edited

Loading

immorBen commented Apr 20, 2023

Running a Vicuna-13B 4it model ? #771

Running a Vicuna-13B 4it model ? #771

Comments

manageseverin commented Apr 5, 2023

KASR commented Apr 5, 2023

bhubbb commented Apr 5, 2023 • edited Loading

rabidcopy commented Apr 5, 2023 • edited Loading

jmtatsch commented Apr 6, 2023 • edited Loading

idontneedonetho commented Apr 7, 2023 • edited Loading

chakflying commented Apr 7, 2023

ai2p commented Apr 7, 2023 • edited Loading

SlyEcho commented Apr 7, 2023

ai2p commented Apr 7, 2023

SlyEcho commented Apr 7, 2023 • edited Loading

Crimsonfart commented Apr 8, 2023

ai2p commented Apr 8, 2023

Sergio438 commented Apr 9, 2023

multimediaconverter commented Apr 9, 2023 • edited Loading

jmtatsch commented Apr 9, 2023

multimediaconverter commented Apr 9, 2023 • edited Loading

SlyEcho commented Apr 9, 2023

multimediaconverter commented Apr 10, 2023 • edited Loading

ai2p commented Apr 11, 2023

multimediaconverter commented Apr 12, 2023 • edited Loading

jxy commented Apr 13, 2023

ai2p commented Apr 14, 2023

chakflying commented Apr 14, 2023

jaeqy commented Apr 19, 2023 • edited Loading

immorBen commented Apr 20, 2023

bhubbb commented Apr 5, 2023 •

edited

Loading

rabidcopy commented Apr 5, 2023 •

edited

Loading

jmtatsch commented Apr 6, 2023 •

edited

Loading

idontneedonetho commented Apr 7, 2023 •

edited

Loading

ai2p commented Apr 7, 2023 •

edited

Loading

SlyEcho commented Apr 7, 2023 •

edited

Loading

multimediaconverter commented Apr 9, 2023 •

edited

Loading

multimediaconverter commented Apr 9, 2023 •

edited

Loading

multimediaconverter commented Apr 10, 2023 •

edited

Loading

multimediaconverter commented Apr 12, 2023 •

edited

Loading

jaeqy commented Apr 19, 2023 •

edited

Loading