ctransformers: another attempt #3313

cal066 · 2023-07-26T05:23:21Z

Generalized ctransformers based on:
#2892
Credits to randoentity

RFC since it moves some control structures around to allow extensions to overload them more easily.
Select Model Type "None" to allow ctransformers to guess based on directory config.json file.

You may need to install ctransformers manually at this time, but latest binaries use Cuda 12.x instead of 11.x.
env CT_CUBLAS=1 pip install ctransformers --no-binary ctransformers

Edit: Now using ctransformers build from @jllllll.

jllllll · 2023-07-31T20:28:36Z

I have built ctransformers wheels for various Cuda versions and linked the Cuda 11.7 one here:
#3357

cal066 · 2023-07-31T23:37:14Z

@jllllll Thanks, updated to use your build.

lppllppl920 · 2023-08-01T21:26:44Z

It looks like the model unloading of CtransformersModel will not release the GPU memory occupied.

oobabooga · 2023-08-02T03:00:53Z

I have some basic questions about ctransformers:

How are the ggmls generated? For instance, if I want to do a q4_0 or q4_K_M for GPT-J or Falcon starting from the model in Hugging Face format, how do I do it?
What is the main use case? What models are worth using on ctransformers the most?

jllllll · 2023-08-02T05:01:35Z

I have some basic questions about ctransformers:

1. How are the ggmls generated? For instance, if I want to do a q4_0 or q4_K_M for GPT-J or Falcon starting from the model in Hugging Face format, how do I do it?

2. What is the main use case? What models are worth using on ctransformers the most?

Falcon models can be converted and quantized using the conversion script and binaries provided by ggllm.cpp.
All other models use the scripts and binaries provided by the ggml library:
https://github.com/ggerganov/ggml/tree/master/examples
LLaMa models still use llama.cpp.

The primary use case that ctransformers provides is access to non-LLaMa models. It's llama.cpp bindings don't provide anything that llama-cpp-python doesn't already, as far as I can tell.

Starcoder, Falcon and MPT ggml models are what most people want out of ctransformers. There is also Replit, Dolly and the GPT models, but I haven't seen much interest in them lately.

It is also worth mentioning that work is being done to unify all of the various ggml implementations into a single file format:
ggerganov/ggml#220
ggerganov/ggml#302
Eventually, there will be no need for multiple ggml backends once that work is finished.

cal066 · 2023-08-02T09:53:03Z

It looks like the model unloading of CtransformersModel will not release the GPU memory occupied.

@lppllppl920
This seems to be the case as well for exllama/llama.cpp from my experience, I have to restart webui to clean it up. I'll need to look into how unloading works.

cal066 · 2023-08-02T11:06:38Z

It seems a destructor already exists. I'm not too familiar with the ctransformer C/C++ ggml backend to tell if its a memory leak somewhere.

ye7iaserag · 2023-08-09T20:47:58Z

I checked the PR and it doesn't mention WizardML in the models list, I though ctransformers support WizardML...

jllllll · 2023-08-10T00:12:29Z

WizardML is llama/llama2

oobabooga

2 more questions:

I assume this PR does the same thing as Add starcoder and starchat ggml support #2364 and Add support for starcoder ggml and similar (GPT-2 GGML?) #2892, so if it gets merged, both can be closed. Is that correct?
The README needs to be updated to briefly mention ctransformers support, and it would be nice to have a short entry under docs/ mentioning what it is and giving a couple of examples of models that it can load.

modules/ui_model_menu.py

oobabooga · 2023-08-10T13:44:36Z

modules/text_generation.py

@@ -89,8 +97,12 @@ def _generate_reply(question, state, stopping_strings=None, is_chat=False):
 yield reply


+encode_llama_prompts = ['LlamaCppModel', 'RWKVModel', 'CtransformersModel']


I don't like these variables floating here and think that they should be removed.

I hate probing by class names, but I have put them back for now. I'll try to refactor them next time.

Generalized ctransformers based on: oobabooga#2892 Credits to randoentity

cal066 · 2023-08-11T00:22:32Z

@oobabooga Yes, this was based on #2892 from @randoentity.

oobabooga · 2023-08-11T17:44:19Z

Looks good now. It's very barebones but it works with falcon-7b:

python server.py --model models/falcon-7b-instruct.ggccv1.q4_0.bin --loader ctransformers --model_type "falcon" --n-gpu-layers 10000

I have added @randoentity as a coauthor. Before the merge I also made several little adjustments.

Thank you for you work on this PR.

cal066 · 2023-08-11T17:49:55Z

Awesome, thanks.

randoentity · 2023-08-11T17:50:09Z

@oobabooga thanks for mentioning me, but most of the work was by @s-kostyaev :) I just did a rebase and a minor refactor.

randoentity · 2023-08-11T17:57:10Z

Thanks for the work here @cal066 ! I missed your PR but did see GPU offloading was added upstream last week and wanted to bring it here, but ${life}. I'm awestruck how much momentum this project has. 🚀

oobabooga · 2023-08-11T18:06:49Z

Oops, it looks like I messed up then, I should have credited @s-kostyaev as well. Sorry!

I have added one more parameter for ctransformers here 28c8df3. I find it weird that seed is both a loading parameter and a generation parameter, as well as threads. Also, truncation seems to not be implemented, at least not in a customizable way. If anyone can think of ways to improve the integration with ctransformers, feel free to submit new PRs.

cal066 · 2023-08-12T02:19:31Z

@oobabooga I added those according to the def parameters I found in the ctransformers python code. The double seed probably isn't not needed, it tries to get it from the generate call if possible, and then fall back to the config option, likewise for seed. I have created #3543 to resolve this.

lppllppl920 · 2023-08-12T20:52:19Z

@cal066 thoughts on changing infer_loader so that ctransformers loader can be pre-selected when a certain model name is chosen?

cal066 · 2023-08-13T02:07:55Z

@lppllppl920 I just tried TheBloke/airoboros-mpt-30b-gpt4-1p4-GGML, it's not quite reliable. I can make it select ctransfomers but the 'None' model_type still needs to be manually set to 'mpt'. The 'None' is equivalent to auto detect, but requires a config.json in the model directory, something like this.

infer_loader doesn't allow setting model_type yet, but it's still possible to do some magic in CtransformersModel if model_type is 'None' and 'config.json' is not present.

cal066 force-pushed the ctransformers branch from d6f3e91 to bb0fefa Compare July 27, 2023 12:16

cal066 force-pushed the ctransformers branch from bb0fefa to 44bd9cf Compare July 31, 2023 23:33

cal066 changed the title ~~WIP: ctransformers: another attempt~~ ctransformers: another attempt Jul 31, 2023

cal066 force-pushed the ctransformers branch from 44bd9cf to c57b25b Compare August 2, 2023 09:52

cal066 force-pushed the ctransformers branch 2 times, most recently from 5ceac52 to e954b84 Compare August 7, 2023 07:57

cal066 force-pushed the ctransformers branch from e954b84 to 26072c8 Compare August 10, 2023 07:53

oobabooga reviewed Aug 10, 2023

View reviewed changes

cal066 force-pushed the ctransformers branch from 69c7841 to 52d6690 Compare August 10, 2023 17:00

ctransformers: another attempt

dd1bbb3

Generalized ctransformers based on: oobabooga#2892 Credits to randoentity

cal066 force-pushed the ctransformers branch from 52d6690 to dd1bbb3 Compare August 11, 2023 00:17

oobabooga added 5 commits August 11, 2023 10:21

Some adjustments

dbfba02

Update README

d27dc76

Several fixes

a8a683f

Change syntax

e2a7c61

Remove unused params

8f3ea05

oobabooga merged commit 7a4fcee into oobabooga:main Aug 11, 2023

This was referenced Aug 11, 2023

Add support for starcoder ggml and similar (GPT-2 GGML?) #2892

Closed

Add starcoder and starchat ggml support #2364

Closed

RDearnaley mentioned this pull request Aug 24, 2023

Add support for ggllm.cpp #3357

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ctransformers: another attempt #3313

ctransformers: another attempt #3313

cal066 commented Jul 26, 2023 •

edited

Loading

jllllll commented Jul 31, 2023

cal066 commented Jul 31, 2023

lppllppl920 commented Aug 1, 2023

oobabooga commented Aug 2, 2023

jllllll commented Aug 2, 2023 •

edited

Loading

cal066 commented Aug 2, 2023

cal066 commented Aug 2, 2023

ye7iaserag commented Aug 9, 2023

jllllll commented Aug 10, 2023

oobabooga left a comment •

edited

Loading

oobabooga Aug 10, 2023

cal066 Aug 11, 2023

cal066 commented Aug 11, 2023

oobabooga commented Aug 11, 2023

cal066 commented Aug 11, 2023

randoentity commented Aug 11, 2023

randoentity commented Aug 11, 2023

oobabooga commented Aug 11, 2023

cal066 commented Aug 12, 2023

lppllppl920 commented Aug 12, 2023

cal066 commented Aug 13, 2023

		@@ -89,8 +97,12 @@ def _generate_reply(question, state, stopping_strings=None, is_chat=False):
		yield reply


		encode_llama_prompts = ['LlamaCppModel', 'RWKVModel', 'CtransformersModel']

ctransformers: another attempt #3313

ctransformers: another attempt #3313

Conversation

cal066 commented Jul 26, 2023 • edited Loading

jllllll commented Jul 31, 2023

cal066 commented Jul 31, 2023

lppllppl920 commented Aug 1, 2023

oobabooga commented Aug 2, 2023

jllllll commented Aug 2, 2023 • edited Loading

cal066 commented Aug 2, 2023

cal066 commented Aug 2, 2023

ye7iaserag commented Aug 9, 2023

jllllll commented Aug 10, 2023

oobabooga left a comment • edited Loading

Choose a reason for hiding this comment

oobabooga Aug 10, 2023

Choose a reason for hiding this comment

cal066 Aug 11, 2023

Choose a reason for hiding this comment

cal066 commented Aug 11, 2023

oobabooga commented Aug 11, 2023

cal066 commented Aug 11, 2023

randoentity commented Aug 11, 2023

randoentity commented Aug 11, 2023

oobabooga commented Aug 11, 2023

cal066 commented Aug 12, 2023

lppllppl920 commented Aug 12, 2023

cal066 commented Aug 13, 2023

cal066 commented Jul 26, 2023 •

edited

Loading

jllllll commented Aug 2, 2023 •

edited

Loading

oobabooga left a comment •

edited

Loading