Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ctransformers: another attempt #3313

Merged
merged 6 commits into from
Aug 11, 2023
Merged

Conversation

cal066
Copy link
Contributor

@cal066 cal066 commented Jul 26, 2023

Generalized ctransformers based on:
#2892
Credits to randoentity

RFC since it moves some control structures around to allow extensions to overload them more easily.
Select Model Type "None" to allow ctransformers to guess based on directory config.json file.

You may need to install ctransformers manually at this time, but latest binaries use Cuda 12.x instead of 11.x.
env CT_CUBLAS=1 pip install ctransformers --no-binary ctransformers

Edit: Now using ctransformers build from @jllllll.

@jllllll
Copy link
Contributor

jllllll commented Jul 31, 2023

I have built ctransformers wheels for various Cuda versions and linked the Cuda 11.7 one here:
#3357

@cal066
Copy link
Contributor Author

cal066 commented Jul 31, 2023

@jllllll Thanks, updated to use your build.

@cal066 cal066 changed the title WIP: ctransformers: another attempt ctransformers: another attempt Jul 31, 2023
@lppllppl920
Copy link

It looks like the model unloading of CtransformersModel will not release the GPU memory occupied.

@oobabooga
Copy link
Owner

I have some basic questions about ctransformers:

  1. How are the ggmls generated? For instance, if I want to do a q4_0 or q4_K_M for GPT-J or Falcon starting from the model in Hugging Face format, how do I do it?
  2. What is the main use case? What models are worth using on ctransformers the most?

@jllllll
Copy link
Contributor

jllllll commented Aug 2, 2023

I have some basic questions about ctransformers:

1. How are the ggmls generated? For instance, if I want to do a q4_0 or q4_K_M for GPT-J or Falcon starting from the model in Hugging Face format, how do I do it?

2. What is the main use case? What models are worth using on ctransformers the most?

Falcon models can be converted and quantized using the conversion script and binaries provided by ggllm.cpp.
All other models use the scripts and binaries provided by the ggml library:
https://github.com/ggerganov/ggml/tree/master/examples
LLaMa models still use llama.cpp.

The primary use case that ctransformers provides is access to non-LLaMa models. It's llama.cpp bindings don't provide anything that llama-cpp-python doesn't already, as far as I can tell.

Starcoder, Falcon and MPT ggml models are what most people want out of ctransformers. There is also Replit, Dolly and the GPT models, but I haven't seen much interest in them lately.

It is also worth mentioning that work is being done to unify all of the various ggml implementations into a single file format:
ggerganov/ggml#220
ggerganov/ggml#302
Eventually, there will be no need for multiple ggml backends once that work is finished.

@cal066
Copy link
Contributor Author

cal066 commented Aug 2, 2023

It looks like the model unloading of CtransformersModel will not release the GPU memory occupied.

@lppllppl920
This seems to be the case as well for exllama/llama.cpp from my experience, I have to restart webui to clean it up. I'll need to look into how unloading works.

@cal066
Copy link
Contributor Author

cal066 commented Aug 2, 2023

It seems a destructor already exists. I'm not too familiar with the ctransformer C/C++ ggml backend to tell if its a memory leak somewhere.

@cal066 cal066 force-pushed the ctransformers branch 2 times, most recently from 5ceac52 to e954b84 Compare August 7, 2023 07:57
@ye7iaserag
Copy link
Contributor

I checked the PR and it doesn't mention WizardML in the models list, I though ctransformers support WizardML...

@jllllll
Copy link
Contributor

jllllll commented Aug 10, 2023

WizardML is llama/llama2

Copy link
Owner

@oobabooga oobabooga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 more questions:

  1. I assume this PR does the same thing as Add starcoder and starchat ggml support #2364 and Add support for starcoder ggml and similar (GPT-2 GGML?) #2892, so if it gets merged, both can be closed. Is that correct?

  2. The README needs to be updated to briefly mention ctransformers support, and it would be nice to have a short entry under docs/ mentioning what it is and giving a couple of examples of models that it can load.

modules/ui_model_menu.py Show resolved Hide resolved
@@ -89,8 +97,12 @@ def _generate_reply(question, state, stopping_strings=None, is_chat=False):
yield reply


encode_llama_prompts = ['LlamaCppModel', 'RWKVModel', 'CtransformersModel']
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like these variables floating here and think that they should be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hate probing by class names, but I have put them back for now. I'll try to refactor them next time.

Generalized ctransformers based on:
oobabooga#2892
Credits to randoentity
@cal066
Copy link
Contributor Author

cal066 commented Aug 11, 2023

@oobabooga Yes, this was based on #2892 from @randoentity.

@oobabooga oobabooga merged commit 7a4fcee into oobabooga:main Aug 11, 2023
@oobabooga
Copy link
Owner

Looks good now. It's very barebones but it works with falcon-7b:

python server.py --model models/falcon-7b-instruct.ggccv1.q4_0.bin --loader ctransformers --model_type "falcon" --n-gpu-layers 10000

I have added @randoentity as a coauthor. Before the merge I also made several little adjustments.

Thank you for you work on this PR.

@cal066
Copy link
Contributor Author

cal066 commented Aug 11, 2023

Awesome, thanks.

@randoentity
Copy link
Contributor

@oobabooga thanks for mentioning me, but most of the work was by @s-kostyaev :) I just did a rebase and a minor refactor.

@randoentity
Copy link
Contributor

Thanks for the work here @cal066 ! I missed your PR but did see GPU offloading was added upstream last week and wanted to bring it here, but ${life}. I'm awestruck how much momentum this project has. 🚀

@oobabooga
Copy link
Owner

Oops, it looks like I messed up then, I should have credited @s-kostyaev as well. Sorry!

I have added one more parameter for ctransformers here 28c8df3. I find it weird that seed is both a loading parameter and a generation parameter, as well as threads. Also, truncation seems to not be implemented, at least not in a customizable way. If anyone can think of ways to improve the integration with ctransformers, feel free to submit new PRs.

@cal066
Copy link
Contributor Author

cal066 commented Aug 12, 2023

@oobabooga I added those according to the def parameters I found in the ctransformers python code. The double seed probably isn't not needed, it tries to get it from the generate call if possible, and then fall back to the config option, likewise for seed. I have created #3543 to resolve this.

@lppllppl920
Copy link

@cal066 thoughts on changing infer_loader so that ctransformers loader can be pre-selected when a certain model name is chosen?

@cal066
Copy link
Contributor Author

cal066 commented Aug 13, 2023

@lppllppl920 I just tried TheBloke/airoboros-mpt-30b-gpt4-1p4-GGML, it's not quite reliable. I can make it select ctransfomers but the 'None' model_type still needs to be manually set to 'mpt'. The 'None' is equivalent to auto detect, but requires a config.json in the model directory, something like this.

infer_loader doesn't allow setting model_type yet, but it's still possible to do some magic in CtransformersModel if model_type is 'None' and 'config.json' is not present.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants