-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
basic implementation of llama.cpp chat generation #723
Conversation
allows for constraining to json allows for function calling (not tested) streaming needs to be implemented when stream is set to true in generation_kwargs
Streaming would require more work and it is outside of something I can implement, but here is a proof of concept for streaming to stdout. I'm simply return the concatenated output for now:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! Could you please add tests like we currently do for the non-chat generator?
I converted most of the regular non-chat tests to work with the chat version and tested manually. I will try and finalize the tests and add relevant chat ones after class |
I took the tests from the regular generator and modified them to work for the chat generator. Data types are a bit different so the biggest difference was just in the way we extract the assistant replies/metadata. I was not able to get the RAG test working... I personally have not used a chat generator in a RAG pipeline, so I'm unsure of how to properly implement it. I can't seem to pass the chat generator reply to the answer builder. |
also add a basic rag test
Still WIP but I got a basic RAG test completed + testing for function calling with functionary. Function calling support has been added but you can not call functions directly through llama.cpp atm. Additionally, you have to make sure a model trained with function calling is used otherwise it can fail. |
A lot of tests have been added and they all seem to pass. I'm hoping it stays that way but you never know with Generative AI. Json constraining and function calling is a bit tough. Usually it is wrapped in such a way that failure can be retried. That might have to be a different component or maybe a recursive call in .run() with a json.loads check. The tests can also serve as examples of how it works, but I built the tests around what I was able to find in llama-cpp-python's and haystacks documentation. I just adapted the examples to work with chat completions. Streaming is another thing that I don't think I will be able to implement because I'm unsure of how you guys would want to handle that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good work, @lbux!
I only removed transformers
from dependencies, consistently with Llama.cpp python bindings (abetlen/llama-cpp-python#1294).
I added that to test dependencies.
For streaming support, I'll open another issue.
Related to #722
Lots of additional changes will need to be made (testing, changing of example docs).
Example usage:
Output:
{'generator': {'replies': [ChatMessage(content='{ "verdicts": [ { "verdict": "yes", "reason": "I\'m an AI, so I am indeed a type of bot designed to assist and communicate with humans." } ] }', role=<ChatRole.ASSISTANT: 'assistant'>, name=None, meta={})]}}