basic implementation of llama.cpp chat generation #723

lbux · 2024-05-08T01:39:38Z

allows for constraining to json
add test for json constraining
allows for function calling
add test for function calling with functionary
add test for regular function calling
streaming needs to be implemented when stream is set to true in generation_kwargs

Related to #722

Lots of additional changes will need to be made (testing, changing of example docs).

Example usage:

test = Pipeline()
prompt_builder = DynamicChatPromptBuilder()

json_schema = {
    "type": "object",
    "properties": {
        "verdicts": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "verdict": {"type": "string", "enum": ["yes", "no"]},
                    "reason": {"type": "string"},
                },
                "required": ["verdict", "reason"],
            },
        }
    },
    "required": ["verdicts"],
}

generator = LlamaCppChatGenerator(
    model="/path/to/gguf",
    n_ctx=2048,
    n_batch=512,
    generation_kwargs={
        "temperature": 0.2,
        "top_p": 0.95,
        "max_tokens": 100,
        "response_format": {
            "type": "json_object",
            "schema": json_schema,
        },
    },
)
system = """
IMPORTANT: Please make sure to only return in JSON format, with the 'verdicts' key as a list of JSON. These JSON only contain the `verdict` key that outputs only 'yes' or 'no', and a `reason` key to justify the verdict. In your reason, make something up.
Example Retrieval Context: ["Einstein won the Nobel Prize for his discovery of the photoelectric effect", "He won the Nobel Prize in 1968.", "There was a cat."]
Example Input: "Who won the Nobel Prize in 1968 and for what?"
Example Expected Output: "Einstein won the Nobel Prize in 1968 for his discovery of the photoelectric effect."

Example:
{{
    "verdicts": [
        {{
            "verdict": "yes",
            "reason": "It clearly addresses the question by stating that 'Einstein won the Nobel Prize for his discovery of the photoelectric effect.'"
        }},
        {{
            "verdict": "yes",
            "reason": "The text verifies that the prize was indeed won in 1968."
        }},
        {{
            "verdict": "no",
            "reason": "'There was a cat' is not at all relevant to the topic of winning a Nobel Prize."
        }}
    ]  
}}
"""
generator.warm_up()
test.add_component("generator", generator)
test.add_component("prompt_builder", prompt_builder)
test.connect("prompt_builder.prompt", "generator.messages")
messages = [
    ChatMessage.from_system("{{system}}"),
    ChatMessage.from_user("Are you a bot?"),
]

output = test.run(data={"prompt_builder": {"template_variables": {"system": system}, "prompt_source": messages}})

print(output)

Output:
{'generator': {'replies': [ChatMessage(content='{ "verdicts": [ { "verdict": "yes", "reason": "I\'m an AI, so I am indeed a type of bot designed to assist and communicate with humans." } ] }', role=<ChatRole.ASSISTANT: 'assistant'>, name=None, meta={})]}}

allows for constraining to json allows for function calling (not tested) streaming needs to be implemented when stream is set to true in generation_kwargs

lbux · 2024-05-08T06:26:22Z

Streaming would require more work and it is outside of something I can implement, but here is a proof of concept for streaming to stdout. I'm simply return the concatenated output for now:

def stream_to_stdout(self, chunk):
        print(chunk.content, end='', flush=True)

def run(self, messages: List[ChatMessage], generation_kwargs: Optional[Dict[str, Any]] = None):
        if self.model is None:
            error_msg = "The model has not been loaded. Please call warm_up() before running."
            raise RuntimeError(error_msg)

        updated_generation_kwargs = {**self.generation_kwargs, **(generation_kwargs or {})}
        formatted_messages = [msg.to_openai_format() for msg in messages]

        stream = updated_generation_kwargs.get("stream", False)

        output = self.model.create_chat_completion(messages=formatted_messages, **updated_generation_kwargs)

        if stream:
            full_response = []
            for response in output:
                for choice in response["choices"]:
                    delta = choice.get("delta", {})
                    if 'content' in delta and delta['content'].strip():
                        chunk = StreamingChunk(
                            content=delta['content'], meta={"id": response["id"], "created": response["created"]}
                        )
                        self.stream_to_stdout(chunk)
                        full_response.append(chunk.content)
            full_response = "".join(full_response)
            full_response = ChatMessage.from_assistant(content=full_response)
            return {"streamed_replies": full_response}
        else:
            replies = [ChatMessage.from_assistant(content=output["choices"][0]["message"]["content"])]
            return {"replies": replies}

shadeMe

Thanks for the PR! Could you please add tests like we currently do for the non-chat generator?

lbux · 2024-05-08T16:24:59Z

I converted most of the regular non-chat tests to work with the chat version and tested manually. I will try and finalize the tests and add relevant chat ones after class

lbux · 2024-05-09T05:38:45Z

I took the tests from the regular generator and modified them to work for the chat generator. Data types are a bit different so the biggest difference was just in the way we extract the assistant replies/metadata. I was not able to get the RAG test working... I personally have not used a chat generator in a RAG pipeline, so I'm unsure of how to properly implement it. I can't seem to pass the chat generator reply to the answer builder.

also add a basic rag test

lbux · 2024-05-11T09:08:46Z

Still WIP but I got a basic RAG test completed + testing for function calling with functionary.

Function calling support has been added but you can not call functions directly through llama.cpp atm. Additionally, you have to make sure a model trained with function calling is used otherwise it can fail.

lbux · 2024-05-12T06:15:04Z

A lot of tests have been added and they all seem to pass. I'm hoping it stays that way but you never know with Generative AI.

Json constraining and function calling is a bit tough. Usually it is wrapped in such a way that failure can be retried. That might have to be a different component or maybe a recursive call in .run() with a json.loads check. The tests can also serve as examples of how it works, but I built the tests around what I was able to find in llama-cpp-python's and haystacks documentation. I just adapted the examples to work with chat completions.

Streaming is another thing that I don't think I will be able to implement because I'm unsure of how you guys would want to handle that.

anakin87

Very good work, @lbux!

I only removed transformers from dependencies, consistently with Llama.cpp python bindings (abetlen/llama-cpp-python#1294).
I added that to test dependencies.

For streaming support, I'll open another issue.

tests added

basic implementation of llama.cpp chat generation

f26447f

allows for constraining to json allows for function calling (not tested) streaming needs to be implemented when stream is set to true in generation_kwargs

lbux requested a review from a team as a code owner May 8, 2024 01:39

lbux requested review from shadeMe and removed request for a team May 8, 2024 01:39

github-actions bot added integration:llama_cpp type:documentation Improvements or additions to documentation labels May 8, 2024

Merge branch 'main' into main

51e8b3b

shadeMe previously requested changes May 8, 2024

View reviewed changes

lbux added 2 commits May 8, 2024 22:32

add testing

64792b5

remove unnecessary function

4a85312

lbux added 2 commits May 8, 2024 22:47

slight documentation fix, comment out broken test

6f58ad4

support for function calling through functionary

788b311

also add a basic rag test

lbux added 4 commits May 11, 2024 16:03

add function calling and execute test, it works!

40bf82b

add json test, add chatml test

93a4dd6

Merge branch 'main' into main

52b4293

make function call and execute more deterministic

bdc23c9

lbux requested a review from shadeMe May 12, 2024 05:56

anakin87 added 3 commits May 13, 2024 13:01

try removing additional deps

0dda139

revert

9f8447e

make transformers a tests-only dependency

869c36c

anakin87 self-requested a review May 13, 2024 12:41

anakin87 approved these changes May 13, 2024

View reviewed changes

anakin87 merged commit 0e02fd6 into deepset-ai:main May 13, 2024
10 checks passed

anakin87 mentioned this pull request May 13, 2024

Support llama.cpp chat generation #722

Closed

anakin87 mentioned this pull request May 13, 2024

Add streaming support in LlamaCPPGenerator AND LlamaCPPChatGenerator #730

Open

lbux mentioned this pull request May 20, 2024

Add support for llama.cpp llm evaluator deepset-ai/haystack#7718

Open

anakin87 mentioned this pull request Jun 13, 2024

Document LlamaCppChatGenerator #807

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

basic implementation of llama.cpp chat generation #723

basic implementation of llama.cpp chat generation #723

lbux commented May 8, 2024 •

edited

Loading

lbux commented May 8, 2024

shadeMe left a comment

lbux commented May 8, 2024

lbux commented May 9, 2024

lbux commented May 11, 2024

lbux commented May 12, 2024

anakin87 left a comment

basic implementation of llama.cpp chat generation #723

basic implementation of llama.cpp chat generation #723

Conversation

lbux commented May 8, 2024 • edited Loading

lbux commented May 8, 2024

shadeMe left a comment

Choose a reason for hiding this comment

lbux commented May 8, 2024

lbux commented May 9, 2024

lbux commented May 11, 2024

lbux commented May 12, 2024

anakin87 left a comment

Choose a reason for hiding this comment

lbux commented May 8, 2024 •

edited

Loading