Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

basic implementation of llama.cpp chat generation #723

Merged
merged 13 commits into from
May 13, 2024
Merged

Conversation

lbux
Copy link
Contributor

@lbux lbux commented May 8, 2024

  • allows for constraining to json
  • add test for json constraining
  • allows for function calling
  • add test for function calling with functionary
  • add test for regular function calling
  • streaming needs to be implemented when stream is set to true in generation_kwargs

Related to #722

Lots of additional changes will need to be made (testing, changing of example docs).

Example usage:

test = Pipeline()
prompt_builder = DynamicChatPromptBuilder()

json_schema = {
    "type": "object",
    "properties": {
        "verdicts": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "verdict": {"type": "string", "enum": ["yes", "no"]},
                    "reason": {"type": "string"},
                },
                "required": ["verdict", "reason"],
            },
        }
    },
    "required": ["verdicts"],
}

generator = LlamaCppChatGenerator(
    model="/path/to/gguf",
    n_ctx=2048,
    n_batch=512,
    generation_kwargs={
        "temperature": 0.2,
        "top_p": 0.95,
        "max_tokens": 100,
        "response_format": {
            "type": "json_object",
            "schema": json_schema,
        },
    },
)
system = """
IMPORTANT: Please make sure to only return in JSON format, with the 'verdicts' key as a list of JSON. These JSON only contain the `verdict` key that outputs only 'yes' or 'no', and a `reason` key to justify the verdict. In your reason, make something up.
Example Retrieval Context: ["Einstein won the Nobel Prize for his discovery of the photoelectric effect", "He won the Nobel Prize in 1968.", "There was a cat."]
Example Input: "Who won the Nobel Prize in 1968 and for what?"
Example Expected Output: "Einstein won the Nobel Prize in 1968 for his discovery of the photoelectric effect."

Example:
{{
    "verdicts": [
        {{
            "verdict": "yes",
            "reason": "It clearly addresses the question by stating that 'Einstein won the Nobel Prize for his discovery of the photoelectric effect.'"
        }},
        {{
            "verdict": "yes",
            "reason": "The text verifies that the prize was indeed won in 1968."
        }},
        {{
            "verdict": "no",
            "reason": "'There was a cat' is not at all relevant to the topic of winning a Nobel Prize."
        }}
    ]  
}}
"""
generator.warm_up()
test.add_component("generator", generator)
test.add_component("prompt_builder", prompt_builder)
test.connect("prompt_builder.prompt", "generator.messages")
messages = [
    ChatMessage.from_system("{{system}}"),
    ChatMessage.from_user("Are you a bot?"),
]

output = test.run(data={"prompt_builder": {"template_variables": {"system": system}, "prompt_source": messages}})

print(output)

Output:
{'generator': {'replies': [ChatMessage(content='{ "verdicts": [ { "verdict": "yes", "reason": "I\'m an AI, so I am indeed a type of bot designed to assist and communicate with humans." } ] }', role=<ChatRole.ASSISTANT: 'assistant'>, name=None, meta={})]}}

allows for constraining to json

allows for function calling (not tested)

streaming needs to be implemented when stream is set to true in generation_kwargs
@lbux lbux requested a review from a team as a code owner May 8, 2024 01:39
@lbux lbux requested review from shadeMe and removed request for a team May 8, 2024 01:39
@github-actions github-actions bot added integration:llama_cpp type:documentation Improvements or additions to documentation labels May 8, 2024
@lbux
Copy link
Contributor Author

lbux commented May 8, 2024

Streaming would require more work and it is outside of something I can implement, but here is a proof of concept for streaming to stdout. I'm simply return the concatenated output for now:

def stream_to_stdout(self, chunk):
        print(chunk.content, end='', flush=True)

def run(self, messages: List[ChatMessage], generation_kwargs: Optional[Dict[str, Any]] = None):
        if self.model is None:
            error_msg = "The model has not been loaded. Please call warm_up() before running."
            raise RuntimeError(error_msg)

        updated_generation_kwargs = {**self.generation_kwargs, **(generation_kwargs or {})}
        formatted_messages = [msg.to_openai_format() for msg in messages]

        stream = updated_generation_kwargs.get("stream", False)

        output = self.model.create_chat_completion(messages=formatted_messages, **updated_generation_kwargs)

        if stream:
            full_response = []
            for response in output:
                for choice in response["choices"]:
                    delta = choice.get("delta", {})
                    if 'content' in delta and delta['content'].strip():
                        chunk = StreamingChunk(
                            content=delta['content'], meta={"id": response["id"], "created": response["created"]}
                        )
                        self.stream_to_stdout(chunk)
                        full_response.append(chunk.content)
            full_response = "".join(full_response)
            full_response = ChatMessage.from_assistant(content=full_response)
            return {"streamed_replies": full_response}
        else:
            replies = [ChatMessage.from_assistant(content=output["choices"][0]["message"]["content"])]
            return {"replies": replies}

shadeMe
shadeMe previously requested changes May 8, 2024
Copy link
Contributor

@shadeMe shadeMe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Could you please add tests like we currently do for the non-chat generator?

@lbux
Copy link
Contributor Author

lbux commented May 8, 2024

I converted most of the regular non-chat tests to work with the chat version and tested manually. I will try and finalize the tests and add relevant chat ones after class

@lbux
Copy link
Contributor Author

lbux commented May 9, 2024

I took the tests from the regular generator and modified them to work for the chat generator. Data types are a bit different so the biggest difference was just in the way we extract the assistant replies/metadata. I was not able to get the RAG test working... I personally have not used a chat generator in a RAG pipeline, so I'm unsure of how to properly implement it. I can't seem to pass the chat generator reply to the answer builder.

@lbux
Copy link
Contributor Author

lbux commented May 11, 2024

Still WIP but I got a basic RAG test completed + testing for function calling with functionary.

Function calling support has been added but you can not call functions directly through llama.cpp atm. Additionally, you have to make sure a model trained with function calling is used otherwise it can fail.

@lbux lbux requested a review from shadeMe May 12, 2024 05:56
@lbux
Copy link
Contributor Author

lbux commented May 12, 2024

A lot of tests have been added and they all seem to pass. I'm hoping it stays that way but you never know with Generative AI.

Json constraining and function calling is a bit tough. Usually it is wrapped in such a way that failure can be retried. That might have to be a different component or maybe a recursive call in .run() with a json.loads check. The tests can also serve as examples of how it works, but I built the tests around what I was able to find in llama-cpp-python's and haystacks documentation. I just adapted the examples to work with chat completions.

Streaming is another thing that I don't think I will be able to implement because I'm unsure of how you guys would want to handle that.

@anakin87 anakin87 self-requested a review May 13, 2024 12:41
Copy link
Member

@anakin87 anakin87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good work, @lbux!

I only removed transformers from dependencies, consistently with Llama.cpp python bindings (abetlen/llama-cpp-python#1294).
I added that to test dependencies.

For streaming support, I'll open another issue.

@anakin87 anakin87 merged commit 0e02fd6 into deepset-ai:main May 13, 2024
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
integration:llama_cpp type:documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants