Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Unexpected Number of Questions Generated When Requesting FAQ Generation #10694

Closed
stephanedebove opened this issue Feb 14, 2024 · 3 comments · Fixed by #13596
Closed
Labels
bug Something isn't working P1

Comments

@stephanedebove
Copy link

stephanedebove commented Feb 14, 2024

Bug Description

When attempting to generate a set of FAQ questions and answers from a document, specifying num_questions_per_chunk=1 unexpectedly results in multiple questions being generated, exceeding the specified limit.

Expected Behavior
With num_questions_per_chunk=1, I expect to generate exactly one question per document or document chunk processed.

Actual Behavior
Despite setting num_questions_per_chunk=1, multiple questions (7 in my case) are generated for a single document or document chunk, indicating that the limit is not being respected or that the document is being split in an unexpected manner.

Edit : from my LLM log (Mixtral 8x7b running on Replicate) I can see that the LLM is called 8 times even with num_questions_per_chunk=1, so the problem is probably not with the num_questions_per_chunk parameter but with the fact that the request to generate questions/answers is sent multiple times. Could this be due to how async functions work?

Version

0.9.45

Steps to Reproduce

Prepare a document or a list of documents to be processed.
Use the provided code snippet to generate FAQ questions and answers, ensuring num_questions_per_chunk is set to 1.
Observe that the output contains more questions than expected.

import asyncio
from llamaindex import RagDatasetGenerator

async def main():
    documents = SimpleDirectoryReader(input_files=["./document.txt"]).load_data()
    QUESTION_GEN_PROMPT = "Here is a text. Your job is to imagine {num_questions_per_chunk} questions that could be asked about this text."
    
    dataset_generator = RagDatasetGenerator.from_documents(
        documents=documents,
        service_context=service_context,
        question_gen_query=QUESTION_GEN_PROMPT,
        show_progress=True,
        num_questions_per_chunk=1,  # Expected to generate one question per chunk
    )
    
    labelled_rag_dataset = await dataset_generator.agenerate_dataset_from_nodes()
    for example in labelled_rag_dataset.examples:
        print("-----")
        print(f"Question: {example.query}")
        print(f"Answer: {example.reference_answer}")

if __name__ == "__main__":
    asyncio.run(main())

Relevant Logs/Tracbacks

No response

@stephanedebove stephanedebove added bug Something isn't working triage Issue needs to be triaged/prioritized labels Feb 14, 2024
Copy link

dosubot bot commented Feb 14, 2024

🤖

Hey @stephanedebove, great to cross paths again on here! Hope your code's been more friend than foe since our last chat. 🚀

Thank you for providing the detailed steps to reproduce the issue. Based on the context provided, it seems that the num_questions_per_chunk parameter in the RagDatasetGenerator class is used to instruct the AI to generate that many questions, but it doesn't enforce a strict limit. The AI tries to generate diverse questions based on the document content, and it might generate more than the specified limit if it finds more potential questions in the document.

Here is the relevant code snippet from the RagDatasetGenerator class:

self.question_gen_query = (
    question_gen_query
    or f"You are a Teacher/Professor. Your task is to setup \
                {num_questions_per_chunk} questions for an upcoming \
                quiz/examination. The questions should be diverse in nature \
                    across the document. Restrict the questions to the \
                        context information provided."
)

To strictly limit the number of questions generated, you would need to modify the _agenerate_dataset method where the questions are generated and stored. Specifically, you could truncate the cleaned_questions list to the num_questions_per_chunk value before storing them in the queries dictionary. However, please note that this might result in less diverse questions being generated if the limit is set to a low value.

You can find the relevant code in the llama_index/evaluation/dataset_generation.py file.

Please note that this is a workaround and might not be the best solution. It's possible that there's a better way to handle this that I wasn't able to find. If you have any further questions or need more clarification, feel free to ask.

Sources

About Dosu

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@logan-markewich logan-markewich added P1 and removed triage Issue needs to be triaged/prioritized labels Feb 14, 2024
@dosubot dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label May 15, 2024
@elsatch
Copy link

elsatch commented May 16, 2024

Any advanced with this issue? Yesterday I discovered this bug creating a 8932 Q&A pairs from 97 chunks.

@dosubot dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label May 16, 2024
@dosubot dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale May 16, 2024
Copy link

dosubot bot commented May 16, 2024

Thank you for the update, @elsatch! We appreciate your response. Based on the information provided, we'll go ahead and close this issue now. If you have any more questions or encounter further issues, feel free to open a new one.

Details

About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P1
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants