[Bug]: Unexpected Number of Questions Generated When Requesting FAQ Generation #10694

stephanedebove · 2024-02-14T10:36:20Z

Bug Description

When attempting to generate a set of FAQ questions and answers from a document, specifying num_questions_per_chunk=1 unexpectedly results in multiple questions being generated, exceeding the specified limit.

Expected Behavior
With num_questions_per_chunk=1, I expect to generate exactly one question per document or document chunk processed.

Actual Behavior
Despite setting num_questions_per_chunk=1, multiple questions (7 in my case) are generated for a single document or document chunk, indicating that the limit is not being respected or that the document is being split in an unexpected manner.

Edit : from my LLM log (Mixtral 8x7b running on Replicate) I can see that the LLM is called 8 times even with num_questions_per_chunk=1, so the problem is probably not with the num_questions_per_chunk parameter but with the fact that the request to generate questions/answers is sent multiple times. Could this be due to how async functions work?

Version

0.9.45

Steps to Reproduce

Prepare a document or a list of documents to be processed.
Use the provided code snippet to generate FAQ questions and answers, ensuring num_questions_per_chunk is set to 1.
Observe that the output contains more questions than expected.

import asyncio
from llamaindex import RagDatasetGenerator

async def main():
    documents = SimpleDirectoryReader(input_files=["./document.txt"]).load_data()
    QUESTION_GEN_PROMPT = "Here is a text. Your job is to imagine {num_questions_per_chunk} questions that could be asked about this text."
    
    dataset_generator = RagDatasetGenerator.from_documents(
        documents=documents,
        service_context=service_context,
        question_gen_query=QUESTION_GEN_PROMPT,
        show_progress=True,
        num_questions_per_chunk=1,  # Expected to generate one question per chunk
    )
    
    labelled_rag_dataset = await dataset_generator.agenerate_dataset_from_nodes()
    for example in labelled_rag_dataset.examples:
        print("-----")
        print(f"Question: {example.query}")
        print(f"Answer: {example.reference_answer}")

if __name__ == "__main__":
    asyncio.run(main())

Relevant Logs/Tracbacks

No response

The text was updated successfully, but these errors were encountered:

dosubot · 2024-02-14T10:36:51Z

🤖

Hey @stephanedebove, great to cross paths again on here! Hope your code's been more friend than foe since our last chat. 🚀

Thank you for providing the detailed steps to reproduce the issue. Based on the context provided, it seems that the num_questions_per_chunk parameter in the RagDatasetGenerator class is used to instruct the AI to generate that many questions, but it doesn't enforce a strict limit. The AI tries to generate diverse questions based on the document content, and it might generate more than the specified limit if it finds more potential questions in the document.

Here is the relevant code snippet from the RagDatasetGenerator class:

self.question_gen_query = (
    question_gen_query
    or f"You are a Teacher/Professor. Your task is to setup \
                {num_questions_per_chunk} questions for an upcoming \
                quiz/examination. The questions should be diverse in nature \
                    across the document. Restrict the questions to the \
                        context information provided."
)

To strictly limit the number of questions generated, you would need to modify the _agenerate_dataset method where the questions are generated and stored. Specifically, you could truncate the cleaned_questions list to the num_questions_per_chunk value before storing them in the queries dictionary. However, please note that this might result in less diverse questions being generated if the limit is set to a low value.

You can find the relevant code in the llama_index/evaluation/dataset_generation.py file.

Please note that this is a workaround and might not be the best solution. It's possible that there's a better way to handle this that I wasn't able to find. If you have any further questions or need more clarification, feel free to ask.

Sources

llama_index/evaluation/dataset_generation.py

About Dosu

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

elsatch · 2024-05-16T13:44:01Z

Any advanced with this issue? Yesterday I discovered this bug creating a 8932 Q&A pairs from 97 chunks.

dosubot · 2024-05-16T13:46:15Z

Thank you for the update, @elsatch! We appreciate your response. Based on the information provided, we'll go ahead and close this issue now. If you have any more questions or encounter further issues, feel free to open a new one.

Details

About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

stephanedebove added bug Something isn't working triage Issue needs to be triaged/prioritized labels Feb 14, 2024

logan-markewich added P1 and removed triage Issue needs to be triaged/prioritized labels Feb 14, 2024

dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label May 15, 2024

dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label May 16, 2024

dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale May 16, 2024

tuomastik mentioned this issue May 20, 2024

Fix: Limit the number of generated questions #13596

Merged

8 tasks

nerdai closed this as completed in #13596 May 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Unexpected Number of Questions Generated When Requesting FAQ Generation #10694

[Bug]: Unexpected Number of Questions Generated When Requesting FAQ Generation #10694

stephanedebove commented Feb 14, 2024 •

edited

dosubot bot commented Feb 14, 2024 •

edited

About Dosu

elsatch commented May 16, 2024

dosubot bot commented May 16, 2024

Details

[Bug]: Unexpected Number of Questions Generated When Requesting FAQ Generation #10694

[Bug]: Unexpected Number of Questions Generated When Requesting FAQ Generation #10694

Comments

stephanedebove commented Feb 14, 2024 • edited

Bug Description

Version

Steps to Reproduce

Relevant Logs/Tracbacks

dosubot bot commented Feb 14, 2024 • edited

Sources

About Dosu

elsatch commented May 16, 2024

dosubot bot commented May 16, 2024

Details

stephanedebove commented Feb 14, 2024 •

edited

dosubot bot commented Feb 14, 2024 •

edited