Preprocessing should allow keeping Document ids unchanged #7557

julian-risch · 2024-04-18T10:27:25Z

Currently, preprocessing components such as the DocumentCleaner get documents as input and return preprocessed documents but with different ids.
We should enable users to specify custom ids and make sure these ids are used when writing the documents to a database.
We could achieve that in different ways:

Ensure that preprocessing components such as the DocumentCleaner keep the document ids unchanged. However that means the id is not based on the preprocessed content anymore but on the original content.
Add a new Component that calculates ids of documents right before the DocumentWriter. That means users with custom ids could provide a list of ids as inputs

A current workaround is to define a custom component that includes a method returning custom ids:

from copy import deepcopy
from typing import List
from haystack import Document, component

@component
class DocumentIDCreator:
    # This is a pass-through component that returns the same Documents but with different ids
    @component.output_types(documents=List[Document])
    def run(self, documents: List[Document]):
        updated_documents = []
        for doc in documents:
            id = calculate_id(doc, id_hash_keys) # your code
            updated_documents.append(Document(id=id, content=doc.content, meta=deepcopy(doc.meta), embedding=doc.embedding))

        return {"documents": updated_documents}

Phlasse · 2024-04-18T11:30:50Z

Thanks for creating an issue on the topic.

I can imagine, a keep_id flag when calling the run function could be enough. It could be set to False by default.
The id could be then be reused in the end, when the new document is being created.

CarlosFerLo · 2024-04-29T18:10:02Z

I will take this issue. I believe @Phlasse approach seems to be the way to go, as it keeps things sompler.

vblagoje · 2024-05-08T13:00:12Z

Hey @CarlosFerLo @julian-risch @Phlasse

Just saw your PR contribution @CarlosFerLo - thank you.

I was wondering why not, if we are already adding another init parameter, make it more powerful and future proof yet as simple as the flag keep_id

We could add id_generator init parameter:

id_generator: Optional[Callable[[Document, str], str]] = None

where the first parameter of the callable is cleaned document, the second parameter is doc id of the old document and callable returns str.

assigned in init like this:

self.id_generator = id_generator or (lambda doc, id: doc.id) # or whatever default is

and the end of the run method is:

clean_doc = Document(content=text, meta=deepcopy(doc.meta))
clean_doc.id = self.id_generator(clean_doc, doc.id)
cleaned_docs.append(clean_doc)

Let me know your thoughts about this approach.

CarlosFerLo · 2024-05-08T15:38:46Z

@vblagoje then what are you suggesting?

We keep the keep_id flag, and we add a id_generator where the id generator can overwrite the resulting id of the cleaned document receiving the original document and the new id.

Or we delete the keep_flag and add all the functionality in the id_generator parameter and we can add a keep_id id generator that has this behaviour.

And the important part, what is the default id generator?

vblagoje · 2024-05-09T14:37:55Z

@CarlosFerLo I'm suggesting we don't use keep_id flag because id_generator callable can replace it and future use cases where people want to customize doc id generation. We can then use this familiar pattern not only in this component but in others as well - whenever some derived documents are created by a component and we need to generate derived docs ids.

CarlosFerLo · 2024-05-09T19:08:01Z

@vblagoje Okey. I will create a new pull request, to clean the commit tree. Should I create a Class for this kind of function, to make the code more readable and to add more functionality if needed in the future. Should I name the class DocumentIdGenerator or just IdGenerator. And to specify the input parameters: it receives the new document and the old id? or better if it gets the new document and the old one, it would give more flexibility, but also, haveing the new document alrady formatted seems a little redundant.

vblagoje · 2024-05-10T14:35:33Z

@CarlosFerLo a class in our design principles would be an overkill. A simple Callable suffices here. See in these docs how you can specify the input parameters and return values.

Even better insight, let this Callable have old and new docs as inputs and return id. You can call this parameter id_generator.
Thanks for working on this

CarlosFerLo · 2024-05-12T16:56:16Z

@vblagoje Opened a new pull request that contains the new implementation.

julian-risch added 2.x Related to Haystack v2.0 type:feature New feature or request labels Apr 18, 2024

CarlosFerLo mentioned this issue Apr 29, 2024

feat: add keep_id parameter to DocumentCleaner #7618

Closed

masci assigned vblagoje May 6, 2024

CarlosFerLo mentioned this issue May 12, 2024

feat: Add id_generator functionality to the document cleaner. #7683

Closed

CarlosFerLo mentioned this issue May 16, 2024

feat: add keep-id to DocumentCleaner #7703

Merged

silvanocerza closed this as completed in #7703 May 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preprocessing should allow keeping Document ids unchanged #7557

Preprocessing should allow keeping Document ids unchanged #7557

julian-risch commented Apr 18, 2024

Phlasse commented Apr 18, 2024

CarlosFerLo commented Apr 29, 2024

vblagoje commented May 8, 2024

CarlosFerLo commented May 8, 2024

vblagoje commented May 9, 2024

CarlosFerLo commented May 9, 2024

vblagoje commented May 10, 2024

CarlosFerLo commented May 12, 2024

Preprocessing should allow keeping Document ids unchanged #7557

Preprocessing should allow keeping Document ids unchanged #7557

Comments

julian-risch commented Apr 18, 2024

Phlasse commented Apr 18, 2024

CarlosFerLo commented Apr 29, 2024

vblagoje commented May 8, 2024

CarlosFerLo commented May 8, 2024

vblagoje commented May 9, 2024

CarlosFerLo commented May 9, 2024

vblagoje commented May 10, 2024

CarlosFerLo commented May 12, 2024