Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preprocessing should allow keeping Document ids unchanged #7557

Closed
julian-risch opened this issue Apr 18, 2024 · 8 comments · Fixed by #7703
Closed

Preprocessing should allow keeping Document ids unchanged #7557

julian-risch opened this issue Apr 18, 2024 · 8 comments · Fixed by #7703
Assignees
Labels
2.x Related to Haystack v2.0 type:feature New feature or request

Comments

@julian-risch
Copy link
Member

Currently, preprocessing components such as the DocumentCleaner get documents as input and return preprocessed documents but with different ids.
We should enable users to specify custom ids and make sure these ids are used when writing the documents to a database.
We could achieve that in different ways:

  1. Ensure that preprocessing components such as the DocumentCleaner keep the document ids unchanged. However that means the id is not based on the preprocessed content anymore but on the original content.
  2. Add a new Component that calculates ids of documents right before the DocumentWriter. That means users with custom ids could provide a list of ids as inputs

A current workaround is to define a custom component that includes a method returning custom ids:

from copy import deepcopy
from typing import List
from haystack import Document, component

@component
class DocumentIDCreator:
    # This is a pass-through component that returns the same Documents but with different ids
    @component.output_types(documents=List[Document])
    def run(self, documents: List[Document]):
        updated_documents = []
        for doc in documents:
            id = calculate_id(doc, id_hash_keys) # your code
            updated_documents.append(Document(id=id, content=doc.content, meta=deepcopy(doc.meta), embedding=doc.embedding))

        return {"documents": updated_documents}
@julian-risch julian-risch added 2.x Related to Haystack v2.0 type:feature New feature or request labels Apr 18, 2024
@Phlasse
Copy link

Phlasse commented Apr 18, 2024

Thanks for creating an issue on the topic.

I can imagine, a keep_id flag when calling the run function could be enough. It could be set to False by default.
The id could be then be reused in the end, when the new document is being created.

@CarlosFerLo
Copy link
Contributor

I will take this issue. I believe @Phlasse approach seems to be the way to go, as it keeps things sompler.

@vblagoje
Copy link
Member

vblagoje commented May 8, 2024

Hey @CarlosFerLo @julian-risch @Phlasse

Just saw your PR contribution @CarlosFerLo - thank you.

I was wondering why not, if we are already adding another init parameter, make it more powerful and future proof yet as simple as the flag keep_id

We could add id_generator init parameter:

id_generator: Optional[Callable[[Document, str], str]] = None

where the first parameter of the callable is cleaned document, the second parameter is doc id of the old document and callable returns str.

assigned in init like this:

self.id_generator = id_generator or (lambda doc, id: doc.id) # or whatever default is

and the end of the run method is:

clean_doc = Document(content=text, meta=deepcopy(doc.meta))
clean_doc.id = self.id_generator(clean_doc, doc.id)
cleaned_docs.append(clean_doc)

Let me know your thoughts about this approach.

@CarlosFerLo
Copy link
Contributor

@vblagoje then what are you suggesting?

We keep the keep_id flag, and we add a id_generator where the id generator can overwrite the resulting id of the cleaned document receiving the original document and the new id.

Or we delete the keep_flag and add all the functionality in the id_generator parameter and we can add a keep_id id generator that has this behaviour.

And the important part, what is the default id generator?

@vblagoje
Copy link
Member

vblagoje commented May 9, 2024

@CarlosFerLo I'm suggesting we don't use keep_id flag because id_generator callable can replace it and future use cases where people want to customize doc id generation. We can then use this familiar pattern not only in this component but in others as well - whenever some derived documents are created by a component and we need to generate derived docs ids.

@CarlosFerLo
Copy link
Contributor

@vblagoje Okey. I will create a new pull request, to clean the commit tree. Should I create a Class for this kind of function, to make the code more readable and to add more functionality if needed in the future. Should I name the class DocumentIdGenerator or just IdGenerator. And to specify the input parameters: it receives the new document and the old id? or better if it gets the new document and the old one, it would give more flexibility, but also, haveing the new document alrady formatted seems a little redundant.

@vblagoje
Copy link
Member

@CarlosFerLo a class in our design principles would be an overkill. A simple Callable suffices here. See in these docs how you can specify the input parameters and return values.

Even better insight, let this Callable have old and new docs as inputs and return id. You can call this parameter id_generator.
Thanks for working on this

@CarlosFerLo
Copy link
Contributor

@vblagoje Opened a new pull request that contains the new implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.x Related to Haystack v2.0 type:feature New feature or request
Projects
None yet
4 participants