Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable DocumentCleaner new id generation #7496

Closed
DemirTonchev opened this issue Apr 7, 2024 · 2 comments
Closed

Disable DocumentCleaner new id generation #7496

DemirTonchev opened this issue Apr 7, 2024 · 2 comments
Labels
2.x Related to Haystack v2.0 P2 Medium priority, add to the next sprint if no P1 available topic:preprocessing

Comments

@DemirTonchev
Copy link

Is your feature request related to a problem? Please describe.
DocumentCleaner will replace the ids of the documents if you provide documents with user defined ids.
I expect when I am using DocumentCleaner IF docs have already given ID this processing step won't change it.
I understand this is not so straightforward due to different uses cases and how id gets generated from the content, but maybe a simple explicit flag could be added. This step does not generate new documents in the sense of splitter. So this could make sense to bigger audience.

Describe the solution you'd like
Running this code should not change the IDs.

d1 = Document(id='ASD1G4',
             content="Hello there")
d2 = Document(id='A2515S',
             content="General Kenobi")
             
DocumentCleaner(
            remove_empty_lines=True,
            remove_extra_whitespaces=False,
            remove_repeated_substrings=False,
            generate_new_id=False, # example flag
            ).run([d1, d2])

Describe alternatives you've considered
A simple inheritance of DocumentCleaner could do the job for me or add source id to meta data.
System:

  • Haystack version : 2.0.0
@shadeMe shadeMe added topic:preprocessing 2.x Related to Haystack v2.0 P2 Medium priority, add to the next sprint if no P1 available labels Apr 8, 2024
@nickprock
Copy link
Contributor

nickprock commented Apr 8, 2024

Hi @DemirTonchev ,

the id is an hash from the content (or other explicit metadata), you can use a code like the following to keep your ids as metadata when you build documents.

d1 = Document(content="Hello there", meta={"id":'ASD1G4'})
d2 = Document(content="General Kenobi", meta={"id":'A2515S'})
             
test = DocumentCleaner(
            remove_empty_lines=True,
            remove_extra_whitespaces=False,
            remove_repeated_substrings=False,
            ).run([d1, d2])

# {'documents': 
# [Document(id=5dd415571caa2046d3d8528cf778e81c88ead438115d3bcbc367390dabcbc883, content: 'Hello there', meta: {'id': 'ASD1G4'}),
# Document(id=c0c2f8c86f060007142ad8ee660a72428876d4a7dcc54c196f16d0ba68a9757e, content: 'General Kenobi', meta: {'id': 'A2515S'})]}

@DemirTonchev
Copy link
Author

Yes, this is my default solutions as well. But was wondering for different solution if I dont care about the hash of the content.

Describe alternatives you've considered
A simple inheritance of DocumentCleaner could do the job for me or add source id to meta data.

You gave the reason for not adding this as a functionality in your answer. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.x Related to Haystack v2.0 P2 Medium priority, add to the next sprint if no P1 available topic:preprocessing
Projects
None yet
Development

No branches or pull requests

3 participants