Disable DocumentCleaner new id generation #7496

DemirTonchev · 2024-04-07T13:38:33Z

Is your feature request related to a problem? Please describe.
DocumentCleaner will replace the ids of the documents if you provide documents with user defined ids.
I expect when I am using DocumentCleaner IF docs have already given ID this processing step won't change it.
I understand this is not so straightforward due to different uses cases and how id gets generated from the content, but maybe a simple explicit flag could be added. This step does not generate new documents in the sense of splitter. So this could make sense to bigger audience.

Describe the solution you'd like
Running this code should not change the IDs.

d1 = Document(id='ASD1G4',
             content="Hello there")
d2 = Document(id='A2515S',
             content="General Kenobi")
             
DocumentCleaner(
            remove_empty_lines=True,
            remove_extra_whitespaces=False,
            remove_repeated_substrings=False,
            generate_new_id=False, # example flag
            ).run([d1, d2])

Describe alternatives you've considered
A simple inheritance of DocumentCleaner could do the job for me or add source id to meta data.
System:

Haystack version : 2.0.0

nickprock · 2024-04-08T08:29:53Z

Hi @DemirTonchev ,

the id is an hash from the content (or other explicit metadata), you can use a code like the following to keep your ids as metadata when you build documents.

d1 = Document(content="Hello there", meta={"id":'ASD1G4'})
d2 = Document(content="General Kenobi", meta={"id":'A2515S'})
             
test = DocumentCleaner(
            remove_empty_lines=True,
            remove_extra_whitespaces=False,
            remove_repeated_substrings=False,
            ).run([d1, d2])

# {'documents': 
# [Document(id=5dd415571caa2046d3d8528cf778e81c88ead438115d3bcbc367390dabcbc883, content: 'Hello there', meta: {'id': 'ASD1G4'}),
# Document(id=c0c2f8c86f060007142ad8ee660a72428876d4a7dcc54c196f16d0ba68a9757e, content: 'General Kenobi', meta: {'id': 'A2515S'})]}

DemirTonchev · 2024-04-08T08:39:14Z

Yes, this is my default solutions as well. But was wondering for different solution if I dont care about the hash of the content.

Describe alternatives you've considered
A simple inheritance of DocumentCleaner could do the job for me or add source id to meta data.

You gave the reason for not adding this as a functionality in your answer. Thanks.

shadeMe added topic:preprocessing 2.x Related to Haystack v2.0 P2 Medium priority, add to the next sprint if no P1 available labels Apr 8, 2024

DemirTonchev closed this as completed Apr 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disable DocumentCleaner new id generation #7496

Disable DocumentCleaner new id generation #7496

DemirTonchev commented Apr 7, 2024

nickprock commented Apr 8, 2024 •

edited

Loading

DemirTonchev commented Apr 8, 2024

Disable DocumentCleaner new id generation #7496

Disable DocumentCleaner new id generation #7496

Comments

DemirTonchev commented Apr 7, 2024

nickprock commented Apr 8, 2024 • edited Loading

DemirTonchev commented Apr 8, 2024

nickprock commented Apr 8, 2024 •

edited

Loading