-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Disable DocumentCleaner new id generation #7496
Comments
Hi @DemirTonchev , the id is an hash from the content (or other explicit metadata), you can use a code like the following to keep your ids as metadata when you build documents.
|
Yes, this is my default solutions as well. But was wondering for different solution if I dont care about the hash of the content.
You gave the reason for not adding this as a functionality in your answer. Thanks. |
Is your feature request related to a problem? Please describe.
DocumentCleaner will replace the ids of the documents if you provide documents with user defined ids.
I expect when I am using DocumentCleaner IF docs have already given ID this processing step won't change it.
I understand this is not so straightforward due to different uses cases and how id gets generated from the content, but maybe a simple explicit flag could be added. This step does not generate new documents in the sense of splitter. So this could make sense to bigger audience.
Describe the solution you'd like
Running this code should not change the IDs.
Describe alternatives you've considered
A simple inheritance of DocumentCleaner could do the job for me or add source id to meta data.
System:
The text was updated successfully, but these errors were encountered: