-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Preprocessing should allow keeping Document ids unchanged #7557
Comments
Thanks for creating an issue on the topic. I can imagine, a keep_id flag when calling the run function could be enough. It could be set to False by default. |
I will take this issue. I believe @Phlasse approach seems to be the way to go, as it keeps things sompler. |
Hey @CarlosFerLo @julian-risch @Phlasse Just saw your PR contribution @CarlosFerLo - thank you. I was wondering why not, if we are already adding another init parameter, make it more powerful and future proof yet as simple as the flag keep_id We could add id_generator init parameter:
where the first parameter of the callable is cleaned document, the second parameter is doc id of the old document and callable returns str. assigned in init like this:
and the end of the run method is:
Let me know your thoughts about this approach. |
@vblagoje then what are you suggesting? We keep the Or we delete the And the important part, what is the default id generator? |
@CarlosFerLo I'm suggesting we don't use |
@vblagoje Okey. I will create a new pull request, to clean the commit tree. Should I create a Class for this kind of function, to make the code more readable and to add more functionality if needed in the future. Should I name the class |
@CarlosFerLo a class in our design principles would be an overkill. A simple Callable suffices here. See in these docs how you can specify the input parameters and return values. Even better insight, let this Callable have old and new docs as inputs and return id. You can call this parameter id_generator. |
@vblagoje Opened a new pull request that contains the new implementation. |
Currently, preprocessing components such as the DocumentCleaner get documents as input and return preprocessed documents but with different ids.
We should enable users to specify custom ids and make sure these ids are used when writing the documents to a database.
We could achieve that in different ways:
A current workaround is to define a custom component that includes a method returning custom ids:
The text was updated successfully, but these errors were encountered: