Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding splitting information in the metadata of DocumentSplitter output #7389

Closed
tradicio opened this issue Mar 20, 2024 · 6 comments · Fixed by #7933
Closed

Adding splitting information in the metadata of DocumentSplitter output #7389

tradicio opened this issue Mar 20, 2024 · 6 comments · Fixed by #7933
Assignees
Labels
2.x Related to Haystack v2.0 topic:preprocessing type:feature New feature or request

Comments

@tradicio
Copy link

Is your feature request related to a problem? Please describe.
When splitting a document in Haystack v1, the function _create_docs_from_splits() within the class PreProcessor() is able to save relevant metadata information such as _split_id and _split_overlap. So far, the class DocumentSplitter in Haystack v2 is not able to do the same. I think that it could be useful to bring these metadata information in the final output of DocumentSplitter.

Describe the solution you'd like
In the run() method, the solution could be to reproduce an adapted version of _create_docs_from_splits() as reported in the PreProcessor() class in Haystack v1.

Additional context
To check the differences between the outputs of the two Haystack versions, you need to realize an indexing pipeline in Haystack v2 and compare it with the output resulting from the PreProcessor of Haysatck v1

@anakin87 anakin87 added topic:preprocessing type:feature New feature or request labels Mar 20, 2024
@anakin87
Copy link
Member

@tradicio are you using this information in your application?
Please explain your use case to understand how we can support it, possibly reintroducing this information.

@tradicio
Copy link
Author

Hi @anakin87, thanks for your message!

You're correct, I am using the _split_id and _split_overlap in my application in order to keep the information of the chunks order resulting from PreProcessor(). This feature allows textual chunks to be displayed as they are in the DocumentStore() and is very useful for checking how the information from a long text is divided and ordered.

With the introduction of Haystack v2 in the application, I would like to keep this functionality to continue to display the sorted output from DocumentSplitter() class.

Let me know if you need further clarification, I am really glad to contribute!

@anakin87
Copy link
Member

Ok, I understand!

Let's involve @julian-risch for an opinion since he worked on these components.

@julian-risch
Copy link
Member

@tradicio Thanks for the feedback, I agree we should add these advanced metadata of _split_id and _split_overlap to a next iteration of DocumentSplitter, yes. 👍

@anakin87
Copy link
Member

@tradicio If you feel like it, go ahead and try to create a PR.

@anakin87 anakin87 added the 2.x Related to Haystack v2.0 label Mar 27, 2024
@tradicio
Copy link
Author

@tradicio If you feel like it, go ahead and try to create a PR.

Sure, I'll try to create a PR in the next days! Thanks for all your support, I am really glad to give my contribution

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.x Related to Haystack v2.0 topic:preprocessing type:feature New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants