Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Find alternative for NLTK usage in our DocumentSplitter #5922

Open
sjrl opened this issue Sep 29, 2023 · 1 comment
Open

feat: Find alternative for NLTK usage in our DocumentSplitter #5922

sjrl opened this issue Sep 29, 2023 · 1 comment
Labels
2.x Related to Haystack v2.0

Comments

@sjrl
Copy link
Contributor

sjrl commented Sep 29, 2023

Original issue: #5675

While we have merged a basic version of TextDocumentSplitter, it doesn't support whitespace cleaning or tokenization so let's keep this issue open.
@sjrl You wanted to share your opinion on NLTK usage in the preprocessor?

Originally posted by @julian-risch in #5675 (comment)

I just wanted to say that I think it is still worth supporting NLTK, however, I think we could also benefit from looking for other options as well. The Sol team has often run into the case that sentence detection on documents that contain things like bullet points and other markdown like elements (e.g. code, Headers, etc.) does not work well. As in those things aren't detected as a separate "sentence" which in the case of bullet points sometimes lead to extremely long documents since all bullet points were grouped into one document.

So I was wondering if it would be possible to look into other processing libraries that might be out there that already natively have better support for this than NLTK. I do realize this might be out of scope for now, but I wanted to bring it up.

Originally posted by @sjrl in #5675 (comment)

@sjrl sjrl added the 2.x Related to Haystack v2.0 label Sep 29, 2023
@Timoeller Timoeller changed the title feat: TextDocumentSplitter text cleaning, tokenization, sentence splitting feat: Find alternative for NLTK usage in our TextDocumentSplitter Oct 9, 2023
@Timoeller
Copy link
Contributor

I changed the name of the issue to make it more clear we should look into alternatives for the existing NLTK based implementation.

@Timoeller Timoeller added the P3 Low priority, leave it in the backlog label Oct 9, 2023
@masci masci removed the P3 Low priority, leave it in the backlog label May 10, 2024
@sjrl sjrl changed the title feat: Find alternative for NLTK usage in our TextDocumentSplitter feat: Find alternative for NLTK usage in our DocumentSplitter May 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.x Related to Haystack v2.0
Projects
None yet
Development

No branches or pull requests

3 participants