Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Fix split_start_idx and _split_overlap information in DocumentSplitter #8046

Merged
merged 8 commits into from
Jul 24, 2024

Conversation

sjrl
Copy link
Contributor

@sjrl sjrl commented Jul 19, 2024

Related Issues

  • fixes #issue-number

Proposed Changes:

The DocumentSplitter was incorrectly calculating the split_start_idx and _split_overlap information due to slight miscalculations of appropriate indices.

This was missed because the values of split_start_idx and _split_overlap weren't actually tested for in the unit tests.

How did you test it?

Expanded the unit tests to test the values of the incorrect keys.

Notes for the reviewer

Checklist

@sjrl sjrl requested a review from a team as a code owner July 19, 2024 07:45
@sjrl sjrl requested review from vblagoje and removed request for a team July 19, 2024 07:45
@sjrl sjrl added 2.x Related to Haystack v2.0 type:bug Something isn't working and removed topic:tests labels Jul 19, 2024
@sjrl sjrl added this to the 2.3.1 milestone Jul 19, 2024
@sjrl sjrl requested a review from a team as a code owner July 19, 2024 07:47
@sjrl sjrl requested review from dfokina and removed request for a team July 19, 2024 07:47
@coveralls
Copy link
Collaborator

coveralls commented Jul 19, 2024

Pull Request Test Coverage Report for Build 10075792294

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 3 unchanged lines in 1 file lost coverage.
  • Overall coverage decreased (-0.003%) to 90.139%

Files with Coverage Reduction New Missed Lines %
components/retrievers/sentence_window_retrieval.py 3 83.33%
Totals Coverage Status
Change from base Build 10075624435: -0.003%
Covered Lines: 6856
Relevant Lines: 7606

💛 - Coveralls

@vblagoje
Copy link
Member

There seem to be some minor test failures @sjrl

@sjrl
Copy link
Contributor Author

sjrl commented Jul 19, 2024

@vblagoje it should be fixed now. Issue was SentenceWindowRetriever had the same issue as the DocumentSplitter. Should be fixed now.

@davidsbatista
Copy link
Contributor

@sjrl thanks for detecting and fixing this

Copy link
Member

@vblagoje vblagoje left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sjrl not directly related to your change but why do we set instance variable split_at from run invocation? A big smell IYAM

Can we avoid this bad practice in any way in this class?

@davidsbatista
Copy link
Contributor

I have no clue, when I updated to contain the split_id and split_overlap it was already like that - what's the issue exactly?

@vblagoje
Copy link
Member

I have no clue, when I updated to contain the split_id and split_overlap it was already like that - what's the issue exactly?

It's a bad practice basically. And we should see how we can avoid it. If not here than in separate PR

@davidsbatista
Copy link
Contributor

And we should see how we can avoid it

I think it's enough to move the first block of _split_into_units into the _init_() and update the rest of the code accordingly, but I think that's out-of-scope for this PR.

@sjrl
Copy link
Contributor Author

sjrl commented Jul 24, 2024

Hey @vblagoje I agree with @davidsbatista let's make that fix in a separate PR

@vblagoje vblagoje self-requested a review July 24, 2024 13:13
Copy link
Member

@vblagoje vblagoje left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved with a note we all agreed on.

@sjrl sjrl merged commit baed478 into main Jul 24, 2024
17 checks passed
@sjrl sjrl deleted the fix-split-overlap-information branch July 24, 2024 13:15
@shadeMe shadeMe modified the milestones: 2.3.2, 2.4.0 Aug 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.x Related to Haystack v2.0 topic:tests type:bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants