Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve relevance scoring for titles and object-name matches in search results #12441

Conversation

jayaddison
Copy link
Contributor

@jayaddison jayaddison commented Jun 19, 2024

Feature or Bugfix

  • Bugfix

Purpose

Detail

  • Add an example project to the JavaScript tests that exhibits the sub-optimal search result behaviour.
  • Add JavaScript test coverage to assert on the expected, improved relative ordering of query results.
  • Implement changes to the indexing/query algorithms to improve the query results without regressing other JavaScript search tests.

Relates

@jayaddison jayaddison added html search javascript Pull requests that update Javascript code labels Jun 19, 2024
@jayaddison
Copy link
Contributor Author

Commit cb0f6e7 -- regenerating a search index file from scratch -- seems to be have been necessary because I had a stale _build directory on my local machine for the relevant input project for the fixture.

That seems like a bug; re-using an existing _build directory should be valid behaviour and should produce the same search index output as a fresh build. I'll file that as a separate issue within the next few days.

jayaddison and others added 3 commits June 23, 2024 17:47
…earch-scoring-adjustment

Conflicts:
	tests/js/searchtools.js (no edit conflict; unit test failure)
  * Minimize diff relative to mainline codebase to ease review/historic viewing.
  * Move updated main-title score variable into a ``Scorer`` constant.
@jayaddison jayaddison marked this pull request as ready for review June 24, 2024 10:32
@jayaddison
Copy link
Contributor Author

I've a slight preference for #12047 to be merged before this, to make the code and diff history easier to follow, if+when either of them are considered ready.

@jayaddison jayaddison requested a review from wlach June 24, 2024 17:44
Copy link
Contributor

@wlach wlach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Some minor optional comments.

sphinx/themes/basic/static/searchtools.js Outdated Show resolved Hide resolved
sphinx/themes/basic/static/searchtools.js Outdated Show resolved Hide resolved
tests/js/searchtools.js Outdated Show resolved Hide resolved
@jayaddison
Copy link
Contributor Author

This should be ready for further review / merge; I've no changes planned on this branch.

Copy link
Contributor

@wlach wlach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a couple of extra comments but in general this LGTM. I think this will be a nice incremental improvement. @picnixz may you could take a look too (and merge when you think it ready)?

Comment on lines 331 to 332
let score = Math.round(Scorer.title * queryLower.length / title.length);
let boost = titles[file] === title ? 1 : 0; // add a boost for document titles
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second look, could these be declared as const?

Suggested change
let score = Math.round(Scorer.title * queryLower.length / title.length);
let boost = titles[file] === title ? 1 : 0; // add a boost for document titles
const score = Math.round(Scorer.title * queryLower.length / title.length);
const boost = titles[file] === title ? 1 : 0; // add a small boost for document titles

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thinkg it's better using a const as well. But on a second thought, I'm wondering whether a +1 is sufficient.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously a title and subsection title with the same text would have equal scores, leaving their relative ranking undefined.

Any positive value here should have the effect of elevating the main-document titles above same-named subsection titles in the search results.

A single-integer increment is used because ideally we don't want the main document titles to move up in the rankings 'too much' and overtake other matches. That is possible, though, especially given that some scores are fractional. So I have the opposite worry: that +1 might be too much.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(a good way to figure these out could be to develop counterexamples and add test cases for them)

tests/js/searchtools.js Outdated Show resolved Hide resolved
tests/js/searchtools.js Show resolved Hide resolved
Copy link
Member

@picnixz picnixz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be good if we have a more complete example where you have a lot of multiple matches of the same kind. Does it cover the issue with the asyncio module that we described?

CHANGES.rst Outdated Show resolved Hide resolved
Comment on lines 331 to 332
let score = Math.round(Scorer.title * queryLower.length / title.length);
let boost = titles[file] === title ? 1 : 0; // add a boost for document titles
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thinkg it's better using a const as well. But on a second thought, I'm wondering whether a +1 is sufficient.

tests/js/roots/titles/relevance.py Show resolved Hide resolved
tests/js/searchtools.js Show resolved Hide resolved
@jayaddison
Copy link
Contributor Author

It'd be good if we have a more complete example where you have a lot of multiple matches of the same kind. Does it cover the issue with the asyncio module that we described?

Yep, the thinking here was to replicate the asyncio relevance ordering problem using a minimal test case, and then to adjust the code to fix it; attempting to apply (and demonstrate) a Test-Driven-Development approach to search ranking fixups. I'll investigate expanding the test fixture data to add more results.

@wlach
Copy link
Contributor

wlach commented Jul 9, 2024

It'd be good if we have a more complete example where you have a lot of multiple matches of the same kind. Does it cover the issue with the asyncio module that we described?

This PR uses a similar approach to what was described/shown in #12393 (comment) (edit: original link was wrong) so it should.

However, it would be good to another test before landing to be sure. I tested there by checking out the cpython repository and regenerating the Doc/ directory using a virtualenv with my development version of Sphinx.

@AA-Turner
Copy link
Member

@jayaddison are you happy with this // ready to review & merge?

A

@jayaddison
Copy link
Contributor Author

@AA-Turner yep, I think this is ready.

@AA-Turner AA-Turner changed the title [HTML search] Improve relevance scoring for titles and object-name matches. Improve relevance scoring for titles and object-name matches in search results Jul 11, 2024
@AA-Turner AA-Turner merged commit 91c5cd3 into sphinx-doc:master Jul 11, 2024
23 checks passed
@AA-Turner
Copy link
Member

Thanks all!

A

@jayaddison jayaddison deleted the issue-12391/subtitle-search-scoring-adjustment branch July 11, 2024 11:10
@jayaddison
Copy link
Contributor Author

Thank you @AA-Turner!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
html search javascript Pull requests that update Javascript code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[search] issues with the new HTML search algorithm
4 participants