HTML Search: omit anchor reference from document titles in the search index. #12047

jayaddison · 2024-03-04T11:28:54Z

Feature or Bugfix

Bugfix

Purpose

Adjusts the contents of the search-index provided to the client so that the browser correctly de-duplicates search results that link to the top-level title in documents.

Detail

Adapts the transformation of docutils nodes.title nodes into the title index by omitting their id that is used as the hyperlink anchor in the special-case of the document title.
Note: there is no explicit markup for the document title in reStructuredText.

Relates

Resolves HTML Search: Contains duplicates based on title and content search #11961.
Alternative approach to client-side fix in HTML Search: Fix duplicate results #11942.

… title

Note: reStructuredText document titles are implicit; ref: https://docutils.sourceforge.io/docs/ref/rst/restructuredtext.html#document

…alues

…n headings may exist

…anchor values" This reverts commit d44dc11.

wlach · 2024-03-09T15:37:45Z

This seems preferable to my solution in #11942, just tested and it has the same results while making the search index smaller. 👍 from me.

picnixz · 2024-03-15T14:56:40Z

Fine for me (I'm not an expert in that search-related aspect so I'll leave it to you guys). Should I understand that this PR would replace #11942 ? @jayaddison ping me when you want it to be merged

jayaddison · 2024-03-15T15:56:51Z

Fine for me (I'm not an expert in that search-related aspect so I'll leave it to you guys). Should I understand that this PR would replace #11942 ?

That's correct. I think there are three things that I don't like about this pull request, in order of priority:

It doesn't offer JavaScript test coverage to demonstrate the fix -- only Python test coverage that confirms the expected index format.
I haven't been able to think of a way that this would happen yet, but in theory I expect some downstream tools that use the current index format could be broken by this - in particular the possibility of null values where previously everything was a string.
Related to (2), the index format isn't as small as it could be. In raw text, the JavaScript empty string can be represented by either '' or "" (two characters), whereas a null value is represented by the token null (four characters). It may not matter a lot, but over a long enough duration of time the cumulative cost could be significant (or not! maybe it's a waste of time considering it).

jayaddison · 2024-03-15T16:05:53Z

It doesn't offer JavaScript test coverage to demonstrate the fix -- only Python test coverage that confirms the expected index format.

I think I'll begin work on a small JavaScript refactor PR to make test coverage easier to add.

picnixz · 2024-03-15T16:09:35Z

particular the possibility of null values where previously everything was a string.

Is it possible to keep a string or is the null type needed?

jayaddison · 2024-03-15T16:17:11Z

particular the possibility of null values where previously everything was a string.

Is it possible to keep a string or is the null type needed?

If I remember correctly, this conditional needs to be adjusted if we're using empty strings, but otherwise it should be fine.

What I would really like is a way to generate the indexes used in the JavaScript tests from the same Python code that builds searchindex.js when projects are built. It feels fragile that the test indexes are written by hand.

jayaddison · 2024-03-15T16:30:12Z

What I would really like is a way to generate the indexes used in the JavaScript tests from the same Python code that builds searchindex.js when projects are built. It feels fragile that the test indexes are written by hand.

Moved into #12099.

jayaddison · 2024-03-16T11:08:43Z

I'd like to check whether we can get #12102 in place before progressing this pull request further. If that can be added, then I think adding test coverage here will be much easier and more reliable (I'll be able to create a sample Sphinx project that returns duplicate search results, and add test coverage against that).

jayaddison · 2024-03-16T11:09:47Z

(and maybe do the null to '' refactoring at the same time and more safely, given the test coverage)

tests/test_search.py

picnixz · 2024-06-18T09:37:54Z

tests/test_search.py

@@ -176,7 +179,10 @@ def test_IndexBuilder():
 'index': {'docname1_1', 'docname1_2', 'docname2_1', 'docname2_2'},
 'test': {'docname1_1', 'docname1_2', 'docname2_1', 'docname2_2'},
 }
- assert index._title_mapping == {'section_titl': {'docname1_1', 'docname1_2', 'docname2_1', 'docname2_2'}}
+ assert index._title_mapping == {


Here, I see that the values are sets and thus can be unordered. Is it possible to actually have a more reliable guarantee on the title_mapping actually?

To check that I understand correctly: use an assertion that checks not only the contents of the title mapping, but also the ordering of the elements inside it?

(makes sense to me, order can be significant for index data)

If my understanding there was correct: I'd prefer to handle that in a separate issue thread and PR, if that's OK. Or, if I misunderstood: let's make sure to resolve that first.

If there is more than one occurrence of that, yes we could discuss it in another thread.

Yep; I think it would be applicable for both _title_mapping and _mapping; because both of those currently use a pattern of <mapping>.setdefault(..., set()).add(...) (implying that insertion-order may not match subsequent access iteration order):

sphinx/sphinx/search/__init__.py

Lines 447 to 464 in b70578f

for word in word_store.title_words:

# add stemmed and unstemmed as the stemmer must not remove words

# from search index.

stemmed_word = stem(word)

if _filter(stemmed_word):

self._title_mapping.setdefault(stemmed_word, set()).add(docname)

elif _filter(word):

self._title_mapping.setdefault(word, set()).add(docname)

for word in word_store.words:

# add stemmed and unstemmed as the stemmer must not remove words

# from search index.

stemmed_word = stem(word)

if not _filter(stemmed_word) and _filter(word):

stemmed_word = word

already_indexed = docname in self._title_mapping.get(stemmed_word, ())

if _filter(stemmed_word) and not already_indexed:

self._mapping.setdefault(stemmed_word, set()).add(docname)

Correction: any runtime (in-memory) differences in the ordering of objects within the _title_mapping and _mapping set entries are later resolved into deterministic order by an explicit sort that occurs during freezing (serialization) of the search index (thanks to d24bd73).

I'm not aware of any other current bugs causing nondeterminism in our searchindex.js output since v7.3.0 (95fb0e5 included in that release handled a similar situation).

So _title_mapping are unsorted in the tests, but in production they are sorted? if so, a comment is needed (at least to remember the reason).

Yep, pretty much. I'll be a bit verbose:

In-memory (and therefore in the tests), each value in _title_mapping may store elements in arbitrary ordering.

The tests use Python set equality comparison, and that is order-agnostic (as it should be) -- so the tests aren't really checking the order, even though it might appear from the code that they would.

The output of a Sphinx production build to output a searchindex.js sorts these entries.

What you suggest about a comment makes sense...

...however.. perhaps I could attempt a separate PR to refactor the code so that sorting occurs in-place and in-memory, and to adjust the unit tests to assert on that. Those would provide stronger properties than a comment in the code.

No need for that. It's better that at runtime, the objects are represented in sets since they are more efficient. It's only for serialization that you need to sort to ensure reproducibility.

Ok, agreed (storing the lists in-place for ordered iteration would probably add various types of overhead).

I've pushed a commit to add some commentary as a docstring on the relevant method and also in the unit tests.

tests/test_search.py

picnixz · 2024-06-18T09:40:48Z

sphinx/search/__init__.py

@@ -391,7 +391,7 @@ def freeze(self) -> dict[str, Any]:
 objtypes = {v: k[0] + ':' + k[1] for (k, v) in self._objtypes.items()}
 objnames = self._objnames

- alltitles: dict[str, list[tuple[int, str]]] = {}
+ alltitles: dict[str, list[tuple[int, str | None]]] = {}


Could you explain the shape of this value, i.e., to what corresponds list[tuple[int, str | None] (what is the integer, what is the str | None and) and what is the key str of the dict, and why does it change from the all_titles ? (the list has pairs of int and strings and not pairs of strings anymore).

Mm, this is not very clear at the moment, but I believe that the integer values here are what I'd call documentId values - sequential IDs assigned to each of the source .rst files at build-time. The second value in each tuple -- the str or empty value -- is a target reference (aka anchor when talking about HTML) -- it's an indicator about what destination on the page/document the user should be taken to were they to navigate using this entry.

Perhaps this is a situation where we should refactor and extract out meaningful type aliases (a bit like the recent Intersphinx refactorings).

I would not use aliases, but the int always seem to me like corresponding to the index in the list, so that you could possibly have multiple times the same element but with different indices. However, I'd be happy to know if this is really the reason why we duplicate that or if we could just live with a list of strings.

I guess an example to exlain that could be a title like Recent Changes that could appear in multiple documents and might have different headings (or perhaps even be the document title in some cases).

alltitles: {"Recent Changes": [0, None], [1, "coolmodule-recentchanges"], [5, "anothermodule-recentchanges"], ...}

(a bit hypothetical, but it's to handle situations like that I believe. again, a good case for checking and documenting)

Co-authored-by: Bénédikt Tran <[email protected]>

wlach · 2024-06-18T11:40:44Z

Mm, yep - the type hints themselves shouldn't affect anything... but they are an indicator that something about the possible range of values for their attributes has changed (None now permitted) - and since the env is serialized and potentially re-loaded during non-fresh builds (to speed up project rebuilds), I'm worrying about possible regressions due to that.

I'm struggling a bit to see the problem: aside from deserialization, I can't imagine for what reason we'd be reading index entries from Python code? I skimmed through the code and don't see us doing anything where the different types would be a problem:

https://github.com/search?q=repo%3Asphinx-doc%2Fsphinx%20self._titles&type=code

(similarly it could be worth pausing to consider searchindex.js forward/backward compatibility. I think this change is safe because there are no associated JavaScript changes required for this fix, but that may be naive)

IMO it's fine. We're just changing what gets included in the searchindex.

I think I agree here - it's simply that document main titles are now linked directly instead of with a hyperlink anchor/target.

However, similar to the env pickling/reloading -- there is code that I remember seeing that can reload the search state from a previously-frozen serialized format.

It looks like we explicitly reject any index created with a previous version of Sphinx in this case:

sphinx/sphinx/search/__init__.py

Line 302 in de4ac2f

frozen.get('envversion') != self.env.version:

jayaddison · 2024-06-18T13:23:04Z

Mm, yep - the type hints themselves shouldn't affect anything... but they are an indicator that something about the possible range of values for their attributes has changed (None now permitted) - and since the env is serialized and potentially re-loaded during non-fresh builds (to speed up project rebuilds), I'm worrying about possible regressions due to that.

I'm struggling a bit to see the problem: aside from deserialization, I can't imagine for what reason we'd be reading index entries from Python code? I skimmed through the code and don't see us doing anything where the different types would be a problem:

https://github.com/search?q=repo%3Asphinx-doc%2Fsphinx%20self._titles&type=code

I think the theory is that the search index -- like individual output files in many cases -- could be rebuilt incrementally rather than having to re-read the entire project content.

Ref: b1271fa

However, it's not clear to me whether those loaded entries are retained before the updated search index is written, because of the code linked below; is env.all_docs the entire project (in which case it's definitely not an incremental rebuild), or are they only the modified documents?

sphinx/sphinx/builders/html/__init__.py

Line 1163 in de4ac2f

self.indexer.prune(self.env.all_docs)

(similarly it could be worth pausing to consider searchindex.js forward/backward compatibility. I think this change is safe because there are no associated JavaScript changes required for this fix, but that may be naive)

IMO it's fine. We're just changing what gets included in the searchindex.

I think I agree here - it's simply that document main titles are now linked directly instead of with a hyperlink anchor/target.
However, similar to the env pickling/reloading -- there is code that I remember seeing that can reload the search state from a previously-frozen serialized format.

It looks like we explicitly reject any index created with a previous version of Sphinx in this case:

sphinx/sphinx/search/__init__.py

Line 302 in de4ac2f

frozen.get('envversion') != self.env.version:

Ok, that's good to know - but that version number is a static value in the codebase -- it's not the Sphinx version number, nor a counter of the number of builds on the machine where Sphinx is being run:

sphinx/sphinx/environment/__init__.py

Lines 59 to 61 in de4ac2f

 # This is increased every time an environment attribute is added 

 # or changed to properly invalidate pickle files. 

 ENV_VERSION = 61

wlach · 2024-06-18T14:34:25Z

I'm struggling a bit to see the problem: aside from deserialization, I can't imagine for what reason we'd be reading index entries from Python code? I skimmed through the code and don't see us doing anything where the different types would be a problem:
https://github.com/search?q=repo%3Asphinx-doc%2Fsphinx%20self._titles&type=code

I think the theory is that the search index -- like individual output files in many cases -- could be rebuilt incrementally rather than having to re-read the entire project content.

Ref: b1271fa

Ok, but I don't see anything in the codebase accessing that variable (self._titles) aside from when we write per that search query above. Let me know if I'm missing something.

It looks like we explicitly reject any index created with a previous version of Sphinx in this case:

sphinx/sphinx/search/__init__.py

Line 302 in de4ac2f

frozen.get('envversion') != self.env.version:

Ok, that's good to know - but that version number is a static value in the codebase -- it's not the Sphinx version number, nor a counter of the number of builds on the machine where Sphinx is being run:

sphinx/sphinx/environment/__init__.py

Lines 59 to 61 in de4ac2f

# This is increased every time an environment attribute is added

# or changed to properly invalidate pickle files.

ENV_VERSION = 61

You could always increment that number if you wanted to be on the safe side. Not really a lot of downside outside of maybe a very minor build-time penalty in a few obscure cases.

jayaddison · 2024-06-18T15:14:41Z

I'm struggling a bit to see the problem: aside from deserialization, I can't imagine for what reason we'd be reading index entries from Python code? I skimmed through the code and don't see us doing anything where the different types would be a problem:
https://github.com/search?q=repo%3Asphinx-doc%2Fsphinx%20self._titles&type=code

I think the theory is that the search index -- like individual output files in many cases -- could be rebuilt incrementally rather than having to re-read the entire project content.
Ref: b1271fa

Ok, but I don't see anything in the codebase accessing that variable (self._titles) aside from when we write per that search query above. Let me know if I'm missing something.

Ok - that matches my understanding too; the only time we write to the search-related attributes on BuildEnvironment is during loading of an existing searchindex.js from the output build directory. Subsequently some/all of that in-memory content is invalidated before we write the updated searchindex.js to that same output build directory.

It looks like we explicitly reject any index created with a previous version of Sphinx in this case:

sphinx/sphinx/search/__init__.py

Line 302 in de4ac2f

frozen.get('envversion') != self.env.version:

Ok, that's good to know - but that version number is a static value in the codebase -- it's not the Sphinx version number, nor a counter of the number of builds on the machine where Sphinx is being run:

sphinx/sphinx/environment/__init__.py

Lines 59 to 61 in de4ac2f

# This is increased every time an environment attribute is added

# or changed to properly invalidate pickle files.

ENV_VERSION = 61

You could always increment that number if you wanted to be on the safe side. Not really a lot of downside outside of maybe a very minor build-time penalty in a few obscure cases.

Given that we believe that the Python-side changes should be backwards-compatible here (no structural changes to the pickled env data stored-and-read by Sphinx), I think I'd prefer to leave the env-version value as-is.

jayaddison · 2024-06-24T18:05:43Z

NodeJS tests on this branch began failing from 9fef43b onwards; that isn't easily visible from the CI results in the GitHub web UI because we weren't running those on each commit until #12445 was merged (the latest merge-from-master includes that, explaining the visibility of the test failure).

The change at 9fef43b regenerated the search index file for the repro test case. Next I'm going to try to figure out why the tests began failing after that, and what that means.

jayaddison · 2024-06-24T18:15:36Z

Good news, I think.

NodeJS tests on this branch began failing from 9fef43b onwards; that isn't easily visible from the CI results in the GitHub web UI because we weren't running those on each commit until #12445 was merged (the latest merge-from-master includes that, explaining the visibility of the test failure).

The change at 9fef43b regenerated the search index file for the repro test case. Next I'm going to try to figure out why the tests began failing after that, and what that means.

This index-regeneration occurred directly after e947f6b - a merge from the mainline branch. That merge pulled in support for JavaScript testing of client-side HTML. And that happened to include a test case -- with a relevant TODO note -- indicating that some of the data in it is a workaround and should be removed when #11961 is fixed. This PR fixes #11961 - so it is in fact expected that we should remove that data, and it's good that the test is failing while it is still in place.

Edit: remove duplication of quoted comment

… no longer required with the fix from this branch in place.

wlach · 2024-07-05T17:25:52Z

@jayaddison Is there anything left to be done here or is this ready for a review by a committer?

jayaddison · 2024-07-05T17:43:38Z

This is ready, yep; the mypy lint failure on the latest commit (64387c4) is unrelated (#12510), and my last two messages where I'm talking to myself (not uncommon, sadly) are because I hadn't updated the test coverage after merging from mainline (to remove a duplicate result expectation added for a separate, already-merged PR).

picnixz · 2024-07-05T18:16:37Z

I'll review it tomorrow as well! (now that I've successfully defended, I have a bit more time!)

jayaddison · 2024-07-05T18:25:10Z

Well done completing the defence @picnixz :) Thank you!

picnixz

I think my last interrogation was resolved. I'm also happy that we don't have those duplicated entries anymore! I'll wait for Will to also review it and then we can merge it (just ping me when it's done).

picnixz · 2024-07-06T12:43:27Z

tests/test_search.py

@@ -176,7 +179,10 @@ def test_IndexBuilder():
 'index': {'docname1_1', 'docname1_2', 'docname2_1', 'docname2_2'},
 'test': {'docname1_1', 'docname1_2', 'docname2_1', 'docname2_2'},
 }
- assert index._title_mapping == {'section_titl': {'docname1_1', 'docname1_2', 'docname2_1', 'docname2_2'}}
+ assert index._title_mapping == {


So _title_mapping are unsorted in the tests, but in production they are sorted? if so, a comment is needed (at least to remember the reason).

…erialized sort-order for title and document terms.

wlach

All looks good from my end! @picnixz please merge when ready

picnixz · 2024-07-07T16:12:46Z

I'll do it tomorrow (I was working on something else and missed the notification but I'm going offline now and I prefer not to merge something that I won't be able to see the merge result of).

picnixz · 2024-07-08T11:34:18Z

Thank you Jay (and Will for the review!)

jayaddison · 2024-07-08T12:15:50Z

Thank you @picnixz!

jayaddison added 2 commits March 4, 2024 01:28

searchindex: omit the section reference when indexing each document's…

e951f65

… title

refactor: move the fix to the wordcollector node visitor

5e05a27

Note: reStructuredText document titles are implicit; ref: https://docutils.sourceforge.io/docs/ref/rst/restructuredtext.html#document

jayaddison mentioned this pull request Mar 4, 2024

HTML Search: Fix duplicate results #11942

Closed

jayaddison added 3 commits March 4, 2024 12:42

typing: emit empty-string instead of null for document title anchor v…

d44dc11

…alues

tests: add failing test to catch a regression; other top-level sectio…

a978110

…n headings may exist

fixup: only omit the anchor for the first-encountered title

ed47ab9

jayaddison added the html search label Mar 4, 2024

jayaddison added 2 commits March 4, 2024 13:59

Revert "typing: emit empty-string instead of null for document title …

26d6e3c

…anchor values" This reverts commit d44dc11.

linting: add ignore-E501 to overlong line

5c75c81

jayaddison changed the title ~~Draft: [HTML search] omit anchor reference from document titles in the search index.~~ [HTML search] omit anchor reference from document titles in the search index. Mar 4, 2024

jayaddison marked this pull request as ready for review March 4, 2024 14:03

jayaddison requested a review from jakobandersen March 6, 2024 01:21

jayaddison changed the title ~~[HTML search] omit anchor reference from document titles in the search index.~~ HTML Search: omit anchor reference from document titles in the search index. Mar 14, 2024

jayaddison added type:bug awaiting:response Waiting for a response from the author of this issue awaiting:review PR waiting for a review by a maintainer. and removed awaiting:response Waiting for a response from the author of this issue labels Mar 14, 2024

jayaddison added 2 commits March 15, 2024 13:58

Merge branch 'master' into issue-11961/searchindex-omit-doctitle-anchors

3c557cd

Add CHANGES.rst entry

b8177c5

jayaddison added the DO NOT MERGE label Mar 16, 2024

jayaddison marked this pull request as draft March 16, 2024 11:08

Merge branch 'master' into issue-11961/searchindex-omit-doctitle-anchors

86fd24c

picnixz reviewed Jun 18, 2024

View reviewed changes

Fixup: remove excess heading delineation characters.

c7245d2

Co-authored-by: Bénédikt Tran <[email protected]>

wlach mentioned this pull request Jun 19, 2024

[search] issues with the new HTML search algorithm #12391

Closed

Merge branch 'master' into issue-11961/searchindex-omit-doctitle-anchors

2857c7b

jayaddison mentioned this pull request Jun 24, 2024

Improve relevance scoring for titles and object-name matches in search results #12441

Merged

4 tasks

jayaddison and others added 2 commits June 24, 2024 19:19

[tests] JavaScript: remove workaround test data from mainline that is…

520a205

… no longer required with the fix from this branch in place.

Merge branch 'master' into issue-11961/searchindex-omit-doctitle-anchors

64387c4

picnixz approved these changes Jul 6, 2024

View reviewed changes

jayaddison added 2 commits July 6, 2024 20:07

Code review feedback: add explanatory comments regarding runtime vs s…

b349764

…erialized sort-order for title and document terms.

Merge branch 'master' into issue-11961/searchindex-omit-doctitle-anchors

b256267

wlach approved these changes Jul 7, 2024

View reviewed changes

Merge branch 'master' into issue-11961/searchindex-omit-doctitle-anchors

964d179

picnixz merged commit 7eb77f2 into sphinx-doc:master Jul 8, 2024
24 checks passed

jayaddison deleted the issue-11961/searchindex-omit-doctitle-anchors branch July 8, 2024 12:15

AA-Turner added this to the 7.4.0 milestone Jul 13, 2024

github-actions bot locked as resolved and limited conversation to collaborators Aug 13, 2024

	for word in word_store.title_words:
	# add stemmed and unstemmed as the stemmer must not remove words
	# from search index.
	stemmed_word = stem(word)
	if _filter(stemmed_word):
	self._title_mapping.setdefault(stemmed_word, set()).add(docname)
	elif _filter(word):
	self._title_mapping.setdefault(word, set()).add(docname)

	for word in word_store.words:
	# add stemmed and unstemmed as the stemmer must not remove words
	# from search index.
	stemmed_word = stem(word)
	if not _filter(stemmed_word) and _filter(word):
	stemmed_word = word
	already_indexed = docname in self._title_mapping.get(stemmed_word, ())
	if _filter(stemmed_word) and not already_indexed:
	self._mapping.setdefault(stemmed_word, set()).add(docname)

HTML Search: omit anchor reference from document titles in the search index. #12047

HTML Search: omit anchor reference from document titles in the search index. #12047

Conversation

jayaddison commented Mar 4, 2024

Feature or Bugfix

Purpose

Detail

Relates

wlach commented Mar 9, 2024

picnixz commented Mar 15, 2024

jayaddison commented Mar 15, 2024

jayaddison commented Mar 15, 2024

picnixz commented Mar 15, 2024

jayaddison commented Mar 15, 2024

jayaddison commented Mar 15, 2024

jayaddison commented Mar 16, 2024

jayaddison commented Mar 16, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

picnixz Jun 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

picnixz Jun 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wlach commented Jun 18, 2024

jayaddison commented Jun 18, 2024

wlach commented Jun 18, 2024

jayaddison commented Jun 18, 2024

jayaddison commented Jun 24, 2024

jayaddison commented Jun 24, 2024 • edited Loading

wlach commented Jul 5, 2024

jayaddison commented Jul 5, 2024

picnixz commented Jul 5, 2024

jayaddison commented Jul 5, 2024

picnixz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wlach left a comment

Choose a reason for hiding this comment

picnixz commented Jul 7, 2024

picnixz commented Jul 8, 2024 • edited Loading

jayaddison commented Jul 8, 2024

picnixz Jun 18, 2024 •

edited

Loading

picnixz Jun 18, 2024 •

edited

Loading

jayaddison commented Jun 24, 2024 •

edited

Loading

picnixz commented Jul 8, 2024 •

edited

Loading