Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML Search: omit anchor reference from document titles in the search index. #12047

Conversation

jayaddison
Copy link
Contributor

Feature or Bugfix

  • Bugfix

Purpose

  • Adjusts the contents of the search-index provided to the client so that the browser correctly de-duplicates search results that link to the top-level title in documents.

Detail

Relates

@jayaddison jayaddison changed the title Draft: [HTML search] omit anchor reference from document titles in the search index. [HTML search] omit anchor reference from document titles in the search index. Mar 4, 2024
@jayaddison jayaddison marked this pull request as ready for review March 4, 2024 14:03
@wlach
Copy link
Contributor

wlach commented Mar 9, 2024

This seems preferable to my solution in #11942, just tested and it has the same results while making the search index smaller. 👍 from me.

@jayaddison jayaddison changed the title [HTML search] omit anchor reference from document titles in the search index. HTML Search: omit anchor reference from document titles in the search index. Mar 14, 2024
@jayaddison jayaddison added type:bug awaiting:response Waiting for a response from the author of this issue awaiting:review PR waiting for a review by a maintainer. and removed awaiting:response Waiting for a response from the author of this issue labels Mar 14, 2024
@picnixz
Copy link
Member

picnixz commented Mar 15, 2024

Fine for me (I'm not an expert in that search-related aspect so I'll leave it to you guys). Should I understand that this PR would replace #11942 ? @jayaddison ping me when you want it to be merged

@jayaddison
Copy link
Contributor Author

Fine for me (I'm not an expert in that search-related aspect so I'll leave it to you guys). Should I understand that this PR would replace #11942 ?

That's correct. I think there are three things that I don't like about this pull request, in order of priority:

  1. It doesn't offer JavaScript test coverage to demonstrate the fix -- only Python test coverage that confirms the expected index format.
  2. I haven't been able to think of a way that this would happen yet, but in theory I expect some downstream tools that use the current index format could be broken by this - in particular the possibility of null values where previously everything was a string.
  3. Related to (2), the index format isn't as small as it could be. In raw text, the JavaScript empty string can be represented by either '' or "" (two characters), whereas a null value is represented by the token null (four characters). It may not matter a lot, but over a long enough duration of time the cumulative cost could be significant (or not! maybe it's a waste of time considering it).

@jayaddison
Copy link
Contributor Author

It doesn't offer JavaScript test coverage to demonstrate the fix -- only Python test coverage that confirms the expected index format.

I think I'll begin work on a small JavaScript refactor PR to make test coverage easier to add.

@picnixz
Copy link
Member

picnixz commented Mar 15, 2024

particular the possibility of null values where previously everything was a string.

Is it possible to keep a string or is the null type needed?

@jayaddison
Copy link
Contributor Author

particular the possibility of null values where previously everything was a string.

Is it possible to keep a string or is the null type needed?

If I remember correctly, this conditional needs to be adjusted if we're using empty strings, but otherwise it should be fine.

What I would really like is a way to generate the indexes used in the JavaScript tests from the same Python code that builds searchindex.js when projects are built. It feels fragile that the test indexes are written by hand.

@jayaddison
Copy link
Contributor Author

What I would really like is a way to generate the indexes used in the JavaScript tests from the same Python code that builds searchindex.js when projects are built. It feels fragile that the test indexes are written by hand.

Moved into #12099.

@jayaddison
Copy link
Contributor Author

I'd like to check whether we can get #12102 in place before progressing this pull request further. If that can be added, then I think adding test coverage here will be much easier and more reliable (I'll be able to create a sample Sphinx project that returns duplicate search results, and add test coverage against that).

@jayaddison jayaddison marked this pull request as draft March 16, 2024 11:08
@jayaddison
Copy link
Contributor Author

(and maybe do the null to '' refactoring at the same time and more safely, given the test coverage)

tests/test_search.py Outdated Show resolved Hide resolved
@@ -176,7 +179,10 @@ def test_IndexBuilder():
'index': {'docname1_1', 'docname1_2', 'docname2_1', 'docname2_2'},
'test': {'docname1_1', 'docname1_2', 'docname2_1', 'docname2_2'},
}
assert index._title_mapping == {'section_titl': {'docname1_1', 'docname1_2', 'docname2_1', 'docname2_2'}}
assert index._title_mapping == {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, I see that the values are sets and thus can be unordered. Is it possible to actually have a more reliable guarantee on the title_mapping actually?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To check that I understand correctly: use an assertion that checks not only the contents of the title mapping, but also the ordering of the elements inside it?

(makes sense to me, order can be significant for index data)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If my understanding there was correct: I'd prefer to handle that in a separate issue thread and PR, if that's OK. Or, if I misunderstood: let's make sure to resolve that first.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is more than one occurrence of that, yes we could discuss it in another thread.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep; I think it would be applicable for both _title_mapping and _mapping; because both of those currently use a pattern of <mapping>.setdefault(..., set()).add(...) (implying that insertion-order may not match subsequent access iteration order):

for word in word_store.title_words:
# add stemmed and unstemmed as the stemmer must not remove words
# from search index.
stemmed_word = stem(word)
if _filter(stemmed_word):
self._title_mapping.setdefault(stemmed_word, set()).add(docname)
elif _filter(word):
self._title_mapping.setdefault(word, set()).add(docname)
for word in word_store.words:
# add stemmed and unstemmed as the stemmer must not remove words
# from search index.
stemmed_word = stem(word)
if not _filter(stemmed_word) and _filter(word):
stemmed_word = word
already_indexed = docname in self._title_mapping.get(stemmed_word, ())
if _filter(stemmed_word) and not already_indexed:
self._mapping.setdefault(stemmed_word, set()).add(docname)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correction: any runtime (in-memory) differences in the ordering of objects within the _title_mapping and _mapping set entries are later resolved into deterministic order by an explicit sort that occurs during freezing (serialization) of the search index (thanks to d24bd73).

I'm not aware of any other current bugs causing nondeterminism in our searchindex.js output since v7.3.0 (95fb0e5 included in that release handled a similar situation).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So _title_mapping are unsorted in the tests, but in production they are sorted? if so, a comment is needed (at least to remember the reason).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, pretty much. I'll be a bit verbose:

  • In-memory (and therefore in the tests), each value in _title_mapping may store elements in arbitrary ordering.
  • The tests use Python set equality comparison, and that is order-agnostic (as it should be) -- so the tests aren't really checking the order, even though it might appear from the code that they would.
  • The output of a Sphinx production build to output a searchindex.js sorts these entries.

What you suggest about a comment makes sense...

...however.. perhaps I could attempt a separate PR to refactor the code so that sorting occurs in-place and in-memory, and to adjust the unit tests to assert on that. Those would provide stronger properties than a comment in the code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for that. It's better that at runtime, the objects are represented in sets since they are more efficient. It's only for serialization that you need to sort to ensure reproducibility.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, agreed (storing the lists in-place for ordered iteration would probably add various types of overhead).

I've pushed a commit to add some commentary as a docstring on the relevant method and also in the unit tests.

tests/test_search.py Show resolved Hide resolved
@@ -391,7 +391,7 @@ def freeze(self) -> dict[str, Any]:
objtypes = {v: k[0] + ':' + k[1] for (k, v) in self._objtypes.items()}
objnames = self._objnames

alltitles: dict[str, list[tuple[int, str]]] = {}
alltitles: dict[str, list[tuple[int, str | None]]] = {}
Copy link
Member

@picnixz picnixz Jun 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain the shape of this value, i.e., to what corresponds list[tuple[int, str | None] (what is the integer, what is the str | None and) and what is the key str of the dict, and why does it change from the all_titles ? (the list has pairs of int and strings and not pairs of strings anymore).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mm, this is not very clear at the moment, but I believe that the integer values here are what I'd call documentId values - sequential IDs assigned to each of the source .rst files at build-time. The second value in each tuple -- the str or empty value -- is a target reference (aka anchor when talking about HTML) -- it's an indicator about what destination on the page/document the user should be taken to were they to navigate using this entry.

Perhaps this is a situation where we should refactor and extract out meaningful type aliases (a bit like the recent Intersphinx refactorings).

Copy link
Member

@picnixz picnixz Jun 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not use aliases, but the int always seem to me like corresponding to the index in the list, so that you could possibly have multiple times the same element but with different indices. However, I'd be happy to know if this is really the reason why we duplicate that or if we could just live with a list of strings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess an example to exlain that could be a title like Recent Changes that could appear in multiple documents and might have different headings (or perhaps even be the document title in some cases).

alltitles: {"Recent Changes": [0, None], [1, "coolmodule-recentchanges"], [5, "anothermodule-recentchanges"], ...}

(a bit hypothetical, but it's to handle situations like that I believe. again, a good case for checking and documenting)

@wlach
Copy link
Contributor

wlach commented Jun 18, 2024

Mm, yep - the type hints themselves shouldn't affect anything... but they are an indicator that something about the possible range of values for their attributes has changed (None now permitted) - and since the env is serialized and potentially re-loaded during non-fresh builds (to speed up project rebuilds), I'm worrying about possible regressions due to that.

I'm struggling a bit to see the problem: aside from deserialization, I can't imagine for what reason we'd be reading index entries from Python code? I skimmed through the code and don't see us doing anything where the different types would be a problem:

https://github.com/search?q=repo%3Asphinx-doc%2Fsphinx%20self._titles&type=code

(similarly it could be worth pausing to consider searchindex.js forward/backward compatibility. I think this change is safe because there are no associated JavaScript changes required for this fix, but that may be naive)

IMO it's fine. We're just changing what gets included in the searchindex.

I think I agree here - it's simply that document main titles are now linked directly instead of with a hyperlink anchor/target.

However, similar to the env pickling/reloading -- there is code that I remember seeing that can reload the search state from a previously-frozen serialized format.

It looks like we explicitly reject any index created with a previous version of Sphinx in this case:

frozen.get('envversion') != self.env.version:

@jayaddison
Copy link
Contributor Author

Mm, yep - the type hints themselves shouldn't affect anything... but they are an indicator that something about the possible range of values for their attributes has changed (None now permitted) - and since the env is serialized and potentially re-loaded during non-fresh builds (to speed up project rebuilds), I'm worrying about possible regressions due to that.

I'm struggling a bit to see the problem: aside from deserialization, I can't imagine for what reason we'd be reading index entries from Python code? I skimmed through the code and don't see us doing anything where the different types would be a problem:

https://github.com/search?q=repo%3Asphinx-doc%2Fsphinx%20self._titles&type=code

I think the theory is that the search index -- like individual output files in many cases -- could be rebuilt incrementally rather than having to re-read the entire project content.

Ref: b1271fa

However, it's not clear to me whether those loaded entries are retained before the updated search index is written, because of the code linked below; is env.all_docs the entire project (in which case it's definitely not an incremental rebuild), or are they only the modified documents?

self.indexer.prune(self.env.all_docs)

(similarly it could be worth pausing to consider searchindex.js forward/backward compatibility. I think this change is safe because there are no associated JavaScript changes required for this fix, but that may be naive)

IMO it's fine. We're just changing what gets included in the searchindex.

I think I agree here - it's simply that document main titles are now linked directly instead of with a hyperlink anchor/target.
However, similar to the env pickling/reloading -- there is code that I remember seeing that can reload the search state from a previously-frozen serialized format.

It looks like we explicitly reject any index created with a previous version of Sphinx in this case:

frozen.get('envversion') != self.env.version:

Ok, that's good to know - but that version number is a static value in the codebase -- it's not the Sphinx version number, nor a counter of the number of builds on the machine where Sphinx is being run:

# This is increased every time an environment attribute is added
# or changed to properly invalidate pickle files.
ENV_VERSION = 61

@wlach
Copy link
Contributor

wlach commented Jun 18, 2024

I'm struggling a bit to see the problem: aside from deserialization, I can't imagine for what reason we'd be reading index entries from Python code? I skimmed through the code and don't see us doing anything where the different types would be a problem:
https://github.com/search?q=repo%3Asphinx-doc%2Fsphinx%20self._titles&type=code

I think the theory is that the search index -- like individual output files in many cases -- could be rebuilt incrementally rather than having to re-read the entire project content.

Ref: b1271fa

Ok, but I don't see anything in the codebase accessing that variable (self._titles) aside from when we write per that search query above. Let me know if I'm missing something.

It looks like we explicitly reject any index created with a previous version of Sphinx in this case:

frozen.get('envversion') != self.env.version:

Ok, that's good to know - but that version number is a static value in the codebase -- it's not the Sphinx version number, nor a counter of the number of builds on the machine where Sphinx is being run:

# This is increased every time an environment attribute is added
# or changed to properly invalidate pickle files.
ENV_VERSION = 61

You could always increment that number if you wanted to be on the safe side. Not really a lot of downside outside of maybe a very minor build-time penalty in a few obscure cases.

@jayaddison
Copy link
Contributor Author

I'm struggling a bit to see the problem: aside from deserialization, I can't imagine for what reason we'd be reading index entries from Python code? I skimmed through the code and don't see us doing anything where the different types would be a problem:
https://github.com/search?q=repo%3Asphinx-doc%2Fsphinx%20self._titles&type=code

I think the theory is that the search index -- like individual output files in many cases -- could be rebuilt incrementally rather than having to re-read the entire project content.
Ref: b1271fa

Ok, but I don't see anything in the codebase accessing that variable (self._titles) aside from when we write per that search query above. Let me know if I'm missing something.

Ok - that matches my understanding too; the only time we write to the search-related attributes on BuildEnvironment is during loading of an existing searchindex.js from the output build directory. Subsequently some/all of that in-memory content is invalidated before we write the updated searchindex.js to that same output build directory.

It looks like we explicitly reject any index created with a previous version of Sphinx in this case:

frozen.get('envversion') != self.env.version:

Ok, that's good to know - but that version number is a static value in the codebase -- it's not the Sphinx version number, nor a counter of the number of builds on the machine where Sphinx is being run:

# This is increased every time an environment attribute is added
# or changed to properly invalidate pickle files.
ENV_VERSION = 61

You could always increment that number if you wanted to be on the safe side. Not really a lot of downside outside of maybe a very minor build-time penalty in a few obscure cases.

Given that we believe that the Python-side changes should be backwards-compatible here (no structural changes to the pickled env data stored-and-read by Sphinx), I think I'd prefer to leave the env-version value as-is.

@jayaddison
Copy link
Contributor Author

NodeJS tests on this branch began failing from 9fef43b onwards; that isn't easily visible from the CI results in the GitHub web UI because we weren't running those on each commit until #12445 was merged (the latest merge-from-master includes that, explaining the visibility of the test failure).

The change at 9fef43b regenerated the search index file for the repro test case. Next I'm going to try to figure out why the tests began failing after that, and what that means.

@jayaddison
Copy link
Contributor Author

jayaddison commented Jun 24, 2024

Good news, I think.

NodeJS tests on this branch began failing from 9fef43b onwards; that isn't easily visible from the CI results in the GitHub web UI because we weren't running those on each commit until #12445 was merged (the latest merge-from-master includes that, explaining the visibility of the test failure).

The change at 9fef43b regenerated the search index file for the repro test case. Next I'm going to try to figure out why the tests began failing after that, and what that means.

This index-regeneration occurred directly after e947f6b - a merge from the mainline branch. That merge pulled in support for JavaScript testing of client-side HTML. And that happened to include a test case -- with a relevant TODO note -- indicating that some of the data in it is a workaround and should be removed when #11961 is fixed. This PR fixes #11961 - so it is in fact expected that we should remove that data, and it's good that the test is failing while it is still in place.

Edit: remove duplication of quoted comment

@wlach
Copy link
Contributor

wlach commented Jul 5, 2024

@jayaddison Is there anything left to be done here or is this ready for a review by a committer?

@jayaddison
Copy link
Contributor Author

This is ready, yep; the mypy lint failure on the latest commit (64387c4) is unrelated (#12510), and my last two messages where I'm talking to myself (not uncommon, sadly) are because I hadn't updated the test coverage after merging from mainline (to remove a duplicate result expectation added for a separate, already-merged PR).

@picnixz
Copy link
Member

picnixz commented Jul 5, 2024

I'll review it tomorrow as well! (now that I've successfully defended, I have a bit more time!)

@jayaddison
Copy link
Contributor Author

Well done completing the defence @picnixz :) Thank you!

Copy link
Member

@picnixz picnixz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my last interrogation was resolved. I'm also happy that we don't have those duplicated entries anymore! I'll wait for Will to also review it and then we can merge it (just ping me when it's done).

@@ -176,7 +179,10 @@ def test_IndexBuilder():
'index': {'docname1_1', 'docname1_2', 'docname2_1', 'docname2_2'},
'test': {'docname1_1', 'docname1_2', 'docname2_1', 'docname2_2'},
}
assert index._title_mapping == {'section_titl': {'docname1_1', 'docname1_2', 'docname2_1', 'docname2_2'}}
assert index._title_mapping == {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So _title_mapping are unsorted in the tests, but in production they are sorted? if so, a comment is needed (at least to remember the reason).

Copy link
Contributor

@wlach wlach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All looks good from my end! @picnixz please merge when ready

@picnixz
Copy link
Member

picnixz commented Jul 7, 2024

I'll do it tomorrow (I was working on something else and missed the notification but I'm going offline now and I prefer not to merge something that I won't be able to see the merge result of).

@picnixz picnixz merged commit 7eb77f2 into sphinx-doc:master Jul 8, 2024
24 checks passed
@picnixz
Copy link
Member

picnixz commented Jul 8, 2024

Thank you Jay (and Will for the review!)

@jayaddison jayaddison deleted the issue-11961/searchindex-omit-doctitle-anchors branch July 8, 2024 12:15
@jayaddison
Copy link
Contributor Author

Thank you @picnixz!

@AA-Turner AA-Turner added this to the 7.4.0 milestone Jul 13, 2024
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 13, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

HTML Search: Contains duplicates based on title and content search
4 participants