Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[search] issues with the new HTML search algorithm #12391

Closed
picnixz opened this issue May 23, 2024 · 16 comments · Fixed by #12441
Closed

[search] issues with the new HTML search algorithm #12391

picnixz opened this issue May 23, 2024 · 16 comments · Fixed by #12441

Comments

@picnixz
Copy link
Member

picnixz commented May 23, 2024

I don't really have a good title for this issue, but the CPython issue speaks for itself. It appears we are still misranking titles in some cases (the 3.11 doc is generated by 7.2.6 but the 3.12 is generated by 7.3.7).

cc @wlach @jayaddison

From python/cpython#119423 (comment)

(I initially posted this in the pythondotorg repo - where I was directed here. Copying the description from there verbatim)

Describe the bug Searching for library modules on 3.12 returns first a list of "What's new in " pages instead of the searched for module.

To Reproduce Steps to reproduce the behavior:

1. Go to [docs.python.org/3.12/index.html](https://docs.python.org/3.12/index.html)

2. Click on search box in top right corner, enter e.g. "asyncio" and hit enter

3. [Scroll through results](https://docs.python.org/3.12/search.html?q=asyncio)

4. See that the first result is "History and License" followed by 10 (!) "What’s New In Python <version>" results. The 11th result is the actual asyncio module.

Expected behavior I would expect the asyncio module page to be the first result, or at the very least to appear very close to the top. This has been the case in previous versions of the website search.

Screenshots (the list goes on a few more entries below the screenshot) image

Compare with search from 3.11: image

Desktop (please complete the following information):

* OS: MacOS

* Browser Chrome

* Version 123

Additional context This happens for other topics too, not only modules.

@wlach
Copy link
Contributor

wlach commented May 23, 2024

Ironically I think fixing the search algorithm so it worked as originally intended made this worse. In previous versions, there were various bugs that made title matches rank lower than they should have.

In this case, asyncio is a title and it's a full match in the case of the release notes. In the case of the module, it's asyncio — Asynchronous I/O so is only a partial one (I'm not sure what's going on with the module match-- would need to investigate)

That's said, there's a few things we could investigate that might improve things:

@Saba-Sabato
Copy link

If I may add my two cents here, I think a key part is actually the webpage name: https://docs.python.org/3.12/library/asyncio.html, https://docs.python.org/3.12/library/itertools.html
If you boosted the webpage's actual name, it would at the very least solve the issue for all modules. (which IMO is one of the most common use cases for search, but I don't really have data here)

@picnixz
Copy link
Member Author

picnixz commented May 23, 2024

There are other possibilities, namely rank the match higher:

  • if it's part of the document being documented (in general, you document a module in a file that has the same... name)
  • if it's a standalone name + lowercased name (usually important in this case)

EDIT: We had the same idea with the name of the file !

@wlach
Copy link
Contributor

wlach commented May 23, 2024

Definitely lots of things we could experiment with! I think now that we have a foundation with better testing (esp. after #12102 merges) it will be easier to iterate on things and make sure that we maintain desired behaviour.

@picnixz
Copy link
Member Author

picnixz commented May 23, 2024

I'll be out for the next days due to Eurocrypt, but I'll be back next week and could perhaps make that happen. In the meantime, @hugovk I suggest you generate the docs with a patched version of searchtools.js so that Python is not broken for too much time (or maybe people can live with that, though it's a bit annoying...).

@hugovk
Copy link
Contributor

hugovk commented May 23, 2024

Thanks, I think we can wait, let's see. Enjoy Eurocrypt!

wlach added a commit to wlach/sphinx that referenced this issue May 25, 2024
Related to sphinx-doc#12391. Intended as a proof of concept, not a full solution.
@wlach
Copy link
Contributor

wlach commented May 25, 2024

Spent a bit of time looking at this today. Some thoughts:

  • I'm not an expert in information retrieval, but my sense from taking a couple of undergraduate courses in the distant past is that it's very hard to come up with heuristics that don't occasionally give bad results like what we're seeing here. It's probably better to just rely on things like tf-idf with a few lightweight things on top to weight the results a bit.
  • In this vein, the existing js code seems to heavily boost results where there is a full title match (100 points, vs. like 5 for a term search). This might make sense when it's the title of the whole document but when it's a subtitle this probably isn't nearly as much of a hint that the information is relevant. I suspect this feature actually never worked exactly as originally intended, but various bugs in the JavaScript code obscured that.

I attempted a quick fix that adjusted some things here:

const queryLower = query.toLowerCase().trim();

It seems to give somewhat better results (see PR) but to be honest I'm not super happy with it, it feels very much like a set of hacks (on top of the existing hacks) that might just lead to other problems.

Longer term, I think we might want to focus on improving the scoring algorithm for generic term search. I may be misunderstanding what's going on (wouldn't be the first time 😄), but based on my initial spelunking of the https://github.com/sphinx-doc/sphinx/blob/48cbb43e28efe82b198137042811e0ade3599ae7/sphinx/search/__init__.py it doesn't look like we're using tf-idf -- it seems like we just check whether the term is in the document or not. I suspect we could get much better results by using that by default, then implementing title search on top of that (to further boost scores), instead of making it its own thing.

@picnixz
Copy link
Member Author

picnixz commented May 28, 2024

So, I'm back and here's what I suggest:

  1. merge [tests] JavaScript: extract searchindex.js-format test fixtures. #12102 (the comments that I made are only cosmetic so we can merge it right away and revist the PR later
  2. should we sit down to check how to improve the whole search algorithm? IMO we should probably revisit more or less everything from scratch (the premises might be wrong)

I had some lectures about learning theory, tensors and similarities but those were mostly theoretical and not really with implementations. So on that one I cannot really help although I can give ideas. Nonetheless, I can help with the maths behind if needed (and graph-based searches as well).

PS: If anyone has experience in search-based engine (e.g., someone that is well-versed in Google techniques :D), they are welcome to help us!

@wlach
Copy link
Contributor

wlach commented May 29, 2024

So, I'm back and here's what I suggest:

1. merge [[tests] JavaScript: extract searchindex.js-format test fixtures. #12102](https://github.com/sphinx-doc/sphinx/pull/12102) (the comments that I made are only cosmetic so we can merge it right away and revist the PR later

Do you have rights to edit the pull request? I believe maintainers usually do with the committers' consent. But yes, that makes sense to me.

2. should we sit down to check how to improve the whole search algorithm? IMO we should probably revisit more or less everything from scratch (the premises might be wrong)

Yes, I kind of think so though it feels like a large project -- at least 40 to 80 hours, maybe more (especially if you consider product research and testing). One thing to note is that any site hosted by readthedocs (which many sphinx projects use, though not cpython) uses a different server-side implementation of search which doesn't appear to suffer from these problems.

Part of me wonders if we should implement some variation of #12393 as an interim solution to at least fix these pathologically bad cases where a subtitle gets ranked higher than a title.

I had some lectures about learning theory, tensors and similarities but those were mostly theoretical and not really with implementations. So on that one I cannot really help although I can give ideas. Nonetheless, I can help with the maths behind if needed (and graph-based searches as well).

PS: If anyone has experience in search-based engine (e.g., someone that is well-versed in Google techniques :D), they are welcome to help us!

Well, there's definitely prior art out there to look at-- e.g. I think mkdocs and mkdocs-material have pretty good client side search. I believe they use a combination of lunr.js (using https://github.com/yeraydiazdiaz/lunr.py to generate the index) with some other magic.

@picnixz
Copy link
Member Author

picnixz commented May 29, 2024

Do you have rights to edit the pull request? I believe maintainers usually do with the committers' consent. But yes, that makes sense to me.

I think Jay allowed it.

Yes, I kind of think so though it feels like a large project -- at least 40 to 80 hours, maybe more (especially if you consider product research and testing)

I'm defending my thesis in one month, so I cannot really afford to spend 40-80h on that task but at least, it's good that we agree that we need to do something. If you however know of anything that is opensource and that can be used as a plug-in replacement or from where we could inspire ourselves, then it'd be definitely a starting point. The lunar.py framework seems suitable to our needs since we do have something that could serve as primary/secondary keys.

For instance, with the 'asyncio' use case, we would have multiple matches but only one of them would be a "module"-like match and in this case, it is pretty easy to select which would be the expected one.

Now, before doing anything, perhaps we could decide the priority of the fields, like, when you search something, do you first want a module, a class, a property, or what? and in which order?

@wlach
Copy link
Contributor

wlach commented Jun 8, 2024

I'm defending my thesis in one month, so I cannot really afford to spend 40-80h on that task but at least, it's good that we agree that we need to do something. If you however know of anything that is opensource and that can be used as a plug-in replacement or from where we could inspire ourselves, then it'd be definitely a starting point. The lunar.py framework seems suitable to our needs since we do have something that could serve as primary/secondary keys.

Fair enough, I've also been busy and my heart sinks a bit thinking about the size of this project. That said, I feel like this would benefit from a broader discussion (we've gone far beyond the scope of this bug) so started one here with some ideas about how one might move forward:

https://github.com/orgs/sphinx-doc/discussions/12419

For instance, with the 'asyncio' use case, we would have multiple matches but only one of them would be a "module"-like match and in this case, it is pretty easy to select which would be the expected one.

Now, before doing anything, perhaps we could decide the priority of the fields, like, when you search something, do you first want a module, a class, a property, or what? and in which order?

I think my initial hypothesis is that we might not want to distinguish between these things explicitly and that we'd be better off with a good search implementation that just worked against document titles/content in a more general way. I think one of the nice things about doing this up as an extension (as I suggest in the discussion) is that we can more easily experiment with various approaches.

@jayaddison
Copy link
Contributor

Re: @wlach's bug analysis:

In this vein, the existing js code seems to heavily boost results where there is a full title match (100 points, vs. like 5 for a term search). This might make sense when it's the title of the whole document but when it's a subtitle this probably isn't nearly as much of a hint that the information is relevant.

I agree that this seems to be the cause of the problem - the JavaScript variable named title that we use during the percentage-match logic (100 * ...) is not always the document main title (What's new in ... at the top of the document), but can also be a subsection title (asyncio as it appears as a subsection heading somewhere within the release notes).

So if I understand correctly, we're producing a 100 score for documents when a query matches a subsection title in them; I don't think that's the original intent of the code.

Re: @picnixz's suggested courses of action:

merge #12102 (the comments that I made are only cosmetic so we can merge it right away and revist the PR later

Yep - I would like to treat this bugreport as something we can write a test case for and then have some confidence that it is resolved.

I think a search engine / information retrieval expert would say that it is impractical to rely on unit tests to guarantee the continued quality of a search engine where indexed content evolves over time -- and I more-or-less agree with that, but I also think that a few well-chosen test cases can be useful to maintain (and document) desired behaviours of the search engine.

should we sit down to check how to improve the whole search algorithm? IMO we should probably revisit more or less everything from scratch (the premises might be wrong)

I would really like if we could document what we have already, and I think that would help with this.

Re: interim solutions (quoting @wlach)

Part of me wonders if we should implement some variation of #12393 as an interim solution to at least fix these pathologically bad cases where a subtitle gets ranked higher than a title.

I think I'm more-or-less-OK with that idea; I don't love the additional code path complexity but I think the idea -- creating a distinction between main-title and section-title match scoring -- is good. It might be nice to clarify those values as Scorer. constants.

...and it's also worth eeping @Saba-Sabato's note about filenames in mind. In particular: is asyncio considered an 'object name' because it's a Python module in our searchindex.js? Exact-matches on object names seem like they should be a strong relevance indicator. Perhaps that scoring logic will naturally be used here if we down-rank the subsection title logic, or perhaps not.

Re: #12102 edit permissions

In general, yep feel free to push changes to my branches - I've added a few replies on there and will get around to adding the relevant commits soon.

@jayaddison
Copy link
Contributor

I've developed a minimal-ish repro case that I believe exhibits the same problem. It contains:

  • A document where the word Relevance appears as a sub-section heading.
  • A document where Relevance is the main title.
  • A document with the filename relevance, based on a Python module of the same name (relevance.py).

The search results when querying for the term relevance are currently:

image

  1. The subsection heading match.
  2. The main-title match.
  3. The object-name (module) match.

Those don't seem ideal to me; I'd suggest we'd want the results to appear instead as: object-name match, main-title match, subsection match.

That suggested ordering is based on a sense I have that many of the projects built with Sphinx are technical/reference projects, and so a match on an object name in the project's domain is usually highly relevant to the user's query. And also a sense that if a title is the main title of a document, then that document should score more highly than a subsection heading with the same level of matching.

There are likely multiple ways to implement scoring that would achieve the suggested re-ordering; some of those might be good, some of them might be less good. Trying to find an implementation that is straightforward to read and understand (both code and behaviour) and also produces good results would be the next step -- and then we could (re)evaluate that after a subsequent release, based on feedback.

@wlach
Copy link
Contributor

wlach commented Jun 19, 2024

This is great, thank you @jayaddison! As I said above, I think there's only so far we can go with the current approach, but as an incremental step adding a real test case and then adjusting the scoring weights seems like the way to go. Happy to help with review / adjustments as you go along.

As an aside, it's hilarious that we get 7 results for a corpus of three documents. 😭 Really looking forward to #12047 landing.

Those don't seem ideal to me; I'd suggest we'd want the results to appear instead as: object-name match, main-title match, subsection match.

Yup that makes sense to me too.

@jayaddison
Copy link
Contributor

As an aside, it's hilarious that we get 7 results for a corpus of three documents. 😭 Really looking forward to #12047 landing.

Well spotted :) Yep, I will confirm that that resolves the result duplication in this case too.

@jayaddison
Copy link
Contributor

As an aside, it's hilarious that we get 7 results for a corpus of three documents. 😭 Really looking forward to #12047 landing.

Well spotted :) Yep, I will confirm that that resolves the result duplication in this case too.

Done - merging #12047 (at c7245d2) into #12441 (at 5eaea64) does include de-duplication as expected. Six (6) results are returned. They mostly appear distinct (unique); the third result is arguably dubious though -- it looks like a term match, but the term also happens to be a subsection title. Even so, I think that is a separate issue to both this and #11961.

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants