Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved search and deep linking #129

Open
ajparsons opened this issue Sep 20, 2021 · 1 comment
Open

Improved search and deep linking #129

ajparsons opened this issue Sep 20, 2021 · 1 comment

Comments

@ajparsons
Copy link
Contributor

Copying from a notes document:

One of the user interviews was disappointed that the search couldn’t link to individual pages, and this feels a fixable issue with a little R&D time. Similarly there are some existing tools for table of contents extraction that paired with deep linking would make PDFs much more useful without needing to fully understand the content. Structured understanding of tables of contents could improve search, and provide a light form of structured content for comparing documents.

Chrome supports both linking to a specific page of a PDF url, and also linking to specific text fragments on an html page. Could use this to validate if it's useful.

@zarino zarino added this to Improvements to existing features in CAPE continuous improvement Apr 28, 2022
@struan
Copy link
Member

struan commented Apr 28, 2022

For context, at the moment the search works by extracting the text from the PDF, putting that in to the CSV as plain text and then indexing that text, so we're not directly indexing the PDF.

struan added a commit that referenced this issue May 4, 2022
This is a Solr specific thing that relies on Solr's
ExtractingResourceHandler to parse structured files - see
https://django-haystack.readthedocs.io/en/master/rich_content_extraction.html

Means we should be indexing word documents etc as well.

For #129
struan added a commit that referenced this issue Jun 8, 2022
This is a Solr specific thing that relies on Solr's
ExtractingResourceHandler to parse structured files - see
https://django-haystack.readthedocs.io/en/master/rich_content_extraction.html

Means we should be indexing word documents etc as well.

For #129
struan added a commit that referenced this issue Aug 2, 2022
This is a Solr specific thing that relies on Solr's
ExtractingResourceHandler to parse structured files - see
https://django-haystack.readthedocs.io/en/master/rich_content_extraction.html

Means we should be indexing word documents etc as well.

For #129
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
CAPE continuous improvement
Improvements to existing features
Development

No branches or pull requests

3 participants