Improved search and deep linking #129

ajparsons · 2021-09-20T08:26:59Z

Copying from a notes document:

One of the user interviews was disappointed that the search couldn’t link to individual pages, and this feels a fixable issue with a little R&D time. Similarly there are some existing tools for table of contents extraction that paired with deep linking would make PDFs much more useful without needing to fully understand the content. Structured understanding of tables of contents could improve search, and provide a light form of structured content for comparing documents.

Chrome supports both linking to a specific page of a PDF url, and also linking to specific text fragments on an html page. Could use this to validate if it's useful.

struan · 2022-04-28T11:55:39Z

For context, at the moment the search works by extracting the text from the PDF, putting that in to the CSV as plain text and then indexing that text, so we're not directly indexing the PDF.

This is a Solr specific thing that relies on Solr's ExtractingResourceHandler to parse structured files - see https://django-haystack.readthedocs.io/en/master/rich_content_extraction.html Means we should be indexing word documents etc as well. For #129

zarino added the project:cape label Apr 28, 2022

zarino added this to Improvements to existing features in CAPE continuous improvement Apr 28, 2022

zarino added this to the Making it easier to search and read plans milestone Apr 28, 2022

struan mentioned this issue May 4, 2022

use extract file contents to index documents #388

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved search and deep linking #129

Improved search and deep linking #129

ajparsons commented Sep 20, 2021

struan commented Apr 28, 2022

Improved search and deep linking #129

Improved search and deep linking #129

Comments

ajparsons commented Sep 20, 2021

struan commented Apr 28, 2022