Evaluate using trigram matching for string search parameters #4452

mattwiller · 2024-04-25T22:05:05Z

Currently, we use the built-in Postgres tsvector data type to index string fields for search. The computed search columns in our database tables have type TEXT, and we create an index for each such column e.g. in the lookup table for Address.address:

CREATE INDEX CONCURRENTLY IF NOT EXISTS "Address_address_idx_tsv" ON "Address"
USING gin (to_tsvector('simple', address))

This allow us to query against the column using tsquery and hit the index to achieve acceptable performance:

SELECT * FROM "Address"
WHERE address @@ to_tsquery('simple', 'oakl:*');

The text vector data type is optimized for full text search, where we want to find similar words or lexemes in a given language within a large document written in that language. In contrast, nearly all of the string values we encounter in FHIR search parameters are relatively short (i.e. a couple tens to hundreds of characters) and are not really written in any specific language: they tend to be proper nouns or niche technical concepts. Consequently, we have already observed some interesting issues where the tsvector approach does not give the results we would naively expect: for example, stemming and stop word rules have caused searches to return incorrect results or not find matching resources that do contain the search string.

Instead of vectors, we could use a trigram index: this is optimized for fast text similarity testing on arbitrary strings. We should be able to create another index over the TEXT column:

CREATE INDEX CONCURRENTLY IF NOT EXISTS "Address_address_idx_trgm" ON "Address"
USING gin (address gin_trgm_ops)
-- OR
CREATE INDEX CONCURRENTLY IF NOT EXISTS "Address_address_idx_trgm" ON "Address"
USING gist (address gist_trgm_ops)

This would allow us to similarly query the index using trigram functions and operators:

SELECT *, 1 - strict_word_similarity('oakl', address) AS dist FROM "Address"
WHERE 'oakl' <<% address
ORDER BY dist;

In theory, this solution should give us more consistent string matching with good performance. It should be evaluated for representative string searches, noting query performance and index size.

The text was updated successfully, but these errors were encountered:

mattwiller added the search Features and fixes related to search label Apr 25, 2024

mattwiller added this to the Milestone Quality milestone Apr 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluate using trigram matching for string search parameters #4452

Evaluate using trigram matching for string search parameters #4452

mattwiller commented Apr 25, 2024

Evaluate using trigram matching for string search parameters #4452

Evaluate using trigram matching for string search parameters #4452

Comments

mattwiller commented Apr 25, 2024