Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluate using trigram matching for string search parameters #4452

Open
mattwiller opened this issue Apr 25, 2024 · 0 comments
Open

Evaluate using trigram matching for string search parameters #4452

mattwiller opened this issue Apr 25, 2024 · 0 comments
Labels
search Features and fixes related to search

Comments

@mattwiller
Copy link
Member

Currently, we use the built-in Postgres tsvector data type to index string fields for search. The computed search columns in our database tables have type TEXT, and we create an index for each such column e.g. in the lookup table for Address.address:

CREATE INDEX CONCURRENTLY IF NOT EXISTS "Address_address_idx_tsv" ON "Address"
USING gin (to_tsvector('simple', address))

This allow us to query against the column using tsquery and hit the index to achieve acceptable performance:

SELECT * FROM "Address"
WHERE address @@ to_tsquery('simple', 'oakl:*');

The text vector data type is optimized for full text search, where we want to find similar words or lexemes in a given language within a large document written in that language. In contrast, nearly all of the string values we encounter in FHIR search parameters are relatively short (i.e. a couple tens to hundreds of characters) and are not really written in any specific language: they tend to be proper nouns or niche technical concepts. Consequently, we have already observed some interesting issues where the tsvector approach does not give the results we would naively expect: for example, stemming and stop word rules have caused searches to return incorrect results or not find matching resources that do contain the search string.

Instead of vectors, we could use a trigram index: this is optimized for fast text similarity testing on arbitrary strings. We should be able to create another index over the TEXT column:

CREATE INDEX CONCURRENTLY IF NOT EXISTS "Address_address_idx_trgm" ON "Address"
USING gin (address gin_trgm_ops)
-- OR
CREATE INDEX CONCURRENTLY IF NOT EXISTS "Address_address_idx_trgm" ON "Address"
USING gist (address gist_trgm_ops)

This would allow us to similarly query the index using trigram functions and operators:

SELECT *, 1 - strict_word_similarity('oakl', address) AS dist FROM "Address"
WHERE 'oakl' <<% address
ORDER BY dist;

In theory, this solution should give us more consistent string matching with good performance. It should be evaluated for representative string searches, noting query performance and index size.

@mattwiller mattwiller added the search Features and fixes related to search label Apr 25, 2024
@mattwiller mattwiller added this to the Milestone Quality milestone Apr 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
search Features and fixes related to search
Projects
Status: No status
Development

No branches or pull requests

1 participant