You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, we use the built-in Postgres tsvector data type to index string fields for search. The computed search columns in our database tables have type TEXT, and we create an index for each such column e.g. in the lookup table for Address.address:
CREATEINDEXCONCURRENTLY IF NOT EXISTS "Address_address_idx_tsv"ON"Address"
USING gin (to_tsvector('simple', address))
This allow us to query against the column using tsquery and hit the index to achieve acceptable performance:
The text vector data type is optimized for full text search, where we want to find similar words or lexemes in a given language within a large document written in that language. In contrast, nearly all of the string values we encounter in FHIR search parameters are relatively short (i.e. a couple tens to hundreds of characters) and are not really written in any specific language: they tend to be proper nouns or niche technical concepts. Consequently, we have already observed some interesting issues where the tsvector approach does not give the results we would naively expect: for example, stemming and stop word rules have caused searches to return incorrect results or not find matching resources that do contain the search string.
Instead of vectors, we could use a trigram index: this is optimized for fast text similarity testing on arbitrary strings. We should be able to create another index over the TEXT column:
CREATEINDEXCONCURRENTLY IF NOT EXISTS "Address_address_idx_trgm"ON"Address"
USING gin (address gin_trgm_ops)
-- ORCREATEINDEXCONCURRENTLY IF NOT EXISTS "Address_address_idx_trgm"ON"Address"
USING gist (address gist_trgm_ops)
This would allow us to similarly query the index using trigram functions and operators:
SELECT*, 1- strict_word_similarity('oakl', address) AS dist FROM"Address"WHERE'oakl'<<% address
ORDER BY dist;
In theory, this solution should give us more consistent string matching with good performance. It should be evaluated for representative string searches, noting query performance and index size.
The text was updated successfully, but these errors were encountered:
Currently, we use the built-in Postgres
tsvector
data type to index string fields for search. The computed search columns in our database tables have typeTEXT
, and we create an index for each such column e.g. in the lookup table forAddress.address
:This allow us to query against the column using
tsquery
and hit the index to achieve acceptable performance:The text vector data type is optimized for full text search, where we want to find similar words or lexemes in a given language within a large document written in that language. In contrast, nearly all of the
string
values we encounter in FHIR search parameters are relatively short (i.e. a couple tens to hundreds of characters) and are not really written in any specific language: they tend to be proper nouns or niche technical concepts. Consequently, we have already observed some interesting issues where thetsvector
approach does not give the results we would naively expect: for example, stemming and stop word rules have caused searches to return incorrect results or not find matching resources that do contain the search string.Instead of vectors, we could use a trigram index: this is optimized for fast text similarity testing on arbitrary strings. We should be able to create another index over the
TEXT
column:This would allow us to similarly query the index using trigram functions and operators:
In theory, this solution should give us more consistent string matching with good performance. It should be evaluated for representative
string
searches, noting query performance and index size.The text was updated successfully, but these errors were encountered: