Relationship indices are not generated correctly when there are multiple relationships relating to same source/target/type #42

wardle · 2022-11-07T13:34:39Z

SNOMED distributions can, even in a snapshot, have more than one relationship relating to the same combination of source, target, type and modifier identifiers.

For example, this is from the September 22 UK edition:

➜  Terminology git:(main) head -n 1 sct2_Relationship_UKCLSnapshot_GB1000000_20220928.txt 
id	effectiveTime	active	moduleId	sourceId	destinationId	relationshipGroup	typeId	characteristicTypeId	modifierId
➜

Terminology git:(main) cat sct2_Relationship_UKCLSnapshot_GB1000000_20220928.txt | grep 1089261000000101
832591000000123.        20210512	0	999000011000000103	1089261000000101	609336008	0	116680003	900000000000011006	900000000000451002
2191421000000129        20210512	0	999000011000000103	1089261000000101	301857004	0	116680003	900000000000011006	900000000000451002
3219831000000124	20210512	1	999000011000000103	1089261000000101	773760007	2	42752001	900000000000011006	900000000000451002
3228451000000128	20210512	1	999000011000000103	1089261000000101	51576004	1	363698007	900000000000011006	900000000000451002
3229451000000120	20210512	1	999000011000000103	1089261000000101	12835000	1	116676008	900000000000011006	900000000000451002
3229461000000123	20210512	1	999000011000000103	1089261000000101	213345000	0	116680003	900000000000011006	900000000000451002
5687171000000128	20210512	0	999000011000000103	1089261000000101	213345000	0	116680003	900000000000011006	900000000000451002
5687191000000129	20210512	0	999000011000000103	1089261000000101	36818005	1	116676008	900000000000011006	900000000000451002
5687201000000127	20210512	0	999000011000000103	1089261000000101	52530000	1	363698007	900000000000011006	900000000000451002

In this, you can see that 3229461000000123 and 5687171000000128 both relate to the same source, target and type:

3229461000000123	20210512	1	999000011000000103	1089261000000101	213345000	0	116680003	900000000000011006	900000000000451002
5687171000000128	20210512	0	999000011000000103	1089261000000101	213345000	0	116680003	900000000000011006	900000000000451002

The current relationship indexing is done in a single pass during relationship importing. This would work if there were no relationships that essentially had the same data. In this case, you can see that the later row shows the relationship to be inactive, while the earlier row shows it to be active.

Current import would look at the effective date and if the same or later, would delete the relationship because it is inactive in the later row. This is incorrect behaviour when multiple relationships can reference the same tuple of source-target-type.

This therefore results in ~70 or so concepts not having correct relationships stored, affecting search and inference. The fix is to adopt a two-pass approach, in which relationships are imported, and the indices rebuilt after import.

The text was updated successfully, but these errors were encountered:

wardle added the bug Something isn't working label Nov 7, 2022

wardle closed this as completed in fd90ded Nov 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Relationship indices are not generated correctly when there are multiple relationships relating to same source/target/type #42

Relationship indices are not generated correctly when there are multiple relationships relating to same source/target/type #42

wardle commented Nov 7, 2022

Relationship indices are not generated correctly when there are multiple relationships relating to same source/target/type #42

Relationship indices are not generated correctly when there are multiple relationships relating to same source/target/type #42

Comments

wardle commented Nov 7, 2022