Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Relationship indices are not generated correctly when there are multiple relationships relating to same source/target/type #42

Closed
wardle opened this issue Nov 7, 2022 · 0 comments
Labels
bug Something isn't working

Comments

@wardle
Copy link
Owner

wardle commented Nov 7, 2022

SNOMED distributions can, even in a snapshot, have more than one relationship relating to the same combination of source, target, type and modifier identifiers.

For example, this is from the September 22 UK edition:

➜  Terminology git:(main) head -n 1 sct2_Relationship_UKCLSnapshot_GB1000000_20220928.txt 
id	effectiveTime	active	moduleId	sourceId	destinationId	relationshipGroup	typeId	characteristicTypeId	modifierId
➜
Terminology git:(main) cat sct2_Relationship_UKCLSnapshot_GB1000000_20220928.txt | grep 1089261000000101
832591000000123.        20210512	0	999000011000000103	1089261000000101	609336008	0	116680003	900000000000011006	900000000000451002
2191421000000129        20210512	0	999000011000000103	1089261000000101	301857004	0	116680003	900000000000011006	900000000000451002
3219831000000124	20210512	1	999000011000000103	1089261000000101	773760007	2	42752001	900000000000011006	900000000000451002
3228451000000128	20210512	1	999000011000000103	1089261000000101	51576004	1	363698007	900000000000011006	900000000000451002
3229451000000120	20210512	1	999000011000000103	1089261000000101	12835000	1	116676008	900000000000011006	900000000000451002
3229461000000123	20210512	1	999000011000000103	1089261000000101	213345000	0	116680003	900000000000011006	900000000000451002
5687171000000128	20210512	0	999000011000000103	1089261000000101	213345000	0	116680003	900000000000011006	900000000000451002
5687191000000129	20210512	0	999000011000000103	1089261000000101	36818005	1	116676008	900000000000011006	900000000000451002
5687201000000127	20210512	0	999000011000000103	1089261000000101	52530000	1	363698007	900000000000011006	900000000000451002

In this, you can see that 3229461000000123 and 5687171000000128 both relate to the same source, target and type:

3229461000000123	20210512	1	999000011000000103	1089261000000101	213345000	0	116680003	900000000000011006	900000000000451002
5687171000000128	20210512	0	999000011000000103	1089261000000101	213345000	0	116680003	900000000000011006	900000000000451002	

The current relationship indexing is done in a single pass during relationship importing. This would work if there were no relationships that essentially had the same data. In this case, you can see that the later row shows the relationship to be inactive, while the earlier row shows it to be active.

Current import would look at the effective date and if the same or later, would delete the relationship because it is inactive in the later row. This is incorrect behaviour when multiple relationships can reference the same tuple of source-target-type.

This therefore results in ~70 or so concepts not having correct relationships stored, affecting search and inference. The fix is to adopt a two-pass approach, in which relationships are imported, and the indices rebuilt after import.

@wardle wardle added the bug Something isn't working label Nov 7, 2022
@wardle wardle closed this as completed in fd90ded Nov 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant