Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Examples of non-ideal merging of records #162

Open
andrewsu opened this issue Jun 26, 2023 · 1 comment
Open

Examples of non-ideal merging of records #162

andrewsu opened this issue Jun 26, 2023 · 1 comment

Comments

@andrewsu
Copy link
Member

The merging of multiple records in source databases into a single record in mychem.info is a challenging process, and one where I doubt we'll ever get it perfectly "right". Having said that, I noticed an example where the current merging is not ideal, and so I'm creating this issue to document this example and others like it.

This is the API call that illustrates this example: https://mychem.info/v1/chem/GVJHHUAWPYXKBD-IEOSBIPESA-N?fields=chembl.molecule_chembl_id,chembl.max_phase,chembl.pref_name,drugcentral.xrefs.chembl_id

{
  "_id": "GVJHHUAWPYXKBD-IEOSBIPESA-N",
  "_version": 1,
  "chembl": {
    "_license": "https://bit.ly/2KAUCAm",
    "max_phase": 0,
    "molecule_chembl_id": "CHEMBL47",
    "pref_name": "VITAMIN E"
  },
  "drugcentral": [
    {
      "_license": "https://bit.ly/2SeEhUy",
      "xrefs": {
        "chembl_id": [
          "CHEMBL3989727",
          "CHEMBL2108106"
        ]
      }
    },
    {
      "_license": "https://bit.ly/2SeEhUy",
      "xrefs": {
        "chembl_id": [
          "CHEMBL3989727",
          "CHEMBL47"
        ]
      }
    }
  ]
}

mychem only maps this record to a single ChEMBL ID -- CHEMBL47, but DrugCentral maps to two additional IDs: CHEMBL3989727 and CHEMBL2108106. All of these IDs are some variant of Vitamin E. One reason this is confusing because CHEMBL47 reports "max_phase": 0, whereas the other two are "max_phase": 4 (what one would expect for Vitamin E).

@newgene
Copy link
Member

newgene commented Jun 28, 2023

I had a quick look at this particular case, in most of these drugcentral documents, 4097 out of 5399, drugcentral does include a field for inchikey, like in this query:

https://mychem.info/v1/chem/GVJHHUAWPYXKBD-IEOSBIPESA-N?fields=drugcentral.xrefs,drugcentral.structures.inchikey

{
    "_id": "GVJHHUAWPYXKBD-IEOSBIPESA-N",
    "_version": 1,
    "drugcentral": [
        {
            "_license": "https://bit.ly/2SeEhUy",
            "structures": {
                "inchikey": "GVJHHUAWPYXKBD-IEOSBIPESA-N"
            },
            "xrefs": {
                "chebi": "CHEBI:18145",
                "chembl_id": [
                    "CHEMBL3989727",
                    "CHEMBL2108106"
                ],

In this case, the merging step will be based on the inchikey and skip the rest of xrefs IDs. Whether we should change this behavior (set a priority list of ID types, stop and merge once we find one), it probably depends on how confident we trust the drugcentral.xrefs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants