Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement GPAD/GPI cross check so that bad identifiers are eliminated early and do not make it to output #2066

Open
kltm opened this issue Sep 12, 2023 · 12 comments

Comments

@kltm
Copy link
Member

kltm commented Sep 12, 2023

When a GPAD and its associated GPI are parsed, a check should occur (GORULE:0000001) and if a matching identifier is not found the annotation is eliminated.
Currently, when a GAF is emitted and there is not a matching identifier from the GPI file in the GPAD, it is passed through but with taxon:0.

Under no circumstances should taxon:0 be emitted by the ontobio code.

@kltm
Copy link
Member Author

kltm commented Sep 12, 2023

This is a preventative measure; cleaning bad incoming data is here: #2061

@kltm
Copy link
Member Author

kltm commented Sep 12, 2023

Note that we will initially be trying to have this after the 3.10 requirement in ontobio, which we are aiming for the end of the week. If we cannot make that, we'll try some kind of "backport" or other hack to get this done sooner rather than later.

@dustine32
Copy link
Contributor

@mugitty Here are two example taxon:0 lines for ZFIN that we observed being emitted in the 2023-07-27 GO release zfin.gaf:

ZFIN    ZDB-GENE-070117-1552            acts_upstream_of_or_within      GO:0045601      PMID:17531218   IMP        ZFIN:ZDB-GENO-080318-17 P                       gene_product    taxon:0 20080326        ZFIN
ZFIN    ZDB-GENE-070117-2459            acts_upstream_of_or_within      GO:0033334      PMID:24173565   IMP        ZFIN:ZDB-GENO-100420-6  P                       gene_product    taxon:0 20141230        ZFIN

We can work backwards to find the Noctua GPAD lines for these and debug how the annotations are coming through the ontobio validation code, paying special attention to where any GPI is accessed to pull in taxon info.

mugitty added a commit to geneontology/pipeline that referenced this issue Sep 14, 2023
@kltm
Copy link
Member Author

kltm commented Sep 14, 2023

Now testing on master.

@pgaudet pgaudet moved this from TODO to Clearing in GORULES (low-hanging fruit) Sep 21, 2023
@pgaudet pgaudet removed this from Clearing in GORULES (low-hanging fruit) Sep 21, 2023
@kltm kltm moved this from Working to Clearing in Ongoing data QC and pipeline maintenance Sep 22, 2023
@kltm
Copy link
Member Author

kltm commented Sep 22, 2023

@mugitty / @dustine32 This looks to be complete now?

sjcarbon@moiraine:/tmp$:( zgrep -c taxon:0 zfin.gaf.gz 

Can you confirm that this is reporting as desired? If so, let's go ahead and close this.

@dustine32
Copy link
Contributor

@kltm Correct, annotations to those GPs not in the GPI are now being dropped thus preventing taxon:0 from being written out to the GAF. Thanks @mugitty!

Closing as completed.

@kltm
Copy link
Member Author

kltm commented Oct 4, 2023

@dustine32 @mugitty Sorry, to bring this up again, but are these being "silently" dropped when emitted, or are these being reported as a GORULE violation somewhere (I was unable to see anything in noctua_zfin.gaf.gz). I tried to find the ontobio code, but couldn't track it back quickly.

@kltm kltm reopened this Oct 4, 2023
@dustine32
Copy link
Contributor

@kltm You know, looking back through the commits in ontobio since we made this ticket, I don't think we ever implemented an explicit "if not in GPI BioEntities then drop annotation". Instead, we just implemented the specific fix for taxon:0 which effectively solved our issue.

However, I don't see those Report.INVALID_TAXON GO rule 1 errors in the noctua_zfin-report.html. We'll have to debug why.

@pgaudet
Copy link
Contributor

pgaudet commented Oct 5, 2023

Should we move this to the go-rules project ?

@kltm
Copy link
Member Author

kltm commented Oct 5, 2023

@pgaudet If is a current ongoing data issue that needs to be solved before release, it would seem to fit here better than the rolling GORULES, from my POV. If it's a non-blocking issue, this can switch over to GORULES.

@dustine32
Copy link
Contributor

sigh I think I now know why we don't see those INVALID_TAXON errors in the noctua_zfin.report.md file.

  1. In mega-make, the noctua_zfin-src.gpad file is parsed and mixed in with the validated, upstream zfin.gaf via validate.py produce. The GPI files are used here to fetch taxon info so lines for our example genes (e.g., ZDB-GENE-070117-1552) are caught and reported because no taxon can be fetched from the GPI. I can see these errors in the noctua_zfin.report.md produced by validate.py.
  2. In the "temporary post filter" step, the same noctua_zfin-src.gpad file is parsed via ontobio-parsed-assocs.py and the resulting noctua_zfin.gpad and reports are copied back to skyhook, overwriting the reports from the previous mega-make step. The key thing here is that the ontobio-parsed-assocs.py command is not given a GPI argument so no entity info is checked by the GpadParser and our example gene annots are allowed to go through to the product noctua_zfin.gpad (and hence no errors in the new report).

Some suggestions to fix:

  1. Supply the --gpi arg to the ontobio-parse-assocs.py command.
  2. Stop overwriting the reports on skyhook by adding some unique suffix to the filenames.

@kltm More details can be figured out later but for now my brain is broke.

@kltm
Copy link
Member Author

kltm commented Oct 6, 2023

@dustine32 Let's make sure your brain is cared for and regroup after to figure out the best course.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

4 participants