Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GAF/Source validation processing should occur with only one pass #1384

Open
dougli1sqrd opened this issue Feb 19, 2020 · 6 comments
Open

GAF/Source validation processing should occur with only one pass #1384

dougli1sqrd opened this issue Feb 19, 2020 · 6 comments
Assignees

Comments

@dougli1sqrd
Copy link
Contributor

Currently the Makefile and ontobio validate are structured very group centric. So when processing say fb we do all processing that that requires to completion. This includes merging any "mix-in" datasets, for examaple: PAINT. So in the course of validating fb we also download and validate paint_fb.

In normal pipeline mode though, we also separately process paint, including paint_fb. Since we validate paint_fb above in the course of validating fb we are processing paint_fb and potentially any other mix-in source twice.

On its own processing twice is a little lame, but has been okay for quite some time. But as we have expanded features of ontobio validate and the pipeline there's been difficulties. In particular, #1253 is ultimately caused by this "double processing" issue outlined above.

The pipeline, Makefile, and ontobio should be structured so that main validation of sources only occurs once per dataset. Merging of mix-ins into main sources as output products can come as a separate step. @kltm and I will expand on solutions here to do this.

@dougli1sqrd
Copy link
Contributor Author

It occurs to me also that doing only one pass would eliminate completely any mid-validate GAF downloads. Recall that we have had an issue where mix-in GAFs are still downloaded within ontobio.

@kltm
Copy link
Member

kltm commented Feb 19, 2020

@dougli1sqrd Good point--we've been burned by that before as well.

@kltm
Copy link
Member

kltm commented Feb 19, 2020

This will also add clarity as we add more internal and upstream sources.

@kltm
Copy link
Member

kltm commented Feb 20, 2020

Talking to @dougli1sqrd earlier, one idea would be to process all incoming files without any thought of mixins, get the files, then perform the reassembly as a discrete step afterwards. This would make it easier to trace issues, view intermediate products, and add new sources/products/mixins in the future.

kltm added a commit to geneontology/pipeline that referenced this issue Feb 20, 2020
Add second pass at copying files to ensure that the "good" PAINT report files are the last ones over.
For issue geneontology/go-site#1253 .
Can go away with flow change in geneontology/go-site#1384 .
@kltm
Copy link
Member

kltm commented Mar 17, 2020

Note that we need this to get to geneontology/pipeline#27

@dougli1sqrd
Copy link
Contributor Author

In ontobio, the order of operations will make this difficult:
Currently ontobio operates in this order:

  1. Produce pristine GAF,
  2. Make GPI,
  3. Mixin datasets (example: paint_fb.gaf merges into fb.gaf),
  4. make the rest of our products (gpad, ttl)

Step 3 is what this issue addresses. But if step 4 is dependent on step 3, we will need to resolve this difficulty in order to complete this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants