Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assembly Step for new pipeline kernel: Questions and Strategies #1676

Open
dougli1sqrd opened this issue Apr 16, 2021 · 2 comments
Open

Assembly Step for new pipeline kernel: Questions and Strategies #1676

dougli1sqrd opened this issue Apr 16, 2021 · 2 comments
Assignees
Labels

Comments

@dougli1sqrd
Copy link
Contributor

In geneontology/pipeline#206 we're making steps to reform the pipeline kernel. Currently, @dustine32 and I are working on the Assembly step ("shovel2pile") which should take "pristine", validated annotations in gpad+gpi format and merge any mixin gpads into the final produce.

For example, we have mgi and paint_mgi. At the end of the run, a validated paint_mgi will be merged into a validated mgi, and their corresponding headers will also be joined, to produce the final mgi dataset product.

Here we discuss various strategies for this:

  • Final <dataset> = Sum[<dataset>.header, <mixin0>.header, <mixin1>.header, ...] + Sum[<dataset>, <mixin0>, <mixin1>, ...]
    • example, paint_mgi will have in metadata: merges_into: mgi.
      1. download-annotation-sources.py annotations -g mgi -g paint -x [the rest of paint]
        • sources: mgi.gpad, paint_mgi.gpad,
      2. goat pristine sources/
        • pristine: mgi_valid.gpad, paint_mgi_valid.gpad
      3. goat assemble
        • assemble: mgi.gpad (contains mgi_valid and paint_mgi_valid), paint_mgi.gpad
    • So how does assemble know that paint_mgi_valid should be mixed into mgi_valid?
      • mgi_valid -> mgi; paint_mgi_valid -> paint_mgi; <mixin>_<dataset>
        • paint_mgi is a mixin because when we match <mixin>_<dataset> <dataset> matches an existing source, namely "mgi".
        • We find potential mixins by the filename, and separate on the first underscore. If we get a mixin pattern, we can check if the <dataset> part of the name corresponds to an existing file in "pristine". If it does, then we have a <dataset>, and a <mixin>_<dataset> match.
        • We can then look at the datasets yaml. For a mixin: <group>_<dataset>, look in <group>.yaml for a <dataset> entry, and if it merges_into: <dataset>. If so, we can confirm that this mixin should merge into the given dataset name.
        • A drawback with this is we're very tied to the filenames and dataset names
    • Alternatively: instead of the mixin metadata yamls saying what they merge into, we change the metadata so that primary datasets state what mixins they desire. Example: mgi would have: "has_mixin": ["paint_mgi"]
      • For every file in "pristine", we look up the metadata entry for that file, and look for any mixins. If we also have a file with the mixin name, we perform the mixin logic above.
      • Drawback: This requires changing the metada yamls formally.
      • This seems ultimately easier though, and less brittle to filename/dataset name changes
@kltm
Copy link
Member

kltm commented Apr 17, 2021

Basically, I think the thing to argue against is as follows:
Let's say we have a directory with all valid products in it. Without the metadata file, would there ever be a situation where I couldn't eyeball it and assemble things correctly? (I think that not bothering with the metadata might also put us in a better situation to pivot to species-orientation.)

@dougli1sqrd
Copy link
Contributor Author

We definitely could eyeball it and figure it out. But that's only because of how the names historically happen to line up: paint_mgi goes into mgi. If we did it this way, then the name would convey real semantic meaning. Which is fine if we want to do that, but I think I feel mild discomfort about it? Maybe it just feels brittle. But I'm definitely not opposed. We'd have to document this fact somewhere.

Although, now that I'm saying this, we are the ones that control all the "mixins", so the naming convention is mostly on us anyway. I'm less discomforted by that since realistically we will mostly control the mix-in sources.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants