-
Notifications
You must be signed in to change notification settings - Fork 673
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Workflow output definition #1227
base: dev
Are you sure you want to change the base?
Conversation
Signed-off-by: Ben Sherman <[email protected]>
|
Signed-off-by: Ben Sherman <[email protected]>
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
Signed-off-by: Ben Sherman <[email protected]>
Signed-off-by: Ben Sherman <[email protected]>
Liking this. If we really need the output block (rather than doing something with emit), this is a nice readable way of doing it. |
This is beginning to look great. All the publishing logic is in one location, easy to review and understand where it's coming from. There are two downsides to this approach:
|
Agreeing with Adam, it's a bit too implicit, especially what is a path what is a topic |
In the Nextflow PR there are some docs which explain the feature in more detail. Unfortunately the deploy preview isn't working so you'll have to look at the diff
Indeed this is the downside of selecting channels instead of processes. More flexible but more layers of indirection. We should be able to alleviate this with IDE tooling, i.e. hover over a selected channel to see it's definition
Thanks @pinin4fjords , I never responded to your idea about putting everything in the The main question now is, how to bring the outputs for PREPARE_GENOME and RNASEQ up to the top-level workflow? I was thinking some kind of |
Signed-off-by: Ben Sherman <[email protected]>
Signed-off-by: Ben Sherman <[email protected]>
Signed-off-by: Ben Sherman <[email protected]>
Signed-off-by: Ben Sherman <[email protected]>
The current prototype simply maps the output channels to the publish directory structure, but we still need to get these outputs to the top level whereas currently they are nested under Before I go off and add a gajillion channels to the @adamrtalbot @pinin4fjords @maxulysse @ewels Since you guys understand this pipeline better than me, I'm wondering, how would you group all of these outputs if you could group them any way you want? You are no longer restricted to process selectors or directory names, but you could use those if you wanted. For example, I see the modules config for RNASEQ is grouped with these comments:
Would those be good top-level groupings for outputs? Then you might have topics called |
I managed to move everything to the top-level workflow, so it should be executable now (though there are likely some bugs, will test tomorrow). I ended up using topics for everything, using the various publish directories to guide the topic names. Hope this gives you a more concrete sense of how topics are useful. The topics don't really reduce the amount of code, they just split it between the output DSL and the workflow |
As Evan mentioned on Slack, this does seem very verbose:
But I understand why, since if even one of the outputs from a process needs to go to a different topic then you can't use the multi-channel object Rather than doing this from the calling workflow, could e.g. QUANTIFY_STAR_SALMON use a topic as part of its emit, to 'suggest' a classification for that channel?
Then, if we all used good standards (e.g. an ontology for topics for outputs), calling workflows could have very minimal logic for this, relying on what the components said about their outputs. The calling workflow would only need to decide what to do with the topics in its outputs. |
We can definitely move some of these topic mappings into the modules and subworkflows, that was going to be my next step. I also suspect that nf-core will be able to converge on a shared ontology for these things. I'd still rather keep the topic mapping separate from the emits though, as we will need the |
I moved most of the topic mappings into their respective subworkflows. It gets tricky when a workflow is used multiple times under a different name and with different publish behavior. For example, I can't set a "sensible default" in the subworkflow because I can't override the default later, I can only specify additional topics. Or I could specify a default and not use it in the output definition for rnaseq, instead re-mapping each alias to different topics as I am currently doing. However, keeping the topic mappings in the RNASEQ workflow is also tricky because the process/workflow might not be executed, in which case the topic mapping will fail. We might need to replicate the control flow in the topic:
if{ !params.skip_alignment && params.aligner == 'star_rsem' ) {
DESEQ2_QC_RSEM.out.rdata >> 'align-deseq2'
DESEQ2_QC_RSEM.out.pca_txt >> 'align-deseq2'
DESEQ2_QC_RSEM.out.pdf >> 'align-deseq2'
DESEQ2_QC_RSEM.out.dists_txt >> 'align-deseq2'
DESEQ2_QC_RSEM.out.size_factors >> 'align-deseq2'
DESEQ2_QC_RSEM.out.log >> 'align-deseq2'
} Totally doable, but unfortunate if we have to resort to it @adamrtalbot noted in Slack that most Nextflow pipelines don't come close to this level of complexity, so I wouldn't be opposed to moving forward with what we have and let the rnaseq maintainers sort out the details. Though we do need to address the last point about conditional topic mappings |
I'm loving the principle:
The multi import thing didn't occur to me. Could we use a variable sent in via the |
In this case I would manipulate the channel to what I wanted. If I had to use a topic I would use them at the last second. So again, as long as topics are optional I think everything can be handled reasonably well.
Presumably, if a
Even better, just tidy up the channels before making the topic:
I think my overall impression is topics are a nice sugar on top of existing channels, in which case most of the key logic should be in the channel manipulations. Topics are a way of turning a local channel into a global one and should do very little else.
That sounds like a bug 😆 |
Signed-off-by: Ben Sherman <[email protected]>
Signed-off-by: Ben Sherman <[email protected]>
Notes on the latest update:
Overall, everything is much more concise and more in line with what many people have suggested, to simply annotate the workflows with the publish paths. The output definition is no longer a comprehensive view of all outputs, but there is a degree of modularity, and you can be verbose in the output definition if you want to. @adamrtalbot thanks for your comments, makes me feel more confident about the prototype. I think all of the remaining TODOs can be addressed by refactoring some dataflow logic, it can be handled by the rnaseq devs. |
Really liking the way this is going now, it's going to be very tidy. Would it be feasible at some point to use some optional dynamism in the modules, to facilitate repeated usage?
|
Maybe in a future iteration. But related to this, we are interested in building on the concept of the samplesheet as a way to hold metadata for file collections in general, and it might be a better practice than trying to encode metadata in the filenames. For example Paolo has proposed that we have a method in the output DSL to automatically publish an index file for a given "rule": output {
directory 'results'
'star_salmon/intermeds/' {
index 'index.csv'
}
}
Of course you could also do this manually like in fetchngs, and I would like to add a stdlib function like |
Signed-off-by: Ben Sherman <[email protected]>
Signed-off-by: Ben Sherman <[email protected]>
The redirect to It seems like the best delineation for what goes in the top-level publish block vs the workflow publish sections is, the workflows define what is published (including conditional logic) while the top-level publish def should define how things are being published (mode, whether to overwrite, content type, tags, etc). This is also good for modularity. Note that some subworkflows are now using params which is an anti-pattern. For this I recommend passing those params as workflow inputs to keep things modular. |
We have been trying to eliminate that when we see it |
Signed-off-by: Ben Sherman <[email protected]>
Signed-off-by: Ben Sherman <[email protected]>
Signed-off-by: Ben Sherman <[email protected]>
output { | ||
directory params.outdir | ||
mode params.publish_dir_mode | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be set in nextflow.config
@bentsherman?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not currently, but should be considered as a future improvement. If we don't need the target-specific config in the output block, would be better IMO to make these config settings instead of pipeline code + params.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer this as a scope in the config rather than in the main.nf
.
I was also thinking to allow different modes for files matching different patterns or properties, e.g. large files are symlinked, rather than copied, but this is an extra.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that they should be config. Much less flexible for the user if you have to define a param just to make it configurable
For now you can customize things like the mode for specific targets, though that it not as granular as what you describe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like how it simplifies things quite a bit.
My main issues with this is that the publish path is spread across so many files, which probably means we shouldn't be using this in nf-core in any of the resuable/modular parts, especially if we can't turn off anything without having params
for everything.
Paths that need meta data like sample name are not here, but I guess they're directly accessible in the process
?
publish: | ||
pdf >> 'deseq2' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was surprising to still see the publish directive in the process block. Why this way then instead of:
output:
path "*.pdf" , optional:true, emit: pdf, publish: 'deseq2'
i.e. instead of making publish
and extra option for the path
type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I wanted the publish definition to be in one place so that it's easier to review. But also, I think the separate publish:
section will work better with static typed outputs
output { | ||
directory params.outdir | ||
mode params.publish_dir_mode | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer this as a scope in the config rather than in the main.nf
.
I was also thinking to allow different modes for files matching different patterns or properties, e.g. large files are symlinked, rather than copied, but this is an extra.
@@ -14,6 +14,9 @@ process CAT_FASTQ { | |||
tuple val(meta), path("*.merged.fastq.gz"), emit: reads | |||
path "versions.yml" , emit: versions | |||
|
|||
publish: | |||
reads >> 'cat/fastq' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would a pipeline developer turn this off? If I include this module from nf-core/modules in my own pipeline, but don't want to publish the output from this, what can I do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can redirect the output to null:
workflow {
CAT_FASTQ()
publish:
CAT_FASTQ.out >> null
}
And Paolo just added an enabled
option so that you can control it from the output block:
output {
'cat/fastq' {
enabled false
}
}
@@ -62,6 +62,16 @@ workflow ALIGN_STAR { | |||
BAM_SORT_STATS_SAMTOOLS ( ch_orig_bam, fasta ) | |||
ch_versions = ch_versions.mix(BAM_SORT_STATS_SAMTOOLS.out.versions) | |||
|
|||
publish: | |||
ch_orig_bam >> (params.save_align_intermeds || params.save_umi_intermeds ? 'star_salmon/' : null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine for specific pipelines, but makes module reusability across pipelines more cumbersome. If we need to have ternary operators in every nf-core module to control whether an output is published, this would make the pipeline schema, potentially huge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, I cut some corners here by not passing the params as workflow inputs. In the final implementation I would do that so that you can choose whether to expose it as a param in your own pipeline
@@ -26,6 +26,10 @@ workflow FASTQ_ALIGN_HISAT2 { | |||
BAM_SORT_STATS_SAMTOOLS ( HISAT2_ALIGN.out.bam, ch_fasta ) | |||
ch_versions = ch_versions.mix(BAM_SORT_STATS_SAMTOOLS.out.versions) | |||
|
|||
publish: | |||
HISAT2_ALIGN.out.bam >> (params.save_align_intermeds ? 'hisat2/' : null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this going to cause problems/user-confusion having the possibility to publish in two places ( workflow and module, e.g. nf-core modules supplies a path, and then the workflow supplies another path )?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is that workflow can overwrite the publish targets defined by processes / subworkflows. And if you can find a convention to agree on, you might not need to overwrite anything in the first place.
But I admit I'm not 100% sold on the publish:
section for processes. It's more "modular" but if it ends up being overwritten most of the time then it's not very useful.
@@ -782,6 +782,127 @@ workflow RNASEQ { | |||
ch_multiqc_report = MULTIQC.out.report | |||
} | |||
|
|||
publish: | |||
QUANTIFY_STAR_SALMON.out.results >> 'star_salmon/' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is going to be annoying to users/developers not knowing where a publish path is defined when trying to debug why their file is being written to another location.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main principle is that callers overwrite callees. But also I think we will extend the inspect
command to also show the resolved publish targets for the entire pipeline, so that it is more clear.
This PR adds a workflow output definition based on nextflow-io/nextflow#4784. I'm still working through the pipeline, but once I'm done, I will have completely replaced publishDir using the output DSL.
See also nf-core/fetchngs#275 for ongoing discussion