Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

processing large numbers of files with Snakemake and gridss? #354

Closed
DC23 opened this issue May 20, 2020 · 6 comments
Closed

processing large numbers of files with Snakemake and gridss? #354

DC23 opened this issue May 20, 2020 · 6 comments

Comments

@DC23
Copy link

DC23 commented May 20, 2020

I need to process approximately 850 large files with gridss. I have a Snakemake workflow that groups the files into batches, runs pre-processing on each file in parallel on a cluster, and then runs assembly on each batch of files. Multiple assemble processes run in parallel as batch jobs on SLURM. I then intend to attempt a call operation on the results of all parallel assembly runs.

My question is do the parallel assemble gridss processes need a separate working directory to avoid conflicts? Or can I run them all in the same working directory? There will be many assemble processes running concurrently on the cluster as the workflow moves through the entire set of input files.

Currently, each batch of input files has a separate working directory for that batch just in case. But if this is required, how can I run the call step on everything? If separate working directories for assemble are required, can I just move everything into a common directory prior to call?

@DC23
Copy link
Author

DC23 commented May 21, 2020

--jobindex and --jobnodes look close to what I want, but they seem to be for splitting assembly of the same set of input files across multiple compute nodes. What I want to do is assemble different sets of files with separate assembly processes (each running on a single compute node) and then gather the results together with a final call process. I shouldn't have filename conflicts using the same working directory, as the assembly output files have the batch number coded into their name.

Any advice is appreciated.

@d-cameron
Copy link
Member

What I want to do is assemble different sets of files with separate assembly processes (each running on a single compute node) and then gather the results together with a final call process.

This won't work. The sample tracking and breakdown fields in the assembly uses the ordinals of the corresponding input/label. These ordinals match the output VCF ordinals.

Assuming your overall workflow looks something like the following:

assemble input1.bam input2.bam input3.bam > ass123.bam
assemble input4.bam input5.bam input6.bam > ass456.bam
samtools merge ass123.bam ass456.bam > ass.bam
call_variants ass.bam input1.bam input2.bam input3.bam input4.bam input5.bam input6.bam > calls.vcf

Then the VCF will have no assembly support for 4/5/6 since it's based on ordinal. input4 assembly support will be allocated to input1, input5 to input2, and input6 to input3.

The workaround for this is to introduce empty placeholder bams when you do each assembly step. Your pipeline needs to looks like:

assemble input1.bam input2.bam input3.bam empty4.bam empty5.bam  empty6.bam > ass123.bam
assemble empty1.bam empty2.bam  empty3.bam input4.bam input5.bam input6.bam > ass456.bam
samtools merge ass123.bam ass456.bam > ass.bam
call_variants ass.bam input1.bam input2.bam input3.bam input4.bam input5.bam input6.bam > calls.vcf

This will allow you to batch your assemblies (related samples in the same batch if possible) without breaking the assembly-to-input mapping.

Note that joint assembly using --jobindex and --jobnodes is the preferred approach, but the assembly does not scale to very high sequencing depths (I've only tested GRIDSS up to ~1000x aggegrate coverage) so at some point assembly will indeed need to be batched as you are proposing.

@d-cameron
Copy link
Member

d-cameron commented May 23, 2020

But if this is required, how can I run the call step on everything?

The key here is that the call step requires a single assembly bam since GRIDSS technically only supports joint assembly. As I posted above, you can simulate/hack this with empty bam files.

The safest approach is perform each assembly in it's own working directory and symlink to either to real input, or the empty input (depending on which files you're processing). This way all assemblies just have input1.bam input2.bam input3.bam input4.bam input5.bam input6.bam as parameters, just some of those inputs are empty. The symlink can be just to the .gridss.working directory for the real files, but you'll need to symlink to each of the files inside the working directory for the empty.bam since the filenames are different.

For example, something like the following would set up the first batch:

cd $WORKING_DIR
mkdir batch1_working
cd batch1_working
# real files
for f in input1.bam input2.bam input3.bam; do
  ln -s ../$f.gridss.working batch1_working
done
# empty files
for f in input4.bam input5.bam input6.bam; do
  # assumes empty.bam exists. make with samtools view -Hb input1.bam > empty.bam
  # also need to run pre-process on empty.bam so the working files exist
  mkdir $f.gridss.working
  for suffix in .cigar_metrics .coverage.blacklist.bed .idsv_metrics .insert_size_metrics .mapq_metrics .sv.bam .sv.bam.bai .sv_metrics .tag_metrics ; do
  ln -s ../empty.bam.gridss.working/empty.bam${suffix} $f.gridss.working/$f${suffix}
done
# and run the assembly with batch1_working as your working directory

@DC23
Copy link
Author

DC23 commented May 23, 2020

Thanks for the response. And please forgive my ignorance, I am a software engineer assisting with the workflow, but I have no prior experience at all with this field.

For joint assembly using --jobindex and --jobnodes, does that run as one large distributed job (eg with MPI), or does each node run as a separate process (using the index and node count to adjust the ordinal numbering)? Either way, I have access to a 230 node cluster.

I have around 850 input BAM files, but how does that number of files relate to the "~1000x aggregate coverage" you mention? Do you have any tips on how to estimate or measure aggregate coverage based on my input files so I can guide the batch size? Based on your comments, I think my best approach is to use multinode assembly on batches that are as large as possible, and combine that with the empty BAM trick to split into multiple batches. As for pre-processing, hopefully I can preprocess the empty bam just once and then symlink to the output files as required.

Also, does gridss do the samtools merge step when passing multiple assembly files to gridss.sh --steps call ... or should I do that merge manually before passing the single merged file to the call step?

Finally, I have done some scaling analysis of pre-processing and assembly, but only on 2 input files and only testing variable thread count on a single node. This was to help guide the resource requests on the cluster. I can share the results with you if they are of any interest, but I am sure you have examined this already.

@d-cameron
Copy link
Member

The assembly algorithm is single-threaded. By default, the assembly process split up the genome into 10Megabase chunks and multiple assemblies are done in parallel on different regions of the genome. The recommended settings are 8 threads and 25-31Gb heap size (ie just under 32Gb so java compressed OOPs can be used) so up to 8 assembly chunks will be processed in parallel. Each chunk gets written to it's own output file (in the assembly working directory) which are then concatentated together at the end to generate the assembly.bam

For joint assembly using --jobindex and --jobnodes, does that run as one large distributed job (eg with MPI), or does each node run as a separate process (using the index and node count to adjust the ordinal numbering)?

These are all independent jobs. Each job processes chunks for which chunknumber % jobnodes == jobindex (ie a simple round-robin allocation of units of work). Once they're all complete, you actually need to run 1 more assembly job without specifying --jobindex or --jobnodes to do the gather step. This last assembly job just concatentates the chunk bams into the actual output assembly bam.

All of this is handled internally by GRIDSS (either in the driver script, or the java code).

Also, does gridss do the samtools merge step when passing multiple assembly files to gridss.sh --steps call ... or should I do that merge manually before passing the single merged file to the call step?

Technically speaking, GRIDSS does not support multiple assembly files at all. The batching with empty files is a work-around. In this case, yes, you'll need to do the samtools merge yourself since GRIDSS will only recognise a single joint all-sample assembly bam.

As for pre-processing, hopefully I can preprocess the empty bam just once and then symlink to the output files as required.

Correct, just symlink the files with the appropriate names as I outlined in my previous example.

I have around 850 input BAM files, but how does that number of files relate to the "~1000x aggregate coverage" you mention?

That depends on your sequencing depth (your biologist should be able to tell you the depth they sequenced to). If you have done GRIDSS preprocessing for the file, then you can find read length and read counts in the .idsv_metrics file in the GRIDSS working directory for that file. sequencing depth = READS * READ_LENGTH / genome_length` . Genome size depends entirely on what species you sequenced. Human is ~3Gbase, mouse ~2.5Gbase, but there's a huge range in sizes: viruses are ~10Kbase, E-coli ~3Mbase, lungfish 130Gbase.

Note that this assumes whole genome sequencing. If you did targetted or exome sequence, your denominator is the size of the capture region, not the size of the entire genome.

@DC23
Copy link
Author

DC23 commented May 24, 2020

Finally someone is talking my language :)

Thanks, I have my work cut out tomorrow adjusting the workflow and running tests!

For assembly, my black-box scaling tests showed low efficiency but still usable reductions in processing time for up to 20 threads, but my heap size was larger. I will test again with your suggested settings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants