Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WGBS sample: alignment - sorting - de-duplication - extraction = Fatal Error Read1 != Read2 #360

Closed
naumenko-sa opened this issue Jul 15, 2020 · 8 comments

Comments

@naumenko-sa
Copy link

Hello @FelixKrueger !

Thanks for maintaining the Bismark pipeline!

We are running WGBS analysis using Bismark wrapped in bcbio
using the following steps:

1. bismark --bowtie2 ...
2. samtools sort ..
3. deduplicate_bismark ...
4. bismark_methylation_extractor ...

de-duplication is not recommended for RRBS samples, but our samples are WGBS, so we need de-duplication, right?

For some samples step 4 fails with:

FATAL ERROR:] The IDs of Read 1 and Read 2 are not the same.
This might be the result of sorting the BAM files by chromosomal position or 
merging several files with Samtools sort, 
and this is not compatible with correct methylation extraction. 
Please use an unsorted file instead or sort the file by name 
using the command 'samtools sort -n'. Paired-end files may 
be merged properly (without risking this error) 
using either 'samtools merge -n' or 'samtools cat'.

Step 4 requires an unsorted bam file, but step 3 requires a sorted one.

Could you please comment on how should we organize the workflow?

  1. Is it appropriate to finish the failed step 4 in a single end mode (-s)
  2. should we skip de-duplication for WGBS?
  3. should we re-sort the de-deduplicated bam file with samtools sort -n before step4?

Thanks!
Sergey

@FelixKrueger
Copy link
Owner

Hi Sergey,

The entire Bismark pipeline does not require any sorting by chromosomal position, in fact deduplicate_bismark and bismark_methylation_extractor do specifically NOT work if you sort the files in between steps. Both Bismark itself, and deduplicate_bismark write out read pairs so that Read1 and Read2 follow each other on consecutive lines, so you really shouldn't perform any sorting if you can at all avoid it (or just do the sorting the BAM file once deduplication and methylation extraction have completed).

But yes, for WGBS we would recommend definitely recommend de-duplication, and you should proceed in paired-end mode as per usual.

Do let me know if something was unclear.
best wishes, Felix

@naumenko-sa
Copy link
Author

Hi Felix!

Thanks for such a prompt response!

Sorry, I was not precise enough - after double-checking I see that we are not sorting the bam by chromosomal position,
the issue is more subtle. We are running these steps:

  1. bismark --bowtie2 => sample_bismark_bt2_pe.bam
  2. picard AddOrReplaceReadGroups => sample_fixrgs.bam
  3. picard ReorderSam => sample_fixrgs_reoder.bam - it just reoders by chromosome name, see picard doc
  4. samtools sort -n => sample.bam
  5. deduplicate_bismark sample.bam => sample.deduplicated.bam
  6. bismark_methylation_extractor sample.deduplicated.bam

For some samples, all 6 steps run just fine.
For some other samples, step6 fails with aforementioned error.

I ran the extractor step for each bam file from steps 1-4:
1 - works
2 - fails
3 - fails
4 - fails

Then I ran the deduplication step for each bam file:
1 - works
2 - fails
3 - fails
4 - works

Thus,

  • steps 2-3 make the bam file corrupt for deduplication and extractor (at least in some samples)
  • step 4 fixes the bam file for deduplication
  • but not for the extractor!

I'm just removing steps2-3 from the workflow and leaving this description here just in case somebody hits the same issue.
Thanks again for the clarification - it helped a lot!

Sergey

@naumenko-sa
Copy link
Author

naumenko-sa commented Jul 19, 2020

Hi Felix!

Removing bam processing steps helped to finish many samples.
However, some samples are still failing the extractor step with

Processed lines: 217500000
Processed lines: 218000000
[FATAL ERROR:] The IDs of Read 1
[READ_ID_1] and Read 2 [READ_ID_2] are not the same.
This might be the result of sorting the BAM files by chromosomal position or merging several files with Samtools sort, and this is not compatible with correct methylation extraction. Please use an unsorted file instead or sort the file by name using the command 'samtools sort -n'. Paired-end files may be merged properly (without risking this error) using either 'samtools merge -n' or 'samtools cat'.
' returned non-zero exit status 29.

I've extracted SAM records with READ_ID_1 and READ_ID_2 from the bismark output bam (sample_bismark_bt2_PE.bam):

READ_ID_1 83      chr6    143963338       40      121M    =       143963205       -254    ACAATACCCCTTCAAAATCACATAATAAAACTTCAAACATTAACCCACAAACTTACTCCTCCTCCAACACTACCTATTCTAACTAATAACACCACTATACAATAACCATCAAAACTCCCGA    FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF::FFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFF,:,FFFFFFFF:F:FFFFFFFFFFFFF,F       NM:i:26 MD:Z:0G2G1G9G0G6G0G1G0G0G0G8G10G4G11G4G3G5G5G0G4G4G5G6G1G6G0    XM:Z:h..x.h.........hh......hh.hhhh........z..........x....h...........x....x...x.....h.....hh....z....h.....h......x.h.....Zx       XR:Z:CT XG:Z:GA
READ_ID_2  83      chr17   79248631        42      113M    =       79248493        -251    TATACTTAATATCCCCCACAATAACAACCAATATACTTTACACATCTCCTCCAAAAAATTCTCACCCCACAACCCTATCAAAAAATACTTATATCCCTAAACACAAATAAAAA    FFFFFFFFF:::FFFFFFFF::FFFFFFF,:,FFFFFF:FFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFF:FFF:FFFF:F,FFFFFFF,FFFFFFF:FFFFFFFFF       NM:i:32 MD:Z:1G1G3G0G1G9G1G2G0G2G0G0A0G1G4G13G1G0G22G0G0G0G0G0G1G3G8G0G4G0G3G0G1        XM:Z:.x.h...hh.h.........x.h..zx..zx.h.h....h.............x.hh......................zxhhhh.h...h........hh....xh...hh.       XR:Z:CT XG:Z:GA
READ_ID_2   163     chr17   79248493        42      121M    =       79248631        251     AAAAAAAAACTCAAACCTAAATAAATACAACATCATTTCCACATATTAAACTACACTCCCCAACCTAAACTAATTTCCCTAAAAAATAACAATTTTTAAAACAAAACCAAAAATAACTCCC    FFFFFFFFFFFF:FFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,F:FFFFFFFFFFFFFFFF:FFFFFFFFF,FFFFFFFFF:FFFF:FFFFF       NM:i:26 MD:Z:0G0G0G1G0G1G0G9G3G3G4G10G1G7G15G3G7G0G0G1G0G4G6G12G1G2G5   XM:Z:zxh.hh.hh.........x...h...h....z..........z.h.......x...............h...h.......xhh.hh....z......h............h.h..h.....       XR:Z:GA XG:Z:GA

So READ_ID_1 is unpaired but has = in the RNEXT field, so the read gets paired with the next one, which has a different ID.
Probably, it is an error from bowtie2 rather than from bismark.
How would you suggest us to process these files? fixing read mates with picard?

Another issues - some samples fail deduplication step with:

skipping header line:   @PG     ID:Bismark      VN:v0.22.1      CL:"bismarUse of uninitialized value $line1 in concatenation (.) or string at /bcbio/tools/bin/deduplicate_bismark line 291, <IN> line 196061329.
Failed to determine read and genome conversion from line:
k --bowtie2 \
--temp_dir /scratch/bcbio/work/bcbiotx/tmp7_jd3l2_ --gzip --parallel 16 -o /bcbio/work/bcbiotx/tmp7_jd3l2_ \
--unmapped /bcbio/genomes/Hsapiens/hg38/bismark/ \
-1 /scratch/bcbio/work/trimmed/NA12878_6/val_1.fq.gz \
-2 /scratch/bcbio/work/trimmed/NA12878_6/val_2.fq.gz -p 2"
' returned non-zero exit status 29.

How could we debug this issue?

Thanks!
Sergey

@naumenko-sa naumenko-sa reopened this Jul 19, 2020
@FelixKrueger
Copy link
Owner

Hi Sergey,

This doesn't sound right. Without having seen it with my own eyes, I would say Bismark would never ever produce a paired-end output file where Read 1 and Read 2 do not follow each other, like here:

READ_ID_1 83      chr6    143963338       40      121M    =       143963205       -254    ACAATACCCCTTCAAAATCACATAATAAAACTTCAAACATTAACCCACAAACTTACTCCTCCTCCAACACTACCTATTCTAACTAATAACACCACTATACAATAACCATCAAAACTCCCGA    FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF::FFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFF,:,FFFFFFFF:F:FFFFFFFFFFFFF,F       NM:i:26 MD:Z:0G2G1G9G0G6G0G1G0G0G0G8G10G4G11G4G3G5G5G0G4G4G5G6G1G6G0    XM:Z:h..x.h.........hh......hh.hhhh........z..........x....h...........x....x...x.....h.....hh....z....h.....h......x.h.....Zx       XR:Z:CT XG:Z:GA
READ_ID_2  83      chr17   79248631        42      113M    =       79248493        -251    TATACTTAATATCCCCCACAATAACAACCAATATACTTTACACATCTCCTCCAAAAAATTCTCACCCCACAACCCTATCAAAAAATACTTATATCCCTAAACACAAATAAAAA    FFFFFFFFF:::FFFFFFFF::FFFFFFF,:,FFFFFF:FFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFF:FFF:FFFF:F,FFFFFFF,FFFFFFF:FFFFFFFFF       NM:i:32 MD:Z:1G1G3G0G1G9G1G2G0G2G0G0A0G1G4G13G1G0G22G0G0G0G0G0G1G3G8G0G4G0G3G0G1        XM:Z:.x.h...hh.h.........x.h..zx..zx.h.h....h.............x.hh......................zxhhhh.h...h........hh....xh...hh.       XR:Z:CT XG:Z:GA

Thus I also don't really have a 'best solution' for this, as it should never come up in the first place....

I am afraid I also don't have an out of the box answer to the last issue without taking a look at the actual file in question. There seem to be some empty lines in the file, and some programs seems to be reporting a non-zero exit status (which do not come from deduplicate_bismark itself...

A few considerations:

  • Is there a chance that there were still some other programs or processes that had been (maybe inadvertently or accidentally) at some point, that could the reason for these issues? Again, I have never ever seen error message like that, and we run a fair few thousand samples on a regular basis...
  • I saw from the error message that you seem to be requesting a LOT of resources for your run:
--parallel 16 -p 2"

Let's we assume that a single thread of Bismark uses say 12GB of RAM (not exactly sure which human genome you are using but 9-12GB is the minimum for directional alignments), and 3-5 cores of CPU. With -p 2 you are increasing the cores to 5-7 per thread, and then parallelise this 16 times, therefore requesting at least 192GB of RAM and some 80 cores of CPU. Are you sure you have these resources available on a single node on. your system? I am just not sure if such massive processes, with lots of temp directories (probably over the network?) could cause I/O issues coming from your OS...

  • I think if the temporary directory is the same as the output directory, you don't need to specify it explicitly
  • You don't seem to be using the latest version of Bismark, although I don't think that there should be anything in the latest versions that should affect the issues you re reporting (https://github.com/FelixKrueger/Bismark/releases)

I think my suggestion would probably be to upgrade Bismark to the latest version, and be a bit more humble with the command you are using, e.g.:

bismark --parallel 2 -o /bcbio/work/bcbiotx/tmp7_jd3l2 --genome /bcbio/genomes/Hsapiens/hg38/bismark/ -1 /scratch/bcbio/work/trimmed/NA12878_6/val_1.fq.gz -2 /scratch/bcbio/work/trimmed/NA12878_6/val_2.fq.gz

If you are still getting such weird looking BAM files we could arrange a file transfer and I could take a look myself...?

Cheers, Felix

@naumenko-sa
Copy link
Author

Thanks, Felix!

I've updated trim-galore and bismark to the latest versions in bcbio and re-running the failed samples with fewer threads.
I will share a bam file with you if they don't pass.

Bcbio users reported that our bismark wrapper is very slow,
and I tried running a black-box benchmark varying bismark_threads and bowtie_threads.

The results are here: https://bcbio-nextgen.readthedocs.io/en/latest/contents/methylation.html#benchmarking
I submitted 50/100/200G RAM jobs with 16/32/64 cores to the cluster.
I assumed that I needed total_cores = bismark_threads (--parallel) x bowtie_threads (-p).

Of course, such a test is not a perfect measurement, it depends on the current cluster load/architecture.
But still, 16/2 or 8/2 combinations were 5X-10X faster than others.

And, with that 16/2/100G combination, I was able to process >50% of the samples in 2-3 days.
Even this is a bit slow compared to the hardware-accelerated Illumina Dragen methylation pipeline.

Buy the way, Dragen's page references Bismark:
https://www.illumina.com/products/by-type/informatics-products/basespace-sequence-hub/apps/edico-genome-inc-dragen-methylation-pipeline.html

Are you aware of any publicly available benchmarks comparing Dragen's and Bismark's output? Is it a hardware-accelerated port of a specific version of Bismark code, or a pipeline developed along the lines established in Bismark? Sorry - those are not exactly the questions to address in the bismark GitHub, just wondering if the answer is known to the methylation research community - it will save us the effort to find the answer the hard way.

Thanks!
Sergey

@FelixKrueger
Copy link
Owner

FelixKrueger commented Jul 22, 2020

Excellent, hope it'll work this time. Regarding Dragen's output: I am afraid I am not exactly aware of what they have done, the last meeting we had with Illumina was many years ago...

@naumenko-sa
Copy link
Author

naumenko-sa commented Jul 25, 2020

I see, thanks, Felix!

Updating bismark and trim-galore and running with bismark_threads=4 and bowtie_threads=2 helped to finish all failed samples.
So as I rule of thumb we use 4/2/100G RAM, which gives 4 bismark instances x 5-7 cores = 20-32 cores.

Typically, we run on 48Cores/ 192G RAM instances, so even 8/4/192G makes sens.

@newwbeee
Copy link

I had the same problem, and none of the recommendations worked.
Finally I realized deduplicate_bismark treated some samples, even some parts of the same sample as single-end instead of paired-end. samtools sort -n also didn't help.
Eventually, I realized that all I had to do is using deduplicate_bismark -p /my/file.. It fixed all the problems. I could use 8 core or 10 core during processing with no problem.
Now, by default I started adding -s for single end -p for paired end at deduplication step

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants