Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generating cytosine report #655

Closed
desmodus1984 opened this issue Feb 19, 2024 · 15 comments
Closed

Generating cytosine report #655

desmodus1984 opened this issue Feb 19, 2024 · 15 comments

Comments

@desmodus1984
Copy link

Hi,

I am studying differential methylation in bumblebees, and after using methylKit to find differentially methylated sites it seems better to find differentially methylated regions. I wanted to find DMR using the software swDMR (unless you suggest a different/better one) and it needs as input the cytosine report with strand and counts.

I am contacting you because for methylKit I have the bam files for methylation calling but I did pre-process them in a different way, not with Bismark. I trimmed the reads based on the suggested EM-seq parameters, and I checked for methylation-bias with Methyldackel with regular trimming parameters. After confirming best trimming parameters, I aligned the trimmed reads with bwa-meth, sam-to-bam conversion, and deduplication/elimination with Picard.

I wanted to ask if I can use the bam files that I have already generated to make the cytosine report or if I should re-processed my files all over again with Bismark.

Thank you again.

@FelixKrueger
Copy link
Owner

I am afraid I don't think the BAM files are compatible :(, and you would indeed have to re-start from the alignment step...

@desmodus1984
Copy link
Author

desmodus1984 commented Feb 21, 2024 via email

@FelixKrueger
Copy link
Owner

Hmm, this does indeed look like a Bowtie2 installation issue. If you don't specify the path yourself it will look in the PATH environment variable. It seems to use it from /home/applications/bowtie2/2.5.3/bowtie2-build.

You can find out the path with the command which bowtie2.

When you run /home/applications/bowtie2/2.5.3/bowtie2-build --help you should be able to see the help text. Otherwise, I would recommend you reintall Bowtie2 (or install Bismark using conda install bismark, which will also satisfy all other requirements)

@desmodus1984
Copy link
Author

desmodus1984 commented Feb 21, 2024 via email

@FelixKrueger
Copy link
Owner

The latest version on bioconda is 0.24.2, the instructions are here: https://bioconda.github.io/recipes/bismark/README.html

@desmodus1984
Copy link
Author

Hi,

I Installed Bismark in conda, and I am trying to run the workflow and I would appreciate if you could check my code, and tell me what I am writing wrong or missing.

I loaded the environment, and I am running the code in the folder, as an .sh file, and it is not creating any bam output.
Here is my code which worked with bs_seeker2.

I start with a loop to grab the names of the files, but then alignment doesn't happen.

`for i in *_R1_val_1.fq.gz
do
base=$(basename $i "_R1_val_1.fq.gz")

    bismark -q --parallel 8  -B ${base} \
    --genome /DataDrive/juaguila/Bvos-trimmed/BM-lot/ \
     -1 ${base}_R1_val_1.fq.gz -2 ${base}_R2_val_2.fq.gz 

    deduplicate_bismark --bam ${base}.bam

    bismark_methylation_extractor -p --gzip --genome_folder /DataDrive/juaguila/Bvos-trimmed/BM-lot/ \
    --cytosine_report --multicore 8  ${base}.deduplicated.bam

    done

`

Which I run this:
nohup sh BS2.sh > test.bismark.log

The genome was created in the genome directory.

The log is as follows:

Bowtie 2 seems to be working fine (tested command 'bowtie2 --version' [2.4.1])
Output format is BAM (default)
Alignments will be written out in BAM format. Samtools found here: '/home/juaguila/miniconda3/envs/bismark/bin/samtools'
Reference genome folder provided is /DataDrive/juaguila/Bvos-trimmed/BM-lot/ (absolute path is '/DataDrive/juaguila/Bvos-trimmed/BM-lot/)'
FastQ format specified

Input files to be analysed (in current folder '/DataDrive/juaguila/Bvos-trimmed/BM-lot'):
V00001_R1_val_1.fq.gz
V00001_R2_val_2.fq.gz
Library is assumed to be strand-specific (directional), alignments to strands complementary to the original top or bottom strands will be ignored (i.e. not performed!)
Specifying --basename in conjuction with --multicore is currently not supported (but we are aiming to fix this soon). Please lose either --basename or --multicore to proceed

Neither -s (single-end) nor -p (paired-end) selected for deduplication. Trying to extract this information for each file separately from the @pg line of the SAM/BAM file
Processing single-end Bismark output file(s) (SAM format):
V00001.bam

If there are several alignments to a single position in the genome the first alignment will be chosen. Since the input files are not in any way sorted this is a near-enough random selection of reads.

Checking file >>V00001.bam<< for signs of file truncation...
Captured error message: '[E::hts_open_format] Failed to open file V00001.bam'

[ERROR] The file appears to be truncated, please ensure that there were no errors while copying the file!!! Exiting...

I would appreciate if you could tell me what parameter I am missing

@FelixKrueger
Copy link
Owner

The error here is this:

Specifying --basename in conjuction with --multicore is currently not supported (but we are aiming to fix this soon). Please lose either --basename or --multicore to proceed

So the Bismark could simply be:

bismark --parallel 8 --genome /DataDrive/juaguila/Bvos-trimmed/BM-lot/ -1 ${base}_R1_val_1.fq.gz -2 ${base}_R2_val_2.fq.gz 

This is fine:

deduplicate_bismark ${base}.bam

The following command will need --bedGraph I believe:

bismark_methylation_extractor --gzip --bedGraph --genome_folder /DataDrive/juaguila/Bvos-trimmed/BM-lot/ --cytosine_report --multicore 8  ${base}.deduplicated.bam

@desmodus1984
Copy link
Author

Thanks,
I ran the code and I made one change, in addition to eliminating the base (-B) parameter, I changed multicore to parallel.
Could you explain what is the difference? I thought they are the same, but maybe I am wrong.

Here is the code

for i in *_R1_val_1.fq.gz
        do
        base=$(basename $i "_R1_val_1.fq.gz")
        bismark -q --parallel 8 \
        --genome /home/juaguila/Bismark/ \
         -1 ${base}_R1_val_1.fq.gz -2 ${base}_R2_val_2.fq.gz
        deduplicate_bismark --bam ${base}.bam
        bismark_methylation_extractor -p --gzip --genome_folder /DataDrive/juaguila/Bvos-trimmed/BM-lot/ \
        --cytosine_report --multicore 8  ${base}.deduplicated.bam
        done

I have a question too.
I thought that the basename (${base}) would work well, and I would have a file called whatever is before *_R1_val_1.fq.gz like V00001.bam, as it did when I ran bs-seeker and cgmaptools. Contrarily, iInstead it created files with the entire XXXX_R1_val_1.fq.gz suffix

[juaguila@v001 BSMK-1]$ cd ../BSMK-0/
[juaguila@v001 BSMK-0]$ ls -ltr
total 103335076
-rw-r--r-- 1 juaguila hpc_jfierst 3999335475 Jun 8 2023 V00001_R2_val_2.fq.gz
-rw-r--r-- 1 juaguila hpc_jfierst 3675649845 Jun 8 2023 V00001_R1_val_1.fq.gz
-rw-r--r-- 1 juaguila hpc_jfierst 3436679091 Jun 8 2023 V00021_R2_val_2.fq.gz
-rw-r--r-- 1 juaguila hpc_jfierst 3104862452 Jun 8 2023 V00021_R1_val_1.fq.gz
-rw-r--r-- 1 juaguila hpc_jfierst 3079178155 Jun 8 2023 V00294_R2_val_2.fq.gz
-rw-r--r-- 1 juaguila hpc_jfierst 2823807574 Jun 8 2023 V00294_R1_val_1.fq.gz
-rw-r--r-- 1 juaguila hpc_jfierst 457 Mar 3 21:49 BSMK-test.sh
-rw-r--r-- 1 juaguila hpc_jfierst 7654589585 Mar 4 06:55 V00001_R1_val_1_bismark_bt2_pe.bam
-rw-r--r-- 1 juaguila hpc_jfierst 1867 Mar 4 06:55 V00001_R1_val_1_bismark_bt2_PE_report.txt
-rw-r--r-- 1 juaguila hpc_jfierst 6195867738 Mar 4 11:14 V00021_R1_val_1_bismark_bt2_pe.bam
-rw-r--r-- 1 juaguila hpc_jfierst 1868 Mar 4 11:14 V00021_R1_val_1_bismark_bt2_PE_report.txt

Did I write something wrong?
I thought that the code was specifying just the XXX part; that is why I used the -B base parameter to specify the file name, and then create the loop/sequence for the other codes. I thought that the alignment would write a bam file called XXX.bam and not a XXX_R1_val_1.fq.gz file.

@FelixKrueger
Copy link
Owner

The options --parallel/--multicore are identical.

I believe for the output files, only the ending is stripped off (e.g. .fq.gz). You can always rename files afterwards so they are to your liking, e.g.

rename s/_R1_val_1_bismark_/_/ *

The file XXX_R1_val_1.fq.gz is the input file (and is not created by Bismark...).

@desmodus1984
Copy link
Author

Good evening,

I followed your code for the methylation extractor, and I didn't get the bedgraph ouput nor the genome-wide methylation report files, which I require to do the detection of differentially methylated regions.

The code was:

for i in *_R1_val_1.fq.gz
        do
        base=$(basename $i "_R1_val_1.fq.gz")

        bismark -q --parallel 10 \
        --genome /home/juaguila/Bismark/ \
         -1 ${base}_R1_val_1.fq.gz -2 ${base}_R2_val_2.fq.gz

        deduplicate_bismark -p --bam ${base}_R1_val_1_bismark_bt2_pe.bam -o ${base}

        bismark_methylation_extractor -p --gzip --bedGraph --buffer_size 20G --genome_folder /home/juaguila/Bismark/ \
        --cytosine_report --parallel 10  ${base}.deduplicated.bam

done

The log tells this error:

Methylation information will now be written into a bedGraph and coverage file

Using the following files as Input:
/home/juaguila/BombusMethylSeq/Rec-5/BSMK-2/CpG_OT_V00747.deduplicated.txt.gz /home/juaguila/BombusMethylSeq/Rec-5/BSMK-2/CpG_OB_V00747.deduplicated.txt.gz

Writing bedGraph to file: V00747.deduplicated.bedGraph.gz
Also writing out a coverage file including counts methylated and unmethylated residues to file: V00747.deduplicated.bismark.cov.gz

Now writing methylation information for file >>CpG_OT_V00747.deduplicated.txt.gz<< to individual files for each chromosome
Failed to open filehandle: Too many open files at /home/juaguila/.conda/envs/bismark/bin/bismark2bedGraph line 279, line 276304.
Finished BedGraph conversion ...


And for the cytosine report.

Methylation information will now be written into a genome-wide cytosine report

Adding context-specific methylation summaries

Writing genome-wide cytosine report to: V00314.deduplicated.CpG_report.txt.gz <<<

Writing all cytosine context summary file to: V00314.deduplicated.cytosine_context_summary.txt <<<

No last chromosome was defined, something must have gone wrong while reading the data in (e.g. specified wrong file path for a gzipped coverage file?). Please check your command!


I also wanted to inquire about this error. While I looked at the log file, I saw this very often:
Processed 3000000 sequence pairs so far
Chromosomal sequence could not be extracted for A00744:203:HVYMWDSXY:3:2313:3495:29340_1:N:0:GTCGTTAC+GCTTAGCT NW_022883904.1 2
Chromosomal sequence could not be extracted for A00744:203:HVYMWDSXY:3:2265:16071:7435_1:N:0:GTCGTTAC+GCTTAGCT NW_022883908.1 29954
Chromosomal sequence could not be extracted for A00744:203:HVYMWDSXY:3:2214:25635:36370_1:N:0:GTCGTTAC+GCTTAGCT NW_022883462.1 19111
Chromosomal sequence could not be extracted for A00744:203:HVYMWDSXY:3:2233:28782:9596_1:N:0:GTCGTTAC+GCTTAGCT NW_022884241.1 19018
Chromosomal sequence could not be extracted for A00744:203:HVYMWDSXY:3:2238:29441:32095_1:N:0:GTCGTTAC+GCTTAGCT NW_022883649.1 13623
Chromosomal sequence could not be extracted for A00744:203:HVYMWDSXY:3:2305:16034:34053_1:N:0:GTCGTTAC+GCTTAGCT NW_022883202.1 20994
Chromosomal sequence could not be extracted for A00744:203:HVYMWDSXY:3:2234:16080:20760_1:N:0:GTCGTTAC+GCTTAGCT NW_022883657.1 29689
Chromosomal sequence could not be extracted for A00744:203:HVYMWDSXY:3:2234:16098:20823_1:N:0:GTCGTTAC+GCTTAGCT NW_022883657.1 29689
Chromosomal sequence could not be extracted for A00744:203:HVYMWDSXY:3:2307:14525:10019_1:N:0:GTCGTTAC+GCTTAGCT NW_022883927.1 29210

Is this something that I have to worry about? I indexed the reference genome the standard way, my methylation reads are PE-150.

Could you please tell what is wrong with my code that it did not generate the output files?

Lastly, while reading the log I found this line about directional library in the cytosine report:


Finished generating genome-wide cytosine report

Bowtie 2 seems to be working fine (tested command 'bowtie2 --version' [2.5.3])
Output format is BAM (default)
Alignments will be written out in BAM format. Samtools found here: '/home/juaguila/.conda/envs/bismark/bin/samtools'
Reference genome folder provided is /home/juaguila/Bismark/ (absolute path is '/home/juaguila/Bismark/)'
FastQ format specified

Input files to be analysed (in current folder '/home/juaguila/BombusMethylSeq/Rec-5/BSMK-2'):
V00750_R1_val_1.fq.gz
V00750_R2_val_2.fq.gz
Library is assumed to be strand-specific (directional), alignments to strands complementary to the original top or bottom strands will be ignored (i.e. not performed!)

The library is PE-150. At first I was confused because it was done RNA-seq, which the library itself was directional, but then I realized that in this case, it is not. Is there an argument to state the library as non-directional, or am I misunderstanding the log?

Thank you.

@FelixKrueger
Copy link
Owner

You need to use the option --scaffolds form the methylation extractor as you seem to have a genom with too many contigsscaffolds.

@desmodus1984
Copy link
Author

Can I rerun the methylation extractor code or should I eliminate any of the previous files (X.methXtractor.temp), or better move the deduplicated files to a new directory?

Thanks

@FelixKrueger
Copy link
Owner

I would probably remove the .temp files, but it might work even leaving them there...

@desmodus1984
Copy link
Author

I ran the methylator and it worked well now.
By curiosity, I checked the MBias files, which I expected to be "good" or "fine" as the ones from methyldackel and I got surprised that they weren't as flat as those from methyldackel.
The libraries were made with the enzymatic-methyl-kit from NEB. The R1 look okay, but there is a huge drop at the end, while the R2 have the drop in methylation level, until, similarly, reaching the end of the read, and the huge drop out.
V00001 deduplicated M-bias_R2
V00001 deduplicated M-bias_R1
V00315 deduplicated M-bias_R2
V00315 deduplicated M-bias_R1

Do you think that I should add any extra parameters to the methylation extractor?
Do you recommend trimming bases from the 3'-end?
R1 seems to need more trimming than the R2.

Thanks and sorry for the many questions.

@FelixKrueger
Copy link
Owner

A couple of comments:

  • it might be easier to run bismark2report once the run is complete, as this give you an interactive HTML version of all reports, including the M-bias plots
  • as far as I can see, almost all reads have <1% of total methylation in any context, is that correct? This is super constant across all reads, and there are neither spikes nor drop-offs
  • what you are probably referring to is the number of calls per position for each context. For R1, this is fairly constant throughout. For R2 you see a steady decline for calls in all contexts, the reason for which is the overlap detection and removal (as soon as R2 starts overlapping R1, the extraction stops and proceeds to the next read
  • steep 'drops' in the number of calls at the very end are typically caused by adapter clipping and/or quality trimming. Both phenomena are expected, and there is nothing you need to do about it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants