Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

help combining bismark pipeline with methylated UMIs #680

Open
shawpa opened this issue Jun 25, 2024 · 8 comments
Open

help combining bismark pipeline with methylated UMIs #680

shawpa opened this issue Jun 25, 2024 · 8 comments

Comments

@shawpa
Copy link

shawpa commented Jun 25, 2024

I have a whole genome em-seq methylation library with dual indexes as well as methylated UMI's at the beginning of each R1 and R2. These are from Twist and were used to make EM-seq libraries. I am not sure how I need to modify my current bismark pipeline to use the --barcode option when I do the deduplication. I saw in a previous thread that you have a program called UMI-Grinder but I'm not sure where it fits (if at all) in the pipeline. Normally I do the following:

  1. Trimming with trim galore
  2. PE Alignment with bismark bowtie 2 option
  3. deduplication
  4. methylation calling

The UMIs are going to be the first 5 bases of each read1 and read2. The next 2 bases are supposed to be skipped and then my genomic sequencing should begin. I looked at the UMI-Grinder documention but the input seems to be a bam. Not sure how I get to a bam from a fastq without aligning and I can't align because the UMI's at the beginning of the read. Thank you for your help.

@FelixKrueger
Copy link
Owner

Hi there, apologies for the slow response.

UMIbam is indeed a tool for UMI-aware deduplication, kind of a slightly more capable version of deduplicate_bismark.

What you need here is a way of transferring the UMI-sequence from the beginning of both reads to the readIDs, and then also remove the 2 additional bp, Trim Galore already has dedicated methods of doing this for the the --clock or --implicon modes:

--implicon              This is a special mode of operation for paired-end data, such as required for the IMPLICON method, where a UMI sequence
                        is getting transferred from the start of Read 2 to the readID of both reads. Following this, Trim Galore will exit.

                        In it's current implementation, the UMI carrying reads come in the following format:

                        Read 1  5' FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF 3'
                        Read 2  3' UUUUUUUUFFFFFFFFFFFFFFFFFFFFFFFFFFFF 5'

                        Where UUUUUUUU is a random 8-mer unique molecular identifier (UMI) and FFFFFFF... is the actual fragment to be
                        sequenced. The UMI of Read 2 (R2) is written into the read ID of both reads and removed from the actual sequence.
                        Here is an example:

                        R1: \@HWI-D00436:407:CCAETANXX:1:1101:4105:1905 1:N:0: CGATGTTT
                            ATCTAGTTCAGTACGGTGTTTTCGAATTAGAAAAATATGTATAGAGGAAATAGATATAAAGGCGTATTCGTTATTG
                        R2: \@HWI-D00436:407:CCAETANXX:1:1101:4105:1905 3:N:0: CGATGTTT
                            CAATTTTGCAGTACAAAAATAATACCTCCTCTATTTATCCAAAATCACAAAAAACCACCCACTTAACTTTCCCTAA
                        
                        After --implicon trimming:
                        R1: \@HWI-D00436:407:CCAETANXX:1:1101:4105:1905 1:N:0: CGATGTTT:CAATTTTG
                            ATCTAGTTCAGTACGGTGTTTTCGAATTAGAAAAATATGTATAGAGGAAATAGATATAAAGGCGTATTCGTTATTG
                        R2: \@HWI-D00436:407:CCAETANXX:1:1101:4105:1905 3:N:0: CGATGTTT:CAATTTTG
                                    CAGTACAAAAATAATACCTCCTCTATTTATCCAAAATCACAAAAAACCACCCACTTAACTTTCCCTAA

It should be easy enough to implement this additional trimming, you could for example use the code for --implicon and modify it slightly to meet your needs. If this is a method is getting used more widely it might make sense to add it as an additional mode of Specialised Trimming.

@shawpa
Copy link
Author

shawpa commented Jun 28, 2024

It seems like the --clock is most similar to what I am doing because I have UMIs on both ends. I have no "fixed" bases so that part of the code could be removed. Hopefully I can find someone to implement this type of code change. Do the following steps sound right:

  1. Remove adapter sequences in case any are in my reads.
  2. Run modified --clock script to append UMI's to R1 and R2.
  3. Hard clip at least 2 bases on the 3' and 5' ends.
  4. Run trimgalore using default settings for length and quality trimming. Not sure if this step would be necessary.

Am I missing anything important on the trimming step?

For deduplication, I would use UMIbam instead of deduplicate_bismark in my pipeline.

@FelixKrueger
Copy link
Owner

I think you can start with Steps 2+3 (this can be done in one go), and then proceed with Step 4 (which will remove adapter and quality). If you add both barcodes from R1 and R2 as a string after a colon together, deduplicate_bismark should work. Something like this:

@somereadID:CGATG_CAATT

If you can't find someone to apply a local fix I can probably take a look for you.

@shawpa
Copy link
Author

shawpa commented Jul 18, 2024

I just got my data and I want to confirm how the RID would need to be formatted so that the current deduplicate_bismark would work properly. An example of the R1 and R2 are below. I cut off some of the read for clarity.

@LH00512:33:22N7MVLT3:4:1101:1564:1080 1:N:0:GATAGCCA+CTGTGTTG
GNATAGTAATTTGGTATTTATATATTGTGTTTGTTAAATTTAGTGATGTATATAGGTTTAAATATTGTTTTTTATGTGTTTTAATTTTTTATAGTTTTTTAAT

@LH00512:33:22N7MVLT3:4:1101:1564:1080 2:N:0:GATAGCCA+CTGTGTTG
CNCTGTACTCCACATTTTAAAATATATAAAACCACAAAAACATCATCTTCTATATCTTATTCTTACTAACTATATAACAATAACATTCTTAAATAAACT

So the UMI for R1 in this case is GNATA and the UMI for R2 is CNCTG. Not a great example because of the N's. You had previously said it needed to be formatted as

@somereadID:CGATG_CAATT

So does this mean my RIDs should look like the following:

@LH00512:33:22N7MVLT3:4:1101:1564:1080 1:N:0:GATAGCCA+CTGTGTTG:GNATA_CNCTG
     GTAATTTGGTATTTATATATTGTGTTTGTTAAATTTAGTGATGTATATAGGTTTAAATATTGTTTTTTATGTGTTTTAATTTTTTATAGTTTTTTAAT

@LH00512:33:22N7MVLT3:4:1101:1564:1080 2:N:0:GATAGCCA+CTGTGTTG:GNATA_CNCTG
     TACTCCACATTTTAAAATATATAAAACCACAAAAACATCATCTTCTATATCTTATTCTTACTAACTATATAACAATAACATTCTTAAATAAACT

Or do I also need to remove the P7 and P5 barcodes from the RID?

@FelixKrueger
Copy link
Owner

Yes that's correct, it will take take the last element after the last :, so here GNATA_CNCTG and use this in addition to the chromosome, start, end and orientation of a read for the deduplication process.

@zimzims
Copy link

zimzims commented Jul 19, 2024

Hi Felix! I will be the one helping @shawpa with this task. Right now, the functionality we are looking to implement is to remove the first N bases of R1 and R2 and then appending those to each of the read ids as Annie showed above. We are also looking to be able to receive either a list or file of "valid" UMIs and if either R1 or R2 has a UMI that is not valid, throw away the pair. I've used cutadapt/trim_galore a bunch, but never added functionality. I would love to add the functionality and make it open source, what is the best way of going about this?

@FelixKrueger
Copy link
Owner

Hi Zach,

This is a very interesting idea, incidentally we have talked about Twist libraries and also libraries from Watchmaker Genomics that will apparently have a similar in-line barcoding strategy for both R1 and R2. As it turns out, it seem I may have a real interest in making this become reality myself :) My first instinct would probably be to have such a 5' inline barcode transfer function (for a variable number of nucleotides) set up as a stand-alone function (for both SE and PE), and potentially one that can be called as part of the trimming process. The latter would require a little more thinking though....

Adding functionality to validate UMIs against a UMI-allowed or whitelist seems like a rather particular, application-specific step and I wonder if this isn't best handled afterwards using a post-processing script of some sort?

I could probably have a go at UMI/inline-barcode transfer function as a stand-alone process (similar to e.g. --clock or --implicon). Should I have a go?

@zimzims
Copy link

zimzims commented Jul 22, 2024

Hey Felix, sorry for the late response. If you are up for it, then by all means give it a go! I agree about validating UMI as a post-processing script. We will handle that separately. If there is anything I can do to help, please let me know. Thank you for help

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants