Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loci with multiple motifs at same position #38

Open
christopher-schroeder opened this issue May 28, 2024 · 4 comments
Open

Loci with multiple motifs at same position #38

christopher-schroeder opened this issue May 28, 2024 · 4 comments

Comments

@christopher-schroeder
Copy link

christopher-schroeder commented May 28, 2024

I have called a bunch of samples with straglr and merged the (chrom, start, end, motif) beds into one unified bed. I would like to use this merged bed to genotype each sample again. Sometimes different motifs occur at the same position, but straglr only calles one of them multiple Times. Example:

Here is a position of the merged bed file with multiple motifs at the same position.

$ cat results/str_candidates/merged.bed | grep 56828456
Y	56828456	56836900	CATTC
Y	56828456	56836900	TCCAT

I use this bed for genotyping.
What I see:

$ cat results/str/818-13.bed | grep 56828456
Y	56828456	56836900	TCCAT	9286.5	1857.3	55	2907.0	581.4	13
Y	56828456	56836900	TCCAT	9286.5	1857.3	55	2907.0	581.4	13

What I am expecting:

$ cat results/str/818-13.bed | grep 56828456
Y	56828456	56836900	CATTC	...	...	...	...	...	...
Y	56828456	56836900	TCCAT	9286.5	1857.3	55	2907.0	581.4	13
@christopher-schroeder
Copy link
Author

Now I recognized that the motifs are basically the same but only shifted.

I provide a more complicated example:

$ cat results/str_candidates/merged.bed | grep 8500613
1	8500613	8502586	GCCGCCCCGTCCGGGAGGGAGGGGCCCCTCTGCCCG
1	8500613	8502586	GCCTCTGCCCGGCCGCCACCCCGTCTGGGAAGTGAGGAGC
1	8500613	8502586	GCCTCTGCCCGGCCGCCACCCGTCTGGGAAGTGAGGAGC
1	8500613	8502586	GGGGGGGTCAGCCCCCCGCCCGGCCAGCCGCCCCGTCCGGGAGGTGAGGT
1	8500613	8502586	CCCCCCGCCCGGCCAGCCGCCCCGTCCGGGAGGGAGGTGGGGGGGTCAG
1	8500613	8502586	CCCCCGCCCGGCCAGCCGCCCCGTCCGGGAGGGAGGTGGGGGGTCAG
1	8500613	8502586	CCCCCCGCCCGGCCAGCCGCCCCGTCCGGGAGGGAGGTGGGGGGTCAG
1	8500613	8502586	GCCGCCCCGTCCGGGAAGTGAGGGGCCCCTCTGCCCG
cat results/str/818-14.bed | grep 8500613
1	8500613	8502586	GCCTCTGCCCGGCCGCCACCCCGTCTGGGAAGTGAGGAGC	2448.0	61.2	4	932.0	23.3	5
1	8500613	8502586	GGGGGGGTCAGCCCCCCGCCCGGCCAGCCGCCCCGTCCGGGAGGTGAGGT	1420.0	28.4	8	305.0	6.1	1
1	8500613	8502586	GGGGGGGTCAGCCCCCCGCCCGGCCAGCCGCCCCGTCCGGGAGGTGAGGT	1295.0	25.9	9	-	-	-
1	8500613	8502586	GGGGGGGTCAGCCCCCCGCCCGGCCAGCCGCCCCGTCCGGGAGGTGAGGT	1295.0	25.9	9	-	-	-
1	8500613	8502586	GGGGGGGTCAGCCCCCCGCCCGGCCAGCCGCCCCGTCCGGGAGGTGAGGT	1295.0	25.9	9	-	-	-
1	8500613	8502586	GCCTCTGCCCGGCCGCCACCCCGTCTGGGAAGTGAGGAGC	2448.0	61.2	4	976.0	24.4	5
1	8500613	8502586	GCCTCTGCCCGGCCGCCACCCCGTCTGGGAAGTGAGGAGC	2448.0	61.2	4	952.0	23.8	5
1	8500613	8502586	GCCTCTGCCCGGCCGCCACCCCGTCTGGGAAGTGAGGAGC	2448.0	61.2	4	952.0	23.8	5

Especially line 2 and 3 of the output confuse me.

1	8500613	8502586	GGGGGGGTCAGCCCCCCGCCCGGCCAGCCGCCCCGTCCGGGAGGTGAGGT	1420.0	28.4	8	305.0	6.1	1
1	8500613	8502586	GGGGGGGTCAGCCCCCCGCCCGGCCAGCCGCCCCGTCCGGGAGGTGAGGT	1295.0	25.9	9	-	-	-

The same motif on the same locus but with different genotypes for one sample.

@readmanchiu
Copy link
Collaborator

when you specify the the loci in the bed file, there should one only be one locus (position) per line.
If you expect multiple motifs, you should use some regex for STRs for the last (4th) column in the bed file. For example, if you expect CGG and CAG to show up in the same locus, you can use "C*G". If you have no clue what to expect, you can use some very generic pattern, like "GC", which may work for your VNTR example.

@christopher-schroeder
Copy link
Author

So you are only allowed to put in a a position once and if there might be different motifs at a single position I have to use regexp?
Why would you implement it in such way?
We are also searching for pathogenic motif changes in STRs so especially loci where samples show different motifs are of interest. So I know exactly what motif I am searching for and I also know the position. But sometimes I have several motifs and would like to know if this occurs in a sample and if so how much copies.

The usecase is that if you detect an STR, you would like to know if the detected copy number is very much outside the norm. For that you have to know the norm and genotype the other (control) samples to get this information.

@readmanchiu
Copy link
Collaborator

Thanks, @christopher-schroeder for the info.
The idea of the input bed file is that each line corresponds to a unique locus; however, flexibility is allowed at the motif (4th column). So you you basically put "a" if you only are sure there is an A in the repeat.
This becomes silly so I would probably make a new release such that the motif will be optional - empty column allowed. There will be no screening against a pre-defined motif.
I think by removing this screening against pre-defined motifs will allow everything (regardless how they different from the reference) that is detected from the sequences to be reported.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants