Loci with multiple motifs at same position #38

christopher-schroeder · 2024-05-28T11:55:16Z

I have called a bunch of samples with straglr and merged the (chrom, start, end, motif) beds into one unified bed. I would like to use this merged bed to genotype each sample again. Sometimes different motifs occur at the same position, but straglr only calles one of them multiple Times. Example:

Here is a position of the merged bed file with multiple motifs at the same position.

$ cat results/str_candidates/merged.bed | grep 56828456
Y	56828456	56836900	CATTC
Y	56828456	56836900	TCCAT

I use this bed for genotyping.
What I see:

$ cat results/str/818-13.bed | grep 56828456
Y	56828456	56836900	TCCAT	9286.5	1857.3	55	2907.0	581.4	13
Y	56828456	56836900	TCCAT	9286.5	1857.3	55	2907.0	581.4	13

What I am expecting:

$ cat results/str/818-13.bed | grep 56828456
Y	56828456	56836900	CATTC	...	...	...	...	...	...
Y	56828456	56836900	TCCAT	9286.5	1857.3	55	2907.0	581.4	13

The text was updated successfully, but these errors were encountered:

christopher-schroeder · 2024-05-28T12:27:42Z

Now I recognized that the motifs are basically the same but only shifted.

I provide a more complicated example:

$ cat results/str_candidates/merged.bed | grep 8500613
1	8500613	8502586	GCCGCCCCGTCCGGGAGGGAGGGGCCCCTCTGCCCG
1	8500613	8502586	GCCTCTGCCCGGCCGCCACCCCGTCTGGGAAGTGAGGAGC
1	8500613	8502586	GCCTCTGCCCGGCCGCCACCCGTCTGGGAAGTGAGGAGC
1	8500613	8502586	GGGGGGGTCAGCCCCCCGCCCGGCCAGCCGCCCCGTCCGGGAGGTGAGGT
1	8500613	8502586	CCCCCCGCCCGGCCAGCCGCCCCGTCCGGGAGGGAGGTGGGGGGGTCAG
1	8500613	8502586	CCCCCGCCCGGCCAGCCGCCCCGTCCGGGAGGGAGGTGGGGGGTCAG
1	8500613	8502586	CCCCCCGCCCGGCCAGCCGCCCCGTCCGGGAGGGAGGTGGGGGGTCAG
1	8500613	8502586	GCCGCCCCGTCCGGGAAGTGAGGGGCCCCTCTGCCCG

cat results/str/818-14.bed | grep 8500613
1	8500613	8502586	GCCTCTGCCCGGCCGCCACCCCGTCTGGGAAGTGAGGAGC	2448.0	61.2	4	932.0	23.3	5
1	8500613	8502586	GGGGGGGTCAGCCCCCCGCCCGGCCAGCCGCCCCGTCCGGGAGGTGAGGT	1420.0	28.4	8	305.0	6.1	1
1	8500613	8502586	GGGGGGGTCAGCCCCCCGCCCGGCCAGCCGCCCCGTCCGGGAGGTGAGGT	1295.0	25.9	9	-	-	-
1	8500613	8502586	GGGGGGGTCAGCCCCCCGCCCGGCCAGCCGCCCCGTCCGGGAGGTGAGGT	1295.0	25.9	9	-	-	-
1	8500613	8502586	GGGGGGGTCAGCCCCCCGCCCGGCCAGCCGCCCCGTCCGGGAGGTGAGGT	1295.0	25.9	9	-	-	-
1	8500613	8502586	GCCTCTGCCCGGCCGCCACCCCGTCTGGGAAGTGAGGAGC	2448.0	61.2	4	976.0	24.4	5
1	8500613	8502586	GCCTCTGCCCGGCCGCCACCCCGTCTGGGAAGTGAGGAGC	2448.0	61.2	4	952.0	23.8	5
1	8500613	8502586	GCCTCTGCCCGGCCGCCACCCCGTCTGGGAAGTGAGGAGC	2448.0	61.2	4	952.0	23.8	5

Especially line 2 and 3 of the output confuse me.

1	8500613	8502586	GGGGGGGTCAGCCCCCCGCCCGGCCAGCCGCCCCGTCCGGGAGGTGAGGT	1420.0	28.4	8	305.0	6.1	1
1	8500613	8502586	GGGGGGGTCAGCCCCCCGCCCGGCCAGCCGCCCCGTCCGGGAGGTGAGGT	1295.0	25.9	9	-	-	-

The same motif on the same locus but with different genotypes for one sample.

readmanchiu · 2024-05-28T15:04:47Z

when you specify the the loci in the bed file, there should one only be one locus (position) per line.
If you expect multiple motifs, you should use some regex for STRs for the last (4th) column in the bed file. For example, if you expect CGG and CAG to show up in the same locus, you can use "C*G". If you have no clue what to expect, you can use some very generic pattern, like "GC", which may work for your VNTR example.

christopher-schroeder · 2024-05-28T19:55:59Z

So you are only allowed to put in a a position once and if there might be different motifs at a single position I have to use regexp?
Why would you implement it in such way?
We are also searching for pathogenic motif changes in STRs so especially loci where samples show different motifs are of interest. So I know exactly what motif I am searching for and I also know the position. But sometimes I have several motifs and would like to know if this occurs in a sample and if so how much copies.

The usecase is that if you detect an STR, you would like to know if the detected copy number is very much outside the norm. For that you have to know the norm and genotype the other (control) samples to get this information.

readmanchiu · 2024-06-03T04:42:47Z

Thanks, @christopher-schroeder for the info.
The idea of the input bed file is that each line corresponds to a unique locus; however, flexibility is allowed at the motif (4th column). So you you basically put "a" if you only are sure there is an A in the repeat.
This becomes silly so I would probably make a new release such that the motif will be optional - empty column allowed. There will be no screening against a pre-defined motif.
I think by removing this screening against pre-defined motifs will allow everything (regardless how they different from the reference) that is detected from the sequences to be reported.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loci with multiple motifs at same position #38

Loci with multiple motifs at same position #38

christopher-schroeder commented May 28, 2024 •

edited

Loading

christopher-schroeder commented May 28, 2024

readmanchiu commented May 28, 2024

christopher-schroeder commented May 28, 2024

readmanchiu commented Jun 3, 2024

Loci with multiple motifs at same position #38

Loci with multiple motifs at same position #38

Comments

christopher-schroeder commented May 28, 2024 • edited Loading

christopher-schroeder commented May 28, 2024

readmanchiu commented May 28, 2024

christopher-schroeder commented May 28, 2024

readmanchiu commented Jun 3, 2024

christopher-schroeder commented May 28, 2024 •

edited

Loading