-
Notifications
You must be signed in to change notification settings - Fork 0
/
HLA
3098 lines (2473 loc) · 105 KB
/
HLA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
Warning: this file is a long brain dump documenting ideas, problems,
solutions, experimental command lines, and so on.
It's mainly for me to be able to refer back to and to document
progress rather than something for others to read.
=============================================================================
HLA genes and stats
https://www.ebi.ac.uk/ipd/imgt/hla/about/statistics/
ftp:https://ftp.ebi.ac.uk/pub/databases/ipd/imgt/hla/
Genbank query
https://www.ncbi.nlm.nih.gov/nuccore/?term=DQA1+human+complete+cds
in seq4d:/local/scratch01/jkb/quantum/
Pangenome paper
https://www.nature.com/articles/s41586-023-05896-x
"We selected five clinically relevant complex SV loci for detailed
structural analysis: RHD-RHCE, HLA-A, CYP2D6-CYP2D7, C4 and LPA".
(Other useful targets for pangenome construction and tangle resolution?)
# mamba install vcfbub vcflib wfmash
See also pggb worked example:
https://pggb.readthedocs.io/en/latest/rst/quick_start.html
pggb -i data/HLA/DRB1-3123.fa.gz -p 70 -s 3000 -n 10 -t 16 -V 'gi|568815561' -o out
odgi layout -i out2/*smooth.final.gfa -o _.layout
odgi draw -i out2/*smooth.final.gfa -c _.layout -p _.png
odgi draw -i out2/*smooth.final.gfa -c _.layout -s _.svg
Also "bandage" for viewing the gfa
pggb HLA defaults 4737 S nodes
-c 2 -G 10,100 3423
-G 10 1737
-G 10 -s 1000 1740
-G 10 -s 1000 -p 90 1685
-G 10 -s 5000 -p 90 1732
-G 100 -s 5000 -p 90 3380
-G 100 -s 1000 -p 90 -k30 1852
" -j 1 1845
" -e 1 1149
" -e 10 1137
-G10 -s1000 -k30 -p90 -e10 1145
-G10 -s5000 -k30 -p90 -e10 1165
-G10 -s5000 -k30 -p92 -e10 1111
-G10 -s5000 -k30 -p93 -e10 951 => 1 distinct graph
-G10 -s5000 -k30 -p94 -e10 699 => 2 distinct graphs
-G10 -s5000 -k30 -p95 -e10 61 => 4 distinct graphs
-G10 -s5000 -k30 -p70 -e10 1173
-G10 -s5000 -k35 -p93 -e10 726 (longer exact match needed)
-G10 -s5000 -k35 -p93 -e10 -n5 727
-G10 -s5000 -k35 -p93 -e10 -n1 711
-G10 -s5000 -k35 -p93 -e10 -n20 740
-G100 -s5000 -k35 -p93 -e10 725
G* -s5000 -k35 -p93 -e10 730 (-G 100,500,1000)
G* -s5000 -k35 -p93 -e0 4397
-p70 -n10 -s500 -k35 -e1 893 (p70 n10 s500 is from HLA example
-p70 -n10 -s500 4687
-p70 -n10 -s500 -k50 -e1 440
-p70 -n10 -s500 -k75 -e1 195
-p70 -n10 -s500 -k100 -e1 90 => 2 graphs
Aligning the 12 input reads with GraphAligner against the 440 node
graph gives 81,1,0,0,71,0,0,1,81,0,81,0 mismatches.
Against the 4687 node graph it is 1,1,0,0,1,0,0,1,1,0,1,0 mismatches.
An example of the alignment 'cg' tag for the first seq (81 vs 1
mismatch) is:
10938=3D1=30D5=1X4=2D1=4D1=1X3D6=1I2=3X3=2D2=2D2=3D1=1D2=2X1=1X2=1D5=1D1=1X2=1D1=3X3D3=1D1=4X2=7D2=13402=1D1=
vs
11067=1D1=
So it's mostly deletions, with the biggest being 30 bp. 3 sequences
have a 30bp deletion, and one has a 63bp deletion.
Ideally we need a way to get those nodes added to this graph, but not
to care about all the X cases and maybe the 1D/1I cases as they're
largely irrelevant as far as alignment goes.
With -k75 graph NMs are 63 max and we have 3 seqs with 58D.
So that's actually a better graph?
However there is more collapsing of nodes in the big graph:
awk '/^S/ {a+=length($3)} END {print a}' /tmp/out.4687/*.gfa
nodes seq len
4687 23373
440 51545
195 53569
/nfs/sam_scratch/jkb/conda22/bin/GraphAligner -g _.gfa -f ../pggb/data/HLA/DRB1-3123.fa.gz -x vg -a /tmp/out.gaf
sed -n 's/.*NM:i:https://p' /tmp/out.gaf |sed 's/[^0-9].*//'
0
783
1099
239
71
1105
240
71
774
1308
783
783
0
1313
5
534
493
240
70
So many differences, but small:
sed -n 's/.*cg:Z:https://p' /tmp/out.gaf | sed 's/[X=ID]/&#/g'|tr '#' '\012'|sort -n|grep -v =
8D
8I
8I
9I
9I
9I
10I
10I
10I
11D
11D
11D
11D
12D
12D
16D
16D
19D
19D
With ./minigraph -cxggs /tmp/_[0-9]*.fa -l4000 -d4000 -L10 > _.gfa
we see 132 nodes and:
...
5I
5I
6I
6I
6I
6I
8D
9I
9I
9I
So longest edit vs -L param:
L nodes longest NM N.align
2 511 8D 5126 39
3 473 12D 5812 28
4 336 8D 5802 39
5 308 8D 6632 20
6 247 9I 7313 21
7 229 9I 7310 21
8 167 9I 7882 21
9 138 9I 8183 21
10 132 9I 8230 19
11 117 9I 8413 19
12 111 9I 8492 19
13 92 9I 8701 19
14 89 9I 8725 19
15 77 11D 8905 19
16 68 11D 9059 19
17 62 11D 9186 19
18 59 11D 9234 19
19 53 16D 9332 19
20 53 16D 9332 19
21 50 16D
22 48 16D
23 48 16D
24 45 16D
25-31 42 19D
32-33 39 19D
34-46 36 19D
47-50 33 19D 9911 19
200 24 19D
1000 1 19D 9520 61 (of 12 seq)
So 1 node seq is enough to maximise out at a 19bp deletion.
-L 200 gives 24 nodes, biggest change is 19D, total change is 10445,
max change per seq is 1377. 21 alignments
-L 20 is 53 node, 16D, 9331 1252, 19
-L 18 is 59 node, 11D, 9234 1204, 19 <<<
-L 14 is 89 node, 9I, 8725 1151, 19
Maybe -L 18?
-L 18 -k 19 and lower is good (55-59 nodes)
-L 18 -k 20 is 28 nodes, but 30D and now 29 alignments
-j 0.3 (was .1) reduces number of alingments down to ~13. Also helps
to reduce node count too.
-L 25 -k 15 -j 0.3 => 46 node, 11I, 9853 1272, 13 seq
=============================================================================
Manual.
0,0 - 630,630 match 0.6kb
630,630 - 1500,1030 inversion plus indel
1500,1030 - 6233,5758 match ~4.7kb
6233,5758 - 6214,8095 deletion 2.3kb
6214,8095 - 8589,10451 match 2.4kb
8589,10451 - 8570,13130 deletion 2.7kb
8570,13130 - 11068,15590 match 2.5kb
Plus some sundry repeats
TODO: CNV at ~12kb in. 3 vs 2 vs 1 copies in some seqs
seq3: two copies with 1 SNP in 13KB. Correct, or assembly error?
--------------------
6/10 vs 2/5/7/8
1 - 600: 12% mismatch.
601/597 - 937/1093 diff seq (bubble)
938/1094 - 4043/4181 match
4043/4181 - 5244/4181 gap (del in s6)
5244/4181 - ? match
small ins in s2 (58bp)
5290/4240 - 6325/5245 match
s6: GGGAAGACACTATCTGGCCCCACCCGCAACCCCGACAACAC
s2: TGAGAAGACACTGACAGTGACGCCGCCATCCGGGGCTCCCTGGGTGGGGTGCGGGCACTGGGAACCTTAACCGGCCCC*CGCCGCAACGCCCACCACCA
6370/5345 - 7880/6822 match
s6 ins vs s2 "AA" (AA could be attached to another node?)
8195/6827 - 9360/7985 match
s6 ins vs s2 small ins
s6: ATGCCATGACAGACATATAGAACTTTTTTTTTTTTTTGAGACGGAGTTTCGCTCTTTTGCCCAGGGTGGAGCTGGAGTGCAGTGGCACGATCTCAGCTCACTGCAAGCTCCGCCTCCCAGGTTCATGCCATTCTCCTGCCTCAGCCTCCCAAGTAGCTGGGACTACAGGTGCACACCACCACGCCCAGCTAATTTTTTGTATTTTTAGTAGAGACGAGGTTTCACCATGTTAGCCAGGATGGTCTTCATCTCCTGACCTCGTGATCTGCCCGCCTTGGCCTCCCAAAGTGCTGGGATTACAGGCATGAGCCACCGCACCCAGCCT
s2: CATTATAGACATAC
9685/8007 - 12244/6125 match. NB: last 600-900bp or so is 2 copies
rep on s6 and 3 copies on s2. Make a loop instead? Maybe not
accurate enough. TODO
3rd rep copy below in s2
s6: AAATCTCCGTCTC
s2: GAAACTCCGTCTCAAAAAAAAAAAAAATAGGCCAGGCGTGGTGGCTCACACCTGTAGTCTCAGCACTTTGGGAGGCTGAGGCAGGTGGATCACGAGGTCAGGAGATGGAGACCACCCTGGTTAACATGACAAAATGCCCTCACTACTAAAAATACTAAAAATTAGCCAGGCGTGGTGGCAGGTGCCTGTAGTCCCAGCTACTAGGGAGGCTGAGGCAGGAGAATGGCATGAACCCAGGAGGTGGAGCTTGCAGTGAGCTGAGATCACGCCACTGCACTCCAGCCTGGGCAACAAAGCGAGACTCCATCTTAAAAAAATAAATAAATAAATAAATAAAATAAAT
12260/5770 - 13805/4215 match
Small ins
s6: GAGGAGTATCCAAGTATTCAGGGGTCTGAGAGACCGATCAAGAGGA
s2: TA
13853/4206 - end match
Overall 24% mismatch!
--------------------
2/5/7(+6/10) vs 1/9/11
Ie 2 vs 1 in dotter.
~1kb match (some small indels)
~400bp ins in s1
s1: TGTAGGCCGGGCGCGGTGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCCGAGGCGGGCGGATCACGAGGTCAGGAGATCGAGACCATCCTGGCTAAAACGGTGAAACCCCGTCTCTACTAAAAATACAAAAAATTAGCCGGGCGTAGTGGCGGGCGCCTGTAGTCCCAGCTACTTGGGAGGCTGAGGCAGGAGAATGGCGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCGAGATCCCGCCACTGCACTCCAGCCTGGGCGACAGAGCGAGACTCCGTCTCAAAAAAAAAAAAAAAAAAAAAAAACAAAAAAAAACCTCTTGTAAGA
s2: CCTAAGACAGG
865/1380 - 4850/6200 match (noting existing indel from s10)
s2: 6850-7880 ins in s1 (noting additional data in s10)
9680/6380 - 11890/8610 match
11890/8610
Two ins, mainly s2, some variance in seqs
s2: CAAAAAAATAAAATAAAATAAAAATAAATTGAATATAAATTGACTCTCCTTGTAACCATACAGTAGCTTCTCGTATCTATTTAAATAAAATTCAGTCCGGCCGCGGTGGCTCATGCCTGTAATCCCAGCACTTTGGGAGGCCAAGGCAGACAGATTATCTGAGGTCAGGAGTTCGAGACCAGCATGGTCAACATGGTGATACCCGATGTCTACTAAAAATACAAAAAAATAAAAATTAGCCAGGTATGGTGGTGGGTGCCTGTAATCTCAGATACTTGGGAAGCTGAGGCAGGAGAATCACTTGAACCAGGGAGGCGAAGGTTGCAGTGAGCCGAGATTGCACCATTGCACTCCAGCCTGGGCAACAAGAACGAAACTCCGTCTCAAAAAAAAAAAAAATAGGCCAGGCGTGGTGGCTCACACCTGTAGTCTCAGCACTTTGGGAGGCTGAGGCAGGTGGATCACGAGGTCAGGAGATGGAGACCACCCTGGTTAACATGACAAAATGCCCTCACTACTAAAAATACTAAAAATTAGCCAGGCGTGGTGGCAGGTGCCTGTAGTCCCAGCTACTAGGGAGGCTGAGGCAGGAGAATGGCATGAACCCAGGAGGTGGAGCTTGCAGTGAGCTGAGATCACGCCACTGCACTCCAGCCTGGGCAACAAAGCGAGACTCCATCTTAAAAAAATAAATAAATAAATAAATAAAATAAATAAATAAATAAATAAATAAATAAAATAAAATTCAAA
s1: AAAATAAAATAAAATAAACAAACAAATAAATAAATAATAAAATTTAATTT
8660 to end: match
--------------------
s3 vs s2 et al
Trim s3 at 15600 (2 copies)
1/1 - ~600 match
600 - 1030/930 bubble
1030/930 - 6740 match
s3 insertion; but approx matches another insertion in s6/s10 too.
Realign to that
s3: CCATAAAGAACTGAATTGGGCAGTGACTGTATAAAAGATGTGAGTCTAGGCCAGGCGCGGTGGCTCAAGCCTGTAATCCTAGCACTTTGGGAGACTGAGGCAGGTGGATTGCCTGAGCTCAGGAGATCAGCCTGGGCAACATGGTAAAACCCCGTCTCTACTAAAATACAAAAAATTAGCTGGGCATGGCGGGGTGCACCTGTAATCCCAGCTACTCGGGAGGCTGAGACGGGAGAATCACTTGAACCTCGGAGGCAGAGGTTGCAGTGAGCCGAGAGGGAGCCACTGCACTCCAGCCTGGGTGACAGAGCAAGACCCCATTTCAAAAAAACAAACAAACAAAAAAGTGAGTCTAG
6800-10800 match
~2.3kb ins in s3
s3 ~13k onwards match (from AAAATAAAATTCAAATTTTTTACCGTGGACATCAGAG)
(12950/12275 onwards)
12950 vs 12275
18449 at 18410 49<
TODO: compare s6/s10 ins vs others. (b3.0 vs b3.1 vs b3.i1 etc)
s6 vs s2,5,8?
It's about 600 in.
s6 600-935 has no match vs 2,5,8
s6 600-941 has inverse match vs 1,9,11 <<<<<
(last 40-10% or so of the 1,9,11 insertion)
Can replicate this way, or just make a new node.
-----------------------------------------------------------------------------
Cons for CNV:
prefix:
TAAATTGAATATAAATTGACTCTCCTTGTAACCATACAGTAGCTTCTCGTATCTATTTAAATAAAATT
CNV:
CAGGCCGGGCGTGGTGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCCAAGGCAGGTGGATCACGAGGTCAGGAGATCGAGACCAGCCTGGCTAACATGGTGAAACCCCATCTCTACTAAAAATACAAAAAATTAGCCAGGTGTGGTGGCAGGTACCTGTAGTCCCAGCTACTTGGGAGGCTGAGGCAGGAGAATCGCTTGAACCCAGGAGGCAGAGGTTGCAGTGAGCCGAGATTGCGCCACTGCACTCCAGCCTGGGCAACAGAGCAAGACTCCGTCTCAAAAAAATAAAAAAAAATAAAAA
suffix:
TTGACTGTCCTTGTAACCATACAGTAGCTTCTCATATCTATTCAAATAAAATT
Good match to s2 from CNV bp 22:
GCCTGTAATCCCAGCACTTTGGGAGGCCAAGGCAG
============t===================tg=
a======g==t================tg======
^^^^^^^^^^^^^^
TAAATTGAATATAAATTGACTCTCCTTGTAACCATACAGTAGCTTCTCGTATCTATTTAAATAAAATT prefix
TTGACTGTCCTTGTAACCATACAGTAGCTTCTCATATCTATTCAAATAAAATT suffix
So (CNV SUFFIX?)* (TAAA)*
(CNV TTGACTGTCCTTGTAACCATACAGTAGCTTCTCATATCTATTC? AAAT*)*
/nfs/sam_scratch/jkb/conda22/bin/GraphAligner -b 50 -C -1 -g mafft.gfa -f pggb/data/HLA/DRB1-3123.fa.gz -x vg -a mafft.gaf;awk '{print $1}' mafft.gaf|uniq -c
See
https://github.com/lh3/gfatools/blob/master/doc/rGFA.md#the-graph-alignment-format-gaf
for output format.
s3 vs mafft aligns with 0-15600
s3 vs mafft+loop aligns with 0-10858, 11964-13164 and 13142-15600
Similarly for s4.
=> graph is wrong somewhere :(
s3 start (0-11k) is:
>m1>m2>b3.0>cons2>cons3>i3>cons4>i4.1>i4.1.1>i4.2>cons5>cons6>cnv>cnvx>cnvstr>cnvstr>cnv
Then s3 ~12k to 13k is: >i2>cons3
Finally s3 13k-16k is: >cnv>cnvstr>cons7>cons8
=============================================================================
1. First 11 seqs from PGGB. Rename s1 to s11 and complement s7.
2. Align with mafft:
mafft --thread 8 --maxiterate 1000 --globalpair train.fa > train.mafft
3. Build a gap5 database, one read per contig.
tr '-' '*' < train.mafft > train.mafft2
tg_index -Q 30 train.mafft2
4. Manually merge the contigs together to form one contig. (Need to
automate this). Generally up to 20% divergent.
5. Annotate via tags.
Cons for consensus regions blue
A1 to Ax for differing alleles yellow/green
I1 to Ix for insertions
CNV for copies
STR for short reps red
See /nfs/users/nfs_j/jkb/work/quantum/train.2
Model as (CNV STR+ link*)+
link*. Right now it's just "link"
s3/s4 still fail. Why?
----------------------------------------------------------------------
Comparison
/nfs/sam_scratch/jkb/conda22/bin/GraphAligner -C -1 -b 100 -g pggb-2667.gfa -f DRB-train.fa -x vg -a out.gaf;awk '{print $1,$3,$4}' out.gaf
s1 0 11068
s2 0 13403
s3 0 15600
s4 0 15590
s5 0 13413
s6 0 14739
s7 0 13403
s8 0 13403
s9 0 11068
s10 0 14733
s11 0 11065
NMs are 0.
/nfs/sam_scratch/jkb/conda22/bin/GraphAligner -C -1 -b 100 -g mafft+cons.gfa -f DRB-train.fa -x vg -a out.gaf;awk '{print $1,$3,$4}' out.gaf
s1 1 11068
s2 1 13403
s3 0 15600
s4 0 15590
s5 1 13413
s6 0 14739
s7 0 13401
s8 1 13403
s9 1 11068
s10 0 14733
s11 1 11065
NMs are:
s1 479
s2 521
s3 844
s4 853
s5 520
s6 894
s7 521
s8 521
s9 479
s10 893
s11 480
total 7005
Biggest indel is 11I, 11D, 8I, 8D, etc.
49 hla-drb1.fasta.gz
mafft+cons.gfa: NM total 25664 (biggest 606, avg 524)
pggb-2667.gfa: NM total 2465 (biggest 364, avg 50)
--------------------
/nfs/sam_scratch/jkb/conda22/bin/GraphAligner -g mafft+cons.gfa -f hla-drb1.fasta.gz -x vg -a out.gaf
sed -n 's/.*cg:Z:https://p' out.gaf|awk '{print $1}'|sed 's/[=XID]/&#/g'|tr '#' '\012'|sort -n|grep -v =|uniq -c
...
1499 2X
90 3D
8 3I
197 3X
40 4D
41 4I
37 5D
15 5I
6 6D
21 7I
5 8D
1 8I
2 9D
3 10I
total NM 25044 (between 475 and 591 per alignment)
TODO: plot diff per site per node position in graph? Would allow us
to optimise the graph further
Common mis-hits:
# c1 69 C 5 15 75 8 3 0 0 in train
# c1 131 C 7 13 65 6 5 0 0
# c1 138 G 7 13 65 8 3 0 1
# c1 151 A 7 13 65 7 4 0 0
# c1 152 A 7 13 65 8 3 0 0
# c1 254 C 5 15 75 8 3 0 0
# c1 256 A 6 14 70 8 3 0 0
# c1 300 T 6 14 70 8 3 0 0
# c1 372 C 6 14 70 8 3 0 0
# c1 395 T 6 14 70 8 3 0 0
# c1 404 T 6 14 70 8 3 0 0
# c1 458 C 5 15 75 8 3 0 3
# c1 473 C 6 14 70 8 3 0 0
# c10 36 C 1 14 93 4 4 0 0 fixed
a1c0 TGTGTCCAGAGTTTGTTCCTTCCGGTGGGTTTGTGGTCTCGCTGACTTC
TGTGTCCAGAGTTTGTTCCTTCCGGTGGGTTCGTGGTTTCGCTGACTTC
^ ^
a1i2 GGCGCGGTGGCTCACGCCTGTAAT...TCTCAAAAAAAAAAAAAAAAAAAAAAACAA...
GGCGCGGTGGCTCACACCTGTAAT...TCTCAAAAAAAAAAAAAAAAAGAAAC...
^
c10 ATTTTCAGAAAGAGCACAGTGTGAACCAGTGCTCTTAG...
ATTTTCAGAAAGAGCACAGCGTGAACCAGTGCTCTCAG...
^
./gaf2gfa.pl mafft.gaf $f mafft+cons.gfa | perl -lane 'next unless /^#/;chomp($_);print "$_\t",int(100*($F[5]+$F[6]+$F[7])/($F[4]+$F[5]+$F[6]+$F[7]))' | less
--------------------
cnvstr differences.
Maybe this matters if first, second or third occurence of this node
due to 1, 2 or 3 copies numbers of cnv?
# cnvstr 0 A 459 0 0 15 15
# cnvstr 1 A 457 2 0 0 2
# cnvstr 2 A 454 5 0 17 22
# cnvstr 3 T 373 65 21 120 206 <<<
AAAT vs AAAAT?
# cnvstr 0 A 346 30 0 0 30
# cnvstr 1 A 376 0 0 61 61
# cnvstr 2 A 376 0 0 3 3
# cnvstr 3 A 329 18 29 65 112
# cnvstr 4 T 267 88 21 21 130
AAAT is fewer errs than AAAAT
AAAAAAAAAAAAAAT
AAAAAAAAAAAAAAT
AAAAAAAAAAAAAAT
AAAAAAAAAAAAAAT
AAAAAAAAAAAAAAT
AAAAAAAAAAAAAAT
AAAAAAAAAAAAAAT
AAAAAAAAAAAAAAT
AAAAAAAAAAAAAAT
AAAAAAAAAAAAAT
AAAAAAAAAAAAAT
AAAAAAAAAAAAAT
AAAAAAAAAAAAAT
AAAAAAAAAAAAAT
AAAAAAAAAAAAAT
AAAAAAATAAAATAAAATAAAAATAAATTGAATATAAAT
AAAAAAATAAAATAAAATAAAAATAAATTGAATATAAAT
AAAAAAATAAAATAAAATAAAAATAAATTGAATATAAAT
AAAAAAATAAAATAAAATAAAAATAAATTGAATATAAAT
AAAAAAATAAAATAAAATAAAAATAAATTGAATATAAAT
AAAAAAATAAAATAAAATAAAAATAAATTGAATATAAAT
AAAAAAATAAAATAAAATAAAAATAAATTGAATATAAAT
AAAAAAATAAAATAAAATAAAAATAAATTGAATATAAAT
AAAAAAATAAAATAAAATAAAAATAAATTGAATATAAAT
AAAAAAATAAAATAAAATAAAAATAAATTGAATATAAAT
AAAAAAATAAAATAAAATAAAAATAAATTGAATATAAAT
AAAAAAATAAAATAAAATAAAAATAAATTGAATATAAAT
AAAAAAATAAAATAAAATAAAAATAAATTGAATATAAAT
AAAAAAATAAAATAAAATAAAAATAAATTGAATATAAAT
AAAAAAATAAAATAAAATAAAAATAAATTGAATATAAAT
AAAAAAATAAATAAATAAATAAAATAAATAAATAAATAAATAAATAAATAAATAAAATAAAAT
AAAAAAATAAATAAATAAATAAAATAAATAAATAAATAAATAAATAAATAAATAAAATAAAAT
AAAAAAATAAATAAATAAATAAAATAAATAAATAAATAAATAAATAAATAAATAAAATAAAAT
AAAAAAATAAATAAATAAATAAAATAAATAAATAAATAAATAAATAAATAAATAAAATAAAAT
xx
AAAAAAATAAATAAATAAATAAATAAAATAAATAAATAAATAAATAAATAAATAAAATAAAAT
AAAAAAATAAATAAATAAATAAATAAAATAAATAAATAAATAAATAAATAAATAAAATAAAAT
AAAAAAATAAATAAATAAATAAATAAAATAAATAAATAAATAAATAAATAAATAAAATAAAAT
AAAAAAATAAATAAATAAATAAATAAAATAAATAAATAAATAAATAAATAAATAAAATAAAAT
AAAAAAATAAATAAATAAATAAATAAAATAAATAAATAAATAAATAAATAAATAAAATAAAAT
AAAAAAATAAATAAATAAATAAATAAAATAAATAAATAAATAAATAAATAAATAAAATAAAAT
AAAAAAATAAATAAATAAATAAATAAAATAAATAAATAAATAAATAAATAAATAAAATAAAAT
AAAAAAATAAATAAATAAATAAATAAAATAAATAAATAAATAAATAAATAAATAAAATAAAAT
AAAAAAATAAATAAATAAATAAATAAAATAAATAAATAAATAAATAAATAAATAAAATAAAAT
xx xx
AAAAAAATAAATAAATAAATAAATAAAATAAATAAATAAATAAATAAATAAATAAATAAAATAAAAT
AAAAAAATAAATAAATAAATAAATAAAATAAATAAATAAATAAATAAATAAATAAATAAAATAAAAT
AAAATAAAATAAAATAAACAAACAAATAAATAAATAATAAAAT
AAAATAAAATAAAATAAACAAATAAATAAATAATAAAAAT
AAAATAAAATAAAATAAACAAATAAATAAATAATAAAAAT
AAAATAAAATAAAATAAATAAACAAATAAATAGGTAAATAAATAATAAAAT
AAAATAAAATAAAATAAATAAACAAATAAATAGGTAAATAAATAATAAAAT
Mix of AAAT and AAAAT?
AAAAAAATAAATAAATAAATAAATAAAATAAATAAATAAATAAATAAATAAATAAATAAAATAAAAT
x xx xx xx xx xx xx xx xx x x x x
AAATAAATAAATAAATAAATAAATAAATAAATAAATAAATAAATAAATAAATAAATAAATAAATAAAT
AAAAAAATAAAATAAAATAAAAATAAATTGAATATAAAT
x xx x x x xx xx xx x
AAATAAATAAATAAATAAATAAATAAATAAATAAATAAAT
It's a mix of repeats A*T for differing lengths of A.
Ideal model is ((AAA|AAAA|AAAAA)T)*
Eg:
AAAAAAATAAATAAATAAATAAATAAAATAAATAAATAAATAAATAAATAAATAAATAAAATAAAAT
x
AAATAAATAAATAAATAAATAAATAAAATAAATAAATAAATAAATAAATAAATAAATAAAATAAAAT
AAAAAAATAAAATAAAATAAAAATAAATTGAATATAAAT
x xx x
AAATAAATAAAATAAAATAAAAATAAATAAAATAAAAAT
But matching kmers to it is complex.
=============================================================================
a1i2: one sequence with big difference
also 12 poly-A diff at end with some seqs
c1: several large differences. Needs a separate seq entry? Dotplot this
c10: mostly good, but an 8bp common deletion
TTACCAACAGTCACAATCATATAATTTATCTTTCACATCATCTTCTTT
TTACCAA--------ATCATATAATTTATCTCTCACATCATCTTCTTT
Long segment so wouldn't affect path.
c11: long and mostly fine, but some variations and indels
GTTGCTGTCACTGTGGCT------TGCATGGTTAGCACTGTAATCCATGTCCATGTGTC
GTCGCTGTCACTGTGGCTGTGGCTTCCATGGGTAACACCGTAATCCATGCTCATGTGTC
c15: a small fluctuation in indels
c17: as above; +- 7bp
c19: a small fluctuation in indels
c2: a collection of indels that add up, 7bp biggest?
AGGAGCAGAAACAGACCATGT-GACCCATGAA--GCGTGAAG-------TGTCTGTCACAGGATCCA
AGGGGCAGAAAGGAGCAGAGACAGACCATGTGACCCATGAAGCCTGAAGTGTCTGTCACAGGATCCA
c3: many indels that add up over time
TGTCCATTACCTACCAAATACAACAGGTAATTCGTAGAGCAGAGCAGTAATTACAACTGGCCAAT
TGTCCATTAC---------ACAACAGGTAATTGGTTGAGCAG-----TAATTATAACTGGCCAAT
c8: 12bp del
AGGGTGCCAGAGAAGGACCCTCTTAATAGTGACGGTT-CAGATGTGACTT
AGGGTGCCAGAGAAG------------AGTGACAGTTACAGATGTGACTT
c9: graph seems the rarer variant (9 of 40)
TAAAACACTATGTACATGTTTTTGATGTTTTTGTCTTTATGTTTGATTGAAGTGTGAAAAT
TAAAACACTAT------ATTTTTGATGTTTTTGTCTTTATGTTTGA----AGTGTGAAAAT
...
CAAGCAAAAATAAACCA-TAAAGAACTGAATTGGGCAGTGACTGTATAAAAGATGTGAGTCTAG
CAAGCAAAAATAAAAGACTAAAC--------TGTGTGGTGAGTGTATAACAGATGTGAGTCTAG
cnv: several small common indels between copies of rep.
cnvstr: highly variable in length!
Plus mix of AAAT, AAAAT, AAAAAT
Maybe better modelled as a complex of repeated choices?
Main changes:
# a1c0 7 A 28 21 0 0 21
# a1c0 31 C 30 19 0 0 19
# a1c0 37 T 28 21 0 0 21
# a1c1 0 C 44 5 0 0 5
# a1c1 1 C 31 18 0 0 18
# a1c1 2 G 46 3 0 0 3
# a1c1 3 C 49 0 0 0 0
# a1c1 4 G 31 18 0 2 20
# a1c1 5 A 30 19 0 0 19
# a1c1 6 A 47 2 0 0 2
# a1c1 7 T 28 21 0 0 21
Maybe just two alternative sequences here?
# a1c1 38 T 31 18 0 0 18
# a1c1 39 G 49 0 0 0 0
# a1c1 40 G 49 0 0 0 0
# a1c1 41 C 49 0 0 0 0
# a1c1 42 A 49 0 0 0 0
# a1c1 43 C 31 18 0 0 18
# a1c1 44 A 28 21 0 0 21
# a1c1 45 G 49 0 0 0 0
# a1c1 46 A 49 0 0 0 0
# a1c1 47 C 39 10 0 0 10
# a1c1 48 C 49 0 0 0 0
# a1c1 49 C 49 0 0 0 0
# a1c1 50 G 28 21 0 0 21
# a1c1 51 A 49 0 0 0 0
# a1c1 52 A 49 0 0 0 0
# a1c1 53 G 49 0 0 0 0
# a1c1 54 A 49 0 0 0 0
# a1c1 55 G 49 0 0 0 0
# a1c1 56 T 49 0 0 0 0
# a1c1 57 G 44 5 0 0 5
# a1c1 58 A 31 18 0 0 18
# a1c1 59 G 49 0 0 0 0
# a1c1 60 C 49 0 0 0 0
# a1c1 61 A 49 0 0 0 0
# a1c1 62 A 28 19 2 0 21
Similarly 2 populations
# a1c2 0 G 49 0 0 0 0
# a1c2 1 C 28 21 0 0 21
# a1c2 2 A 28 21 0 0 21
# a1c2 3 T 28 21 0 0 21
Rare version in graph
# a1c2 20 C 31 18 0 0 18
# a1c2 21 A 49 0 0 0 0
# a1c2 22 G 48 1 0 0 1
# a1c2 23 A 49 0 0 0 0
# a1c2 24 G 44 5 0 0 5
# a1c2 25 T 49 0 0 0 0
# a1c2 26 G 49 0 0 0 0
# a1c2 27 T 30 19 0 0 19
# a1c2 28 G 30 19 0 0 19
Diff segment?
# a1c2 52 T 31 18 0 0 18
# a1c2 53 G 31 18 0 0 18
# a1c2 54 A 49 0 0 0 0
# a1c2 55 T 49 0 0 0 0
# a1c2 56 T 49 0 0 0 0
# a1c2 57 A 28 21 0 0 21
# a1c2 143 C 28 21 0 0 21
# a1c2 144 C 28 21 0 0 21
# a1c2 145 T 49 0 0 0 0
# a1c2 146 A 49 0 0 0 0
# a1c2 147 A 36 13 0 0 13
# a1c2 148 G 49 0 0 0 0
# a1c2 149 A 36 13 0 0 13
# a1c2 150 C 43 1 5 0 6
end also different
# a1c3 0 A 50 0 0 0 0
# a1c3 1 G 45 0 5 0 5
# a1c3 2 G 24 21 5 0 26
# a1c3 3 A 50 0 0 0 0
# a1c3 4 A 50 0 0 0 0
# a1c3 5 A 45 0 5 0 5
# a1c3 6 G 50 0 0 0 0
# a1c3 7 T 50 0 0 0 0
# a1c3 8 T 50 0 0 0 0
# a1c3 9 C 50 0 0 0 0
# a1c3 10 T 50 0 0 0 0
# a1c3 11 G 28 22 0 0 22
# a1c3 12 C 50 0 0 0 0
# a1c3 13 A 50 0 0 0 0
# a1c3 14 A 50 0 0 0 0
# a1c3 15 G 50 0 0 0 0
# a1c3 16 T 50 0 0 0 0
# a1c3 17 C 49 1 0 0 1
# a1c3 18 C 50 0 0 0 0
# a1c3 19 C 50 0 0 0 0
# a1c3 20 T 28 22 0 0 22
# a1c3 21 A 50 0 0 0 0
# a1c3 22 C 50 0 0 0 0
# a1c3 23 C 50 0 0 0 0
# a1c3 24 C 50 0 0 0 0
# a1c3 25 G 28 22 0 0 22
start is rare var
# c1 69 C 28 28 0 0 28
# c1 70 A 56 0 0 0 0
# c1 71 C 51 5 0 0 5
# c1 72 G 56 0 0 0 0
# c1 73 G 56 0 0 0 0
# c1 74 G 56 0 0 0 0
# c1 75 A 56 0 0 0 0
# c1 76 G 50 6 0 0 6
# c1 77 G 14 42 0 0 42
# c1 78 C 51 5 0 0 5
# c1 79 C 51 5 0 0 5
# c1 80 G 15 41 0 0 41
rare version in graph
# c1 131 C 31 25 0 0 25
# c1 132 A 56 0 0 0 0
# c1 133 A 56 0 0 0 0
# c1 134 T 56 0 0 0 0
# c1 135 G 55 1 0 1 2
# c1 136 C 56 0 0 23 23
Common ins
# c1 249 A 38 18 0 0 18
# c1 250 A 56 0 0 0 0
# c1 251 C 52 4 0 0 4
# c1 252 A 56 0 0 0 0
# c1 253 G 55 1 0 0 1
# c1 254 C 28 28 0 0 28
# c1 255 C 54 2 0 0 2
# c1 256 A 30 26 0 0 26
# c1 257 A 40 16 0 0 16
# c1 299 C 53 3 0 0 3
# c1 300 T 30 26 0 0 26
# c1 301 T 56 0 0 0 0
# c1 302 C 56 0 0 0 0
# c1 303 C 55 1 0 0 1
# c1 304 T 55 1 0 0 1
# c1 305 T 52 4 0 0 4
# c1 306 G 56 0 0 0 0
# c1 307 A 55 0 1 0 1
# c1 308 A 56 0 0 0 0
# c1 309 T 54 1 1 0 2
# c1 310 G 29 0 27 0 27
# c1 311 T 56 0 0 0 0
# c1 312 G 56 0 0 0 0
# c1 313 G 56 0 0 0 0
# c1 314 T 56 0 0 0 0
# c1 315 C 56 0 0 0 0
# c1 316 A 56 0 0 0 0
# c1 317 T 53 3 0 0 3
# c1 318 C 56 0 0 0 0
# c1 319 T 56 0 0 0 0
# c1 320 G 38 18 0 0 18
# c1 321 C 55 1 0 0 1
# c1 371 C 55 1 0 0 1
# c1 372 C 30 26 0 0 26
# c1 373 T 55 1 0 0 1
# c1 394 T 51 5 0 0 5
# c1 395 T 30 26 0 0 26
# c1 396 C 52 4 0 0 4
# c1 397 C 55 0 0 0 0
# c1 398 C 55 0 0 0 0
# c1 399 C 55 0 0 0 0
# c1 400 T 55 0 0 0 0
# c1 401 C 55 0 0 0 0
# c1 402 T 54 1 0 0 1
# c1 403 T 54 1 0 0 1
# c1 404 T 29 26 0 0 26
# c1 405 C 53 2 0 0 2
# c1 406 A 50 5 0 0 5
# c1 457 T 51 4 0 0 4
# c1 458 C 27 28 0 14 42
# c1 459 C 55 0 0 0 0
# c1 460 C 51 4 0 0 4
# c1 461 T 55 0 0 0 0
# c1 462 G 52 3 0 0 3
# c1 463 A 54 1 0 0 1
# c1 464 G 55 0 0 0 0
# c1 465 C 51 4 0 0 4
# c1 466 C 55 0 0 0 0
# c1 467 A 55 0 0 0 0
# c1 468 A 52 3 0 0 3
# c1 469 T 55 0 0 0 0
# c1 470 A 53 2 0 0 2
# c1 471 G 51 4 0 0 4
# c1 472 T 50 5 0 0 5
# c1 473 C 28 27 0 0 27
# c1 474 C 55 0 0 0 0
# c10 73 C 39 0 0 0 0
# c10 74 T 8 31 0 0 31
# c10 75 T 39 0 0 0 0
# c10 76 T 36 3 0 0 3
# c10 77 G 39 0 0 0 0
# c10 78 T 39 0 0 0 0
# c10 79 T 39 0 0 0 0
# c10 80 G 39 0 0 0 0
# c10 81 G 8 31 0 0 31
# c10 82 T 39 0 0 0 0
# c10 83 G 34 5 0 0 5
# c10 84 A 8 31 0 0 31
# c10 85 T 8 31 0 0 31
# c10 86 G 8 31 0 0 31
# c10 87 G 39 0 0 0 0
# c10 88 A 39 0 0 0 0
# c10 89 G 39 0 0 0 0
# c10 90 T 8 31 0 0 31
# c10 91 C 39 0 0 0 0
rare var in graph
# c10 141 T 39 0 0 0 0
# c10 142 A 8 31 0 0 31
# c10 143 A 39 0 0 0 0
# c10 144 A 39 0 0 0 0
# c10 145 T 39 0 0 0 0
# c10 146 A 39 0 0 0 0
# c10 147 T 39 0 0 0 0
# c10 148 C 39 0 0 0 0
# c10 149 A 39 0 0 0 0
# c10 150 G 39 0 0 0 0
# c10 151 A 39 0 0 0 0
# c10 152 A 39 0 0 0 0
# c10 153 G 39 0 0 0 0
# c10 154 A 39 0 0 0 0
# c10 155 A 39 0 0 0 0
# c10 156 A 39 0 0 0 0
# c10 157 C 38 1 0 0 1
# c10 158 C 39 0 0 0 0
# c10 159 C 39 0 0 0 0
# c10 160 C 39 0 0 0 0
# c10 161 T 39 0 0 0 0
# c10 162 G 39 0 0 0 0
# c10 163 T 39 0 0 0 0
# c10 164 T 39 0 0 0 0
# c10 165 G 39 0 0 30 30
# c10 166 T 39 0 0 0 0
# c10 187 A 39 0 0 0 0
# c10 188 C 39 0 0 31 31
# c10 189 A 39 0 0 0 0
# c10 190 T 39 0 0 0 0
# c10 191 T 39 0 0 0 0
# c10 192 G 39 0 0 0 0
# c10 193 C 8 31 0 0 31
# c10 194 C 39 0 0 0 0
# c10 195 A 39 0 0 0 0
# c10 196 C 39 0 0 0 0
# c10 197 A 8 0 31 0 31
# c10 198 C 36 3 0 0 3
# c10 199 G 34 5 0 0 5
# c10 200 T 39 0 0 0 0
# c10 201 T 39 0 0 0 0
# c10 202 C 39 0 0 0 0
# c10 203 C 39 0 0 0 0
# c10 204 C 36 3 0 0 3
# c10 205 C 34 5 0 0 5
# c10 206 A 39 0 0 0 0
# c10 207 A 39 0 0 0 0
# c10 208 G 8 31 0 0 31
# c10 209 G 36 3 0 0 3
# c10 210 G 8 31 0 0 31
# c10 211 T 39 0 0 0 0
# c10 239 C 39 0 0 0 0
# c10 240 T 10 29 0 0 29
# c10 241 A 36 3 0 0 3
# c10 242 T 39 0 0 0 0
# c10 243 C 39 0 0 0 0
# c10 244 C 39 0 0 0 0
# c10 245 C 39 0 0 0 0
# c10 246 T 39 0 0 0 0
# c10 247 G 39 0 0 0 0
# c10 248 G 39 0 0 0 0
# c10 249 G 39 0 0 0 0
# c10 250 T 39 0 0 0 0
# c10 251 A 39 0 0 0 0
# c10 252 A 39 0 0 0 0
# c10 253 G 39 0 0 0 0
# c10 254 A 10 29 0 0 29
# c10 255 A 39 0 0 0 0
# c10 256 C 36 3 0 0 3
# c10 257 C 37 2 0 0 2
# c10 258 A 39 0 0 0 0
# c10 259 C 36 3 0 0 3
# c10 260 G 8 31 0 0 31
# c10 261 G 39 0 0 0 0
# c10 262 G 39 0 0 0 0
# c10 263 T 39 0 0 0 0
# c10 264 G 39 0 0 0 0
# c10 265 T 39 0 0 0 0
# c10 266 G 39 0 0 0 0
# c10 267 A 39 0 0 0 0
# c10 268 A 39 0 0 0 0
# c10 269 C 10 29 0 0 29
# c10 270 C 39 0 0 0 0
# c10 291 A 39 0 0 0 0
# c10 292 A 39 0 0 0 0
# c10 293 T 8 31 0 0 31
# c10 294 A 39 0 0 0 0
# c10 295 A 39 0 0 0 0
# c10 296 G 39 0 0 0 0
# c10 297 C 36 3 0 0 3