-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to improve the haphic result for polyploid? #30
Comments
Hi Xiaoyu, As numerous collapsed contigs are present in your assembly according to the contact map you provided, simply tuning the inflation value may not significantly improve your assembly. Contigs derived from different haplotypes were clustered together due to the presence of these collapsed contigs. This issue is a common challenge encountered when constructing haplotype-resolved genomes of complex polyploid organisms, such as the cultivated hybrid sugarcane genomes we are working on. Manual adjustment in Juicebox is currently the main solution for this issue. However, there are still several modifications that could potentially improve your results to some extend: (1) Add Best wishes, |
The "utg" version of the assembly should be fully phased and should not exhibit excessive collapse. Howeve, there still appears to be some collapse in the Hi-C hetmap. Could this be due to the high similarity between different haplotypes, where the HiFi read lengths are not sufficient to span these regions, leading to the collapse of sequences from different haplotypes? I will try disabling the purge function in hifiasm and then attempt the assembly again. How does Haphic handle this situation? If manual correction is required, should we identify these excessively collapsed regions, manually duplicate them, and then scaffold? |
Thank you for your detailed response, it has been very helpful for my understanding. The difficulty of phasing such a complex genome may still be a challenging task with the current sequencing data and assembly tools (even though some tools perform well in certain polyploid species). In this case, would assembling a haploid genome (n=4×) instead of a haplotype-resolved genome (2n=8×) be a more feasible option? P.S.: I have tried scaffolding the haploid genome (the "p_ctg" version), but due to the obvious collapse in the heatmap, I subsequently attempted scaffolding using the utg version of the data in hopes of improving the collapse issue. However, it does not seem to be ideal so far. |
In my view, this idea is biologically plausible only for an allo-octoploid species with a karyotype of AABBCCDD.
Primary contigs assembled without trio or Hi-C data are often inadequately phased. Furthermore, in regions of high-heterozygosity, these contigs may even contain sequences from duplicate haplotypes. |
I tried the ALLHiC software, and perhaps due to the reference genome annotation, it is challenging to accurately phase homologous chromosomes, even for larger chromosomes. However, HapHiC performs quite well. Although the results still need to be compared with other species and manually adjusted based on Hi-C signals in juicebox, it generally can reconstruct the information of different haplotypes. But I found the scaffolding rate is still relatively low. My species has many micro-chromosomes, and I found some fragments in debris that should be scaffold to larger chromosomes (possibly due to the collapse of polyploidy). Therefore, I have two questions: Besides the parameters you mentioned earlier for polyploid species, are there any other parameters that can be adjusted (e.g., "min_group" during the reassign process)? There are too many parameters in several stpes. Additionally, I noticed that the cluster step options include some options for UL data. Does HapHiC support the input of additional UL alignment data to better address scaffolding and collapsing issues? |
Sorry for the delay. I'm quite busy these days.
Yes. In our study, HapHiC outperformed ALLHiC in almost all tests. This may not be just an issue of the choice of reference genomes.
The problematic results are primarily attributed to the presence of numerous large collapsed regions in your assembly. Please refer to the points we discussed in our paper: The formation of collapsed contigs primarily results from extremely low sequence divergence. To mitigate the adverse effects of collapsed contigs, HapHiC has implemented the rank-sum algorithm. However, large-scale collapsed regions still significantly impede subsequent allele-aware scaffolding, as demonstrated in the cultivated potato C88 genome. Furthermore, unlike chimeric contigs, scaffolding tools typically do not correct collapsed contigs. Therefore, achieving a higher quality assembly remains a fundamental prerequisite for haplotype resolution. Otherwise, the resulting scaffolds will still suffer from the “garbage in, garbage out” phenomenon, which means that flawed input data will produce low-quality output. This holds true even when using a scaffolding tool with a high tolerance for assembly errors.
Please refer to the FAQs section: How can I do when the anchoring rate is too low? There are three parameters controlling the anchoring rate through the reassignment step:
Microchromosomes are exceptionally small, independent chromosomes commonly found in birds and reptiles. HapHiC does encounter challenges when dealing with microchromosomes due to HapHiC's preference for the length distribution of chromosomes. Specifically, it is hypothesized that when sorting chromosomes by length, adjacent chromosomes should not exhibit significant length disparities. However, you said "some fragments in debris that should be scaffold to larger chromosomes". I cannot understand it. Microchromosomes should not be scaffold to other chromosomes since they are independent entities.
Using the parameters
This is a function we specifically developed in response to reviewer comments. We have verified it in the potato C88.v1 assembly. UL data shows only marginal improvement for scaffolding in this case, it may not markedly enhance your results. It is important to note that this function does not modify the contigs and thus cannot resolve collapsed regions. If UL data is available to you, I recommend using it in the genome assembly process. Here is the conclusion we made in response to reviewers: UL reads may have greater power at the assembly graph level (the approach taken by hifiasm), due to the assembly graph containing extra information from HiFi read alignments. In such scenarios, UL reads can effectively span and connect many unitigs with a high degree of accuracy. |
Hello Xiaofei,
Thank you for developing the Haphic software.
I am currently working on the assembly of a complex polyploid genome and I am trying to scaffold the “p_utg” sequences obtained from hifiasm to the haplotype-resolved level. The first issue I encountered is determining the inflation value. I have tried many different inflation values, but the clustering results never match the chromosome number of my species (which has more than 260 chromosomes).
When running haphic by default parameters (inflation=2.6, clusters=222), I observed that some scaffolds still show signs of heterozygosity and collapse when checked with Juicebox.
Using minimap2, I aligned these scaffolds to the genome of a closely related species and generated dot plots, which reveal obvious 4:1 and 8:1 alignment results, but also some scaffolding errors.
Could you please provide some suggestions for improving this genome assembly? Thank you.
Best regards,
Xiaoyu
The text was updated successfully, but these errors were encountered: