Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to improve the haphic result for polyploid? #30

Open
vergilback opened this issue Jun 6, 2024 · 7 comments
Open

How to improve the haphic result for polyploid? #30

vergilback opened this issue Jun 6, 2024 · 7 comments

Comments

@vergilback
Copy link

vergilback commented Jun 6, 2024

Hello Xiaofei,

Thank you for developing the Haphic software.

I am currently working on the assembly of a complex polyploid genome and I am trying to scaffold the “p_utg” sequences obtained from hifiasm to the haplotype-resolved level. The first issue I encountered is determining the inflation value. I have tried many different inflation values, but the clustering results never match the chromosome number of my species (which has more than 260 chromosomes).

When running haphic by default parameters (inflation=2.6, clusters=222), I observed that some scaffolds still show signs of heterozygosity and collapse when checked with Juicebox.
hic1
hic2

Using minimap2, I aligned these scaffolds to the genome of a closely related species and generated dot plots, which reveal obvious 4:1 and 8:1 alignment results, but also some scaffolding errors.

Could you please provide some suggestions for improving this genome assembly? Thank you.

Best regards,

Xiaoyu

@zengxiaofei
Copy link
Owner

Hi Xiaoyu,

As numerous collapsed contigs are present in your assembly according to the contact map you provided, simply tuning the inflation value may not significantly improve your assembly. Contigs derived from different haplotypes were clustered together due to the presence of these collapsed contigs. This issue is a common challenge encountered when constructing haplotype-resolved genomes of complex polyploid organisms, such as the cultivated hybrid sugarcane genomes we are working on. Manual adjustment in Juicebox is currently the main solution for this issue.

However, there are still several modifications that could potentially improve your results to some extend:

(1) Add --gfa *.p_utg.noseq.gfa to filter out potential collapsed contigs before clustering by utilizing the depth information from HiFi reads in the GFA file.
(2) Use lower values for --density_upper and --rank_sum_upper to filter out additional potential collapsed contigs. For example, we used --density_upper 0.9 --rank_sum_upper 0.8 for scaffolding the wild sugarcane AP85-411 genome, and --density_upper 0.8 --rank_sum_upper 0.8 for scaffolding the Medicago sativa Zhongmu-4 genome.
(3) Use --Nx 70 to remove more short contigs to accelerate the Markov clustering step.

Best wishes,
Xiaofei

@vergilback
Copy link
Author

The "utg" version of the assembly should be fully phased and should not exhibit excessive collapse. Howeve, there still appears to be some collapse in the Hi-C hetmap. Could this be due to the high similarity between different haplotypes, where the HiFi read lengths are not sufficient to span these regions, leading to the collapse of sequences from different haplotypes? I will try disabling the purge function in hifiasm and then attempt the assembly again.

How does Haphic handle this situation? If manual correction is required, should we identify these excessively collapsed regions, manually duplicate them, and then scaffold?

@zengxiaofei
Copy link
Owner

zengxiaofei commented Jun 7, 2024

Actually, unitigs can be collapsed due to nearly identical sequences between haplotypes. Please refer to the scheme diagram below, wherein green circles represent collapsed unitigs (source from the latest paper of hifiasm published on Nature methods in 2024):
screenshot

Could this be due to the high similarity between different haplotypes, where the HiFi read lengths are not sufficient to span these regions, leading to the collapse of sequences from different haplotypes?

Yes, you are correct.

I will try disabling the purge function in hifiasm and then attempt the assembly again.

Purging duplicate haplotypes may not significantly benefit your case. This same holds true for some other strategies like local assembly. Even ONT ultra-long reads cannot resolve these long-range collapses.

How does Haphic handle this situation?

Similar to all other scaffolders, HapHiC does not alter input contigs (except for misjoin correction). If correctly identified as collapsed, these contigs will be temporarily removed before Markov clustering and then rescued in the subsequent reassignment step. Consequently, these collapsed contigs will be assigned to one of the homologous group (e.g., the collapsed contigs in group6 and group7). In cases of super long collapsed regions, these contigs will lead to incorrect clustering of the contigs from homologous chromosomes (e.g., group1).

If manual correction is required, should we identify these excessively collapsed regions, manually duplicate them, and then scaffold?

Creating one or more copies of these collapsed regions can increase the completeness of haplotypes. The collapsed unitgs phased by hifiasm + UL reads also have more than one copies in the final graph:

screenshot2

There are also many identical regions in the haplotype-resolved assembly of cultivated potato C88 (although they resolved these collapsed regions by using the information from offspring, not manual copying). However, this strategy sometimes can also make trouble for downstream analysis. Reads aligned to these regions will be designated a MAPQ of zero by aligners due to multiple mapping.

@vergilback
Copy link
Author

Thank you for your detailed response, it has been very helpful for my understanding.

The difficulty of phasing such a complex genome may still be a challenging task with the current sequencing data and assembly tools (even though some tools perform well in certain polyploid species). In this case, would assembling a haploid genome (n=4×) instead of a haplotype-resolved genome (2n=8×) be a more feasible option?

P.S.: I have tried scaffolding the haploid genome (the "p_ctg" version), but due to the obvious collapse in the heatmap, I subsequently attempted scaffolding using the utg version of the data in hopes of improving the collapse issue. However, it does not seem to be ideal so far.

@zengxiaofei
Copy link
Owner

In this case, would assembling a haploid genome (n=4×) instead of a haplotype-resolved genome (2n=8×) be a more feasible option?

In my view, this idea is biologically plausible only for an allo-octoploid species with a karyotype of AABBCCDD.

I have tried scaffolding the haploid genome (the "p_ctg" version), but due to the obvious collapse in the heatmap, I subsequently attempted scaffolding using the utg version of the data in hopes of improving the collapse issue.

Primary contigs assembled without trio or Hi-C data are often inadequately phased. Furthermore, in regions of high-heterozygosity, these contigs may even contain sequences from duplicate haplotypes.

@vergilback
Copy link
Author

I tried the ALLHiC software, and perhaps due to the reference genome annotation, it is challenging to accurately phase homologous chromosomes, even for larger chromosomes. However, HapHiC performs quite well. Although the results still need to be compared with other species and manually adjusted based on Hi-C signals in juicebox, it generally can reconstruct the information of different haplotypes. But I found the scaffolding rate is still relatively low. My species has many micro-chromosomes, and I found some fragments in debris that should be scaffold to larger chromosomes (possibly due to the collapse of polyploidy). Therefore, I have two questions:

Besides the parameters you mentioned earlier for polyploid species, are there any other parameters that can be adjusted (e.g., "min_group" during the reassign process)? There are too many parameters in several stpes.

Additionally, I noticed that the cluster step options include some options for UL data. Does HapHiC support the input of additional UL alignment data to better address scaffolding and collapsing issues?

@zengxiaofei
Copy link
Owner

Sorry for the delay. I'm quite busy these days.

I tried the ALLHiC software, and perhaps due to the reference genome annotation, it is challenging to accurately phase homologous chromosomes, even for larger chromosomes. However, HapHiC performs quite well.

Yes. In our study, HapHiC outperformed ALLHiC in almost all tests. This may not be just an issue of the choice of reference genomes.

Although the results still need to be compared with other species and manually adjusted based on Hi-C signals in juicebox, it generally can reconstruct the information of different haplotypes.

The problematic results are primarily attributed to the presence of numerous large collapsed regions in your assembly. Please refer to the points we discussed in our paper:

The formation of collapsed contigs primarily results from extremely low sequence divergence. To mitigate the adverse effects of collapsed contigs, HapHiC has implemented the rank-sum algorithm. However, large-scale collapsed regions still significantly impede subsequent allele-aware scaffolding, as demonstrated in the cultivated potato C88 genome. Furthermore, unlike chimeric contigs, scaffolding tools typically do not correct collapsed contigs. Therefore, achieving a higher quality assembly remains a fundamental prerequisite for haplotype resolution. Otherwise, the resulting scaffolds will still suffer from the “garbage in, garbage out” phenomenon, which means that flawed input data will produce low-quality output. This holds true even when using a scaffolding tool with a high tolerance for assembly errors.

But I found the scaffolding rate is still relatively low.

Please refer to the FAQs section:

How can I do when the anchoring rate is too low?

There are three parameters controlling the anchoring rate through the reassignment step: --min_RE_sites , --min_links , and --min_link_density . By default, these parameters are set to 25, 25, and 0.0001, respectively. However, both the contig contiguity and Hi-C sequencing depth vary across different projects. By checking the *statistics.txt files in 01.cluster/inflation_* , you can find better values for these parameters to get a scaffolding result with a higher anchoring rate.

My species has many micro-chromosomes, and I found some fragments in debris that should be scaffold to larger chromosomes (possibly due to the collapse of polyploidy).

Microchromosomes are exceptionally small, independent chromosomes commonly found in birds and reptiles. HapHiC does encounter challenges when dealing with microchromosomes due to HapHiC's preference for the length distribution of chromosomes. Specifically, it is hypothesized that when sorting chromosomes by length, adjacent chromosomes should not exhibit significant length disparities. However, you said "some fragments in debris that should be scaffold to larger chromosomes". I cannot understand it. Microchromosomes should not be scaffold to other chromosomes since they are independent entities.

Besides the parameters you mentioned earlier for polyploid species, are there any other parameters that can be adjusted (e.g., "min_group" during the reassign process)? There are too many parameters in several steps.

Using the parameters --remove_allelic_links and --normalize_by_nlinks may yield slight improvements. Other parameters may not have a specific effect on your results; they could either improve or worsen the outcome.

Additionally, I noticed that the cluster step options include some options for UL data. Does HapHiC support the input of additional UL alignment data to better address scaffolding and collapsing issues?

This is a function we specifically developed in response to reviewer comments. We have verified it in the potato C88.v1 assembly. UL data shows only marginal improvement for scaffolding in this case, it may not markedly enhance your results. It is important to note that this function does not modify the contigs and thus cannot resolve collapsed regions.

If UL data is available to you, I recommend using it in the genome assembly process. Here is the conclusion we made in response to reviewers:

UL reads may have greater power at the assembly graph level (the approach taken by hifiasm), due to the assembly graph containing extra information from HiFi read alignments. In such scenarios, UL reads can effectively span and connect many unitigs with a high degree of accuracy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants