Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Output file interpretation #162

Open
molinfzlvvv opened this issue May 21, 2024 · 1 comment
Open

Output file interpretation #162

molinfzlvvv opened this issue May 21, 2024 · 1 comment

Comments

@molinfzlvvv
Copy link

Hello! I've successfully executed the TOGA pipeline, but I'm still unclear on how to interpret the output file.

My objective was to identify missing or inactivated genes. I've observed that there's a file named loss_summ_data.tsv, which is divided into eight categories. Should I primarily focus on the UL (uncertain loss) and L (clearly lost) categories and disregard the others? Additionally, there's an inact_mut_data.txt file for visualization.What these two details of the contents of the documents, I should how to correctly identify genetic loss events and the inactivated genes.

In fact, I took the pika genome and compared it with hg38 using lastal, and the maf output was converted to a chain file(more faster). After TOGA, the loss_summ_data.tsv file had only 19 non-redundant results, and the inact_mut_data.txt file had 319 results. But after getting assembly quality statistics, the results are shown in the figure. Is this reasonable and what might be the cause

toga_statsplot.pdf

I'm sorry for disturbing you so many times. I really look forward to your reply, which is very important to me.

Best regards!

@MichaelHiller
Copy link
Collaborator

Hi,

  1. your stats plot shows that almost all genes are classified as missing. I assume this is because the chains you use are very incomplete.

  2. Yes, extracting lost and UL genes is what I would do. For UL, you may want to run RELAX in addition to check if the gene evolves under relaxed selection, which would be stronger evidence that the gene (and not only 1 exon) is lost.

  3. If you have a highly complete genome, you can also extract M genes. E.g. in Rhie et al Nature (the VGP paper) there are genes lost between rearranged genomic regions in bats (Fig 5). TOGA would classify them as missing, which is the correct classification if the assembly is not very complete. But for several Bat1K quality genomes, M then likely indicates a true loss.

  4. Wrt whats in the files pls look into https://genome.senckenberg.de/download/TOGA/README.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants