Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ambiguous Novel Transcript Locations and Merging Transcript Models #233

Open
BenneyMRArgue opened this issue Sep 4, 2024 · 2 comments
Open
Labels
question Further information is requested

Comments

@BenneyMRArgue
Copy link

Hi,

Thanks so much for developing this tool, I'm excited to be able to use this in my long-read isoform analysis!

I have had a couple questions come up since looking through the output from some IsoQuant runs (v3.4.1). When looking through the transcript model output (transcript_model_grouped_counts.tsv) I noticed that novel transcripts appear to be labeled "transcript####.chr##". I need to merge the outputs from the individual IsoQuant runs (I have a large sc-RNAseq dataset which I have had to split by sample because of resource limitations for each job), so realizing that these transcripts appear to be listed in the order that IsoQuant processed them during the individual experiment raised concern. Is it possible to merge novel transcripts which have the same genomic coordinates but have been assigned different numbers in their respective runs?

I also noticed that some novel transcripts are listed multiple times in the tsv, connected to different chromosomes. For instance, I looked at one in IGV which was placed both in chromosome 7 and 10:
IGV_novel_transcript_ambiguity
Do you have any insight on why this occurs and how to identify which location is correct?

Thanks,
-Benney

@andrewprzh
Copy link
Collaborator

Dear @BenneyMRArgue

Thanks for the feedback!

I would recommend you to try the latest IsoQuant version (3.5.2). It has far better RAM consumption compared to 3.4.1 - a major problem was fixed since version 3.4.2 resulting in ~10-30x RAM decrease on different tested datasets. Probably, you'd be able to process you dataset at once.

Regarding duplicated transcripts. IsoQuant assigns transcript ids sequentially, but the independent runs will not have identical ids for the same novel transcripts. So unfortunately, it is impossible to track novel transcripts between different runs. Moreover, chromosome name is a part of transcript id, so it's OK to have transcript58.chr7.nic and transcript58.chr10.nic -- these are two completely different transcripts ids.

If you still would like to merge different GTFs, I'd suggest using gffcompare tool.

Best
Andrey

@andrewprzh andrewprzh added the question Further information is requested label Sep 12, 2024
@BenneyMRArgue
Copy link
Author

Hi Andrey,

Thanks for the input! It's good to know that the novel transcripts can't be compared between runs. I will try all together with the newest version first.

Also thanks so much for clarifying that point about the transcript ids and chromosome assignments, it's a relief to find they are not supposed to be the same!

Best,
-Benney

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants