Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RNAC RNA types are getting mangled by the pipeline (tested by gorule-0000001) #2246

Open
cmungall opened this issue Jan 29, 2024 · 31 comments
Open

Comments

@cmungall
Copy link
Member

Source:

✗ curl -L -s https://ftp.ebi.ac.uk/pub/databases/GO/goa/HUMAN/goa_human_rna.gaf.gz | gzip -dc | grep URS000075D95B_9606 | cut -f2,3,9-12
URS00004176D4_9606	URS00004176D4_9606	F	Homo sapiens (human) hsa-miR-185-5p		miRNA
URS000075D95B_9606	URS000075D95B_9606	F	Homo sapiens (human) X inactive specific transcript (XIST)		lncRNA
URS000075D95B_9606	URS000075D95B_9606	F	Homo sapiens (human) X inactive specific transcript (XIST)		lncRNA
URS000075D95B_9606	URS000075D95B_9606	F	Homo sapiens (human) X inactive specific transcript (XIST)		lncRNA
URS000075D95B_9606	URS000075D95B_9606	P	Homo sapiens (human) X inactive specific transcript (XIST)		lncRNA
URS000075D95B_9606	URS000075D95B_9606	P	Homo sapiens (human) X inactive specific transcript (XIST)		lncRNA
URS000075D95B_9606	URS000075D95B_9606	P	Homo sapiens (human) X inactive specific transcript (XIST)		lncRNA
URS000075D95B_9606	URS000075D95B_9606	P	Homo sapiens (human) X inactive specific transcript (XIST)		lncRNA
URS000075D95B_9606	URS000075D95B_9606	P	Homo sapiens (human) X inactive specific transcript (XIST)		lncRNA
URS000075D95B_9606	URS000075D95B_9606	C	Homo sapiens (human) X inactive specific transcript (XIST)		lncRNA
URS000075D95B_9606	URS000075D95B_9606	C	Homo sapiens (human) X inactive specific transcript (XIST)		lncRNA
URS000075D95B_9606	URS000075D95B_9606	C	Homo sapiens (human) X inactive specific transcript (XIST)		lncRNA

what we end up publishing:

✗ curl -L -s http:https://current.geneontology.org/annotations/goa_human_rna.gaf.gz | gzip -dc | grep URS000075D95B_9606 | cut -f2,3,9-12
URS00004176D4_9606	URS00004176D4_9606	F	Homo sapiens (human) hsa-miR-185-5p		miRNA
URS000075D95B_9606	URS000075D95B_9606	F	Homo sapiens (human) X inactive specific transcript (XIST)		gene_product
URS000075D95B_9606	URS000075D95B_9606	F	Homo sapiens (human) X inactive specific transcript (XIST)		gene_product
URS000075D95B_9606	URS000075D95B_9606	P	Homo sapiens (human) X inactive specific transcript (XIST)		gene_product
URS000075D95B_9606	URS000075D95B_9606	P	Homo sapiens (human) X inactive specific transcript (XIST)		gene_product
URS000075D95B_9606	URS000075D95B_9606	P	Homo sapiens (human) X inactive specific transcript (XIST)		gene_product
URS000075D95B_9606	URS000075D95B_9606	P	Homo sapiens (human) X inactive specific transcript (XIST)		gene_product
URS000075D95B_9606	URS000075D95B_9606	P	Homo sapiens (human) X inactive specific transcript (XIST)		gene_product
URS000075D95B_9606	URS000075D95B_9606	C	Homo sapiens (human) X inactive specific transcript (XIST)		gene_product
URS000075D95B_9606	URS000075D95B_9606	C	Homo sapiens (human) X inactive specific transcript (XIST)		gene_product
URS000075D95B_9606	URS000075D95B_9606	C	Homo sapiens (human) X inactive specific transcript (XIST)		gene_product
  1. The RNA type should be preserved
  2. We should have a specific QC check on RNCA that anything with an RNCA ID must be an RNA subtype

Aside for @alexsign should probably be it's own ticket:

Why don't we get gene symbols for RNA types? This one (Xist) clearly has one https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/HGNC:12810 - why don't we just propagate across from HGNC?

And not to overstuff this issue but there are issues with general RNCA/HGNC propagation on AGR. Recall AGR uses HGNCs:
https://www.alliancegenome.org/gene/HGNC:12810
no GO annotatuion

Even though this gene obviously has a known function:
https://amigo.geneontology.org/amigo/gene_product/RNAcentral:URS000075D95B_9606

@kltm
Copy link
Member

kltm commented Jan 29, 2024

@pgaudet Could we add this to the GORULEs project, as it may be due to a filter?

@cmungall cmungall changed the title RNCA RNA types are getting mangled by the pipeline RNAC RNA types are getting mangled by the pipeline Jan 30, 2024
@pgaudet
Copy link
Contributor

pgaudet commented Jan 30, 2024

You mean there is is a GORULE that changes the entity type?
Could you please point to which rule that is?

Thanks, Pascale

@pgaudet
Copy link
Contributor

pgaudet commented Jan 30, 2024

(I dont have permissions to add this to the GO-rules project; @kltm would you please do it?)

@kltm
Copy link
Member

kltm commented Jan 30, 2024

@pgaudet To clarify, I suspect the issue is that there is a "silent rule" that is converting (or dropping and re-adding information) such that field value lncRNA is getting outputted as gene_product. Technically, this may be an incorrect implementation of GORULE:0000001; let's keep in mind the GAF 2.2 doc statement:

DB Object Type will be one of the following: protein_complex; protein; transcript; ncRNA; rRNA; tRNA; snRNA; snoRNA; any subtype of ncRNA in the [Sequence Ontology](http:https://www.sequenceontology.org/browser/obob.cgi). If the precise product type is unknown, gene_product should be used. (https://geneontology.org/docs/go-annotation-file-gaf-format-2.2/#db-object-type-column-12).

With that definition, this would then include http:https://www.sequenceontology.org/browser/current_release/term/SO:0000655, which is lncRNA. Looking at annotations in AmiGO, I'd note that we have lncRNA_gene, antisense_lncRNA, and lnc_RNA. The last one there would be a variant of lncRNA, and I believe incorrect--the spec does not specify synonyms, but at the very least we should normalize to the proper term name.

My guess for whatever is going on is that the parser for col 12 is mistakenly bumping lncRNA and mistakenly allowing lnc_RNA in (or not normalizing). Ideally, we normalize to lncRNA; if not, I would at least expect lncRNA to pass in and lnc_RNA to be "fixed" to generic gene_product.

@cmungall
Copy link
Member Author

Can we come up with a fixed static list of types. Saying any subtype of ncRNA is not good; there are 20 subtypes of tRNA, no one should be using these. There is also the issue of labels potentially changing. The number of annotatable distinct meaningful ncRNA types should be small.

@mugitty
Copy link
Contributor

mugitty commented Feb 22, 2024

I used @cmungall's example and was able to reproduce. The parser is doing a lookup and defaulting to gene_product (as given in the specs). Currently, there is an entry for 'lnc_RNA' mapped to SO:0001877, but not for 'lncRNA'. I can add an entry for 'lncRNA' and map it to SO:0001877. @pgaudet, please create a lookup for the supported types, I want to ensure all allowed types are mapped.

@kltm
Copy link
Member

kltm commented Feb 22, 2024

@mugitty @pgaudet According to spec, it's a limited list plus a set of entries from the SO. As a compromise (#2246), as we're not actively using SO and likely have never done so, let's pull the "used" subset from the current SO and make our used list static for the moment to prevent drift and issues like we're currently having.

@pgaudet
Copy link
Contributor

pgaudet commented Feb 23, 2024

@mugitty Can you give me all the types you find? And which ones are not mapped. It seem lncRNA should simply be a synonym of lnc_RNA.

I can see if I find matches that are more informative than 'gene product'.

Thanks, Pascale

@kltm
Copy link
Member

kltm commented Feb 23, 2024

@pgaudet Noting from here (#2246), I think it's technically the opposite?

@pgaudet pgaudet changed the title RNAC RNA types are getting mangled by the pipeline RNAC RNA types are getting mangled by the pipeline (tested by gorule-0000001) Feb 26, 2024
@pgaudet pgaudet transferred this issue from geneontology/pipeline Feb 26, 2024
@pgaudet
Copy link
Contributor

pgaudet commented Feb 26, 2024

After discussion with @mugitty , I am attaching the allowed entity types and the suggestions for replacement for others. We will first check errors with this list, and we can change the list if needed.

2024-02-26-entities.xlsx

@mugitty
Copy link
Contributor

mugitty commented Feb 27, 2024

Thanks @pgaudet , I will update to use this list and output a warning, if defaulting to gene_product.

@kltm
Copy link
Member

kltm commented Feb 27, 2024

As part of the "gaf tests", it would be good to add something to make sure that the synonyms are mapping back to the proper ID (i.e. lnc_RNA -> lncRNA).

@mugitty
Copy link
Contributor

mugitty commented Feb 27, 2024

@kltm,
@pgaudet's wants to only use the terms in the attachment. All others will default to gene_product with gorule-0000001 warning. Based on the number of warnings, the list may be updated

@kltm
Copy link
Member

kltm commented Feb 27, 2024

@pgaudet Clarifying that you're removing lncRNA (SO:0001877), only 13 of those, so mapping to...gene_product as mentioned in the spec? Currently, in AmiGO filters, we also have:

lncRNA_gene	(6848)

What are these expected to map to? Without digging in, I think with the list you have
lncRNA_gene -> gene_product? Or is the intention to use the ontology to map to biological_region? Perhaps we should add what is currently used?

@mugitty
Copy link
Contributor

mugitty commented Feb 28, 2024

@pgaudet , I noticed a test for MGI that was failing with the proposed code update. For example, if there is a GAF line as follows:
gaf = ["MGI", "MGI:1923503", "0610006L08Rik", "enables", "GO:0003674", "MGI:MGI:2156816|GO_REF:0000015", "ND", "",
"F", "RIKEN cDNA 0610006L08 gene", "", "gene", "taxon:10090", "20120430", "MGI", "", ""]

"gene" will be converted to "gene_product". Is this expected?

@pgaudet
Copy link
Contributor

pgaudet commented Mar 21, 2024

Hi @mugitty
Can you check in this directory in all the files *-src.gaf.gz:
http:https://snapshot.geneontology.org/products/upstream_and_raw_data/index.html

whether entity types OTHER than the following are present:

protein_coding_gene SO:0001217
protein PR:000000001
gene_product CHEBI:33695
snRNA SO:0000274
ncRNA SO:0000655
rRNA SO:0000252
mRNA SO:0000234
lincRNA SO:0001463
tRNA SO:0000253
snoRNA SO:0000275
miRNA SO:0000276
scRNA SO:0000013
piRNA SO:0001035
tmRNA SO:0000584
SRP_RNA SO:0000590
ribozyme SO:0000374
telomerase_RNA SO:0000390
RNase_P_RNA SO:0000386
antisense_RNA SO:0000644
RNase_MRP_RNA SO:0000385
guide_RNA SO:0000602
hammerhead_ribozyme SO:0000380
pseudogene SO:0000336
protein_complex GO:0032991
antisense_lncRNA SO:0001904
gene_segment SO:3000000
genetic_marker SO:0001645
biological region SO:0001411
transposable_element_gene SO:0000111

and spit out any entity type that doesn't match these, on a file-by-file basis.

@pgaudet
Copy link
Contributor

pgaudet commented Mar 21, 2024

Alternatively - or in addition, could you give me a count of these different types:

protein_coding_gene SO:0001217
protein PR:000000001
gene_product CHEBI:33695
snRNA SO:0000274
ncRNA SO:0000655
rRNA SO:0000252
mRNA SO:0000234
lincRNA SO:0001463
tRNA SO:0000253
snoRNA SO:0000275
miRNA SO:0000276
scRNA SO:0000013
piRNA SO:0001035
tmRNA SO:0000584
SRP_RNA SO:0000590
ribozyme SO:0000374
telomerase_RNA SO:0000390
RNase_P_RNA SO:0000386
antisense_RNA SO:0000644
RNase_MRP_RNA SO:0000385
guide_RNA SO:0000602
hammerhead_ribozyme SO:0000380
pseudogene SO:0000336
protein_complex GO:0032991
antisense_lncRNA SO:0001904
gene_segment SO:3000000
genetic_marker SO:0001645
biological region SO:0001411
transposable_element_gene SO:0000111
gene SO:0000704
lincRNA_gene SO:0001641
lncRNA_gene SO:0002127
miRNA_gene SO:0001265
mRNA SO:0000234
ncRNA_gene SO:0001263
primary_transcript SO:0000185
RNA SO:0000356
RNase_MRP_RNA_gene SO:0001640
RNase_P_RNA_gene SO:0001639
rRNA_gene SO:0001637
scRNA_gene SO:0001266
sense_intronic_ncRNA_gene SO:0002184
sense_overlap_ncRNA_gene SO:0002183
snoRNA_gene SO:0001267
snRNA_gene SO:0001268
SRP_RNA_gene SO:0001269
telomerase_RNA_gene SO:0001643
transcript SO:0000673
tRNA_gene SO:0001272

  • as well as any type not matching the above

Thanks, Pascale

@kltm
Copy link
Member

kltm commented Mar 21, 2024

@mugitty @pgaudet I'm running job to get numbers on col12.

@kltm
Copy link
Member

kltm commented Mar 21, 2024

    570 antisense_lncRNA
      1 antisense_lncRNA_gene
   6262 antisense_RNA
    188 autocatalytically_spliced_intron
 543618 gene
 362496 gene_product
    170 gene_segment
      4 guide_RNA
   2562 hammerhead_ribozyme
      2 lincRNA
 132472 lncRNA
     23 lnc_RNA
     18 lncRNA_gene
  32965 miRNA
      2 miRNA_gene
1719451 misc_RNA
  47234 mRNA
 201054 ncRNA
   8172 other
    464 piRNA
   2923 precursor_RNA
 147790 pre_miRNA
1357853612 protein
 406877 protein_coding_gene
  22567 protein_complex
   1005 pseudogene
     72 pseudogenic_transcript
   3606 ribozyme
    434 RNA
   5547 RNase_MRP_RNA
 306437 RNase_P_RNA
12871329 rRNA
    192 scaRNA
    268 scRNA
      1 scRNA_gene
     15 siRNA
      1 sncRNA
 362852 snoRNA
 871408 snRNA
 170884 sRNA
 195433 SRP_RNA
   1053 telomerase_RNA
 103559 tmRNA
      2 transposable_element_gene
10232300 tRNA
      9 tRNA_gene
      2 uORF
      8 vault_RNA
      6 Y_RNA

@kltm
Copy link
Member

kltm commented Mar 21, 2024

(Noting that reactome and zfin need to [obviously] fix their GAF.)

@mugitty
Copy link
Contributor

mugitty commented Mar 22, 2024

@pgaudet, do you still want me to output the types for each file or is @kltm 's output good enough for now?

@pgaudet
Copy link
Contributor

pgaudet commented Mar 26, 2024

(Noting that reactome and zfin need to [obviously] fix their GAF.)

Isn't this gorule-0000001 ?
It seems it should be a hard error.

@pgaudet pgaudet self-assigned this Mar 26, 2024
@pgaudet
Copy link
Contributor

pgaudet commented Mar 26, 2024

@mugitty

Yes @kltm 's output is fine to get started.

@mugitty
Copy link
Contributor

mugitty commented Mar 26, 2024

@pgaudet, Just to confirm. So the types you added to this ticket on February 26, 2024 are valid. I have already updated the code

can I do a pull request?

@pgaudet
Copy link
Contributor

pgaudet commented Apr 25, 2024

For reference - these are the types that GOA loads from RNA central

rRNA 12894606
tRNA 10330538
misc_RNA 1720365
snRNA 833596
snoRNA 350816
RNase_P_RNA 307952
ncRNA 199135
SRP_RNA 194578
sRNA 167541
pre_miRNA 138520
lncRNA 114530
tmRNA 104453
miRNA 24008
other 7295
antisense_RNA 6196
RNase_MRP_RNA 5464
ribozyme 3603
precursor_RNA 2893
hammerhead_ribozyme 2556
telomerase_RNA 885
piRNA 445
scRNA 262
scaRNA 195
autocatalytically_spliced_intron 189
siRNA 15
vault_RNA 8
Y_RNA 6
guide_RNA 2

@pgaudet
Copy link
Contributor

pgaudet commented May 8, 2024

@kltm should I make a new GO rule for entity types?

@kltm
Copy link
Member

kltm commented May 8, 2024

@pgaudet That would be great.

@pgaudet pgaudet added this to In progress in GORULES (low-hanging fruit) May 13, 2024
@pgaudet
Copy link
Contributor

pgaudet commented May 29, 2024

Hi @mugitty

Here are repairs we should implement:

  • lnc_RNA should be repaired to lncRNA (should also be replaced in the code in the list of CURIES)
  • lncRNA is SO:0001877

These have to be added to the GO list of CURIEs:

  • pre_miRNA is SO:0001244
  • antisense_lncRNA_gene is SO:0002182
  • scaRNA is SO:0002095
  • pseudogenic_transcript is SO:0000516
  • siRNA is SO:0000646
  • autocatalytically_spliced_intron SO:0000588
  • vault_RNA SO:0000404
  • Y_RNA SO:0000405

That should take care of many issues.

However, these types are not in SO:

  • misc_RNA
  • other
  • precursor_RNA
  • sncRNA
  • sRNA
  • uORF

@mugitty and I propose to continue to change them to 'gene product' and output a warning.

Thanks, Pascale

@pgaudet
Copy link
Contributor

pgaudet commented Jul 10, 2024

I need to add tests for the entity types ; can we first check snapshot to see if disallowed types are being reported?

@pgaudet
Copy link
Contributor

pgaudet commented Jul 18, 2024

This problem is fixed:

Image

@pgaudet
Copy link
Contributor

pgaudet commented Jul 18, 2024

lncRNA amiGO:

Image

lncRNA staging

Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: 2024-07-14 snapshot
GORULES (low-hanging fruit)
  
Clearing - needs testing
Development

No branches or pull requests

4 participants