Skip to content

ucscXena/Xena-GDC-auto-test

Repository files navigation

Project Overview

UCSC Xena is an online tool that allows users to analyze and visualize genomics data[1]. The website supports various data types from multiple platforms such as the Genomics Data Commons (GDC)[2]. Data types supported by the GDC Xena hub include RNA-seq gene expression, DNA methylation, clinical data, Gene-level mutation, survival, and more. The data from the GDC was last imported into UCSC Xena 4 years ago using the Xena-GDC-ETL code[7]. This means that Xena users lack access to the newer and updated GDC data. The quantity of data that needs to be updated is significant, which makes it impossible to completely test the data manually. This project aims to create a set of automatic testing scripts to efficiently validate the accuracy of each data type imported from the GDC through the Xena-GDC-ETL code[7], both for data from existing projects on the Xena browser such as The Cancer Genome Atlas (TCGA)[5], as well as new projects like CTSP-DLBCL1[6] to the Xena platform. These scripts will complement some manual spot-checking as well as some manual tests which can not be automated.

Background: GDC

The NCI's Genomic Data Commons (GDC) provides the cancer research community with a unified repository and cancer knowledge base that enables data sharing across cancer genomic studies in support of precision medicine. The GDC provides public access to data from important national and international cancer research projects such as TCGA, Therapeutically Applicable Research to Generate Effective Treatments (TARGET) [3], and Genomics Evidence Neoplasia Information Exchange (GENIE)[4]. This data can be retrieved and downloaded using the GDC API’s various endpoints, such as “/cases”, “/files”, “/projects”, etc. These endpoints are used by the Xena-GDC-ETL to import data, transform the data into files that are compatible with Xena, and transfer it to the Xena hub.

Objective

A multitude of researchers use data from the GDC database on the Xena browser, making the accuracy of this data a major priority. The data is too large to be tested entirely by hand, hence it is crucial to automatically test the data from the GDC. This project consists of:

  • Understanding the GDC API’s search, retrieval, and download.
  • Develop multiple automatic tests for GDC’s RNA-seq data
  • Develop an automatic test for GDC survival data
  • Develop an automatic test for GDC clinical data

Methods

Testing the RNA-seq gene expression data on CTSP-DLBCL1[6]

The first project chosen to import from the GDC using Xena-GDC-ETL was CTSP-DLBCL1. This is due to the project’s small case size of 45 and the fact that it was limited in its types of genomic data to only RNA-seq data.

I wrote a script that compared the four types of gene expression data (counts, FPKM, FPKM-UQ, and TPM) in the GDC with the .tsv file generated by the Xena-GDC-ETL code. First, my script independently downloads the RNA-seq data using the GDC API. Then it reformats both files into a pandas data frame[8] to compare every value on a cell-by-cell basis. Any inconsistencies in the data are reported to the command line once the script is finished. Each cell’s value, gene ID, and sample ID are saved into a list that is printed out at the end of the program.

In addition to comparing the values cell-by-cell, I also created a test to find the correlation between the different data types of RNA-seq data and ensure that the values are close to one. I wrote a script to run a Pearson correlation[8] between all four RNA-seq data types, resulting in six comparisons. This is done on a sample-by-sample basis, meaning the correlation test only checks one sample, but every gene. However, counts are the number of sequence reads of a gene. Because gene sizes may vary significantly, this means that larger genes will generally have more counts. This is different from FPKM, FPKM-uq, and TPM, which are normalized to one million read counts. This means that these three data types measure the density of read counts, while counts are the total number of read counts. Due to this difference, any correlation test with counts is done on a gene-by-gene basis, meaning that only one gene is measured for every sample. This is done to prevent any variation between genes. All of the values are plotted onto a scatter plot with the axes being data types.

Testing the clinical data

For the clinical data I was not able to simply download the clinical data from the GDC API to compare to the .tsv file generated by the Xena-GDC-ETL code because the Xena-GDC-ETL code deletes fields that have no data. This is common for the clinical data from the GDC as each project reports different clinical data depending on the cancer type, project stage, and which fields the project was able to collect as part of the project. In light of this, the script I wrote first reads the fields from the .tsv file generated by the Xena-GDC-ETL. These fields are then used to limit the output from the calls to the GDC API clinical endpoint to only these fields. The results from the GDC API are loaded into a pandas data frame. Then, similar to the RNA-seq automated testing, the .tsv file is converted to a pandas data frame to compare each value. If there are multiple samples per case, the sample IDs are saved and run through the program a second time.

Testing the survival data

The GDC API has a specific endpoint for survival data called: “/analysis/survival”, which the Xena-GDC-ETL code uses this endpoint to retrieve survival data. This is a separate endpoint from the "cases" clinical data endpoint. I created a script to retrieve survival data of a specified project using the “/analysis/survival” endpoint, convert it into a data frame, and compare it with the Xena-GDC-ETL generated data frame, similar to the testing script for the RNA-seq data.

Further, I created an automated test to verify the internal consistency within the GDC between the survival and "cases" clinical endpoints. It first retrieves the relevant survival fields for a specific project using the GDC API’s “/cases” endpoint. This endpoint returns data on a case-by-case basis, which is then formatted into a data frame. From here it takes the 'vital status' as whether the individual is alive or dead. It then takes the greatest value of the following fields: “days to death”, “days to last follow up”, “days to last known disease status”, “days to recurrence”, and more, as these are all related to the calculation of time to event, whether that is being alive or dead. Finally, the .tsv file output by the Xena-GDC-ETL code is converted to a data frame as well, so that every cell can be compared. While the first test was conducted on the CTSP-DLBCL1 project, we also conducted multiple tests on TCGA-BRCA and TCGA-UCEC.

Results & Outcomes

RNAseq data automated testing results:

Each automated testing script was run on the CTSP-DLBCL1 project. When testing RNA-seq data of CTSP-DLBCL1 imported by the Xena-GDC-ETL, all data matched, meaning that the import done by the Xena-GDC-ETL code was accurate.

Clinical and survival data automated testing results:

There were several inconsistencies between the survival and clinical data. When using the “/cases” endpoint, several cases that were alive but had follow-up data, were missing from the GDC survival endpoint. These inconsistencies were in the CTSP-DLBCL1 project (an example is CTSP-AD1S), but not in the TCGA-BRCA or TCGA-UCEC projects. Initial analysis of the clinical and survival endpoints leads me to believe that this is due to the fact that the CTSP-DLBCL1 projects submitted follow-up data to the field: ‘days to follow up’, whereas the TCGA project submitted to the field ‘days to last follow up’. Further analysis and follow up are needed to verify this. Despite this internal consistency within the GDC, the Xena-GDC-ETL import worked as expected where, for the project CTSP-DLBCL1, the “/analysis/survival” endpoint leads to the same results as the Xena-GDC-ETL import.

Another example of different endpoints producing different results can be found in the results of the clinical data automated testing. The automated testing script uses the “/cases” endpoint, while the Xena-GDC-ETL code uses the “/projects” endpoint. Due to this difference, only one diagnosis of a case is retrieved by Xena-GDC-ETL, compared to all diagnoses by the automated testing script. An example of this is the case, CTSP-AD1S, which has 2 diagnoses, but only one is found in the Xena-GDC-ETL.[jing: detail of the CTSP case that has the inconsistency.]

Correlation analysis

alt_text

Correlation: 0.9999999999999867

alt_text

Correlation: 0.999999999999983

alt_text

Correlation: 0.9880884397613635

alt_text

Correlation: 0.9999999999999917

alt_text

Correlation: 0.9545050830065851

alt_text

Correlation: 0.9587627363186685

Figure 1: Correlation values between various data types

Highly significant positive correlation between various RNA-seq data types was generally expected. Correlation values between TPM, FPKM, and FPKM-UQ were all nearly 1, which is expected as they’re normalized per million reads. When comparing counts, which are not normalized, to other data types, the correlation values were lower, between 0.94 and 0.98. This is still close to 1.

Usage Instructions

RNA-seq gene expression automated testing

Command line:

[This file name] [Xena-GDC-ETL imported .tsv file name] [data type] [debug mode]

Valid data types: [“fpkm”, “fpkm-uq”, “tpm”, “star_counts”]

Valid debug modes: [“True”, “False”]

Example:

python3 XenaGeneExpressionMatrixValidation.py /Users/Downloads/CTSP-DLBCL1.star_tpm.tsv tpm False

Example results:

alt_text

RNA-seq gene expression correlation testing

Command line: [This file name] [Xena-GDC-ETL imported .tsv file name(data type 1)] [file name(data type 2)] [file name(data type 3)] [file name(data type 4)]

Example:

python3 RNAseqPCC.py

/Users/Downloads//CTSP-DLBCL1.star_tpm.tsv /Users/Downloads/CTSP-DLBCL1.star_fpkm-uq.tsv /Users/Downloads/CTSP-DLBCL1.star_fpkm.tsv /Users/Downloads/CTSP-DLBCL1.star_counts.tsv

Example results:

alt_text

alt_text

Survival Automated Testing (/cases endpoint)

Command line:

[This file name] [Xena-GDC-ETL imported .tsv file] [project id]

Example:

python3 XenaSurvivalMatrixValidation.py /Users/Downloads/CTSP-DLBCL1.survival.tsv CTSP-DLBCL1

Example results:

alt_text

Survival Automated Testing(/analysis/survival endpoint)

Command line:

[This file name] [Xena-GDC-ETL imported .tsv file] [project id]

Example:

python3 XenaSurvivalAnalysisEndptValidation.py /Users/Downloads/CTSP-DLBCL1.survival.tsv CTSP-DLBCL1

Example results:

alt_text

Clinical Automated Testing

Command line:

[This file name] [Xena-GDC-ETL imported .tsv file]

Example:

python3 XenaClinicalDataValidation.py /Users/Downloads/CTSP-DLBCL1.clinical.tsv

Example results:

alt_text

Reference

[1]Goldman, M.J., Craft, B., Hastie, M. et al. Visualizing and interpreting cancer genomics data via the Xena platform. Nat Biotechnol (2020). https://doi.org/10.1038/s41587-020-0546-8

[2]Heath AP, Ferretti V, Agrawal S, An M, Angelakos JC, Arya R, Bajari R, Baqar B, Barnowski JHB, Burt J, Catton A, Chan BF, Chu F, Cullion K, Davidsen T, Do PM, Dompierre C, Ferguson ML, Fitzsimons MS, Ford M, Fukuma M, Gaheen S, Ganji GL, Garcia TI, George SS, Gerhard DS, Gerthoffert F, Gomez F, Han K, Hernandez KM, Issac B, Jackson R, Jensen MA, Joshi S, Kadam A, Khurana A, Kim KMJ, Kraft VE, Li S, Lichtenberg TM, Lodato J, Lolla L, Martinov P, Mazzone JA, Miller DP, Miller I, Miller JS, Miyauchi K, Murphy MW, Nullet T, Ogwara RO, Ortuño FM, Pedrosa J, Pham PL, Popov MY, Porter JJ, Powell R, Rademacher K, Reid CP, Rich S, Rogel B, Sahni H, Savage JH, Schmitt KA, Simmons TJ, Sislow J, Spring J, Stein L, Sullivan S, Tang Y, Thiagarajan M, Troyer HD, Wang C, Wang Z, West BL, Wilmer A, Wilson S, Wu K, Wysocki WP, Xiang L, Yamada JT, Yang L, Yu C, Yung CK, Zenklusen JC, Zhang J, Zhang Z, Zhao Y, Zubair A, Staudt LM, Grossman RL. The NCI Genomic Data Commons. Nat Genet. 2021 Mar;53(3):257-262. doi: 10.1038/s41588-021-00791-5. Erratum in: Nat Genet. 2021 Jun;53(6):935. PMID: 33619384.

[3]Ma X, Edmonson M, Yergeau D, Muzny DM, Hampton OA, Rusch M, Song G, Easton J, Harvey RC, Wheeler DA, Ma J, Doddapaneni H, Vadodaria B, Wu G, Nagahawatte P, Carroll WL, Chen IM, Gastier-Foster JM, Relling MV, Smith MA, Devidas M, Guidry Auvil JM, Downing JR, Loh ML, Willman CL, Gerhard DS, Mullighan CG, Hunger SP, Zhang J. Rise and fall of subclones from diagnosis to relapse in pediatric B-acute lymphoblastic leukaemia. Nat Commun. 2015 Mar 19;6:6604. doi: 10.1038/ncomms7604. PMID: 25790293; PMCID: PMC4377644.

[4]Micheel CM, Sweeney SM, LeNoue-Newton ML, André F, Bedard PL, Guinney J, Meijer GA, Rollins BJ, Sawyers CL, Schultz N, Shaw KRM, Velculescu VE, Levy MA; AACR Project GENIE Consortium. American Association for Cancer Research Project Genomics Evidence Neoplasia Information Exchange: From Inception to First Data Release and Beyond-Lessons Learned and Member Institutions' Perspectives. JCO Clin Cancer Inform. 2018 Dec;2:1-14. doi: 10.1200/CCI.17.00083. PMID: 30652542; PMCID: PMC6873906.

[5]National Cancer Institute, 2023 The Cancer Genome Atlas Program (TCGA). www.cancer.gov/ccg/research/genome-sequencing/tcga Accessed 19 June 2023.

[6]Schmitz R, Wright GW, Huang DW, Johnson CA, Phelan JD, Wang JQ, Roulland S, Kasbekar M, Young RM, Shaffer AL, Hodson DJ, Xiao W, Yu X, Yang Y, Zhao H, Xu W, Liu X, Zhou B, Du W, Chan WC, Jaffe ES, Gascoyne RD, Connors JM, Campo E, Lopez-Guillermo A, Rosenwald A, Ott G, Delabie J, Rimsza LM, Tay Kuang Wei K, Zelenetz AD, Leonard JP, Bartlett NL, Tran B, Shetty J, Zhao Y, Soppet DR, Pittaluga S, Wilson WH, Staudt LM. Genetics and Pathogenesis of Diffuse Large B-Cell Lymphoma. N Engl J Med. 2018 Apr 12;378(15):1396-1407. doi: 10.1056/NEJMoa1801445. PMID: 29641966; PMCID: PMC6010183.

[7] Xena-GDC-ETL, https://github.com/ucscXena/xena-GDC-ETL Accessed 11 August 2023.

[8] pandas - Python Data Analysis Library. (n.d.), https://pandas.pydata.org/ Accessed 14 August 2023.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages