WO2019064063A1 - Biomarkers for colorectal cancer detection - Google Patents

Biomarkers for colorectal cancer detection Download PDF

Info

Publication number
WO2019064063A1
WO2019064063A1 PCT/IB2018/001169 IB2018001169W WO2019064063A1 WO 2019064063 A1 WO2019064063 A1 WO 2019064063A1 IB 2018001169 W IB2018001169 W IB 2018001169W WO 2019064063 A1 WO2019064063 A1 WO 2019064063A1
Authority
WO
WIPO (PCT)
Prior art keywords
sample
cancer
crc
samples
subject
Prior art date
Application number
PCT/IB2018/001169
Other languages
French (fr)
Inventor
Nicolas James WALKER
Vitali Proutski
Kate Joanne HOWELL
Sandro MORGANELLA
Original Assignee
Cambridge Epigenetix Limited
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Cambridge Epigenetix Limited filed Critical Cambridge Epigenetix Limited
Priority to EP18807398.5A priority Critical patent/EP3688195A1/en
Publication of WO2019064063A1 publication Critical patent/WO2019064063A1/en

Links

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6876Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
    • C12Q1/6883Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material
    • C12Q1/6886Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material for cancer
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2600/00Oligonucleotides characterized by their use
    • C12Q2600/154Methylation markers

Definitions

  • the methods and kits as described herein may provide identification of samples from a subject as benign or malignant for a cancer. This method may be an improvement in the field of analyzing samples from a subject.
  • FIG. 1 shows 110 total colorectal cancer (CRC) and healthy volunteer (HV) plasma samples processed through the HMCP v2 protocol.
  • FIG. 2A - FIG. 2C shows the HMCP- 110 study design and initial sample set.
  • FIG. 3A - FIG. 3E shows the HMCP- 110 study design and sample set breakdown.
  • FIG. 4A - FIG. 4H shows the HMCP- 110 data quality control outlining that technical parameters did not affect quality or bias results.
  • FIG. 5A - FIG. 5D shows no operator-related batch effect in the HMCP-1 10 dataset.
  • FIG. 6A - FIG. 6B shows HMCP-110 data/feature exploratory analysis.
  • FIG. 7A - FIG. 7B shows the HMCP-110 differential feature analysis of gene bodies identified a high number of discriminating genes.
  • FIG. 8A - FIG. 8E shows the top 20 differential genes in the HMCP-110 differential feature analysis are a mixture of hypo- and hyper-hydroxymethylated loci.
  • FIG. 9A - FIG. 9B shows an example gene, ZIC4, showing concordance between cell free DNA (cfDNA) and genomic DNA (gDNA) 5-hydroxymethylated cytosine (5-hmC) profiles.
  • FIG. 10 shows a comparison of differential genes in CRC vs. HV having functional significance based on most variable features.
  • FIG. 11 shows a comparison of differential genes in CRC vs. HV having functional significance based on most variable features.
  • FIG. 12A - FIG. 12B shows a high number of discriminating features identified in the HMCP-110 differential feature analysis of enhancers.
  • FIG. 13A - FIG. 13E shows a 6-fold x-validation using top varying genes with read counts over 30 in HMCP-110 classification.
  • FIG. 14A - FIG. 14E shows a 6-fold x-validation using top varying genehancers with read counts over 30 in HMCP-110 classification.
  • FIG. 15 shows HMCP-1 10 classification using a Lasso regression model to develop classifiers based on training sets to be assessed using test sets.
  • FIG. 16A - FIG. 16B shows the performance of two Lasso-based signatures (gene and genehancer) for CRC vs. HV assessed using test sets. Lasso signatures predict CRC vs. HV disease status in test set with > 91% sensitivity and 80% specificity.
  • FIG. 17A - FIG. 17F shows CRC vs. HV class separation based on Lasso signature features.
  • FIG. 18A - FIG. 18B shows the performance of two Lasso-based signatures (gene and genehancer) for early CRC vs. HV assessed using test sets. Lasso signatures predict early CRC vs. HV disease status in test set with > 93% sensitivity and 80% specificity.
  • FIG. 19A - FIG. 19C shows feature overlap between CRC vs. HV and early CRC vs. HV gene Lasso signatures.
  • FIG. 20A - FIG. 20B shows histogram data from the HMCP110 method.
  • FIG. 21A - FIG. 21B shows differential feature analysis of genes and genehancer filtered for read count only (>30).
  • FIG. 22A - FIG. 22B shows pie charts for top 50 genes.
  • FIG. 23 A - FIG. 23D shows peak analysis.
  • FIG. 24A - FIG. 24B shows HMCP-110 profile of ZIC4 and ZIC1 genes.
  • FIG. 25A - FIG. 25C shows boxplots of key genes (FIGN, SIX1 , ZIC4) with gDNA from tumours.
  • FIG. 26A - FIG. 26E shows 6-fold cross-validation using top varying genes for HMCP-110 classification.
  • FIG. 27A - FIG. 27E shows 6-fold cross-validation using top varying genehancers for HMCP-110 classification.
  • FIG. 28A - FIG. 28B shows permutation tests (AUC) for SVM models for genes.
  • FIG. 29A - FIG. 29B shows permutation tests (AUC) for SVM models for genehancers.
  • FIG. 30A - FIG. 30D shows HMCP-110 data/feature exploratory analysis.
  • FIG. 31A - FIG. 31B shows a histogram of genehancer signature and label permutation test.
  • FIG. 32A - FIG. 32B shows HMCP-110 study sample composition and parameters imbalance.
  • FIG. 33 shows the HMCP- 110 protocol overview.
  • FIG. 34A shows a gene list of biomarkers for CRC-HV (single application) an application of the LASSO model.
  • FIG. 34B shows a gene list of biomarkers for earlyCRC-HV (single application) an application of the LASSO model.
  • FIG. 35 shows a gene list of biomarkers for 5% CRC-HV - Z-Normalization - a result of analysis to find robust gene signatures.
  • FIG. 36 shows a sample cohort numbers used in the HMCP003 secondary analysis.
  • FIG. 37A-C shows a distribution of the cohort based on three key variables - age, gender and cancer stage.
  • An age bias is visible in (FIG. 37A) with HV younger than CRC patients.
  • Age and gender is less biased (FIG. 37B) but there is a bias by gender and cancer stage (FIG. 37C)
  • FIG. 38A-D shows results of the OSAT sample balancing analysis based on key variables across the 14 strip tubes needed for the HMCP v2 workflow.
  • Each bar of the histogram represents one strip tube processed in the workflow.
  • Each of the plots represents for strip tube 1- 14 how well balanced it is for cancer stage, gender, extraction operator and day of extraction. No strip is found to be unbalanced based on chi-square tests.
  • FIG. 39A-E shows assessment of the quantity of DNA (concentration and yield) achieved by DNA extraction based on both Qubit and the Bioanalyser (BA) by key cohort meta- data and extraction operator.
  • FIG. 40A-B shows an association of total mass (ng) of cfDNA that went into the library preparation stage (denoted conv ng) with Sex, and cancer stage.
  • the NetFlex adapters contain the library indexes needed for sequencing, which are well balanced across the operators.
  • FIG. 43A-D shows an association identified between the quantity of input cfDNA and the sequencing metrics including the diversity, uniformity, total de-duplicated reads
  • FIG. 44A-D shows histograms and boxplots of the de-duplicated sequencing reads.
  • FIG. 45A-F shows an assessment of spike ins by clinical diagnosis and HMCP operator.
  • FIG. 46A-D shows an assessment of the diversity, uniformity and mitochondrial reads based on the run, operator and clinical diagnosis. Some variation identified in the mitochondrial RPKMs for both input and pulldown (pBGT).
  • FIG. 50 shows a number of discriminatory features identified at several FDR thresholds. Many discriminating features are found for CRC vs. HV and early CRC vs. HV comparisons at an FDRO.01.
  • FIG. 51 shows a top 20 discriminatory genes ranked by adjusted p-value for the CRC vs HV comparison (Mann- Whitney U test). For each gene, its specific prediction power in terms of AUC is computed.
  • FIG. 52A-F shows boxplots of the 6 top ranked genes by p-value from CRC vs HV comparison (top varying genes), all of which show an increased level of 5hmC enrichment in CRC over HV.
  • FIG. 53A-B shows 5hmC Enrichment Profile of ZIC4 and ZIC1 genes showing increased levels of 5hmC in CRC.
  • FIG. 54A-B shows 5hmC Enrichment Profile of SIX1 gene showing increased levels of 5hmC in CRC.
  • FIG. 55 shows a disease association table from the GeneAnalytics component of the GeneCards suite, for genes with an adjusted p-value ⁇ 0.05 in CRC vs HV comparison.
  • CRC is the top hit for the gene list.
  • FIG. 56 shows genes in the CRC vs HV set that are identified as differentially expressed in tissue samples in CRC.
  • FIG. 57 shows top 20 genes directly associated with CRC using the VarElect component of the Genecards database. CRC related terms are top hits in this analysis.
  • FIG. 58 shows disease association table from the GeneAnalytics component of the GeneCards suite, for genes with an adjusted p-value ⁇ 0.05 in CRC vs HV comparison using the All-genes list which does not apply a filter based on co-efficient of variation. CRC and other cancers are the top hits for the gene lists.
  • FIG. 59A-B shows histograms of the top 20 Biological pathways from GO analysis of genes with a FDR ⁇ 0.05 from early CRC vs. HV MWU results (both genders, with read count filtering). Under-enriched pathways are predominantly immune related (FIG. 59A) and over- enriched pathways are predominantly metabolism related (FIG. 59B).
  • FIG. 60A-B shows histograms of the top 20 Biological pathways from GO analysis of genes with a FDR ⁇ 0.05 from late CRC vs. HV MWU results (females only). Under-enriched pathways are immune related (FIG. 60A) and over-enriched pathways are related to adhesion, morphogenesis and development (FIG. 60B).
  • FIG. 61A-E shows ROC curves for SVM classifiers built on gene data for disease comparisons (varied by stage and gender). All classifiers may be built using 6-fold cross- validation including COFV filtering (>0.2) and chi-square test (p-value>0.25) for age and or gender (if mixed population). ROC curves are annotated with the mean area under the curve and permutation test (PT) p-values. The ROC curve achieved during each cross-validation (CV) is shown in light grey. All classifiers show high performance levels with AUCs>0.8.
  • FIG. 62A-E shows ROC curves for classifiers built on genehancer data for disease comparisons (varied by stage and gender). All classifiers were built using 6-fold cross-validation including COFV filtering (>0.2) and chi-square test (p-value>0.25) for age and or gender (if mixed population). ROC curves are annotated with the mean area under the curve and
  • PT permutation test
  • FIG. 63A-E shows ROC curves for LR RFE classifiers built on gene data for CRC vs. HV comparisons (varied by stage and gender). All classifiers were built using 6-fold cross- validation including COFV filtering (>0.2) and chi-square test (p-value>0.25) for age and or gender (where appropriate). ROC curves are annotated with the mean area under the curve. All ROC curves show models built with 20 features. All classifiers show high performance levels with AUCs>0.8.
  • FIG. 64A-E shows ROC curves for LR RFE classifiers built on genehancer data for CRC vs. HV comparisons (varied by stage and gender). All classifiers were built using 6-fold cross-validation including COFV filtering (>0.2) and chi-square test (p-value>0.25) for age and or gender (where appropriate). ROC curves are annotated with the mean area under the curve. All ROC curves show models built with 20 features. All classifiers show high performance levels with AUCs>0.7.
  • FIG. 65 shows an overview of test and training sets.
  • FIG. 66A-B shows performance of LASSO regression model on Genes (AUC 0.883) and Genehancers (AUC 0.937). Final model results in 56 features using genes and 59 features using genehancers (3-fold cross-validation is used in the training process). All classifiers show high performance levels with AUCs>0.85.
  • FIG. 67 shows a summary of cross validation results using a LASSO regression model on gene features.
  • FIG. 68 shows a summary of independent test set performance using a LASSO regression model on gene features.
  • FIG. 69A-B shows PCA based on the list containing the 56 genes having non-zero weight in the computed LASSO model, also supports the evidence that these features can be effectively used to distinguish between CRC and HV samples (FIG. 64A-E). PCS 1-3 shown.
  • FIG. 70A-B shows PCA based on the list containing the 59 genehancers having nonzero weight in the computed LASSO model, also supports the evidence that these features can be effectively used to distinguish between CRC and HV samples (FIG. 64A-E). PCS 1-3 shown.
  • Final model results in 40 features using genes and 25 features using genehancers. 3-fold cross-validation is used. All classifiers show high performance levels with AUCs>0.85.
  • FIG. 72 shows a summary of cross validation results using a LASSO regression model on gene features for early CRC vs HV.
  • FIG. 73 shows a summary of cross validation results using a LASSO regression model on genehancer features for early CRC vs HV.
  • FIG. 74 shows a non-zero genes shared between early CRC vs HV and CRC vs HV classifiers.
  • the table report the rank of each gene, in genes (with an outlined box) have the negative weight (e.g., MRPS31P2 is the gene with the most negative weight in both classifiers).
  • FIG. 75A-75E MULTIQC PLOTS - Insert size as calculated by the Picard software suite. Run461 to Run465 represent the different sequencing batches. No untoward insert size anomalies were found.
  • FIG. 76A-76L Additional QC plots.
  • FIG. 76A - FIG. 76F Uniformity
  • FIG. 76G - FIG. 76H Results of iCNA show a mismatch in predicted gender and % tumour fraction predictions.
  • FIG. 761 - FIG. 76L are metrics from the deeptools plotFingerprint utility that summarise a diagnostic plot that gives an overview of aspects of genomic coverage. Both pBGT and input samples behave as expected, pBGTs expected to have higher
  • elbow/inflection points lower AUC and higher x-intercept. No difference is observed by operator.
  • FIG. 77A-77N PCA of samples using features (Genes (FIG. 77A-C), Genehancers (FIG. 77D-F) ) that have passed the read count thresholds (>30 reads in input and pBGT) and filtered by the coefficient of variation (>0.2 & ⁇ 2).
  • the variance explained by each principal component for the gene and genehancer set is given in FIG. 77G-H, demonstrating that the majority of the variance is accounted for in the first three to four principle components.
  • FIG. 77I-N gives plots for genes and genehancers with only the read count thresholds (>30 reads in the input and pBGT).
  • FIG. 78A-78D PC As of the top 20 discriminating/ranked genes for each of the patient subgroup as determined by the MWU test. Results are based on the MWU tests of the read count threshold feature lists (>30 reads in both pBGT and input). Clustering by diagnosis is evident based on the top 20 features alone.
  • FIG. 79A - 79D PCAs of the top 20 discriminating/ranked genehancers for each of the patient subgroup sourced from. Results are based on the MWU tests of the read count threshold feature lists (>30 reads in both pBGT and input). Clustering by diagnosis is evident based on the top 20 features alone.
  • FIG. 80A - 80D PCAs of the top 20 discriminating/ranked genes for each of the patient subgroup. Results are based on the MWU tests of the top varying features (>30 reads in both pBGT and input plus coefficient of variation >0.2 & ⁇ 2). Clear separation between CRC and HV samples is demonstrated.
  • FIG. 81A - 81D PCAs of the top 20 discriminating/ranked genehancers for each of the patient subgroup. Results are based on the MWU tests of the top varying features (>30 reads in both pBGT and input plus coefficient of variation >0.2 & ⁇ 2). Clustering by diagnosis is evident based on the top 20 features alone.
  • FIG. 82A - 82F Boxplots of the top 6 discriminating genehancers demonstrating separation between CRC and HVs (top varying list). Increased levels of 5hmC are found for CRC over HV for these top 6 genehancers.
  • FIG. 83A - 83F Boxplots of the top 6 discriminating genes demonstrating separation between early CRC and HVs (top varying list). Increased levels of 5hmC are found for early CRC over HV for these top 6 genes.
  • FIG. 84A - 84F Boxplots of the top 6 discriminating genes demonstrating separation between late CRC and HVs (top varying list). The majority of the top 6 genes show increased levels of 5hmC for late CRC over HV.
  • FIG. 85A - 85F Boxplots of the top 6 discriminating genehancers demonstrating separation between early CRC and HVs (top varying list). Increased levels of 5hmC are found for early CRC over HV for these top 6 genehancers.
  • FIG. 86A - 86F Boxplots of the top 6 discriminating genehancers demonstrating separation between late CRC and HVs (top varying list). Increased levels of 5hmC are found for late CRC over HV for these top 6 genehancers.
  • FIG. 87 Prediction score (in terms of AUC) of the top 20 most discriminating genes (top-varying comparison) between CRC and HV based on age groups. Those with a score > 0.7 are highlighted in red. The top 20 genes do not show any clear prediction power for these three age groups.
  • FIG. 88A - 88F Boxplots of the top 6 discriminating genes demonstrating separation between CRC and HVs by age group and diagnosis (top varying list). Enrichment levels are fairly similar across the age groups for HVs, while increased 5hmC levels are evident in the CRC patients.
  • FIG. 89A - 89F Boxplots of the top 6 discriminating genehancers demonstrating separation between CRC and HVs by age group and diagnosis (top varying list). Enrichment levels are fairly similar across the age groups for HVs, while increased 5hmC levels are evident in the CRC patients.
  • FIG. 91 Summary of DESeq2 results with covariates. A high number of features are identified as significantly discriminatory based on the default DESeq2 threshold of ⁇ 0.1 adjusted p-value.
  • FIG. 92 DESeq vs. MWU rank comparison tests - Genes. Gender and age have a stronger effect in the early CRC comparisons. P-value from the rank comparison test ⁇ 0.05 are highlighted in red. The addition of the covariates makes the most difference for the early CRC vs. HV comparison.
  • FIG. 93 DESeq vs. MWU rank comparison tests - Genehancers. Gender and age have little effect on the rank comparisons. The addition of any covariates does not significantly affect the rank of the discriminating genehancer lists, with approximately 3 ⁇ 4 of genehancers identified by both methods (DESeq2 and MWU tests).
  • FIG. 94A - 94F Top 6 genes ranked by DESeq2 test between CRC and HV including age and gender as covariates. Many of these genes (4/6) were also in the top 6 for the MWU test.
  • FIG. 95A - 95E Receiver operator characteristic (ROC) curves for SVM classifiers built on gene data for CRC vs. HV comparisons (varied by stage and gender). All classifiers were built using 6-fold cross-validation including variance filtering (>0.1) and chi-square tests (p- value>0.25) for age and or gender (if mixed population). ROC curves are annotated with the mean area under the curve and permutation test p-value is reported for each model. All classifiers achieved a mean AUC of 0.8 or above.
  • FIG. 96A - 96E Receiver operator characteristic (ROC) curves for SVM classifiers built on genehancer data for CRC vs. HV comparisons (varied by stage and gender). All classifiers were built using 6-fold cross-validation including variance filtering (>0.1) and chi- square tests (p-value>0.25) for age and or gender (if mixed population). ROC curves are annotated with the mean area under the curve and permutation test p-value is reported for each model. All classifiers achieved a mean AUC of 0.76 or above.
  • ROC Receiver operator characteristic
  • FIG. 97A - 97E ROC curves for logistic regression classifiers built on gene data for disease comparisons (varied by stage and gender). All classifiers were built using 6-fold cross- validation including coefficient of variation filtering (>0.2) and chi-square tests (p-value>0.25) for age and or gender (if mixed population). ROC curves are annotated with the ROC curves from each cross-validation round (light grey) and the mean area under the curve (green). All mean ROC curves show good performance with a minimum mean AUC of 0.83.
  • FIG. 98A - 98E Receiver operator characteristic (ROC) curves for logistic regression classifiers built on genehancer data for CRC vs. HV comparisons (varied by stage and gender). All classifiers were built using 6-fold cross-validation including coefficient of variation (>0.2) and chi-square tests (p-value>0.25) for age and or gender (if mixed population). ROC curves are annotated with the ROC curves from each cross-validation round (light grey) and the mean area under the curve (green). All mean ROC curves show good performance with a minimum mean AUC of 0.79.
  • ROC Receiver operator characteristic
  • FIG. 99A - 99E Receiver operator characteristic (ROC) curves for logistic regression classifiers built on gene data for CRC vs. HV comparisons (varied by stage and gender). All classifiers were built using 6-fold cross-validation including a variance filter (>0.1) and chi-square tests (p-value>0.25) for age and or gender (if mixed population). ROC curves are annotated with the mean area under the curve. The mean ROC curves show that even with a stricter variance threshold the models can still perform well.
  • ROC Receiver operator characteristic
  • FIG. 100A - 100E Receiver operator characteristic (ROC) curves for logistic regression classifiers built on genehancer data for CRC vs. HV comparisons (varied by stage and gender). All classifiers were built using 6-fold cross-validation including a variance filter (>0.1) and chi-square tests (p-value>0.25) for age and or gender (if mixed population). ROC curves are annotated with the mean area under the curve. The mean ROC curves show that even with a stricter variance threshold the models can still perform well.
  • ROC Receiver operator characteristic
  • FIG. 101 A - 101E ROC curves for Logistic Regression (LR) Recursive Feature Elimination (RFE) classifiers built on gene data for CRC vs. HV comparisons (varied by stage and gender). All classifiers were built using 6-fold cross-validation including variance filtering (>0.1) and chi-square test (p-value>0.25) for age and or gender (where appropriate) followed by RFE (20 features retained). ROC curves are annotated with the mean area under the curve. The mean ROC curves show that even with a stricter variance threshold the models can still perform well.
  • LR Logistic Regression
  • RFE Recursive Feature Elimination
  • FIG. 102A - 102E ROC curves for Logistic Regression (LR) Recursive Feature Elimination (RFE) classifiers built on genehancer data for CRC vs. HV comparisons (varied by stage and gender). All classifiers were built using 6-fold cross-validation including variance filtering (>0.1) and chi-square test (p-value>0.25) for age and or gender (where appropriate) followed by RFE (20 features retained). ROC curves are annotated with the mean area under the curve. The mean ROC curves show that even with a stricter variance threshold the models can still perform well.
  • LR Logistic Regression
  • RFE Recursive Feature Elimination
  • FIG. 103A - 103B Receiver operator characteristic (ROC) curves for Logistic Regression (LR) Recursive Feature Elimination (RFE) classifiers built on gene data for age groups ( ⁇ 61 and >61) comparisons. All classifiers were built using 6-fold cross-validation including variance filtering (>0.1 and chi-square test for gender where appropriate, p-value>0.25) plus recursive feature elimination to 20 features. ROC curves are annotated with the mean area under the curve. Age classifiers perform worse than disease classifiers.
  • LR Logistic Regression
  • RFE Recursive Feature Elimination
  • FIG. 104A - 104B Receiver operator characteristic (ROC) curves for Logistic Regression (LR) Recursive Feature Elimination (RFE) classifiers built on genehancer data for age groups ( ⁇ 61 and >61). All classifiers were built using 6-fold cross-validation including variance filtering (>0.1 and chi-square test for gender where appropriate, p-value>0.25) plus recursive feature elimination to 20 features. ROC curves are annotated with the mean area under the curve. Age classifiers perform worse than disease classifiers.
  • LR Logistic Regression
  • RFE Recursive Feature Elimination
  • FIG. 105A - 105B LASSO weights for gene and genehancer datasets. Only non-zero elements are reported.
  • FIG. 106 LASSO weights for genes in the early CRC vs HV classifier. Only nonzero elements are reported.
  • FIG. 107A - 107B PC A performed on the 40 non-zero genes in the early CRC vs
  • FIG. 108A - 108B PC A performed on the 13 non-zero genes shared between early
  • FIG. 109A - 109B PC A performed on the 13 non-zero genes shared between early CRC vs HV and CRC vs HV classifiers. The plots highlight CRC and HV samples. Despite, results show a clear split between early CRC to HV samples, this separation is less stronger than observed before for early CRC and HV.
  • FIG. 110 Performance of the LASSO model trained on 1,000 independent permutations of the labels of the original dataset. How expected the average AUC for the Permutation test is 0.5 (random classification)
  • Reference Split indicates the train/test datasets used in the main analysis described in the document. Results on the reference split are very similar to the median obtained on the 1,000 splits, suggesting that this split do not over/under train our model.
  • FIG. 113 PCA based on the 56 non-zero genes. The first 5 components are showed and samples are different shades based on the volunteer's age. PCA does not reveal any clear separation between different volunteer's ages.
  • FIG. 114 PCA based on the 56 non-zero genehancers. The first 5 components are showed and samples are different shades based on the volunteer's age. PCA does not reveal any clear separation between different volunteer's ages.
  • FIG. 115A - 115C From left to right and top to bottom. AUCs on 100 different splits of the original dataset, where the model is trained and test on the volunteer's. This table reports the median AUC for Age and CRC classifiers and the p-value resulting from the Mann- Whitney's test. List of non-zero genes/genehancers in the LASSO model trained on the volunteer's age, in red the genes shared between this model and the model trained on CRC-HV (no shared genehancers were found). Results refuse the hypothesis that age can be a confounding factor in the training of the model.
  • FIG. 116A-B Distribution of the number of non-zero genes/genehancers found in the 200 simulations. Variability in the number of discriminating features is observed.
  • FIG. 119 List of the 56 non-zero gene in the Lasso classifier
  • FIG. 120 List of the 59 non-zero genehancers in the Lasso classifier
  • FIG. 121 List of the non-zero genes in the 200 simulations of the Lasso classifier. Only genes occurring in more than 10% of the simulations are reported. In red the genes shared with the list containing the 56 non-zero genes in the CRC-HV classifier used in the main analyses (FIG. 92). Mean/Min/Max weight columns contain the Mean/Min/Max weight found in the 200 simulations (note that only non zero weights were used in this calculation). Training weight column contain the weight if the feature in the Lasso model used in the main analyses.
  • FIG. 122 List of the non-zero genehancers in the 200 simulations of the Lasso classifier. Only genehancers occurring in more than 10% of the simulations are reported. In red the genehancers shared with the list containing the 59 non-zero genehancers in the CRC-HV classifier used in the main analyses (FIG. 93). Mean/Min/Max weight columns contain the Mean/Min/Max weight found in the 200 simulations (note that only non zero weights were used in this calculation). Training weight column contain the weight if the feature in the Lasso model used in the main analyses.
  • FIG. 123A-C the LASSO scores computed for the 21 samples. Red and blue bars highlight CRC and HV samples, respectively. The horizontal red dotted line shows the optimal classification threshold inferred from the HMCP-110 dataset (0.091).
  • FIG. 124 shows a table of selected gene subsets having above 5% frequency, 10% frequency or 21% frequency.
  • FIG. 125 shows one example of the 5-hydroxymethylcytosine (5-hmC) Pulldown Label Copy Enrich (HMCP LCE) method detailed herein.
  • FIG. 126 shows one example of the 5-hmC Pulldown Copy Label Enrich
  • FIG. 127 shows one example of the 5-hmC Pulldown Label Random prime Enrich (HMCP LRE) method detailed herein.
  • FIG. 128 shows one example of the 5-hmC Pulldown Random primer Label Enrich (HMCP RLE) method detailed herein.
  • FIG. 129 shows one example of the 5-hmC Pulldown Label Loci Specific Enrich (HMCP LLSE) method detailed herein.
  • FIG. 130 shows one example of the 5-hmC Pulldown Loci Specific Label Enrich (HMCP LSLE) method detailed herein.
  • FIG. 131 shows a gene list of biomarkers for 5% CRC-HV - No Z-Normalization - an analysis to find robust gene signatures.
  • FIG. 132 shows a genehancer list of biomarkers for CRC-HV (single application) an application of the LASSO model.
  • FIG. 133 shows a genehancer list of biomarkers for earlyCRC-HV (single application) an application of the LASSO model.
  • FIG. 134 shows a list of biomarkers for CRC HV genes TV - a result of statistical tests between a sample and a control using a Top Varying feature set (features that have a coefficient of variation > 0.2).
  • FIG. 135 shows a list of biomarkers for CRC HV genehancers TV - a result of statistical tests between a sample and a control using a Top Varying feature set (features that have a coefficient of variation > 0.2).
  • FIG. 136 shows a list of biomarkers for earlyCRC HV genes TV - a result of statistical tests between a sample and a control using a Top Varying feature set (features that have a coefficient of variation > 0.2).
  • FIG. 137 shows a gene list of biomarkers for earlyCRC HV genehancers TV - a result of statistical tests between a sample and a control using a Top Varying feature set (features that have a coefficient of variation > 0.2).
  • FIG. 138 shows a summary of the genes chosen during the RFE across the 6 fold cross-validation alongside their corresponding p-values from the MWU tests.
  • FIG. 140 shows a table of genes distinguishing earlyCRC from HV. FDR ⁇ 0.05, N
  • a method may comprise assaying a sample for a nucleotide sequence having at least: 70%, 75%, 80%, 85%, 90%, 95%, 96%, 97%, 98%, 99% sequence homology to a biomarker or active fragment thereof to produce a result.
  • the biomarker or active fragment thereof may comprise a gene or portion thereof.
  • the biomarker may comprise a genehancer or a portion thereof.
  • the biomarker may comprise a transcription factor or a portion thereof.
  • the biomarker may not be previously associated with a cancer.
  • An epigenetic modification of the biomarker may not be previously associated with a cancer.
  • the assaying may identify a presence of an epigenetic modification.
  • the assaying may identify a presence of one or more of methylcytosine (mC), a hydroxymethylated cytosine (hmC), a carboxycytosine (caC), a formylcytosine (fC), or any combination thereof at one or more positions in the biomarker.
  • the assaying may identify an epigenetic signature.
  • the sample may be obtained from a subject having been previously diagnosed to have cancer.
  • the sample may be obtained from a subject having cancer.
  • the sample may be obtained from a subject suspected of having cancer.
  • the sample may be obtained from a subject asymptomatic of cancer.
  • the sample may be obtained from a subject not previously diagnosed with cancer.
  • the sample may be obtained from a subject during an early screening procedure.
  • the sample may be obtained from a subject having a risk of cancer - such as a presence of a biomarker or familial genetic history.
  • the sample obtained from the subject may be a blood sample, a fine needle aspirate (FNA) sample, a tissue sample, a fecal sample or any combination thereof.
  • the sample may comprise cell-free DNA.
  • the sample may comprise a small sample volume, for example, from about 1 nanogram to about 15 ng.
  • the sample may comprise a small sample volume, for example from about 1 cell to about 1000 cells; from about 1 cell to about 500 cells; from about 1 cell to about 100 cells.
  • a sample may comprise a first portion comprising a blood sample and a second portion comprising a tissue sample or a fecal sample.
  • a result of assaying may be compared to a result obtained from a control sample.
  • the control sample may comprise a database of control samples.
  • the control sample may comprise at least: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 30, 40, 50, 60, 70, 80, 90, 100, 150, 200 independent samples.
  • the control sample may comprise at least 5 independent samples.
  • the control sample may comprise at least 10 independent samples.
  • the control sample may comprise at least 5 independent samples.
  • the control sample may comprise at least 20 independent samples.
  • the control sample may comprise at least 50 independent samples.
  • the control sample may comprise at least 100 independent samples.
  • the control sample may comprise a blood sample, an FNA sample, a tissue sample, or any combination thereof.
  • the control sample may be obtained from a healthy volunteer.
  • the control sample may be obtained from a subject having received a positive diagnose of cancer.
  • the control sample may be obtained from a subject having a specific cancer type, such as a colorectal cancer, a colon cancer, etc.
  • the control sample may include a sample previously obtained from the same subject, such as a sample obtained at an early point in time.
  • the control sample may include a sample obtained from a different subject.
  • Comparing a result from a sample to a result obtained from a control sample may identify the sample as benign or malignant for a cancer.
  • a comparison of a result may include a differential gene expression, a presence or absence of an epigenetic modification at a position in a gene or genehancer, a difference in an epigenetic signature, a presence or absence of a sequence variant, a difference in a copy number of a gene, or any combination thereof.
  • a comparison to a result from a control sample may identify the sample as being indicative of a particular stage of a cancer, a particular type of cancer, a risk of developing a cancer, a risk of a cancer recurring, a risk of metastasis, or any combination thereof.
  • the assaying may include sequencing a nucleotide sequence present in the sample.
  • the nucleotide sequence may have at least 85% sequence homology to a biomarker or active fragment thereof.
  • the assaying may include selecting for or sorting for nucleotides sequences having at least 85% sequence homology to at least a portion of a biomarker.
  • the assaying may employ one or more probes specific for one or more biomarkers or portions thereof as described herein.
  • One or more biomarkers may be assayed. At least 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 50, 100, 150, 200 biomarkers may be assayed. At least 5 biomarkers may be assayed. At least 10 biomarkers may be assayed.
  • At least 15 biomarkers may be assayed. At least 20 biomarkers may be assayed. At least 50 biomarkers may be assayed. At least 100 biomarkers may be assayed. At least 200 biomarkers may be assayed.
  • the assaying may include detecting an epigenetic modification in a nucleotide sequence present in the sample.
  • the detecting may include detecting a methylcytosine (mC), a hydroxymethylated cytosine (hmC), a carboxycytosine (caC), a formylcytosine (fC), or any combination thereof.
  • the detecting may include distinguishing between two or more types of epigenetic modifications, such as distinguishing mC from hmC.
  • the epigenetic modification may be detected any number of ways including but not limited to sequencing (such as nanopore sequencing, high throughput sequencing), bi-sulfite sequencing, antibody-specific labeling (such as use of radio-labeling, click chemistry, fluorescent moieties), sugar moiety addition (including glucose or gentibiose or combination, wherein the addition may be by an enzyme such as bGT), thin-layer chromatography, TET enzymatic modification, methyltransferase activity (such as DNMTl), blotting assays, an ELISA assays, the HMCP v2 method, or any combination thereof.
  • the detecting may comprise sequencing.
  • the detecting may comprise nanopore sequencing.
  • the detecting may comprise highthroughput sequencing.
  • the detecting may comprise associating a label with an epigenetically modified base of a nucleic acid sequence to form a labeled nucleic acid sequence; hybridizing a substantially complementary strand to the labeled nucleic acid sequence; and amplifying the substantially complementary strand in a reaction in which the labeled nucleic acid sequence is substantially not present.
  • the detecting may comprise contacting the sample with an enzyme or a catalytically active fragment thereof that converts a methylated residue in the sample to a modified base.
  • the detecting may comprise labeling covalently, a hydroxyl group on a hydroxymethylated residue in the sample to generate labeled
  • the detecting may comprise contacting at least a portion of the sample with an enzyme that utilizes a labeled glucose or a labeled glucose-derivative donor substrate to add a labeled glucose molecule or a labeled glucose-derivative to a 5-hydroxymethylcytosine in the sample to generate a labeled
  • the detecting may comprise adding a detectable label to the epigenetic modification.
  • the detecting may comprise the detectable label comprises an antibody.
  • the detecting may comprise a FRET assay.
  • the detecting may comprise an ELISA assay.
  • the detecting may comprise an LCMS assay.
  • the identifying may comprise adaptor ligation.
  • the detecting may comprise detecting caC or fC.
  • the detecting may comprise detecting a kinetic change during sequencing wherein the kinetic change is relative to the control or derivative thereof and comprises a change in interpulse duration, pulse width, or a combination thereof, wherein the presence of the kinetic change indicates the presence of the epigenetic modification in the sample.
  • the detecting may comprise A method comprising: associating a label with the epigenetic modification in a nucleic acid sequence to form a labeled nucleic acid sequence; hybridizing a substantially complementary strand to the labeled nucleic acid sequence; and amplifying the substantially complementary strand in a reaction in which the labeled nucleic acid sequence is substantially not present.
  • a presence or an absence of an epigenetic modification may comprise a level of an epigenetic modification.
  • a presence or an absence of an epigenetic modification may comprise a presence or an absence at one or more specific positions in a biomarker.
  • a presence or an absence may comprise a pattern or signature of epigenetic modifications.
  • An epigenetic modification may comprise a 5mC, a 5hmC, a 5caC, a 5fC, or any combination thereof.
  • a presence or an absence of an epigenetic modification may comprise a number of methylated sites in the biomarker, in the transcription factor (TF) associated with the biomarker, in a region of the genome associated with the biomarker or TF, or any combination thereof.
  • TF transcription factor
  • a presence or an absence of an epigenetic modification may comprise a number of hypo-hydroxymethylated loci, a number of hyper-hydroxymethylated loci, or a combination thereof in the biomarker, in the TF associated with the biomarker, in a region of the genome associated with the biomarker or TF, or any combination thereof.
  • a loss of an epigenetic modification may be indicative of a presence of cancer in the sample, such as a loss of 5-hmC.
  • a gain of an epigenetic modification may be indicative of a presence of cancer in the sample.
  • a method may comprise assaying a sample for a metabolic-related biomarker, an immune-related biomarker, cell growth related biomarker, apoptosis related biomarker, protein degradation related biomarker, endocrine related biomarker, cell movement or morphology related biomarker, or any combination thereof to obtain a result.
  • a biomarker may be associated with an Ingenuity Pathway.
  • a biomarker may be a metabolic-related biomarker, an immune- related biomarker, or any combination thereof.
  • a comparison of the result to a result from a control sample may identify the sample as benign or malignant for a cancer.
  • a result may include assaying a sample for a population of immune cells, including a number of immune cells or immune cell subtypes.
  • Immune cell subtypes may include T cells, B cells, neutrophils, basophils, eosinophils, or any combination thereof.
  • a result may include assaying a sample for a population of immune cells and quantifying one or more markers expressed by the population of immune cells.
  • a method may comprise identifying a presence or an absence of an early stage cancer or a late stage cancer in a sample.
  • the cancer may be colorectal cancer, a colon cancer, or others.
  • the method may identify the sample as having a particular stage of cancer, such as stage I, II, III, or IV.
  • the method may identify the sample as having an aggressive type of cancer. The identification may be based on a comparison to a control sample.
  • the sample may be assayed for a result and the result may be compared to a result obtained from a control sample.
  • the control sample may comprise samples obtained from early-stage cancer and late stage cancer, aggressive types of cancer, stage I cancers, stage II cancers, stage III cancers, stage IV cancers, metastatic cancers, or any combination thereof.
  • the assaying may include assaying for at least a portion of a biomarker.
  • the comparison may include comparing a presence or an absence of an epigenetic modification between the control sample and the sample.
  • the comparison may include comparing a differential gene expression, a presence or an absence of a sequence variant, a copy number, a presence or an absence of an epigenetic modification, a patient's genetic history, a patient's environmental history, or any combination thereof.
  • a method may identify the sample as representative of a subtype of the cancer, such as an aggressive type of cancer.
  • a method may identify the sample as representative of a subtype of the cancer, such as a tissue type (i.e. colorectal cancer).
  • a method may identify the sample as representative of a subtype of the cancer, such as a stage I, stage II, stage III, or stage IV cancer.
  • a method may identify the sample as representative of a subtype of the cancer, such as a colon cancer that may be a serrated adenoma or a tubular adenoma.
  • a method may identify the sample as representative of a subtype of the cancer, such as a colon cancer that may be CMS1, CMS2, CMS3, or CMS4.
  • a result obtained from assaying may be input into a computer processor.
  • a result obtained from assaying may be input into a trained algorithm.
  • a result including the presence or absence of an epigenetic modification may be input into the trained algorithm.
  • a result including a number of immune cells, types of immune cells, or combinations thereof may be input into the trained algorithm.
  • a trained algorithm may be a classifier, a supervised machine learning algorithm, or a molecular classifier.
  • Epigenetic data (or additionally gene expression data, sequence variant data, copy number data, immune population data, or others) may in some cases be improved through the application of algorithms designed to normalize and or improve the reliability of the data.
  • Data analysis may employ a computer or other device, machine or apparatus for application of the various algorithms described herein due to the large number of individual data points that may be processed.
  • a “machine learning algorithm” may refer to a computational-based prediction methodology, also known to persons skilled in the art as a "classifier", employed for characterizing epigenetic data, gene expression data, sequence variant data, copy number data, any combination thereof or others.
  • the data obtained from a sample may be input to the algorithm in order to classify the sample, such as benign or malignant for a cancer.
  • Supervised learning generally involves "training" a classifier with a training set to recognize the distinctions among classes or disease states and then “testing" the accuracy of the classifier on an
  • the classifier can be used to predict the class in which the samples belong, such as benign or malignant for a cancer.
  • a trained algorithm may identify significant differences in epigenetic data, such as a significant difference in a presence or an absence of an epigenetic modification, as determined by feature selection using LIMMA (linear models for micro array data) and SVM (support vector machine) for classification of malignant vs. benign samples. Rank or weight denotes the marker significance (lower rank, higher significance) after Benjamini and Hochberg correction for False Discovery Rate (FDR).
  • a trained algorithm may include a support vector machine (SVM) algorithm, a random forest algorithm, or a combination thereof.
  • LIMMA may be used for feature selection. Classification may be performed with a random forest algorithm or SVM methods. Markers that repeatedly appear in multiple iterative rounds of training, classification, and cross validation may be identified and ranked. A joint set of core features may be created using the top ranked features. Biomarkers with a non-zero repeatability score may be selected as significant.
  • a result of a trained algorithm may be output in a report.
  • Results may be presented as a report on a computer screen or as a paper record.
  • the report may include, but is not limited to, such information as one or more of the following: the number of biomarkers comprising an epigenetic modification, a classification of a sample as benign or malignant for a cancer, the suitability of the original sample, a diagnosis, a statistical confidence for the diagnosis, the likelihood of cancer or malignancy, a recommendation for further treatment, or any combination thereof.
  • the comparison to a control sample may be performed by a trained algorithm.
  • a trained algorithm may be trained to identify feature selections within a data set.
  • a trained algorithm may classify a sample as benign or malignant for a cancer.
  • a cancer may include a colorectal cancer, or a colon cancer.
  • the methods may include identifying a sample as benign or malignant for cancer. In some cases, the method may include identifying a sample as premalignant or precancerous. In some cases, the methods may include identifying a presence of or likelihood of developing a tumor, neoplasm, or cancer. A cancer may include colon cancer, colon cancer, a rectal cancer, or any combination thereof. In some cases, the methods may include identifying a presence of a premalignant condition or a precancerous lesion or growth.
  • a premalignant condition or precancerous lesion or growth may comprise a polyp (such as an adenomatous polyp), a nonpolyp, an adenoma, a dysplasia (such as high grade or low grade), or any combination thereof.
  • the methods may include distinguishing a premalignant condition from a benign condition (such as a benign polyp, benign lesion, benign hyperplastic tissue, benign hyperplasia, or the like).
  • the methods may include comparing a result obtained from assaying a sample to a result obtained from a control or derivative thereof.
  • the comparing may identify the sample as a precancerous lesion or precancerous growth.
  • the comparing may distinguish a precancerous lesion or growth from a benign condition.
  • the comparing may be performed by a trained algorithm.
  • a precancerous lesion or growth may be identified by performing the methods as described herein on a blood sample.
  • the sample may comprise cell-free DNA.
  • Assaying a sample may be performed in the absence of a screening procedure.
  • the methods herein may provide a replacement or alternative to a screening procedure.
  • a screening procedure may include a colonoscopy, an assay performed on a stool sample provided by the subject, a sigmoidoscopy, or any combination thereof.
  • a benefit of the method may include an alternative pre-screening tool that does not require a colonoscopy or providing a stool sample.
  • the method may provide a result having greater than 90% sensitivity and greater than 80% specificity to distinguish a precancerous lesion or growth from a benign condition. When a subject receives a result identifying the sample as benign, the method may permit a subject to opt out or not receive a screening procedure.
  • a method may comprise assaying a sample for a nucleotide having at least 70% sequence homology to a biomarker listed in FIG. 19B-C, FIG. 34B, FIG. 74, FIG. 136, FIG. 137, FIG. 139, FIG. 140, FIG, 141, FIG. 142, any combination thereof, or any other figure described herein labeled as "earlyCRC from HV".
  • a table described as “earlyCRC from HV” may distinguish an early stage cancer, such as stage I or II from a healthy volunteer.
  • a table described as "earlyCRC from HV” may distinguish a premalignant lesion or growth from a healthy volunteer.
  • the assaying may include assaying the sample for a nucleotide having at least 70% sequence homology to a biomarker from Table 1, Table 2, Table 3 or any combination thereof.
  • the assaying may produce a result that may be compared to a result from a control or derivative thereof.
  • the sample may be obtained from a subject asymptomatic for cancer, at risk for developing cancer, not previously diagnosed with cancer, or as part of a routine screening.
  • the comparing may identify the sample as a precancerous lesion or precancerous growth.
  • the assaying may include detecting a presence or an absence of an epigenetic modification.
  • the detecting may comprise detecting by sequence, such as by nanopore sequencing or high throughput sequencing.
  • the control or derivative thereof may comprise samples obtained from a precancerous lesion or growth.
  • a method may provide a result in the absence of a further medical procedure such as a result that may include an identification of the sample as a malignant or benign for a cancer.
  • a further medical procedure may include: obtaining a second sample from the subject, such as an invasive sample (such as a biopsy) or a blood sample; performing an imaging scan on a portion of the subject; performing surgery on the subject; or a combination thereof.
  • a method may include repeating the assaying.
  • a method may include repeating the comparing to a control sample, such as comparing to a different control sample.
  • a method may provide a result that includes a recommendation for monitoring a change over time in the result.
  • a method may include assaying a second sample from the subject. The second sample may be obtained from the subject at a different period of time, such as an earlier period of time or a later period of time.
  • a method may provide a result that includes a recommendation for the subject to receive a surgery.
  • a trained algorithm may be trained with a training set of samples.
  • a trained algorithm may be validated with a validation set of samples.
  • the validation set of samples may be independent of the training set.
  • An independent sample may be input into the trained algorithm that may be independent of both the training set and the validation set.
  • a training set of samples may include at least: 1, 2, 3, 4, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 100, 150, 200, 300, 400, 500 samples.
  • a training set of samples may include about 5 samples.
  • a training set of samples may include about 20 samples.
  • a training set of samples may include about 50 samples.
  • a training set of samples may include about 100 samples.
  • a training set of samples may include about 200 samples.
  • a training set of samples may include about 300 samples.
  • a training set of samples may include about 500 samples.
  • a training set of samples may include at least: 1, 2, 3, 4, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 100, 150, 200, 300, 400, 500 cell free DNA samples.
  • a training set of samples may include about 5 cell free DNA samples.
  • a training set of samples may include about 20 cell free DNA samples.
  • a training set of samples may include about 50 cell free DNA samples.
  • a training set of samples may include about 100 cell free DNA samples.
  • a training set of samples may include about 200 cell free DNA samples.
  • a training set of samples may include about 300 cell free DNA samples.
  • a training set of samples may include about 500 cell free DNA samples.
  • a training set of samples may include samples having a malignant diagnosis, a benign diagnosis, or a combination thereof.
  • a training set of samples may include samples obtained from healthy volunteers, subjects diagnosed with cancer, or a combination thereof.
  • a training set of samples may include cell free DNA samples, genomic DNA samples, biopsy samples, FNA samples, tissue samples, or any combination thereof.
  • a training set of samples may include more than one subtype of cancer.
  • a training set of samples may include genomic DNA samples and cell free DNA samples.
  • a training set of samples may include genomic DNA samples.
  • a training set of samples may include cell free DNA samples.
  • a training set of samples may include one or
  • a presence or an absence of an epigenetic modification may identify a sample as comprising a benign or malignant tissue.
  • a read count threshold between the sample and control or derivative thereof may be at least: 10, 20, 30, 40, or 50.
  • a read count threshold may be greater than about 10.
  • a read count threshold may be greater than about 20.
  • a read count threshold may be greater than about 30.
  • a read count threshold may be greater than about 40.
  • a FDR threshold may be less than about: 0.5, 0.1, 0.05, or 0.01. In some cases, a FDR threshold may be less than about 0.01. In some cases, a FDR threshold may be less than about 0.05. In some cases, a FDR threshold may be less than about 0.1.
  • a FDR threshold may be less than about 0.5.
  • a biomarker may be weighted or ranked.
  • a weighing or ranking may be indicative of a discriminatory power of a biomarker to identify a sample as benign or malignant for a cancer.
  • a kit may include one or more materials for performing the methods as described herein.
  • a kit may include reagents for the assaying.
  • a kit may include reagents to identify epigenetic modifications in a sample according to any method as described herein.
  • a kit may include reagents for sequencing.
  • a kit may include TET enzymes or fragments thereof.
  • a kit may include a DNA methyltransf erase.
  • a kit may include a glucosyltransferase.
  • a kit may include an excipient, such a glycerol, water, saline, dextrose, ethanol, or any combination thereof.
  • a kit may include probes to one or more biomarkers as described herein.
  • a kit may include a pre-programmed trained algorithm.
  • a kit may include controls or derivative thereof, a database comprising controls or derivative thereof, or access to an online database comprising controls or derivative thereof.
  • a kit may include reagents for obtaining a sample, storing the sample, assaying the sample, or any combination thereof.
  • the kit may further comprise software or a license to obtain and use software for analysis of the data provided using the methods described herein. Definitions
  • the term “about” or “approximately” can mean within an acceptable error range for the particular value as determined by one of ordinary skill in the art, which will depend in part on how the value is measured or determined, e.g., the limitations of the measurement system. For example, “about” can mean plus or minus 10%, per the practice in the art. Alternatively, “about” can mean a range of plus or minus 20%, plus or minus 10%, plus or minus 5%, or plus or minus 1%) of a given value. Alternatively, particularly with respect to biological systems or processes, the term can mean within an order of magnitude, within 5-fold, or within 2-fold, of a value.
  • the term "substantially” as used herein can refer to a value approaching 100% of a given value. In some cases, the term can refer to an amount that can be at least about 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99%, 99.9%, or 99.99% of a total amount. In some cases, the term can refer to an amount that can be about 100% of a total amount.
  • the term "homology" can refer to a % identity of a sequence to a reference sequence. As a practical matter, whether any particular sequence can be at least 50%, 60%, 70%, 80%, 85%, 90%, 92%, 95%, 96%, 97%, 98% or 99% identical to any sequence described herein (which may correspond with a particular nucleic acid sequence described herein), such particular polypeptide sequence can be determined conventionally using known computer programs such the Bestfit program (Wisconsin Sequence Analysis Package, Version 8 for Unix, Genetics Computer Group, University Research Park, 575 Science Drive, Madison, Wis. 53711).
  • the parameters can be set such that the percentage of identity is calculated over the full length of the reference sequence and that gaps in homology of up to 5% of the total reference sequence are allowed.
  • identity between a reference sequence query sequence, i.e., a sequence of the present invention
  • subject sequence also referred to as a global sequence alignment
  • the percent identity can be corrected by calculating the number of residues of the query sequence that are lateral to the N- and C-terminal of the subj ect sequence, which are not matched/aligned with a corresponding subject residue, as a percent of the total bases of the query sequence.
  • determination of whether a residue is matched/aligned can be determined by results of the FASTDB sequence alignment. This percentage can be then subtracted from the percent identity, calculated by the FASTDB program using the specified parameters, to arrive at a final percent identity score. This final percent identity score can be used for the purposes of this embodiment.
  • only residues to the N- and C-termini of the subject sequence, which are not matched/aligned with the query sequence are considered for the purposes of manually adjusting the percent identity score. That is, only query residue positions outside the farthest island C-terminal residues of the subject sequence are considered for this manual correction. For example, a 90 residue subject sequence can be aligned with a 100 residue query sequence to determine percent identity.
  • the deletion occurs at the N-terminus of the subject sequence and therefore, the FASTDB alignment does not show a matching/alignment of the first 10 residues at the N-terminus.
  • the 10 unpaired residues represent 10% of the sequence (number of residues at the N- and C-termini not matched/total number of residues in the query sequence) so 10% is subtracted from the percent identity score calculated by the FASTDB program. If the remaining 90 residues were perfectly matched the final percent identity would be 90%.
  • a 90 residue subject sequence is compared with a 100 residue query sequence. This time the deletions are internal deletions so there are no residues at the N- or C-termini of the subject sequence which are not matched/aligned with the query.
  • fragment may be a portion of a sequence, a subset that may be shorter than a full length sequence.
  • a fragment may be a portion of a gene.
  • a fragment may be a portion of a peptide or protein.
  • a fragment may be a portion of an amino acid sequence.
  • a fragment may be a portion of an oligonucleotide sequence.
  • a fragment may be less than about: 20, 30, 40, 50 amino acids in length.
  • a fragment may be less than about: 20, 30, 40, 50 oligonucleotides in length.
  • epigenetic modification may be any covalent modification of a nucleic acid base.
  • a covalent modification may comprise (i) adding a methyl group, a hydroxymethyl group, a carbon atom, an oxygen atom, or any combination thereof to one or more bases of a nucleic acid sequence, (ii) changing an oxidation state of a molecule associated with a nucleic acid sequence, such as an oxygen atom, or (iii) a combination thereof.
  • a covalent modification may occur at any base, such as a cytosine, a thymine, a uracil, an adenine, a guanine, or any combination thereof.
  • an epigenetic modification may comprise an oxidation or a reduction.
  • a nucleic acid sequence may comprise one or more epigenetically modified bases.
  • An epigenetically modified base may comprise any base, such as a cytosine, a uracil, a thymine, adenine, or a guanine.
  • An epigenetically modified base may comprise a methylated base, a hydroxymethylated base, a formylated base, or a carboxylic acid containing base or a salt thereof.
  • An epigenetically modified base may comprise a 5-methylated base, such as a 5-methylated cytosine (5-mC).
  • An epigenetically modified base may comprise a 5 -hydroxymethylated base, such as a 5 -hydroxymethylated cytosine (5-hmC).
  • An epigenetically modified base may comprise a 5-formylated base, such as a 5-formylated cytosine (5-fC).
  • An epigenetically modified base may comprise a 5-carboxylated base or a salt thereof, such as a 5- carboxylated cytosine (5-caC).
  • an epigenetically modified base may comprise a methyltransferase-directed transfer of an antivated group (mTAG).
  • An epigenetically modified base may comprise one or more bases or a purine (such as Structure 1) or one or more bases of a pyrimidine (such as Structure 2).
  • an epigenetic modification may occur one or more of any positions.
  • an epigenetic modification may occur at one or more positions of a purine, including positions 1, 2, 3, 4, 5, 6, 7, 8, 9, as shown in Structure 1.
  • an epigenetic modification may occur at one or more positions of a pyrimidine, including positions 1, 2, 3, 4, 5, 6, as shown in Structure 2.
  • a nucleic acid sequence may comprise an epigenetically modified base.
  • a nucleic acid sequence may comprise a plurality of epigenetically modified bases.
  • a nucleic acid sequence may comprise an epigenetically modified base positioned within a CG site, a CpG island, or a combination thereof.
  • a nucleic acid sequence may comprise different epigenetically modified bases, such as a methylated base, a hydroxymethylated base, a formylated base, a carboxylic acid containing base or a salt thereof, a plurality of any of these, or any combination thereof.
  • nucleic acid sequence may comprise DNA or RNA.
  • a nucleic acid sequence may comprise a plurality of nucleotides.
  • a nucleic acid sequence may comprise an artificial nucleic acid analogue.
  • a nucleic acid sequence comprising DNA may comprise cell-free DNA, cDNA, fetal DNA, or maternal DNA.
  • a nucleic acid sequence may comprise miRNA, shRNA, or siRNA.
  • substantially complementary strand may comprise from about 70% - 100% bases that base pair with bases of a nucleic acid sequence. This percentage of base pairing may be measured by UV absorption of the nucleic acid sequence.
  • a substantially complementary strand may be hybridized to at least a portion of a nucleic acid sequence under stringent hybridization conditions.
  • substantially free of an epigenetically modified base may comprise a complementary strand having no epigenetically modified base, or a complementary strand having from about 0.000001% to about 5% of a plurality of epigenetically modified bases of a nucleic acid sequence.
  • click-chemistry may comprise a reaction having at least one of the following: (a) high yielding, (b) wide in scope, (c) create only byproducts that may be removed in the absence of chromatography, (d) stereospecific, (e) simple to perform, (f) conducted in easily removable or benign solvents.
  • click-chemistry comprises tagging, such as tagging a nucleic acid sequence or a complementary strand.
  • click- chemistry may associate a nucleic acid sequence with a label.
  • Click-chemistry may comprise a reaction having a [3+2] cycloaddition; a thiol-ene reaction; a Diels-Alder reaction, an inverse electron demand Diels-Alder reaction; a [4+1] cycloaddition; a nucleophilic substitution; a carbonyl-chemistry-like formation of urea; an addition to a carbon-carbon double bond; or any combination thereof.
  • a [3+2] cycloaddition may comprise a Huisgen 1,3 -dipolar cycloaddition.
  • a [4+1] cycloaddition may comprise a cycloaddition between an isonitrile and a tetrazine.
  • Click-chemistry may comprise a copper(I)-catalyzed azide-alkyne cycloaddition (CuAAC); a strain-promoted azide-alkyne cycloaddition (SPAAC); a strain- promoted alkyne-nitrone cycloaddition (SPANC); or any combination thereof.
  • CuAAC copper(I)-catalyzed azide-alkyne cycloaddition
  • SPAAC strain-promoted azide-alkyne cycloaddition
  • SPANC strain- promoted alkyne-nitrone cycloaddition
  • sequencing may comprise bisulfite-free sequencing, bisulfite sequencing, TET-assisted bisulfite (TAB) sequencing, ACE-sequencing, high- throughput sequencing, Maxam-Gilbert sequencing, massively parallel signature sequencing, Polony sequencing, 454 pyrosequencing, Sanger sequencing, Illumina sequencing, SOLiD sequencing, Ion Torrent semiconductor sequencing, DNA nanoball sequencing, Heliscope single molecule sequencing, single molecule real time (SMRT) sequencing, nanopore DNA sequencing, shot gun sequencing, RNA sequencing, Enigma sequencing, or any combination thereof.
  • TAB TET-assisted bisulfite
  • ACE-sequencing high- throughput sequencing
  • Maxam-Gilbert sequencing massively parallel signature sequencing
  • Polony sequencing 454 pyrosequencing
  • Sanger sequencing Illumina sequencing
  • SOLiD sequencing Ion Torrent semiconductor sequencing
  • DNA nanoball sequencing Heliscope single molecule sequencing
  • SMRT single molecule real time sequencing
  • nanopore DNA sequencing shot gun sequencing
  • a method may comprise sequencing.
  • the sequencing may include bisulfite sequencing or bisulfite-free sequencing.
  • a method may comprise oxidizing one or more bases of a nucleic acid sequence or complementary strand or combination thereof.
  • a method may comprise selectively enriching for a nucleic acid sequence that contains at least one epigenetic modification.
  • tissue may be any tissue sample.
  • a tissue may be a tissue suspected or confirmed of having a disease or condition.
  • a tissue may be a sample that may be substantially healthy, substantially benign, or otherwise substantially free of a disease or a condition.
  • a tissue may be a tissue removed from a subject, such as a tissue biopsy, a tissue resection, an aspirate (such as a fine needle aspirate), a tissue washing, a cytology specimen, a bodily fluid, or any combination thereof.
  • a tissue may comprise cancerous cells, tumor cells, non-cancerous cells, or a combination thereof.
  • a tissue may comprise colon tissue, colorectal tissue, rectal tissue, a polyp, a blood sample (such as a cell-free DNA sample), or any
  • a tissue may be a sample that may be genetically modified.
  • cell-free refers to the condition of the nucleic acid sequence as it appeared in the body before the sample is obtained from the body.
  • circulating cell-free nucleic acid sequences in a sample may have originated as cell-free nucleic acid sequences circulating in the bloodstream of the human body.
  • nucleic acid sequences that are extracted from a solid tissue, such as a biopsy are generally not considered to be "cell-free.”
  • cell-free DNA may comprise fetal DNA, maternal DNA, or a combination thereof.
  • cell-free DNA may comprise DNA fragments released into a blood plasma.
  • the cell-free DNA may comprise circulating tumor DNA.
  • cell-free DNA may comprise circulating DNA indicative of a tissue origin, a disease or a condition.
  • a cell-free nucleic acid sequence may be isolated from a blood sample.
  • a cell-free nucleic acid sequence may be isolated from a plasma sample.
  • a cell-free nucleic acid sequence may comprise a complementary DNA (cDNA).
  • cDNA complementary DNA
  • one or more cDNAs may form a cDNA library.
  • the term "subject,” as used herein, may be any animal or living organism.
  • Animals can be mammals, such as humans, non-human primates, rodents such as mice and rats, dogs, cats, pigs, sheep, rabbits, and others.
  • Animals can be fish, reptiles, or others.
  • Animals can be neonatal, infant, adolescent or adult animals. Humans can be more than about: 1, 2, 5, 10, 20, 30, 40, 50, 60, 65, 70, 75, or about 80 years of age.
  • the subject may have or be suspected of having a condition or a disease, such as cancer.
  • the subject may be a patient, such as a patient being treated for a condition or a disease, such as a cancer patient.
  • the subject may be predisposed to a risk of developing a condition or a disease such as cancer.
  • the subject may be in remission from a condition or a disease, such as a cancer patient.
  • the subject may be healthy.
  • a nucleic acid sequence may be from a sample.
  • a sample may be isolated from a subject.
  • a subject may be a human subject.
  • a sample may comprise a buccal sample, a saliva sample, a blood sample, a plasma sample, a reproductive sample (such as an egg or a sperm), a mucus sample, a cerebral spinal fluid sample, a tissue sample, a tissue biopsy, a surgical resection, a fine needle aspirate sample, or any combination thereof.
  • a sample may comprise a blood sample.
  • a sample may comprise a buccal sample.
  • a subject may have previously received a diagnosis of a disease or condition prior to performing a method as described herein.
  • a subject may have previously received a positive diagnosis of a disease, such as a cancer.
  • a subject may have previously received an indeterminate or inclusive diagnosis of a disease, such as a cancer.
  • a subject may be a subject in need thereof, such as a need for a definitive diagnosis or a need for a selection of a therapeutic treatment regime.
  • a result of the method or a result output from the trained algorithm may include a recommendation for a treatment.
  • a treatment may include further monitoring of the subject, such as obtaining a second sample from the subject and repeating a method as described herein.
  • a treatment may include performing surgery or removing of a tissue from the subject, performing an imaging scan on the subject, performing a diagnostic test on a sample from the subject, performing radiation, chemotherapy, or other cancer treatment procedure.
  • a subject may not have previously received a diagnosis of a disease or condition prior to performing a method as described herein.
  • a subject may be suspected of having a disease or condition, such as having one or more symptoms of a disease or condition.
  • a subject may be at risk of developing a disease or condition, such as a subject having a biomarker or genetic indication that may be indicative of a risk of developing a disease or condition.
  • a disease or a condition may comprise a cancer.
  • a nucleic acid sequence may comprise a cytosine guanine (CG) site, a cytosine phosphate guanine (CpG) island, a portion of any of these, or a combination thereof.
  • a CpG island may comprise one or more CG sites.
  • a nucleic acid sequence may comprise one or more CG sites or portions thereof.
  • a nucleic acid sequence may comprise dense CG sites, dense CpG islands or a combination thereof.
  • a nucleic acid sequence may comprise a plurality of CG sites or portions thereof.
  • a nucleic acid sequence may comprise one or more CpG islands or portions thereof.
  • a nucleic acid sequence may comprise a plurality of CpG islands or portions thereof.
  • One or more bases of a nucleic acid sequence comprising a CG site, a CpG island, a portion thereof, or any of these may comprise an epigenetically modified base, such as a methylated base or a hydroxymethylated base.
  • One or more cytosines of a nucleic acid sequence comprising a CG site, a CpG island, a portion thereof, or any of these may comprise an epigenetically modified cytosine, such as a methylated cytosine or a hydroxymethylated cytosine.
  • a CpG island (or a CG island) may be a region with a high frequency of CG sites.
  • a CpG island may be a region of a nucleic acid sequence with at least about 200 basepairs (bp) and a GC percentage that may be greater than about 50% and with an observed-to-expected CpG ratio that may be greater than about 60 %.
  • An "observed-to-expected CpG ratio" may be derived where the observed may be calculated as:
  • the methods of the present invention provide for storing the sample for a time such as seconds, minutes, hours, days, weeks, months, years or longer after the sample is obtained and before the sample is analyzed by one or more methods of the invention.
  • the sample obtained from a subject can be subdivided prior to the step of storage or further analysis such that different portions of the sample may be subject to different downstream methods or processes including but not limited to storage, cytological analysis, adequacy tests, nucleic acid extraction, molecular profiling or a combination thereof.
  • a portion of the sample may be stored while another portion of said sample is further manipulated.
  • manipulations may include but are not limited to molecular profiling (epigenetics, gene expression levels, sequence variant, copy number); sequencing, labeling, cytological or histological staining; flow cytometry analysis; nucleic acid (RNA or DNA) extraction, detection, or quantification; gene expression product (RNA or Protein) extraction, detection, or quantification; fixation; and examination.
  • the sample may be fixed prior to or during storage by any method known to the art such as using glutaraldehyde, formaldehyde, or methanol.
  • the sample is obtained and stored and subdivided after the step of storage for further analysis such that different portions of the sample are subject to different downstream methods or processes including but not limited to storage, cytological analysis, adequacy tests, nucleic acid extraction, molecular profiling or a combination thereof.
  • samples are obtained and analyzed by for example cytological analysis, and the resulting sample material is further analyzed by one or more molecular profiling methods of the present invention.
  • the samples may be stored between the steps of cytological analysis and the steps of molecular profiling. Samples may be stored upon acquisition to facilitate transport, or to wait for the results of other analyses. In another embodiment, samples may be stored while awaiting instructions from a physician or other medical professional.
  • the results obtained from the assaying can be analyzed using feature selection techniques including filter techniques which assess the relevance of features by looking at the intrinsic properties of the data, wrapper methods which embed the model hypothesis within a feature subset search, and embedded techniques in which the search for an optimal set of features is built into a classifier algorithm.
  • Filter techniques useful in the methods of the present invention include (1) parametric methods such as the use of two sample t-tests, ANOVA analyses, Bayesian frameworks, and
  • Gamma distribution models model free methods such as the use of Wilcoxon rank sum tests, between-within class sum of squares tests, rank products methods, random permutation methods, or TNoM which involves setting a threshold point for fold-change differences in expression between two datasets and then detecting a threshold point in each gene that minimizes the number of missclassifications (3) and multivariate methods such as bivariate methods, correlation based feature selection methods (CFS), minimum redundancy maximum relassemble methods
  • MRMR Markov blanket filter methods
  • Wrapper methods useful in the methods of the present invention include sequential search methods, genetic algorithms, and estimation of distribution algorithms.
  • Embedded methods useful in the methods of the present invention include random forest algorithms, weight vector of support vector machine algorithms, and weights of logistic regression algorithms.
  • Selected features may then be classified using a classifier algorithm.
  • Illustrative algorithms include but are not limited to methods that reduce the number of variables such as principal component analysis algorithms, partial least squares methods, and independent component analysis algorithms.
  • Illustrative algorithms further include but are not limited to methods that handle large numbers of variables directly such as statistical methods and methods based on machine learning techniques.
  • Statistical methods include penalized logistic regression, prediction analysis of microarrays (PAM), methods based on shrunken centroids, support vector machine analysis, and regularized linear discriminant analysis.
  • Machine learning techniques include bagging procedures, boosting procedures, random forest algorithms, and combinations thereof.
  • Classifiers may be developed using top varying genes, enhancers, or a combination thereof to demonstrate the predictive power of 5-hmC in diagnosing cancer, early detection of cancer, recurrence of cancer, metastasis of cancer, presence of a malignant tissue, or any combination thereof.
  • a trained model may successfully predict a disease status, a risk of occurrence or recurrence of a disease, or any combination thereof in a test set with greater than about 90% sensitivity and greater than about 80% specificity.
  • the trained model provides a result having greater than about 90% sensitivity and greater than about 80% specificity. In some cases, the trained model provides a result having greater than about 90% sensitivity and greater than about 85% specificity. In some cases, the trained model provides a result having greater than about 90% sensitivity and greater than about 90% specificity. In some cases, the trained model provides a result having greater than about 90% sensitivity and greater than about 95% specificity.
  • the trained model provides a result having greater than about 95% sensitivity and greater than about 80% specificity. In some cases, the trained model provides a result having greater than about 95% sensitivity and greater than about 85% specificity. In some cases, the trained model provides a result having greater than about 95% sensitivity and greater than about 90% specificity. In some cases, the trained model provides a result having greater than about 95%) sensitivity and greater than about 95% specificity.
  • the trained model provides a result having greater than about 98% sensitivity and greater than about 80% specificity. In some cases, the trained model provides a result having greater than about 98% sensitivity and greater than about 85% specificity. In some cases, the trained model provides a result having greater than about 98% sensitivity and greater than about 90% specificity. In some cases, the trained model provides a result having greater than about 98%) sensitivity and greater than about 95% specificity.
  • the trained algorithm provides a result having greater than about 80% sensitivity. In some cases, the trained algorithm provides a result having greater than about 85% sensitivity. In some cases, the trained algorithm provides a result having greater than about 90% sensitivity. In some cases, the trained algorithm provides a result having greater than about 95% sensitivity. In some cases, the trained algorithm provides a result having greater than about 96% sensitivity. In some cases, the trained algorithm provides a result having greater than about 97% sensitivity. In some cases, the trained algorithm provides a result having greater than about 98% sensitivity.
  • the trained algorithm provides a result having greater than about 70% specificity. In some cases, the trained algorithm provides a result having greater than about 75% specificity. In some cases, the trained algorithm provides a result having greater than about 80% specificity. In some cases, the trained algorithm provides a result having greater than about 85% specificity. In some cases, the trained algorithm provides a result having greater than about 90% specificity. In some cases, the trained algorithm provides a result having greater than about 95% specificity. In some cases, the trained algorithm provides a result having greater than about 96% specificity.
  • the trained algorithm provides a result having greater than about 80% clinical diagnostic accuracy. In some cases, the trained algorithm provides a result having greater than about 85% clinical diagnostic accuracy. In some cases, the trained algorithm provides a result having greater than about 90% clinical diagnostic accuracy. In some cases, the trained algorithm provides a result having greater than about 95% clinical diagnostic accuracy.
  • Sensitivity typically refers to TP/(TP+FN), where TP is true positive and FN is false negative. Number of Continued Indeterminate results divided by the total number of malignant results based on adjudicated histopathology diagnosis. Specificity typically refers to
  • TN/(TN+FP) where TN is true negative and FP is false positive.
  • Positive Predictive Value (PPV) typically refers to TP/(TP+FP)
  • Negative Predictive Value (NPV) typically refers to TN/(TN+FN).
  • the clinical accuracy as used herein includes specificity, sensitivity, positive predictive value, negative predictive value, or any combination thereof.
  • Methods as described herein may assay for at least one biomarker or an active fragment thereof.
  • about: 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 20, 30, 40, 50, 100, 150, 200 or more biomarkers may be assayed.
  • about 2 biomarkers may be assayed.
  • about 5 biomarkers may be assayed.
  • about 10 biomarkers may be assayed.
  • about 15 biomarkers may be assayed.
  • at least 20 biomarkers may be assayed.
  • Methods as described herein may utilize at least one biomarker or an active fragment thereof to classify a sample.
  • about: 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 20, 30, 40, 50, 100, 150, 200 biomarkers may be utilized to classify a sample.
  • about 2 biomarkers may be utilized to classify a sample.
  • about 5 biomarkers may be utilized to classify a sample.
  • about 10 biomarkers may be utilized to classify a sample.
  • about 15 biomarkers may be utilized to classify a sample.
  • about 20 biomarkers may be utilized to classify a sample.
  • Methods as described herein may select at least one biomarker or an active fragment thereof.
  • about: 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 20, 30, 40, 50, 100, 150, 200 biomarkers may be selected.
  • about 2 biomarkers may be selected.
  • at least 5 biomarkers may be selected.
  • about 10 biomarkers may be selected.
  • about 15 biomarkers may be selected.
  • about 20 biomarkers may be selected.
  • Methods as described herein may compare a result to at least one biomarker or an active fragment thereof of a control or derivative thereof, such as a reference sample.
  • a result may be compared to about: 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 20, 30, 40, 50, 100, 150, 200 biomarkers.
  • a result may be compared to about 2 biomarkers.
  • a result may be compared to about 5 biomarkers.
  • a result may be compared to about 10 biomarkers.
  • a result may be compared to about 15 biomarkers.
  • a result may be compared to about 20 biomarkers.
  • a biomarker or active fragment thereof may be a gene, a portion of a gene, a genehancer, a transcription factor, or any combination thereof.
  • a nucleotide sequence from a sample may comprise a nucleotide sequence having at least: 70%, 75%, 80%, 85%, 90%, 95%, 99% sequence homology to the biomarker.
  • a nucleotide sequence from a sample may comprise a nucleotide sequence having at least 70% sequence homology to the biomarker.
  • a nucleotide sequence from a sample may comprise a nucleotide sequence having at least 75% sequence homology to the biomarker.
  • a nucleotide sequence from a sample may comprise a nucleotide sequence having at least 80% sequence homology to the biomarker.
  • a nucleotide sequence from a sample may comprise a nucleotide sequence having at least 85% sequence homology to the biomarker.
  • a nucleotide sequence from a sample may comprise a nucleotide sequence having at least 90%) sequence homology to the biomarker.
  • a nucleotide sequence from a sample may comprise a nucleotide sequence having at least 95% sequence homology to the biomarker.
  • a nucleotide sequence from a sample may comprise a nucleotide sequence having at least 96% sequence homology to the biomarker.
  • a nucleotide sequence from a sample may comprise a nucleotide sequence having at least 97% sequence homology to the biomarker.
  • a nucleotide sequence from a sample may comprise a nucleotide sequence having at least 98% sequence homology to the biomarker.
  • a biomarker may be a genehancer.
  • a nucleotide sequence from a sample may comprise a nucleotide sequence having at least 99% sequence homology to the biomarker.
  • a biomarker may be a transcription factor.
  • a biomarker may be a site that is proximal to a gene.
  • a biomarker may be a site associated with a gene but more than 10 basepairs away from the gene.
  • a biomarker may not have been previously associated with a cancer.
  • An expression of a biomarker may be associated with cancer but a change in an epigenetic modification in the biomarker may not have been previously associated with a cancer.
  • a presence or absence of an epigenetic modification may be indicative of a cancer.
  • a presence of an epigenetic modification may comprise a level of methylation or a level of hydroxymethylation.
  • a presence of an epigenetic modification may comprise a number of methylated sites, hydroxymethylated sites, hypo-hydroxymethylated sites, hyper- hydroxymethylated sites, or any combination thereof.
  • biomarkers or active fragments thereof may be selected for use in the methods described herein.
  • About: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 30, 40, 50, 60, 70, 80, 90, 100 biomarkers or more may be selected. From 1 to 5 biomarkers may be selected. From 1 to 10 biomarkers may be selected. From 1 to 20 biomarkers may be selected. From 1 to 40 biomarkers may be selected. From 1 to 50 biomarkers may be selected. From 1 to 60 biomarkers may be selected. From 1 to 100 biomarkers may be selected. From 2 to 5 biomarkers may be selected. From 2 to 10 biomarkers may be selected. From 2 to 20 biomarkers may be selected. From 2 to 50 biomarkers may be selected. From 2 to 100 biomarkers may be selected. From 5 to 10 biomarkers may be selected. From 5 to 20 biomarkers may be selected. From 5 to 30 biomarkers may be selected. From 5 to 40 biomarkers may be selected.
  • One or more biomarkers may be assayed accordingly to the methods described herein. At least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 30, 40, 50, 60, 70, 80, 90, 100 biomarkers or more may be assayed. From 1 to 5 biomarkers may be assayed. From 1 to 10 biomarkers may be assayed. From 1 to 20 biomarkers may be assayed. From 1 to 40 biomarkers may be assayed. From 1 to 50 biomarkers may be assayed. From 1 to 60 biomarkers may be assayed. From 1 to 100 biomarkers may be assayed. From 2 to 5 biomarkers may be assayed. From 2 to 10 biomarkers may be assayed.
  • biomarkers may be assayed. From 2 to 50 biomarkers may be assayed. From 2 to 100 biomarkers may be assayed. From 5 to 10 biomarkers may be assayed. From 5 to 20 biomarkers may be assayed. From 5 to 30 biomarkers may be assayed. From 5 to 40 biomarkers may be assayed.
  • a result from one or more biomarkers may be compared to a result from a control sample.
  • a result from at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 30, 40, 50, 60, 70, 80, 90, 100 biomarkers or more may be compared.
  • From 1 to 5 biomarkers may be compared.
  • From 1 to 10 biomarkers may be compared.
  • From 1 to 20 biomarkers may be compared.
  • From 1 to 40 biomarkers may be compared.
  • From 1 to 50 biomarkers may be compared.
  • From 1 to 60 biomarkers may be compared.
  • From 1 to 100 biomarkers may be compared.
  • From 2 to 5 biomarkers may be compared.
  • From 2 to 10 biomarkers may be compared.
  • From 2 to 20 biomarkers may be compared.
  • biomarkers may be compared. From 2 to 50 biomarkers may be compared. From 2 to 100 biomarkers may be compared. From 5 to 10 biomarkers may be compared. From 5 to 20 biomarkers may be compared. From 5 to 30 biomarkers may be compared. From 5 to 40 biomarkers may be compared.
  • one or more biomarkers not previously associated with a cancer may be selected to use in the methods as described herein to identify a sample as benign or malignant for the cancer.
  • one or more biomarkers having an epigenetic marker or epigenetic change not previously associated with a cancer may be selected for use in the methods as described herein to identify a sample as benign or malignant for the cancer.
  • a panel of biomarkers may comprise one or more biomarkers from
  • One or more biomarkers may be selected based on a ranking or a weighting value assigned to the biomarker.
  • One or more biomarkers may comprise a gene or portion thereof, a genehancer, or a combination thereof.
  • One or more biomarkers may be selected based on a cancer type or stage of disease.
  • One or more biomarkers may include 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 25, 30, 35, 40, 45, 50, 100, or more biomarkers selected from any one of FIG.
  • a biomarker may distinguish a premalignant condition from a benign condition. In some cases, a biomarker may identify a sample as having a premalignant condition.
  • a panel of biomarkers may comprise one or more biomarkers from FIG. 19B-C, FIG. 34B, FIG. 74, FIG. 136, FIG. 137, FIG. 139, FIG. 140, FIG. 141, FIG. 142, or any combination thereof.
  • a panel of biomarkers may comprise one or more biomarkers from FIG. 19B.
  • a panel of biomarkers may comprise one or more biomarkers from FIG. 19C.
  • a panel of biomarkers may comprise one or more biomarkers from FIG. 34B.
  • a panel of biomarkers may comprise one or more biomarkers from FIG. 74.
  • a panel of biomarkers may comprise one or more biomarkers from FIG. 136.
  • a panel of biomarkers may comprise one or more biomarkers from FIG. 137.
  • a panel of biomarkers may comprise one or more biomarkers from FIG. 139, FIG. 141, or a combination thereof.
  • a biomarker panel may comprise FIGN.
  • a biomarker panel may comprise MRPS31P2.
  • a biomarker panel may comprise RPl 1-797H7.1.
  • a biomarker panel may comprise GCOM2.
  • a biomarker panel may comprise RPl 1-95F22.1.
  • a biomarker panel may comprise USP32P2.
  • a biomarker panel may comprise RP1-155D22.1.
  • a biomarker may be a top ranked biomarker, such as a top ranked gene.
  • a panel of biomarkers may comprise one or more biomarkers from FIG. 8A.
  • a panel of biomarkers may comprise one or more of C2CD4C, ZIC4, INHBB, FIGN, SIX1, FRMD1, OTX1, CYP26C1, TMEM200B, NOL2, CXCL12, RPl 1-522B 15.3, TBX2, TJP1, IHH, MACI1- AS1, ZIC1, CNPY2, LRIG3, PINK 1 -AS, or any combination thereof.
  • a panel of biomarkers may comprise one or more of C2CD4C, ZIC4, INHBB, FIGN, SIX1, FRMD1, OTX1,
  • a panel of biomarkers may comprise one or more of C2CD4C, ZIC4, INHBB, FIGN, SIX1, FRMD1, OTX1, CYP26C1, TMEM200B, NOL2, or any combination thereof.
  • a panel of biomarkers may comprise one or more of C2CD4C, ZIC4, INHBB, FIGN, SIX1, or any combination thereof.
  • a panel of biomarkers may comprise one or more of
  • a panel of biomarkers may comprise one or more of C2CD4C, ZIC4, INHBB, FIGN, or any combination thereof.
  • a panel of biomarkers may comprise one or more of C2CD4C, ZIC4, INHBB, or any combination thereof.
  • a panel of biomarkers may comprise one or more of C2CD4C, ZIC4, or a combination thereof.
  • a panel of biomarkers may comprise C2CD4C.
  • a biomarker may be a top ranked biomarker.
  • a biomarker may be top ranked for distinguishing a malignant sample from a normal sample.
  • a biomarker may be top ranked for distinguishing an early stage cancer from a normal sample.
  • a biomarker may be top ranked for an early screening molecular classifier or for a subject not suspected of having a cancer.
  • a panel of biomarkers may comprise one or more biomarkers from FIG. 19A-C.
  • a panel of biomarkers may comprise one or more of FIGN, MRPS31P2, DHX30, USP32P2, GCOM2, RP11-797H7.1, RPl 1-45506.2, RP1-155D22.1, TXLNA, RPl 1-95F22.1, CTC-273B12.10, RNA5SP129, RASSF10, or any combination thereof.
  • a panel of biomarkers may comprise one or more of FIGN, MRPS31P2, DHX30, USP32P2, GCOM2, RP11-797H7.1, RPl 1-45506.2, RP1-155D22.1, or any combination thereof.
  • a panel of biomarkers may comprise one or more of FIGN, MRPS31P2, DHX30, USP32P2, GCOM2, or any combination thereof.
  • a panel of biomarkers may comprise one or more of FIGN, MRPS31P2, DHX30, USP32P2, or any combination thereof.
  • a panel of biomarkers may comprise one or more of FIGN, MRPS31P2, DHX30, or any combination thereof.
  • a panel of biomarkers may comprise one or more of FIGN, MRPS31P2, or any combination thereof.
  • a panel of biomarkers may comprise one or more of MRPS21P2, USP32P2, RP1-155D22.1, or any combination thereof.
  • a biomarker may not previously be associated with a cancer.
  • a biomarker may be a gene or genehancer.
  • a panel of biomarkers may comprise one or more of RP11-522B 15.3, CYP26C1, TMEM200B, NOL6, FRMD1, FIGN, C2CD4C, MAGI1-AS1, PINK1-AS, or any combination thereof.
  • a panel of biomarkers may comprise one or more of FIGN, MRPS31P2, DHX30, USP32P2, GCOM2, RP11-797H7.1, RPl 1-45506.2, RP1- 155D22.1, TXLNA, RPl 1-95F22.1, CTC-273B 12.10, RNA5SP129, or any combination thereof.
  • a panel of biomarkers may comprise one or more of MCRIP2P1, RNU6-1265P, TRAPPC3, TXLNA, AC073257.2, FIGN, IGKV1-33, KLF2P3, RPl 1-523H20.3, DHX30, RNA5SP129, SNORA6, FKBP4P1, GCOM2, MTND3P24, RNA5SP152, RNU1-36P, RP11-96A1.5,
  • a panel of biomarkers may comprise one or more of RP1-155D22.1, FIGN, RPl 1-95F22.1, AHRR, NAA20, RPl 1-797H7.1, RPS2P46, NDUFA8, MRPS31P2, AC009120.6, C2CD4C, RN7SL635P, PCBD2, SLC24A1, KARS, CH17-11806.3, BEND7, RN7SKP69, PNMAl, RP11-21C4.1, LINC01607, AC005253.4, CTC- 301O7.4, RP11-137H2.4, IRX3, RELL2, RP11-26J3.1, AE000662.93, FAM118A,
  • a panel of biomarkers may comprise one or more biomarkers from FIG. 120.
  • the one or more biomarkers may be selected from FIG. 120 in the absence of GH15F067182.
  • RN7SL635P PCBD2, SLC24A1, KARS, CH17-11806.3, BEND7, RN7SKP69, PNMAl, RP11- 21C4.1, LINC01607, AC005253.4, CTC-301O7.4, RPl 1-137H2.4, IRX3, RELL2, RP11-26J3.1, AE000662.93, FAM118A, AC006028.11, GAPDHP65, one or more biomarkers from FIG. 120 or any combination thereof.
  • an epigenetic modification in a biomarker may not previously be associated with a cancer.
  • a biomarker may be a gene or genehancer.
  • a panel of biomarkers may comprise one or more of INHBB, SIX1, TJP1, IHH, CNPY2, or any
  • a panel of biomarkers may comprise one or more of MIRlOl-1, RBP7, CSNKIAI, CYP26C1, NDUFAB l, PES1, or any combination thereof.
  • a panel of biomarkers may comprise one or more of DSTN, BCAP29, NDUFAB l, STMN4, or any combination thereof.
  • PES1 DSTN
  • BCAP29 NDUFABl
  • STMN4 STMN4
  • a biomarker may distinguish samples having an early stage cancer from samples having a late stage cancer.
  • a panel of biomarkers may comprise one or more of FIGN, MRPS31P2, DHX30, USP32P2, GCOM2, RP11-797H7.1, RP11-45506.2, RP1- 155D22.1, TXLNA, RP1 1-95F22.1, CTC-273B 12.10, RNA5SP129, RASSFIO, or any combination thereof.
  • a panel of biomarkers may comprise one or more of FIGN, SIXl, ZIC4, or any combination thereof.
  • a panel of biomarkers may comprise one or more of C2CD4C, ZIC4, INHBB, FIGN, SIXl, FRMD1, OTX1, CYP26C1, TMEM200B, NOL6, CXCL12, RPl 1-522B15.3, TBX2, TJP1, IHH, MAGI1-AS1, ZIC1, CNPY2, LRIG3, PINK1-AS, or any combination thereof.
  • a panel of biomarkers may comprise one or more of SIXl, ZIC4, INHBB, C2CD4C, or any combination thereof.
  • a panel of biomarkers may comprise one or more biomarkers from FIG. 51.
  • a panel of biomarkers may comprise one or more of RP11-522B 15.3, CYP26C1, TMEM200B, NOL6, FRMD1, FIGN, C2CD4C, MAGI1-AS1, PINK1-AS, or any combination thereof.
  • a panel of biomarkers may comprise one or more of INHBB, SIXl, TJP1, IHH, CNPY2, or any
  • a panel of biomarkers may comprise one or more of RPl 1-522B15.3, CYP26C1, TMEM200B, NOL6, FRMD1, FIGN, C2CD4C, MAGI1-AS1, PINK1-AS, INHBB, SIXl, TJP1, IHH, CNPY2, or any combination thereof.
  • a panel of biomarkers may comprise one or more biomarkers from FIG. 74.
  • a panel of biomarkers may comprise one or more of FIGN, MRPS31P2, DHX30, USP32P2, GCOM2, RP11-797H7.1, RPl 1-45506.2, RP1-155D22.1, TXLNA, RPl 1-95F22.1, CTC-273B12.10, RNA5SP129, or any combination thereof.
  • a panel of biomarkers may comprise one or more biomarkers from FIG. 119.
  • a panel of biomarkers may comprise one or more of MCRIP2P1, RNU6-1265P, TRAPPC3, TXLNA, AC073257.2, FIGN, IGKV1-33, KLF2P3, RPl 1-523H20.3, DHX30, RNA5SP129, SNORA6, FKBP4P1, GCOM2, MTND3P24, RNA5SP152, RNU1-36P, RP11-96A1.5, AD AMTS 19- AS 1 , PCBD2, SIL1, DUTP5, RP1-155D22.1, RPl 1-279022.1, RP11-797H7.1, AC083843.4,
  • PRR13P7 RNU4-50P, FAM210CP, RP11-481H12.1, RP1-65P5.3, RPl 1-22B23.2, RP11- 121C6.4, RP11-128P10.1, MRPS31P2, RP11-95F22.1, AE000662.93, CTD-2302E22.5,
  • a panel of biomarkers may comprise one or more of MIRlOl-1, RBP7, CS K1A1, CYP26C1, DUFAB1, PES1, or any combination thereof.
  • a panel of biomarkers may comprise one or more of MCRIP2P1, RNU6-1265P,
  • RNA5SP129 SNORA6, FKBP4P1, GCOM2, MTND3P24, RNA5SP152, RNU1-36P, RP11-
  • a panel of biomarkers may comprise one or more biomarkers of FIG. 121.
  • a panel of biomarkers may comprise one or more of RP1-155D22.1, FIGN, RP11-95F22.1, AHRR, NAA20, RP11-797H7.1, RPS2P46, NDUFA8, MRPS31P2, AC009120.6, C2CD4C,
  • a panel of biomarkers may comprise one or more of DSTN, BCAP29, NDUFABl, STMN4, or any combination thereof.
  • a panel of biomarkers may comprise one or more of RP1-155D22.1, FIGN, RP11-95F22.1, AHRR, NAA20, RPl 1-797H7.1, RPS2P46, NDUFA8, MRPS31P2,
  • the HMCP-110 workflow may improve workflow and reduce sample attrition from 30% to 5% and eliminate strong operator biases seen in the HMCP-150 study.
  • the analysis may identify many significantly differential hydroxym ethyl ated features (both gene bodies and enhancers) that have been previously associated with cancer (such as CRC) or not previously associated with cancer.
  • HMCP-110 protocol As shown in FIG. 1, key improvements of the HMCP-110 protocol as compared to the HMCP-150 protocol.
  • a total of 110 colorectal cancer (CRC) and healthy volunteer (HV) plasma samples are processed through the HMCP v2 protocol with significant improvements to project management, data analysis and overall execution. Improvements may include a reduction in operator bias, a reduction in attrition rate, or a combination thereof.
  • HMCP-110 protocol is shown in FIG. 33.
  • Day 1. Summary. cfDNA samples will undergo end repair, addition of an A-base overhang, adaptor ligation, and post ligation purification. -3.8% of the ligation product will be amplified, purified and QC'ed by Qubit and BioAnalyzer while the remainder is reserved for processing on day 2.
  • Day 2. Summary. The remaining purified ligation product from day 1 is then denatured into single strands, these are copied to produce double stranded material, 5 -hydroxy methylated cytosines are chemically labeled then bound to a biotin conjugate followed by a clean-up of this reaction.
  • Biotin conjugated 5hmC-containing DNA fragments material is bound to streptavidin beads. Using a magnet the unbound material (non 5hmC-containing fragments) are washed away. Following this, the bound DNA fragments are denatured into single stranded DNA leaving the copy strand in solution while, the biotin-conjugated original strand remains bound to the streptavidin beads. The single-stranded copy strand is amplified. The library size and molarity are determined for both the amplified enriched (5hmC-containing) libraries by the bioanalyzer.
  • the HMCP-110 protocol may be a modified version of the HMCP v2 protocol as described above and in FIG. 33 and FIG. 1.
  • a method as described herein may comprise associating a label with an epigenetically modified base of a nucleic acid sequence to form a labeled nucleic acid sequence; hybridizing a substantially complementary strand to the labeled nucleic acid sequence; and amplifying the substantially complementary strand in a reaction in which the labeled nucleic acid sequence is substantially not present.
  • One or more individual elements of the method need not be performed in a particular order. For example, associating a label may occur after the hybridizing. One or more individual elements of a given method may be performed in a different order than described herein.
  • FIG. 125 shows one example of the 5-hmC Pulldown Label Copy Enrich
  • HMCP LCE HMCP LCE
  • Advantages of the HMCP LCE method may provide: (a) an improved resolution as compared to other methods, such as a HMCP method or a method that may associate a sugar, an antibody, a protein, a fragment of any of these, a label, or any combination thereof with an epigenetically modified base of the nucleic acid; (b) a decrease in 5- hmC-density bias as compared to other methods, such as a HMCP method or a method that may associate a sugar, an antibody, a protein, a fragment of any of these, a label, or any combination thereof with an epigenetically modified base of the nucleic acid; or (c) any combination thereof.
  • a first element 201 may be to prepare a plurality of double-stranded fragments 202, such as a library of oligonucleotide fragments.
  • the plurality of double-stranded fragments may comprise cell-free DNA.
  • the plurality of double-stranded fragments may comprise one or more epigenetic modifications on one or both strands.
  • a second element 203 may be to associate a label (such as an azido-glucose label) with at least one of the oligonucleotide fragments from the plurality of double-stranded fragments to form a modified oligonucleotide fragment 204.
  • the label may associate with an epigenetic modification present at one or more bases of the modified oligonucleotide fragment.
  • a third element 205 may be to separate the modified oligonucleotide fragment to form one or more single-stranded modified oligonucleotide fragments 206.
  • a fourth element 207 may be to hybridize a complementary strand, such as a substantially complementary strand, to a single-stranded modified
  • a fifth element 210 may be to associate a label 209 with the modified
  • the label 209 may also associate with a substrate.
  • the label 209 may bind to an epigenetic modification or to a label previously associated with an epigenetic modification.
  • the label 209 may not bind directly to the complementary strand.
  • complementary strand may be indirectly associated with the substrate via the interaction between the substrate and the modified oligonucleotide fragment.
  • a sixth element 211 may be to enrich a sample for one or more complementary strands 212 by removing or separating or washing away from the substrate one or more complementary strands (such as by disrupting the bond between the complementary strand and the opposing strand) and then separating the complementary strand from the modified oligonucleotide fragment that remains associated with the substrate.
  • a seventh element 213 may be to amplify the enriched complementary strand in the absence of the modified oligonucleotide fragment to form one or more daughter strands 214 of the complementary strand.
  • the library may comprise double-stranded oligonucleotide fragments or single- stranded oligonucleotide fragments.
  • the oligonucleotide fragments may be DNA or RNA.
  • the library may be a next-generation (NGS) library.
  • the library may comprise an oligonucleotide fragment having an adaptor (such as an NGS adaptor) at (a) one or both ends of the fragment, (b) at one or both strands of the double-stranded oligonucleotide fragment, or (c) a combination thereof.
  • the adaptor may uniquely identify the oligonucleotide fragment from other oligonucleotide fragments in a sample or in a library.
  • the adaptor may be specific to or selective for double-stranded DNA.
  • a label may associate with an epigenetic modification (such as 5- hmC) or a type of epigenetic modification present at a base of the oligonucleotide fragment.
  • a label may associate with a plurality of epigenetic modifications present on one or both strands of a double-stranded oligonucleotide fragment.
  • a label may associate with a type of epigenetic modification (such as 5-hmC).
  • a label may be selective for a type of epigenetic modification (such as a 5-hmC). The label may be selective for double-stranded oligonucleotide fragments and may not label single-stranded fragments.
  • the label may be selective for single-stranded oligonucleotide fragments.
  • the label may associate with (such as bind to) the epigenetic modification with an aid, such as an enzyme.
  • the enzyme may be selective for double-stranded oligonucleotide fragments, such as beta-glucosyltransferase (bGT).
  • bGT beta-glucosyltransferase
  • the label may associate with the epigenetic modification by click chemistry.
  • the label may be an azido-sugar, such as an azido-glucose.
  • a double-stranded oligonucleotide fragment may be separated to form single stranded fragments, such as separating by denaturation.
  • a complementary strand may be hybridized to at least a portion of a single stranded oligonucleotide.
  • a complementary strand may be a primer, such as a primer that may be complementary to the adaptor (such as an NGS adaptor).
  • a complementary strand may be a substantially complementary strand, such as substantially complementary along an entire length of the oligonucleotide fragment.
  • the substantially complementary strand may be absent (a) the label that may be present in the parent oligonucleotide fragment, (b) the epigenetic modification that may be present in the parent oligonucleotide fragment, or (c) a combination thereof.
  • the substantially complementary strand may be hybridized to the parent oligonucleotide fragment by DNA extension or cDNA extension.
  • complementary strand may be indirectly associated with a substrate.
  • the association to the substrate may occur via the label associated with the epigenetic modification on the parent oligonucleotide fragment.
  • the substantially complementary strand may be free of any label and/or free of any epigenetic modification.
  • the association between the label and the substrate may be disrupted.
  • oligonucleotide fragments comprising an epigenetic modification may be separated from oligonucleotide fragments absent any epigenetic modifications or absent a type of epigenetic modification. Separation may occur by associating the label with a substrate, such that any fragment absent the epigenetic modification or the type of epigenetic modification may be removed. Removal may occur by washing, such as stringent washing of the substrate. Following removal of oligonucleotide fragments lacking an epigenetic modification or a type of epigenetic modification, the substantially complementary strand may be separated from the parent oligonucleotide fragment strand. The parent oligonucleotide fragment strand may remain associated with the substrate. The parent oligonucleotide fragment strand and the substrate may be discarded. The substantially complementary strand may be amplified in a reaction vessel that may be free of the parent oligonucleotide fragment strand.
  • FIG. 126 shows one example of the 5-hmC Pulldown Copy Label Enrich
  • the HMCP CLE method may provide: (a) an improved resolution as compared to other methods, such as a HMCP method or a method that may associate a sugar, an antibody, a protein, a fragment of any of these, a label, or any combination thereof with an epigenetically modified base of the nucleic acid; (b) a decrease in 5- hmC-density bias as compared to other methods, such as a HMCP method or a method that may associate a sugar, an antibody, a protein, a fragment of any of these, a label, or any combination thereof with an epigenetically modified base of the nucleic acid; or (c) any combination thereof.
  • a first element 301 may be to prepare a plurality of double stranded oligonucleotide fragments 302, such as a library.
  • the double stranded oligonucleotide fragments may comprise cell-free DNA.
  • the double stranded oligonucleotide fragments may have epigenetic modifications on one or more bases of one or both strands.
  • a second element 303 may be to separate the strands of a double-stranded oligonucleotide fragment of the plurality to form one or more single-stranded oligonucleotide fragments 304.
  • the one or more single-stranded oligonucleotide fragments may comprise one or more bases having an epigenetic modification.
  • a third element 305 may be to hybridize a complementary strand, such as a substantially complementary strand, to at least one single-stranded oligonucleotide fragment to form a modified oligonucleotide fragment 306.
  • the complementary strand may be
  • a fourth element 307 may be to associate a label (such as an azido- glucose label) with the modified oligonucleotide fragment to form a labeled modified
  • the label may associate with an epigenetic modification present in the modified oligonucleotide fragment.
  • the label may not be associated with the substantially complementary strand that may lack an epigenetic modification.
  • a fifth element 310 may be to associate a label 309 with the modified oligonucleotide fragment wherein the label 309 may also associate with a substrate.
  • the label 309 may not bind directly to the complementary strand.
  • the complementary strand may be indirectly associated with the substrate via the interaction between the substrate and the modified oligonucleotide fragment.
  • the association between the complementary strand and the opposing strand may be disruptable, such as a disruptable bond.
  • a sixth element 311 may be to enrich a sample for one or more complementary strands 312 by removing or separating or washing away from the substrate one or more complementary strands (such as by disrupting the bond between the complementary strand and the opposing strand). Upon separation, the modified oligonucleotide fragment may remain associated with the substrate.
  • enriching a sample for one or more complementary strands may comprise washing a substrate, such as stringent washing of a substrate. Washing may remove one or more non-covalently bound fragments, one or more non-specifically physisorbed fragments, or a combination thereof. Washing may not disrupt or alter an association between a modified oligonucleotide fragment and a substrate, such that a sample may be enriched for the complementary strand.
  • a seventh element 313 may be to amplify the
  • the library may comprise double-stranded oligonucleotide fragments or single- stranded oligonucleotide fragments.
  • the oligonucleotide fragments may be DNA or RNA.
  • the library may be a next-generation (NGS) library.
  • the library may comprise an oligonucleotide fragment having an adaptor (such as an NGS adaptor) at (a) one or both ends of the fragment, (b) at one or both strands of the double-stranded oligonucleotide fragment, or (c) a combination thereof.
  • the adaptor may uniquely identify the oligonucleotide fragment from other oligonucleotide fragments in a sample or in a library.
  • the adaptor may be specific to or selective for double-stranded DNA.
  • a double-stranded oligonucleotide fragment may be separated to form single stranded fragments, such as separating by denaturation.
  • a complementary strand may be hybridized to at least a portion of a single stranded oligonucleotide.
  • a complementary strand may be a primer, such as a primer that may be complementary to the adaptor (such as an NGS adaptor).
  • a complementary strand may be a substantially complementary strand, such as substantially complementary along an entire length of the oligonucleotide fragment. The substantially complementary strand may be absent the epigenetic modification that may be present in the parent oligonucleotide fragment.
  • a label may associate with an epigenetic modification (such as 5- hmC) or a type of epigenetic modification present at a base of the parent oligonucleotide fragment.
  • a label may associate with a plurality of epigenetic modifications present on the parent oligonucleotide fragment.
  • a label may associate with a type of epigenetic modification (such as 5-hmC).
  • a label may be selective for a type of epigenetic modification (such as a 5-hmC). The label may be selective for double-stranded fragments and may not label single-stranded fragments.
  • the label may be selective for single-stranded fragments.
  • the label may associate with (such as bind to) the epigenetic modification of the parent strand with an aid, such as an enzyme.
  • the enzyme may be selective for double-stranded oligonucleotide fragments, such as beta-glucosyltransferase (bGT).
  • bGT beta-glucosyltransferase
  • the label may associate with the epigenetic modification by click chemistry.
  • the label may be an azido-sugar, such as an azido-glucose.
  • complementary strand may be indirectly associated with a substrate.
  • the association to the substrate may occur via the label associated with the epigenetic modification on the parent oligonucleotide fragment.
  • the substantially complementary strand may be free of any label and/or free of any epigenetic modification.
  • the association between the label and the substrate may be disrupted.
  • oligonucleotide fragments comprising an epigenetic modification may be separated from oligonucleotide fragments absent any epigenetic modifications or absent a type of epigenetic modification. Separation may occur by associating the label with a substrate, such that any fragment absent the epigenetic modification or the type of epigenetic modification may be removed. Removal may occur by washing, such as stringent washing of the substrate. Following removal of oligonucleotide fragments lacking an epigenetic modification or a type of epigenetic modification, the substantially complementary strand may be separated from the parent oligonucleotide fragment strand. The parent oligonucleotide fragment strand may remain associated with the substrate. The parent oligonucleotide fragment strand and the substrate may be discarded. The substantially complementary strand may be amplified in a reaction vessel that may be free of the parent oligonucleotide fragment strand.
  • FIG. 127 shows one example of the 5-hmC Pulldown Label Random prime
  • the HMCP LRE method may provide: (a) an improved resolution as compared to other methods, such as a HMCP method or a method that may associate a sugar, an antibody, a protein, a fragment of any of these, a label, or any combination thereof with an epigenetically modified base of the nucleic acid; (b) a decrease in 5-hmC-density bias as compared to other methods, such as a HMCP method or a method that may associate a sugar, an antibody, a protein, a fragment of any of these, a label, or any combination thereof with an epigenetically modified base of the nucleic acid; (c) a substantially improved robustness at low input mass as compared to other methods, such as a HMCP method or a method that may associate a sugar, an antibody, a protein, a fragment of any of these, a label, or any combination thereof with an epigenetically modified base of the nucleic acid; or
  • a first element 401 may be to associate a label (such as an azido-glucose label) with a double stranded oligonucleotide fragment to yield a modified oligonucleotide fragment 402.
  • the double stranded oligonucleotide may comprise cell-free DNA.
  • the label may associate with an epigenetic modification or a type of epigenetic modification present at a base of one or both strands of the double stranded oligonucleotide fragment to form the modified oligonucleotide fragment 402.
  • a second element 403 may be to separate the strands of the modified oligonucleotide fragment to form one or more single-stranded modified oligonucleotide fragments and then to hybridize a complementary strand, such as a substantially complementary strand to at least one of the single-stranded modified oligonucleotide fragments to form a double stranded modified oligonucleotide fragment 404 having a complementary strand and a modified oligonucleotide fragment having the label.
  • the complementary strand may be absent the label and absent the epigenetic modification.
  • a third element 405 may associate an adaptor to the double stranded modified oligonucleotide fragment (such as to one or both ends of one or both strands of the double stranded modified oligonucleotide fragment) to form a double stranded modified oligonucleotide fragment having one or more adaptors 406, such as a labeled chimeric library.
  • a fourth element 408 may be to associate a label 407 with the modified oligonucleotide fragment wherein the label 407 may also associate with a substrate.
  • the label 408 may bind to an epigenetic modification or to the label previously associated with an epigenetic modification.
  • the label 408 may not bind directly to the complementary strand.
  • the complementary strand may be indirectly associated with the substrate via the interaction between the substrate and the modified oligonucleotide fragment. The interaction between the
  • a fifth element 409 may be to enrich a sample for one or more complementary strands 410 by removing or separating or washing away from the substrate one or more complementary strands that lack a label associated with the substrate (such as by disrupting the bond between the complementary strand and the opposing strand) and then separating the complementary strand from the modified oligonucleotide fragment that remains associated with the substrate.
  • a sixth element 411 may be to amplify the complementary strand in the absence of the modified oligonucleotide fragment to form one or more daughter strands 412 of the complementary strand.
  • a label may associate with an epigenetic modification (such as 5- hmC) present at a base of the parent oligonucleotide fragment.
  • a label may associate with a plurality of epigenetic modifications present on the parent oligonucleotide fragment.
  • a label may associate with a type of epigenetic modification (such as 5-hmC).
  • a label may be selective for a type of epigenetic modification (such as a 5-hmC).
  • the label may be selective for double- stranded fragments and may not label single-stranded fragments.
  • the label may be selective for single-stranded fragments.
  • the label may associate with (such as bind to) the epigenetic modification of the parent strand with an aid, such as an enzyme.
  • the enzyme may be selective for double-stranded oligonucleotide fragments, such as beta-glucosyltransferase (bGT).
  • bGT beta-glucosyltransferase
  • the label may associate with the epigenetic modification by click chemistry.
  • the label may be an azido- sugar, such as an azido-glucose.
  • a position of a label may be determined by the presence/absence of 5-hmC in a dsDNA parent fragment.
  • a label may be an azido-glucose, transferred to a 5-hmC from UDP-6-azide-glucose (UDP-N3-glc) by beta-glucosyltransferase (PGT).
  • PTT beta-glucosyltransferase
  • Labeling may be performed directly on a purified circulating tumor DNA (ctDNA) extract.
  • hybridizing may comprise (i) priming (such as random priming),
  • random priming may be performed by incubating an azido-labeled double-stranded DNA
  • dsDNA duplex in the presence of an oligomer pool (where each oligo in the pool may comprise a degenerate N6, N7, N8, N9, N10 or beyond "head” attached to a "NGS-adapter” tail), a DNA polymerase (e.g. Klenow) and a native nucleoside triphosphate comprising deoxyribose (dNTP) mix in a given buffer, and performing a single extension reaction at 37 °C for a defined time (e.g. 10 mins).
  • a degenerate primer "head” randomly may prime a template DNA and may make multiple copies for each of the parent strands.
  • Random priming may achieve two elements in one by: 1) introducing an NGS-specific adapter sequence and 2) generating a modification-free copy (daughter strand) of the modified parent strand. [00267] In FIG.
  • adapter ligation may occur by incubating a mono-adapted chimeric labelled duplex template with a NGS-platform specific adapter (a forked adapter, a linear duplex adapter, a hairpin adapter, or a combination thereof) with 3' T overhang and 5' P04 end, a dsDNA ligase (e.g. T4 ligase) and necessary cofactors (e.g. Mg2+, adenosine triphosphate (ATP), polyethylene glycol (PEG)) in a given buffer, at 20 °C for a defined period of time (e.g. 15 minutes).
  • NGS-platform specific adapter a forked adapter, a linear duplex adapter, a hairpin adapter, or a combination thereof
  • a dsDNA ligase e.g. T4 ligase
  • necessary cofactors e.g. Mg2+, adenosine triphosphate (ATP), polyethylene glycol (PEG)
  • the A overhang of the monoadapted chimeric labelled duplex may match with the T overhang of the adapter and may promote ligation efficiency. Only one end of each duplex (that being formed by the 3' end of the daughter strand) may be adapted.
  • a successful ligation product may have a singly adapted azido-labeled parent strand (5 'adapted) and a doubly adapted non- modified daughter strand (both 3' and 5'ends). In some cases, amplification of such "library", only a bottom strand may be amplifiable with an adapter-specific polymerase chain reaction (PCR) primer.
  • PCR polymerase chain reaction
  • magnetic bead binding may enable selective enrichment of a labeled chimeric next generation sequencing (NGS) library fragments. This may be achieved directly (i.e. by Sharpless Azide-alkyne cycloaddition reaction (CLICK) chemistry between the azido- glucose label and dibenzocyclooctyne (DBCO)-magbead) or indirectly (i.e. by Sharpless Azide- alkyne cycloaddition reaction (CLICK) of a dibenzocyclooctyne (DBCO)-biotin linker and then conjugation of the product to streptavidin-magbeads). In some cases, only azido-labeled fragments (i.e.
  • 5-hmC-containing may bind to the magbead.
  • Azido-labeled fragments may be immobilized to a bead, such as a magnetic bead. In some cases, this interaction may only occur via a labeled parent strand of the chimeric NGS library duplex.
  • a copied complement may not be azido-labeled and thus may be immobilized to a bead by virtue of the hydrogen-bonding interaction between the complementary duplex strands. As this H-bonding interaction may be non-covalent, it may be disrupted and exploited in downstream steps.
  • enrichment by stringent washing may be essential to maximize a signal-to-noise ratio of an enrichment process.
  • Chimeric NGS library immobilized beads may be washed stringently (e.g. specific buffers; mild heat; mild denaturants etc.) to selectively remove non-covalently bound NGS library fragments, non-specifically physiosorbed to their surface. In some cases, such types of fragments may cause noise in a final sequencing result.
  • Chimeric NGS library fragments covalently bound to the bead surface may be selected for in the enrichment (i.e. signal, those whose may insert originally contained 5-hmC). After stringent washing, a daughter strand may be eluted from the bead (e.g.
  • these daughter strands may be exact complements of a labeled strands immobilized to a bead. However, they may not contain any epigenetic modifications and hence may be free from "5-hmC-density" amplification bias. Amplification of these eluted daughter strands may give a superior result over existing methodologies for two reasons: 1) an improved resolution (higher signal-to-noise) and 2) an improved representation (decreased selection bias).
  • the methods and systems as described herein may provide a result that may be far more representative of an extent to which a nucleic acid may be marked epigenetically.
  • the methods and systems may be superior to other methods of identification of epigenetic modifications.
  • Other methods of identification may include the HMCP method or a method that comprises associating a sugar, a protein, an antibody, or a fragment of any of these with an epigenetic modification and detecting a presence of the sugar, the protein, the antibody, or fragment thereof.
  • nucleic acid sequences, such as fragments containing a high density of epigenetic modifications may not be detected using other methods of identification of epigenetic modifications.
  • the unbiased approach of the present methods and systems provides for detection of high density epigenetic modifications of nucleic acid sequences, such as short fragments yielding an unbias detection.
  • a daughter strand PCR amplification may occur.
  • PCR may be employed using only an eluted daughter strand as amplification template using standard protocols and procedures.
  • minimizing a number of PCR cycles may minimize duplicates.
  • using UMI-codes within an adapter sequence may help quantitation during downstream analysis.
  • a genome wide library of enriched fragments may be ready for sequencing.
  • FIG. 128 shows one example of the 5-hmC Pulldown Random prime Label
  • the HMCP RLE method may provide: (a) an improved resolution as compared to other methods, such as a HMCP method or a method that may associate a sugar, an antibody, a protein, a fragment of any of these, a label, or any combination thereof with an epigenetically modified base of the nucleic acid; (b) a decrease in 5-hmC-density bias as compared to other methods, such as a HMCP method or a method that may associate a sugar, an antibody, a protein, a fragment of any of these, a label, or any combination thereof with an epigenetically modified base of the nucleic acid; (c) a substantially improved robustness at low input mass as compared to other methods, such as a HMCP method or a method that may associate a sugar, an antibody, a protein, a fragment of any of these, a label, or any combination thereof with an epigenetically modified base of the nucleic acid; or
  • ligation such as adapter ligation
  • FIG. 128 priming and ligation may occur after labeling as shown in
  • a first element 501 may (i) separate strands of a double stranded oligonucleotide fragment, such as a cell-free DNA fragment (having one or more epigenetic modifications at one or more bases on one or both strands) and (ii) initiate random priming to form a complementary strand, such as a substantially complementary strand, to at least one of the single stranded oligonucleotide fragment.
  • Random priming may form a double stranded modified oligonucleotide fragment 502.
  • the complementary strand formed by random priming may not have epigenetic modifications or may be substantially free of epigenetic modifications.
  • a second element 503 may associate an adaptor to the double stranded modified oligonucleotide fragment (such as to one or both ends of one or both strands of the double stranded modified oligonucleotide fragment) to form a double stranded modified oligonucleotide fragment having one or more adaptors 504.
  • a third element 505 may associate a label (such as an azido-glucose label) with the double stranded modified oligonucleotide fragment to yield a labeled fragment 506, such as a labeled chimeric library.
  • the label may associate with an epigenetic modification or a type of epigenetic modification present at a base of the double stranded oligonucleotide fragment to form the labeled fragment 506.
  • a fourth element 508 may be to associate a label 507 with the double stranded modified oligonucleotide fragment wherein the label 507 may also associate with a substrate.
  • the label 507 may not bind directly to the complementary strand.
  • the complementary strand may be indirectly associated with the substrate via the interaction between the substrate and the modified oligonucleotide fragment.
  • the interaction between the complementary strand and the opposing strand may be disruptable, such as a disruptable bond.
  • a fifth element 509 may be to enrich a sample for one or more
  • complementary strands 510 by removing or separating or washing away from the substrate one or more complementary strands that lack a label associated with the substrate (such as by disrupting the interaction between the complementary strand and the opposing strand). Upon separation, the modified oligonucleotide fragment may remain associated with the substrate.
  • a sixth element 511 may be to amplify the complementary strand in the absence of the modified oligonucleotide fragment to form one or more daughter strands 512 of the complementary strand.
  • FIG. 129 shows one example of the 5-hmC Pulldown Label Loci Specific Enrich
  • the HMCP LLSE method may provide (a) an improved resolution as compared to other methods, such as a HMCP method or a method that may associate a sugar, an antibody, a protein, a fragment of any of these, a label, or any combination thereof with an epigenetically modified base of the nucleic acid; (b) a decrease in a 5-hmC-density bias as compared to other methods, such as a HMCP method or a method that may associate a sugar, an antibody, a protein, a fragment of any of these, a label, or any combination thereof with an epigenetically modified base of the nucleic acid; (c) an substantially improved robustness at low input mass as compared to other methods, such as a HMCP method or a method that may associate a sugar, an antibody, a protein, a fragment of any of these, a label, or any combination thereof with an epigenetically modified base of the nucleic acid; (a) an improved resolution as compared to other methods, such as
  • a first element 601 may associate a label (such as an azido-glucose label) with the double stranded oligonucleotide fragment, such as a cell-free DNA fragment to yield a labeled fragment 602.
  • a label such as an azido-glucose label
  • the label may associate with an epigenetic
  • a second element 603 may (i) separate strands of a labeled fragment and (ii) initiate loci specific priming to form a complementary strand, such as a substantially complementary strand, to at least one of the single stranded oligonucleotide fragments.
  • Loci specific priming may form a double stranded modified oligonucleotide fragment 604 having a label associated with an epigenetic modification of the parent strand.
  • the complementary strand may be absent both epigenetic modifications and the associated label.
  • a third element 605 may associate an adaptor to the double stranded modified oligonucleotide fragment (such as to one or both ends of one or both strands of the double stranded modified oligonucleotide fragment) to form a double stranded modified oligonucleotide fragment having one or more adaptors 606, such as a labeled and loci-enriched chimeric library.
  • a fourth element 608 may be to associate a label 607 with the double stranded modified oligonucleotide fragment wherein the label 607 may also associate with a substrate. The label 607 may not bind directly to the complementary strand.
  • the complementary strand may be indirectly associated with the substrate via the interaction between the substrate and the modified oligonucleotide fragment.
  • the interaction between the complementary strand and the opposing strand may be disruptable, such as a disruptable bond.
  • a fifth element 609 may be to enrich a sample for one or more complementary strands 610 by removing or separating or washing away from the substrate one or more complementary strands that lack a label associated with the substrate (such as by disrupting the bond between the complementary strand and the opposing strand). Upon separation, the opposing strand may remain associated with the substrate.
  • a sixth element 611 may be to amplify the complementary strand in the absence of the modified oligonucleotide fragment to form one or more daughter strands 612 of the complementary strand.
  • both strands of double stranded DNA (dsDNA) fragments containing 5-hmC may be labeled using beta-glucosyltransferase (PGT) and UDP-6-azide- glucose (UDP-N3-glc).
  • PGT beta-glucosyltransferase
  • Position of label may be determined by the presence/absence of 5-hmC in the dsDNA parent fragment.
  • a label may be azido-glucose, transferred to the 5-hmC from UDP-N3-glc by PGT.
  • the labeling may be performed directly on the purified circulating tumor DNA (ctDNA) extract.
  • the ctDNA may not have been through a series of library prep steps ahead of labeling. So there may be likely more material at the labeling (improved efficiency) and may present a more representative sample to a labeling than may be the case post NGS prep.
  • hybridizing may comprise (i) priming (such as loci specific priming), (ii) ligation (such as adapter ligation), or (iii) a combination thereof.
  • priming such as loci specific priming
  • ligation such as adapter ligation
  • a combination thereof for example, in FIG. 129, loci specific priming may be performed by incubating azido-labeled dsDNA duplexes in the presence of an oligomer pool (where each oligo in the pool may comprise a loci specific "head” attached to a "NGS-adapter” tail), a DNA polymerase (e.g. Klenow) and a native dNTP mix in a given buffer, and performing a single extension reaction at 37 °C for a defined time (e.g. 10 mins).
  • a defined time e.g. 10 mins
  • a loci specific head may be designed to be complementary to specific, defined regions of interest (ROI). Extension from an annealed loci specific primer may result in an A-overhang at an end of a daughter copy.
  • a random priming may achieve two elements in one: 1) it may introduce an NGS-specific adapter sequence in a loci-specific manner and 2) it may generate a modification-free copy (daughter strand) of the modified parent strand.
  • a labelled loci-monoadapted chimeric duplex template may be incubated with a NGS-platform specific adapter (illustration shows forked adapter, but linear duplex adapter of hairpin adapter may be substituted) with 3' T overhang and 5' P04 end, a dsDNA ligase (e.g. T4 ligase) and necessary cofactors (e.g. Mg2+, adenosine triphosphate (ATP), polyethylene glycol (PEG)) in a given buffer, at 20 °C for a defined period of time (e.g. 15 minutes).
  • a dsDNA ligase e.g. T4 ligase
  • necessary cofactors e.g. Mg2+, adenosine triphosphate (ATP), polyethylene glycol (PEG)
  • the A overhang of the monoadapted chimeric labelled duplex may match with the T overhang of the adapter and promotes ligation efficiency. In some cases, only one end of each duplex (that being formed by the 3 ' end of the daughter strand) may be adapted.
  • a successful ligation product may have a singly adapted azido-labeled parent strand (5' adapted) and a doubly adapted non-modified daughter strand (both 3' and 5' ends). Where one to amplify this "library" it may be that only a bottom strand may be amplifiable with adapter-specific PCR primers.
  • an enrichment of the daughter strand by a substrate may be employed followed by PCR amplification of the daughter strand that may be substantially free of epigenetic modifications.
  • FIG. 130 shows one example of the 5-hmC Pulldown Loci Specific Label Enrich (HMCP LSLE) method detailed herein.
  • the HMCP LSLE method may provide (a) an improved resolution as compared to other methods, such as a HMCP method or a method that may associate a sugar, an antibody, a protein, a fragment of any of these, a label, or any combination thereof with an epigenetically modified base of the nucleic acid; (b) a decrease in a 5-hmC-density bias as compared to other methods, such as a HMCP method or a method that may associate a sugar, an antibody, a protein, a fragment of any of these, a label, or any combination thereof with an epigenetically modified base of the nucleic acid; (c) an substantially improved robustness at low input mass as compared to other methods, such as a HMCP method or a method that may associate a sugar, an antibody, a protein, a fragment of any of
  • FIG. 130 is similar to the method of FIG. 129 except that in some cases, priming
  • ligation such as adapter ligation
  • FIG. 130 priming and ligation may occur after labeling as shown in FIG. 129
  • a first element 701 may (i) separate strands of a double stranded oligonucleotide fragment, such as a cell-free DNA fragment and (ii) initiate loci specific priming to form a complementary strand, such as a substantially complementary strand, to at least one of the single stranded parent strands.
  • Loci specific priming may form a double stranded modified oligonucleotide fragment 702.
  • the double stranded oligonucleotide fragment may have one or more epigenetic modifications at one or more bases on one or both strands.
  • the complementary strand, such as a substantially complementary strand, formed by loci specific priming may not have epigenetic modifications.
  • a second element 703 may associate an adaptor to the double stranded modified oligonucleotide fragment (such as to one or both ends of one or both strands of the double stranded modified oligonucleotide fragment) to form a double stranded modified oligonucleotide fragment having one or more adaptors 704.
  • a third element 705 may associate a label (such as an azido-glucose label) with the double stranded modified
  • a fourth element 708 may be to associate a label 707 with the double stranded modified oligonucleotide fragment wherein the label 707 may also associate with a substrate.
  • the label 707 may not bind directly to the complementary strand.
  • the complementary strand may be indirectly associated with the substrate via the interaction between the substrate and the modified oligonucleotide fragment.
  • a fifth element 709 may be to enrich a sample for one or more complementary strands 710 by removing or separating or washing away from the substrate one or more complementary strands that lack a label associated with the substrate (such as by disrupting the bond between the complementary strand and the opposing strand). Upon separation, the opposing strand may remain associated with the substrate.
  • a sixth element 711 may be to amplify the complementary strand in the absence of the parent strand to form one or more daughter strands 712 of the complementary strand.
  • the HMCP method may be referred to herein as the 'standard' method.
  • the HMCP method may be referred to herein as HMCP, HMCP-vl, HMCPvl, HMCP, vlHMCP, vl HMCP, or VI .
  • the CLE method may be referred to herein as HMCP CLE, HMCP-v2,
  • HMCPv2 HMCPv2
  • CLE-HMCP v2HMCP
  • v2 HMCP CLE-HMCP
  • one or more individual elements of a given method may be performed in the order as described herein. In some cases, one or more individual elements of a given method need not be performed in a particular order described herein. In some cases, one or more individual elements of a given method may be performed in a different order than described herein.
  • the complementary strand may be a substantially complementary strand or may comprise a portion that may be substantially complementary to a portion of a nucleic acid sequence.
  • Hybridizing may comprise hybridizing at least two complementary strands to at least two portions of a nucleic acid sequence.
  • Hybridizing may comprise hybridizing at least a portion of a complementary strand to an adapter sequence of the nucleic acid sequence.
  • Hybridizing may comprise extension, such as cDNA extension.
  • Hybridizing may comprise priming, such as loci specific priming or random priming.
  • Hybridizing may comprise ligation, such as adapter ligation.
  • Hybridizing may comprise hybridizing a primer to a nucleic acid sequence and elongating from the primer to form a complementary strand.
  • Hybridizing may comprise obtaining a complementary strand and hybridizing the complementary strand to the nucleic acid sequence.
  • a label may be associated with an epigenetically modified base of a nucleic acid sequence.
  • a label may be associated with an epigenetically modified base before hybridizing.
  • a label may be associated with an epigenetically modified base after hybridizing.
  • the method may comprise amplifying the complementary strand in a reaction in which the nucleic acid sequence may be substantially not present.
  • the amplifying may comprise associating the nucleic acid sequence and complementary strand with a substrate, such as by a label.
  • the amplifying may comprise washing a substrate that may be associated with the nucleic acid sequence and complementary strand, such as stringent washing.
  • the amplifying may comprise eluting a complementary strand from the substrate on which the nucleic acid sequence remains.
  • the amplifying may comprise amplifying the complementary strand.
  • An epigenetic modification may comprise a DNA methylation.
  • a DNA methylation may comprise a hyper-methylation or a hypo-methylation.
  • a DNA methylation may comprise a modification of a DNA base, such as a 5-methylcytosine (5-mC), a 4-methylcytosine, a 6-methyladenine, or a combination thereof.
  • Embodiment 1 A method comprising: (a) assaying a sample for a nucleotide sequence having at least 70% sequence homology to a biomarker listed in Table 1 to produce a result, wherein the sample is from a subject having cancer or suspected of having cancer; and (b) comparing the result of (a) to a result obtained from assaying a control or a derivative thereof.
  • Embodiment 2 The method of claim 1, wherein based on the comparing of (b) the sample is identified as benign or malignant for the cancer.
  • Embodiment 3 The method of any one of claims 1-2, wherein the nucleotide sequence has at least 85% sequence homology to the biomarker listed in Table 1.
  • Embodiment 4 The method of any one of claims 1-3, wherein at least five biomarkers listed in Table 1 or Table 2 are assayed in (a).
  • Embodiment 5 The method of any one of claims 1-4, wherein the assaying of (a) comprises assaying for a presence or an absence of an epigenetic modification.
  • Embodiment 6. The method of any one of claims 1-5, wherein the biomarker is a transcription factor.
  • Embodiment 7 A method comprising: (a) assaying a sample for a presence or an absence of an epigenetic modification in a nucleotide sequence having at least 70% sequence homology to a biomarker listed in Table 2 to produce a result, wherein the sample is from a subject having cancer or suspected of having cancer; and (b) comparing the result of (a) to a result obtained from assaying a control or a derivative thereof.
  • Embodiment 8 The method of claim 7, wherein based on the comparing of (b) the sample is identified as benign or malignant for the cancer.
  • Embodiment 9 The method of any one of claims 7-8, wherein at least five biomarkers listed in Table 1 or Table 2 are assayed in (a).
  • Embodiment 10 The method of any one of claims 7-9, wherein the biomarker comprises a transcription factor.
  • Embodiment 11 A method comprising: (a) assaying a cell-free DNA sample for a metabolic-related biomarker or an immune-related biomarker to produce a result, wherein the cell-free DNA sample is from a subject having cancer or suspected of having cancer; and (b) comparing the result of (a) to a result obtained from assaying a control or a derivative thereof.
  • Embodiment 12 The method of claim 11, wherein based on the comparing of (b) the cell-free DNA sample is identified as benign or malignant for the cancer.
  • Embodiment 13 The method of any one of claims 11-12, wherein the assaying of (a) comprises assaying for a presence or an absence of an epigenetic modification.
  • Embodiment 14 The method of any one of claims 11-13, wherein at least five biomarkers are assayed in (a).
  • Embodiment 15 The method of any one of claims 11-14, wherein the biomarker is a transcription factor.
  • Embodiment 16 A method comprising: identifying a presence or an absence of (i) an early stage colorectal cancer, (ii) a late stage colorectal cancer in a sample, wherein the identifying comprises assaying for a presence or an absence of an epigenetic modification in a nucleotide sequence of the sample to produce a result, wherein the sample is from a subject having cancer or suspected of having cancer.
  • Embodiment 17 The method of claim 16, wherein the nucleotide sequence has at least 70% sequence homology to a biomarker of Table 3.
  • Embodiment 18 The method of any one of claims 1, 7 or 11, wherein the result from
  • (a) is input into a trained algorithm and the comparing of (b) is performed by the trained algorithm to classify the sample as benign or malignant for the cancer.
  • Embodiment 19 The method of any one of claims 5, 7, 13 or 16, wherein the presence or the absence of the epigenetic modification comprises a number of methylated sites in the biomarker.
  • Embodiment 20 The method of any one of claims 5, 7, 13 or 16, wherein the presence or the absence of the epigenetic modification comprises a number of hypo-hydroxymethylated loci, a number of hyper-hydroxymethylated loci, or a combination thereof in the biomarker.
  • Embodiment 21 The method of any one of claims 18, further comprising (c) assaying the sample for a population of immune cells.
  • Embodiment 22 The method of claim 21, further comprising inputting the population of immune cells from (c) into the trained algorithm.
  • Embodiment 23 The method of claim 21 or claim 22, wherein the population of immune cells comprises more than one type of immune cell.
  • Embodiment 24 The method of claim 21 or claim 22, wherein the population of immune cells comprises a single type of immune cell.
  • Embodiment 25 The method of claim 18, wherein the trained algorithm classifies the sample as benign or malignant for the cancer at greater than about 90% sensitivity, greater than about 80%) specificity, or a combination thereof.
  • Embodiment 26 The method of claim 25, wherein the trained algorithm classifies the sample as benign or malignant for the cancer at greater than about 90%> sensitivity.
  • Embodiment 27 The method of claim 25, wherein the trained algorithm classifies the sample as benign or malignant for the cancer at greater than about 80%> specificity.
  • Embodiment 28 The method of any one of claims 5, 7, 13 or 16, wherein the epigenetic modification comprises a 5-methycytosine (5mC), a 5-hydroxymethylcytosine (5- hmC), a 5-formylcytosine (5-fC), a 5-carboxylcytosine (5-caC), or any combination thereof.
  • the epigenetic modification comprises a 5-methycytosine (5mC), a 5-hydroxymethylcytosine (5- hmC), a 5-formylcytosine (5-fC), a 5-carboxylcytosine (5-caC), or any combination thereof.
  • Embodiment 29 The method of claim 28, wherein the epigenetic modification comprises the 5-hmC.
  • Embodiment 30 The method of any one of claims 5, 7 or 13, wherein a loss in the epigenetic modification as compared to the control or the derivative thereof is indicative of the cancer.
  • Embodiment 31 The method of claim 30, wherein the epigenetic modification is the 5-hmC.
  • Embodiment 32 The method of any one of claims 1-31, wherein the subject is suspected of having the cancer.
  • Embodiment 33 The method of any one of claims 1-32, wherein said subject is asymptomatic for the cancer.
  • Embodiment 34 The method of any one of claims 1-33, wherein the subject has not previously been diagnosed with the cancer.
  • Embodiment 35 The method of any one of claims 1-34, wherein the cancer is colorectal cancer (CRC).
  • CRC colorectal cancer
  • Embodiment 36 The method of any one of claims 2, 8, 12, or 18, wherein when the method identifies the sample as malignant for the cancer, the method further classifies the sample as representative of a stage of cancer.
  • Embodiment 37 The method of claim 36, wherein the stage of the cancer is stage I.
  • Embodiment 38 The method of any one of claims 2, 8, 12 or 18, wherein when the method identifies the sample as malignant for the cancer, the method further classifies the sample as representative of a subtype of cancer.
  • Embodiment 39 The method of claim 38, wherein the cancer is colon cancer and the subtype of the colon cancer is serrated adenoma or a tubular adenoma.
  • Embodiment 40 The method of claim 38, wherein the cancer is colon cancer and the subtype of the colon cancer is CMS1, CMS2, CMS3, or CMS4.
  • Embodiment 41 The method of any one of claims 1-10 or 16-40, wherein the sample comprises cell-free DNA.
  • Embodiment 42 The method of claim 41, wherein an amount of the cell-free DNA is from about 5 nanogram (ng) to about 15 ng.
  • Embodiment 43 The method of claim 41 or 42, wherein the sample further comprises a blood sample, a tissue samples, a fine needle aspirate sample, a fecal sample, or any
  • Embodiment 44 The method of any one of claims 1-43, wherein the sample is identified as benign for the cancer in an absence of the subject having a further medical procedure.
  • Embodiment 45 The method of claim 44, wherein the further medical procedure comprises: obtaining a biopsy from the subject, performing an imaging scan of the subject, or a combination thereof.
  • Embodiment 46 The method of any one of claims 18-45, wherein when the trained algorithm identifies the sample as benign, assaying a second sample from the subject to monitor a change over time in the result from (a).
  • Embodiment 47 The method of any one of claims 18-46, wherein the trained algorithm is trained using a training set of samples.
  • Embodiment 48 The method of any one of claims 18-47, wherein the training set of samples comprises cell-free DNA samples.
  • Embodiment 49 The method of any one of claims 18-48, wherein the training set of samples comprises cell-free DNA samples and genomic DNA samples.
  • Embodiment 50 The method of any one of claims 18-49, wherein the training set of samples comprises a sample having a sequence comprising a CpG island.
  • Embodiment 51 The method of any one of claims 18-50, wherein the training set of samples comprises a combination of malignant samples and benign samples.
  • Embodiment 52 The method of claim 5, 7 or 13, wherein the assaying of (a) comprises detecting the epigenetic modification.
  • Embodiment 53 The method of claim 52, wherein the detecting is by nanopore sequencing.
  • Embodiment 54 The method of claim 52, wherein the detecting is by high throughput sequencing.
  • Embodiment 55 The method of claim 52, wherein the detecting comprises associating a label with an epigenetic modification in a sequence of the sample to form a labeled sequence; hybridizing a substantially complementary strand to the labeled sequence; and amplifying the substantially complementary strand in a reaction in which the labeled sequence is substantially not present.
  • Embodiment 56 The method of claim 52, wherein the detecting comprises contacting the sample with an enzyme or a catalytically active fragment thereof that converts a methylated residue in the sample to a modified base.
  • Embodiment 57 The method of claim 52, wherein the detecting comprises labeling covalently, a hydroxyl group on a hy droxym ethyl ated residue in the sample to generate a labeled hydroxymethylated residue; and sequencing the sample comprising the labeled
  • Embodiment 58 The method of claim 52, wherein the detecting comprises contacting at least a portion of the sample with an enzyme that utilizes a labeled glucose or a labeled glucose-derivative donor substrate to add a labeled glucose molecule or a labeled glucose- derivative to an epigenetic modification in the sample to generate a labeled glucosylated- epigenetic modification.
  • Embodiment 59 The method of claim 52, wherein the detecting comprises adding a detectable label to the epigenetic modification.
  • Embodiment 60 The method of claim 59, wherein the detectable label comprises an antibody.
  • Embodiment 61 The method of any one of claims 52-60, wherein the detecting is by a method comprising fluorescence resonance energy transfer (FRET) assay, an enzyme-linked immunosorbent assay (ELISA), an liquid chromatography-mass spectrometry (LCMS) assay, or any combination thereof.
  • FRET fluorescence resonance energy transfer
  • ELISA enzyme-linked immunosorbent assay
  • LCMS liquid chromatography-mass spectrometry
  • Embodiment 62 The method of any one of claims 52-61, wherein the detecting comprises adaptor ligation.
  • Embodiment 63 The method of claim 1, wherein the control or derivative thereof is from a subject having cancer, a subject not having cancer, a subject having a stage I cancer, a subject having a stage II cancer, a subject having a stage III cancer, a subject having a stage IV cancer, or any combination thereof.
  • Embodiment 64 The method of claim 52, wherein the detecting comprises detecting 5-caC or 5-fC.
  • Embodiment 65 The method of claim 17, wherein the nucleotide sequence has at least 70% sequence homology to a biomarker of Table 1, Table 2, or a combination thereof.
  • Embodiment 66 The method of any one of claims 1, 7, or 11, wherein based on the comparing of (b) the sample is identified as a precancerous lesion or a precancerous growth.
  • Embodiment 67 The method of claim 66, wherein the precancerous lesion or the precancerous growth comprises a polyp, a nonpolyp, an advanced adenoma, or any combination thereof.
  • Embodiment 68 The method of claim 66, wherein the assaying of (a) is performed in the absence of a screening procedure.
  • Embodiment 69 The method of claim 68, wherein the screening procedure comprises a colonoscopy, an assay performed on a stool sample provided by the subject, a sigmoidoscopy, or any combination thereof.
  • Embodiment 70 The method of claim 68, wherein the sample is a blood sample.
  • Embodiment 71 The method of claim 68, wherein the sample comprises cell-free DNA.
  • Embodiment 72. A method comprising: (a) assaying a sample for a nucleotide sequence having at least 70% sequence homology to a biomarker listed in FIG. 34B, FIG. 139, FIG. 140, FIG. 141, FIG. 142, or any combination thereof to produce a result, wherein the sample is from a subject asymptomatic for a cancer or not previously diagnosed with a cancer; and (b) comparing the result of (a) to a result obtained from assaying a control or a derivative thereof.
  • Embodiment 73 The method of claim 72, wherein based on the comparing of (b) the sample is identified as a precancerous lesion or a precancerous growth.
  • Embodiment 74 The method of claim 73, wherein the precancerous lesion or precancerous growth comprises a polyp, nonpolyp, an advanced adenoma, or any combination thereof.
  • Embodiment 75 The method of claim 72, wherein the assaying of (a) is performed in the absence of a screening procedure.
  • Embodiment 76 The method of claim 75, wherein the screening procedure comprises a colonoscopy, an assay performed on a stool sample provided by the subject, a sigmoidoscopy, or any combination thereof.
  • Embodiment 77 The method of claim 72, wherein the sample is a blood sample.
  • Embodiment 78 The method of claim 72, wherein the sample comprises cell-free DNA.
  • Embodiment 79 The method of claim 72, wherein the nucleotide sequence has at least 70% sequence homology to a biomarker of Table 1, Table 2, Table 3, or any combination thereof.
  • Embodiment 80 The method of claim 72, wherein the assaying of (a) comprises assaying for a presence or an absence of an epigenetic modification.
  • Embodiment 81 The method of claim 80, wherein the assaying of (a) comprises detecting the epigenetic modification.
  • Embodiment 82 The method of claim 81, wherein the detecting is by nanopore sequencing.
  • Embodiment 83 The method of claim 81, wherein the detecting is by high throughput sequencing.
  • Embodiment 84 The method of claim 72, wherein control or derivative thereof comprises samples obtained from a precancerous lesion or a precancerous growth.
  • the top 20 genes directly associated with CRC based on the VarElect component of the Genecards database are present among the differential genes in the CRC vs. HV comparison (MCC, IGF2, FGFR2).
  • MCC CRC vs. HV comparison
  • IGF2 IGF2, FGFR2
  • a total of 56 genes from CRC list of VarElect are present in the differential list of genes with FDR ⁇ 0.05 CRC vs. HV.
  • the data quality control (QC) parameters used (i) de-duplicated read count for input and pull-down and (ii) uniformity score for input and pull-down. Review of parameters and their relationships with each other and sample characteristics is performed. No HMCP profiles are excluded on the basis of data QC.
  • ZIC1 also shows enrichment of 5-hmC in cancer samples. Tracks from top to bottom show: 10 profiles of HV cfDNA; 10 profiles of CRC cfDNA; average track over all HV cfDNA; average track over all CRC cfDNA patients; average track overall CRC stage 1-2 cfDNA patients; four tumor profiles,; two technical replicates from gDNA of normal colon.
  • the final cohort for analysis is composed of 105 samples, distributed over ages 55 - 70, and with ⁇ 60% of the cohort females samples (demonstrated in FIG. 37A-C). Late stage cancers are all female samples (FIG. 36).
  • FIG. 37A-C shows distribution of the cohort based on three key variables - age, gender and cancer stage. CRC patients are significantly older than healthy volunteers (FIG. 37A) with HV younger than CRC patients. Age and gender is less biased (FIG. 37B) but there is a bias by gender and cancer stage (FIG. 37C).
  • Sample Balancing is performed using the R package OSAT and demonstrated no bias in the allocation of DNA samples across the strip tubes going into the HMCP v2 workflow. Chi- square p-values for all desired variables were p>0.5 including clinical diagnosis & stage, sex, extraction operator, day of extraction and age. The distribution of samples across the 14 strip tubes is shown (FIG. 38A-D). Alterations in the desired balancing are only to move empty wells to the end of the strip tube.
  • FIG. 38A-D shows results of the OSAT sample balancing analysis based on key variables across the 14 strip tubes needed for the HMCP v2 workflow.
  • Each bar of the histogram represents one strip tube processed in the workflow.
  • Each of the plots represents for strip tube 1- 14 how well balanced it is for cancer stage, gender, extraction operator and day of extraction. No strip is found to be unbalanced based on chi-square tests.
  • FIG. 39A-E shows assessment of the quantity of DNA (concentration and yield) achieved by DNA extraction based on both Qubit and the Bioanalyser (BA) by key cohort metadata and extraction operator.
  • FIG. 40A-B shows association of total mass (ng) of cell free DNA (cfDNA) that went into the library preparation stage (denoted conv ng) with Sex, and cancer stage.
  • the NetFlex adapters contain the library indexes needed for sequencing, which are well balanced across the operators.
  • HiSeq4000 60M fragments per sample.
  • Input and pBGT libraries show similar distributions across the cohort meta-data (FIG. 44A-D), no effect of biological or technical variables that were tested are identified (chi-square tests all p-values >0.4). Little variation is identified based on run or operator.
  • the quantification of the spike-in sequences showed no difference based on the operator or run (FIG. 45A-F).
  • a small difference in the diversity of the inputs is identified based on the run (FIG. 46A-D).
  • FIG. 43A-D shows association identified between the quantity of input cfDNA and the sequencing metrics including the diversity, uniformity, total de-duplicated reads
  • the conv ng is the total mass (ng) of cfDNA that went into the library prep step. Data shown for both the input and pulldown (pBGT). Pearson correlation is performed between these metrics.
  • FIG. 44A-D shows histograms and boxplots of the de-duplicated sequencing reads.
  • the de-duplicated read count is based on bamstats mapped reads and is a paired end read count. For the number of fragments this number can be halved. Greater read count is achieved for the input samples over the pBGTs and both reached or exceeded the expectations of the sequencing depth. Inputs and pBGTs show similar distributions across the cohort meta-data. Chi-square tests are performed for each (as well as Age Group) and the minimum p-value identified is 0.436 for all tests. Run464 had a slight variation in the number of de-duplicated reads but this is not significantly different to the other runs.
  • FIG. 45A-F shows assessment of spike ins by clinical diagnosis and HMCP operator. Spike in levels are shown as log2 (ratios) for both the input and the pBGT. Ratio 2hmC vs.
  • mCpC is the ratio of hmC control reads divided by the sum of mC and C control reads.
  • Ration 2mC vs. hmCpC is the ratio of mC control reads divided by the sum of hmC and C control reads.
  • the figures here show that the pBGT has specifically enriched the hmC reads in the pBGT and not the input and that the mC reads are rare in the pBGT. No significant difference identified based on the 2hmC:mCpC ratio or the 2mC:hmCpC based on clinical diagnosis or the HMCP operator. Chi-square tests: Input - 2hmC ratio vs.
  • ANOVA tests are performed for input and pBGT libraries (separately) and key technical (shown - FIG. 46A-D) and biological variables (age, clinical diagnosis and gender - FIG. 76A-76L).
  • the ANOVA tests revealed no significant differences between diversity, uniformity, run and HMCP operator for the pBGT library.
  • Mitochondrial RPKM is associated with sequencing run and HMCP operator (p-value 5.1 le-06, HMCP operator 0.053) for the pBGT library.
  • For the pBGT library biological variables only uniformity and stage are significantly associated (p-value 0.0138).
  • the input library mitochondrial RPKM is significantly associated with age groups (p-value 0.045). While only diversity score and Sex are associated for the input library (p-value - 0.08).
  • FIG. 46A-D shows assessment of the diversity, uniformity and mitochondrial reads based on the run, operator and clinical diagnosis. Some variation identified in the mitochondrial RPKMs for both input and pulldown (pBGT).
  • Two main genomic feature types are used for secondary analysis: gene bodies defined by the Gencode v.25 GRChg38 annotation set and enhancer regions defined by the Genehancer regions from the Genecards database.
  • features are excluded that do not obtain more than 30 reads per feature in all samples resulting in 22377 genes and 16643 genehancers.
  • Excluded features include those that are largely invariant by setting the coefficient of variation > 0.2 (over all samples), and restricted features that have high variability by restricting to features with coefficient of variation ⁇ 0.8.
  • This feature set is referred to as the "Top Varying" set in the following text (composed of 3104 genes, and 1323 Genehancers).
  • PCA is utilized to visually assess correlative structure in the datasets. Greater separation is observed between the biological variables of interest than any of the technical variables over the first three principal components. In particular, there is no bias based on operator or sequencing run (FIG. 47A-F). No separation is noted between gender or age group (FIG. 48A-D). In comparison, some separation of the clinical diagnoses based on the top 3 PCA axes (FIG. 49A-F), with the clearest separation seen for the Genehancers. Separation by clinical diagnosis is also seen when more features are considered (read count threshold >30) as shown in FIG. 77A-77N.
  • FIG. 51 Several of the top gene candidates for the CRC vs HV comparison are given in FIG. 51. For the full list of significant candidates from both the top varying and all gene comparisons, see FIG. 87, FIG. 90-93, and FIG. 119-121. Boxplots of the top 6 discriminating genes between CRC and HV demonstrate the level of separation in each feature (FIG. 52A-F). Boxplots of the top 6 genehancer features display similar levels of separation (FIG. 82A-F). Boxplots of the top 6 features discriminating between early CRC and HV, and late CRC and HV can be found in the figures (Genes: FIG. 83A-F and FIG. 84A-F, Genehancers: FIG. 85A-F and FIG. 86A-F).
  • FIG. 50 shows number of discriminatory features identified at several FDR thresholds. Many discriminating features are found for CRC vs. HV and early CRC vs. HV comparisons at an FDRO.01.
  • FIG. 51 shows top 20 discriminatory genes ranked by adjusted p-value for the CRC vs HV comparison (Mann- Whitney U test). For each gene, its specific prediction power in terms of AUC is computed.
  • FIG. 52A-F shows boxplots of the 6 top ranked genes by p-value from CRC vs HV comparison (top varying genes), all of which show an increased level of 5hmC enrichment in CRC over HV.
  • FIG. 53A-B shows 5hmC Enrichment Profile of ZIC4 and ZIC1 genes showing increased levels of 5hmC in CRC.
  • FIG. 53A shows demonstration of 5hmC enrichment on the genomic level in patient CRC profiles. 5hmC enrichment is localised around the 4th exon of ZIC4.
  • the gene ZIC1, to the right, also shows enrichment of 5hmC in the cancer samples.
  • the genome browser tracks from top to bottom include: Average track over all HV cfDNA; Average track over all CRC cfDNA patients; Three tumour profiles; and gDNA of normal colon.
  • FIG. 53B shows ZIC4 and ZIC1 summarised over the gene body in boxplot form by CRC stage.
  • FIG. 54A-B shows 5hmC Enrichment Profile of SIXl gene showing increased levels of 5hmC in CRC.
  • FIG. 54A shows demonstration of 5hmC enrichment on the genomic level in patient CRC profiles. Tracks from top to bottom include: Average track overall CRC stage 1-2 cfDNA patients; Three tumour profiles; and gDNA of normal colon.
  • FIG. 54B shows SIX1 levels summarised over the gene body and plotted in boxplot form by CRC stage.
  • Validation of these top feature lists is performed by subsetting the cohort into two equally sized groups, performing the MWU tests (genes and genehancers) and comparing the ranks between the two groups using a Wilcox signed-rank test. This resulted in a average p-value of 0.72 and average 30% intersection between the features with an FDR ⁇ 0.05 across the different filtering levels (read count threshold and top varying) (FIG. 90). Only the CRC vs. HV (female only) top varying genes comparison resulted in a p-value ⁇ 0.4.
  • the DESeq2 method is used on just the pBGT library counts. This method permits the inclusion of covariates such as operator and age.
  • the DESeq2 approach produces more features reaching statistical significance (FIG. 91).
  • Rank comparison (Wilcox signed-rank test) of the DESeq2 approach to the MWU test is performed across the disease types. DESeq2 is run using combinations of age, gender, HMCP operator and the sequencing run as covariates.
  • FIG. 92 includes results for CRC vs. HV and early CRC vs HV comparisons for gene level tests.
  • Genes identified as discriminatory between CRC and HV patients are assessed for functional relevance using the GeneCards suite. Genes with adjusted p-values lower than 0.05 (367 genes) are selected and used the disease association algorithm in the GeneAnalytics component of the GeneCards database. The disease type "Colorectal Cancer" had the highest score (FIG. 55), which is calculated based on the number of matching colorectal cancer associated genes. The basis of this matching score are verified colorectal associated genes, or which a number of the matching example genes are listed in FIG. 56. The likelihood for this result being due to database bias is tested by submitting 20 random genes sets (in order to be conservative in the analysis, the random lists are selected from the top varying gene set), of the same feature set size.
  • FIG. 55 shows a disease association table from the GeneAnalytics component of the GeneCards suite, for genes with an adjusted p-value ⁇ 0.05 in CRC vs HV comparison.
  • CRC is the top hit for the gene list.
  • FIG. 56 shows genes in the CRC vs HV set that are identified as differentially expressed in tissue samples in CRC.
  • FIG. 57 shows top 20 genes directly associated with CRC using the VarElect component of the Genecards database. CRC related terms are top hits in this analysis.
  • FIG. 58 shows disease association table from the GeneAnalytics component of the GeneCards suite, for genes with an adjusted p-value ⁇ 0.05 in CRC vs HV comparison using the All-genes list which does not apply a filter based on co-efficient of variation. CRC and other cancers are the top hits for the gene lists.
  • top hits for all subgroups and comparisons are performed using GOseq, an R package to perform gene ontology analysis using biological pathways. Both the top 50 features and those with an FDR ⁇ 0.05 are tested, and separated into under and over enriched based on the direction of the mean change in the RPKM enrichment ratio. GO terms with a p-value ⁇ 0.001 are taken forward into the next round of analysis and the top 20 biological pathways are plotted as a histogram of the -loglO(p-value). The under-enriched pathways in all- stage comparisons of CRC and HV are predominantly immune related (average - 13/20) and the over-enriched pathways are related to metabolism and biosynthetic processes for early CRC and all CRC comparisons.
  • 5hmC cfDNA profiles reflect changes that are specific to the patient condition.
  • 5hmC enriched DNA fragments from known colon cancer oncogenes are overrepresented in patient cfDNA profiles when compared to HV patients.
  • diagnostic potential may be aided by both tumour impact in cfDNA along with immune cell population changes.
  • FIG. 59A-B shows histograms of the top 20 Biological pathways from GO analysis of genes with a FDR ⁇ 0.05 from early CRC vs. HV MWU results (both genders, with read count filtering). Under-enriched pathways are predominantly immune related (FIG. 59A) and over- enriched pathways are predominantly metabolism related (FIG. 59B).
  • FIG. 60A-B shows histograms of the top 20 Biological pathways from GO analysis of genes with a FDR ⁇ 0.05 from late CRC vs. HV MWU results (females only). Under-enriched pathways are immune related (FIG. 60A) and over-enriched pathways are related to adhesion, morphogenesis and development (FIG. 60B).
  • Phase I Development of a Support Vector Machine (SVM) and Logistic Regression Model(LR) under a cross validation strategy that filters features that are invariant and correlated with age or sex.
  • Phase II Addition of a Recursive Feature Elimination (RFE) strategy to the Logistic Regression approach to reduce the feature set to the top 20 best performing features.
  • SVM Support Vector Machine
  • LR Logistic Regression Model
  • An SVM is built using 6-fold cross-validation including variance filtering and chi- square tests to assess the importance of covariates (age and/or gender) for the retained features.
  • 6-fold cross-validation the samples are split into 6 groups with five used for training and one for testing, this is repeated for all 6 test-set-training set permutations and an average of the performance measures over the six runs computed as an estimate of the predictive performance of the dataset.
  • Feature selection occurs within each cross-validation, this means that the coefficient of variation (COFV) of the training set is calculated and can vary for each subgroup of samples.
  • COFV coefficient of variation
  • Those features that pass the COFV threshold are then tested for associations with age groups and/or gender using a chi-square test.
  • Features with a p-value greater than a chosen threshold are retained and the model is trained and tested on these features.
  • Each model has been built for both gene and genehancer feature sets and for each sample group (e.g stage and gender comparisons).
  • the feature selection criteria is tested at multiple thresholds.
  • a COFV > 0.2 is chosen to remain consistent with the MWU test, and retained features that are not significant for age or sex (p-value > 0.25). Most of the feature selection is due to the COFV threshold, with ⁇ 20% but in some cases 0% of features being removed by the addition of the chi-square tests for age and gender.
  • the average area under the curve (AUC) measure from the receiver operator characteristic curves (ROC) is > 0.8 for all comparisons (see FIG. 61A-E and FIG. 62A-E). ROC curves are the result of the mean over the 6-fold cross-validations.
  • FIG. 61A-E shows ROC curves for SVM classifiers built on gene data for disease comparisons (varied by stage and gender). All classifiers are built using 6-fold cross-validation including COFV filtering (>0.2) and chi-square test (p-value>0.25) for age and or gender (if mixed population). ROC curves are annotated with the mean area under the curve and
  • PT permutation test
  • FIG. 62A-E shows ROC curves for classifiers built on genehancer data for disease comparisons (varied by stage and gender). All classifiers are built using 6-fold cross-validation including COFV filtering (>0.2) and chi-square test (p-value>0.25) for age and or gender (if mixed population). ROC curves are annotated with the mean area under the curve and
  • PT permutation test
  • Phase II Recursive Feature Elimination with the Logistic Regression (LR) Model
  • LR Logistic Regression
  • a recursive feature elimination (RFE) method is added within the cross-validation step.
  • the number of features to select within the model is varied (10, 20, 50 and 100) and in this report RFE models are included that select the top 20 most informative features.
  • RFE recursive feature elimination
  • the features are recorded so a comparison across the cross-validations can be performed.
  • a summary of the genes chosen during the RFE across the 6 fold cross-validation can be found in FIG. 138 alongside their corresponding p-values from the MWU tests.
  • FIG. 63A-E shows ROC curves for LR RFE classifiers built on gene data for CRC vs. HV comparisons (varied by stage and gender). All classifiers are built using 6-fold cross- validation including COFV filtering (>0.2) and chi-square test (p-value>0.25) for age and or gender (where appropriate). ROC curves are annotated with the mean area under the curve. All ROC curves show models built with 20 features. All classifiers show high performance levels with AUCs>0.8.
  • FIG. 64A-E shows ROC curves for LR RFE classifiers built on genehancer data for CRC vs. HV comparisons (varied by stage and gender). All classifiers are built using 6-fold cross-validation including COFV filtering (>0.2) and chi-square test (p-value>0.25) for age and or gender (where appropriate). ROC curves are annotated with the mean area under the curve. All ROC curves show models built with 20 features. All classifiers show high performance levels with AUCs>0.7.
  • the performance of a classifier developed from this cohort is assessed on an independent test set.
  • the dataset is split into 2 partitions, 3 ⁇ 4 of the samples are utilized to build a classifier, and 1 ⁇ 2 as an independent test set (see FIG. 65 and FIG. 67 for the sample distribution).
  • LASSO classifier development [00442] LASSO regression based classifiers are built (see “Use of LASSO regression for CRC and HV state prediction" section for description of LASSO model) to distinguish all stage CRC from HV, and early stage CRC from HV, and as before, separate classifiers are built using 5hmC enrichment in gene bodies and enhancer regions.
  • Each classifier is trained using cross-validation (All Stage vs HV: FIG. 66A-B; Early Stage vs HV: FIG. 71A-B) and the performance assessed on the independent test set (All Stage vs HV: FIG. 67 and FIG. 68; Early stage CRC vs HV: FIG. 72 and FIG. 73).
  • PCAs showing the ability of genes with non-zero weights to separate the CRC and HV can be seen in FIG. 69A-B (genes) and FIG. 70A-B (genehancers).
  • the feature sets are reduced substantially to a final informative set after training the LASSO regression classifier.
  • all-stage cancer vs HV 56 genes and 59 genehancers are retained, while for early stage vs HV, 40 genes and 25 genehancers are retained. Comparing the genes with non-negative weight in both CRC vs HV and early CRC vs HV classifier, 13 shared genes are found (FIG. 74).
  • composition of the train and test datasets (ii) the age of the volunteer involved in the study. Results show that these features cannot be considered as confounding factors in the training process.
  • the second dataset is generated using HMCP vl technology (87 samples: 40 CRC and 47 HV).
  • the dataset is split in two different groups because it showed operator-bias (Group 1 containing 43 samples: 15 CRC - 28 HV, and Group2 containing 44 samples: 25 CRC - 19 HV).
  • the classifier showed good prediction performance on this dataset as well (AUC of 0.817 and 0.752 for Group 1 and 2 respectively, FIG. 117A-J).
  • the difference in the sequencing technology (and likely also the issues related to the reliability of the signal of this dataset) had a negative effect on the sensitivity using the threshold of 0.36 established on the V2 training data (see "Use of LASSO regression for CRC and HV state prediction" section).
  • Results on these two independent datasets demonstrate that the LASSO model, based on 56 genes, has good potential for classification of CRC, however, tuning of the classification threshold for the technology platform differences may lead to improvements in predictive accuracy.
  • FIG. 66A-B shows performance of LASSO regression model on Genes (AUC 0.883) and Genehancers (AUC 0.937). Final model results in 56 features using genes and 59 features using genehancers (3-fold cross-validation is used in the training process. All classifiers show high performance levels with AUCs>0.85.
  • FIG. 67 shows a summary of cross validation results using a LASSO regression model on gene features.
  • FIG. 68 shows a summary of independent test set performance using a LASSO regression model on gene features.
  • FIG. 69A-B shows a PCA based on the list containing the 56 genes having non-zero weight in the computed LASSO model, also supports the evidence that these features can be effectively used to distinguish between CRC and HV samples (FIG. 64A-E). PCS 1-3 shown.
  • FIG. 70A-B shows a PCA based on the list containing the 59 genehancers having non-zero weight in the computed LASSO model, also supports the evidence that these features can be effectively used to distinguish between CRC and HV samples (FIG. 64A-E). PCS 1-3 shown.
  • Final model results in 40 features using genes and 24 features using genehancers. 3-fold cross-validation is used. All classifiers show high performance levels with AUCs>0.85.
  • FIG. 72 shows a summary of cross validation results using a LASSO regression model on gene features for early CRC vs HV
  • FIG. 73 shows a summary of cross validation results using a LASSO regression model on genehancer features for early CRC vs HV
  • FIG. 74 shows a non-zero genes shared between early CRC vs HV and CRC vs HV classifiers. This table reports the rank of each gene, the boxed genes having the negative weight (e.g., MRPS31P2 is the gene with the most negative weight in both classifiers).
  • the experiment is designed to avoid operator bias and batch effects by actively balancing samples over operators, which has worked well, even with an eventual imbalance in the operators due to staff illness.
  • the methods used to balance samples in the project can be considered as standard protocol for future projects.
  • Genomic level evidence has been found for several of the genes showing clear enrichment in cfDNA of CRC patients over HV cfDNA profiles, and in some cases show good correspondence to regions that appear enriched in CRC Tumours.
  • the methods may further include: additional samples for validation of the signatures (additional cancers, diseases and increasing the current sample cohort), gDNA from tumour and normal tissue to aid the understanding of the tumour circulating DNA and immune cell profiling to better the understanding of the 5hmC profiles of blood cell types.
  • a sample may be assayed for one or more biomarkers, wherein the sample is a cell-free DNA sample obtained from a blood cell.
  • FIG. 75A-75E MULTIQC PLOTS - Insert size as calculated by the Picard software suite. Run461 to Run465 represent the different sequencing batches. No untoward insert size anomalies are found.
  • FIG. 76A-76L Additional QC plots.
  • A-F Uniformity and Diversity scores by library preparation strategy (conv ng) assessed over technical and biological variables.
  • G-H Results of iCNA show a mismatch in predicted gender and % tumour fraction predictions.
  • I) - L) are metrics from the deeptools plotFingerprint utility that summarise a diagnostic plot that gives an overview of aspects of genomic coverage. Both pBGT and input samples behave as expected, pBGTs expected to have higher elbow/inflection points, lower AUC and higher x-intercept. No difference is observed by operator.
  • FIG. 77A-77N PC A of samples using features (Genes (FIG. 77A-C), Genehancers (FIG. 77D-F) ) that have passed the read count thresholds (>30 reads in input and pBGT) and filtered by the coefficient of variation (>0.2 & ⁇ 2).
  • the variance explained by each principal component for the gene and genehancer set is given in FIG. 77G-H, demonstrating that the majority of the variance is accounted for in the first three to four principle components.
  • FIG. 77I-N Gives plots for genes (FIG. 77I-K) and genehancers (FIG. 77L-N) with only the read count thresholds (>30 reads in the input and pBGT).
  • FIG. 78A - 78D PCAs of the top 20 discriminating/ranked genes for each of the patient subgroup as determined by the MWU test. Results are based on the MWU tests of the read count threshold feature lists (>30 reads in both pBGT and input). Clustering by diagnosis is evident based on the top 20 features alone.
  • FIG. 79A - 79D PCAs of the top 20 discriminating/ranked genehancers for each of the patient subgroup sourced from. Results are based on the MWU tests of the read count threshold feature lists (>30 reads in both pBGT and input). Clustering by diagnosis is evident based on the top 20 features alone.
  • FIG. 80A - 80D PCAs of the top 20 discriminating/ranked genes for each of the patient subgroup. Results are based on the MWU tests of the top varying features (>30 reads in both pBGT and input plus coefficient of variation >0.2 & ⁇ 2). Clear separation between CRC and HV samples is demonstrated.
  • FIG. 81A - 81D PCAs of the top 20 discriminating/ranked genehancers for each of the patient subgroup. Results are based on the MWU tests of the top varying features (>30 reads in both pBGT and input plus coefficient of variation >0.2 & ⁇ 2). Clustering by diagnosis is evident based on the top 20 features alone.
  • FIG. 82A - 82F Boxplots of the top 6 discriminating genehancers demonstrating separation between CRC and HVs (top varying list). Increased levels of 5hmC are found for CRC over HV for these top 6 genehancers.
  • FIG. 83A - 83F Boxplots of the top 6 discriminating genes demonstrating separation between early CRC and HVs (top varying list). Increased levels of 5hmC are found for early CRC over HV for these top 6 genes.
  • FIG. 84A - 84F Boxplots of the top 6 discriminating genes demonstrating separation between late CRC and HVs (top varying list). The majority of the top 6 genes show increased levels of 5hmC for late CRC over HV.
  • FIG. 85A - 85F Boxplots of the top 6 discriminating genehancers demonstrating separation between early CRC and HVs (top varying list). Increased levels of 5hmC are found for early CRC over HV for these top 6 genehancers.
  • FIG. 86A - 86F Boxplots of the top 6 discriminating genehancers demonstrating separation between late CRC and HVs (top varying list). Increased levels of 5hmC are found for late CRC over HV for these top 6 genehancers.
  • FIG. 87 Prediction score (in terms of AUC) of the top 20 most discriminating genes (top-varying comparison) between CRC and HV based on age groups. Those with a score > 0.7 are highlighted in red. The top 20 genes do not show any clear prediction power for these three age groups.
  • FIG. 88A - 88F Boxplots of the top 6 discriminating genes demonstrating separation between CRC and HVs by age group and diagnosis (top varying list). Enrichment levels are fairly similar across the age groups for HVs, while increased 5hmC levels are evident in the CRC patients.
  • FIG. 89A - 89F Boxplots of the top 6 discriminating genehancers demonstrating separation between CRC and HVs by age group and diagnosis (top varying list). Enrichment levels are fairly similar across the age groups for HVs, while increased 5hmC levels are evident in the CRC patients.
  • FIG. 91 Summary of DESeq2 results with covariates. A high number of features are identified as significantly discriminatory based on the default DESeq2 threshold of ⁇ 0.1 adjusted p-value.
  • FIG. 92 DESeq vs. MWU rank comparison tests - Genes. Gender and age have a stronger effect in the early CRC comparisons. P-value from the rank comparison test ⁇ 0.05 are highlighted in red. The addition of the covariates makes the most difference for the early CRC vs. HV comparison.
  • FIG. 93 DESeq vs. MWU rank comparison tests - Genehancers. Gender and age have little effect on the rank comparisons. The addition of any covariates does not significantly affect the rank of the discriminating genehancer lists, with approximately 3 ⁇ 4 of genehancers identified by both methods (DESeq2 and MWU tests).
  • FIG. 94A - 94F Top 6 genes ranked by DESeq2 test between CRC and HV including age and gender as covariates. Many of these genes (4/6) are also in the top 6 for the MWU test.
  • FIG. 95A - 95E Receiver operator characteristic (ROC) curves for SVM classifiers built on gene data for CRC vs. HV comparisons (varied by stage and gender). All classifiers are built using 6-fold cross-validation including variance filtering (>0.1) and chi-square tests (p- value>0.25) for age and or gender (if mixed population). ROC curves are annotated with the mean area under the curve and permutation test p-value is reported for each model. All classifiers achieved a mean AUC of 0.8 or above.
  • ROC Receiver operator characteristic
  • FIG. 96A - 96E Receiver operator characteristic (ROC) curves for SVM classifiers built on genehancer data for CRC vs. HV comparisons (varied by stage and gender). All classifiers are built using 6-fold cross-validation including variance filtering (>0.1) and chi- square tests (p-value>0.25) for age and or gender (if mixed population). ROC curves are annotated with the mean area under the curve and permutation test p-value is reported for each model. All classifiers achieved a mean AUC of 0.76 or above.
  • ROC Receiver operator characteristic
  • FIG. 97A - 97E ROC curves for logistic regression classifiers built on gene data for disease comparisons (varied by stage and gender). All classifiers are built using 6-fold cross- validation including coefficient of variation filtering (>0.2) and chi-square tests (p-value>0.25) for age and or gender (if mixed population). ROC curves are annotated with the ROC curves from each cross-validation round (light grey) and the mean area under the curve (green). All mean ROC curves show good performance with a minimum mean AUC of 0.83.
  • FIG. 98A - 98E Receiver operator characteristic (ROC) curves for logistic regression classifiers built on genehancer data for CRC vs. HV comparisons (varied by stage and gender). All classifiers are built using 6-fold cross-validation including coefficient of variation (>0.2) and chi-square tests (p-value>0.25) for age and or gender (if mixed population). ROC curves are annotated with the ROC curves from each cross-validation round (light grey) and the mean area under the curve (green). All mean ROC curves show good performance with a minimum mean AUC of 0.79.
  • ROC Receiver operator characteristic
  • FIG. 99A - 99E Receiver operator characteristic (ROC) curves for logistic regression classifiers built on gene data for CRC vs. HV comparisons (varied by stage and gender). All classifiers are built using 6-fold cross-validation including a variance filter (>0.1) and chi-square tests (p-value>0.25) for age and or gender (if mixed population). ROC curves are annotated with the mean area under the curve. The mean ROC curves show that even with a stricter variance threshold the models can still perform well.
  • ROC Receiver operator characteristic
  • FIG. 100A - 100E Receiver operator characteristic (ROC) curves for logistic regression classifiers built on genehancer data for CRC vs. HV comparisons (varied by stage and gender). All classifiers are built using 6-fold cross-validation including a variance filter (>0.1) and chi-square tests (p-value>0.25) for age and or gender (if mixed population). ROC curves are annotated with the mean area under the curve. The mean ROC curves show that even with a stricter variance threshold the models can still perform well.
  • ROC Receiver operator characteristic
  • FIG. 101 A - 101E ROC curves for Logistic Regression (LR) Recursive Feature Elimination (RFE) classifiers built on gene data for CRC vs. HV comparisons (varied by stage and gender). All classifiers are built using 6-fold cross-validation including variance filtering (>0.1) and chi-square test (p-value>0.25) for age and or gender (where appropriate) followed by RFE (20 features retained). ROC curves are annotated with the mean area under the curve. The mean ROC curves show that even with a stricter variance threshold the models can still perform well.
  • LR Logistic Regression
  • RFE Recursive Feature Elimination
  • FIG. 102A - 102E ROC curves for Logistic Regression (LR) Recursive Feature Elimination (RFE) classifiers built on genehancer data for CRC vs. HV comparisons (varied by stage and gender). All classifiers are built using 6-fold cross-validation including variance filtering (>0.1) and chi-square test (p-value>0.25) for age and or gender (where appropriate) followed by RFE (20 features retained). ROC curves are annotated with the mean area under the curve. The mean ROC curves show that even with a stricter variance threshold the models can still perform well.
  • LR Logistic Regression
  • RFE Recursive Feature Elimination
  • FIG. 103A - 103B Receiver operator characteristic (ROC) curves for Logistic Regression (LR) Recursive Feature Elimination (RFE) classifiers built on gene data for age groups ( ⁇ 61 and >61) comparisons. All classifiers are built using 6-fold cross-validation including variance filtering (>0.1 and chi-square test for gender where appropriate, p-value>0.25) plus recursive feature elimination to 20 features. ROC curves are annotated with the mean area under the curve. Age classifiers perform worse than disease classifiers.
  • LR Logistic Regression
  • RFE Recursive Feature Elimination
  • FIG. 104A - 104B Receiver operator characteristic (ROC) curves for Logistic Regression (LR) Recursive Feature Elimination (RFE) classifiers built on genehancer data for age groups ( ⁇ 61 and >61). All classifiers are built using 6-fold cross-validation including variance filtering (>0.1 and chi-square test for gender where appropriate, p-value>0.25) plus recursive feature elimination to 20 features. ROC curves are annotated with the mean area under the curve. Age classifiers perform worse than disease classifiers.
  • LR Logistic Regression
  • RFE Recursive Feature Elimination
  • the LASSO classification is based on 3 steps:
  • both a training dataset (used in the learning process), and a test dataset (used to assess the performance of the inferred model) are needed.
  • the original dataset is split in: (i) 3 ⁇ 4 for training; and (ii) 1 ⁇ 2 for testing.
  • Variance-Based Feature Filtering Aim of this step is to filter out all low-variance features. This step is performed looking only at the features (it does not take in account the labels). In particular, in this step all the features with a variance lower than a specified threshold are removed. In order to be as much conservative as possible, only the features that have the same value in all samples are removed. This filtering step removed (i) 45 genes in the gene dataset; and (ii) 6 genehancers in the genehancer dataset.
  • Model Training and Feature Selection are used to train the model. This is a linear model that estimates sparse coefficients and it is useful to obtain solutions with fewer parameter values (it reduces the number of variables upon which the given solution is dependent). Mathematically, it consists of a linear model trained, where the objective function to minimize is:
  • the lasso estimate thus solves the minimization of the least-squares penalty with
  • LASSO regression has two main advantages: (i) It simultaneously performs training and feature selection providing a sparse solution; and (ii) It associates a weight to each feature, in this way one can have an idea of the most important features.
  • a 3-fold cross validation approach is utilized on the training dataset.
  • the regression is performed on the normalized version of the training dataset (subtracting the mean and dividing by the L2-norm).
  • FIG. 92 and FIG. 93 contain the detailed list of gene and genehancers respectively and shown in FIG. 105A - 105B): (i) 56 genes with non-zero weight (from the initial list of 56,788 genes); and (ii) 59 genehancers with non-zero weight (from the initial list of 218,117 genehancers).
  • the trained model had the following prediction scores: (i) 0.975 for the gene-based model; and (ii) 0.988 for the genehancer-based model.
  • FIG. 105A - 105B LASSO weights for gene and genehancer datasets. Only non-zero elements are reported.
  • FIG. 28 shows the ROC for both models.
  • the first dataset containing 21 samples (7 CRC and 14 HV, all samples are female and all CRC samples are earlyCRC) is generated by using an early version of HMCP v2.
  • the second dataset is generated with an older version of the CEGX 5hmC genome wide profile technology.
  • This dataset contained 87 samples: 40 CRC (21 early-stage and 19 late- stage), and 47 HV.
  • the dataset is split in two different groups because it showed operator-bias.
  • Group 1 containing a total of 43 samples: 15 CRC (7 early-stage and 8 late-stage) and 28 HV
  • Group2 containing a 44 samples: 25 CRC (14 early-stage and 11 late-stage) and 19 HV.
  • the CRC vs HV model is tested based on gene.
  • 117A - 117 J shows that, despite these datasets are obtained by using a different technology and that they presented some issues in terms of reliability of the observed signal, the classifier showed good prediction performance: AUC of 0.817 and 0.752 on Groupl and Group2 respectively.
  • Results in this section highlight the potential of the 56 genes LASSO model for the classification of CRC and HV samples (this is also suggested by the PC A showed in FIG. 117A - 117 J), however the different sequencing technology used for these external dataset poses a problem related to the tuning of the classification threshold.
  • LASSO classifier for early CRC vs HV This classifier is trained by using the same approach used for CRC vs HV, and FIG. 106 shows the 40 genes having non-zero weight in the classifier. In the main part of the document (FIG. 31, and FIG. 72 and FIG. 73) the performance of this classifier is shown, and also compared the 13 genes with non-negative weight shared between CRC vs HV and early CRC vs HV classifier (FIG. 74).
  • PCA is also performed to test the prediction power of the classifier based on:
  • FIG. 106 LASSO weights for genes in the early CRC vs HV classifier. Only nonzero elements are reported.
  • FIG. 107 A - 107B PCA performed on the 40 non-zero genes in the early CRC vs HV classifier. Results show a clear split between early CRC to HV samples on PCI .
  • FIG. 108A - 108B PCA performed on the 13 non-zero genes shared between early CRC vs HV and CRC vs HV classifiers. The plots highlight early CRC and HV samples. Results show a clear split between early CRC to HV samples on PCI .
  • FIG. 109A - 109B PC A performed on the 13 non-zero genes shared between early
  • FIG. 110 shows the distribution of the AUC obtained on 1,000 permutations. This plot shows on average predictive performance of -0.5, which indicates a random classification model (as may be expected). On the other hand, the predictive performance of the LASSO classifier trained on the dataset with the correct labels is 0.993.
  • composition of the dataset, as well as the composition of the train/test datasets may not affect the performance of the prediction process.
  • the training dataset may affect the performance of the prediction process.
  • FIG. 110 Performance of the LASSO model trained on 1,000 independent permutations of the labels of the original dataset. How expected the average AUC for the Permutation test is 0.5 (random classification)
  • Reference Split indicates the train/test datasets used in the main analysis described in the document. Results on the reference split are very similar to the median obtained on the 1,000 splits, suggesting that this split do not over/under train the model.
  • Age of 61 is used as the age for young/old classification because with this value we can have a fair partition of the volunteers (44 and 61 samples respectively), and at the same time have enough HV in the oldest group (19 CRC - 42 HV in the young cohort, and 38 CRC - 6 HV in the old group).
  • FIG. 111A - 11 IB shows the same analysis summarized in FIG. 111A - 11 IB (random split of the dataset in 3 ⁇ 4 training and 1 ⁇ 2 testing) but this time it is trained and tested on the volunteer's age (young vs old).
  • FIG. 115A - 115C shows the results of this analysis based on 100 simulations. From this figure it is clear that there is no split enabling to train a model that performs a good classification of the volunteer's age. This becomes more evident if the results obtained by using the classifier for CRC-HV state (Table in FIG. 111A - 11 IB) are compared.
  • FIG. 121 lists the genes found in at least 10% of the simulations. It is interesting to see on the top of this list genes that are contained in the CRC-HV classification model (in red) including the genes associated with the highest (FIGN) and lowest (MRPS31P2) weight.
  • Results presented in this section highlight a very important point: despite the fact that the Lasso classifier model used in the main analyses is able to obtain very good performance, the list of non-zero genes/genehancers (and their weights) may not be considered as a "final signature for CRC detection".
  • the analysis presented in this section reveals a small instability of the results of the training process. However, at the same time, it is reassuring to see that the strongest genes of the model are also the ones showing the strongest stability (e.g., FIGN, RNF219, MRPS31P2). Increasing the size of the dataset may definitely help to obtain a more robust and stable classification model.
  • FIG. 113 PCA based on the 56 non-zero genes. The first 5 components are showed and samples are shaded based on the volunteer's age. PCA does not reveal any clear separation between different volunteer's ages.
  • FIG. 114 PCA based on the 56 non-zero genehancers. The first 5 components are showed and samples are shaded based on the volunteer's age. PCA does not reveal any clear separation between different volunteer's ages.
  • FIG. 115A - 115C From left to right and top to bottom. AUCs on 100 different splits of the original dataset, where the model is trained and test on the volunteer's. This table reports the median AUC for Age and CRC classifiers and the p-value resulting from the Mann- Whitney's test. List of non-zero genes/genehancers in the LASSO model trained on the volunteer's age, in red the genes shared between this model and the model trained on CRC-HV (no shared genehancers are found). Results refuse the hypothesis that age can be a confounding factor in the training of the model. [00546] FIG. 116. Distribution of the number of non-zero genes/genehancers found in the 200 simulations. Variability in the number of discriminating features is observed.
  • FIG. 117A - 117 J Performance of the CRC-HV gene-trained model on external datasets.
  • FIG. 119 List of the 56 non-zero gene in the Lasso classifier
  • FIG. 120 List of the 59 non-zero genehancers in the Lasso classifier
  • FIG. 121 List of the non-zero genes in the 200 simulations of the Lasso classifier. Only genes occurring in more than 10% of the simulations are reported. In red the genes shared with the list containing the 56 non-zero genes in the CRC-HV classifier used in the main analyses (FIG. 92). Mean/Min/Max weight columns contain the Mean/Min/Max weight found in the 200 simulations (note that only non zero weights are used in this calculation). Training weight column contain the weight if the feature in the Lasso model used in the main analyses.
  • FIG. 122 List of the non-zero genehancers in the 200 simulations of the Lasso classifier. Only genehancers occurring in more than 10% of the simulations are reported. In red the genehancers shared with the list containing the 59 non-zero genehancers in the CRC-HV classifier used in the main analyses (FIG. 93). Mean/Min/Max weight columns contain the Mean/Min/Max weight found in the 200 simulations (note that only non zero weights are used in this calculation). Training weight column contain the weight if the feature in the Lasso model used in the main analyses.
  • the genome is split into non overlapping regions and the GC bias of each region is calculated. Since the genome is biased towards certain GC bias classes than others, for example, a GC bias of 40% is more common than a GC bias of 8%, so more regions may have a GC bias of 40%. If you scatter reads across a genome evenly, you expect your reads to fall into the GC bias classes according to how frequent they occur in the genome, e.g. more reads can fall into regions that are in 40% GC bias than in the parts of the genome that are 8% GC bias.
  • Norm coverage proportion of windows at a GC% / proportion of reads observed at a GC%
  • Addition 2 Gene signatures determined via a Robust LASSO regression scheme. A) Brief description of the Robust LASSO regression scheme B) Resulting gene signatures and overlaps. A full description of the LASSO model parameters can be found in the appendix.
  • a meta classifier is created, such that for the gene features selected (those that occur > 5% in all 1,000 gene signatures) the median of the gene feature weight is computed over the 1,000 instances (excluding zero instances).
  • the final model is the gene features and the weights associated with each gene feature.
  • Sheet "gene LASSO” provides the parameters for all genes that meet the 5% criteria as described above for both the z- score and non-z-score normalised data (See headings 5% CRC-HV - Z-Normalization and 5% CRC-HV - No Z-Normalization).
  • the more stringent 10% (or 21% for z-score normalised data) are a subset of these tables which can be gained by selecting all the genes above the 10% (non- zscore,49 genes) frequency or 21% frequency (z-score normalisation, 27 genes) values.

Landscapes

  • Chemical & Material Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Organic Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Engineering & Computer Science (AREA)
  • Immunology (AREA)
  • Pathology (AREA)
  • Analytical Chemistry (AREA)
  • Zoology (AREA)
  • Genetics & Genomics (AREA)
  • Wood Science & Technology (AREA)
  • Physics & Mathematics (AREA)
  • Biotechnology (AREA)
  • Microbiology (AREA)
  • Molecular Biology (AREA)
  • Hospice & Palliative Care (AREA)
  • Biophysics (AREA)
  • Oncology (AREA)
  • Biochemistry (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

Provided herein are methods and kits for analyzing a sample from a subject, such as identifying a sample as benign or malignant for a cancer. The methods as described herein may assay for an epigenetic modification in one or more biomarkers of the sample to obtain a result. The result may be input into a trained algorithm to classify the sample as benign or malignant for the cancer based on the presence or absence of the epigenetic modification in the one or more biomarkers.

Description

BIOMARKERS FOR COLORECTAL CANCER DETECTION
CROSS-REFERENCE
[0001] This provisional application is related to U.S. provisional application 62/564,164 filed on September 27, 2017, which is entirely incorporated herein by reference.
BACKGROUND
[0002] It is important to develop new methods to identify samples as benign or malignant for cancer.
SUMMARY
[0003] The methods and kits as described herein may provide identification of samples from a subject as benign or malignant for a cancer. This method may be an improvement in the field of analyzing samples from a subject.
INCORPORATION BY REFERENCE
[0004] All publications, patents, and patent applications herein are incorporated by reference to the same extent as if each individual publication, patent, or patent application was specifically and individually indicated to be incorporated by reference. To the extent publications and patents or patent applications incorporated by reference contradict the disclosure contained in the specification, the specification is intended to supersede or take precedence over any such contradi ctory materi al .
BRIEF DESCRIPTION OF THE DRAWINGS
[0005] The novel features herein are set forth with particularity in the appended claims. A better understanding of the features and advantages herein will be obtained by reference to the following detailed description that sets forth illustrative embodiments, in which the principles herein are utilized, and the accompanying drawings (also "figure" and "FIG." herein), of which:
[0006] FIG. 1 shows 110 total colorectal cancer (CRC) and healthy volunteer (HV) plasma samples processed through the HMCP v2 protocol.
[0007] FIG. 2A - FIG. 2C shows the HMCP- 110 study design and initial sample set.
[0008] FIG. 3A - FIG. 3E shows the HMCP- 110 study design and sample set breakdown.
[0009] FIG. 4A - FIG. 4H shows the HMCP- 110 data quality control outlining that technical parameters did not affect quality or bias results.
[0010] FIG. 5A - FIG. 5D shows no operator-related batch effect in the HMCP-1 10 dataset.
[0011] FIG. 6A - FIG. 6B shows HMCP-110 data/feature exploratory analysis. [0012] FIG. 7A - FIG. 7B shows the HMCP-110 differential feature analysis of gene bodies identified a high number of discriminating genes.
[0013] FIG. 8A - FIG. 8E shows the top 20 differential genes in the HMCP-110 differential feature analysis are a mixture of hypo- and hyper-hydroxymethylated loci.
[0014] FIG. 9A - FIG. 9B shows an example gene, ZIC4, showing concordance between cell free DNA (cfDNA) and genomic DNA (gDNA) 5-hydroxymethylated cytosine (5-hmC) profiles.
[0015] FIG. 10 shows a comparison of differential genes in CRC vs. HV having functional significance based on most variable features.
[0016] FIG. 11 shows a comparison of differential genes in CRC vs. HV having functional significance based on most variable features.
[0017] FIG. 12A - FIG. 12B shows a high number of discriminating features identified in the HMCP-110 differential feature analysis of enhancers.
[0018] FIG. 13A - FIG. 13E shows a 6-fold x-validation using top varying genes with read counts over 30 in HMCP-110 classification.
[0019] FIG. 14A - FIG. 14E shows a 6-fold x-validation using top varying genehancers with read counts over 30 in HMCP-110 classification.
[0020] FIG. 15 shows HMCP-1 10 classification using a Lasso regression model to develop classifiers based on training sets to be assessed using test sets.
[0021] FIG. 16A - FIG. 16B shows the performance of two Lasso-based signatures (gene and genehancer) for CRC vs. HV assessed using test sets. Lasso signatures predict CRC vs. HV disease status in test set with > 91% sensitivity and 80% specificity.
[0022] FIG. 17A - FIG. 17F shows CRC vs. HV class separation based on Lasso signature features.
[0023] FIG. 18A - FIG. 18B shows the performance of two Lasso-based signatures (gene and genehancer) for early CRC vs. HV assessed using test sets. Lasso signatures predict early CRC vs. HV disease status in test set with > 93% sensitivity and 80% specificity.
[0024] FIG. 19A - FIG. 19C shows feature overlap between CRC vs. HV and early CRC vs. HV gene Lasso signatures.
[0025] FIG. 20A - FIG. 20B shows histogram data from the HMCP110 method.
[0026] FIG. 21A - FIG. 21B shows differential feature analysis of genes and genehancer filtered for read count only (>30).
[0027] FIG. 22A - FIG. 22B shows pie charts for top 50 genes.
[0028] FIG. 23 A - FIG. 23D shows peak analysis. [0029] FIG. 24A - FIG. 24B shows HMCP-110 profile of ZIC4 and ZIC1 genes.
[0030] FIG. 25A - FIG. 25C shows boxplots of key genes (FIGN, SIX1 , ZIC4) with gDNA from tumours.
[0031] FIG. 26A - FIG. 26E shows 6-fold cross-validation using top varying genes for HMCP-110 classification.
[0032] FIG. 27A - FIG. 27E shows 6-fold cross-validation using top varying genehancers for HMCP-110 classification.
[0033] FIG. 28A - FIG. 28B shows permutation tests (AUC) for SVM models for genes.
[0034] FIG. 29A - FIG. 29B shows permutation tests (AUC) for SVM models for genehancers.
[0035] FIG. 30A - FIG. 30D shows HMCP-110 data/feature exploratory analysis.
[0036] FIG. 31A - FIG. 31B shows a histogram of genehancer signature and label permutation test.
[0037] FIG. 32A - FIG. 32B shows HMCP-110 study sample composition and parameters imbalance.
[0038] FIG. 33 shows the HMCP- 110 protocol overview.
[0039] FIG. 34A shows a gene list of biomarkers for CRC-HV (single application) an application of the LASSO model.
[0040] FIG. 34B shows a gene list of biomarkers for earlyCRC-HV (single application) an application of the LASSO model.
[0041] FIG. 35 shows a gene list of biomarkers for 5% CRC-HV - Z-Normalization - a result of analysis to find robust gene signatures.
[0042] FIG. 36 shows a sample cohort numbers used in the HMCP003 secondary analysis.
[0043] FIG. 37A-C shows a distribution of the cohort based on three key variables - age, gender and cancer stage. An age bias is visible in (FIG. 37A) with HV younger than CRC patients. Age and gender is less biased (FIG. 37B) but there is a bias by gender and cancer stage (FIG. 37C)
[0044] FIG. 38A-D shows results of the OSAT sample balancing analysis based on key variables across the 14 strip tubes needed for the HMCP v2 workflow. Each bar of the histogram represents one strip tube processed in the workflow. Each of the plots represents for strip tube 1- 14 how well balanced it is for cancer stage, gender, extraction operator and day of extraction. No strip is found to be unbalanced based on chi-square tests.
[0045] FIG. 39A-E shows assessment of the quantity of DNA (concentration and yield) achieved by DNA extraction based on both Qubit and the Bioanalyser (BA) by key cohort meta- data and extraction operator. Qubit chi-square tests- Stage: chi-square X2=205, p-value=0.3361, Extraction operator: chi-square X2=197, p-value=0.505. BA chi-square tests - Stage: chi-square X2=210, p-value=0.3718, Extraction operator: chi-square X2=210, p-value=0.371. A good correlation between the two methods is achieved (Pearson's correlation R2=0.994, p-value < 2.2e-16).
[0046] FIG. 40A-B shows an association of total mass (ng) of cfDNA that went into the library preparation stage (denoted conv ng) with Sex, and cancer stage. No bias identified (Sex: chi-square X2=16.4, p-value=0.354, Stage: chi-square X2=28.433, p-value= 0.54, Age Groups: chi-square X2=33.8, p-value=0.287).
[0047] FIG. 41 shows an assessment of DNA quantity included in the workflow based on the nextflex adapter for inputs. No biases identified (all inputs: chi-square test for NetFlex adapters and operators p-value=l, NextFlex Adapters and ng/ul input cfDNA p-value 0.342). The NetFlex adapters contain the library indexes needed for sequencing, which are well balanced across the operators.
[0048] FIG. 42A-C shows a balancing of operators and runs by diagnosis and gender. From Figure 2, extraction operators are well balanced over runs (chi-square test X=3.2, p-value=0.92) alongside the HMCP operators by diagnosis stage (chi-square test X2=2.29, p-value=0.89) and the gender (chi-square test X2=1.53, p-value=0.673). However HMCP operators are imbalanced across runs (chi-square test X2=139, p-value<2.2e-16), which is further assessed in FIG. 47A-F.
[0049] FIG. 43A-D shows an association identified between the quantity of input cfDNA and the sequencing metrics including the diversity, uniformity, total de-duplicated reads
(bamstats mapped reads) and the mitochondrial genes RPKM.
[0050] FIG. 44A-D shows histograms and boxplots of the de-duplicated sequencing reads.
[0051] FIG. 45A-F shows an assessment of spike ins by clinical diagnosis and HMCP operator.
[0052] FIG. 46A-D shows an assessment of the diversity, uniformity and mitochondrial reads based on the run, operator and clinical diagnosis. Some variation identified in the mitochondrial RPKMs for both input and pulldown (pBGT).
[0053] FIG. 47A-F shows principal components from PC A with different operators (shape = operator #) who performed library preparation and pull down experiments demonstrating lack of clustering over the first 3 Principal Components. Plots based on the top varying Genes (FIG. 47A-C) and Genehancers (FIG. 47D-F). [0054] FIG. 48A-D shows a first two principal components of top varying regions by sex and age group (shape = subgroup). Plots based on the top varying Genes (FIG. 48A-B) and
Genehancers (FIG. 48C-D). Limited clustering observed based on these biological variables.
[0055] FIG. 49A-F show PC A using the top varying genes (N=3104) and genehancers
(N=1323). Evidence for separation between biological variables is shown. Particularly, separation on PC2 for gene bodies and PC3 for genehancers is shown.
[0056] FIG. 50 shows a number of discriminatory features identified at several FDR thresholds. Many discriminating features are found for CRC vs. HV and early CRC vs. HV comparisons at an FDRO.01.
[0057] FIG. 51 shows a top 20 discriminatory genes ranked by adjusted p-value for the CRC vs HV comparison (Mann- Whitney U test). For each gene, its specific prediction power in terms of AUC is computed.
[0058] FIG. 52A-F shows boxplots of the 6 top ranked genes by p-value from CRC vs HV comparison (top varying genes), all of which show an increased level of 5hmC enrichment in CRC over HV.
[0059] FIG. 53A-B shows 5hmC Enrichment Profile of ZIC4 and ZIC1 genes showing increased levels of 5hmC in CRC.
[0060] FIG. 54A-B shows 5hmC Enrichment Profile of SIX1 gene showing increased levels of 5hmC in CRC.
[0061] FIG. 55 shows a disease association table from the GeneAnalytics component of the GeneCards suite, for genes with an adjusted p-value < 0.05 in CRC vs HV comparison. CRC is the top hit for the gene list.
[0062] FIG. 56 shows genes in the CRC vs HV set that are identified as differentially expressed in tissue samples in CRC.
[0063] FIG. 57 shows top 20 genes directly associated with CRC using the VarElect component of the Genecards database. CRC related terms are top hits in this analysis.
[0064] FIG. 58 shows disease association table from the GeneAnalytics component of the GeneCards suite, for genes with an adjusted p-value < 0.05 in CRC vs HV comparison using the All-genes list which does not apply a filter based on co-efficient of variation. CRC and other cancers are the top hits for the gene lists.
[0065] FIG. 59A-B shows histograms of the top 20 Biological pathways from GO analysis of genes with a FDR<0.05 from early CRC vs. HV MWU results (both genders, with read count filtering). Under-enriched pathways are predominantly immune related (FIG. 59A) and over- enriched pathways are predominantly metabolism related (FIG. 59B). [0066] FIG. 60A-B shows histograms of the top 20 Biological pathways from GO analysis of genes with a FDR<0.05 from late CRC vs. HV MWU results (females only). Under-enriched pathways are immune related (FIG. 60A) and over-enriched pathways are related to adhesion, morphogenesis and development (FIG. 60B).
[0067] FIG. 61A-E shows ROC curves for SVM classifiers built on gene data for disease comparisons (varied by stage and gender). All classifiers may be built using 6-fold cross- validation including COFV filtering (>0.2) and chi-square test (p-value>0.25) for age and or gender (if mixed population). ROC curves are annotated with the mean area under the curve and permutation test (PT) p-values. The ROC curve achieved during each cross-validation (CV) is shown in light grey. All classifiers show high performance levels with AUCs>0.8.
[0068] FIG. 62A-E shows ROC curves for classifiers built on genehancer data for disease comparisons (varied by stage and gender). All classifiers were built using 6-fold cross-validation including COFV filtering (>0.2) and chi-square test (p-value>0.25) for age and or gender (if mixed population). ROC curves are annotated with the mean area under the curve and
permutation test (PT) p-values. The ROC curve achieved during each cross-validation (CV) is shown in light grey. All classifiers show high performance levels with AUCs>0.8.
[0069] FIG. 63A-E shows ROC curves for LR RFE classifiers built on gene data for CRC vs. HV comparisons (varied by stage and gender). All classifiers were built using 6-fold cross- validation including COFV filtering (>0.2) and chi-square test (p-value>0.25) for age and or gender (where appropriate). ROC curves are annotated with the mean area under the curve. All ROC curves show models built with 20 features. All classifiers show high performance levels with AUCs>0.8.
[0070] FIG. 64A-E shows ROC curves for LR RFE classifiers built on genehancer data for CRC vs. HV comparisons (varied by stage and gender). All classifiers were built using 6-fold cross-validation including COFV filtering (>0.2) and chi-square test (p-value>0.25) for age and or gender (where appropriate). ROC curves are annotated with the mean area under the curve. All ROC curves show models built with 20 features. All classifiers show high performance levels with AUCs>0.7.
[0071] FIG. 65 shows an overview of test and training sets.
[0072] FIG. 66A-B shows performance of LASSO regression model on Genes (AUC 0.883) and Genehancers (AUC 0.937). Final model results in 56 features using genes and 59 features using genehancers (3-fold cross-validation is used in the training process). All classifiers show high performance levels with AUCs>0.85. [0073] FIG. 67 shows a summary of cross validation results using a LASSO regression model on gene features.
[0074] FIG. 68 shows a summary of independent test set performance using a LASSO regression model on gene features.
[0075] FIG. 69A-B shows PCA based on the list containing the 56 genes having non-zero weight in the computed LASSO model, also supports the evidence that these features can be effectively used to distinguish between CRC and HV samples (FIG. 64A-E). PCS 1-3 shown.
[0076] FIG. 70A-B shows PCA based on the list containing the 59 genehancers having nonzero weight in the computed LASSO model, also supports the evidence that these features can be effectively used to distinguish between CRC and HV samples (FIG. 64A-E). PCS 1-3 shown.
[0077] FIG. 71A-B shows performance of LASSO regression model on Genes (AUC = 0.951) and Genehancers (AUC = 0.884) for early CRC vs HV classification. Final model results in 40 features using genes and 25 features using genehancers. 3-fold cross-validation is used. All classifiers show high performance levels with AUCs>0.85.
[0078] FIG. 72 shows a summary of cross validation results using a LASSO regression model on gene features for early CRC vs HV.
[0079] FIG. 73 shows a summary of cross validation results using a LASSO regression model on genehancer features for early CRC vs HV.
[0080] FIG. 74 shows a non-zero genes shared between early CRC vs HV and CRC vs HV classifiers. The table report the rank of each gene, in genes (with an outlined box) have the negative weight (e.g., MRPS31P2 is the gene with the most negative weight in both classifiers).
[0081] FIG. 75A-75E. MULTIQC PLOTS - Insert size as calculated by the Picard software suite. Run461 to Run465 represent the different sequencing batches. No untoward insert size anomalies were found.
[0082] FIG. 76A-76L: Additional QC plots. FIG. 76A - FIG. 76F) Uniformity and
Diversity scores by library preparation strategy (conv ng) assessed over technical and biological variables. FIG. 76G - FIG. 76H) Results of iCNA show a mismatch in predicted gender and % tumour fraction predictions. FIG. 761 - FIG. 76L are metrics from the deeptools plotFingerprint utility that summarise a diagnostic plot that gives an overview of aspects of genomic coverage. Both pBGT and input samples behave as expected, pBGTs expected to have higher
elbow/inflection points, lower AUC and higher x-intercept. No difference is observed by operator.
[0083] FIG. 77A-77N: PCA of samples using features (Genes (FIG. 77A-C), Genehancers (FIG. 77D-F) ) that have passed the read count thresholds (>30 reads in input and pBGT) and filtered by the coefficient of variation (>0.2 & <2). The variance explained by each principal component for the gene and genehancer set is given in FIG. 77G-H, demonstrating that the majority of the variance is accounted for in the first three to four principle components. FIG. 77I-N gives plots for genes and genehancers with only the read count thresholds (>30 reads in the input and pBGT).
[0084] FIG. 78A-78D: PC As of the top 20 discriminating/ranked genes for each of the patient subgroup as determined by the MWU test. Results are based on the MWU tests of the read count threshold feature lists (>30 reads in both pBGT and input). Clustering by diagnosis is evident based on the top 20 features alone.
[0085] FIG. 79A - 79D: PCAs of the top 20 discriminating/ranked genehancers for each of the patient subgroup sourced from. Results are based on the MWU tests of the read count threshold feature lists (>30 reads in both pBGT and input). Clustering by diagnosis is evident based on the top 20 features alone.
[0086] FIG. 80A - 80D: PCAs of the top 20 discriminating/ranked genes for each of the patient subgroup. Results are based on the MWU tests of the top varying features (>30 reads in both pBGT and input plus coefficient of variation >0.2 & <2). Clear separation between CRC and HV samples is demonstrated.
[0087] FIG. 81A - 81D: PCAs of the top 20 discriminating/ranked genehancers for each of the patient subgroup. Results are based on the MWU tests of the top varying features (>30 reads in both pBGT and input plus coefficient of variation >0.2 & <2). Clustering by diagnosis is evident based on the top 20 features alone.
[0088] FIG. 82A - 82F: Boxplots of the top 6 discriminating genehancers demonstrating separation between CRC and HVs (top varying list). Increased levels of 5hmC are found for CRC over HV for these top 6 genehancers.
[0089] FIG. 83A - 83F: Boxplots of the top 6 discriminating genes demonstrating separation between early CRC and HVs (top varying list). Increased levels of 5hmC are found for early CRC over HV for these top 6 genes.
[0090] FIG. 84A - 84F: Boxplots of the top 6 discriminating genes demonstrating separation between late CRC and HVs (top varying list). The majority of the top 6 genes show increased levels of 5hmC for late CRC over HV.
[0091] FIG. 85A - 85F: Boxplots of the top 6 discriminating genehancers demonstrating separation between early CRC and HVs (top varying list). Increased levels of 5hmC are found for early CRC over HV for these top 6 genehancers. [0092] FIG. 86A - 86F: Boxplots of the top 6 discriminating genehancers demonstrating separation between late CRC and HVs (top varying list). Increased levels of 5hmC are found for late CRC over HV for these top 6 genehancers.
[0093] FIG. 87: Prediction score (in terms of AUC) of the top 20 most discriminating genes (top-varying comparison) between CRC and HV based on age groups. Those with a score > 0.7 are highlighted in red. The top 20 genes do not show any clear prediction power for these three age groups.
[0094] FIG. 88A - 88F: Boxplots of the top 6 discriminating genes demonstrating separation between CRC and HVs by age group and diagnosis (top varying list). Enrichment levels are fairly similar across the age groups for HVs, while increased 5hmC levels are evident in the CRC patients.
[0095] FIG. 89A - 89F: Boxplots of the top 6 discriminating genehancers demonstrating separation between CRC and HVs by age group and diagnosis (top varying list). Enrichment levels are fairly similar across the age groups for HVs, while increased 5hmC levels are evident in the CRC patients.
[0096] FIG. 90: Rank comparison between random subgroups of patients (50:50 split). Test to see if the same top genes come up in both subgroups. RC = read count threshold. TV= top varying. The majority of the comparisons show no statistical difference in rank between the subgroups.
[0097] FIG. 91: Summary of DESeq2 results with covariates. A high number of features are identified as significantly discriminatory based on the default DESeq2 threshold of <0.1 adjusted p-value.
[0098] FIG. 92: DESeq vs. MWU rank comparison tests - Genes. Gender and age have a stronger effect in the early CRC comparisons. P-value from the rank comparison test <0.05 are highlighted in red. The addition of the covariates makes the most difference for the early CRC vs. HV comparison.
[0099] FIG. 93: DESeq vs. MWU rank comparison tests - Genehancers. Gender and age have little effect on the rank comparisons. The addition of any covariates does not significantly affect the rank of the discriminating genehancer lists, with approximately ¾ of genehancers identified by both methods (DESeq2 and MWU tests).
[00100] FIG. 94A - 94F: Top 6 genes ranked by DESeq2 test between CRC and HV including age and gender as covariates. Many of these genes (4/6) were also in the top 6 for the MWU test. [00101] FIG. 95A - 95E: Receiver operator characteristic (ROC) curves for SVM classifiers built on gene data for CRC vs. HV comparisons (varied by stage and gender). All classifiers were built using 6-fold cross-validation including variance filtering (>0.1) and chi-square tests (p- value>0.25) for age and or gender (if mixed population). ROC curves are annotated with the mean area under the curve and permutation test p-value is reported for each model. All classifiers achieved a mean AUC of 0.8 or above.
[00102] FIG. 96A - 96E: Receiver operator characteristic (ROC) curves for SVM classifiers built on genehancer data for CRC vs. HV comparisons (varied by stage and gender). All classifiers were built using 6-fold cross-validation including variance filtering (>0.1) and chi- square tests (p-value>0.25) for age and or gender (if mixed population). ROC curves are annotated with the mean area under the curve and permutation test p-value is reported for each model. All classifiers achieved a mean AUC of 0.76 or above.
[00103] FIG. 97A - 97E: ROC curves for logistic regression classifiers built on gene data for disease comparisons (varied by stage and gender). All classifiers were built using 6-fold cross- validation including coefficient of variation filtering (>0.2) and chi-square tests (p-value>0.25) for age and or gender (if mixed population). ROC curves are annotated with the ROC curves from each cross-validation round (light grey) and the mean area under the curve (green). All mean ROC curves show good performance with a minimum mean AUC of 0.83.
[00104] FIG. 98A - 98E: Receiver operator characteristic (ROC) curves for logistic regression classifiers built on genehancer data for CRC vs. HV comparisons (varied by stage and gender). All classifiers were built using 6-fold cross-validation including coefficient of variation (>0.2) and chi-square tests (p-value>0.25) for age and or gender (if mixed population). ROC curves are annotated with the ROC curves from each cross-validation round (light grey) and the mean area under the curve (green). All mean ROC curves show good performance with a minimum mean AUC of 0.79.
[00105] FIG. 99A - 99E: Receiver operator characteristic (ROC) curves for logistic regression classifiers built on gene data for CRC vs. HV comparisons (varied by stage and gender). All classifiers were built using 6-fold cross-validation including a variance filter (>0.1) and chi-square tests (p-value>0.25) for age and or gender (if mixed population). ROC curves are annotated with the mean area under the curve. The mean ROC curves show that even with a stricter variance threshold the models can still perform well.
[00106] FIG. 100A - 100E: Receiver operator characteristic (ROC) curves for logistic regression classifiers built on genehancer data for CRC vs. HV comparisons (varied by stage and gender). All classifiers were built using 6-fold cross-validation including a variance filter (>0.1) and chi-square tests (p-value>0.25) for age and or gender (if mixed population). ROC curves are annotated with the mean area under the curve. The mean ROC curves show that even with a stricter variance threshold the models can still perform well.
[00107] FIG. 101 A - 101E: ROC curves for Logistic Regression (LR) Recursive Feature Elimination (RFE) classifiers built on gene data for CRC vs. HV comparisons (varied by stage and gender). All classifiers were built using 6-fold cross-validation including variance filtering (>0.1) and chi-square test (p-value>0.25) for age and or gender (where appropriate) followed by RFE (20 features retained). ROC curves are annotated with the mean area under the curve. The mean ROC curves show that even with a stricter variance threshold the models can still perform well.
[00108] FIG. 102A - 102E: ROC curves for Logistic Regression (LR) Recursive Feature Elimination (RFE) classifiers built on genehancer data for CRC vs. HV comparisons (varied by stage and gender). All classifiers were built using 6-fold cross-validation including variance filtering (>0.1) and chi-square test (p-value>0.25) for age and or gender (where appropriate) followed by RFE (20 features retained). ROC curves are annotated with the mean area under the curve. The mean ROC curves show that even with a stricter variance threshold the models can still perform well.
[00109] FIG. 103A - 103B: Receiver operator characteristic (ROC) curves for Logistic Regression (LR) Recursive Feature Elimination (RFE) classifiers built on gene data for age groups (<61 and >61) comparisons. All classifiers were built using 6-fold cross-validation including variance filtering (>0.1 and chi-square test for gender where appropriate, p-value>0.25) plus recursive feature elimination to 20 features. ROC curves are annotated with the mean area under the curve. Age classifiers perform worse than disease classifiers.
[00110] FIG. 104A - 104B: Receiver operator characteristic (ROC) curves for Logistic Regression (LR) Recursive Feature Elimination (RFE) classifiers built on genehancer data for age groups (<61 and >61). All classifiers were built using 6-fold cross-validation including variance filtering (>0.1 and chi-square test for gender where appropriate, p-value>0.25) plus recursive feature elimination to 20 features. ROC curves are annotated with the mean area under the curve. Age classifiers perform worse than disease classifiers.
[00111] FIG. 105A - 105B. LASSO weights for gene and genehancer datasets. Only non-zero elements are reported.
[00112] FIG. 106. LASSO weights for genes in the early CRC vs HV classifier. Only nonzero elements are reported. [00113] FIG. 107A - 107B. PC A performed on the 40 non-zero genes in the early CRC vs
HV classifier. Results show a clear split between early CRC to HV samples on PCI .
[00114] FIG. 108A - 108B. PC A performed on the 13 non-zero genes shared between early
CRC vs HV and CRC vs HV classifiers. The plots highlight early CRC and HV samples. Results show a clear split between early CRC to HV samples on PCI .
[00115] FIG. 109A - 109B. PC A performed on the 13 non-zero genes shared between early CRC vs HV and CRC vs HV classifiers. The plots highlight CRC and HV samples. Despite, results show a clear split between early CRC to HV samples, this separation is less stronger than observed before for early CRC and HV.
[00116] FIG. 110. Performance of the LASSO model trained on 1,000 independent permutations of the labels of the original dataset. How expected the average AUC for the Permutation test is 0.5 (random classification)
[00117] FIG. 111A - 11 IB AUCs on 1,000 different splits of the original dataset. Reference Split indicates the train/test datasets used in the main analysis described in the document. Results on the reference split are very similar to the median obtained on the 1,000 splits, suggesting that this split do not over/under train our model.
[00118] FIG. 112. Volunteer's age distribution. The percentage of CRC for each age is reported on the top of each bar. It is evident that most of HV samples are in the youngest cohort (shaded), and most of the CRC samples are in the oldest cohort (solid).
[00119] FIG. 113. PCA based on the 56 non-zero genes. The first 5 components are showed and samples are different shades based on the volunteer's age. PCA does not reveal any clear separation between different volunteer's ages.
[00120] FIG. 114. PCA based on the 56 non-zero genehancers. The first 5 components are showed and samples are different shades based on the volunteer's age. PCA does not reveal any clear separation between different volunteer's ages.
[00121] FIG. 115A - 115C. From left to right and top to bottom. AUCs on 100 different splits of the original dataset, where the model is trained and test on the volunteer's. This table reports the median AUC for Age and CRC classifiers and the p-value resulting from the Mann- Whitney's test. List of non-zero genes/genehancers in the LASSO model trained on the volunteer's age, in red the genes shared between this model and the model trained on CRC-HV (no shared genehancers were found). Results refuse the hypothesis that age can be a confounding factor in the training of the model.
[00122] FIG. 116A-B. Distribution of the number of non-zero genes/genehancers found in the 200 simulations. Variability in the number of discriminating features is observed. [00123] FIG. 117A - 117 J. Performance of the CRC-HV gene-trained model on external datasets. The first row shows the ROC obtained by using the CRC-HV classifier highlighting good accuracy in terms of prediction (21 Samples AUC = 0.806, Groupl AUC = 0.817 and
Group2 AUC = 0.752). The second , third, and fourth rows show PC A for the 21 samples,
Groupl and Group2 respectively (PC 1-3 are showed). In the last row sensitivity and specificity on these datasets when the threshold (0.36) learnt in the cross validation process is used to classify CRC and HV.
[00124] FIG. 118A - 118B. Performance of the earlyCRC-HV gene-trained model on the external dataset containing 21 samples (all 7 CRC samples were earlyCRC). Results show AUC = 0.643, specificity = 0.85, and sensitivity = 0.28.
[00125] FIG. 119: List of the 56 non-zero gene in the Lasso classifier
[00126] FIG. 120: List of the 59 non-zero genehancers in the Lasso classifier
[00127] FIG. 121: List of the non-zero genes in the 200 simulations of the Lasso classifier. Only genes occurring in more than 10% of the simulations are reported. In red the genes shared with the list containing the 56 non-zero genes in the CRC-HV classifier used in the main analyses (FIG. 92). Mean/Min/Max weight columns contain the Mean/Min/Max weight found in the 200 simulations (note that only non zero weights were used in this calculation). Training weight column contain the weight if the feature in the Lasso model used in the main analyses.
[00128] FIG. 122: List of the non-zero genehancers in the 200 simulations of the Lasso classifier. Only genehancers occurring in more than 10% of the simulations are reported. In red the genehancers shared with the list containing the 59 non-zero genehancers in the CRC-HV classifier used in the main analyses (FIG. 93). Mean/Min/Max weight columns contain the Mean/Min/Max weight found in the 200 simulations (note that only non zero weights were used in this calculation). Training weight column contain the weight if the feature in the Lasso model used in the main analyses.
[00129] FIG. 123A-C. In FIG. 123A, the LASSO scores computed for the 21 samples. Red and blue bars highlight CRC and HV samples, respectively. The horizontal red dotted line shows the optimal classification threshold inferred from the HMCP-110 dataset (0.091). In FIG. 123B, the ROC of the classification model on this dataset (AUC = 0.79). In FIG. 123C, the table showing the CRC-HV prediction performance of the model when the inferred classification threshold is applied. It is interesting that increasing the threshold from 0.091 to 0.15 the specificity of the classifier can be improved (12 of 14 HV and 5 of 7 CRC samples are correctly identified, sensitivity = 0.71 and specificity = 0.86). [00130] FIG. 124 shows a table of selected gene subsets having above 5% frequency, 10% frequency or 21% frequency.
[00131] FIG. 125 shows one example of the 5-hydroxymethylcytosine (5-hmC) Pulldown Label Copy Enrich (HMCP LCE) method detailed herein.
[00132] FIG. 126 shows one example of the 5-hmC Pulldown Copy Label Enrich
(HMCP CLE) method detailed herein.
[00133] FIG. 127 shows one example of the 5-hmC Pulldown Label Random prime Enrich (HMCP LRE) method detailed herein.
[00134] FIG. 128 shows one example of the 5-hmC Pulldown Random primer Label Enrich (HMCP RLE) method detailed herein.
[00135] FIG. 129 shows one example of the 5-hmC Pulldown Label Loci Specific Enrich (HMCP LLSE) method detailed herein.
[00136] FIG. 130 shows one example of the 5-hmC Pulldown Loci Specific Label Enrich (HMCP LSLE) method detailed herein.
[00137] FIG. 131 shows a gene list of biomarkers for 5% CRC-HV - No Z-Normalization - an analysis to find robust gene signatures.
[00138] FIG. 132 shows a genehancer list of biomarkers for CRC-HV (single application) an application of the LASSO model.
[00139] FIG. 133 shows a genehancer list of biomarkers for earlyCRC-HV (single application) an application of the LASSO model.
[00140] FIG. 134 shows a list of biomarkers for CRC HV genes TV - a result of statistical tests between a sample and a control using a Top Varying feature set (features that have a coefficient of variation > 0.2).
[00141] FIG. 135 shows a list of biomarkers for CRC HV genehancers TV - a result of statistical tests between a sample and a control using a Top Varying feature set (features that have a coefficient of variation > 0.2).
[00142] FIG. 136 shows a list of biomarkers for earlyCRC HV genes TV - a result of statistical tests between a sample and a control using a Top Varying feature set (features that have a coefficient of variation > 0.2).
[00143] FIG. 137 shows a gene list of biomarkers for earlyCRC HV genehancers TV - a result of statistical tests between a sample and a control using a Top Varying feature set (features that have a coefficient of variation > 0.2).
[00144] FIG. 138 shows a summary of the genes chosen during the RFE across the 6 fold cross-validation alongside their corresponding p-values from the MWU tests. [00145] FIG. 139 shows a table of genes distinguishing earlyCRC from HV. Top Varying filter, Mann Whitney Test, Bonferroni corrected, adjusted p-val < 0.05, N = 29
[00146] FIG. 140 shows a table of genes distinguishing earlyCRC from HV. FDR < 0.05, N
=405.
[00147] FIG. 141 shows a table of genehancers distinguishing earlyCRC from HV. Top varying filter, Mann Whitney test, Bonferroni correction, adjusted p < 0.05, N = 83.
[00148] FIG. 142 shows a table of genehancers distinguishing earlyCRC from HV. Top varying filter, FDR < 0.05, N = 447.
DETAILED DESCRIPTION
[00149] While various embodiments have been shown and described herein, it will be obvious to those skilled in the art that such embodiments are provided by way of example only. It should be understood that various alternatives to the embodiments herein may be employed.
[00150] A method may comprise assaying a sample for a nucleotide sequence having at least: 70%, 75%, 80%, 85%, 90%, 95%, 96%, 97%, 98%, 99% sequence homology to a biomarker or active fragment thereof to produce a result. The biomarker or active fragment thereof may comprise a gene or portion thereof. The biomarker may comprise a genehancer or a portion thereof. The biomarker may comprise a transcription factor or a portion thereof. The biomarker may not be previously associated with a cancer. An epigenetic modification of the biomarker may not be previously associated with a cancer. The assaying may identify a presence of an epigenetic modification. The assaying may identify a presence of one or more of methylcytosine (mC), a hydroxymethylated cytosine (hmC), a carboxycytosine (caC), a formylcytosine (fC), or any combination thereof at one or more positions in the biomarker. The assaying may identify an epigenetic signature.
[00151] The sample may be obtained from a subject having been previously diagnosed to have cancer. The sample may be obtained from a subject having cancer. The sample may be obtained from a subject suspected of having cancer. The sample may be obtained from a subject asymptomatic of cancer. The sample may be obtained from a subject not previously diagnosed with cancer. The sample may be obtained from a subject during an early screening procedure. The sample may be obtained from a subject having a risk of cancer - such as a presence of a biomarker or familial genetic history.
[00152] The sample obtained from the subject may be a blood sample, a fine needle aspirate (FNA) sample, a tissue sample, a fecal sample or any combination thereof. The sample may comprise cell-free DNA. The sample may comprise a small sample volume, for example, from about 1 nanogram to about 15 ng. The sample may comprise a small sample volume, for example from about 1 cell to about 1000 cells; from about 1 cell to about 500 cells; from about 1 cell to about 100 cells. A sample may comprise a first portion comprising a blood sample and a second portion comprising a tissue sample or a fecal sample.
[00153] A result of assaying may be compared to a result obtained from a control sample. The control sample may comprise a database of control samples. The control sample may comprise at least: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 30, 40, 50, 60, 70, 80, 90, 100, 150, 200 independent samples. The control sample may comprise at least 5 independent samples. The control sample may comprise at least 10 independent samples. The control sample may comprise at least 5 independent samples. The control sample may comprise at least 20 independent samples. The control sample may comprise at least 50 independent samples. The control sample may comprise at least 100 independent samples. The control sample may comprise a blood sample, an FNA sample, a tissue sample, or any combination thereof. The control sample may be obtained from a healthy volunteer. The control sample may be obtained from a subject having received a positive diagnose of cancer. The control sample may be obtained from a subject having a specific cancer type, such as a colorectal cancer, a colon cancer, etc. The control sample may include a sample previously obtained from the same subject, such as a sample obtained at an early point in time. The control sample may include a sample obtained from a different subject.
[00154] Comparing a result from a sample to a result obtained from a control sample may identify the sample as benign or malignant for a cancer. A comparison of a result may include a differential gene expression, a presence or absence of an epigenetic modification at a position in a gene or genehancer, a difference in an epigenetic signature, a presence or absence of a sequence variant, a difference in a copy number of a gene, or any combination thereof. A comparison to a result from a control sample may identify the sample as being indicative of a particular stage of a cancer, a particular type of cancer, a risk of developing a cancer, a risk of a cancer recurring, a risk of metastasis, or any combination thereof.
[00155] The assaying may include sequencing a nucleotide sequence present in the sample. The nucleotide sequence may have at least 85% sequence homology to a biomarker or active fragment thereof. The assaying may include selecting for or sorting for nucleotides sequences having at least 85% sequence homology to at least a portion of a biomarker. The assaying may employ one or more probes specific for one or more biomarkers or portions thereof as described herein. One or more biomarkers may be assayed. At least 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 50, 100, 150, 200 biomarkers may be assayed. At least 5 biomarkers may be assayed. At least 10 biomarkers may be assayed. At least 15 biomarkers may be assayed. At least 20 biomarkers may be assayed. At least 50 biomarkers may be assayed. At least 100 biomarkers may be assayed. At least 200 biomarkers may be assayed.
[00156] The assaying may include detecting an epigenetic modification in a nucleotide sequence present in the sample. The detecting may include detecting a methylcytosine (mC), a hydroxymethylated cytosine (hmC), a carboxycytosine (caC), a formylcytosine (fC), or any combination thereof. The detecting may include distinguishing between two or more types of epigenetic modifications, such as distinguishing mC from hmC. The epigenetic modification may be detected any number of ways including but not limited to sequencing (such as nanopore sequencing, high throughput sequencing), bi-sulfite sequencing, antibody-specific labeling (such as use of radio-labeling, click chemistry, fluorescent moieties), sugar moiety addition (including glucose or gentibiose or combination, wherein the addition may be by an enzyme such as bGT), thin-layer chromatography, TET enzymatic modification, methyltransferase activity (such as DNMTl), blotting assays, an ELISA assays, the HMCP v2 method, or any combination thereof. In some cases, the detecting may comprise sequencing. In some cases, the detecting may comprise nanopore sequencing. In some cases, the detecting may comprise highthroughput sequencing. In some cases, the detecting may comprise associating a label with an epigenetically modified base of a nucleic acid sequence to form a labeled nucleic acid sequence; hybridizing a substantially complementary strand to the labeled nucleic acid sequence; and amplifying the substantially complementary strand in a reaction in which the labeled nucleic acid sequence is substantially not present. In some cases, the detecting may comprise contacting the sample with an enzyme or a catalytically active fragment thereof that converts a methylated residue in the sample to a modified base. In some cases, the detecting may comprise labeling covalently, a hydroxyl group on a hydroxymethylated residue in the sample to generate labeled
hydroxymethylated residue; and sequencing said sample comprising said labeled
hydroxymethylated residue or derivatives thereof. In some cases, the detecting may comprise contacting at least a portion of the sample with an enzyme that utilizes a labeled glucose or a labeled glucose-derivative donor substrate to add a labeled glucose molecule or a labeled glucose-derivative to a 5-hydroxymethylcytosine in the sample to generate a labeled
glucosylated-5-hydroxymethylcytosine. In some cases, the detecting may comprise adding a detectable label to the epigenetic modification. In some cases, the detecting may comprise the detectable label comprises an antibody. In some cases, the detecting may comprise a FRET assay. In some cases, the detecting may comprise an ELISA assay. In some cases, the detecting may comprise an LCMS assay. In some cases, the identifying may comprise adaptor ligation. In some cases, the detecting may comprise detecting caC or fC. In some cases, the detecting may comprise detecting a kinetic change during sequencing wherein the kinetic change is relative to the control or derivative thereof and comprises a change in interpulse duration, pulse width, or a combination thereof, wherein the presence of the kinetic change indicates the presence of the epigenetic modification in the sample. In some cases, the detecting may comprise A method comprising: associating a label with the epigenetic modification in a nucleic acid sequence to form a labeled nucleic acid sequence; hybridizing a substantially complementary strand to the labeled nucleic acid sequence; and amplifying the substantially complementary strand in a reaction in which the labeled nucleic acid sequence is substantially not present.
[00157] A presence or an absence of an epigenetic modification may comprise a level of an epigenetic modification. A presence or an absence of an epigenetic modification may comprise a presence or an absence at one or more specific positions in a biomarker. A presence or an absence may comprise a pattern or signature of epigenetic modifications. An epigenetic modification may comprise a 5mC, a 5hmC, a 5caC, a 5fC, or any combination thereof. A presence or an absence of an epigenetic modification may comprise a number of methylated sites in the biomarker, in the transcription factor (TF) associated with the biomarker, in a region of the genome associated with the biomarker or TF, or any combination thereof. A presence or an absence of an epigenetic modification may comprise a number of hypo-hydroxymethylated loci, a number of hyper-hydroxymethylated loci, or a combination thereof in the biomarker, in the TF associated with the biomarker, in a region of the genome associated with the biomarker or TF, or any combination thereof. A loss of an epigenetic modification may be indicative of a presence of cancer in the sample, such as a loss of 5-hmC. A gain of an epigenetic modification may be indicative of a presence of cancer in the sample.
[00158] A method may comprise assaying a sample for a metabolic-related biomarker, an immune-related biomarker, cell growth related biomarker, apoptosis related biomarker, protein degradation related biomarker, endocrine related biomarker, cell movement or morphology related biomarker, or any combination thereof to obtain a result. A biomarker may be associated with an Ingenuity Pathway. A biomarker may be a metabolic-related biomarker, an immune- related biomarker, or any combination thereof. A comparison of the result to a result from a control sample may identify the sample as benign or malignant for a cancer. A result may include assaying a sample for a population of immune cells, including a number of immune cells or immune cell subtypes. Immune cell subtypes may include T cells, B cells, neutrophils, basophils, eosinophils, or any combination thereof. A result may include assaying a sample for a population of immune cells and quantifying one or more markers expressed by the population of immune cells. [00159] A method may comprise identifying a presence or an absence of an early stage cancer or a late stage cancer in a sample. The cancer may be colorectal cancer, a colon cancer, or others. The method may identify the sample as having a particular stage of cancer, such as stage I, II, III, or IV. The method may identify the sample as having an aggressive type of cancer. The identification may be based on a comparison to a control sample. For example, the sample may be assayed for a result and the result may be compared to a result obtained from a control sample. The control sample may comprise samples obtained from early-stage cancer and late stage cancer, aggressive types of cancer, stage I cancers, stage II cancers, stage III cancers, stage IV cancers, metastatic cancers, or any combination thereof. The assaying may include assaying for at least a portion of a biomarker. The comparison may include comparing a presence or an absence of an epigenetic modification between the control sample and the sample. The comparison may include comparing a differential gene expression, a presence or an absence of a sequence variant, a copy number, a presence or an absence of an epigenetic modification, a patient's genetic history, a patient's environmental history, or any combination thereof.
[00160] A method may identify the sample as representative of a subtype of the cancer, such as an aggressive type of cancer. A method may identify the sample as representative of a subtype of the cancer, such as a tissue type (i.e. colorectal cancer). A method may identify the sample as representative of a subtype of the cancer, such as a stage I, stage II, stage III, or stage IV cancer. A method may identify the sample as representative of a subtype of the cancer, such as a colon cancer that may be a serrated adenoma or a tubular adenoma. A method may identify the sample as representative of a subtype of the cancer, such as a colon cancer that may be CMS1, CMS2, CMS3, or CMS4.
[00161] A result obtained from assaying may be input into a computer processor. A result obtained from assaying may be input into a trained algorithm. A result including the presence or absence of an epigenetic modification may be input into the trained algorithm. A result including a number of immune cells, types of immune cells, or combinations thereof may be input into the trained algorithm.
[00162] A trained algorithm may be a classifier, a supervised machine learning algorithm, or a molecular classifier. Epigenetic data (or additionally gene expression data, sequence variant data, copy number data, immune population data, or others) may in some cases be improved through the application of algorithms designed to normalize and or improve the reliability of the data. Data analysis may employ a computer or other device, machine or apparatus for application of the various algorithms described herein due to the large number of individual data points that may be processed. A "machine learning algorithm" may refer to a computational-based prediction methodology, also known to persons skilled in the art as a "classifier", employed for characterizing epigenetic data, gene expression data, sequence variant data, copy number data, any combination thereof or others. The data obtained from a sample may be input to the algorithm in order to classify the sample, such as benign or malignant for a cancer. Supervised learning generally involves "training" a classifier with a training set to recognize the distinctions among classes or disease states and then "testing" the accuracy of the classifier on an
independent test set. For new, unknown samples the classifier can be used to predict the class in which the samples belong, such as benign or malignant for a cancer.
[00163] A trained algorithm may identify significant differences in epigenetic data, such as a significant difference in a presence or an absence of an epigenetic modification, as determined by feature selection using LIMMA (linear models for micro array data) and SVM (support vector machine) for classification of malignant vs. benign samples. Rank or weight denotes the marker significance (lower rank, higher significance) after Benjamini and Hochberg correction for False Discovery Rate (FDR). A trained algorithm may include a support vector machine (SVM) algorithm, a random forest algorithm, or a combination thereof.
[00164] LIMMA may be used for feature selection. Classification may be performed with a random forest algorithm or SVM methods. Markers that repeatedly appear in multiple iterative rounds of training, classification, and cross validation may be identified and ranked. A joint set of core features may be created using the top ranked features. Biomarkers with a non-zero repeatability score may be selected as significant.
[00165] A result of a trained algorithm may be output in a report. Results may be presented as a report on a computer screen or as a paper record. In some cases, the report may include, but is not limited to, such information as one or more of the following: the number of biomarkers comprising an epigenetic modification, a classification of a sample as benign or malignant for a cancer, the suitability of the original sample, a diagnosis, a statistical confidence for the diagnosis, the likelihood of cancer or malignancy, a recommendation for further treatment, or any combination thereof.
[00166] The comparison to a control sample may be performed by a trained algorithm. A trained algorithm may be trained to identify feature selections within a data set. A trained algorithm may classify a sample as benign or malignant for a cancer. A cancer may include a colorectal cancer, or a colon cancer.
[00167] In some cases, the methods may include identifying a sample as benign or malignant for cancer. In some cases, the method may include identifying a sample as premalignant or precancerous. In some cases, the methods may include identifying a presence of or likelihood of developing a tumor, neoplasm, or cancer. A cancer may include colon cancer, colon cancer, a rectal cancer, or any combination thereof. In some cases, the methods may include identifying a presence of a premalignant condition or a precancerous lesion or growth. A premalignant condition or precancerous lesion or growth may comprise a polyp (such as an adenomatous polyp), a nonpolyp, an adenoma, a dysplasia (such as high grade or low grade), or any combination thereof. In some cases, the methods may include distinguishing a premalignant condition from a benign condition (such as a benign polyp, benign lesion, benign hyperplastic tissue, benign hyperplasia, or the like).
[00168] The methods may include comparing a result obtained from assaying a sample to a result obtained from a control or derivative thereof. The comparing may identify the sample as a precancerous lesion or precancerous growth. The comparing may distinguish a precancerous lesion or growth from a benign condition. The comparing may be performed by a trained algorithm. A precancerous lesion or growth may be identified by performing the methods as described herein on a blood sample. The sample may comprise cell-free DNA.
[00169] Assaying a sample may be performed in the absence of a screening procedure. The methods herein may provide a replacement or alternative to a screening procedure. A screening procedure may include a colonoscopy, an assay performed on a stool sample provided by the subject, a sigmoidoscopy, or any combination thereof. A benefit of the method may include an alternative pre-screening tool that does not require a colonoscopy or providing a stool sample. The method may provide a result having greater than 90% sensitivity and greater than 80% specificity to distinguish a precancerous lesion or growth from a benign condition. When a subject receives a result identifying the sample as benign, the method may permit a subject to opt out or not receive a screening procedure.
[00170] A method may comprise assaying a sample for a nucleotide having at least 70% sequence homology to a biomarker listed in FIG. 19B-C, FIG. 34B, FIG. 74, FIG. 136, FIG. 137, FIG. 139, FIG. 140, FIG, 141, FIG. 142, any combination thereof, or any other figure described herein labeled as "earlyCRC from HV". A table described as "earlyCRC from HV" may distinguish an early stage cancer, such as stage I or II from a healthy volunteer. A table described as "earlyCRC from HV" may distinguish a premalignant lesion or growth from a healthy volunteer. Additionally, the assaying may include assaying the sample for a nucleotide having at least 70% sequence homology to a biomarker from Table 1, Table 2, Table 3 or any combination thereof. The assaying may produce a result that may be compared to a result from a control or derivative thereof. The sample may be obtained from a subject asymptomatic for cancer, at risk for developing cancer, not previously diagnosed with cancer, or as part of a routine screening. The comparing may identify the sample as a precancerous lesion or precancerous growth. The assaying may include detecting a presence or an absence of an epigenetic modification. The detecting may comprise detecting by sequence, such as by nanopore sequencing or high throughput sequencing. The control or derivative thereof may comprise samples obtained from a precancerous lesion or growth.
[00171] A method may provide a result in the absence of a further medical procedure such as a result that may include an identification of the sample as a malignant or benign for a cancer. A further medical procedure may include: obtaining a second sample from the subject, such as an invasive sample (such as a biopsy) or a blood sample; performing an imaging scan on a portion of the subject; performing surgery on the subject; or a combination thereof.
[00172] A method may include repeating the assaying. A method may include repeating the comparing to a control sample, such as comparing to a different control sample. A method may provide a result that includes a recommendation for monitoring a change over time in the result. A method may include assaying a second sample from the subject. The second sample may be obtained from the subject at a different period of time, such as an earlier period of time or a later period of time. A method may provide a result that includes a recommendation for the subject to receive a surgery.
[00173] A trained algorithm may be trained with a training set of samples. A trained algorithm may be validated with a validation set of samples. The validation set of samples may be independent of the training set. An independent sample may be input into the trained algorithm that may be independent of both the training set and the validation set.
[00174] A training set of samples may include at least: 1, 2, 3, 4, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 100, 150, 200, 300, 400, 500 samples. A training set of samples may include about 5 samples. A training set of samples may include about 20 samples. A training set of samples may include about 50 samples. A training set of samples may include about 100 samples. A training set of samples may include about 200 samples. A training set of samples may include about 300 samples. A training set of samples may include about 500 samples. A training set of samples may include at least: 1, 2, 3, 4, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 100, 150, 200, 300, 400, 500 cell free DNA samples. A training set of samples may include about 5 cell free DNA samples. A training set of samples may include about 20 cell free DNA samples. A training set of samples may include about 50 cell free DNA samples. A training set of samples may include about 100 cell free DNA samples. A training set of samples may include about 200 cell free DNA samples. A training set of samples may include about 300 cell free DNA samples. A training set of samples may include about 500 cell free DNA samples. A training set of samples may include samples having a malignant diagnosis, a benign diagnosis, or a combination thereof. A training set of samples may include samples obtained from healthy volunteers, subjects diagnosed with cancer, or a combination thereof. A training set of samples may include cell free DNA samples, genomic DNA samples, biopsy samples, FNA samples, tissue samples, or any combination thereof. A training set of samples may include more than one subtype of cancer. A training set of samples may include genomic DNA samples and cell free DNA samples. A training set of samples may include genomic DNA samples. A training set of samples may include cell free DNA samples. A training set of samples may include one or more samples having a sequence comprising a CpG island.
[00175] A presence or an absence of an epigenetic modification may identify a sample as comprising a benign or malignant tissue. In some cases, a read count threshold between the sample and control or derivative thereof may be at least: 10, 20, 30, 40, or 50. A read count threshold may be greater than about 10. A read count threshold may be greater than about 20. A read count threshold may be greater than about 30. A read count threshold may be greater than about 40. In some cases, a FDR threshold may be less than about: 0.5, 0.1, 0.05, or 0.01. In some cases, a FDR threshold may be less than about 0.01. In some cases, a FDR threshold may be less than about 0.05. In some cases, a FDR threshold may be less than about 0.1. In some cases, a FDR threshold may be less than about 0.5. A biomarker may be weighted or ranked. A weighing or ranking may be indicative of a discriminatory power of a biomarker to identify a sample as benign or malignant for a cancer.
Kits
[00176] A kit may include one or more materials for performing the methods as described herein. A kit may include reagents for the assaying. A kit may include reagents to identify epigenetic modifications in a sample according to any method as described herein. A kit may include reagents for sequencing. A kit may include TET enzymes or fragments thereof. A kit may include a DNA methyltransf erase. A kit may include a glucosyltransferase. A kit may include an excipient, such a glycerol, water, saline, dextrose, ethanol, or any combination thereof. A kit may include probes to one or more biomarkers as described herein. A kit may include a pre-programmed trained algorithm. A kit may include controls or derivative thereof, a database comprising controls or derivative thereof, or access to an online database comprising controls or derivative thereof. A kit may include reagents for obtaining a sample, storing the sample, assaying the sample, or any combination thereof. The kit may further comprise software or a license to obtain and use software for analysis of the data provided using the methods described herein. Definitions
[00177] The terminology used herein is for the purpose of describing particular cases only and is not intended to be limiting. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. Furthermore, to the extent that the terms "including", "includes", "having", "has", "with", or variants thereof are used in either the detailed description and/or the claims, such terms are intended to be inclusive in a manner similar to the term "comprising".
[00178] The term "about" or "approximately" can mean within an acceptable error range for the particular value as determined by one of ordinary skill in the art, which will depend in part on how the value is measured or determined, e.g., the limitations of the measurement system. For example, "about" can mean plus or minus 10%, per the practice in the art. Alternatively, "about" can mean a range of plus or minus 20%, plus or minus 10%, plus or minus 5%, or plus or minus 1%) of a given value. Alternatively, particularly with respect to biological systems or processes, the term can mean within an order of magnitude, within 5-fold, or within 2-fold, of a value. Where particular values are described in the application and claims, unless otherwise stated the term "about" meaning within an acceptable error range for the particular value should be assumed. Also, where ranges and/or subranges of values are provided, the ranges and/or subranges can include the endpoints of the ranges and/or subranges.
[00179] The term "substantially" as used herein can refer to a value approaching 100% of a given value. In some cases, the term can refer to an amount that can be at least about 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99%, 99.9%, or 99.99% of a total amount. In some cases, the term can refer to an amount that can be about 100% of a total amount.
[00180] The term "homology" can refer to a % identity of a sequence to a reference sequence. As a practical matter, whether any particular sequence can be at least 50%, 60%, 70%, 80%, 85%, 90%, 92%, 95%, 96%, 97%, 98% or 99% identical to any sequence described herein (which may correspond with a particular nucleic acid sequence described herein), such particular polypeptide sequence can be determined conventionally using known computer programs such the Bestfit program (Wisconsin Sequence Analysis Package, Version 8 for Unix, Genetics Computer Group, University Research Park, 575 Science Drive, Madison, Wis. 53711). When using Bestfit or any other sequence alignment program to determine whether a particular sequence is, for instance, 95% identical to a reference sequence, the parameters can be set such that the percentage of identity is calculated over the full length of the reference sequence and that gaps in homology of up to 5% of the total reference sequence are allowed. [00181] For example, in a specific embodiment the identity between a reference sequence (query sequence, i.e., a sequence of the present invention) and a subject sequence, also referred to as a global sequence alignment, may be determined using the FASTDB computer program based on the algorithm of Brutlag et al. (Comp. App. Biosci. 6:237-245 (1990)). In some embodiments, parameters for a particular embodiment in which identity is narrowly construed, used in a FASTDB amino acid alignment, can include: Scoring Scheme=PAM (Percent Accepted
Mutations) 0, k-tuple=2, Mismatch Penalty=l, Joining Penalty=20, Randomization Group Length=0, Cutoff Score=l, Window Size=sequence length, Gap Penalty=5, Gap Size
Penalty=0.05, Window Size=500 or the length of the subject sequence, whichever is shorter. According to this embodiment, if the subject sequence is shorter than the query sequence due to N- or C-terminal deletions, not because of internal deletions, a manual correction can be made to the results to take into consideration the fact that the FASTDB program does not account for Island C-terminal truncations of the subject sequence when calculating global percent identity. For subject sequences truncated at the N- and C-termini, relative to the query sequence, the percent identity can be corrected by calculating the number of residues of the query sequence that are lateral to the N- and C-terminal of the subj ect sequence, which are not matched/aligned with a corresponding subject residue, as a percent of the total bases of the query sequence. A
determination of whether a residue is matched/aligned can be determined by results of the FASTDB sequence alignment. This percentage can be then subtracted from the percent identity, calculated by the FASTDB program using the specified parameters, to arrive at a final percent identity score. This final percent identity score can be used for the purposes of this embodiment. In some embodiments, only residues to the N- and C-termini of the subject sequence, which are not matched/aligned with the query sequence, are considered for the purposes of manually adjusting the percent identity score. That is, only query residue positions outside the farthest island C-terminal residues of the subject sequence are considered for this manual correction. For example, a 90 residue subject sequence can be aligned with a 100 residue query sequence to determine percent identity. The deletion occurs at the N-terminus of the subject sequence and therefore, the FASTDB alignment does not show a matching/alignment of the first 10 residues at the N-terminus. The 10 unpaired residues represent 10% of the sequence (number of residues at the N- and C-termini not matched/total number of residues in the query sequence) so 10% is subtracted from the percent identity score calculated by the FASTDB program. If the remaining 90 residues were perfectly matched the final percent identity would be 90%. In another example, a 90 residue subject sequence is compared with a 100 residue query sequence. This time the deletions are internal deletions so there are no residues at the N- or C-termini of the subject sequence which are not matched/aligned with the query. In this case the percent identity calculated by FASTDB is not manually corrected. Once again, only residue positions outside the N- and C-terminal ends of the subject sequence, as displayed in the FASTDB alignment, which are not matched/aligned with the query sequence are manually corrected for.
[00182] The term "fragment," as used herein, may be a portion of a sequence, a subset that may be shorter than a full length sequence. A fragment may be a portion of a gene. A fragment may be a portion of a peptide or protein. A fragment may be a portion of an amino acid sequence. A fragment may be a portion of an oligonucleotide sequence. A fragment may be less than about: 20, 30, 40, 50 amino acids in length. A fragment may be less than about: 20, 30, 40, 50 oligonucleotides in length.
[00183] The term "epigenetic modification" as used herein, may be any covalent modification of a nucleic acid base. In some cases, a covalent modification may comprise (i) adding a methyl group, a hydroxymethyl group, a carbon atom, an oxygen atom, or any combination thereof to one or more bases of a nucleic acid sequence, (ii) changing an oxidation state of a molecule associated with a nucleic acid sequence, such as an oxygen atom, or (iii) a combination thereof. A covalent modification may occur at any base, such as a cytosine, a thymine, a uracil, an adenine, a guanine, or any combination thereof. In some cases, an epigenetic modification may comprise an oxidation or a reduction. A nucleic acid sequence may comprise one or more epigenetically modified bases. An epigenetically modified base may comprise any base, such as a cytosine, a uracil, a thymine, adenine, or a guanine. An epigenetically modified base may comprise a methylated base, a hydroxymethylated base, a formylated base, or a carboxylic acid containing base or a salt thereof. An epigenetically modified base may comprise a 5-methylated base, such as a 5-methylated cytosine (5-mC). An epigenetically modified base may comprise a 5 -hydroxymethylated base, such as a 5 -hydroxymethylated cytosine (5-hmC). An epigenetically modified base may comprise a 5-formylated base, such as a 5-formylated cytosine (5-fC). An epigenetically modified base may comprise a 5-carboxylated base or a salt thereof, such as a 5- carboxylated cytosine (5-caC). In some cases, an epigenetically modified base may comprise a methyltransferase-directed transfer of an antivated group (mTAG).
[00184] An epigenetically modified base may comprise one or more bases or a purine (such as Structure 1) or one or more bases of a pyrimidine (such as Structure 2). An epigenetic
modification may occur one or more of any positions. For example, an epigenetic modification may occur at one or more positions of a purine, including positions 1, 2, 3, 4, 5, 6, 7, 8, 9, as shown in Structure 1. In some cases, an epigenetic modification may occur at one or more positions of a pyrimidine, including positions 1, 2, 3, 4, 5, 6, as shown in Structure 2.
Figure imgf000028_0001
Structure 1
Figure imgf000028_0002
Structure 2
[00187] A nucleic acid sequence may comprise an epigenetically modified base. A nucleic acid sequence may comprise a plurality of epigenetically modified bases. A nucleic acid sequence may comprise an epigenetically modified base positioned within a CG site, a CpG island, or a combination thereof. A nucleic acid sequence may comprise different epigenetically modified bases, such as a methylated base, a hydroxymethylated base, a formylated base, a carboxylic acid containing base or a salt thereof, a plurality of any of these, or any combination thereof.
[00188] The term "nucleic acid sequence" as used herein may comprise DNA or RNA. In some cases, a nucleic acid sequence may comprise a plurality of nucleotides. In some cases, a nucleic acid sequence may comprise an artificial nucleic acid analogue. In some cases, a nucleic acid sequence comprising DNA, may comprise cell-free DNA, cDNA, fetal DNA, or maternal DNA. In some cases, a nucleic acid sequence may comprise miRNA, shRNA, or siRNA.
[00189] The term "substantially complementary strand" as used herein, may comprise from about 70% - 100% bases that base pair with bases of a nucleic acid sequence. This percentage of base pairing may be measured by UV absorption of the nucleic acid sequence. In some cases, a substantially complementary strand may be hybridized to at least a portion of a nucleic acid sequence under stringent hybridization conditions.
[00190] The term "substantially free of an epigenetically modified base" as used herein, may comprise a complementary strand having no epigenetically modified base, or a complementary strand having from about 0.000001% to about 5% of a plurality of epigenetically modified bases of a nucleic acid sequence.
[00191] The term "click-chemistry" as used herein may comprise a reaction having at least one of the following: (a) high yielding, (b) wide in scope, (c) create only byproducts that may be removed in the absence of chromatography, (d) stereospecific, (e) simple to perform, (f) conducted in easily removable or benign solvents. In some cases, click-chemistry comprises tagging, such as tagging a nucleic acid sequence or a complementary strand. In some cases, click- chemistry may associate a nucleic acid sequence with a label. Click-chemistry may comprise a reaction having a [3+2] cycloaddition; a thiol-ene reaction; a Diels-Alder reaction, an inverse electron demand Diels-Alder reaction; a [4+1] cycloaddition; a nucleophilic substitution; a carbonyl-chemistry-like formation of urea; an addition to a carbon-carbon double bond; or any combination thereof. In some cases, a [3+2] cycloaddition may comprise a Huisgen 1,3 -dipolar cycloaddition. In some cases, a [4+1] cycloaddition may comprise a cycloaddition between an isonitrile and a tetrazine. Click-chemistry may comprise a copper(I)-catalyzed azide-alkyne cycloaddition (CuAAC); a strain-promoted azide-alkyne cycloaddition (SPAAC); a strain- promoted alkyne-nitrone cycloaddition (SPANC); or any combination thereof.
[00192] The term "sequencing" as used herein, may comprise bisulfite-free sequencing, bisulfite sequencing, TET-assisted bisulfite (TAB) sequencing, ACE-sequencing, high- throughput sequencing, Maxam-Gilbert sequencing, massively parallel signature sequencing, Polony sequencing, 454 pyrosequencing, Sanger sequencing, Illumina sequencing, SOLiD sequencing, Ion Torrent semiconductor sequencing, DNA nanoball sequencing, Heliscope single molecule sequencing, single molecule real time (SMRT) sequencing, nanopore DNA sequencing, shot gun sequencing, RNA sequencing, Enigma sequencing, or any combination thereof.
[00193] In some cases, a method may comprise sequencing. The sequencing may include bisulfite sequencing or bisulfite-free sequencing. In some cases, a method may comprise oxidizing one or more bases of a nucleic acid sequence or complementary strand or combination thereof. In some cases, a method may comprise selectively enriching for a nucleic acid sequence that contains at least one epigenetic modification.
[00194] The term "tissue" as used herein, may be any tissue sample. A tissue may be a tissue suspected or confirmed of having a disease or condition. A tissue may be a sample that may be substantially healthy, substantially benign, or otherwise substantially free of a disease or a condition. A tissue may be a tissue removed from a subject, such as a tissue biopsy, a tissue resection, an aspirate (such as a fine needle aspirate), a tissue washing, a cytology specimen, a bodily fluid, or any combination thereof. A tissue may comprise cancerous cells, tumor cells, non-cancerous cells, or a combination thereof. A tissue may comprise colon tissue, colorectal tissue, rectal tissue, a polyp, a blood sample (such as a cell-free DNA sample), or any
combination thereof. A tissue may be a sample that may be genetically modified.
[00195] As used herein, the term "cell-free" refers to the condition of the nucleic acid sequence as it appeared in the body before the sample is obtained from the body. For example, circulating cell-free nucleic acid sequences in a sample may have originated as cell-free nucleic acid sequences circulating in the bloodstream of the human body. In contrast, nucleic acid sequences that are extracted from a solid tissue, such as a biopsy, are generally not considered to be "cell-free." In some cases, cell-free DNA may comprise fetal DNA, maternal DNA, or a combination thereof. In some cases, cell-free DNA may comprise DNA fragments released into a blood plasma. In some cases, the cell-free DNA may comprise circulating tumor DNA. In some cases, cell-free DNA may comprise circulating DNA indicative of a tissue origin, a disease or a condition. A cell-free nucleic acid sequence may be isolated from a blood sample. A cell-free nucleic acid sequence may be isolated from a plasma sample. A cell-free nucleic acid sequence may comprise a complementary DNA (cDNA). In some cases, one or more cDNAs may form a cDNA library.
[00196] The term "subject," as used herein, may be any animal or living organism. Animals can be mammals, such as humans, non-human primates, rodents such as mice and rats, dogs, cats, pigs, sheep, rabbits, and others. Animals can be fish, reptiles, or others. Animals can be neonatal, infant, adolescent or adult animals. Humans can be more than about: 1, 2, 5, 10, 20, 30, 40, 50, 60, 65, 70, 75, or about 80 years of age. The subject may have or be suspected of having a condition or a disease, such as cancer. The subject may be a patient, such as a patient being treated for a condition or a disease, such as a cancer patient. The subject may be predisposed to a risk of developing a condition or a disease such as cancer. The subject may be in remission from a condition or a disease, such as a cancer patient. The subject may be healthy.
[00197] A nucleic acid sequence may be from a sample. A sample may be isolated from a subject. A subject may be a human subject. A sample may comprise a buccal sample, a saliva sample, a blood sample, a plasma sample, a reproductive sample (such as an egg or a sperm), a mucus sample, a cerebral spinal fluid sample, a tissue sample, a tissue biopsy, a surgical resection, a fine needle aspirate sample, or any combination thereof. In some cases, a sample may comprise a blood sample. In some cases, a sample may comprise a buccal sample.
[00198] In some cases, a subject may have previously received a diagnosis of a disease or condition prior to performing a method as described herein. A subject may have previously received a positive diagnosis of a disease, such as a cancer. A subject may have previously received an indeterminate or inclusive diagnosis of a disease, such as a cancer. A subject may be a subject in need thereof, such as a need for a definitive diagnosis or a need for a selection of a therapeutic treatment regime.
[00199] A result of the method or a result output from the trained algorithm may include a recommendation for a treatment. A treatment may include further monitoring of the subject, such as obtaining a second sample from the subject and repeating a method as described herein. A treatment may include performing surgery or removing of a tissue from the subject, performing an imaging scan on the subject, performing a diagnostic test on a sample from the subject, performing radiation, chemotherapy, or other cancer treatment procedure.
[00200] In some cases, a subject may not have previously received a diagnosis of a disease or condition prior to performing a method as described herein. In some cases, a subject may be suspected of having a disease or condition, such as having one or more symptoms of a disease or condition. In some cases, a subject may be at risk of developing a disease or condition, such as a subject having a biomarker or genetic indication that may be indicative of a risk of developing a disease or condition. In some cases, a disease or a condition may comprise a cancer.
[00201] A nucleic acid sequence may comprise a cytosine guanine (CG) site, a cytosine phosphate guanine (CpG) island, a portion of any of these, or a combination thereof. A CpG island may comprise one or more CG sites. A nucleic acid sequence may comprise one or more CG sites or portions thereof. A nucleic acid sequence may comprise dense CG sites, dense CpG islands or a combination thereof. A nucleic acid sequence may comprise a plurality of CG sites or portions thereof. A nucleic acid sequence may comprise one or more CpG islands or portions thereof. A nucleic acid sequence may comprise a plurality of CpG islands or portions thereof. One or more bases of a nucleic acid sequence comprising a CG site, a CpG island, a portion thereof, or any of these may comprise an epigenetically modified base, such as a methylated base or a hydroxymethylated base. One or more cytosines of a nucleic acid sequence comprising a CG site, a CpG island, a portion thereof, or any of these may comprise an epigenetically modified cytosine, such as a methylated cytosine or a hydroxymethylated cytosine. A CpG island (or a CG island) may be a region with a high frequency of CG sites. A CpG island may be a region of a nucleic acid sequence with at least about 200 basepairs (bp) and a GC percentage that may be greater than about 50% and with an observed-to-expected CpG ratio that may be greater than about 60 %. An "observed-to-expected CpG ratio" may be derived where the observed may be calculated as:
[00202] (number of CpGs)
[00203] and the expected may be calculated as: [00204] (number of C * number of G) / length of sequence
[00205] or the expected may be calculated as:
[00206] ((number of C + number of G) / 2)2 / length of sequence
Samples
[00207] The methods of the present invention provide for storing the sample for a time such as seconds, minutes, hours, days, weeks, months, years or longer after the sample is obtained and before the sample is analyzed by one or more methods of the invention. In some cases, the sample obtained from a subject can be subdivided prior to the step of storage or further analysis such that different portions of the sample may be subject to different downstream methods or processes including but not limited to storage, cytological analysis, adequacy tests, nucleic acid extraction, molecular profiling or a combination thereof.
[00208] In some cases, a portion of the sample may be stored while another portion of said sample is further manipulated. Such manipulations may include but are not limited to molecular profiling (epigenetics, gene expression levels, sequence variant, copy number); sequencing, labeling, cytological or histological staining; flow cytometry analysis; nucleic acid (RNA or DNA) extraction, detection, or quantification; gene expression product (RNA or Protein) extraction, detection, or quantification; fixation; and examination. The sample may be fixed prior to or during storage by any method known to the art such as using glutaraldehyde, formaldehyde, or methanol. In other cases, the sample is obtained and stored and subdivided after the step of storage for further analysis such that different portions of the sample are subject to different downstream methods or processes including but not limited to storage, cytological analysis, adequacy tests, nucleic acid extraction, molecular profiling or a combination thereof. In some cases, samples are obtained and analyzed by for example cytological analysis, and the resulting sample material is further analyzed by one or more molecular profiling methods of the present invention. In such cases, the samples may be stored between the steps of cytological analysis and the steps of molecular profiling. Samples may be stored upon acquisition to facilitate transport, or to wait for the results of other analyses. In another embodiment, samples may be stored while awaiting instructions from a physician or other medical professional.
Classifiers
[00209] The results obtained from the assaying can be analyzed using feature selection techniques including filter techniques which assess the relevance of features by looking at the intrinsic properties of the data, wrapper methods which embed the model hypothesis within a feature subset search, and embedded techniques in which the search for an optimal set of features is built into a classifier algorithm. [00210] Filter techniques useful in the methods of the present invention include (1) parametric methods such as the use of two sample t-tests, ANOVA analyses, Bayesian frameworks, and
Gamma distribution models (2) model free methods such as the use of Wilcoxon rank sum tests, between-within class sum of squares tests, rank products methods, random permutation methods, or TNoM which involves setting a threshold point for fold-change differences in expression between two datasets and then detecting a threshold point in each gene that minimizes the number of missclassifications (3) and multivariate methods such as bivariate methods, correlation based feature selection methods (CFS), minimum redundancy maximum relavance methods
(MRMR), Markov blanket filter methods, and uncorrected shrunken centroid methods. Wrapper methods useful in the methods of the present invention include sequential search methods, genetic algorithms, and estimation of distribution algorithms. Embedded methods useful in the methods of the present invention include random forest algorithms, weight vector of support vector machine algorithms, and weights of logistic regression algorithms.
[00211] Selected features may then be classified using a classifier algorithm. Illustrative algorithms include but are not limited to methods that reduce the number of variables such as principal component analysis algorithms, partial least squares methods, and independent component analysis algorithms. Illustrative algorithms further include but are not limited to methods that handle large numbers of variables directly such as statistical methods and methods based on machine learning techniques. Statistical methods include penalized logistic regression, prediction analysis of microarrays (PAM), methods based on shrunken centroids, support vector machine analysis, and regularized linear discriminant analysis. Machine learning techniques include bagging procedures, boosting procedures, random forest algorithms, and combinations thereof.
[00212] Classifiers may be developed using top varying genes, enhancers, or a combination thereof to demonstrate the predictive power of 5-hmC in diagnosing cancer, early detection of cancer, recurrence of cancer, metastasis of cancer, presence of a malignant tissue, or any combination thereof. A trained model may successfully predict a disease status, a risk of occurrence or recurrence of a disease, or any combination thereof in a test set with greater than about 90% sensitivity and greater than about 80% specificity.
[00213] In some cases, the trained model provides a result having greater than about 90% sensitivity and greater than about 80% specificity. In some cases, the trained model provides a result having greater than about 90% sensitivity and greater than about 85% specificity. In some cases, the trained model provides a result having greater than about 90% sensitivity and greater than about 90% specificity. In some cases, the trained model provides a result having greater than about 90% sensitivity and greater than about 95% specificity.
[00214] In some cases, the trained model provides a result having greater than about 95% sensitivity and greater than about 80% specificity. In some cases, the trained model provides a result having greater than about 95% sensitivity and greater than about 85% specificity. In some cases, the trained model provides a result having greater than about 95% sensitivity and greater than about 90% specificity. In some cases, the trained model provides a result having greater than about 95%) sensitivity and greater than about 95% specificity.
[00215] In some cases, the trained model provides a result having greater than about 98% sensitivity and greater than about 80% specificity. In some cases, the trained model provides a result having greater than about 98% sensitivity and greater than about 85% specificity. In some cases, the trained model provides a result having greater than about 98% sensitivity and greater than about 90% specificity. In some cases, the trained model provides a result having greater than about 98%) sensitivity and greater than about 95% specificity.
[00216] In some cases, the trained algorithm provides a result having greater than about 80% sensitivity. In some cases, the trained algorithm provides a result having greater than about 85% sensitivity. In some cases, the trained algorithm provides a result having greater than about 90% sensitivity. In some cases, the trained algorithm provides a result having greater than about 95% sensitivity. In some cases, the trained algorithm provides a result having greater than about 96% sensitivity. In some cases, the trained algorithm provides a result having greater than about 97% sensitivity. In some cases, the trained algorithm provides a result having greater than about 98% sensitivity.
[00217] In some cases, the trained algorithm provides a result having greater than about 70% specificity. In some cases, the trained algorithm provides a result having greater than about 75% specificity. In some cases, the trained algorithm provides a result having greater than about 80% specificity. In some cases, the trained algorithm provides a result having greater than about 85% specificity. In some cases, the trained algorithm provides a result having greater than about 90% specificity. In some cases, the trained algorithm provides a result having greater than about 95% specificity. In some cases, the trained algorithm provides a result having greater than about 96% specificity.
[00218] In some cases, the trained algorithm provides a result having greater than about 80% clinical diagnostic accuracy. In some cases, the trained algorithm provides a result having greater than about 85% clinical diagnostic accuracy. In some cases, the trained algorithm provides a result having greater than about 90% clinical diagnostic accuracy. In some cases, the trained algorithm provides a result having greater than about 95% clinical diagnostic accuracy.
[00219] Sensitivity typically refers to TP/(TP+FN), where TP is true positive and FN is false negative. Number of Continued Indeterminate results divided by the total number of malignant results based on adjudicated histopathology diagnosis. Specificity typically refers to
TN/(TN+FP), where TN is true negative and FP is false positive. The number of benign results divided by the total number of benign results based on adjudicated histopathology diagnosis. Positive Predictive Value (PPV) typically refers to TP/(TP+FP) and Negative Predictive Value (NPV) typically refers to TN/(TN+FN). The clinical accuracy as used herein includes specificity, sensitivity, positive predictive value, negative predictive value, or any combination thereof. Biomarkers
[00220] Methods as described herein may assay for at least one biomarker or an active fragment thereof. In some cases, about: 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 20, 30, 40, 50, 100, 150, 200 or more biomarkers may be assayed. In some cases, about 2 biomarkers may be assayed. In some cases, about 5 biomarkers may be assayed. In some cases, about 10 biomarkers may be assayed. In some cases, about 15 biomarkers may be assayed. In some cases, at least 20 biomarkers may be assayed.
[00221] Methods as described herein may utilize at least one biomarker or an active fragment thereof to classify a sample. In some cases, about: 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 20, 30, 40, 50, 100, 150, 200 biomarkers may be utilized to classify a sample. In some cases, about 2 biomarkers may be utilized to classify a sample. In some cases, about 5 biomarkers may be utilized to classify a sample. In some cases, about 10 biomarkers may be utilized to classify a sample. In some cases, about 15 biomarkers may be utilized to classify a sample. In some cases, about 20 biomarkers may be utilized to classify a sample.
[00222] Methods as described herein may select at least one biomarker or an active fragment thereof. In some cases, about: 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 20, 30, 40, 50, 100, 150, 200 biomarkers may be selected. In some cases, about 2 biomarkers may be selected. In some cases, at least 5 biomarkers may be selected. In some cases, about 10 biomarkers may be selected. In some cases, about 15 biomarkers may be selected. In some cases, about 20 biomarkers may be selected.
[00223] Methods as described herein may compare a result to at least one biomarker or an active fragment thereof of a control or derivative thereof, such as a reference sample. In some cases, a result may be compared to about: 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 20, 30, 40, 50, 100, 150, 200 biomarkers. In some cases, a result may be compared to about 2 biomarkers. In some cases, a result may be compared to about 5 biomarkers. In some cases, a result may be compared to about 10 biomarkers. In some cases, a result may be compared to about 15 biomarkers. In some cases, a result may be compared to about 20 biomarkers.
[00224] A biomarker or active fragment thereof may be a gene, a portion of a gene, a genehancer, a transcription factor, or any combination thereof. A nucleotide sequence from a sample may comprise a nucleotide sequence having at least: 70%, 75%, 80%, 85%, 90%, 95%, 99% sequence homology to the biomarker. A nucleotide sequence from a sample may comprise a nucleotide sequence having at least 70% sequence homology to the biomarker. A nucleotide sequence from a sample may comprise a nucleotide sequence having at least 75% sequence homology to the biomarker. A nucleotide sequence from a sample may comprise a nucleotide sequence having at least 80% sequence homology to the biomarker. A nucleotide sequence from a sample may comprise a nucleotide sequence having at least 85% sequence homology to the biomarker. A nucleotide sequence from a sample may comprise a nucleotide sequence having at least 90%) sequence homology to the biomarker. A nucleotide sequence from a sample may comprise a nucleotide sequence having at least 95% sequence homology to the biomarker. A nucleotide sequence from a sample may comprise a nucleotide sequence having at least 96% sequence homology to the biomarker. A nucleotide sequence from a sample may comprise a nucleotide sequence having at least 97% sequence homology to the biomarker. A nucleotide sequence from a sample may comprise a nucleotide sequence having at least 98% sequence homology to the biomarker. A biomarker may be a genehancer. A nucleotide sequence from a sample may comprise a nucleotide sequence having at least 99% sequence homology to the biomarker. A biomarker may be a transcription factor. A biomarker may be a site that is proximal to a gene. A biomarker may be a site associated with a gene but more than 10 basepairs away from the gene.
[00225] A biomarker may not have been previously associated with a cancer. An expression of a biomarker may be associated with cancer but a change in an epigenetic modification in the biomarker may not have been previously associated with a cancer. A presence or absence of an epigenetic modification may be indicative of a cancer.
[00226] A presence of an epigenetic modification may comprise a level of methylation or a level of hydroxymethylation. A presence of an epigenetic modification may comprise a number of methylated sites, hydroxymethylated sites, hypo-hydroxymethylated sites, hyper- hydroxymethylated sites, or any combination thereof.
[00227] One or more biomarkers or active fragments thereof may be selected for use in the methods described herein. About: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 30, 40, 50, 60, 70, 80, 90, 100 biomarkers or more may be selected. From 1 to 5 biomarkers may be selected. From 1 to 10 biomarkers may be selected. From 1 to 20 biomarkers may be selected. From 1 to 40 biomarkers may be selected. From 1 to 50 biomarkers may be selected. From 1 to 60 biomarkers may be selected. From 1 to 100 biomarkers may be selected. From 2 to 5 biomarkers may be selected. From 2 to 10 biomarkers may be selected. From 2 to 20 biomarkers may be selected. From 2 to 50 biomarkers may be selected. From 2 to 100 biomarkers may be selected. From 5 to 10 biomarkers may be selected. From 5 to 20 biomarkers may be selected. From 5 to 30 biomarkers may be selected. From 5 to 40 biomarkers may be selected.
[00228] One or more biomarkers may be assayed accordingly to the methods described herein. At least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 30, 40, 50, 60, 70, 80, 90, 100 biomarkers or more may be assayed. From 1 to 5 biomarkers may be assayed. From 1 to 10 biomarkers may be assayed. From 1 to 20 biomarkers may be assayed. From 1 to 40 biomarkers may be assayed. From 1 to 50 biomarkers may be assayed. From 1 to 60 biomarkers may be assayed. From 1 to 100 biomarkers may be assayed. From 2 to 5 biomarkers may be assayed. From 2 to 10 biomarkers may be assayed. From 2 to 20 biomarkers may be assayed. From 2 to 50 biomarkers may be assayed. From 2 to 100 biomarkers may be assayed. From 5 to 10 biomarkers may be assayed. From 5 to 20 biomarkers may be assayed. From 5 to 30 biomarkers may be assayed. From 5 to 40 biomarkers may be assayed.
[00229] A result from one or more biomarkers may be compared to a result from a control sample. A result from at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 30, 40, 50, 60, 70, 80, 90, 100 biomarkers or more may be compared. From 1 to 5 biomarkers may be compared. From 1 to 10 biomarkers may be compared. From 1 to 20 biomarkers may be compared. From 1 to 40 biomarkers may be compared. From 1 to 50 biomarkers may be compared. From 1 to 60 biomarkers may be compared. From 1 to 100 biomarkers may be compared. From 2 to 5 biomarkers may be compared. From 2 to 10 biomarkers may be compared. From 2 to 20 biomarkers may be compared. From 2 to 50 biomarkers may be compared. From 2 to 100 biomarkers may be compared. From 5 to 10 biomarkers may be compared. From 5 to 20 biomarkers may be compared. From 5 to 30 biomarkers may be compared. From 5 to 40 biomarkers may be compared.
[00230] In some cases, one or more biomarkers not previously associated with a cancer may be selected to use in the methods as described herein to identify a sample as benign or malignant for the cancer. In some cases, one or more biomarkers having an epigenetic marker or epigenetic change not previously associated with a cancer may be selected for use in the methods as described herein to identify a sample as benign or malignant for the cancer. [00231] In some cases, a panel of biomarkers may comprise one or more biomarkers from
FIG. 8A, FIG. 19A-C, FIG. 31A, FIG. 34A-B, FIG. 35, FIG. 51, FIG. 74, FIG. 87, FIG.
105A-B, FIG. 106, FIG. 119, FIG. 120, FIG. 121, FIG. 122, FIG. 124, FIG. 131, FIG. 132, FIG. 133, FIG. 134, FIG. 135, FIG. 136, FIG. 137 or any combination thereof. One or more biomarkers may be selected based on a ranking or a weighting value assigned to the biomarker. One or more biomarkers may comprise a gene or portion thereof, a genehancer, or a combination thereof. One or more biomarkers may be selected based on a cancer type or stage of disease. One or more biomarkers may include 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 25, 30, 35, 40, 45, 50, 100, or more biomarkers selected from any one of FIG. 8A, FIG. 19A-C, FIG. 31 A, FIG. 34A-B, FIG. 35, FIG. 51, FIG. 74, FIG. 87, FIG. 105A-B, FIG. 106, FIG. 119, FIG. 120, FIG. 121, FIG. 122, FIG. 124, FIG. 131, FIG. 132, FIG. 133, FIG. 134, FIG. 135, FIG. 136, FIG. 137 or any combination thereof.
[00232] In some cases, a biomarker may distinguish a premalignant condition from a benign condition. In some cases, a biomarker may identify a sample as having a premalignant condition. A panel of biomarkers may comprise one or more biomarkers from FIG. 19B-C, FIG. 34B, FIG. 74, FIG. 136, FIG. 137, FIG. 139, FIG. 140, FIG. 141, FIG. 142, or any combination thereof. A panel of biomarkers may comprise one or more biomarkers from FIG. 19B. A panel of biomarkers may comprise one or more biomarkers from FIG. 19C. A panel of biomarkers may comprise one or more biomarkers from FIG. 34B. A panel of biomarkers may comprise one or more biomarkers from FIG. 74. A panel of biomarkers may comprise one or more biomarkers from FIG. 136. A panel of biomarkers may comprise one or more biomarkers from FIG. 137. A panel of biomarkers may comprise one or more biomarkers from FIG. 139, FIG. 141, or a combination thereof. A biomarker panel may comprise FIGN. A biomarker panel may comprise MRPS31P2. A biomarker panel may comprise RPl 1-797H7.1. A biomarker panel may comprise GCOM2. A biomarker panel may comprise RPl 1-95F22.1. A biomarker panel may comprise USP32P2. A biomarker panel may comprise RP1-155D22.1.
[00233] In some cases, a biomarker may be a top ranked biomarker, such as a top ranked gene. A panel of biomarkers may comprise one or more biomarkers from FIG. 8A. A panel of biomarkers may comprise one or more of C2CD4C, ZIC4, INHBB, FIGN, SIX1, FRMD1, OTX1, CYP26C1, TMEM200B, NOL2, CXCL12, RPl 1-522B 15.3, TBX2, TJP1, IHH, MACI1- AS1, ZIC1, CNPY2, LRIG3, PINK 1 -AS, or any combination thereof. A panel of biomarkers may comprise one or more of C2CD4C, ZIC4, INHBB, FIGN, SIX1, FRMD1, OTX1,
CYP26C1, TMEM200B, NOL2, CXCL12, RPl 1-522B15.3, TBX2, TJP1, IHH, or any combination thereof. A panel of biomarkers may comprise one or more of C2CD4C, ZIC4, INHBB, FIGN, SIX1, FRMD1, OTX1, CYP26C1, TMEM200B, NOL2, or any combination thereof. A panel of biomarkers may comprise one or more of C2CD4C, ZIC4, INHBB, FIGN, SIX1, or any combination thereof. A panel of biomarkers may comprise one or more of
C2CD4C, ZIC4, INHBB, FIGN, or any combination thereof. A panel of biomarkers may comprise one or more of C2CD4C, ZIC4, INHBB, or any combination thereof. A panel of biomarkers may comprise one or more of C2CD4C, ZIC4, or a combination thereof. A panel of biomarkers may comprise C2CD4C.
[00234] In some cases, a biomarker may be a top ranked biomarker. A biomarker may be top ranked for distinguishing a malignant sample from a normal sample. A biomarker may be top ranked for distinguishing an early stage cancer from a normal sample. A biomarker may be top ranked for an early screening molecular classifier or for a subject not suspected of having a cancer. A panel of biomarkers may comprise one or more biomarkers from FIG. 19A-C. A panel of biomarkers may comprise one or more of FIGN, MRPS31P2, DHX30, USP32P2, GCOM2, RP11-797H7.1, RPl 1-45506.2, RP1-155D22.1, TXLNA, RPl 1-95F22.1, CTC-273B12.10, RNA5SP129, RASSF10, or any combination thereof. A panel of biomarkers may comprise one or more of FIGN, MRPS31P2, DHX30, USP32P2, GCOM2, RP11-797H7.1, RPl 1-45506.2, RP1-155D22.1, or any combination thereof. A panel of biomarkers may comprise one or more of FIGN, MRPS31P2, DHX30, USP32P2, GCOM2, or any combination thereof. A panel of biomarkers may comprise one or more of FIGN, MRPS31P2, DHX30, USP32P2, or any combination thereof. A panel of biomarkers may comprise one or more of FIGN, MRPS31P2, DHX30, or any combination thereof. A panel of biomarkers may comprise one or more of FIGN, MRPS31P2, or any combination thereof. A panel of biomarkers may comprise one or more of MRPS21P2, USP32P2, RP1-155D22.1, or any combination thereof.
[00235] In some cases, a biomarker may not previously be associated with a cancer. A biomarker may be a gene or genehancer. A panel of biomarkers may comprise one or more of RP11-522B 15.3, CYP26C1, TMEM200B, NOL6, FRMD1, FIGN, C2CD4C, MAGI1-AS1, PINK1-AS, or any combination thereof. A panel of biomarkers may comprise one or more of FIGN, MRPS31P2, DHX30, USP32P2, GCOM2, RP11-797H7.1, RPl 1-45506.2, RP1- 155D22.1, TXLNA, RPl 1-95F22.1, CTC-273B 12.10, RNA5SP129, or any combination thereof. A panel of biomarkers may comprise one or more of MCRIP2P1, RNU6-1265P, TRAPPC3, TXLNA, AC073257.2, FIGN, IGKV1-33, KLF2P3, RPl 1-523H20.3, DHX30, RNA5SP129, SNORA6, FKBP4P1, GCOM2, MTND3P24, RNA5SP152, RNU1-36P, RP11-96A1.5,
ADAMTS19-AS1, PCBD2, SIL1, DUTP5, RP1-155D22.1, RPl 1-279022.1, RPl 1-797H7.1, AC083843.4, PRR13P7, RNU4-50P, FAM210CP, RP11-481H12.1, RP1-65P5.3, RPl 1-22B23.2, RP11-121C6.4, RPl 1-128P10.1, MRPS31P2, RPl 1-95F22.1, AE000662.93, CTD-2302E22.5,
MIR4509-2, AC009120.6, IRX3, MRPS21P7, RPl 1-45506.2, USP32P2, CTC-273B 12.10, NAA20, FAM118 A, or any combination thereof. A panel of biomarkers may comprise one or more of RP1-155D22.1, FIGN, RPl 1-95F22.1, AHRR, NAA20, RPl 1-797H7.1, RPS2P46, NDUFA8, MRPS31P2, AC009120.6, C2CD4C, RN7SL635P, PCBD2, SLC24A1, KARS, CH17-11806.3, BEND7, RN7SKP69, PNMAl, RP11-21C4.1, LINC01607, AC005253.4, CTC- 301O7.4, RP11-137H2.4, IRX3, RELL2, RP11-26J3.1, AE000662.93, FAM118A,
AC006028.11, GAPDHP65, or any combination thereof. A panel of biomarkers may comprise one or more biomarkers from FIG. 120. In some cases, the one or more biomarkers may be selected from FIG. 120 in the absence of GH15F067182.
Table 1
RP11-522B 15.3, CYP26C1, TMEM200B, NOL6, FRMD1, FIGN, C2CD4C, MAGI1-AS1, PINK1-AS, FIGN, MRPS31P2, DHX30, USP32P2, GCOM2, RPl 1-797H7.1, RPl 1-45506.2, RP1-155D22.1, TXLNA, RPl 1-95F22.1, CTC-273B12.10, RNA5SP129, MCRIP2P1, RNU6- 1265P, TRAPPC3, TXLNA, AC073257.2, FIGN, IGKV1-33, KLF2P3, RPl 1-523H20.3, DHX30, RNA5SP129, SNORA6, FKBP4P1, GCOM2, MTND3P24, RNA5SP152, RNU1-36P, RP11-96A1.5, ADAMTS19-AS1, PCBD2, SIL1, DUTP5, RP1-155D22.1, RPl 1-279022.1, RP11-797H7.1, AC083843.4, PRR13P7, RNU4-50P, FAM210CP, RPl 1-481H12.1, RP1- 65P5.3, RP11-22B23.2, RPl 1-121C6.4, RPl 1-128P10.1, MRPS31P2, RPl 1-95F22.1,
AE000662.93, CTD-2302E22.5, MIR4509-2, AC009120.6, IRX3, MRPS21P7, RPl 1-45506.2, USP32P2, CTC-273B12.10, NAA20, FAM118A, RP1-155D22.1, FIGN, RPl 1-95F22.1, AHRR, NAA20, RP11-797H7.1, RPS2P46, NDUFA8, MRPS31P2, AC009120.6, C2CD4C,
RN7SL635P, PCBD2, SLC24A1, KARS, CH17-11806.3, BEND7, RN7SKP69, PNMAl, RP11- 21C4.1, LINC01607, AC005253.4, CTC-301O7.4, RPl 1-137H2.4, IRX3, RELL2, RP11-26J3.1, AE000662.93, FAM118A, AC006028.11, GAPDHP65, one or more biomarkers from FIG. 120 or any combination thereof.
[00236] In some cases, an epigenetic modification in a biomarker may not previously be associated with a cancer. In some cases, a biomarker may be a gene or genehancer. A panel of biomarkers may comprise one or more of INHBB, SIX1, TJP1, IHH, CNPY2, or any
combination thereof. A panel of biomarkers may comprise one or more of MIRlOl-1, RBP7, CSNKIAI, CYP26C1, NDUFAB l, PES1, or any combination thereof. A panel of biomarkers may comprise one or more of DSTN, BCAP29, NDUFAB l, STMN4, or any combination thereof.
Table 2 INHBB, SIXl, TJP1, IHH, CNPY2, MIRlOl-1, RBP7, CSNK1A1, CYP26C1, NDUFAB l,
PES1, DSTN, BCAP29, NDUFABl, STMN4, or any combination thereof.
[00237] In some cases, a biomarker may distinguish samples having an early stage cancer from samples having a late stage cancer. A panel of biomarkers may comprise one or more of FIGN, MRPS31P2, DHX30, USP32P2, GCOM2, RP11-797H7.1, RP11-45506.2, RP1- 155D22.1, TXLNA, RP1 1-95F22.1, CTC-273B 12.10, RNA5SP129, RASSFIO, or any combination thereof. A panel of biomarkers may comprise one or more of FIGN, SIXl, ZIC4, or any combination thereof.
Table 3
MRPS31P2, DHX30, USP32P2, GCOM2, RPl 1-797H7.1, RP11-45506.2, RP1-155D22.1, TXLNA, RP11-95F22.1, CTC-273B 12.10, RNA5SP129, RASSFIO, FIGN, SIXl, ZIC4, or any combination thereof.
[00238] A panel of biomarkers may comprise one or more of C2CD4C, ZIC4, INHBB, FIGN, SIXl, FRMD1, OTX1, CYP26C1, TMEM200B, NOL6, CXCL12, RPl 1-522B15.3, TBX2, TJP1, IHH, MAGI1-AS1, ZIC1, CNPY2, LRIG3, PINK1-AS, or any combination thereof. A panel of biomarkers may comprise one or more of SIXl, ZIC4, INHBB, C2CD4C, or any combination thereof.
[00239] A panel of biomarkers may comprise one or more biomarkers from FIG. 51. A panel of biomarkers may comprise one or more of RP11-522B 15.3, CYP26C1, TMEM200B, NOL6, FRMD1, FIGN, C2CD4C, MAGI1-AS1, PINK1-AS, or any combination thereof. A panel of biomarkers may comprise one or more of INHBB, SIXl, TJP1, IHH, CNPY2, or any
combination thereof. A panel of biomarkers may comprise one or more of RPl 1-522B15.3, CYP26C1, TMEM200B, NOL6, FRMD1, FIGN, C2CD4C, MAGI1-AS1, PINK1-AS, INHBB, SIXl, TJP1, IHH, CNPY2, or any combination thereof.
[00240] A panel of biomarkers may comprise one or more biomarkers from FIG. 74. A panel of biomarkers may comprise one or more of FIGN, MRPS31P2, DHX30, USP32P2, GCOM2, RP11-797H7.1, RPl 1-45506.2, RP1-155D22.1, TXLNA, RPl 1-95F22.1, CTC-273B12.10, RNA5SP129, or any combination thereof.
[00241] A panel of biomarkers may comprise one or more biomarkers from FIG. 119. A panel of biomarkers may comprise one or more of MCRIP2P1, RNU6-1265P, TRAPPC3, TXLNA, AC073257.2, FIGN, IGKV1-33, KLF2P3, RPl 1-523H20.3, DHX30, RNA5SP129, SNORA6, FKBP4P1, GCOM2, MTND3P24, RNA5SP152, RNU1-36P, RP11-96A1.5, AD AMTS 19- AS 1 , PCBD2, SIL1, DUTP5, RP1-155D22.1, RPl 1-279022.1, RP11-797H7.1, AC083843.4,
PRR13P7, RNU4-50P, FAM210CP, RP11-481H12.1, RP1-65P5.3, RPl 1-22B23.2, RP11- 121C6.4, RP11-128P10.1, MRPS31P2, RP11-95F22.1, AE000662.93, CTD-2302E22.5,
MIR4509-2, AC009120.6, IRX3, MRPS21P7, RPl 1-45506.2, USP32P2, CTC-273B 12.10,
NAA20, FAM118 A, or any combination thereof. A panel of biomarkers may comprise one or more of MIRlOl-1, RBP7, CS K1A1, CYP26C1, DUFAB1, PES1, or any combination thereof. A panel of biomarkers may comprise one or more of MCRIP2P1, RNU6-1265P,
TRAPPC3, TXLNA, AC073257.2, FIGN, IGKV1-33, KLF2P3, RPl 1-523H20.3, DHX30,
RNA5SP129, SNORA6, FKBP4P1, GCOM2, MTND3P24, RNA5SP152, RNU1-36P, RP11-
96A1.5, ADAMTS19-AS1, PCBD2, SIL1, DUTP5, RP1-155D22.1, RPl 1-279022.1, RP11-
797H7.1, AC083843.4, PRR13P7, RNU4-50P, FAM210CP, RP11-481H12.1, RP1-65P5.3,
RP11-22B23.2, RPl 1-121C6.4, RP11-128P10.1, MRPS31P2, RP11-95F22.1, AE000662.93,
CTD-2302E22.5, MIR4509-2, AC009120.6, IRX3, MRPS21P7, RPl 1-45506.2, USP32P2,
CTC-273B12.10, NAA20, FAM118A, MIRlOl-1, RBP7, CSNK1A1, CYP26C1, NDUFABl,
PES1, or any combination thereof.
[00242] A panel of biomarkers may comprise one or more biomarkers of FIG. 121. A panel of biomarkers may comprise one or more of RP1-155D22.1, FIGN, RP11-95F22.1, AHRR, NAA20, RP11-797H7.1, RPS2P46, NDUFA8, MRPS31P2, AC009120.6, C2CD4C,
RN7SL635P, PCBD2, SLC24A1, KARS, CH17-11806.3, BEND7, RN7SKP69, PNMAl, RP11- 21C4.1, LINC01607, AC005253.4, CTC-301O7.4, RPl 1-137H2.4, IRX3, RELL2, RP11-26J3.1, AE000662.93, FAM118A, AC006028.11, GAPDHP65, or any combination thereof. A panel of biomarkers may comprise one or more of DSTN, BCAP29, NDUFABl, STMN4, or any combination thereof. A panel of biomarkers may comprise one or more of RP1-155D22.1, FIGN, RP11-95F22.1, AHRR, NAA20, RPl 1-797H7.1, RPS2P46, NDUFA8, MRPS31P2,
AC009120.6, C2CD4C, RN7SL635P, PCBD2, SLC24A1, KARS, CH17-11806.3, BEND7, RN7SKP69, PNMAl, RP11-21C4.1, LINC01607, AC005253.4, CTC-301O7.4, RPl 1-137H2.4, IRX3, RELL2, RP11-26J3.1, AE000662.93, FAM118A, AC006028. i l, GAPDHP65, DSTN, BCAP29, NDUFAB l, STMN4, or any combination thereof.
HMCP-110 Workflow
[00243] The HMCP-110 workflow may improve workflow and reduce sample attrition from 30% to 5% and eliminate strong operator biases seen in the HMCP-150 study. The analysis may identify many significantly differential hydroxym ethyl ated features (both gene bodies and enhancers) that have been previously associated with cancer (such as CRC) or not previously associated with cancer.
[00244] As shown in FIG. 1, key improvements of the HMCP-110 protocol as compared to the HMCP-150 protocol. A total of 110 colorectal cancer (CRC) and healthy volunteer (HV) plasma samples are processed through the HMCP v2 protocol with significant improvements to project management, data analysis and overall execution. Improvements may include a reduction in operator bias, a reduction in attrition rate, or a combination thereof.
[00245] The HMCP-110 protocol is shown in FIG. 33. Day 1. Summary. cfDNA samples will undergo end repair, addition of an A-base overhang, adaptor ligation, and post ligation purification. -3.8% of the ligation product will be amplified, purified and QC'ed by Qubit and BioAnalyzer while the remainder is reserved for processing on day 2. Day 2. Summary. The remaining purified ligation product from day 1 is then denatured into single strands, these are copied to produce double stranded material, 5 -hydroxy methylated cytosines are chemically labeled then bound to a biotin conjugate followed by a clean-up of this reaction. Day 3.
Summary. The Biotin conjugated 5hmC-containing DNA fragments material is bound to streptavidin beads. Using a magnet the unbound material (non 5hmC-containing fragments) are washed away. Following this, the bound DNA fragments are denatured into single stranded DNA leaving the copy strand in solution while, the biotin-conjugated original strand remains bound to the streptavidin beads. The single-stranded copy strand is amplified. The library size and molarity are determined for both the amplified enriched (5hmC-containing) libraries by the bioanalyzer.
HMCP v2 protocol
[00246] The HMCP-110 protocol may be a modified version of the HMCP v2 protocol as described above and in FIG. 33 and FIG. 1.
[00247] A method as described herein may comprise associating a label with an epigenetically modified base of a nucleic acid sequence to form a labeled nucleic acid sequence; hybridizing a substantially complementary strand to the labeled nucleic acid sequence; and amplifying the substantially complementary strand in a reaction in which the labeled nucleic acid sequence is substantially not present. One or more individual elements of the method need not be performed in a particular order. For example, associating a label may occur after the hybridizing. One or more individual elements of a given method may be performed in a different order than described herein.
Variation 1
[00248] FIG. 125 shows one example of the 5-hmC Pulldown Label Copy Enrich
(HMCP LCE) method detailed herein. Advantages of the HMCP LCE method may provide: (a) an improved resolution as compared to other methods, such as a HMCP method or a method that may associate a sugar, an antibody, a protein, a fragment of any of these, a label, or any combination thereof with an epigenetically modified base of the nucleic acid; (b) a decrease in 5- hmC-density bias as compared to other methods, such as a HMCP method or a method that may associate a sugar, an antibody, a protein, a fragment of any of these, a label, or any combination thereof with an epigenetically modified base of the nucleic acid; or (c) any combination thereof.
[00249] In this example of FIG. 125, a first element 201 may be to prepare a plurality of double-stranded fragments 202, such as a library of oligonucleotide fragments. The plurality of double-stranded fragments may comprise cell-free DNA. The plurality of double-stranded fragments may comprise one or more epigenetic modifications on one or both strands. A second element 203 may be to associate a label (such as an azido-glucose label) with at least one of the oligonucleotide fragments from the plurality of double-stranded fragments to form a modified oligonucleotide fragment 204. The label may associate with an epigenetic modification present at one or more bases of the modified oligonucleotide fragment. A third element 205 may be to separate the modified oligonucleotide fragment to form one or more single-stranded modified oligonucleotide fragments 206. A fourth element 207 may be to hybridize a complementary strand, such as a substantially complementary strand, to a single-stranded modified
oligonucleotide fragment to form a modified oligonucleotide fragment 208, such as a labeled chimeric library. The complementary strand may lack one or both of the label and the epigenetic modification. A fifth element 210 may be to associate a label 209 with the modified
oligonucleotide fragment wherein the label 209 may also associate with a substrate. The label 209 may bind to an epigenetic modification or to a label previously associated with an epigenetic modification. The label 209 may not bind directly to the complementary strand. The
complementary strand may be indirectly associated with the substrate via the interaction between the substrate and the modified oligonucleotide fragment. The association between the
complementary strand and the opposing strand may be disruptable, such as a disruptable bond. A sixth element 211 may be to enrich a sample for one or more complementary strands 212 by removing or separating or washing away from the substrate one or more complementary strands (such as by disrupting the bond between the complementary strand and the opposing strand) and then separating the complementary strand from the modified oligonucleotide fragment that remains associated with the substrate. A seventh element 213 may be to amplify the enriched complementary strand in the absence of the modified oligonucleotide fragment to form one or more daughter strands 214 of the complementary strand.
[00250] In FIG. 125, the library may comprise double-stranded oligonucleotide fragments or single- stranded oligonucleotide fragments. The oligonucleotide fragments may be DNA or RNA. The library may be a next-generation (NGS) library. The library may comprise an oligonucleotide fragment having an adaptor (such as an NGS adaptor) at (a) one or both ends of the fragment, (b) at one or both strands of the double-stranded oligonucleotide fragment, or (c) a combination thereof. The adaptor may uniquely identify the oligonucleotide fragment from other oligonucleotide fragments in a sample or in a library. The adaptor may be specific to or selective for double-stranded DNA.
[00251] In FIG. 125, a label may associate with an epigenetic modification (such as 5- hmC) or a type of epigenetic modification present at a base of the oligonucleotide fragment. A label may associate with a plurality of epigenetic modifications present on one or both strands of a double-stranded oligonucleotide fragment. A label may associate with a type of epigenetic modification (such as 5-hmC). A label may be selective for a type of epigenetic modification (such as a 5-hmC). The label may be selective for double-stranded oligonucleotide fragments and may not label single-stranded fragments. The label may be selective for single-stranded oligonucleotide fragments. The label may associate with (such as bind to) the epigenetic modification with an aid, such as an enzyme. The enzyme may be selective for double-stranded oligonucleotide fragments, such as beta-glucosyltransferase (bGT). The label may associate with the epigenetic modification by click chemistry. The label may be an azido-sugar, such as an azido-glucose.
[00252] In FIG. 125, a double-stranded oligonucleotide fragment may be separated to form single stranded fragments, such as separating by denaturation. A complementary strand may be hybridized to at least a portion of a single stranded oligonucleotide. A complementary strand may be a primer, such as a primer that may be complementary to the adaptor (such as an NGS adaptor). A complementary strand may be a substantially complementary strand, such as substantially complementary along an entire length of the oligonucleotide fragment. The substantially complementary strand may be absent (a) the label that may be present in the parent oligonucleotide fragment, (b) the epigenetic modification that may be present in the parent oligonucleotide fragment, or (c) a combination thereof. The substantially complementary strand may be hybridized to the parent oligonucleotide fragment by DNA extension or cDNA extension.
[00253] In FIG. 125, parent oligonucleotide fragments and the substantially
complementary strand may be indirectly associated with a substrate. The association to the substrate may occur via the label associated with the epigenetic modification on the parent oligonucleotide fragment. The substantially complementary strand may be free of any label and/or free of any epigenetic modification. The association between the label and the substrate may be disrupted.
[00254] In FIG. 125, oligonucleotide fragments comprising an epigenetic modification may be separated from oligonucleotide fragments absent any epigenetic modifications or absent a type of epigenetic modification. Separation may occur by associating the label with a substrate, such that any fragment absent the epigenetic modification or the type of epigenetic modification may be removed. Removal may occur by washing, such as stringent washing of the substrate. Following removal of oligonucleotide fragments lacking an epigenetic modification or a type of epigenetic modification, the substantially complementary strand may be separated from the parent oligonucleotide fragment strand. The parent oligonucleotide fragment strand may remain associated with the substrate. The parent oligonucleotide fragment strand and the substrate may be discarded. The substantially complementary strand may be amplified in a reaction vessel that may be free of the parent oligonucleotide fragment strand.
Variation 2
[00255] FIG. 126 shows one example of the 5-hmC Pulldown Copy Label Enrich
(HMCP CLE) method detailed herein. In some cases, the HMCP CLE method may provide: (a) an improved resolution as compared to other methods, such as a HMCP method or a method that may associate a sugar, an antibody, a protein, a fragment of any of these, a label, or any combination thereof with an epigenetically modified base of the nucleic acid; (b) a decrease in 5- hmC-density bias as compared to other methods, such as a HMCP method or a method that may associate a sugar, an antibody, a protein, a fragment of any of these, a label, or any combination thereof with an epigenetically modified base of the nucleic acid; or (c) any combination thereof.
[00256] In this example of FIG. 126, a first element 301 may be to prepare a plurality of double stranded oligonucleotide fragments 302, such as a library. The double stranded oligonucleotide fragments may comprise cell-free DNA. The double stranded oligonucleotide fragments may have epigenetic modifications on one or more bases of one or both strands. A second element 303 may be to separate the strands of a double-stranded oligonucleotide fragment of the plurality to form one or more single-stranded oligonucleotide fragments 304. The one or more single-stranded oligonucleotide fragments may comprise one or more bases having an epigenetic modification. A third element 305 may be to hybridize a complementary strand, such as a substantially complementary strand, to at least one single-stranded oligonucleotide fragment to form a modified oligonucleotide fragment 306. The complementary strand may be
substantially free of the epigenetic modification present in the opposing single-stranded oligonucleotide fragment. A fourth element 307 may be to associate a label (such as an azido- glucose label) with the modified oligonucleotide fragment to form a labeled modified
oligonucleotide fragment 308, such as a labeled chimeric library. The label may associate with an epigenetic modification present in the modified oligonucleotide fragment. The label may not be associated with the substantially complementary strand that may lack an epigenetic modification. A fifth element 310 may be to associate a label 309 with the modified oligonucleotide fragment wherein the label 309 may also associate with a substrate. The label 309 may not bind directly to the complementary strand. The complementary strand may be indirectly associated with the substrate via the interaction between the substrate and the modified oligonucleotide fragment. The association between the complementary strand and the opposing strand may be disruptable, such as a disruptable bond. A sixth element 311 may be to enrich a sample for one or more complementary strands 312 by removing or separating or washing away from the substrate one or more complementary strands (such as by disrupting the bond between the complementary strand and the opposing strand). Upon separation, the modified oligonucleotide fragment may remain associated with the substrate. In some cases, enriching a sample for one or more complementary strands may comprise washing a substrate, such as stringent washing of a substrate. Washing may remove one or more non-covalently bound fragments, one or more non-specifically physisorbed fragments, or a combination thereof. Washing may not disrupt or alter an association between a modified oligonucleotide fragment and a substrate, such that a sample may be enriched for the complementary strand. A seventh element 313 may be to amplify the
complementary strand in the absence of the modified oligonucleotide fragment to form one or more daughter strands 314 of the complementary strand.
[00257] In FIG. 126, the library may comprise double-stranded oligonucleotide fragments or single- stranded oligonucleotide fragments. The oligonucleotide fragments may be DNA or RNA. The library may be a next-generation (NGS) library. The library may comprise an oligonucleotide fragment having an adaptor (such as an NGS adaptor) at (a) one or both ends of the fragment, (b) at one or both strands of the double-stranded oligonucleotide fragment, or (c) a combination thereof. The adaptor may uniquely identify the oligonucleotide fragment from other oligonucleotide fragments in a sample or in a library. The adaptor may be specific to or selective for double-stranded DNA.
[00258] In FIG. 126, a double-stranded oligonucleotide fragment may be separated to form single stranded fragments, such as separating by denaturation. A complementary strand may be hybridized to at least a portion of a single stranded oligonucleotide. A complementary strand may be a primer, such as a primer that may be complementary to the adaptor (such as an NGS adaptor). A complementary strand may be a substantially complementary strand, such as substantially complementary along an entire length of the oligonucleotide fragment. The substantially complementary strand may be absent the epigenetic modification that may be present in the parent oligonucleotide fragment. The substantially complementary strand may be hybridized to the parent oligonucleotide fragment by cDNA extension. [00259] In FIG. 126, a label may associate with an epigenetic modification (such as 5- hmC) or a type of epigenetic modification present at a base of the parent oligonucleotide fragment. A label may associate with a plurality of epigenetic modifications present on the parent oligonucleotide fragment. A label may associate with a type of epigenetic modification (such as 5-hmC). A label may be selective for a type of epigenetic modification (such as a 5-hmC). The label may be selective for double-stranded fragments and may not label single-stranded fragments. The label may be selective for single-stranded fragments. The label may associate with (such as bind to) the epigenetic modification of the parent strand with an aid, such as an enzyme. The enzyme may be selective for double-stranded oligonucleotide fragments, such as beta-glucosyltransferase (bGT). The label may associate with the epigenetic modification by click chemistry. The label may be an azido-sugar, such as an azido-glucose.
[00260] In FIG. 126, parent oligonucleotide fragments and the substantially
complementary strand may be indirectly associated with a substrate. The association to the substrate may occur via the label associated with the epigenetic modification on the parent oligonucleotide fragment. The substantially complementary strand may be free of any label and/or free of any epigenetic modification. The association between the label and the substrate may be disrupted.
[00261] In FIG. 126, oligonucleotide fragments comprising an epigenetic modification may be separated from oligonucleotide fragments absent any epigenetic modifications or absent a type of epigenetic modification. Separation may occur by associating the label with a substrate, such that any fragment absent the epigenetic modification or the type of epigenetic modification may be removed. Removal may occur by washing, such as stringent washing of the substrate. Following removal of oligonucleotide fragments lacking an epigenetic modification or a type of epigenetic modification, the substantially complementary strand may be separated from the parent oligonucleotide fragment strand. The parent oligonucleotide fragment strand may remain associated with the substrate. The parent oligonucleotide fragment strand and the substrate may be discarded. The substantially complementary strand may be amplified in a reaction vessel that may be free of the parent oligonucleotide fragment strand.
Variation 3
[00262] FIG. 127 shows one example of the 5-hmC Pulldown Label Random prime
Enrich (HMCP LRE) method detailed herein. In some cases, the HMCP LRE method may provide: (a) an improved resolution as compared to other methods, such as a HMCP method or a method that may associate a sugar, an antibody, a protein, a fragment of any of these, a label, or any combination thereof with an epigenetically modified base of the nucleic acid; (b) a decrease in 5-hmC-density bias as compared to other methods, such as a HMCP method or a method that may associate a sugar, an antibody, a protein, a fragment of any of these, a label, or any combination thereof with an epigenetically modified base of the nucleic acid; (c) a substantially improved robustness at low input mass as compared to other methods, such as a HMCP method or a method that may associate a sugar, an antibody, a protein, a fragment of any of these, a label, or any combination thereof with an epigenetically modified base of the nucleic acid; or (d) any combination thereof.
[00263] In this example of FIG. 127, a first element 401 may be to associate a label (such as an azido-glucose label) with a double stranded oligonucleotide fragment to yield a modified oligonucleotide fragment 402. The double stranded oligonucleotide may comprise cell-free DNA. The label may associate with an epigenetic modification or a type of epigenetic modification present at a base of one or both strands of the double stranded oligonucleotide fragment to form the modified oligonucleotide fragment 402. A second element 403 may be to separate the strands of the modified oligonucleotide fragment to form one or more single-stranded modified oligonucleotide fragments and then to hybridize a complementary strand, such as a substantially complementary strand to at least one of the single-stranded modified oligonucleotide fragments to form a double stranded modified oligonucleotide fragment 404 having a complementary strand and a modified oligonucleotide fragment having the label. The complementary strand may be absent the label and absent the epigenetic modification. A third element 405 may associate an adaptor to the double stranded modified oligonucleotide fragment (such as to one or both ends of one or both strands of the double stranded modified oligonucleotide fragment) to form a double stranded modified oligonucleotide fragment having one or more adaptors 406, such as a labeled chimeric library. A fourth element 408 may be to associate a label 407 with the modified oligonucleotide fragment wherein the label 407 may also associate with a substrate. The label 408 may bind to an epigenetic modification or to the label previously associated with an epigenetic modification. The label 408 may not bind directly to the complementary strand. The complementary strand may be indirectly associated with the substrate via the interaction between the substrate and the modified oligonucleotide fragment. The interaction between the
complementary strand and the opposing strand may be disruptable, such as a disruptable bond. A fifth element 409 may be to enrich a sample for one or more complementary strands 410 by removing or separating or washing away from the substrate one or more complementary strands that lack a label associated with the substrate (such as by disrupting the bond between the complementary strand and the opposing strand) and then separating the complementary strand from the modified oligonucleotide fragment that remains associated with the substrate. A sixth element 411 may be to amplify the complementary strand in the absence of the modified oligonucleotide fragment to form one or more daughter strands 412 of the complementary strand.
[00264] In FIG. 127, a label may associate with an epigenetic modification (such as 5- hmC) present at a base of the parent oligonucleotide fragment. A label may associate with a plurality of epigenetic modifications present on the parent oligonucleotide fragment. A label may associate with a type of epigenetic modification (such as 5-hmC). A label may be selective for a type of epigenetic modification (such as a 5-hmC). The label may be selective for double- stranded fragments and may not label single-stranded fragments. The label may be selective for single-stranded fragments. The label may associate with (such as bind to) the epigenetic modification of the parent strand with an aid, such as an enzyme. The enzyme may be selective for double-stranded oligonucleotide fragments, such as beta-glucosyltransferase (bGT). The label may associate with the epigenetic modification by click chemistry. The label may be an azido- sugar, such as an azido-glucose.
[00265] In FIG. 127, a position of a label may be determined by the presence/absence of 5-hmC in a dsDNA parent fragment. A label may be an azido-glucose, transferred to a 5-hmC from UDP-6-azide-glucose (UDP-N3-glc) by beta-glucosyltransferase (PGT). Labeling may be performed directly on a purified circulating tumor DNA (ctDNA) extract. An advantage may be that a ctDNA may not have been through a series of library preps ahead of labeling. There may be likely more material at labeling (improved efficiency) and presenting a more representative sample to a labeling than may be the case post NGS prep.
[00266] In some cases, hybridizing may comprise (i) priming (such as random priming),
(ii) ligation (such as adapter ligation), or (iii) a combination thereof. For example, in FIG. 127, random priming may be performed by incubating an azido-labeled double-stranded DNA
(dsDNA) duplex in the presence of an oligomer pool (where each oligo in the pool may comprise a degenerate N6, N7, N8, N9, N10 or beyond "head" attached to a "NGS-adapter" tail), a DNA polymerase (e.g. Klenow) and a native nucleoside triphosphate comprising deoxyribose (dNTP) mix in a given buffer, and performing a single extension reaction at 37 °C for a defined time (e.g. 10 mins). A degenerate primer "head" randomly may prime a template DNA and may make multiple copies for each of the parent strands. If using a strand displacing polymerase, the random primer that primer closest to the 3' end of the template strand may extend and displace the other copies, leading to a long, double stranded chimeric product with a 3'A-overhang at the end of the daughter copy. Random priming may achieve two elements in one by: 1) introducing an NGS-specific adapter sequence and 2) generating a modification-free copy (daughter strand) of the modified parent strand. [00267] In FIG. 127, adapter ligation may occur by incubating a mono-adapted chimeric labelled duplex template with a NGS-platform specific adapter (a forked adapter, a linear duplex adapter, a hairpin adapter, or a combination thereof) with 3' T overhang and 5' P04 end, a dsDNA ligase (e.g. T4 ligase) and necessary cofactors (e.g. Mg2+, adenosine triphosphate (ATP), polyethylene glycol (PEG)) in a given buffer, at 20 °C for a defined period of time (e.g. 15 minutes). The A overhang of the monoadapted chimeric labelled duplex may match with the T overhang of the adapter and may promote ligation efficiency. Only one end of each duplex (that being formed by the 3' end of the daughter strand) may be adapted. A successful ligation product may have a singly adapted azido-labeled parent strand (5 'adapted) and a doubly adapted non- modified daughter strand (both 3' and 5'ends). In some cases, amplification of such "library", only a bottom strand may be amplifiable with an adapter-specific polymerase chain reaction (PCR) primer.
[00268] In FIG. 127, magnetic bead binding may enable selective enrichment of a labeled chimeric next generation sequencing (NGS) library fragments. This may be achieved directly (i.e. by Sharpless Azide-alkyne cycloaddition reaction (CLICK) chemistry between the azido- glucose label and dibenzocyclooctyne (DBCO)-magbead) or indirectly (i.e. by Sharpless Azide- alkyne cycloaddition reaction (CLICK) of a dibenzocyclooctyne (DBCO)-biotin linker and then conjugation of the product to streptavidin-magbeads). In some cases, only azido-labeled fragments (i.e. 5-hmC-containing) may bind to the magbead. Azido-labeled fragments may be immobilized to a bead, such as a magnetic bead. In some cases, this interaction may only occur via a labeled parent strand of the chimeric NGS library duplex. A copied complement may not be azido-labeled and thus may be immobilized to a bead by virtue of the hydrogen-bonding interaction between the complementary duplex strands. As this H-bonding interaction may be non-covalent, it may be disrupted and exploited in downstream steps.
[00269] In FIG. 127, enrichment by stringent washing may be essential to maximize a signal-to-noise ratio of an enrichment process. Chimeric NGS library immobilized beads may be washed stringently (e.g. specific buffers; mild heat; mild denaturants etc.) to selectively remove non-covalently bound NGS library fragments, non-specifically physiosorbed to their surface. In some cases, such types of fragments may cause noise in a final sequencing result. Chimeric NGS library fragments covalently bound to the bead surface may be selected for in the enrichment (i.e. signal, those whose may insert originally contained 5-hmC). After stringent washing, a daughter strand may be eluted from the bead (e.g. heat, high pH, low ionic strength buffer etc.) and taken forward to a PCR reaction. In some cases, the bead-immobilized fraction may be discarded. In some cases, these daughter strands may be exact complements of a labeled strands immobilized to a bead. However, they may not contain any epigenetic modifications and hence may be free from "5-hmC-density" amplification bias. Amplification of these eluted daughter strands may give a superior result over existing methodologies for two reasons: 1) an improved resolution (higher signal-to-noise) and 2) an improved representation (decreased selection bias).
[00270] The methods and systems as described herein may provide a result that may be far more representative of an extent to which a nucleic acid may be marked epigenetically. In some cases, the methods and systems may be superior to other methods of identification of epigenetic modifications. Other methods of identification may include the HMCP method or a method that comprises associating a sugar, a protein, an antibody, or a fragment of any of these with an epigenetic modification and detecting a presence of the sugar, the protein, the antibody, or fragment thereof. In some cases, nucleic acid sequences, such as fragments containing a high density of epigenetic modifications may not be detected using other methods of identification of epigenetic modifications. The unbiased approach of the present methods and systems provides for detection of high density epigenetic modifications of nucleic acid sequences, such as short fragments yielding an unbias detection.
[00271] In FIG. 127, a daughter strand PCR amplification may occur. In some cases, PCR may be employed using only an eluted daughter strand as amplification template using standard protocols and procedures. In some cases, minimizing a number of PCR cycles may minimize duplicates. In some cases, using UMI-codes within an adapter sequence may help quantitation during downstream analysis. In some cases, a genome wide library of enriched fragments may be ready for sequencing.
Variation 4
[00272] FIG. 128 shows one example of the 5-hmC Pulldown Random prime Label
Enrich (HMCP RLE) method detailed herein. In some cases, the HMCP RLE method may provide: (a) an improved resolution as compared to other methods, such as a HMCP method or a method that may associate a sugar, an antibody, a protein, a fragment of any of these, a label, or any combination thereof with an epigenetically modified base of the nucleic acid; (b) a decrease in 5-hmC-density bias as compared to other methods, such as a HMCP method or a method that may associate a sugar, an antibody, a protein, a fragment of any of these, a label, or any combination thereof with an epigenetically modified base of the nucleic acid; (c) a substantially improved robustness at low input mass as compared to other methods, such as a HMCP method or a method that may associate a sugar, an antibody, a protein, a fragment of any of these, a label, or any combination thereof with an epigenetically modified base of the nucleic acid; or (d) any combination thereof. [00273] FIG. 128 is similar to the method of FIG. 127 except that in some cases, priming
(such as random priming) and ligation (such as adapter ligation) may occur before labeling as shown in FIG. 128 and in some cases, priming and ligation may occur after labeling as shown in
FIG. 127
[00274] As shown in FIG. 128, a first element 501 may (i) separate strands of a double stranded oligonucleotide fragment, such as a cell-free DNA fragment (having one or more epigenetic modifications at one or more bases on one or both strands) and (ii) initiate random priming to form a complementary strand, such as a substantially complementary strand, to at least one of the single stranded oligonucleotide fragment. Random priming may form a double stranded modified oligonucleotide fragment 502. The complementary strand formed by random priming may not have epigenetic modifications or may be substantially free of epigenetic modifications. A second element 503 may associate an adaptor to the double stranded modified oligonucleotide fragment (such as to one or both ends of one or both strands of the double stranded modified oligonucleotide fragment) to form a double stranded modified oligonucleotide fragment having one or more adaptors 504. A third element 505 may associate a label (such as an azido-glucose label) with the double stranded modified oligonucleotide fragment to yield a labeled fragment 506, such as a labeled chimeric library. The label may associate with an epigenetic modification or a type of epigenetic modification present at a base of the double stranded oligonucleotide fragment to form the labeled fragment 506. A fourth element 508 may be to associate a label 507 with the double stranded modified oligonucleotide fragment wherein the label 507 may also associate with a substrate. The label 507 may not bind directly to the complementary strand. The complementary strand may be indirectly associated with the substrate via the interaction between the substrate and the modified oligonucleotide fragment. The interaction between the complementary strand and the opposing strand may be disruptable, such as a disruptable bond. A fifth element 509 may be to enrich a sample for one or more
complementary strands 510 by removing or separating or washing away from the substrate one or more complementary strands that lack a label associated with the substrate (such as by disrupting the interaction between the complementary strand and the opposing strand). Upon separation, the modified oligonucleotide fragment may remain associated with the substrate. A sixth element 511 may be to amplify the complementary strand in the absence of the modified oligonucleotide fragment to form one or more daughter strands 512 of the complementary strand.
Variation 5
[00275] FIG. 129 shows one example of the 5-hmC Pulldown Label Loci Specific Enrich
(HMCP LLSE) method detailed herein. In some cases, the HMCP LLSE method may provide (a) an improved resolution as compared to other methods, such as a HMCP method or a method that may associate a sugar, an antibody, a protein, a fragment of any of these, a label, or any combination thereof with an epigenetically modified base of the nucleic acid; (b) a decrease in a 5-hmC-density bias as compared to other methods, such as a HMCP method or a method that may associate a sugar, an antibody, a protein, a fragment of any of these, a label, or any combination thereof with an epigenetically modified base of the nucleic acid; (c) an substantially improved robustness at low input mass as compared to other methods, such as a HMCP method or a method that may associate a sugar, an antibody, a protein, a fragment of any of these, a label, or any combination thereof with an epigenetically modified base of the nucleic acid; (d) targeted regions of 5-hmC enriched DNA as compared with other methods, such as a HMCP method or a method that may associate a sugar, an antibody, a protein, a fragment of any of these, a label, or any combination thereof with an epigenetically modified base of the nucleic acid; or (e) any combination thereof.
[00276] As shown in FIG. 129, a first element 601 may associate a label (such as an azido-glucose label) with the double stranded oligonucleotide fragment, such as a cell-free DNA fragment to yield a labeled fragment 602. The label may associate with an epigenetic
modification or a type of epigenetic modification present at one or more bases of the double stranded oligonucleotide fragment to form the labeled fragment 602. A second element 603 may (i) separate strands of a labeled fragment and (ii) initiate loci specific priming to form a complementary strand, such as a substantially complementary strand, to at least one of the single stranded oligonucleotide fragments. Loci specific priming may form a double stranded modified oligonucleotide fragment 604 having a label associated with an epigenetic modification of the parent strand. The complementary strand may be absent both epigenetic modifications and the associated label. A third element 605 may associate an adaptor to the double stranded modified oligonucleotide fragment (such as to one or both ends of one or both strands of the double stranded modified oligonucleotide fragment) to form a double stranded modified oligonucleotide fragment having one or more adaptors 606, such as a labeled and loci-enriched chimeric library. A fourth element 608 may be to associate a label 607 with the double stranded modified oligonucleotide fragment wherein the label 607 may also associate with a substrate. The label 607 may not bind directly to the complementary strand. The complementary strand may be indirectly associated with the substrate via the interaction between the substrate and the modified oligonucleotide fragment. The interaction between the complementary strand and the opposing strand may be disruptable, such as a disruptable bond. A fifth element 609 may be to enrich a sample for one or more complementary strands 610 by removing or separating or washing away from the substrate one or more complementary strands that lack a label associated with the substrate (such as by disrupting the bond between the complementary strand and the opposing strand). Upon separation, the opposing strand may remain associated with the substrate. A sixth element 611 may be to amplify the complementary strand in the absence of the modified oligonucleotide fragment to form one or more daughter strands 612 of the complementary strand.
[00277] In this example, both strands of double stranded DNA (dsDNA) fragments containing 5-hmC may be labeled using beta-glucosyltransferase (PGT) and UDP-6-azide- glucose (UDP-N3-glc). This step may be dsDNA selective (PGT may not work on single stranded DNA (ssDNA)). Position of label may be determined by the presence/absence of 5-hmC in the dsDNA parent fragment. A label may be azido-glucose, transferred to the 5-hmC from UDP-N3-glc by PGT. The labeling may be performed directly on the purified circulating tumor DNA (ctDNA) extract. Advantage of this may be that the ctDNA may not have been through a series of library prep steps ahead of labeling. So there may be likely more material at the labeling (improved efficiency) and may present a more representative sample to a labeling than may be the case post NGS prep.
[00278] In some cases, hybridizing may comprise (i) priming (such as loci specific priming), (ii) ligation (such as adapter ligation), or (iii) a combination thereof. For example, in FIG. 129, loci specific priming may be performed by incubating azido-labeled dsDNA duplexes in the presence of an oligomer pool (where each oligo in the pool may comprise a loci specific "head" attached to a "NGS-adapter" tail), a DNA polymerase (e.g. Klenow) and a native dNTP mix in a given buffer, and performing a single extension reaction at 37 °C for a defined time (e.g. 10 mins). A loci specific head may be designed to be complementary to specific, defined regions of interest (ROI). Extension from an annealed loci specific primer may result in an A-overhang at an end of a daughter copy. A random priming may achieve two elements in one: 1) it may introduce an NGS-specific adapter sequence in a loci-specific manner and 2) it may generate a modification-free copy (daughter strand) of the modified parent strand.
[00279] In FIG. 129, a labelled loci-monoadapted chimeric duplex template may be incubated with a NGS-platform specific adapter (illustration shows forked adapter, but linear duplex adapter of hairpin adapter may be substituted) with 3' T overhang and 5' P04 end, a dsDNA ligase (e.g. T4 ligase) and necessary cofactors (e.g. Mg2+, adenosine triphosphate (ATP), polyethylene glycol (PEG)) in a given buffer, at 20 °C for a defined period of time (e.g. 15 minutes). The A overhang of the monoadapted chimeric labelled duplex may match with the T overhang of the adapter and promotes ligation efficiency. In some cases, only one end of each duplex (that being formed by the 3 ' end of the daughter strand) may be adapted. A successful ligation product may have a singly adapted azido-labeled parent strand (5' adapted) and a doubly adapted non-modified daughter strand (both 3' and 5' ends). Where one to amplify this "library" it may be that only a bottom strand may be amplifiable with adapter-specific PCR primers.
[00280] In FIG. 129, following adapter ligation, an enrichment of the daughter strand by a substrate may be employed followed by PCR amplification of the daughter strand that may be substantially free of epigenetic modifications.
Variation 6
[00281] FIG. 130 shows one example of the 5-hmC Pulldown Loci Specific Label Enrich (HMCP LSLE) method detailed herein. In some cases, the HMCP LSLE method may provide (a) an improved resolution as compared to other methods, such as a HMCP method or a method that may associate a sugar, an antibody, a protein, a fragment of any of these, a label, or any combination thereof with an epigenetically modified base of the nucleic acid; (b) a decrease in a 5-hmC-density bias as compared to other methods, such as a HMCP method or a method that may associate a sugar, an antibody, a protein, a fragment of any of these, a label, or any combination thereof with an epigenetically modified base of the nucleic acid; (c) an substantially improved robustness at low input mass as compared to other methods, such as a HMCP method or a method that may associate a sugar, an antibody, a protein, a fragment of any of these, a label, or any combination thereof with an epigenetically modified base of the nucleic acid; (d) targeted regions of 5-hmC enriched DNA as compared with other methods, such as a HMCP method or a method that may associate a sugar, an antibody, a protein, a fragment of any of these, a label, or any combination thereof with an epigenetically modified base of the nucleic acid; or (e) any combination thereof.
[00282] FIG. 130 is similar to the method of FIG. 129 except that in some cases, priming
(such as loci specific priming) and ligation (such as adapter ligation) may occur before labeling as shown in FIG. 130 and in some cases, priming and ligation may occur after labeling as shown in FIG. 129
[00283] As shown in FIG. 130, a first element 701 may (i) separate strands of a double stranded oligonucleotide fragment, such as a cell-free DNA fragment and (ii) initiate loci specific priming to form a complementary strand, such as a substantially complementary strand, to at least one of the single stranded parent strands. Loci specific priming may form a double stranded modified oligonucleotide fragment 702. The double stranded oligonucleotide fragment may have one or more epigenetic modifications at one or more bases on one or both strands. The complementary strand, such as a substantially complementary strand, formed by loci specific priming may not have epigenetic modifications. A second element 703 may associate an adaptor to the double stranded modified oligonucleotide fragment (such as to one or both ends of one or both strands of the double stranded modified oligonucleotide fragment) to form a double stranded modified oligonucleotide fragment having one or more adaptors 704. A third element 705 may associate a label (such as an azido-glucose label) with the double stranded modified
oligonucleotide fragment to yield a labeled fragment 706, such as a labeled chimeric library. The label may associate with an epigenetic modification or a type of epigenetic modification present at a base of the double stranded modified oligonucleotide fragment to form the labeled fragment 706. A fourth element 708 may be to associate a label 707 with the double stranded modified oligonucleotide fragment wherein the label 707 may also associate with a substrate. The label 707 may not bind directly to the complementary strand. The complementary strand may be indirectly associated with the substrate via the interaction between the substrate and the modified oligonucleotide fragment. The association between the complementary strand and the opposing strand may be disruptable, such as a disruptable bond. A fifth element 709 may be to enrich a sample for one or more complementary strands 710 by removing or separating or washing away from the substrate one or more complementary strands that lack a label associated with the substrate (such as by disrupting the bond between the complementary strand and the opposing strand). Upon separation, the opposing strand may remain associated with the substrate. A sixth element 711 may be to amplify the complementary strand in the absence of the parent strand to form one or more daughter strands 712 of the complementary strand.
[00284] The HMCP method may be referred to herein as the 'standard' method. The HMCP method may be referred to herein as HMCP, HMCP-vl, HMCPvl, HMCP, vlHMCP, vl HMCP, or VI . The CLE method may be referred to herein as HMCP CLE, HMCP-v2,
HMCPv2, CLE-HMCP, v2HMCP, v2 HMCP, or V2.
[00285] For any of the methods described herein, including CLE, HMCP LCE,
HMCP CLE, HMCP LRE, HMCP RLE, HMCP LLSE, HMCP LSLE, one or more individual elements of a given method may be performed in the order as described herein. In some cases, one or more individual elements of a given method need not be performed in a particular order described herein. In some cases, one or more individual elements of a given method may be performed in a different order than described herein.
[00286] In some cases, the complementary strand may be a substantially complementary strand or may comprise a portion that may be substantially complementary to a portion of a nucleic acid sequence.
[00287] Hybridizing may comprise hybridizing at least two complementary strands to at least two portions of a nucleic acid sequence. Hybridizing may comprise hybridizing at least a portion of a complementary strand to an adapter sequence of the nucleic acid sequence.
Hybridizing may comprise extension, such as cDNA extension. Hybridizing may comprise priming, such as loci specific priming or random priming. Hybridizing may comprise ligation, such as adapter ligation. Hybridizing may comprise hybridizing a primer to a nucleic acid sequence and elongating from the primer to form a complementary strand. Hybridizing may comprise obtaining a complementary strand and hybridizing the complementary strand to the nucleic acid sequence.
[00288] A label may be associated with an epigenetically modified base of a nucleic acid sequence. A label may be associated with an epigenetically modified base before hybridizing. A label may be associated with an epigenetically modified base after hybridizing.
[00289] The method may comprise amplifying the complementary strand in a reaction in which the nucleic acid sequence may be substantially not present. The amplifying may comprise associating the nucleic acid sequence and complementary strand with a substrate, such as by a label. The amplifying may comprise washing a substrate that may be associated with the nucleic acid sequence and complementary strand, such as stringent washing. The amplifying may comprise eluting a complementary strand from the substrate on which the nucleic acid sequence remains. The amplifying may comprise amplifying the complementary strand.
[00290] An epigenetic modification may comprise a DNA methylation. A DNA methylation may comprise a hyper-methylation or a hypo-methylation. A DNA methylation may comprise a modification of a DNA base, such as a 5-methylcytosine (5-mC), a 4-methylcytosine, a 6-methyladenine, or a combination thereof.
Specific Embodiments
[00291] Embodiment 1. A method comprising: (a) assaying a sample for a nucleotide sequence having at least 70% sequence homology to a biomarker listed in Table 1 to produce a result, wherein the sample is from a subject having cancer or suspected of having cancer; and (b) comparing the result of (a) to a result obtained from assaying a control or a derivative thereof.
[00292] Embodiment 2. The method of claim 1, wherein based on the comparing of (b) the sample is identified as benign or malignant for the cancer.
[00293] Embodiment 3. The method of any one of claims 1-2, wherein the nucleotide sequence has at least 85% sequence homology to the biomarker listed in Table 1.
[00294] Embodiment 4. The method of any one of claims 1-3, wherein at least five biomarkers listed in Table 1 or Table 2 are assayed in (a).
[00295] Embodiment 5. The method of any one of claims 1-4, wherein the assaying of (a) comprises assaying for a presence or an absence of an epigenetic modification. [00296] Embodiment 6. The method of any one of claims 1-5, wherein the biomarker is a transcription factor.
[00297] Embodiment 7. A method comprising: (a) assaying a sample for a presence or an absence of an epigenetic modification in a nucleotide sequence having at least 70% sequence homology to a biomarker listed in Table 2 to produce a result, wherein the sample is from a subject having cancer or suspected of having cancer; and (b) comparing the result of (a) to a result obtained from assaying a control or a derivative thereof.
[00298] Embodiment 8. The method of claim 7, wherein based on the comparing of (b) the sample is identified as benign or malignant for the cancer.
[00299] Embodiment 9. The method of any one of claims 7-8, wherein at least five biomarkers listed in Table 1 or Table 2 are assayed in (a).
[00300] Embodiment 10. The method of any one of claims 7-9, wherein the biomarker comprises a transcription factor.
[00301] Embodiment 11. A method comprising: (a) assaying a cell-free DNA sample for a metabolic-related biomarker or an immune-related biomarker to produce a result, wherein the cell-free DNA sample is from a subject having cancer or suspected of having cancer; and (b) comparing the result of (a) to a result obtained from assaying a control or a derivative thereof.
[00302] Embodiment 12. The method of claim 11, wherein based on the comparing of (b) the cell-free DNA sample is identified as benign or malignant for the cancer.
[00303] Embodiment 13. The method of any one of claims 11-12, wherein the assaying of (a) comprises assaying for a presence or an absence of an epigenetic modification.
[00304] Embodiment 14. The method of any one of claims 11-13, wherein at least five biomarkers are assayed in (a).
[00305] Embodiment 15. The method of any one of claims 11-14, wherein the biomarker is a transcription factor.
[00306] Embodiment 16. A method comprising: identifying a presence or an absence of (i) an early stage colorectal cancer, (ii) a late stage colorectal cancer in a sample, wherein the identifying comprises assaying for a presence or an absence of an epigenetic modification in a nucleotide sequence of the sample to produce a result, wherein the sample is from a subject having cancer or suspected of having cancer.
[00307] Embodiment 17. The method of claim 16, wherein the nucleotide sequence has at least 70% sequence homology to a biomarker of Table 3. [00308] Embodiment 18. The method of any one of claims 1, 7 or 11, wherein the result from
(a) is input into a trained algorithm and the comparing of (b) is performed by the trained algorithm to classify the sample as benign or malignant for the cancer.
[00309] Embodiment 19. The method of any one of claims 5, 7, 13 or 16, wherein the presence or the absence of the epigenetic modification comprises a number of methylated sites in the biomarker.
[00310] Embodiment 20. The method of any one of claims 5, 7, 13 or 16, wherein the presence or the absence of the epigenetic modification comprises a number of hypo-hydroxymethylated loci, a number of hyper-hydroxymethylated loci, or a combination thereof in the biomarker.
[00311] Embodiment 21. The method of any one of claims 18, further comprising (c) assaying the sample for a population of immune cells.
[00312] Embodiment 22. The method of claim 21, further comprising inputting the population of immune cells from (c) into the trained algorithm.
[00313] Embodiment 23. The method of claim 21 or claim 22, wherein the population of immune cells comprises more than one type of immune cell.
[00314] Embodiment 24. The method of claim 21 or claim 22, wherein the population of immune cells comprises a single type of immune cell.
[00315] Embodiment 25. The method of claim 18, wherein the trained algorithm classifies the sample as benign or malignant for the cancer at greater than about 90% sensitivity, greater than about 80%) specificity, or a combination thereof.
[00316] Embodiment 26. The method of claim 25, wherein the trained algorithm classifies the sample as benign or malignant for the cancer at greater than about 90%> sensitivity.
[00317] Embodiment 27. The method of claim 25, wherein the trained algorithm classifies the sample as benign or malignant for the cancer at greater than about 80%> specificity.
[00318] Embodiment 28. The method of any one of claims 5, 7, 13 or 16, wherein the epigenetic modification comprises a 5-methycytosine (5mC), a 5-hydroxymethylcytosine (5- hmC), a 5-formylcytosine (5-fC), a 5-carboxylcytosine (5-caC), or any combination thereof.
[00319] Embodiment 29. The method of claim 28, wherein the epigenetic modification comprises the 5-hmC.
[00320] Embodiment 30. The method of any one of claims 5, 7 or 13, wherein a loss in the epigenetic modification as compared to the control or the derivative thereof is indicative of the cancer.
[00321] Embodiment 31. The method of claim 30, wherein the epigenetic modification is the 5-hmC. [00322] Embodiment 32. The method of any one of claims 1-31, wherein the subject is suspected of having the cancer.
[00323] Embodiment 33. The method of any one of claims 1-32, wherein said subject is asymptomatic for the cancer.
[00324] Embodiment 34. The method of any one of claims 1-33, wherein the subject has not previously been diagnosed with the cancer.
[00325] Embodiment 35. The method of any one of claims 1-34, wherein the cancer is colorectal cancer (CRC).
[00326] Embodiment 36. The method of any one of claims 2, 8, 12, or 18, wherein when the method identifies the sample as malignant for the cancer, the method further classifies the sample as representative of a stage of cancer.
[00327] Embodiment 37. The method of claim 36, wherein the stage of the cancer is stage I.
[00328] Embodiment 38. The method of any one of claims 2, 8, 12 or 18, wherein when the method identifies the sample as malignant for the cancer, the method further classifies the sample as representative of a subtype of cancer.
[00329] Embodiment 39. The method of claim 38, wherein the cancer is colon cancer and the subtype of the colon cancer is serrated adenoma or a tubular adenoma.
[00330] Embodiment 40. The method of claim 38, wherein the cancer is colon cancer and the subtype of the colon cancer is CMS1, CMS2, CMS3, or CMS4.
[00331] Embodiment 41. The method of any one of claims 1-10 or 16-40, wherein the sample comprises cell-free DNA.
[00332] Embodiment 42. The method of claim 41, wherein an amount of the cell-free DNA is from about 5 nanogram (ng) to about 15 ng.
[00333] Embodiment 43. The method of claim 41 or 42, wherein the sample further comprises a blood sample, a tissue samples, a fine needle aspirate sample, a fecal sample, or any
combination thereof.
[00334] Embodiment 44. The method of any one of claims 1-43, wherein the sample is identified as benign for the cancer in an absence of the subject having a further medical procedure.
[00335] Embodiment 45. The method of claim 44, wherein the further medical procedure comprises: obtaining a biopsy from the subject, performing an imaging scan of the subject, or a combination thereof. [00336] Embodiment 46. The method of any one of claims 18-45, wherein when the trained algorithm identifies the sample as benign, assaying a second sample from the subject to monitor a change over time in the result from (a).
[00337] Embodiment 47. The method of any one of claims 18-46, wherein the trained algorithm is trained using a training set of samples.
[00338] Embodiment 48. The method of any one of claims 18-47, wherein the training set of samples comprises cell-free DNA samples.
[00339] Embodiment 49. The method of any one of claims 18-48, wherein the training set of samples comprises cell-free DNA samples and genomic DNA samples.
[00340] Embodiment 50. The method of any one of claims 18-49, wherein the training set of samples comprises a sample having a sequence comprising a CpG island.
[00341] Embodiment 51. The method of any one of claims 18-50, wherein the training set of samples comprises a combination of malignant samples and benign samples.
[00342] Embodiment 52. The method of claim 5, 7 or 13, wherein the assaying of (a) comprises detecting the epigenetic modification.
[00343] Embodiment 53. The method of claim 52, wherein the detecting is by nanopore sequencing.
[00344] Embodiment 54. The method of claim 52, wherein the detecting is by high throughput sequencing.
[00345] Embodiment 55. The method of claim 52, wherein the detecting comprises associating a label with an epigenetic modification in a sequence of the sample to form a labeled sequence; hybridizing a substantially complementary strand to the labeled sequence; and amplifying the substantially complementary strand in a reaction in which the labeled sequence is substantially not present.
[00346] Embodiment 56. The method of claim 52, wherein the detecting comprises contacting the sample with an enzyme or a catalytically active fragment thereof that converts a methylated residue in the sample to a modified base.
[00347] Embodiment 57. The method of claim 52, wherein the detecting comprises labeling covalently, a hydroxyl group on a hy droxym ethyl ated residue in the sample to generate a labeled hydroxymethylated residue; and sequencing the sample comprising the labeled
hydroxymethylated residue or derivatives thereof.
[00348] Embodiment 58. The method of claim 52, wherein the detecting comprises contacting at least a portion of the sample with an enzyme that utilizes a labeled glucose or a labeled glucose-derivative donor substrate to add a labeled glucose molecule or a labeled glucose- derivative to an epigenetic modification in the sample to generate a labeled glucosylated- epigenetic modification.
[00349] Embodiment 59. The method of claim 52, wherein the detecting comprises adding a detectable label to the epigenetic modification.
[00350] Embodiment 60. The method of claim 59, wherein the detectable label comprises an antibody.
[00351] Embodiment 61. The method of any one of claims 52-60, wherein the detecting is by a method comprising fluorescence resonance energy transfer (FRET) assay, an enzyme-linked immunosorbent assay (ELISA), an liquid chromatography-mass spectrometry (LCMS) assay, or any combination thereof.
[00352] Embodiment 62. The method of any one of claims 52-61, wherein the detecting comprises adaptor ligation.
[00353] Embodiment 63. The method of claim 1, wherein the control or derivative thereof is from a subject having cancer, a subject not having cancer, a subject having a stage I cancer, a subject having a stage II cancer, a subject having a stage III cancer, a subject having a stage IV cancer, or any combination thereof.
[00354] Embodiment 64. The method of claim 52, wherein the detecting comprises detecting 5-caC or 5-fC.
[00355] Embodiment 65. The method of claim 17, wherein the nucleotide sequence has at least 70% sequence homology to a biomarker of Table 1, Table 2, or a combination thereof.
[00356] Embodiment 66. The method of any one of claims 1, 7, or 11, wherein based on the comparing of (b) the sample is identified as a precancerous lesion or a precancerous growth.
[00357] Embodiment 67. The method of claim 66, wherein the precancerous lesion or the precancerous growth comprises a polyp, a nonpolyp, an advanced adenoma, or any combination thereof.
[00358] Embodiment 68. The method of claim 66, wherein the assaying of (a) is performed in the absence of a screening procedure.
[00359] Embodiment 69. The method of claim 68, wherein the screening procedure comprises a colonoscopy, an assay performed on a stool sample provided by the subject, a sigmoidoscopy, or any combination thereof.
[00360] Embodiment 70. The method of claim 68, wherein the sample is a blood sample.
[00361] Embodiment 71. The method of claim 68, wherein the sample comprises cell-free DNA. [00362] Embodiment 72. A method comprising: (a) assaying a sample for a nucleotide sequence having at least 70% sequence homology to a biomarker listed in FIG. 34B, FIG. 139, FIG. 140, FIG. 141, FIG. 142, or any combination thereof to produce a result, wherein the sample is from a subject asymptomatic for a cancer or not previously diagnosed with a cancer; and (b) comparing the result of (a) to a result obtained from assaying a control or a derivative thereof.
[00363] Embodiment 73. The method of claim 72, wherein based on the comparing of (b) the sample is identified as a precancerous lesion or a precancerous growth.
[00364] Embodiment 74. The method of claim 73, wherein the precancerous lesion or precancerous growth comprises a polyp, nonpolyp, an advanced adenoma, or any combination thereof.
[00365] Embodiment 75. The method of claim 72, wherein the assaying of (a) is performed in the absence of a screening procedure.
[00366] Embodiment 76. The method of claim 75, wherein the screening procedure comprises a colonoscopy, an assay performed on a stool sample provided by the subject, a sigmoidoscopy, or any combination thereof.
[00367] Embodiment 77. The method of claim 72, wherein the sample is a blood sample.
[00368] Embodiment 78. The method of claim 72, wherein the sample comprises cell-free DNA.
[00369] Embodiment 79. The method of claim 72, wherein the nucleotide sequence has at least 70% sequence homology to a biomarker of Table 1, Table 2, Table 3, or any combination thereof.
[00370] Embodiment 80. The method of claim 72, wherein the assaying of (a) comprises assaying for a presence or an absence of an epigenetic modification.
[00371] Embodiment 81. The method of claim 80, wherein the assaying of (a) comprises detecting the epigenetic modification.
[00372] Embodiment 82. The method of claim 81, wherein the detecting is by nanopore sequencing.
[00373] Embodiment 83. The method of claim 81, wherein the detecting is by high throughput sequencing.
[00374] Embodiment 84. The method of claim 72, wherein control or derivative thereof comprises samples obtained from a precancerous lesion or a precancerous growth.
EXAMPLES [00375] As shown in FIG. 10, a list of CRC vs HV differential genes (padj < 0.05) are run against a GeneAnalytics database to identify potential disease associations. As a control comparison, the same analysis is run on 20 random gene lists on the same size. Colorectal cancer is a highest scoring hit for the differential gene list but is not flagged for the random gene lists, suggesting that the association is specific and HMCP is detecting physiologically relevant pathways.
[00376] As shown in FIG. 11, the top 20 genes directly associated with CRC based on the VarElect component of the Genecards database are present among the differential genes in the CRC vs. HV comparison (MCC, IGF2, FGFR2). A total of 56 genes from CRC list of VarElect are present in the differential list of genes with FDR < 0.05 CRC vs. HV.
[00377] As shown in FIG. 20, the data quality control (QC) parameters used: (i) de-duplicated read count for input and pull-down and (ii) uniformity score for input and pull-down. Review of parameters and their relationships with each other and sample characteristics is performed. No HMCP profiles are excluded on the basis of data QC.
[00378] As shown in FIG. 23, Epic peak caller is used. A number of peaks and peak lengths is recorded. A shift occurs to more peaks, shorter peaks or a combination thereof in early stage cancer.
[00379] As shown in FIG. 24, clear enrichment of 5-hmC in CRC patients is shown in a portion of the ZIC4 gene. ZIC1 also shows enrichment of 5-hmC in cancer samples. Tracks from top to bottom show: 10 profiles of HV cfDNA; 10 profiles of CRC cfDNA; average track over all HV cfDNA; average track over all CRC cfDNA patients; average track overall CRC stage 1-2 cfDNA patients; four tumor profiles,; two technical replicates from gDNA of normal colon.
[00380] Sample cohort description
[00381] The final cohort for analysis is composed of 105 samples, distributed over ages 55 - 70, and with ~ 60% of the cohort females samples (demonstrated in FIG. 37A-C). Late stage cancers are all female samples (FIG. 36). The cohort overall is biased for age by clinical diagnosis and stage (Chisq test P-value=6.6E-04) with gender vs. age close to traditional statistical significance thresholds (Chisq test P-value=0.1195). FIG. 37A-C shows distribution of the cohort based on three key variables - age, gender and cancer stage. CRC patients are significantly older than healthy volunteers (FIG. 37A) with HV younger than CRC patients. Age and gender is less biased (FIG. 37B) but there is a bias by gender and cancer stage (FIG. 37C).
[00382] Quality Control
[00383] Sample Balancing [00384] Sample balancing is performed using the R package OSAT and demonstrated no bias in the allocation of DNA samples across the strip tubes going into the HMCP v2 workflow. Chi- square p-values for all desired variables were p>0.5 including clinical diagnosis & stage, sex, extraction operator, day of extraction and age. The distribution of samples across the 14 strip tubes is shown (FIG. 38A-D). Alterations in the desired balancing are only to move empty wells to the end of the strip tube.
[00385] FIG. 38A-D shows results of the OSAT sample balancing analysis based on key variables across the 14 strip tubes needed for the HMCP v2 workflow. Each bar of the histogram represents one strip tube processed in the workflow. Each of the plots represents for strip tube 1- 14 how well balanced it is for cancer stage, gender, extraction operator and day of extraction. No strip is found to be unbalanced based on chi-square tests.
[00386] Laboratory Metrics
[00387] No bias is identified in the lab metrics based on qubit, bioanalyser (BA), clinical diagnosis or extraction operator (chi-square test all p-values >0.3) and a good correlation
(Pearson's correlation R2=0.994) is achieved between the Qubit and BA results (FIG. 39A-E). No association between the ng of starting material in the pBGT and the cohort meta-data (FIG. 40A-B) (chi-square minimum p-value 0.287 by age group). No biases are identified based on the library operator, the strip tube or the Nextflex adapters (FIG. 41 - chi-square test X2=1805, p- value=0.328). However an imbalance in the HMCP operators across the runs is identified (FIG. 42A-C - chi-square test X2=139, p-value<2.2e-16).
[00388] FIG. 39A-E shows assessment of the quantity of DNA (concentration and yield) achieved by DNA extraction based on both Qubit and the Bioanalyser (BA) by key cohort metadata and extraction operator. Qubit chi-square tests- Stage: chi-square X2=205, p-value=0.3361, Extraction operator: chi-square X2=197, p-value=0.505. BA chi-square tests - Stage: chi-square X2=210, p-value=0.3718, Extraction operator: chi-square X2=210, p-value=0.371. A good correlation between the two methods is achieved (Pearson's correlation R2=0.994, p-value < 2.2e-16).
[00389] FIG. 40A-B shows association of total mass (ng) of cell free DNA (cfDNA) that went into the library preparation stage (denoted conv ng) with Sex, and cancer stage. No bias identified (Sex: chi-square X2=16.4, p-value=0.354, Stage: chi-square X2=28.433, p-value= 0.54, Age Groups: chi-square X2=33.8, p-value=0.287).
[00390] FIG. 41 shows assessment of DNA quantity included in the workflow based on the nextflex adapter for inputs. No biases identified (all inputs: chi-square test for NetFlex adapters and operators p-value=l, NextFlex Adapters and ng/ul input cfDNA p-value 0.342). The NetFlex adapters contain the library indexes needed for sequencing, which are well balanced across the operators.
[00391] FIG. 42A-C shows balancing of operators and runs by diagnosis and gender. From FIG. 38A-D extraction operators are well balanced over runs (chi-square test X=3.2, p- value=0.92) alongside the HMCP operators by diagnosis stage (chi-square test X2=2.29, p- value=0.89) and the gender (chi-square test X2=1.53, p-value=0.673). However HMCP operators are imbalanced across runs (chi-square test X2=139, p-value<2.2e-16), which is further assessed in FIG. 47A-F
[00392] Post-sequencing Metrics
[00393] An association between the sequencing metrics and the starting material (in ng) of cfDNA used in the experimental workflow is identified (FIG. 43A-D), with several significant correlations identified, particularly for input. Higher de-duplicated read count is achieved for the input samples over the pBGT libraries. (This measure exceeded the expectations of the
HiSeq4000 (60M fragments per sample). Input and pBGT libraries show similar distributions across the cohort meta-data (FIG. 44A-D), no effect of biological or technical variables that were tested are identified (chi-square tests all p-values >0.4). Little variation is identified based on run or operator. The quantification of the spike-in sequences showed no difference based on the operator or run (FIG. 45A-F). A small difference in the diversity of the inputs is identified based on the run (FIG. 46A-D).
[00394] FIG. 43A-D shows association identified between the quantity of input cfDNA and the sequencing metrics including the diversity, uniformity, total de-duplicated reads
(bamstats mapped reads) and the mitochondrial genes RPKM.
[00395] The conv ng is the total mass (ng) of cfDNA that went into the library prep step. Data shown for both the input and pulldown (pBGT). Pearson correlation is performed between these metrics. Input: conv ng and diversity - R2=0.49, p-value=1.35e-07; conv ng and uniformity - R2=0.55, p-value=1.52e-10; conv_ng and de-duplicated read count - R2=0.43, p-value=4e-06; conv ng and mitochondrial reads - R2=-0.11, p-value=0.27. pBGT: conv ng and diversity - R2=0.13, p-value=0.15; conv_ng and uniformity - R2=0.329, p-value=0.0008; conv_ng and de- duplicated read count - R2=0.229, p-value=0.0186; conv ng and mitochondrial reads - R2=-0.07, p-value=0.456.
[00396] FIG. 44A-D shows histograms and boxplots of the de-duplicated sequencing reads. The de-duplicated read count is based on bamstats mapped reads and is a paired end read count. For the number of fragments this number can be halved. Greater read count is achieved for the input samples over the pBGTs and both reached or exceeded the expectations of the sequencing depth. Inputs and pBGTs show similar distributions across the cohort meta-data. Chi-square tests are performed for each (as well as Age Group) and the minimum p-value identified is 0.436 for all tests. Run464 had a slight variation in the number of de-duplicated reads but this is not significantly different to the other runs.
[00397] FIG. 45A-F shows assessment of spike ins by clinical diagnosis and HMCP operator. Spike in levels are shown as log2 (ratios) for both the input and the pBGT. Ratio 2hmC vs.
mCpC is the ratio of hmC control reads divided by the sum of mC and C control reads. Ration 2mC vs. hmCpC is the ratio of mC control reads divided by the sum of hmC and C control reads. The figures here show that the pBGT has specifically enriched the hmC reads in the pBGT and not the input and that the mC reads are rare in the pBGT. No significant difference identified based on the 2hmC:mCpC ratio or the 2mC:hmCpC based on clinical diagnosis or the HMCP operator. Chi-square tests: Input - 2hmC ratio vs. Stage X2=161, p-value=0.57; 2hmC ratio vs. HMCP operator X2=311, p-value=0.46;2mC ratio vs. Stage X2=210, p-value=0.49; 2mC ratio vs. HMCP operator X2=315, p-value=0.49. pBGT - 2hmC ratio vs. Stage X2=315, p-value=0.44; 2hmC ratio vs. HMCP operator X2=210, p-value=0.49;2mC ratio vs. Stage X2=263, p- value=0.253; 2mC ratio vs. HMCP operator X2=162, p-value=0.58.
[00398] ANOVA tests are performed for input and pBGT libraries (separately) and key technical (shown - FIG. 46A-D) and biological variables (age, clinical diagnosis and gender - FIG. 76A-76L). The ANOVA tests revealed no significant differences between diversity, uniformity, run and HMCP operator for the pBGT library. Mitochondrial RPKM is associated with sequencing run and HMCP operator (p-value 5.1 le-06, HMCP operator 0.053) for the pBGT library. For the pBGT library biological variables, only uniformity and stage are significantly associated (p-value 0.0138). For input libraries uniformity and diversity are significantly associated with sequencing run (p-value 0.00156, 2.85e-05 respectively). The input library mitochondrial RPKM is significantly associated with age groups (p-value 0.045). While only diversity score and Sex are associated for the input library (p-value - 0.08).
[00399] FIG. 46A-D shows assessment of the diversity, uniformity and mitochondrial reads based on the run, operator and clinical diagnosis. Some variation identified in the mitochondrial RPKMs for both input and pulldown (pBGT).
[00400] Feature Definition and Selection
[00401] Two main genomic feature types are used for secondary analysis: gene bodies defined by the Gencode v.25 GRChg38 annotation set and enhancer regions defined by the Genehancer regions from the Genecards database. For the following analysis features are excluded that do not obtain more than 30 reads per feature in all samples resulting in 22377 genes and 16643 genehancers. Excluded features include those that are largely invariant by setting the coefficient of variation > 0.2 (over all samples), and restricted features that have high variability by restricting to features with coefficient of variation < 0.8. This feature set is referred to as the "Top Varying" set in the following text (composed of 3104 genes, and 1323 Genehancers).
[00402] Exploration of technical and biological variables
[00403] PCA is utilized to visually assess correlative structure in the datasets. Greater separation is observed between the biological variables of interest than any of the technical variables over the first three principal components. In particular, there is no bias based on operator or sequencing run (FIG. 47A-F). No separation is noted between gender or age group (FIG. 48A-D). In comparison, some separation of the clinical diagnoses based on the top 3 PCA axes (FIG. 49A-F), with the clearest separation seen for the Genehancers. Separation by clinical diagnosis is also seen when more features are considered (read count threshold >30) as shown in FIG. 77A-77N.
[00404] FIG. 47A-F shows principal components from PCA by operators (shape = operator) who performed library preparation and pull down experiments demonstrating lack of clustering over the first 3 Principal Components. Plots based on the top varying Genes (FIG. 47A-C) and Genehancers (FIG. 47D-F).
[00405] FIG. 48A-D shows first two principal components of top varying regions by sex and age group (shape = subgroup). Plots based on the top varying Genes (FIG. 48A-B) and
Genehancers (FIG. 48C-D). Limited clustering observed based on these biological variables.
[00406] FIG. 49A-F shows PCA using the top varying genes (N=3104) and genehancers (N=1323). Evidence for separation is shown between biological variables. Particularly, separation on PC2 for gene bodies and PC3 for genehancers is shown.
[00407] Statistical Identification of Discriminating Features
[00408] Features for discriminatory power between biological conditions are tested using both gene bodies and genehancers. Over 300 genes are identified in CRC and HV patient comparison using a Mann-Whitney U (MWU) test at a multiple testing corrected (Benjamini-Hochberg, BH) value of less than 0.05 (FIG. 50). This dropped to -100 features for Early CRC and HV comparisons, and is likely due to the lower sample size for this comparison. PCA plots of the top 20 features demonstrate a separation and clustering of the stage comparisons between CRC and HV (FIG. 78A-D - FIG. 81A-D). Discriminating features between early and late stage cancer at an adjusted p-value of < 0.05 are not identified. Several of the top gene candidates for the CRC vs HV comparison are given in FIG. 51. For the full list of significant candidates from both the top varying and all gene comparisons, see FIG. 87, FIG. 90-93, and FIG. 119-121. Boxplots of the top 6 discriminating genes between CRC and HV demonstrate the level of separation in each feature (FIG. 52A-F). Boxplots of the top 6 genehancer features display similar levels of separation (FIG. 82A-F). Boxplots of the top 6 features discriminating between early CRC and HV, and late CRC and HV can be found in the figures (Genes: FIG. 83A-F and FIG. 84A-F, Genehancers: FIG. 85A-F and FIG. 86A-F).
[00409] As there is an imbalance in the cohort, the importance of age groups is investigated and gender for these top features. The predictive power of age groups can be found in FIG. 87, however, only two genes in the top 20 from the CRC vs. HV comparison had an AUC>0.7. Furthermore, box plots of these features are plotted by age group and clinical diagnosis and can demonstrate limited impact of age on these top features (FIG. 88A-F - genes, FIG. 89A-F - genehancers). Two examples of clear genomic discrimination in the HV and CRC samples are shown in FIG. 53A-B (ZIC4) and FIG. 54A-B (SIXl). It is concluded there are in the order of at least 100s of genes that may serve as candidates for discriminatory markers between CRC and HV patients.
[00410] FIG. 50 shows number of discriminatory features identified at several FDR thresholds. Many discriminating features are found for CRC vs. HV and early CRC vs. HV comparisons at an FDRO.01.
[00411] FIG. 51 shows top 20 discriminatory genes ranked by adjusted p-value for the CRC vs HV comparison (Mann- Whitney U test). For each gene, its specific prediction power in terms of AUC is computed.
[00412] FIG. 52A-F shows boxplots of the 6 top ranked genes by p-value from CRC vs HV comparison (top varying genes), all of which show an increased level of 5hmC enrichment in CRC over HV.
[00413] FIG. 53A-B shows 5hmC Enrichment Profile of ZIC4 and ZIC1 genes showing increased levels of 5hmC in CRC. FIG. 53A shows demonstration of 5hmC enrichment on the genomic level in patient CRC profiles. 5hmC enrichment is localised around the 4th exon of ZIC4. The gene ZIC1, to the right, also shows enrichment of 5hmC in the cancer samples. The genome browser tracks from top to bottom include: Average track over all HV cfDNA; Average track over all CRC cfDNA patients; Three tumour profiles; and gDNA of normal colon. FIG. 53B shows ZIC4 and ZIC1 summarised over the gene body in boxplot form by CRC stage.
[00414] FIG. 54A-B shows 5hmC Enrichment Profile of SIXl gene showing increased levels of 5hmC in CRC. FIG. 54A shows demonstration of 5hmC enrichment on the genomic level in patient CRC profiles. Tracks from top to bottom include: Average track overall CRC stage 1-2 cfDNA patients; Three tumour profiles; and gDNA of normal colon. FIG. 54B shows SIX1 levels summarised over the gene body and plotted in boxplot form by CRC stage.
[00415] Validation of these top feature lists is performed by subsetting the cohort into two equally sized groups, performing the MWU tests (genes and genehancers) and comparing the ranks between the two groups using a Wilcox signed-rank test. This resulted in a average p-value of 0.72 and average 30% intersection between the features with an FDR<0.05 across the different filtering levels (read count threshold and top varying) (FIG. 90). Only the CRC vs. HV (female only) top varying genes comparison resulted in a p-value <0.4.
[00416] In addition to these tests, the DESeq2 method is used on just the pBGT library counts. This method permits the inclusion of covariates such as operator and age. The DESeq2 approach produces more features reaching statistical significance (FIG. 91). Rank comparison (Wilcox signed-rank test) of the DESeq2 approach to the MWU test is performed across the disease types. DESeq2 is run using combinations of age, gender, HMCP operator and the sequencing run as covariates. FIG. 92 includes results for CRC vs. HV and early CRC vs HV comparisons for gene level tests. From these results it is observed that once the same features are considered by both tests, the numbers identified with an FDR<0.1 are similar, with a high level of overlap. However the addition of the covariates does influence the rankings of the results, particularly for the early CRC comparison (based on a p-value<0.05). For genehancer features, no significant difference in the rank is identified (FIG. 93), with a minimum p-value of 0.266 from all the tests. Boxplots of top 6 hits from the DESeq2 tests with age and gender can be seen in FIG. 94A-F and include FIGN, C2CD4C, INHBB and ZIC4 that are also seen in the top 20 from the top varying gene list comparison.
[00417] Biological Relevance
[00418] Genes identified as discriminatory between CRC and HV patients are assessed for functional relevance using the GeneCards suite. Genes with adjusted p-values lower than 0.05 (367 genes) are selected and used the disease association algorithm in the GeneAnalytics component of the GeneCards database. The disease type "Colorectal Cancer" had the highest score (FIG. 55), which is calculated based on the number of matching colorectal cancer associated genes. The basis of this matching score are verified colorectal associated genes, or which a number of the matching example genes are listed in FIG. 56. The likelihood for this result being due to database bias is tested by submitting 20 random genes sets (in order to be conservative in the analysis, the random lists are selected from the top varying gene set), of the same feature set size. None of the randomised gene set entries are found to have obtained a higher score, and in 3/20 submissions CRC ranked 1st (but with lower scores). This lends evidence that the genes identified as discriminatory between CRC and healthy patients in cfDNA are enriched for colon cancer specific markers (many of which are derived from tumour-normal studies). In this dataset 11 genes matched the Genecards differential gene set data in both colon and blood based studies. Discriminatory genes are further analyzed with the Genecard VarElect component and found that -50 of the genes had evidence for a direct connection to CRC (see FIG. 57) and a further -150 have an indirect connection with CRC. Using different filtering criteria may enhance the ability to distinguish the cancer type. FIG. 58 shows the same disease association analysis on a wider gene list not subject to variance filtering, which leads to a more specific ranking to four cancer types, with CRC scoring substantially higher than the second ranked cancer type.
[00419] FIG. 55 shows a disease association table from the GeneAnalytics component of the GeneCards suite, for genes with an adjusted p-value < 0.05 in CRC vs HV comparison. CRC is the top hit for the gene list.
[00420] FIG. 56 shows genes in the CRC vs HV set that are identified as differentially expressed in tissue samples in CRC.
[00421] FIG. 57 shows top 20 genes directly associated with CRC using the VarElect component of the Genecards database. CRC related terms are top hits in this analysis.
[00422] FIG. 58 shows disease association table from the GeneAnalytics component of the GeneCards suite, for genes with an adjusted p-value < 0.05 in CRC vs HV comparison using the All-genes list which does not apply a filter based on co-efficient of variation. CRC and other cancers are the top hits for the gene lists.
[00423] Further analysis of the top hits (for all subgroups and comparisons) are performed using GOseq, an R package to perform gene ontology analysis using biological pathways. Both the top 50 features and those with an FDR<0.05 are tested, and separated into under and over enriched based on the direction of the mean change in the RPKM enrichment ratio. GO terms with a p-value <0.001 are taken forward into the next round of analysis and the top 20 biological pathways are plotted as a histogram of the -loglO(p-value). The under-enriched pathways in all- stage comparisons of CRC and HV are predominantly immune related (average - 13/20) and the over-enriched pathways are related to metabolism and biosynthetic processes for early CRC and all CRC comparisons. Interestingly, for late CRC vs. HV comparisons the over-enriched pathways are related to morphogenesis and development. Examples of the analysis are shown in FIG. 59A-B, which is based on genes with a FDR<0.05 from early CRC vs. HV (both genders) and FIG. 60A-B, late CRC vs HV (female only). One naive explanation for these results is that immune system related DNA fragments are displaced by tumour DNA fragments in the Cancer Patient. Another explanation is that the immune cell population of cancer patients shifts to cell populations that are less likely to be 5hmC enriched.
[00424] To summarize, the functional analysis conducted demonstrates that 5hmC cfDNA profiles reflect changes that are specific to the patient condition. In particular, 5hmC enriched DNA fragments from known colon cancer oncogenes are overrepresented in patient cfDNA profiles when compared to HV patients. Furthermore, diagnostic potential may be aided by both tumour impact in cfDNA along with immune cell population changes.
[00425] FIG. 59A-B shows histograms of the top 20 Biological pathways from GO analysis of genes with a FDR<0.05 from early CRC vs. HV MWU results (both genders, with read count filtering). Under-enriched pathways are predominantly immune related (FIG. 59A) and over- enriched pathways are predominantly metabolism related (FIG. 59B).
[00426] FIG. 60A-B shows histograms of the top 20 Biological pathways from GO analysis of genes with a FDR<0.05 from late CRC vs. HV MWU results (females only). Under-enriched pathways are immune related (FIG. 60A) and over-enriched pathways are related to adhesion, morphogenesis and development (FIG. 60B).
[00427] Predictive Potential of the Dataset
[00428] A cross-validation approach is utilized to demonstrate the predictive potential within the entire dataset. Two phases of classifier development are performed: Phase I. Development of a Support Vector Machine (SVM) and Logistic Regression Model(LR) under a cross validation strategy that filters features that are invariant and correlated with age or sex. Phase II. Addition of a Recursive Feature Elimination (RFE) strategy to the Logistic Regression approach to reduce the feature set to the top 20 best performing features.
[00429] Phase I: Classifier Development with Cross-Validation
[00430] An SVM is built using 6-fold cross-validation including variance filtering and chi- square tests to assess the importance of covariates (age and/or gender) for the retained features. In 6-fold cross-validation the samples are split into 6 groups with five used for training and one for testing, this is repeated for all 6 test-set-training set permutations and an average of the performance measures over the six runs computed as an estimate of the predictive performance of the dataset. Feature selection occurs within each cross-validation, this means that the coefficient of variation (COFV) of the training set is calculated and can vary for each subgroup of samples. Those features that pass the COFV threshold are then tested for associations with age groups and/or gender using a chi-square test. Features with a p-value greater than a chosen threshold are retained and the model is trained and tested on these features. Each model has been built for both gene and genehancer feature sets and for each sample group (e.g stage and gender comparisons).
[00431] The feature selection criteria is tested at multiple thresholds. A COFV > 0.2 is chosen to remain consistent with the MWU test, and retained features that are not significant for age or sex (p-value > 0.25). Most of the feature selection is due to the COFV threshold, with <20% but in some cases 0% of features being removed by the addition of the chi-square tests for age and gender. Using this approach, the average area under the curve (AUC) measure from the receiver operator characteristic curves (ROC) is > 0.8 for all comparisons (see FIG. 61A-E and FIG. 62A-E). ROC curves are the result of the mean over the 6-fold cross-validations. [Previous iterations of these models are implemented with variance filtering instead of the COFV filtering, which is a stricter filter. The ROC curves for these models can be found in FIG. 95A-E and FIG. 96A-E, while the number of features of the two different models are somewhat different similar AUCs and permutation scores resulted].
[00432] This analysis is also repeated with a logistic regression (LR) classifier including the same feature filtering methods. The ROC curves for this analysis can be seen in FIG. 97A-97E for the gene models and S19 for the genehancers. Again the ROCs show that these models have performed well with AUCs> 0.80 for genes and 0.79 for genehancers. Similarly, LR classifiers with variance filtering are also implemented, which resulted in similar model performance results of which can be found in FIG. 99A-E and FIG. 100A-E, while the feature numbers included in the model are reduced very similar AUCs and permutation scores are achieved.
[00433] FIG. 61A-E shows ROC curves for SVM classifiers built on gene data for disease comparisons (varied by stage and gender). All classifiers are built using 6-fold cross-validation including COFV filtering (>0.2) and chi-square test (p-value>0.25) for age and or gender (if mixed population). ROC curves are annotated with the mean area under the curve and
permutation test (PT) p-values. The ROC curve achieved during each cross-validation (CV) is shown in light grey. All classifiers show high performance levels with AUCs>0.8.
[00434] FIG. 62A-E shows ROC curves for classifiers built on genehancer data for disease comparisons (varied by stage and gender). All classifiers are built using 6-fold cross-validation including COFV filtering (>0.2) and chi-square test (p-value>0.25) for age and or gender (if mixed population). ROC curves are annotated with the mean area under the curve and
permutation test (PT) p-values. The ROC curve achieved during each cross-validation (CV) is shown in light grey. All classifiers show high performance levels with AUCs>0.8.
[00435] Phase II: Recursive Feature Elimination with the Logistic Regression (LR) Model [00436] To the above mentioned variance filtering and chi-square tests for age and sex covariates, a recursive feature elimination (RFE) method is added within the cross-validation step. The number of features to select within the model is varied (10, 20, 50 and 100) and in this report RFE models are included that select the top 20 most informative features. For each cross- validation the features are recorded so a comparison across the cross-validations can be performed. A summary of the genes chosen during the RFE across the 6 fold cross-validation can be found in FIG. 138 alongside their corresponding p-values from the MWU tests. Several genes are identified in greater than 50% of the cross-validations (CRC vs HV : 12; CRC vs. HV: 9) including C2CD4C, INHBB and TMEM200B all of which are in the top 10 for the CRC vs. HV comparison (top varying genes). The ROC curves for the LR RFE models with 20 features are shown in FIG. 63A-E (genes) and FIG. 64A-E (genehancers). As a result of this analysis the gene based models still performed well with AUCs>0.82, however some of the genehancer models are reduced in performance (AUC>0.70). Additional models are created with a variance filter, the results of which can be found in FIG. 101A-E and FIG. 102A-E.
[00437] FIG. 63A-E shows ROC curves for LR RFE classifiers built on gene data for CRC vs. HV comparisons (varied by stage and gender). All classifiers are built using 6-fold cross- validation including COFV filtering (>0.2) and chi-square test (p-value>0.25) for age and or gender (where appropriate). ROC curves are annotated with the mean area under the curve. All ROC curves show models built with 20 features. All classifiers show high performance levels with AUCs>0.8.
[00438] FIG. 64A-E shows ROC curves for LR RFE classifiers built on genehancer data for CRC vs. HV comparisons (varied by stage and gender). All classifiers are built using 6-fold cross-validation including COFV filtering (>0.2) and chi-square test (p-value>0.25) for age and or gender (where appropriate). ROC curves are annotated with the mean area under the curve. All ROC curves show models built with 20 features. All classifiers show high performance levels with AUCs>0.7.
[00439] Classification Performance on an Independent Test Sets
[00440] Having demonstrated the predictive performance of the entire dataset using cross validation approaches, the performance of a classifier developed from this cohort is assessed on an independent test set. To approach this, the dataset is split into 2 partitions, ¾ of the samples are utilized to build a classifier, and ½ as an independent test set (see FIG. 65 and FIG. 67 for the sample distribution).
[00441] LASSO classifier development [00442] LASSO regression based classifiers are built (see "Use of LASSO regression for CRC and HV state prediction" section for description of LASSO model) to distinguish all stage CRC from HV, and early stage CRC from HV, and as before, separate classifiers are built using 5hmC enrichment in gene bodies and enhancer regions.
[00443] Each classifier is trained using cross-validation (All Stage vs HV: FIG. 66A-B; Early Stage vs HV: FIG. 71A-B) and the performance assessed on the independent test set (All Stage vs HV: FIG. 67 and FIG. 68; Early stage CRC vs HV: FIG. 72 and FIG. 73). PCAs showing the ability of genes with non-zero weights to separate the CRC and HV can be seen in FIG. 69A-B (genes) and FIG. 70A-B (genehancers).
[00444] As expected, the feature sets are reduced substantially to a final informative set after training the LASSO regression classifier. For all-stage cancer vs HV, 56 genes and 59 genehancers are retained, while for early stage vs HV, 40 genes and 25 genehancers are retained. Comparing the genes with non-negative weight in both CRC vs HV and early CRC vs HV classifier, 13 shared genes are found (FIG. 74).
[00445] It also assessed if the computed LASSO classifiers are affected by: (i) the
composition of the train and test datasets, (ii) the age of the volunteer involved in the study. Results show that these features cannot be considered as confounding factors in the training process.
[00446] Two further independent datasets are also used to assess the performance of the model. The first dataset containing 21 samples (7 CRC and 14 HV) is generated with an early version of HMCP v2 technology. When the CRC-HV classifier is applied to these samples good prediction results are obtained (AUC of 0.806). Applying the threshold learnt from the cross validation training process (0.36) all 7 CRC samples (sensitivity=l) are correctly classified, however on the other hand only half of the HV samples are correctly identified (specificity=0.5). The second dataset is generated using HMCP vl technology (87 samples: 40 CRC and 47 HV). The dataset is split in two different groups because it showed operator-bias (Group 1 containing 43 samples: 15 CRC - 28 HV, and Group2 containing 44 samples: 25 CRC - 19 HV). The classifier showed good prediction performance on this dataset as well (AUC of 0.817 and 0.752 for Group 1 and 2 respectively, FIG. 117A-J). However, the difference in the sequencing technology (and likely also the issues related to the reliability of the signal of this dataset) had a negative effect on the sensitivity using the threshold of 0.36 established on the V2 training data (see "Use of LASSO regression for CRC and HV state prediction" section). Results on these two independent datasets demonstrate that the LASSO model, based on 56 genes, has good potential for classification of CRC, however, tuning of the classification threshold for the technology platform differences may lead to improvements in predictive accuracy.
[00447] In conclusion, evidence is found for the potential to develop predictive biomarkers using cfDNA and an appropriate model of relevant genomic loci.
[00448] FIG. 66A-B shows performance of LASSO regression model on Genes (AUC 0.883) and Genehancers (AUC 0.937). Final model results in 56 features using genes and 59 features using genehancers (3-fold cross-validation is used in the training process. All classifiers show high performance levels with AUCs>0.85.
[00449] FIG. 67 shows a summary of cross validation results using a LASSO regression model on gene features.
[00450] FIG. 68 shows a summary of independent test set performance using a LASSO regression model on gene features.
[00451] FIG. 69A-B shows a PCA based on the list containing the 56 genes having non-zero weight in the computed LASSO model, also supports the evidence that these features can be effectively used to distinguish between CRC and HV samples (FIG. 64A-E). PCS 1-3 shown.
[00452] FIG. 70A-B shows a PCA based on the list containing the 59 genehancers having non-zero weight in the computed LASSO model, also supports the evidence that these features can be effectively used to distinguish between CRC and HV samples (FIG. 64A-E). PCS 1-3 shown.
[00453] FIG. 71A-B shows a performance of LASSO regression model on Genes (AUC = 0.951) and Genehancers (AUC = 0.884) for early CRC vs HV classification. Final model results in 40 features using genes and 24 features using genehancers. 3-fold cross-validation is used. All classifiers show high performance levels with AUCs>0.85.
[00454] FIG. 72 shows a summary of cross validation results using a LASSO regression model on gene features for early CRC vs HV
[00455] FIG. 73 shows a summary of cross validation results using a LASSO regression model on genehancer features for early CRC vs HV
[00456] FIG. 74 shows a non-zero genes shared between early CRC vs HV and CRC vs HV classifiers. This table reports the rank of each gene, the boxed genes having the negative weight (e.g., MRPS31P2 is the gene with the most negative weight in both classifiers).
[00457] Summary
[00458] The experiment is designed to avoid operator bias and batch effects by actively balancing samples over operators, which has worked well, even with an eventual imbalance in the operators due to staff illness. The methods used to balance samples in the project can be considered as standard protocol for future projects.
[00459] Due to the availability of samples the cohort is imbalanced with respect to age and gender. The potential contribution of this bias in classifier development has been assessed and managed.
[00460] The imbalance in the HMCP operators is assessed at the data QC stage as well as the exploratory data analysis stage and low numbers of features are associated with technical factors and minimal overlap with features associated with biological variables is found. A decision may be made not to perform any correction for this.
[00461] All samples passed based on post sequencing QC metrics, giving 105 samples in downstream analysis. A more thorough and automated QC metric assessment may be needed to speed up this process.
[00462] Exploratory analysis revealed separation based on clinical diagnosis (an improvement over the HMCP 150 dataset) with minimal clustering based on technical factors or additional biological variables such as gender and age groups.
[00463] The level of significance of the features (genes and genehancers) has also improved from the HMCP150 project, with 100s of features identified with an FDR<0.05, depending on the comparison. However, no differences are identified between early and late CRC based on these significance thresholds.
[00464] The features identified between CRC and HV are very similar to those identified in early CRC vs. HV, however with different rankings.
[00465] Genomic level evidence has been found for several of the genes showing clear enrichment in cfDNA of CRC patients over HV cfDNA profiles, and in some cases show good correspondence to regions that appear enriched in CRC Tumours.
[00466] Functional analysis of these top lists revealed many links to known cancers and immune related pathways. Several of the key genes have been previously identified as key CRC genes. Disease association databases identify the set of discriminatory genes as likely indicative of CRC above other diseases.
[00467] Predictive potential of the dataset is encouraging with cross-validation using Logistic Regression Models achieving -0.90 AUC.
[00468] The addition of a recursive feature elimination step within the cross-validation is able to produce models with 20 features that show similar performance (AUC>0.8) to those using the COFV and covariate filtering. [00469] Classification on an independent test set using a LASSO model perform well classifying all CRC vs. HV and early CRC vs. HV.
[00470] The methods may further include: additional samples for validation of the signatures (additional cancers, diseases and increasing the current sample cohort), gDNA from tumour and normal tissue to aid the understanding of the tumour circulating DNA and immune cell profiling to better the understanding of the 5hmC profiles of blood cell types. In some cases, a sample may be assayed for one or more biomarkers, wherein the sample is a cell-free DNA sample obtained from a blood cell.
[00471] QC, Exploratory Analysis, Feature Discrimination and Predictive Performance
[00472] FIG. 75A-75E. MULTIQC PLOTS - Insert size as calculated by the Picard software suite. Run461 to Run465 represent the different sequencing batches. No untoward insert size anomalies are found.
[00473] FIG. 76A-76L: Additional QC plots. A-F) Uniformity and Diversity scores by library preparation strategy (conv ng) assessed over technical and biological variables. G-H) Results of iCNA show a mismatch in predicted gender and % tumour fraction predictions. I) - L) are metrics from the deeptools plotFingerprint utility that summarise a diagnostic plot that gives an overview of aspects of genomic coverage. Both pBGT and input samples behave as expected, pBGTs expected to have higher elbow/inflection points, lower AUC and higher x-intercept. No difference is observed by operator.
[00474] FIG. 77A-77N: PC A of samples using features (Genes (FIG. 77A-C), Genehancers (FIG. 77D-F) ) that have passed the read count thresholds (>30 reads in input and pBGT) and filtered by the coefficient of variation (>0.2 & <2). The variance explained by each principal component for the gene and genehancer set is given in FIG. 77G-H, demonstrating that the majority of the variance is accounted for in the first three to four principle components. FIG. 77I-N) Gives plots for genes (FIG. 77I-K) and genehancers (FIG. 77L-N) with only the read count thresholds (>30 reads in the input and pBGT).
[00475] FIG. 78A - 78D: PCAs of the top 20 discriminating/ranked genes for each of the patient subgroup as determined by the MWU test. Results are based on the MWU tests of the read count threshold feature lists (>30 reads in both pBGT and input). Clustering by diagnosis is evident based on the top 20 features alone.
[00476] FIG. 79A - 79D: PCAs of the top 20 discriminating/ranked genehancers for each of the patient subgroup sourced from. Results are based on the MWU tests of the read count threshold feature lists (>30 reads in both pBGT and input). Clustering by diagnosis is evident based on the top 20 features alone. [00477] FIG. 80A - 80D: PCAs of the top 20 discriminating/ranked genes for each of the patient subgroup. Results are based on the MWU tests of the top varying features (>30 reads in both pBGT and input plus coefficient of variation >0.2 & <2). Clear separation between CRC and HV samples is demonstrated.
[00478] FIG. 81A - 81D: PCAs of the top 20 discriminating/ranked genehancers for each of the patient subgroup. Results are based on the MWU tests of the top varying features (>30 reads in both pBGT and input plus coefficient of variation >0.2 & <2). Clustering by diagnosis is evident based on the top 20 features alone.
[00479] FIG. 82A - 82F: Boxplots of the top 6 discriminating genehancers demonstrating separation between CRC and HVs (top varying list). Increased levels of 5hmC are found for CRC over HV for these top 6 genehancers.
[00480] FIG. 83A - 83F: Boxplots of the top 6 discriminating genes demonstrating separation between early CRC and HVs (top varying list). Increased levels of 5hmC are found for early CRC over HV for these top 6 genes.
[00481] FIG. 84A - 84F: Boxplots of the top 6 discriminating genes demonstrating separation between late CRC and HVs (top varying list). The majority of the top 6 genes show increased levels of 5hmC for late CRC over HV.
[00482] FIG. 85A - 85F: Boxplots of the top 6 discriminating genehancers demonstrating separation between early CRC and HVs (top varying list). Increased levels of 5hmC are found for early CRC over HV for these top 6 genehancers.
[00483] FIG. 86A - 86F: Boxplots of the top 6 discriminating genehancers demonstrating separation between late CRC and HVs (top varying list). Increased levels of 5hmC are found for late CRC over HV for these top 6 genehancers.
[00484] FIG. 87: Prediction score (in terms of AUC) of the top 20 most discriminating genes (top-varying comparison) between CRC and HV based on age groups. Those with a score > 0.7 are highlighted in red. The top 20 genes do not show any clear prediction power for these three age groups.
[00485] FIG. 88A - 88F: Boxplots of the top 6 discriminating genes demonstrating separation between CRC and HVs by age group and diagnosis (top varying list). Enrichment levels are fairly similar across the age groups for HVs, while increased 5hmC levels are evident in the CRC patients.
[00486] FIG. 89A - 89F: Boxplots of the top 6 discriminating genehancers demonstrating separation between CRC and HVs by age group and diagnosis (top varying list). Enrichment levels are fairly similar across the age groups for HVs, while increased 5hmC levels are evident in the CRC patients.
[00487] FIG. 90: Rank comparison between random subgroups of patients (50:50 split). Test to see if the same top genes come up in both subgroups. RC = read count threshold. TV= top varying. The majority of the comparisons show no statistical difference in rank between the subgroups.
[00488] FIG. 91: Summary of DESeq2 results with covariates. A high number of features are identified as significantly discriminatory based on the default DESeq2 threshold of <0.1 adjusted p-value.
[00489] FIG. 92: DESeq vs. MWU rank comparison tests - Genes. Gender and age have a stronger effect in the early CRC comparisons. P-value from the rank comparison test <0.05 are highlighted in red. The addition of the covariates makes the most difference for the early CRC vs. HV comparison.
[00490] FIG. 93: DESeq vs. MWU rank comparison tests - Genehancers. Gender and age have little effect on the rank comparisons. The addition of any covariates does not significantly affect the rank of the discriminating genehancer lists, with approximately ¾ of genehancers identified by both methods (DESeq2 and MWU tests).
[00491] FIG. 94A - 94F: Top 6 genes ranked by DESeq2 test between CRC and HV including age and gender as covariates. Many of these genes (4/6) are also in the top 6 for the MWU test.
[00492] FIG. 95A - 95E: Receiver operator characteristic (ROC) curves for SVM classifiers built on gene data for CRC vs. HV comparisons (varied by stage and gender). All classifiers are built using 6-fold cross-validation including variance filtering (>0.1) and chi-square tests (p- value>0.25) for age and or gender (if mixed population). ROC curves are annotated with the mean area under the curve and permutation test p-value is reported for each model. All classifiers achieved a mean AUC of 0.8 or above.
[00493] FIG. 96A - 96E: Receiver operator characteristic (ROC) curves for SVM classifiers built on genehancer data for CRC vs. HV comparisons (varied by stage and gender). All classifiers are built using 6-fold cross-validation including variance filtering (>0.1) and chi- square tests (p-value>0.25) for age and or gender (if mixed population). ROC curves are annotated with the mean area under the curve and permutation test p-value is reported for each model. All classifiers achieved a mean AUC of 0.76 or above.
[00494] FIG. 97A - 97E: ROC curves for logistic regression classifiers built on gene data for disease comparisons (varied by stage and gender). All classifiers are built using 6-fold cross- validation including coefficient of variation filtering (>0.2) and chi-square tests (p-value>0.25) for age and or gender (if mixed population). ROC curves are annotated with the ROC curves from each cross-validation round (light grey) and the mean area under the curve (green). All mean ROC curves show good performance with a minimum mean AUC of 0.83.
[00495] FIG. 98A - 98E: Receiver operator characteristic (ROC) curves for logistic regression classifiers built on genehancer data for CRC vs. HV comparisons (varied by stage and gender). All classifiers are built using 6-fold cross-validation including coefficient of variation (>0.2) and chi-square tests (p-value>0.25) for age and or gender (if mixed population). ROC curves are annotated with the ROC curves from each cross-validation round (light grey) and the mean area under the curve (green). All mean ROC curves show good performance with a minimum mean AUC of 0.79.
[00496] FIG. 99A - 99E: Receiver operator characteristic (ROC) curves for logistic regression classifiers built on gene data for CRC vs. HV comparisons (varied by stage and gender). All classifiers are built using 6-fold cross-validation including a variance filter (>0.1) and chi-square tests (p-value>0.25) for age and or gender (if mixed population). ROC curves are annotated with the mean area under the curve. The mean ROC curves show that even with a stricter variance threshold the models can still perform well.
[00497] FIG. 100A - 100E: Receiver operator characteristic (ROC) curves for logistic regression classifiers built on genehancer data for CRC vs. HV comparisons (varied by stage and gender). All classifiers are built using 6-fold cross-validation including a variance filter (>0.1) and chi-square tests (p-value>0.25) for age and or gender (if mixed population). ROC curves are annotated with the mean area under the curve. The mean ROC curves show that even with a stricter variance threshold the models can still perform well.
[00498] FIG. 101 A - 101E: ROC curves for Logistic Regression (LR) Recursive Feature Elimination (RFE) classifiers built on gene data for CRC vs. HV comparisons (varied by stage and gender). All classifiers are built using 6-fold cross-validation including variance filtering (>0.1) and chi-square test (p-value>0.25) for age and or gender (where appropriate) followed by RFE (20 features retained). ROC curves are annotated with the mean area under the curve. The mean ROC curves show that even with a stricter variance threshold the models can still perform well.
[00499] FIG. 102A - 102E: ROC curves for Logistic Regression (LR) Recursive Feature Elimination (RFE) classifiers built on genehancer data for CRC vs. HV comparisons (varied by stage and gender). All classifiers are built using 6-fold cross-validation including variance filtering (>0.1) and chi-square test (p-value>0.25) for age and or gender (where appropriate) followed by RFE (20 features retained). ROC curves are annotated with the mean area under the curve. The mean ROC curves show that even with a stricter variance threshold the models can still perform well.
[00500] FIG. 103A - 103B: Receiver operator characteristic (ROC) curves for Logistic Regression (LR) Recursive Feature Elimination (RFE) classifiers built on gene data for age groups (<61 and >61) comparisons. All classifiers are built using 6-fold cross-validation including variance filtering (>0.1 and chi-square test for gender where appropriate, p-value>0.25) plus recursive feature elimination to 20 features. ROC curves are annotated with the mean area under the curve. Age classifiers perform worse than disease classifiers.
[00501] FIG. 104A - 104B: Receiver operator characteristic (ROC) curves for Logistic Regression (LR) Recursive Feature Elimination (RFE) classifiers built on genehancer data for age groups (<61 and >61). All classifiers are built using 6-fold cross-validation including variance filtering (>0.1 and chi-square test for gender where appropriate, p-value>0.25) plus recursive feature elimination to 20 features. ROC curves are annotated with the mean area under the curve. Age classifiers perform worse than disease classifiers.
[00502] Use of LASSO regression for CRC and HV state prediction
[00503] The LASSO classification is based on 3 steps:
[00504] (1) Variance-based feature filtering; (2) Model training and feature selection based on LASSO regression; and (3) Assessment of the performance on an independent dataset.
[00505] In order to perform the analysis, both a training dataset (used in the learning process), and a test dataset (used to assess the performance of the inferred model) are needed. In order to have a good size for both datasets, the original dataset is split in: (i) ¾ for training; and (ii) ½ for testing.
[00506] The analysis is performed by using python and sklearn library (the script with its input and output files can be found in DNAnexus project HMCPOl 10 folder Scripts/LAS SO)
[00507] Variance-Based Feature Filtering. Aim of this step is to filter out all low-variance features. This step is performed looking only at the features (it does not take in account the labels). In particular, in this step all the features with a variance lower than a specified threshold are removed. In order to be as much conservative as possible, only the features that have the same value in all samples are removed. This filtering step removed (i) 45 genes in the gene dataset; and (ii) 6 genehancers in the genehancer dataset.
[00508] Model Training and Feature Selection. LASSO regression is used to train the model. This is a linear model that estimates sparse coefficients and it is useful to obtain solutions with fewer parameter values (it reduces the number of variables upon which the given solution is dependent). Mathematically, it consists of a linear model trained, where the objective function to minimize is:
I
min \ \Xw — y\ \2 + Of| | ti' | | i
[00509] 2'lsflmpfes
[00510] The lasso estimate thus solves the minimization of the least-squares penalty with
011 "111 added, where a is a constant and 11 "111 is the ^ 1-norm of the parameter vector.
[00511] The use of LASSO regression has two main advantages: (i) It simultaneously performs training and feature selection providing a sparse solution; and (ii) It associates a weight to each feature, in this way one can have an idea of the most important features.
[00512] A 3-fold cross validation approach is utilized on the training dataset. The regression is performed on the normalized version of the training dataset (subtracting the mean and dividing by the L2-norm).
[00513] As expected, LASSO regression significantly reduced the number of features (FIG. 92 and FIG. 93 contain the detailed list of gene and genehancers respectively and shown in FIG. 105A - 105B): (i) 56 genes with non-zero weight (from the initial list of 56,788 genes); and (ii) 59 genehancers with non-zero weight (from the initial list of 218,117 genehancers).
[00514] The trained model had the following prediction scores: (i) 0.975 for the gene-based model; and (ii) 0.988 for the genehancer-based model.
[00515] FIG. 105A - 105B. LASSO weights for gene and genehancer datasets. Only non-zero elements are reported.
[00516] Assessment of the performance on the independent test dataset. The performance of the classifier is assessed on the test dataset obtaining the following results: (i) AUC of 0.833 for the gene-based model and (ii) AUC of 0.937 for the genehancer-based model.
[00517] FIG. 28 shows the ROC for both models.
[00518] Assessment of the performance on independent external datasets. Two datasets coming from external studies are utilized to assess the performance of the model.
[00519] The first dataset containing 21 samples (7 CRC and 14 HV, all samples are female and all CRC samples are earlyCRC) is generated by using an early version of HMCP v2. FIG. 117A - 117J shows the ROC for this dataset (AUC = 0.806), and also sensitivity/specificity of the classifier when the threshold learnt from the cross validation process (0.36) is used to distinguish between CRC and HV (sensitivity=l, specificity=0.5). Given that all the CRC samples are earlyCRC, the earlyCRC -HV classifier is also applied to this dataset (FIG. 118A - 118B), obtaining AUC of 0.643, and sensitivity=0.85 and specificity=0.28. [00520] The second dataset is generated with an older version of the CEGX 5hmC genome wide profile technology. This dataset contained 87 samples: 40 CRC (21 early-stage and 19 late- stage), and 47 HV. The dataset is split in two different groups because it showed operator-bias. Group 1 containing a total of 43 samples: 15 CRC (7 early-stage and 8 late-stage) and 28 HV, and Group2 containing a 44 samples: 25 CRC (14 early-stage and 11 late-stage) and 19 HV. The CRC vs HV model is tested based on gene. FIG. 117A - 117 J shows that, despite these datasets are obtained by using a different technology and that they presented some issues in terms of reliability of the observed signal, the classifier showed good prediction performance: AUC of 0.817 and 0.752 on Groupl and Group2 respectively. When the classification threshold of 0.36 is applied to distinguish between CRC and HV, sensitivity=0.533 and specificity=0.786 for Groupl and sensitivity=0.12 and specificity=0.89 for Group2 is obtained.
[00521] Results in this section highlight the potential of the 56 genes LASSO model for the classification of CRC and HV samples (this is also suggested by the PC A showed in FIG. 117A - 117 J), however the different sequencing technology used for these external dataset poses a problem related to the tuning of the classification threshold.
[00522] LASSO classifier for early CRC vs HV. This classifier is trained by using the same approach used for CRC vs HV, and FIG. 106 shows the 40 genes having non-zero weight in the classifier. In the main part of the document (FIG. 31, and FIG. 72 and FIG. 73) the performance of this classifier is shown, and also compared the 13 genes with non-negative weight shared between CRC vs HV and early CRC vs HV classifier (FIG. 74).
[00523] PCA is also performed to test the prediction power of the classifier based on:
[00524] The 40 non-zero genes in the early CRC vs HV classifier
[00525] The 13 non-zero genes shared between early CRC vs HV and CRC vs HV classifiers
[00526] Results showed in FIG.s S27-S29 suggest that the 40 genes can be used to split early CRC and HV samples. Focusing only on the list containing the 13 shared genes, prediction power drops for both CRC vs HV and early CRC vs HV.
[00527] FIG. 106. LASSO weights for genes in the early CRC vs HV classifier. Only nonzero elements are reported.
[00528] FIG. 107 A - 107B. PCA performed on the 40 non-zero genes in the early CRC vs HV classifier. Results show a clear split between early CRC to HV samples on PCI .
[00529] FIG. 108A - 108B. PCA performed on the 13 non-zero genes shared between early CRC vs HV and CRC vs HV classifiers. The plots highlight early CRC and HV samples. Results show a clear split between early CRC to HV samples on PCI . [00530] FIG. 109A - 109B. PC A performed on the 13 non-zero genes shared between early
CRC vs HV and CRC vs HV classifiers. The plots highlight CRC and HV samples. Despite, results show a clear split between early CRC to HV samples, this separation is less stronger than observed before for early CRC and HV.
[00531] Is the composition of datasets a confounder for the results? In order to check if the observed results are the consequence of the specific composition of the dataset, a permutation test is performed on the gene dataset where: (i) the true labels of the whole dataset are randomly shuffled, (ii) the whole dataset is used to train the LASSO model by using the shuffled labels, (iii) the model obtained is finally used on the original dataset with the correct labels to assess the prediction power of the classifier. FIG. 110 shows the distribution of the AUC obtained on 1,000 permutations. This plot shows on average predictive performance of -0.5, which indicates a random classification model (as may be expected). On the other hand, the predictive performance of the LASSO classifier trained on the dataset with the correct labels is 0.993.
[00532] It is tested to determine if the composition of the training and testing dataset might affect the observed performance. This test is performed on 1,000 different (random) splits of the original dataset (¾ of samples for training and ½ for testing) and the average AUC for each split is computed. In order to have fair splits, it is ensured that the training dataset had to contain a proportion of HV in the range [0.45-0.55]. Results in FIG. 111A - 111B show that the median AUC on the 1,000 splits is very similar to the one obtained on the reference dataset.
[00533] The composition of the dataset, as well as the composition of the train/test datasets may not affect the performance of the prediction process. In some cases, the training dataset may affect the performance of the prediction process.
[00534] FIG. 110. Performance of the LASSO model trained on 1,000 independent permutations of the labels of the original dataset. How expected the average AUC for the Permutation test is 0.5 (random classification)
[00535] FIG. 111A - 11 IB AUCs on 1,000 different splits of the original dataset. Reference Split indicates the train/test datasets used in the main analysis described in the document. Results on the reference split are very similar to the median obtained on the 1,000 splits, suggesting that this split do not over/under train the model.
[00536] Analysis of the volunteer's age as a confounding factor. The age of the dataset is not uniformly distributed. Importantly, CRC patients are significantly older than the HV (FIG. 112). Although a risk factor for cancer is age, the age bias in the dataset may lead to overweighting age in the analysis. We decided to investigate if this dataset bias might be an overriding confounding factor for the classification analysis. In order to perform the analyses described in this section, the volunteers are split into "young" (age <= 60 years) and "old" ( age >= 61). Age of 61 is used as the age for young/old classification because with this value we can have a fair partition of the volunteers (44 and 61 samples respectively), and at the same time have enough HV in the oldest group (19 CRC - 42 HV in the young cohort, and 38 CRC - 6 HV in the old group).
[00537] The first investigation is based on PCA the final feature sets (for both gene and genehancers) produced by the LASSO model. This is the same analysis summarized in FIG. 107A - 107B, but focused on the age of the volunteers. FIG. 113 and FIG. 114 show the first 5 components where samples are different shapes based on the volunteer's age. From the PCA, there is no clear separation between different ages (and this is different from what is observed in CRC vs HV - FIG. 107A - 107B).
[00538] The same analysis summarized in FIG. 111A - 11 IB (random split of the dataset in ¾ training and ½ testing) is performed but this time it is trained and tested on the volunteer's age (young vs old). FIG. 115A - 115C shows the results of this analysis based on 100 simulations. From this figure it is clear that there is no split enabling to train a model that performs a good classification of the volunteer's age. This becomes more evident if the results obtained by using the classifier for CRC-HV state (Table in FIG. 111A - 11 IB) are compared. In addition, it is interesting that in the age classifier two genes are used for the classification of CRC-HV state (i.e., FIGN and IRX3). However, given the poor performance of the age-trained model when compared with the CRC-HV classifier, it is more likely that the CRC-HV state is a confounding factor for the age classifier, and not vice versa. Note that when the genehancer dataset is used to train the age model, this did not shared any genehancer with the CRC-HV models. This suggests that the dataset cannot be used to train a model for age classification, and that the model obtained to distinguish between CRC and HV is not a related to the volunteer's age. Results presented in this section may reject the hypothesis that the volunteer's age represented a confounding factor for the classifier.
[00539] Assess the robustness of the training process. A very important point of the analysis is related to the robustness of the training process, and consequently, to the robustness of the features having non-zero weight in the model. Indeed, it is important to highlight that likely the model can associate a non-zero weight to a gene/genehancer only because this improves the classification performance on the training dataset, but this gene/genehancer does not have any prediction power on the testing dataset. In order to have an overview of the robustness of the training process, 200 random splits are generated of the dataset in training and testing how described in the previous section (¾ of the samples in the train and ½ in the test, imposing that the training dataset had to contain a proportion of HV in the range [0.45-0.55]). [00540] The first observation is related with the variability associated with the number of nonzero genes found in each simulation: minimum=6, maximum=178, median=28, mode=16 and 20 (with 9 occurrences) with AUC ranging from 0.749 to 1 (median of 0.891), FIG. 116. FIG. 121 lists the genes found in at least 10% of the simulations. It is interesting to see on the top of this list genes that are contained in the CRC-HV classification model (in red) including the genes associated with the highest (FIGN) and lowest (MRPS31P2) weight. However, this simulation enables also the detection of new genes that can have an important role for the classification of CRC-HV (e.g., AHRR, RPS2P46, DSTN, NDUFA8, and C2CD4C - which is on the top of the list with the most 20 discriminating genes, FIG. 51). Similar results are observed for genehancers (FIG. 117A - 117 J and FIG. 122)
[00541] Results presented in this section highlight a very important point: despite the fact that the Lasso classifier model used in the main analyses is able to obtain very good performance, the list of non-zero genes/genehancers (and their weights) may not be considered as a "final signature for CRC detection". The analysis presented in this section reveals a small instability of the results of the training process. However, at the same time, it is reassuring to see that the strongest genes of the model are also the ones showing the strongest stability (e.g., FIGN, RNF219, MRPS31P2). Increasing the size of the dataset may definitely help to obtain a more robust and stable classification model.
[00542] FIG. 112. Volunteer's age distribution. The percentage of CRC for each age is reported on the top of each bar. It is evident that most of HV samples are in the youngest cohort (shaded), and most of the CRC samples are in the oldest cohort (solid).
[00543] FIG. 113. PCA based on the 56 non-zero genes. The first 5 components are showed and samples are shaded based on the volunteer's age. PCA does not reveal any clear separation between different volunteer's ages.
[00544] FIG. 114. PCA based on the 56 non-zero genehancers. The first 5 components are showed and samples are shaded based on the volunteer's age. PCA does not reveal any clear separation between different volunteer's ages.
[00545] FIG. 115A - 115C. From left to right and top to bottom. AUCs on 100 different splits of the original dataset, where the model is trained and test on the volunteer's. This table reports the median AUC for Age and CRC classifiers and the p-value resulting from the Mann- Whitney's test. List of non-zero genes/genehancers in the LASSO model trained on the volunteer's age, in red the genes shared between this model and the model trained on CRC-HV (no shared genehancers are found). Results refuse the hypothesis that age can be a confounding factor in the training of the model. [00546] FIG. 116. Distribution of the number of non-zero genes/genehancers found in the 200 simulations. Variability in the number of discriminating features is observed.
[00547] FIG. 117A - 117 J. Performance of the CRC-HV gene-trained model on external datasets. The first row shows the ROC obtained by using the CRC-HV classifier highlighting good accuracy in terms of prediction (21 Samples AUC = 0.806, Groupl AUC = 0.817 and
Group2 AUC = 0.752). The second , third, and fourth rows show PC A for the 21 samples,
Groupl and Group2 respectively (PC 1-3 are showed). In the last row sensitivity and specificity on these datasets when the threshold (0.36) learnt in the cross validation process is used to classify CRC and HV.
[00548] FIG. 118A - 118B. Performance of the earlyCRC-HV gene-trained model on the external dataset containing 21 samples (all 7 CRC samples are earlyCRC). Results show AUC = 0.643, specificity = 0.85, and sensitivity = 0.28.
[00549] FIG. 119: List of the 56 non-zero gene in the Lasso classifier
[00550] FIG. 120: List of the 59 non-zero genehancers in the Lasso classifier
[00551] FIG. 121: List of the non-zero genes in the 200 simulations of the Lasso classifier. Only genes occurring in more than 10% of the simulations are reported. In red the genes shared with the list containing the 56 non-zero genes in the CRC-HV classifier used in the main analyses (FIG. 92). Mean/Min/Max weight columns contain the Mean/Min/Max weight found in the 200 simulations (note that only non zero weights are used in this calculation). Training weight column contain the weight if the feature in the Lasso model used in the main analyses.
[00552] FIG. 122: List of the non-zero genehancers in the 200 simulations of the Lasso classifier. Only genehancers occurring in more than 10% of the simulations are reported. In red the genehancers shared with the list containing the 59 non-zero genehancers in the CRC-HV classifier used in the main analyses (FIG. 93). Mean/Min/Max weight columns contain the Mean/Min/Max weight found in the 200 simulations (note that only non zero weights are used in this calculation). Training weight column contain the weight if the feature in the Lasso model used in the main analyses.
[00553] QC Measures Definitions
[00554] Uniformity Score
[00555] First the genome is split into non overlapping regions and the GC bias of each region is calculated. Since the genome is biased towards certain GC bias classes than others, for example, a GC bias of 40% is more common than a GC bias of 8%, so more regions may have a GC bias of 40%. If you scatter reads across a genome evenly, you expect your reads to fall into the GC bias classes according to how frequent they occur in the genome, e.g. more reads can fall into regions that are in 40% GC bias than in the parts of the genome that are 8% GC bias.
[00556] First compute the norm coverage:
[00557] Norm coverage = proportion of windows at a GC% / proportion of reads observed at a GC%
[00558] For even coverage, you expect Norm coverage to be a value around 1.
[00559] Then compute for each GC bias (0 to 100) class the distance away from 1, call this diff from uniform
[00560] Then multiply this value by the % of reads that fall in the class
[00561] The sum of this over all GC bias classes is the GCBiasGlobalError
[00562] In the reporting, this is transformed according to the following table roughly into a number range from 0 to 10 by using a transformation defined by this table, Table 4.
GCBiasGlobalError iuniformcov score
350 0
290 1
220 2
160 ! 3
100 4
60 5
30 6
1 7
8 8
2 9
0 10 ly = -228.8521 + (10.08137 - -228.8521)/(1 + (x/957297.1)A0.3976487)
[00563] Additions
[00564] Addition 1 : Classifier Performance Improvements
[00565] Better performance is shown on the 21 sample dataset using the LASSO based classifier after a by-sample Z-score transformation (each sample is transformed so that it had mean 0 and standard deviation 1). The classification results are reported in FIG. 123. below:
[00566] FIG. 123. In FIG. 123A, the LASSO scores computed for the 21 samples. Gray and dark gray bars highlight CRC and HV samples, respectively. The horizontal red dotted line shows the optimal classification threshold inferred from the HMCP-110 dataset (0.091). In FIG. 123B, the ROC of the classification model on this dataset (AUC = 0.79). In FIG. 123C, the table showing the CRC-HV prediction performance of the model when the inferred classification threshold is applied. It is interesting that increasing the threshold from 0.091 to 0.15 the specificity of the classifier can be improved (correctly identify 12 of 14 HV, and 5 of 7 CRC samples, sensitivity = 0.71 and specificity = 0.86)
[00567] Addition 2: Gene signatures determined via a Robust LASSO regression scheme. A) Brief description of the Robust LASSO regression scheme B) Resulting gene signatures and overlaps. A full description of the LASSO model parameters can be found in the appendix.
[00568] A)
[00569] Training Phase
[00570] Select 75% of the data set randomly, and build a LASSO classifier using 3 -fold validation resulting in a gene signature and associated parameters.
[00571] Do a.) 1,000 times (i.e 1,000 random selections) and record those genes that are present in > 5% of all gene signatures, that is, in at least 50 of the gene models.
[00572] A meta classifier is created, such that for the gene features selected (those that occur > 5% in all 1,000 gene signatures) the median of the gene feature weight is computed over the 1,000 instances (excluding zero instances). Thus the final model is the gene features and the weights associated with each gene feature.
[00573] Using the list of gene features produced from c) a more stringent gene signature is determined by starting with the most informative gene feature, calculated the performance, and continued to iteratively add in descending importance each gene feature until performance plateaued. This procedure (termed forward selection) resulted in choosing gene features that occur in 10% or more random simulations.
[00574] Testing Phase
[00575] Randomly select 50% of the samples in the dataset and classify each sample and record the LASSO threshold where the sum of the sensitivity and specificity is at a maximum.
[00576] The process in i is repeated 1,000 times and the median threshold is chosen.
[00577] This median threshold is then applied to several independent data sets
[00578] 43 workflow version 1 samples
[00579] 21 workflow version 2 samples
[00580] 150 external publically available samples (Li, 2017)
[00581] The procedure described in 1 and 2. is conducted twice, once using the 5hmC enrichment values for each gene feature, and then again using a z-score transformation of the 5hmC enrichment values. Z-score normalisation is employed to aid generalising the classifier to independent test datasets that are profiled using variations on the HMCP experimental workflow used in developing the classifier. The more stringent gene signature for the z-score normalised case resulted in choosing gene features that occurred in 21% or more of the random simulations (step l .d above)
[00582] The four gene signatures reported from the LASSO model in this document are thus:
[00583] Non-normalised
[00584] "5% gene signature"
[00585] " 10% gene signature"
[00586] Z-score normalized
[00587] "5% gene signature"
[00588] "21% gene signature"
[00589] In the appendix the model parameters are described in detail. Sheet "gene LASSO" provides the parameters for all genes that meet the 5% criteria as described above for both the z- score and non-z-score normalised data (See headings 5% CRC-HV - Z-Normalization and 5% CRC-HV - No Z-Normalization). The more stringent 10% (or 21% for z-score normalised data) are a subset of these tables which can be gained by selecting all the genes above the 10% (non- zscore,49 genes) frequency or 21% frequency (z-score normalisation, 27 genes) values.

Claims

CLAIMS WHAT IS CLAIMED IS:
1. A method comprising: (a) assaying a sample for a nucleotide sequence having at least 70% sequence homology to a biomarker listed in Table 1 to produce a result, wherein the sample is from a subject having cancer or suspected of having cancer; and (b) comparing the result of (a) to a result obtained from assaying a control or a derivative thereof.
2. The method of claim 1, wherein based on the comparing of (b) the sample is identified as benign or malignant for the cancer.
3. The method of any one of claims 1-2, wherein the nucleotide sequence has at least 85% sequence homology to the biomarker listed in Table 1.
4. The method of any one of claims 1-3, wherein at least five biomarkers listed in Table 1 or Table 2 are assayed in (a).
5. The method of any one of claims 1-4, wherein the assaying of (a) comprises assaying for a presence or an absence of an epigenetic modification.
6. The method of any one of claims 1-5, wherein the biomarker is a transcription factor.
7. A method comprising: (a) assaying a sample for a presence or an absence of an epigenetic modification in a nucleotide sequence having at least 70% sequence homology to a biomarker listed in Table 2 to produce a result, wherein the sample is from a subject having cancer or suspected of having cancer; and (b) comparing the result of (a) to a result obtained from assaying a control or a derivative thereof.
8. The method of claim 7, wherein based on the comparing of (b) the sample is identified as benign or malignant for the cancer.
9. The method of any one of claims 7-8, wherein at least five biomarkers listed in Table 1 or Table 2 are assayed in (a).
10. The method of any one of claims 7-9, wherein the biomarker comprises a transcription factor.
11. A method comprising: (a) assaying a cell-free DNA sample for a metabolic-related
biomarker or an immune-related biomarker to produce a result, wherein the cell-free DNA sample is from a subject having cancer or suspected of having cancer; and (b) comparing the result of (a) to a result obtained from assaying a control or a derivative thereof.
12. The method of claim 11, wherein based on the comparing of (b) the cell-free DNA
sample is identified as benign or malignant for the cancer.
13. The method of any one of claims 11-12, wherein the assaying of (a) comprises assaying for a presence or an absence of an epigenetic modification.
14. The method of any one of claims 11-13, wherein at least five biomarkers are assayed in (a).
15. The method of any one of claims 11-14, wherein the biomarker is a transcription factor.
16. A method comprising: identifying a presence or an absence of (i) an early stage colorectal cancer, (ii) a late stage colorectal cancer in a sample, wherein the identifying comprises assaying for a presence or an absence of an epigenetic modification in a nucleotide sequence of the sample to produce a result, wherein the sample is from a subject having cancer or suspected of having cancer.
17. The method of claim 16, wherein the nucleotide sequence has at least 70% sequence homology to a biomarker of Table 3.
18. The method of any one of claims 1, 7 or 11, wherein the result from (a) is input into a trained algorithm and the comparing of (b) is performed by the trained algorithm to classify the sample as benign or malignant for the cancer.
19. The method of any one of claims 5, 7, 13 or 16, wherein the presence or the absence of the epigenetic modification comprises a number of methylated sites in the biomarker.
20. The method of any one of claims 5, 7, 13 or 16, wherein the presence or the absence of the epigenetic modification comprises a number of hypo-hydroxymethylated loci, a number of hyper-hydroxymethylated loci, or a combination thereof in the biomarker.
21. The method of any one of claims 18, further comprising (c) assaying the sample for a population of immune cells.
22. The method of claim 21, further comprising inputting the population of immune cells from (c) into the trained algorithm.
23. The method of claim 21 or claim 22, wherein the population of immune cells comprises more than one type of immune cell.
24. The method of claim 21 or claim 22, wherein the population of immune cells comprises a single type of immune cell.
25. The method of claim 18, wherein the trained algorithm classifies the sample as benign or malignant for the cancer at greater than about 90% sensitivity, greater than about 80% specificity, or a combination thereof.
26. The method of claim 25, wherein the trained algorithm classifies the sample as benign or malignant for the cancer at greater than about 90% sensitivity.
27. The method of claim 25, wherein the trained algorithm classifies the sample as benign or malignant for the cancer at greater than about 80% specificity.
28. The method of any one of claims 5, 7, 13 or 16, wherein the epigenetic modification comprises a 5-methycytosine (5mC), a 5-hydroxymethylcytosine (5-hmC), a 5- formylcytosine (5-fC), a 5-carboxylcytosine (5-caC), or any combination thereof.
29. The method of claim 28, wherein the epigenetic modification comprises the 5-hmC.
30. The method of any one of claims 5, 7 or 13, wherein a loss in the epigenetic modification as compared to the control or the derivative thereof is indicative of the cancer.
31. The method of claim 30, wherein the epigenetic modification is the 5-hmC.
32. The method of any one of claims 1-31, wherein the subject is suspected of having the cancer.
33. The method of any one of claims 1-32, wherein said subject is asymptomatic for the cancer.
34. The method of any one of claims 1-33, wherein the subject has not previously been
diagnosed with the cancer.
35. The method of any one of claims 1-34, wherein the cancer is colorectal cancer (CRC).
36. The method of any one of claims 2, 8, 12, or 18, wherein when the method identifies the sample as malignant for the cancer, the method further classifies the sample as representative of a stage of cancer.
37. The method of claim 36, wherein the stage of the cancer is stage I.
38. The method of any one of claims 2, 8, 12 or 18, wherein when the method identifies the sample as malignant for the cancer, the method further classifies the sample as representative of a subtype of cancer.
39. The method of claim 38, wherein the cancer is colon cancer and the subtype of the colon cancer is serrated adenoma or a tubular adenoma.
40. The method of claim 38, wherein the cancer is colon cancer and the subtype of the colon cancer is CMS1, CMS2, CMS3, or CMS4.
41. The method of any one of claims 1-10 or 16-40, wherein the sample comprises cell-free DNA.
42. The method of claim 41, wherein an amount of the cell-free DNA is from about 5
nanogram (ng) to about 15 ng.
43. The method of claim 41 or 42, wherein the sample further comprises a blood sample, a tissue samples, a fine needle aspirate sample, a fecal sample, or any combination thereof.
44. The method of any one of claims 1-43, wherein the sample is identified as benign for the cancer in an absence of the subject having a further medical procedure.
45. The method of claim 44, wherein the further medical procedure comprises: obtaining a biopsy from the subject, performing an imaging scan of the subject, or a combination thereof.
46. The method of any one of claims 18-45, wherein when the trained algorithm identifies the sample as benign, assaying a second sample from the subject to monitor a change over time in the result from (a).
47. The method of any one of claims 18-46, wherein the trained algorithm is trained using a training set of samples.
48. The method of any one of claims 18-47, wherein the training set of samples comprises cell-free DNA samples.
49. The method of any one of claims 18-48, wherein the training set of samples comprises cell-free DNA samples and genomic DNA samples.
50. The method of any one of claims 18-49, wherein the training set of samples comprises a sample having a sequence comprising a CpG island.
51. The method of any one of claims 18-50, wherein the training set of samples comprises a combination of malignant samples and benign samples.
52. The method of claim 5, 7 or 13, wherein the assaying of (a) comprises detecting the
epigenetic modification.
53. The method of claim 52, wherein the detecting is by nanopore sequencing.
54. The method of claim 52, wherein the detecting is by high throughput sequencing.
55. The method of claim 52, wherein the detecting comprises associating a label with an
epigenetic modification in a sequence of the sample to form a labeled sequence; hybridizing a substantially complementary strand to the labeled sequence; and amplifying the substantially complementary strand in a reaction in which the labeled sequence is substantially not present.
56. The method of claim 52, wherein the detecting comprises contacting the sample with an enzyme or a catalytically active fragment thereof that converts a methylated residue in the sample to a modified base.
57. The method of claim 52, wherein the detecting comprises labeling covalently, a hydroxyl group on a hydroxymethylated residue in the sample to generate a labeled hydroxymethylated residue; and sequencing the sample comprising the labeled hydroxymethylated residue or derivatives thereof.
58. The method of claim 52, wherein the detecting comprises contacting at least a portion of the sample with an enzyme that utilizes a labeled glucose or a labeled glucose-derivative donor substrate to add a labeled glucose molecule or a labeled glucose-derivative to an epigenetic modification in the sample to generate a labeled glucosylated-epigenetic modification.
59. The method of claim 52, wherein the detecting comprises adding a detectable label to the epigenetic modification.
60. The method of claim 59, wherein the detectable label comprises an antibody.
61. The method of any one of claims 52-60, wherein the detecting is by a method comprising fluorescence resonance energy transfer (FRET) assay, an enzyme-linked immunosorbent assay (ELISA), an liquid chromatography-mass spectrometry (LCMS) assay, or any combination thereof.
62. The method of any one of claims 52-61, wherein the detecting comprises adaptor ligation.
63. The method of claim 1, wherein the control or derivative thereof is from a subject having cancer, a subject not having cancer, a subject having a stage I cancer, a subject having a stage II cancer, a subject having a stage III cancer, a subject having a stage IV cancer, or any combination thereof.
64. The method of claim 52, wherein the detecting comprises detecting 5-caC or 5-fC.
65. The method of claim 17, wherein the nucleotide sequence has at least 70% sequence
homology to a biomarker of Table 1, Table 2, or a combination thereof.
66. The method of any one of claims 1, 7, or 11, wherein based on the comparing of (b) the sample is identified as a precancerous lesion or a precancerous growth.
67. The method of claim 66, wherein the precancerous lesion or the precancerous growth comprises a polyp, a nonpolyp, an advanced adenoma, or any combination thereof.
68. The method of claim 66, wherein the assaying of (a) is performed in the absence of a screening procedure.
69. The method of claim 68, wherein the screening procedure comprises a colonoscopy, an assay performed on a stool sample provided by the subject, a sigmoidoscopy, or any combination thereof.
70. The method of claim 68, wherein the sample is a blood sample.
71. The method of claim 68, wherein the sample comprises cell-free DNA.
72. A method comprising: (a) assaying a sample for a nucleotide sequence having at least 70% sequence homology to a biomarker listed in FIG. 139, FIG. 141 or a combination thereof to produce a result, wherein the sample is from a subject asymptomatic for a cancer or not previously diagnosed with a cancer; and (b) comparing the result of (a) to a result obtained from assaying a control or a derivative thereof.
73. The method of claim 72, wherein based on the comparing of (b) the sample is identified as a precancerous lesion or a precancerous growth.
74. The method of claim 73, wherein the precancerous lesion or precancerous growth
comprises a polyp, nonpolyp, an advanced adenoma, or any combination thereof.
75. The method of claim 72, wherein the assaying of (a) is performed in the absence of a screening procedure.
76. The method of claim 75, wherein the screening procedure comprises a colonoscopy, an assay performed on a stool sample provided by the subject, a sigmoidoscopy, or any combination thereof.
77. The method of claim 72, wherein the sample is a blood sample.
78. The method of claim 72, wherein the sample comprises cell-free DNA.
79. The method of claim 72, wherein the nucleotide sequence has at least 70% sequence homology to a biomarker of Table 1, Table 2, Table 3, or any combination thereof.
80. The method of claim 72, wherein the assaying of (a) comprises assaying for a presence or an absence of an epigenetic modification.
81. The method of claim 80, wherein the assaying of (a) comprises detecting the epigenetic modification.
82. The method of claim 81, wherein the detecting is by nanopore sequencing.
83. The method of claim 81, wherein the detecting is by high throughput sequencing.
84. The method of claim 72, wherein control or derivative thereof comprises samples
obtained from a precancerous lesion or a precancerous growth.
PCT/IB2018/001169 2017-09-27 2018-09-27 Biomarkers for colorectal cancer detection WO2019064063A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
EP18807398.5A EP3688195A1 (en) 2017-09-27 2018-09-27 Biomarkers for colorectal cancer detection

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201762564164P 2017-09-27 2017-09-27
US62/564,164 2017-09-27

Publications (1)

Publication Number Publication Date
WO2019064063A1 true WO2019064063A1 (en) 2019-04-04

Family

ID=64426970

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB2018/001169 WO2019064063A1 (en) 2017-09-27 2018-09-27 Biomarkers for colorectal cancer detection

Country Status (2)

Country Link
EP (1) EP3688195A1 (en)
WO (1) WO2019064063A1 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111454953A (en) * 2020-04-16 2020-07-28 山东殷氏干细胞有限公司 Bone marrow mesenchymal stem cell adipogenic transformation promoter
CN112029860A (en) * 2020-09-03 2020-12-04 首都医科大学 Marker molecule related to colorectal cancer prognosis and detection kit
US11410750B2 (en) 2018-09-27 2022-08-09 Grail, Llc Methylation markers and targeted methylation probe panel
US12027237B2 (en) 2018-03-13 2024-07-02 Grail, Llc Anomalous fragment detection and classification
US12024750B2 (en) 2018-04-02 2024-07-02 Grail, Llc Methylation markers and targeted methylation probe panel

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106755464A (en) * 2017-01-11 2017-05-31 上海易毕恩基因科技有限公司 For the method for screening the gene marker of intestinal cancer and/or stomach cancer, the gene marker and application thereof that is screened with the method

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106755464A (en) * 2017-01-11 2017-05-31 上海易毕恩基因科技有限公司 For the method for screening the gene marker of intestinal cancer and/or stomach cancer, the gene marker and application thereof that is screened with the method

Non-Patent Citations (7)

* Cited by examiner, † Cited by third party
Title
BRUTLAG ET AL., COMP. APP. BIOSCI., vol. 6, 1990, pages 237 - 245
CHUN-XIAO SONG ET AL: "5-Hydroxymethylcytosine signatures in cell-free DNA provide information about tumor types and stages", CELL RESEARCH - XIBAO YANJIU, vol. 27, no. 10, 18 August 2017 (2017-08-18), GB, CN, pages 1231 - 1242, XP055540434, ISSN: 1001-0602, DOI: 10.1038/cr.2017.106 *
DATABASE WPI Week 201759, Derwent World Patents Index; AN 2017-381459 *
HANYANG HU ET AL: "Epigenomic landscape of 5-hydroxymethylcytosine reveals its transcriptional regulation of lncRNAs in colorectal cancer", BRITISH JOURNAL OF CANCER, vol. 116, no. 5, 31 January 2017 (2017-01-31), GB, pages 658 - 668, XP055540452, ISSN: 0007-0920, DOI: 10.1038/bjc.2016.457 *
JAMES R. BRADFORD ET AL: "Consensus Analysis of Whole Transcriptome Profiles from Two Breast Cancer Patient Cohorts Reveals Long Non-Coding RNAs Associated with Intrinsic Subtype and the Tumour Microenvironment", PLOS ONE, vol. 11, no. 9, 29 September 2016 (2016-09-29), pages e0163238, XP055540557, DOI: 10.1371/journal.pone.0163238 *
NOA GILAT ET AL: "Single-molecule quantification of 5-hydroxymethylcytosine for diagnosis of blood and colon cancers", CLINICAL EPIGENETICS, BIOMED CENTRAL LTD, GB, vol. 9, no. 1, 14 July 2017 (2017-07-14), pages 1 - 8, XP021247144, ISSN: 1868-7075, DOI: 10.1186/S13148-017-0368-9 *
WENSHUAI LI ET AL: "5-Hydroxymethylcytosine signatures in circulating cell-free DNA as diagnostic biomarkers for human cancers", CELL RESEARCH - XIBAO YANJIU, vol. 27, no. 10, 19 September 2017 (2017-09-19), GB, CN, pages 1243 - 1257, XP055539921, ISSN: 1001-0602, DOI: 10.1038/cr.2017.121 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US12027237B2 (en) 2018-03-13 2024-07-02 Grail, Llc Anomalous fragment detection and classification
US12024750B2 (en) 2018-04-02 2024-07-02 Grail, Llc Methylation markers and targeted methylation probe panel
US11410750B2 (en) 2018-09-27 2022-08-09 Grail, Llc Methylation markers and targeted methylation probe panel
US11685958B2 (en) 2018-09-27 2023-06-27 Grail, Llc Methylation markers and targeted methylation probe panel
US11725251B2 (en) 2018-09-27 2023-08-15 Grail, Llc Methylation markers and targeted methylation probe panel
US11795513B2 (en) 2018-09-27 2023-10-24 Grail, Llc Methylation markers and targeted methylation probe panel
CN111454953A (en) * 2020-04-16 2020-07-28 山东殷氏干细胞有限公司 Bone marrow mesenchymal stem cell adipogenic transformation promoter
CN112029860A (en) * 2020-09-03 2020-12-04 首都医科大学 Marker molecule related to colorectal cancer prognosis and detection kit

Also Published As

Publication number Publication date
EP3688195A1 (en) 2020-08-05

Similar Documents

Publication Publication Date Title
JP6161607B2 (en) How to determine the presence or absence of different aneuploidies in a sample
EP3688195A1 (en) Biomarkers for colorectal cancer detection
US20210310075A1 (en) Cancer Classification with Synthetic Training Samples
JP2023524627A (en) Methods and systems for detecting colorectal cancer by nucleic acid methylation analysis
US20210065842A1 (en) Systems and methods for determining tumor fraction
JP2024126029A (en) Multimodal analysis of circulating tumor nucleic acid molecules
JP2023517029A (en) Methods for Classifying Genetic Mutations Detected in Cell-Free Nucleic Acids as Tumor or Non-Tumor Origin
CN115572764A (en) Tumor detection marker and application thereof
WO2020194057A1 (en) Biomarkers for disease detection
US20220084632A1 (en) Clinical classfiers and genomic classifiers and uses thereof
WO2023235379A1 (en) Single molecule sequencing and methylation profiling of cell-free dna
US20240055073A1 (en) Sample contamination detection of contaminated fragments with cpg-snp contamination markers
US20240170099A1 (en) Methylation-based age prediction as feature for cancer classification
US20240309461A1 (en) Sample barcode in multiplex sample sequencing
US12073920B2 (en) Dynamically selecting sequencing subregions for cancer classification
EP4234720A1 (en) Epigenetic biomarkers for the diagnosis of thyroid cancer
US20240312561A1 (en) Optimization of sequencing panel assignments
US20240233872A9 (en) Component mixture model for tissue identification in dna samples
US20240312564A1 (en) White blood cell contamination detection
US20240071622A1 (en) Clinical classifiers and genomic classifiers and uses thereof
WO2024216205A1 (en) Methods and systems for cell-free nucleic acid processing
WO2024192294A1 (en) Methods and systems for generating sequencing libraries
WO2023230289A1 (en) Methods and systems for cell-free nucleic acid processing
US20220290245A1 (en) Cancer detection and classification
TW202330933A (en) Sample contamination detection of contaminated fragments for cancer classification

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18807398

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2018807398

Country of ref document: EP

Effective date: 20200428