US20210383890A1 - Systems and methods for classifying, prioritizing and interpreting genetic variants and therapies using a deep neural network - Google Patents

Systems and methods for classifying, prioritizing and interpreting genetic variants and therapies using a deep neural network Download PDF

Info

Publication number
US20210383890A1
US20210383890A1 US17/369,499 US202117369499A US2021383890A1 US 20210383890 A1 US20210383890 A1 US 20210383890A1 US 202117369499 A US202117369499 A US 202117369499A US 2021383890 A1 US2021383890 A1 US 2021383890A1
Authority
US
United States
Prior art keywords
variant
condition
cell variable
specific
dna
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
US17/369,499
Inventor
Brendan Frey
Michael K. K. Leung
Andrew Thomas DELONG
Hui Yuan XIONG
Babak ALIPANAHI
Leo J. Lee
Hannes BRETSCHNEIDER
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Toronto
Deep Genomics Inc
Original Assignee
University of Toronto
Deep Genomics Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Toronto, Deep Genomics Inc filed Critical University of Toronto
Priority to US17/369,499 priority Critical patent/US20210383890A1/en
Assigned to DEEP GENOMICS INCORPORATED reassignment DEEP GENOMICS INCORPORATED ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: THE GOVERNING COUNCIL OF THE UNIVERSITY OF TORONTO
Assigned to THE GOVERNING COUNCIL OF THE UNIVERSITY OF TORONTO reassignment THE GOVERNING COUNCIL OF THE UNIVERSITY OF TORONTO ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: BRETSCHNEIDER, HANNES, DELONG, ANDREW THOMAS, XIONG, Hui Yuan, ALIPANAHI, BABAK, LEE, LEO J., LEUNG, MICHAEL K.K., FREY, BRENDAN
Publication of US20210383890A1 publication Critical patent/US20210383890A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations

Definitions

  • the following relates generally to systems and methods for classifying, prioritizing and interpreting genetic variants and therapies using a deep neural network.
  • an example workflow may be as follows: a blood or tissue sample is obtained from a patient; variants (mutations) are identified, by either sequencing the genome, the exome or a gene panel; the variants are individually examined manually (e.g. by a technician), using literature databases and internet search engines; a diagnostic report is prepared. Manually examining the variants is costly and prone to human error, which may lead to incorrect diagnosis and potential patient morbidity. Automating or semi-automating this step is thus beneficial. Since the number of possible genetic variants is large, evaluating them manually is time-consuming, highly dependent on previous literature, and involves experimental data that has poor coverage and therefore can lead to high false negative rates, or “variants of unknown significance”. The same issues arise in therapeutic design, where the number of possible therapies (molecules) to be evaluated is extremely large.
  • Some other machine learning approaches to genetic analysis have been proposed.
  • One such approach predicts a cell variable that combines information across conditions, or tissues.
  • MCMC Markov Chain Monte Carlo
  • computation-wise it is relatively expensive to get predictions from a BNN, which require computing the average predictions of many models.
  • a method for computing variant-induced changes in one or more condition-specific cell variables for one or more variants comprising: computing a set of variant features from a DNA or RNA variant sequence; applying a deep neural network of at least two layers of processing units to the variant features to compute one or more condition-specific variant cell variables; computing a set of reference features from a DNA or RNA reference sequence; applying the deep neural network to the reference features to compute one or more condition-specific reference cell variables; computing a set of variant-induced changes in the one or more condition-specific cell variables by comparing the one or more condition-specific reference cell variables to the one or more condition-specific variant cell variables.
  • a deep neural network for computing variant-induced changes in one or more condition-specific cell variables for one or more variants
  • the deep neural network comprising: an input layer configured to receive as input a set of variant features from a DNA or RNA variant sequence; and at least two layers of processing units operable to: compute one or more condition-specific variant cell variables; compute a set of reference features from a DNA or RNA reference sequence; compute one or more condition-specific reference cell variables; compute a set of variant-induced changes in the one or more condition-specific cell variables by comparing the one or more condition-specific reference cell variables to the one or more condition-specific variant cell variables.
  • a method for training a deep neural network to compute one or more condition-specific cell variables comprising: establishing a neural network comprising at least two connected layers of processing units; repeatedly updating one or more parameters of the neural network so as to decrease the error for a set of training cases chosen randomly or using a predefined pattern, where each training case comprises features extracted from a DNA or RNA sequence and corresponding targets derived from measurements of one or more condition-specific cell variables, until a condition for convergence is met at which point the parameters are no longer updated.
  • FIG. 1 shows a system for cell variable prediction
  • FIG. 2 shows a comparison of approaches to predict phenotypes, such as disease risks, from an input
  • FIG. 3 shows a method of generating target cell variables for training
  • FIG. 4 shows an example deep neural network architecture for a cell variable predictor that predicts splicing levels
  • FIG. 5 shows a further example deep neural network architecture for a cell variable predictor that predicts splicing levels
  • FIG. 6 shows yet a further example deep neural network architecture for a cell variable predictor that predicts splicing levels
  • FIG. 7 shows yet a further example deep neural network architecture for a cell variable predictor that predicts splicing levels
  • FIG. 8 shows yet a further example deep neural network architecture for a cell variable predictor that predicts splicing levels
  • FIG. 9 shows yet a further example deep neural network architecture for a cell variable predictor that predicts splicing levels
  • FIG. 10 shows a method for training cell variable predictors
  • FIG. 11 shows a system to perform non-uniform sampling of training cases for determining a mini-batch for training a deep neural network
  • FIG. 12 shows a method for training cell variable predictors for ensuring a consistent backpropagation signal that updates the weights connected to tissue inputs and biases learning towards the event with large tissue variability early on before overfitting occurs;
  • FIG. 13 shows a method for using the outputs of the CVP for scoring, classifying and prioritizing genetic variants
  • FIG. 14 shows a method for scoring variants by associating cell variable changes with those of other variants
  • FIG. 15 shows a method for interpreting which genetic features account for variant-induced cell variable changes
  • FIG. 16 shows a further method for interpreting which genetic features account for variant-induced cell variable changes
  • FIG. 17 shows a further method for interpreting which genetic features account for variant-induced cell variable changes
  • FIG. 18 shows a method to generate a visualization for tissue-specific feature importance
  • FIG. 19 shows a detailed illustration of the method to generate a visualization for tissue-specific feature importance.
  • Any module, unit, component, server, computer, terminal, engine or device exemplified herein that executes instructions may include or otherwise have access to computer readable media such as storage media, computer storage media, or data storage devices (removable and/or non-removable) such as, for example, magnetic disks, optical disks, or tape.
  • Computer storage media may include volatile and non-volatile, removable and non-removable media implemented in any method or technology for storage of information, such as computer readable instructions, data structures, program modules, or other data.
  • Examples of computer storage media include RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by an application, module, or both. Any such computer storage media may be part of the device or accessible or connectable thereto.
  • any processor or controller set out herein may be implemented as a singular processor or as a plurality of processors. The plurality of processors may be arrayed or distributed, and any processing function referred to herein may be carried out by one or by a plurality of processors, even though a single processor may be exemplified. Any method, application or module herein described may be implemented using computer readable/executable instructions that may be stored or otherwise held by such computer readable media and executed by the one or more processors.
  • Systems and methods described herein relate, in part, to the problem of assessing genetic variants with respect to phenotypes, such as deleteriousness for human diseases. This problem has implications in several industrial categories under the broad umbrella of ‘personalized medicine’, including molecular diagnostics, whole genome sequencing, and pharmaceutical development.
  • variants depend on genetic context, which includes which other variants are present and, more generally, on the genomic sequence within the individual, or patient, being tested. So, whereas a particular variant may be benign in one genetic context, it may cause a disease in another genetic context. This impacts prioritization and interpretation.
  • the following describes a process for context-dependent genetic variant assessment and wherein variants may be ranked and presented as a priority list. Variant prioritization can be used to increase efficiency and accuracy of manual interpretation, since it enables the technician to focus on a small subset of candidates
  • Deep learning generally refers to methods that map data through multiple levels of abstraction, where higher levels represent more abstract entities.
  • the goal of deep learning is to provide a fully automatic system for learning complex functions that map inputs to outputs, without using hand crafted features or rules.
  • One implementation of deep learning comes in the form of feedforward neural networks, where levels of abstraction are modeled by multiple non-linear hidden layers.
  • embodiments described herein provide systems and methods that receive as input a DNA or RNA sequence, extracts features, and apply multiple layers of nonlinear processing units of a cell variable predictor (“CVP”) to compute a cell variable, which corresponds to a measurable quantity within a cell, for different conditions, such as tissue types.
  • CVP cell variable predictor
  • a cell variable that corresponds to a measureable quantity for a specific condition such as a tissue type
  • a cell variable that is a combination of measureable quantities from multiple conditions we refer to the former as a “condition-specific cell variable” and the latter as a “non-specific cell variable”.
  • the CVP is applied to a DNA or RNA sequence and/or features extracted from the sequences, containing a genetic variant, and also to a corresponding reference (e.g., wild type) sequence to determine how much the cell variable changes because of the variant.
  • the systems and methods can be applied to naturally occurring genomic sequences, mini-gene reporters, edited genomic sequences, such as those edited using CRISPR-Cas9, genomic sequences targeted by therapies, and other genomic sequences.
  • the change in the cell variable in different conditions may be used to classify disease-causing variants, compute a score for how deleterious a variant is, prioritize variants for subsequent processing, interpret the mechanism by which a variant operates, and determine the effect of a therapy.
  • an unknown variant can be given a high score for deleteriousness if it induces a change in a particular cell variable that is similar to changes in the same cell variable that are induced by one or more variants that are known to be deleterious.
  • the CVP comprises a deep neural network having multiple layers of processing units and possibly millions of parameters.
  • the CVP may be trained using a dataset of DNA or RNA sequences and corresponding measurements of cell variables, using a deep learning training method that adjusts the strengths of the connections between processing units in adjacent layers. Specialized training methods are described, including a multi-task training method that improves accuracy.
  • the mechanism by which a mutation causes a deleterious change in a cell variable may in some instances be determined by identifying features or groups of features that are changed by the mutation and that cause the cell variable to change, which can be computed by substituting features derived from the variant sequence one by one into the reference sequence or by backpropagating the cell variable change back to the input features.
  • the invention can assign ‘blame’ to variants that are disease causing, and generate appropriate user visualizations. For example, a variant that changes the splicing ‘cell variable’ may be targeted by a therapy that targets the splicing pathway to remediate the disease
  • the term “reference sequence” means: in the context of evaluating a variant (as described below), whereupon the systems described herein compare the variant to a ‘reference sequence’, the reference sequence is a DNA or RNA sequence obtained using genome sequencing, exome sequencing or gene sequencing of an unrelated individual or a closely related individual (e.g., parent, sibling, child). Alternatively, the reference sequence may be derived from the reference human genome, or it may be an artificially designed sequence.
  • the term “variant” means: a DNA or RNA sequence that differs from a reference sequence in one or more nucleotides, by substitutions, insertions, deletions or any other changes.
  • the variant sequence may be obtained using genome sequencing, exome sequencing or gene sequencing of an individual.
  • the variant sequence may be derived from the reference human genome, or it may be an artificially designed sequence.
  • the sequence containing the variant as well as surrounding DNA or RNA sequence is included in the ‘variant’.
  • single nucleotide variant means: a variant that consists of a substitution to a single nucleotide.
  • variant analysis means: the procedure (computational or otherwise) of processing a variant, possibly in addition to surrounding DNA or RNA sequence that establishes context, for the purpose of variant scoring, categorization, prioritization, and interpretation.
  • score means: a numeric value that indicates how deleterious a variant is expected to be.
  • classification refers to the classification of a variant.
  • a variant may be classified in different ways, such as by applying a threshold to the score to determine if the variant is deleterious or not.
  • the American College of Medical Genetics recommends a five-way classification: pathogenic (very likely to contribute to the development of disease); likely pathogenic (there is strong evidence that the variant is pathogenic, but the evidence is inconclusive); unknown significance or VUS (there is not enough evidence to support classification one way or another); likely benign (there is strong evidence that the variant is benign, but the evidence is inconclusive); benign (very likely to be benign).
  • the terms “rank”/“prioritization” mean: the process of sorting the scores of a set of variants to determine which variant should be further investigated. The pathogenic variants will be at the top, with the benign variants at the bottom.
  • cell variable means: a quantity, level, potential, or process outcome in the cell that is potentially relevant to the function of a living cell, and that is computed by a CVP (see below).
  • a condition-specific cell variable is a cell variable that is measured or predicted under a specific condition, such as a tissue type
  • non-specific cell variable is a cell variable that is derived by combining information from across multiple conditions, for example by subtracting the average cell variable values across conditions from the cell variable for each condition.
  • a cell variable can often be quantified by a vector of one or more real-valued numbers, or by a probability distribution over such a vector. Examples include the strength of binding between two molecules (e.g.
  • exon splicing levels the fraction of mRNA transcripts in a particular tissue that contain a particular exon, i.e. percent spliced in
  • DNA curvature DNA methylation
  • RNA folding interactions the fraction of mRNA transcripts in a particular tissue that contain a particular exon, i.e. percent spliced in
  • the term “event” means: in the context of a splicing-related cell variable (e.g. the fraction of transcripts with an exon spliced in), an observed (measured) alternative splicing event in the cell where both the genomic features and the corresponding splicing levels are known for that particular event.
  • a splicing-related cell variable e.g. the fraction of transcripts with an exon spliced in
  • an observed (measured) alternative splicing event in the cell where both the genomic features and the corresponding splicing levels are known for that particular event.
  • Each event can be used as either a training case or a testing case for a machine learning system.
  • a system 100 for cell variable prediction comprising a machine learning unit.
  • the machine learning unit is preferably implemented by a deep neural network, which is alternatively referred to herein as a “cell variable predictor” (“CVP”) 101 .
  • CVP takes as input a set of features, including genomic features, and produces an output intended to mimic a specific cell variable.
  • the quantification of a cell variable can be represented in such a system by one or more real-valued numbers on an absolute or relative scale, with or without meaningful units.
  • the CVP may provide other outputs in addition to outputs intended to mimic a specific cell variable.
  • the system 100 further comprises a memory 106 communicatively linked to the CVP 101 .
  • Each layer comprises one or more processing units 104 , each of which implements a feature detector and/or a computation that maps an input to an output.
  • the processing units 104 accept a plurality of parameter inputs from other layers and apply activation functions with associated weights for each such parameter input to the respective processing unit 104 .
  • the output of a processing unit of layer l may be provided as input to one or more processing units of layer l+1.
  • Each processing unit may be considered as a processing “node” of the network and one or more nodes may be implemented by processing hardware, such as a single or multi-core processor and/or graphics processing unit(s) (GPU(s)). Further, it will be understood that each processing unit may be considered to be associated with a hidden unit or an input unit of the neural network for a hidden layer or an input layer, respectively.
  • processing hardware such as a single or multi-core processor and/or graphics processing unit(s) (GPU(s)).
  • GPU(s) graphics processing unit
  • each processing unit may be considered to be associated with a hidden unit or an input unit of the neural network for a hidden layer or an input layer, respectively.
  • the use of large (many hidden variables) and deep (multiple hidden layers) neural networks may improve the predictive performances of the CVP compared to other systems.
  • inputs to the input layer of the CVP can include genetic information, such as sequences representing DNA, RNA, features derived from DNA and RNA, and features providing extra information (e.g. tissue type, age, sex), while outputs at the output layer of the CVP can include cell variables.
  • genetic information such as sequences representing DNA, RNA, features derived from DNA and RNA, and features providing extra information (e.g. tissue type, age, sex)
  • outputs at the output layer of the CVP can include cell variables.
  • feedforward network an illustrative feedforward network
  • type of neural network implemented is not limited merely to feedforward neural networks but can also be applied to any neural networks, including convolutional neural networks, recurrent neural networks, auto-encoders and Boltzmann machines.
  • system 100 comprises a secondary analysis unit 114 for receiving the cell variables from the output layer and providing further analysis, as described below.
  • the memory 106 may comprise a database for storing activations and learned weights for each feature detector, as well as for storing datasets of genetic information and extra information and optionally for storing outputs from the CVP 101 .
  • the genetic information may provide a training set comprising training data.
  • the training data may, for example, be used for training the CVP 101 to predict cell variables, in which case DNA and RNA sequences with known cell variables and/or phenotypes may be provided.
  • the memory 106 may further store a validation set comprising validation data.
  • the neural network learns optimized weights for each processing unit. After learning, the optimized weight configuration can then be applied to test data. Stochastic gradient descent can be used to train feedforward neural networks. A learning process (backpropagation), involves for the most part matrix multiplications, which makes them suitable for speed up using GPUs. Furthermore, the dropout technique may be utilized to prevent overfitting.
  • the system may further comprise a computing device 110 communicatively linked to the CVP 101 for controlling operations carried out in the CVP.
  • the computing device may comprise further input and output devices, such as input peripherals (such as a computer mouse or keyboard), and/or a display.
  • the computing device 110 may further be linked to a remote device 112 over a wired or wireless network 108 for transmitting and receiving data.
  • genetic information is received over the network 108 from the remote device 112 for storage in memory 106 .
  • Cell variable predictions and lists of variants priorities may be displayed to a user via the display.
  • the inputs 204 to a CVP can include sequences representing DNA, RNA, features derived from DNA and RNA, and features providing extra information (e.g. tissue type, age, sex).
  • the cell variables 206 could be, for example, the distribution of proteins along a strand of DNA containing a gene, the number of copies of a gene (transcripts) in a cell, the distribution of proteins along the transcript, and the number of proteins.
  • the cell variables can be used by the system to determine how much a variant causes the cell variable to change. By examining how much a mutation causes the cell variable to change, the CVP can be used to score, categorize, and prioritize variants.
  • the cell variable predictions can act as high-level features to facilitate more accurate phenotypic predictions, optionally performed at the secondary analysis unit 114 .
  • the resultant machine learning problem is modularized. Moreover, it allows variants to be related to particular cell variables, thereby providing a mechanism to explain variants.
  • the variant and a reference sequence are fed into the input layer of the CVP 101 and the amount of change in the cell variable is quantified and used to score, categorize and prioritize the variant by the secondary analysis unit 114 .
  • the secondary analysis unit 114 comprises a second system (of similar architecture to the CVP) trained to predict a phenotype based on the outputs of the cell variable prediction systems (as illustrated in FIG. 2 b ).
  • the cell variable could be the frequency with which the exon is included when the gene is being copied to make a protein.
  • Other examples of cell variables include the distribution of proteins along a strand of DNA containing a gene, the number of copies of a gene (transcripts) in a cell, the distribution of proteins along the transcript, and the number of proteins.
  • the CVP comprises multiple layers of nonlinear processing units to compute the cell variable using the raw DNA or RNA sequence, or features derived from the sequence.
  • the system may first construct a pair of feature vectors corresponding to the reference sequence and the variant sequence. Due to the variant, these genomic feature vectors will be different, but without a further cell variable predictor it may not be possible to predict whether those differences would result in any change in phenotype. Embodiments of the predictive system may therefore infer both the reference cell variable value and the variant cell variable value using these two distinct feature vectors.
  • a distance function that combines the reference and the variant predictions may be used to produce a single score which summarizes the magnitude of predicted effect induced by the mutations.
  • Example distance functions include the absolute difference in expectation, Kullback-Leibler divergence, and variation distance. Detailed mathematical formulas of these will be described in a later paragraph.
  • process 250 can rely on input features derived from other types of data besides DNA sequences (e.g. age, sex, known biomarkers)—the above described inputs are merely illustrative.
  • An aspect of the embodiments described herein is the use of machine learning to infer predictors that are capable of generalizing to new genetic contexts and to new cell states.
  • a predictor may be inferred using reference genome and data profiling transcripts in healthy tissues, but then applied to the genome of a cancer cell to ascertain how the distribution of transcripts changes in the cancer cell.
  • This notion of generalization is a crucial aspect of the predictors that need to be inferred. If a predictor is good at generalization, it can analyze variant sequences that lead to changes in cell variables that may be indicative of disease state, without needing experimental measurements from diseased cells.
  • Process 250 may address the two problems discussed with respect to approach 200 . Since the cell variables are more closely related to and more easily determined from genomic sequences than are phenotypes, learning predictors that map from DNA to cell variables is usually more straightforward. High-throughput sequencing technologies are currently generating massive amounts of data profiling these cell variables under diverse conditions; these datasets can be used to train larger and more accurate predictors. Also, since the cell variables correspond to intermediate biochemically active quantities, such as the concentration of a gene transcript, they may be good targets for therapies. If high disease risk is associated with a change in a cell variable compared to a healthy individual, an effective therapy may consist of restoring that cell variable to its normal state.
  • Embodiments may include such cell variables as ‘exon inclusion or exclusion’, ‘alternative splice site selection’, ‘alternative polyadenylation site selection’, ‘RNA- or DNA-binding protein or microRNA specificity’, and ‘phosphorylation’.
  • the method can be applied to raw DNA or RNA sequence or features extracted from the sequence, such as RNA secondary structures and nucleosome positions; the method can compute one or more condition-specific cell variables, without the need for a baseline average across conditions; the method can detect variants that affect all condition-specific cell variables in the same way; the method can compare a variant sequence to a reference sequence, enabling it to make different predictions for the same variant, depending on genetic context; the method can compute the condition-specific cell variables using a deep neural network, which has at least two layers of processing units; the method does not require disease labels (e.g., a case population and a control population); the method can score a variant that has never been seen before; the method can be used to compute a ‘distance’ between a variant sequence and a reference sequence, which can be used to rank the variant; the method can be used to compute a ‘distance’ between variants, which is useful for classifying unknown variants based on how similar they are to known variants.
  • the method can be applied to raw DNA or
  • FIG. 3 shown therein is a method of generating target cell variables for training.
  • a family of gradient-following procedures are performed where weights (“ ⁇ ”) of a neural network are changed according to the gradient of a cost function evaluated using the prediction and the target in a training dataset.
  • the measured cell variable to be modeled is represented in a mathematical form, also referred to as the ‘target’ in a dataset.
  • the target in a dataset.
  • PSI percent-spliced-in values
  • the biological measurements such as RNA-Seq datasets are processed to produce a posterior probability distribution p of PSI, using methods such as cufflinks and the bootstrap binomial model.
  • a regression model to predict the expected PSI can be trained, with the cost function being squared loss function or the cross-entropy based on a binomial distribution with E( ⁇ ) as the probability of success.
  • the preparation of training targets according to method 300 may be different for different cell variables, the system architecture applied may be the same or similar.
  • FIGS. 4 to 9 shown therein are example DNN architectures for CVPs that predicts splicing levels ( ⁇ ).
  • the number of hidden layers and the number of processing units in each layer can range widely and may be determined by hand, using data or using other information;
  • the nodes of the DNN are fully connected, where each connection is parameterized by a real-valued weight ⁇ .
  • the DNN has multiple layers of non-linearity consisting of hidden units.
  • the output activation a of each hidden unit v in layer l processes a sum of weighted outputs from the previous layer, using a non-linear function f:
  • a v l f ( ⁇ m M l-1 ⁇ v,m l a m l-1 )
  • M l represents the number of hidden units in layer l
  • a 0 and M 0 are the input into the model and its dimensionality, respectively.
  • Different activation functions for the hidden units can be used, such as the TANH function, SIGMOID, and the rectified linear unit (RELU).
  • FIG. 4 shown therein is an example architecture 400 of a deep neural network that predicts alternative splicing inclusion levels in a single tissue type i, where the inclusion level is represented by a real-valued number ⁇ i .
  • Inputs into the first hidden layer 406 consist of genomic features 402 describing a genomic region; these features may include binding specificities of RNA- and DNA-binding proteins, RNA secondary structures, nucleosome positions, position-specific frequencies of short nucleotide sequences, and many others. To improve learning, the features can be normalized by the maximum of the absolute value across all training examples.
  • the purpose of the first hidden layer is to reduce the dimensionality of the input and learn a better representation of the feature space.
  • condition e.g., tissues
  • T represent the number of conditions
  • the final output 412 may be a regression model that predicts the expected PSI.
  • the discretized PSI may be predicted by a classification model 512 .
  • the DNN can predict the difference in PSI ( ⁇ PSI) between two conditions for a particular exon.
  • FIG. 6 shows an example architecture 600 of a deep neural network that predicts the difference between the alternative splicing inclusion levels of two tissue types (conditions) i 602 and j 604 .
  • tissue types conditions
  • two different tissues can be supplied to the inputs.
  • three classes can be generated, called decreased inclusion 606 , no change 608 , and increased inclusion 610 , which can be similarly generated, but from the ⁇ PSI distributions.
  • An interval can be chosen that more finely differentiates tissue-specific alternative splicing for this task, where a difference of greater than 0.15 could be labeled as a change in PSI levels.
  • the probability mass could be summed over the intervals of ⁇ 1 to ⁇ 0.15 for decreased inclusion, ⁇ 0.15 to 0.15 for no change, and 0.15 to 1 for increased inclusion.
  • FIG. 7 shown therein is an example architecture 700 of a deep neural network that predicts the alternative splicing inclusion levels of two tissue types i and j, where the inclusion levels are represented by real-valued numbers ⁇ i 702 and ⁇ j 704 and the difference in alternative splicing inclusion levels between the two tissue types 706 is also represented by a real-valued number.
  • the classification, regression, and tissue difference codes may be trained jointly.
  • the benefit is to reuse the same hidden representations learned by the model, and for each learning task to improve the performance of another.
  • FIG. 9 shows an example architecture of such system.
  • the first hidden layer can be trained using an autoencoder to reduce the dimensionality of the feature space in an unsupervised manner.
  • An autoencoder is trained by supplying the input through a non-linear hidden layer, and reconstructing the input, with tied weights going into and out of the hidden layer. Alternatively, the weights can be untied. This method of pretraining the network may initialize learning near a good local minimum.
  • An autoencoder may be used instead of other dimensionality reduction techniques like principal component analysis, because it naturally fits into the CVP's architecture, and that a non-linear technique may discover a better and more compact representation of the features.
  • the weights from the input layer to the first hidden layer (learned from the autoencoder) are fixed, and the inputs corresponding to tissues are appended.
  • a one-hot encoding representation may be used, such that specifying a tissue for a particular training example can take the form [0 1 0 0 0] to denote the second tissue out of 5 possible types.
  • the reduced feature set and tissue variables become input into the second hidden layer.
  • the weights connected to the second hidden layer and the final hidden layer of the CVP are then trained together in a supervised manner, with targets being the expected value of PSI, the discretized version of PSI, the expected value of ⁇ PSI, and/or the discretized version of ⁇ PSI, depending on architecture.
  • targets being the expected value of PSI, the discretized version of PSI, the expected value of ⁇ PSI, and/or the discretized version of ⁇ PSI, depending on architecture.
  • weights from all layers of the CVP may be fine-tuned by backpropagation.
  • the autoencoder may be omitted altogether, and all weights of neural network may be trained at once.
  • the targets consist of (1) PSI for each of the two tissues, and (2) ⁇ PSI between the two tissues.
  • N ⁇ N training examples can be constructed. This construction has redundancy in that it generates examples where both tissues are the same in the input to teach the model that it should predict no change for ⁇ PSI given identical tissue indices. Additionally, if the tissues are swapped in the input, a previously increased inclusion label should become decreased inclusion. The same rationale extends to the LMH classifier. Generating these additional examples is one method to incorporate this knowledge without explicitly specifying it in the model architecture.
  • a threshold can be applied to exclude examples from training if the total number RNA-Seq junction is below a number, such as 10, to remove low signal training examples.
  • multiple tasks may be trained together. Since each of these tasks might learn at different rates, learning rates may be allowed to differ. This is to prevent one task from overfitting too soon and negatively affecting the performance of another task before the complete model is fully trained. This may be implemented by having different learning rates for the weights between the connections of the last hidden layer and the functions used for classification or regression for each task.
  • data may be split into folds at random for cross validation, such as five approximately equal folds.
  • Each fold may contain a unique set of genetic information, such as exons that are not found in any of the other folds. Where five folds are provided, three of the folds could be used for training, one used for validation, and one held out for testing.
  • Training can be performed for a fixed number of epochs and hyperparameters can be selected that give optimal area under curve (“AUC”) performance or data likelihood on the validation data.
  • AUC area under curve
  • the model can then be re-trained using the selected hyperparameters with both the training and validation data. Multiple models can be trained this way from the different folds of data. Predictions from the models on their corresponding test set can then be used to evaluate the code's performance.
  • the data can be randomly partitioned, and the above training procedure can be repeated.
  • the CVP's processing unit weights may be initialized with small random values sampled from a zero-mean Gaussian distribution. Alternatively it may be initialized with small random values with a zero-mean uniform distribution. Learning may be performed with stochastic gradient descent with momentum and dropout, where mini-batches are constructed as described below. An L1 weight penalty may be included in the cost function to improve the model performance by disconnecting features deemed to be not useful by the predictor. The model's weights may updated after each mini-batch.
  • FIG. 11 shown therein is a system to perform non-uniform sampling of training cases for creating a mini-batch for training a deep neural network.
  • a system for biasing the distribution of training events in the mini-batches.
  • the system comprises training cases separated into “high-variance” cases and “low-variance” cases.
  • the set of high-variance training cases is thus selected by thresholding each case's variance across tissue types or genomic features.
  • the “high-variance” cases are provided in a database 1106
  • the “low-variance” cases are provided in a database 1108 .
  • the system further comprises switches 1104 and multiplexers 1102 .
  • each row of a mini-batch 1110 is sampled either from a list of high- or low-variance training cases, depending on a probabilistic ⁇ 0,1 ⁇ switch value.
  • the resulting mini-batch of genomic features and corresponding cell variable targets can be used for training, such as for training the architectures in FIGS. 6 and 7 .
  • a method for training cell variable predictors for ensuring a consistent backpropagation signal that updates the weights connected to tissue inputs and biases learning towards the event with large tissue variability early on before overfitting occurs.
  • a method 1200 at block 1202 , all training cases are separated into a database of “high-variance” cases and a database of “low-variance” cases, where the variance of each training case is measured as “variance of the ⁇ training targets across tissue types” and the threshold for separating high/low is any pre-determined constant.
  • all events that exhibit large tissue variability are selected, and mini-batches are constructed based only on these events.
  • training cases can be further sampled (with or without replacement) from the larger pool of events with low tissue variability, of some pre-determined or randomized size typically smaller than equal to one fifth of the mini-batch size.
  • a purpose of method 1200 is to have a consistent backpropagation signal that updates the weights connected to the tissue inputs and bias learning towards the event with large tissue variability early on before overfitting occurs. As training progresses, the splicing pattern of the events with low tissue variability is also learned. This arrangement effectively gives the events with large tissue variability greater importance (i.e. more weight) during optimization. This may be beneficial to improve the models' tissue specificity.
  • CVPs comprising of deep neural networks may be a competitive technique for conducting learning and prediction on biological datasets, with the advantage that they can be trained quickly, have enough capacity to model complex relationships, and scale well with the number of hidden variables and volume of data, making them potentially highly suitable for ‘omic’ datasets.
  • the performance of a CVP depends on a good set of hyperparameters.
  • Bayesian frameworks can be used to automatically select a model's hyperparameters. These methods use a Gaussian Process to search for a joint setting of hyperparameters that optimize a process's performance on validation data. It uses the performance measures from previous experiments to decide which hyperparameters to try next, taking into account the trade-off between exploration and exploitation. This method eliminates many of the human judgments involved with hyperparameter optimization and reduces the time required to find such hyperparameters.
  • randomized hyperparameter search can be performed, where the hyperparameters to be optimized is sampled from a uniform distribution. These methods require only the search range of hyperparameter values to be specified, as well as how long to run the optimization for.
  • the systems described above can be used to compute a set of condition-specific scores for how deleterious a variant is. For instance, a variant may be found to have a high deleteriousness score in brain tissue, but not in liver tissue. In this way the condition-specific cell variables computed as described above can be used to compute condition-specific deleteriousness scores. To classify variants as pathogenic, likely pathogenic, unknown significance (VUS), likely benign or benign, and to prioritize or rank a set of variants, these sets of scores can be combined.
  • VUS pathogenic, likely pathogenic, unknown significance
  • these sets of scores can be combined.
  • a pair of feature vectors are constructed corresponding to the reference sequence and the variant sequence. Due to the mutation, these genomic feature vectors will be different, but without a further CVP it may not be possible to predict whether those differences will result in any change in phenotype.
  • the predictive system is therefore used to compute both the reference cell variable value and the mutant cell variable value for each condition, using these two distinct feature vectors.
  • a distance function that combines the reference and the mutant predictions can be used to produce a single score for each condition, which summarizes the magnitude of predicted effect induced by the mutations. Because large change of cell variables is likely to cause diseases, without further information about a particular diseases and a particular cell variable, high scoring mutations are assumed to cause diseases.
  • distance functions are the expected difference, Kullback-Leibler divergence, and variation distance.
  • LMH splicing predictor LMH splicing predictor
  • the expected difference represents the absolute value of the difference induced by the mutation in the expected value of a cell variable.
  • the predicted reference splicing patterns ⁇ p low wt , p mid wt , p high wt ⁇ and the predicted mutant splicing patterns ⁇ p low mut , p mid mut , p high mut ⁇ are computed using the reference and mutant feature vectors as inputs.
  • the expected value of the predicted cell variable with and without the mutation is computed, denoted as ⁇ wt and ⁇ mut .
  • the expected value is a weighted average of the PSI values corresponding to the center of the bins used to define the splicing pattern.
  • KL divergence is an information theoretic measure of difference between probability distributions P and Q:
  • the KL divergence can be computed for each condition and the sum (or average) KL divergence can be computed across conditions, or the maximum KL divergence can be computed across tissues.
  • the variation distance is another measure of difference between probability distributions. It is the sum of absolute value of the predicted probabilities.
  • s 1 ⁇ 2 ⁇ s ⁇ low,mid,high ⁇
  • the score of a variant can be thresholded and/or combined with other information to classify the variant as pathogenic, likely pathogenic, unknown significance (VUS), likely benign or benign.
  • VUS pathogenic, likely pathogenic, unknown significance
  • the score of every variant can be computed and the set of variants can be reordered so that the highest-scoring (most deleterious) variants are at the top of the list and the lowest-scoring variants are at the bottom of the list.
  • the method 1400 comprises by, at block 1402 , associating the cell variable changes of variants with those of other variants with known function. For instance, suppose the system 100 determines that a variant that has never been seen before causes a change in a particular cell variable, say the cassette splicing level of a specific exon. Suppose a nearby variant whose disease function is well-characterized causes a similar change in the exact same cell variable, e.g., the splicing level of the same exon.
  • mutations act by changing cellular chemistry, such as the splicing level of the exon, it can be inferred that the unknown variant likely has the same functional impact as the known variant.
  • the system can ascertain the ‘distance’ between two variants in this fashion using a variety of different measures. Because the system computes variant-induced changes in a cell variable for different conditions, this information can be used to more accurately associate variants with one another. For example, two variants that induce a similar cell variable change in brain tissue would be associated more strongly than two variants that induce similar cell variable changes, but in different tissues.
  • the methods and systems described here can be used to score, classify, prioritize and interpret a variant in the context of different reference sequences. For instance, when a child's variant is compared to a reference sequence obtained from the reference human genome, the variant may have a high score, but when the same variant is compared to the reference sequences obtained from his or her unaffected parents, the variant may have a low score, indicating that the variant is likely not the cause of the disease. In contrast, if the child's variant is found to have a high score when it is compared to the reference sequences obtained from his or her parents, then it is more likely to be the cause of the disease.
  • Another circumstance in which different reference sequences arise is when the variant may be present in more than one transcript, which can occur because transcription occurs bidirectionally in the genome, there may be alternative transcription start sites, there may be alternative splicing, and for other reasons.
  • a variant leads to a change in DNA/RNA sequence and/or a change in the DNA/RNA features extracted from the sequence. However, which particular changes in the sequence or features are important.
  • An SNV may change more than one feature (e.g., a protein binding site and RNA secondary structure), but because of contextual dependence only some of the affected features play an important role.
  • the system 100 can determine which inputs (nucleotides or DNA/RNA features) are responsible for changes in cell variables. In other words, it is useful to know how important a feature is overall for making a specific prediction, and it is also useful to know in what way the feature contributes to the prediction (positively or negatively).
  • a first method 1500 to identify the impact of features on a cell variable CVP prediction works by computing, at block 1502 , the features for the sequence containing the variant and the features for the sequence that does not have the variant.
  • both feature vectors are fed into the cell variable predictor to obtain the two sets of condition-specific cell variables.
  • a single feature from the variant sequence is copied into the corresponding feature in the non-variant sequence and the system is used to compute the set of condition-specific cell variables.
  • this is repeated for all features and the feature that produces the set of condition-specific cell variables that is most similar to the set of condition-specific cell variables for the variant sequence is identified. This approach can be extended to test a set of pairs of features or a set of arbitrary combinations of features.
  • the impact of feature subsets of the same size are comparable, including all cases when
  • 1.
  • the overall importance of a feature (as opposed to its importance for a specific training or test case) with regard to a particular dataset (e.g. a training or test set) can be determined as the average or median of all its impact scores across all cases in that dataset.
  • a third method 1700 is described to identify the impact of features on a cell variable CVP prediction.
  • an example from the dataset is given as input to the trained model and forward propagated through a CVP comprising of a neural network to generate an output.
  • the target is modified to a different value compared to the predicted output; for example, in classification, the class label would be modified so that it differs from the prediction.
  • the error signal is backpropagated to the inputs. The resulting signal describes how much each input feature needs to change in order to make the modified prediction, as well as the direction.
  • the computation is extremely quick, as it only requires a single forward and backward pass through the CVP, and all examples can be calculated in parallel. Features that need to be changed the most are deemed to be important.
  • the overall importance of a feature (as opposed to its importance for a specific training or test case) with regards to a particular dataset (e.g. a training or test set) can be determined as the average or median of amount of change across all cases in that dataset. The benefit of this approach compared to the first is it can model how multiple features operate simultaneously.
  • a complementary method 1800 based on the method of 1700 to analyze a CVP is to see how features are used in a tissue-specific manner.
  • this extension simply receives examples from the dataset corresponding to particular tissues, and, at block 1804 , performs the procedure as described above [ 110 ].
  • this procedure yields tissue-specific feature importance information.
  • FIG. 19 shown therein is a detailed illustration of a method 1900 to generate a visualization for tissue-specific feature importance based on the method described in 1700 and 1800 .
  • input comprising examples from a dataset corresponding to a particular tissue is provided to the CVP.
  • tissue-specific cell variable predictions are provided by the CVP.
  • targets are constructed based on the cell value predictions, such that there is a mismatch between the prediction and the target.
  • an update signal is computed which describes how the weights of the connection need to change to make the prediction match the target.
  • an update signal backpropagated to the input, ⁇ feature is further computed.
  • examples from the dataset are sorted by tissue types.
  • the overall importance of features for each tissue is computed by taking the mean of the magnitude of the update signal over the entire dataset.
  • a visualization is generated, where the importance of each feature is colored accordingly for each tissue.
  • the systems and methods described here can also be used to determine whether a therapy reverses the effect of a variant on a pertinent cell variable.
  • a therapy reverses the effect of a variant on a pertinent cell variable.
  • an SNV within an intron may cause a decrease in the cell variable that corresponds to the inclusion level of a nearby exon, but an oligonucleotide therapy that targets the same region as the SNV or a different one may cause the cell variable (inclusion level) to rise to its original level.
  • a DNA editing system such as CRISPR-Cas9 may be used to edit the DNA, adding, remove or changing a sequence such that the cell variable (inclusion level) of the exon rises to its original level.
  • the method described here is applied to a variant and a reference sequence obtained from the reference genome or an unaffected family member, and the cell variable is found to change by a certain amount, or if the cell variable has been measured to change by a certain amount, the following technique can be used to evaluate putative therapies to see if they correct the change.
  • therapies that target the variant sequence such as by protein-DNA or protein-RNA binding or by oligonucleotide hybridization
  • the effect of the therapy on the variant can be computed using the CVP, where the reference is taken to be the variant sequence and the “variant sequence” is now taken to be the variant sequence modified to account for the effect of the therapy.
  • That subsequence may be, in silico, modified by randomly changing the nucleotides, setting them all to a particular value, or some other method.
  • features that overlap, fully or partially, with the targeted subsequence may be set to values that reflect absence of the feature.
  • the reference (the original variant) and the modified variant are then fed into the CVP and the change in the cell variable is computed. This is repeated with a wide range of therapies, and the efficacy of each therapy can be determined by how much the therapy-induced change in the cell variable corrects for the original variant-induced change.
  • the procedure is even more straightforward.
  • the reference is taken to be the original variant, and the variant is taken to be the edited version of the variant.
  • the output of the CVP then indicates by how much the cell variable will change because of the editing.
  • An exemplary method comprises computing a set of features from the DNA or RNA sequence containing the variant, applying a network of at least two layers of processing units (the deep neural network) to the variant features to compute the one or more condition-specific variant cell variables, computing a set of features from a reference DNA or RNA sequence, applying the deep network to the reference features to compute the one or more condition-specific reference cell variables, and computing the variant-induced changes in the one or more condition-specific cell variables by comparing the one or more condition-specific reference cell variables to the one or more condition-specific variant cell variables.
  • the number of condition-specific cell variables is at least two.
  • the deep neural network may be trained using a dataset of examples, where each example is a measured DNA or RNA sequence and a corresponding set of measured values of the condition-specific cell variables, one for each condition, and where the condition-specific cell variables are not normalized using a baseline that is determined by combining the condition-specific cell variables across two or more conditions.
  • the set of features may include a binary matrix with 4 rows and a number of columns equal to the length of the DNA or RNA sequence and where each column contains a single ‘1’ and three ‘0’s and where the row in which each ‘1’ occurs indicates the nucleotide at the corresponding position in the DNA or RNA sequence.
  • the set of features includes a set of features may be computed using the recognition path of an autoencoder that is applied to the binary matrix.
  • the autoencoder may be trained using a dataset of binary matrices computed using a set of DNA or RNA sequences of fixed length.
  • the set of features may also include real and binary features derived from the DNA or RNA sequence.
  • At least part of the deep network may be configured to form a convolutional network and/or recurrent network.
  • Part of the deep network that is a recurrent network may be configured to use of long-term short-term memory.
  • the deep neural network may be trained using a dataset of feature vectors extracted from DNA or RNA and a corresponding set of measured values of cell variables.
  • the training method may adjust the parameters of the deep neural network so as to minimize the sum of the error between the measured cell variables and the output of the deep neural network.
  • the error may be the squared difference between the measured cell variable and the corresponding output of the neural network.
  • the error may be the absolute difference between the measured cell variable and the corresponding output of the neural network.
  • the error may be the Kullback-Leibler divergence between the measured cell variable and the corresponding output of the neural network.
  • Stochastic gradient descent may be used to train the deep neural network.
  • Dropout may be used to train the deep neural network.
  • the hyperparameters of the deep neural network may be adjusted so as to minimize the error on a separate validation set.
  • the deep neural network may be trained using multitask learning, where the outputs of the deep neural network are comprised at least two of the following: a real-valued cell variable, a probability distribution over a discretized cell variable, a probability distribution over a real-valued cell variable, a difference between two real-valued cell variables, a probability distribution over a discretized difference between two real-valued cell variables, a probability distribution over the difference between two real-valued cell variables.
  • An input to the deep neural network may indicate the condition for which the cell variable is computed and the deep neural network is applied repeatedly to compute each condition-specific cell variable.
  • the output of the deep neural network may comprise one real value for each condition and the variant-induced change for each condition may be computed by subtracting the computed reference cell variable from the computed variant cell variable.
  • the output of the deep neural network may comprise a probability distribution over a discrete variable for each condition and the variant-induced change for each condition may be computed by summing the absolute difference between the computed probabilities for the reference cell variable and the variant cell variable.
  • the output of the deep neural network may comprise a probability distribution over a discrete variable for each condition and the variant-induced change for each condition may be computed using the Kullback-Leibler divergence between the computed probabilities for the reference cell variable and the variant cell variable.
  • the output of the deep neural network may comprise a probability distribution over a discrete variable for each condition and the variant-induced change for each condition may be computed by first computing the expected value of the reference cell variable and the variant cell variable, and then subtracting the expected value of the reference cell variable from the expected value of the variant cell variable.
  • the variant-induced changes in the one or more condition-specific cell variables may be combined to output a single numerical variant score.
  • the variant score may be computed by summing the variant-induced changes across conditions.
  • the variant score may be computed by summing the squares of the variant-induced changes across conditions.
  • the variant score may be computed by summing the outputs of a nonlinear function that are computed by applying the nonlinear function to the variant-induced changes across conditions.
  • At least two variants and corresponding reference sequences may be independently processed to compute the variant-induced changes in one or more condition-specific cell variables for each variant and corresponding reference sequence. At least two variants and corresponding reference sequences may be independently processed to compute the variant score for each variant and corresponding reference sequence.
  • the variant scores may be used to prioritize the variants by sorting them according to their scores. Thresholds may be applied to the score to classify the variant as deleterious or non-deleterious, or to classify the variant as pathogenic, likely pathogenic, unknown significance, likely benign or benign, or to classify the variant using any other discrete set of labels.
  • a validation data consisting of variants, reference sequences, and labels may be used to compute the thresholds that minimize classification error.
  • the scores may be combined with additional numerical information before the variants are sorted.
  • the scores may be combined with additional numerical information before the thresholds are applied.
  • the scores may be combined with additional numerical information before the thresholds are applied.
  • the distance between the two variants in each pair may be computed by summing the output of a nonlinear function applied to the difference between the change in the condition-specific cell variable for the first variant and the change in the condition-specific cell variable for the second variant.
  • the nonlinear function may be the square operation.
  • the nonlinear function may be the absolute operation.
  • the deleteriousness label of an unknown variant may be determined by computing the distance of the variant to one or more variants of known deleteriousness and outputting the label or the score of the closest known variant.
  • the deleteriousness value of an unknown variant may be determined by computing the distance of the variant to one or more variants of known deleteriousness and then computing the weighted average of their labels or scores, where the weights are nonlinear functions of the distances. Two or more unknown variants may be prioritized, by sorting them according to their deleteriousness values.
  • the mini-batches used during multitask training may be balanced so that the number of cases that exhibit a large difference is similar to the number of cases that exhibit a small difference.
  • the genetic variant may be a single nucleotide variant.
  • the genetic variant may contain two or more distinct single nucleotide variants.
  • the genetic variant may be a combination of substitutions, insertions and deletions and not be a single nucleotide variant.
  • the genetic variant may be obtained by sequencing the DNA from a patient sample.
  • the reference sequence may be obtained by sequencing the DNA from a close relative of the patient.
  • the reference sequence may be any DNA or RNA sequence and the variant sequence may be any DNA or RNA sequence, but where the reference sequence and the variant sequence are not identical.
  • the features may include position-dependent genetic features such as conservation.
  • the most explanatory feature may be determined by examining each feature in turn, and computing a feature-specific variant feature vector by copying the feature derived from the variant sequence onto the features derived from the reference sequence; using the deep neural network to compute the variant-induced changes in the one or more condition-specific cell variables for that feature-specific variant identifying the feature whose corresponding feature-specific variant-induced changes in the one or more condition-specific cell variables are most similar to the variant-induced changes in the one or more condition-specific cell variables.
  • the similarity between the feature-specific variant-induced changes in the one or more condition-specific cell variables and the variant-induced changes in the one or more condition-specific cell variables may be computed by summing the squares of their differences.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Theoretical Computer Science (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Biotechnology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Molecular Biology (AREA)
  • Databases & Information Systems (AREA)
  • Bioethics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Epidemiology (AREA)
  • Public Health (AREA)
  • Genetics & Genomics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • General Physics & Mathematics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

Described herein are systems and methods that receive as input a DNA or RNA sequence, extract features, and apply layers of processing units to compute one ore more condition-specific cell variables, corresponding to cellular quantities measured under different conditions. The system may be applied to a sequence containing a genetic variant, and also to a corresponding reference sequence to determine how much the condition-specific cell variables change because of the variant. The change in the condition-specific cell variables are used to compute a score for how deleterious a variant is, to classify a variant's level of deleteriousness, to prioritize variants for subsequent processing, and to compare a test variant to variants of known deleteriousness. By modifying the variant or the extracted features so as to incorporate the effects of DNA editing, oligonucleotide therapy, DNA- or RNA-binding protein therapy or other therapies, the system may be used to determine if the deleterious effects of the original variant can be reduced.

Description

    CROSS-REFERENCE
  • This application is a continuation of U.S. application Ser. No. 16/197,146, filed Nov. 20, 2018, which is a divisional of U.S. application Ser. No. 14/739,432, filed Jun. 15, 2015 (now U.S. Pat. No. 10,185,803, issued Jan. 22, 2019), each of which is incorporated by reference herein in its entirety.
  • TECHNICAL FIELD
  • The following relates generally to systems and methods for classifying, prioritizing and interpreting genetic variants and therapies using a deep neural network.
  • BACKGROUND
  • Precision medicine, genetic testing, therapeutic development and whole genome, exome, gene panel and mini-gene reporter analysis require the ability to accurately interpret how diverse features encoded in the genome, such as protein binding sites, RNA secondary structures, and nucleosome positions, impact processes within cells. Most existing approaches to identifying disease variants ignore their impact on these genomic features. Many genome studies are restricted to mutations in exons that either change an amino acid in a protein or prevent the production of the protein.
  • Over the past decade, the importance of understanding regulatory genomic instructions and not just the protein-coding exons and genes that they control has been underscored by several observations: While evolution is estimated to preserve at least 5.5% of the human genome, only 1% accounts for exons within genes; biological complexity often cannot be accounted for by the number of genes (e.g. balsam poplar trees have twice as many genes as humans); differences between organisms cannot be accounted for by differences between their genes (e.g. less than 1% of human genes are distinct from those of mice and dogs); increasingly, disease-causing variants have been found outside of exons, indicating that crucial information is encoded outside of those sequences.
  • In traditional molecular diagnostics, an example workflow may be as follows: a blood or tissue sample is obtained from a patient; variants (mutations) are identified, by either sequencing the genome, the exome or a gene panel; the variants are individually examined manually (e.g. by a technician), using literature databases and internet search engines; a diagnostic report is prepared. Manually examining the variants is costly and prone to human error, which may lead to incorrect diagnosis and potential patient morbidity. Automating or semi-automating this step is thus beneficial. Since the number of possible genetic variants is large, evaluating them manually is time-consuming, highly dependent on previous literature, and involves experimental data that has poor coverage and therefore can lead to high false negative rates, or “variants of unknown significance”. The same issues arise in therapeutic design, where the number of possible therapies (molecules) to be evaluated is extremely large.
  • Techniques have been proposed for which predicting phenotypes (e.g., traits and disease risks) from the genome can be characterized as a problem suitable for solution by machine learning, and more specifically by supervised machine learning where inputs are features extracted from a DNA sequence (genotype), and the outputs are the phenotypes. Such an approach is shown in FIG. 2(a). A DNA sequence 204 is fed to a predictor 202 to generate outputs 208, such as disease risks. This approach is unsatisfactory for most complex phenotypes and diseases for two reasons. First is the sheer complexity of the relationship between genotype (represented by 204) and phenotype (represented by 208). Even within a single cell, the genome directs the state of the cell through many layers of intricate biophysical processes and control mechanisms that have been shaped by evolution. It is extremely challenging to infer these regulatory processes by observing only the genome and phenotypes, for example due to ‘butterfly effects’. For many diseases, the amount of data necessary would be cost-prohibitive to acquire with currently available technologies, due to the size of the genome and the exponential number of possible ways a disease can be traced to it. Second, even if one could infer such models (those that are predictive of disease risks), it is likely that the hidden variables of these models would not correspond to biological mechanisms that can be acted upon, unless strong priors, such as cause-effect relationships, have been built in. This is important for the purpose of developing therapies. Insisting on how a model ought to work by using these priors can hurt model performance if the priors are inaccurate, which they usually are.
  • Some other machine learning approaches to genetic analysis have been proposed. One such approach predicts a cell variable that combines information across conditions, or tissues. Another describes a shallow, single-layer Bayesian neural network (BNN), which often relies on methods like Markov Chain Monte Carlo (MCMC) to sample models from a posterior distribution, which can be difficult to speed up and scale up to a large number of hidden variables and a large volume of training data. Furthermore, computation-wise, it is relatively expensive to get predictions from a BNN, which require computing the average predictions of many models.
  • SUMMARY
  • In one aspect, a method for computing variant-induced changes in one or more condition-specific cell variables for one or more variants is provided, the method comprising: computing a set of variant features from a DNA or RNA variant sequence; applying a deep neural network of at least two layers of processing units to the variant features to compute one or more condition-specific variant cell variables; computing a set of reference features from a DNA or RNA reference sequence; applying the deep neural network to the reference features to compute one or more condition-specific reference cell variables; computing a set of variant-induced changes in the one or more condition-specific cell variables by comparing the one or more condition-specific reference cell variables to the one or more condition-specific variant cell variables.
  • In another aspect, a deep neural network for computing variant-induced changes in one or more condition-specific cell variables for one or more variants is provided, the deep neural network comprising: an input layer configured to receive as input a set of variant features from a DNA or RNA variant sequence; and at least two layers of processing units operable to: compute one or more condition-specific variant cell variables; compute a set of reference features from a DNA or RNA reference sequence; compute one or more condition-specific reference cell variables; compute a set of variant-induced changes in the one or more condition-specific cell variables by comparing the one or more condition-specific reference cell variables to the one or more condition-specific variant cell variables.
  • In another aspect, a method for training a deep neural network to compute one or more condition-specific cell variables is provided, the method comprising: establishing a neural network comprising at least two connected layers of processing units; repeatedly updating one or more parameters of the neural network so as to decrease the error for a set of training cases chosen randomly or using a predefined pattern, where each training case comprises features extracted from a DNA or RNA sequence and corresponding targets derived from measurements of one or more condition-specific cell variables, until a condition for convergence is met at which point the parameters are no longer updated.
  • DESCRIPTION OF THE DRAWINGS
  • The features of the invention will become more apparent in the following detailed description in which reference is made to the appended drawings wherein:
  • FIG. 1 shows a system for cell variable prediction;
  • FIG. 2 shows a comparison of approaches to predict phenotypes, such as disease risks, from an input;
  • FIG. 3 shows a method of generating target cell variables for training;
  • FIG. 4 shows an example deep neural network architecture for a cell variable predictor that predicts splicing levels;
  • FIG. 5 shows a further example deep neural network architecture for a cell variable predictor that predicts splicing levels;
  • FIG. 6 shows yet a further example deep neural network architecture for a cell variable predictor that predicts splicing levels;
  • FIG. 7 shows yet a further example deep neural network architecture for a cell variable predictor that predicts splicing levels;
  • FIG. 8 shows yet a further example deep neural network architecture for a cell variable predictor that predicts splicing levels;
  • FIG. 9 shows yet a further example deep neural network architecture for a cell variable predictor that predicts splicing levels;
  • FIG. 10 shows a method for training cell variable predictors;
  • FIG. 11 shows a system to perform non-uniform sampling of training cases for determining a mini-batch for training a deep neural network;
  • FIG. 12 shows a method for training cell variable predictors for ensuring a consistent backpropagation signal that updates the weights connected to tissue inputs and biases learning towards the event with large tissue variability early on before overfitting occurs;
  • FIG. 13 shows a method for using the outputs of the CVP for scoring, classifying and prioritizing genetic variants;
  • FIG. 14 shows a method for scoring variants by associating cell variable changes with those of other variants;
  • FIG. 15 shows a method for interpreting which genetic features account for variant-induced cell variable changes;
  • FIG. 16 shows a further method for interpreting which genetic features account for variant-induced cell variable changes;
  • FIG. 17 shows a further method for interpreting which genetic features account for variant-induced cell variable changes;
  • FIG. 18 shows a method to generate a visualization for tissue-specific feature importance; and
  • FIG. 19 shows a detailed illustration of the method to generate a visualization for tissue-specific feature importance.
  • DETAILED DESCRIPTION
  • For simplicity and clarity of illustration, where considered appropriate, reference numerals may be repeated among the Figures to indicate corresponding or analogous elements. In addition, numerous specific details are set forth in order to provide a thorough understanding of the embodiments described herein. However, it will be understood by those of ordinary skill in the art that the embodiments described herein may be practised without these specific details. In other instances, well-known methods, procedures and components have not been described in detail so as not to obscure the embodiments described herein. Also, the description is not to be considered as limiting the scope of the embodiments described herein.
  • Various terms used throughout the present description may be read and understood as follows, unless the context indicates otherwise: “or” as used throughout is inclusive, as though written “and/or”; singular articles and pronouns as used throughout include their plural forms, and vice versa; similarly, gendered pronouns include their counterpart pronouns so that pronouns should not be understood as limiting anything described herein to use, implementation, performance, etc. by a single gender; “exemplary” should be understood as “illustrative” or “exemplifying” and not necessarily as “preferred” over other embodiments. Further definitions for terms may be set out herein; these may apply to prior and subsequent instances of those terms, as will be understood from a reading of the present description.
  • Any module, unit, component, server, computer, terminal, engine or device exemplified herein that executes instructions may include or otherwise have access to computer readable media such as storage media, computer storage media, or data storage devices (removable and/or non-removable) such as, for example, magnetic disks, optical disks, or tape. Computer storage media may include volatile and non-volatile, removable and non-removable media implemented in any method or technology for storage of information, such as computer readable instructions, data structures, program modules, or other data. Examples of computer storage media include RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by an application, module, or both. Any such computer storage media may be part of the device or accessible or connectable thereto. Further, unless the context clearly indicates otherwise, any processor or controller set out herein may be implemented as a singular processor or as a plurality of processors. The plurality of processors may be arrayed or distributed, and any processing function referred to herein may be carried out by one or by a plurality of processors, even though a single processor may be exemplified. Any method, application or module herein described may be implemented using computer readable/executable instructions that may be stored or otherwise held by such computer readable media and executed by the one or more processors.
  • Systems and methods described herein relate, in part, to the problem of assessing genetic variants with respect to phenotypes, such as deleteriousness for human diseases. This problem has implications in several industrial categories under the broad umbrella of ‘personalized medicine’, including molecular diagnostics, whole genome sequencing, and pharmaceutical development.
  • It has been found that the effect of a variant depends on genetic context, which includes which other variants are present and, more generally, on the genomic sequence within the individual, or patient, being tested. So, whereas a particular variant may be benign in one genetic context, it may cause a disease in another genetic context. This impacts prioritization and interpretation. The following describes a process for context-dependent genetic variant assessment and wherein variants may be ranked and presented as a priority list. Variant prioritization can be used to increase efficiency and accuracy of manual interpretation, since it enables the technician to focus on a small subset of candidates
  • Computational procedures for prioritizing and/or interpreting variants must generalize well. Generalization refers to the ability of the computational procedure to assess variants that have not been seen before and that may be involved in a disease that has not been previously analyzed. A method that generalizes well should even be able to assess variants within genes that have not been previously analyzed for variants. Finally, a crucial aspect of enabling computational procedures to operate effectively is computational efficiency since these procedures may involve aggregating, organizing and sifting through large amounts of data.
  • The systems and methods described herein apply deep learning to genetic variant analysis. Deep learning generally refers to methods that map data through multiple levels of abstraction, where higher levels represent more abstract entities. The goal of deep learning is to provide a fully automatic system for learning complex functions that map inputs to outputs, without using hand crafted features or rules. One implementation of deep learning comes in the form of feedforward neural networks, where levels of abstraction are modeled by multiple non-linear hidden layers.
  • In brief, embodiments described herein provide systems and methods that receive as input a DNA or RNA sequence, extracts features, and apply multiple layers of nonlinear processing units of a cell variable predictor (“CVP”) to compute a cell variable, which corresponds to a measurable quantity within a cell, for different conditions, such as tissue types. To distinguish a cell variable that corresponds to a measureable quantity for a specific condition, such as a tissue type, from a cell variable that is a combination of measureable quantities from multiple conditions, we refer to the former as a “condition-specific cell variable” and the latter as a “non-specific cell variable”. In embodiments, the CVP is applied to a DNA or RNA sequence and/or features extracted from the sequences, containing a genetic variant, and also to a corresponding reference (e.g., wild type) sequence to determine how much the cell variable changes because of the variant. The systems and methods can be applied to naturally occurring genomic sequences, mini-gene reporters, edited genomic sequences, such as those edited using CRISPR-Cas9, genomic sequences targeted by therapies, and other genomic sequences. The change in the cell variable in different conditions may be used to classify disease-causing variants, compute a score for how deleterious a variant is, prioritize variants for subsequent processing, interpret the mechanism by which a variant operates, and determine the effect of a therapy. Further, an unknown variant can be given a high score for deleteriousness if it induces a change in a particular cell variable that is similar to changes in the same cell variable that are induced by one or more variants that are known to be deleterious.
  • In embodiments, the CVP comprises a deep neural network having multiple layers of processing units and possibly millions of parameters. The CVP may be trained using a dataset of DNA or RNA sequences and corresponding measurements of cell variables, using a deep learning training method that adjusts the strengths of the connections between processing units in adjacent layers. Specialized training methods are described, including a multi-task training method that improves accuracy. The mechanism by which a mutation causes a deleterious change in a cell variable may in some instances be determined by identifying features or groups of features that are changed by the mutation and that cause the cell variable to change, which can be computed by substituting features derived from the variant sequence one by one into the reference sequence or by backpropagating the cell variable change back to the input features.
  • If a change related to a variant of any cell variable is large enough compared to a reference, the variant warrants investigation for deleteriousness. The systems described herein can thus be used to prioritize genetic variants for further ‘wet-lab’ investigations, significantly aiding and reducing the costs of variant discovery. Furthermore, because of the presence of cell variables in the predictor, the invention can assign ‘blame’ to variants that are disease causing, and generate appropriate user visualizations. For example, a variant that changes the splicing ‘cell variable’ may be targeted by a therapy that targets the splicing pathway to remediate the disease
  • As used herein, the term “reference sequence” means: in the context of evaluating a variant (as described below), whereupon the systems described herein compare the variant to a ‘reference sequence’, the reference sequence is a DNA or RNA sequence obtained using genome sequencing, exome sequencing or gene sequencing of an unrelated individual or a closely related individual (e.g., parent, sibling, child). Alternatively, the reference sequence may be derived from the reference human genome, or it may be an artificially designed sequence.
  • As used herein, the term “variant” means: a DNA or RNA sequence that differs from a reference sequence in one or more nucleotides, by substitutions, insertions, deletions or any other changes. The variant sequence may be obtained using genome sequencing, exome sequencing or gene sequencing of an individual. Alternatively, the variant sequence may be derived from the reference human genome, or it may be an artificially designed sequence. For the purpose of this invention, when a variant is being evaluated by the system, the sequence containing the variant as well as surrounding DNA or RNA sequence is included in the ‘variant’.
  • As used herein, the term “single nucleotide variant” (“SNV”) means: a variant that consists of a substitution to a single nucleotide.
  • As used herein, the term “variant analysis” means: the procedure (computational or otherwise) of processing a variant, possibly in addition to surrounding DNA or RNA sequence that establishes context, for the purpose of variant scoring, categorization, prioritization, and interpretation.
  • As used herein, the term “score” means: a numeric value that indicates how deleterious a variant is expected to be.
  • As used herein, the term “classification” refers to the classification of a variant. A variant may be classified in different ways, such as by applying a threshold to the score to determine if the variant is deleterious or not. The American College of Medical Genetics recommends a five-way classification: pathogenic (very likely to contribute to the development of disease); likely pathogenic (there is strong evidence that the variant is pathogenic, but the evidence is inconclusive); unknown significance or VUS (there is not enough evidence to support classification one way or another); likely benign (there is strong evidence that the variant is benign, but the evidence is inconclusive); benign (very likely to be benign).
  • As used herein, the terms “rank”/“prioritization” mean: the process of sorting the scores of a set of variants to determine which variant should be further investigated. The pathogenic variants will be at the top, with the benign variants at the bottom.
  • As used herein, the term “cell variable” means: a quantity, level, potential, or process outcome in the cell that is potentially relevant to the function of a living cell, and that is computed by a CVP (see below). There are two types of cell variables: a “condition-specific cell variable” is a cell variable that is measured or predicted under a specific condition, such as a tissue type; a “non-specific cell variable” is a cell variable that is derived by combining information from across multiple conditions, for example by subtracting the average cell variable values across conditions from the cell variable for each condition. A cell variable can often be quantified by a vector of one or more real-valued numbers, or by a probability distribution over such a vector. Examples include the strength of binding between two molecules (e.g. protein-protein or protein-DNA binding), exon splicing levels (the fraction of mRNA transcripts in a particular tissue that contain a particular exon, i.e. percent spliced in), DNA curvature, DNA methylation, RNA folding interactions.
  • As used herein, the term “event” means: in the context of a splicing-related cell variable (e.g. the fraction of transcripts with an exon spliced in), an observed (measured) alternative splicing event in the cell where both the genomic features and the corresponding splicing levels are known for that particular event. Each event can be used as either a training case or a testing case for a machine learning system.
  • Referring now to FIG. 1, shown therein is a system 100 for cell variable prediction, comprising a machine learning unit. The machine learning unit is preferably implemented by a deep neural network, which is alternatively referred to herein as a “cell variable predictor” (“CVP”) 101. The CVP takes as input a set of features, including genomic features, and produces an output intended to mimic a specific cell variable. The quantification of a cell variable can be represented in such a system by one or more real-valued numbers on an absolute or relative scale, with or without meaningful units. In embodiments, the CVP may provide other outputs in addition to outputs intended to mimic a specific cell variable.
  • The system 100 further comprises a memory 106 communicatively linked to the CVP 101.
  • An illustrated embodiment of the CVP 101 comprising a feedforward neural network having a plurality of layers 102 (i.e. deep) is shown. Each layer comprises one or more processing units 104, each of which implements a feature detector and/or a computation that maps an input to an output. The processing units 104 accept a plurality of parameter inputs from other layers and apply activation functions with associated weights for each such parameter input to the respective processing unit 104. Generally, the output of a processing unit of layer l may be provided as input to one or more processing units of layer l+1.
  • Each processing unit may be considered as a processing “node” of the network and one or more nodes may be implemented by processing hardware, such as a single or multi-core processor and/or graphics processing unit(s) (GPU(s)). Further, it will be understood that each processing unit may be considered to be associated with a hidden unit or an input unit of the neural network for a hidden layer or an input layer, respectively. The use of large (many hidden variables) and deep (multiple hidden layers) neural networks may improve the predictive performances of the CVP compared to other systems.
  • In embodiments, inputs to the input layer of the CVP can include genetic information, such as sequences representing DNA, RNA, features derived from DNA and RNA, and features providing extra information (e.g. tissue type, age, sex), while outputs at the output layer of the CVP can include cell variables.
  • It will be appreciated that though an illustrative feedforward network is described herein, the type of neural network implemented is not limited merely to feedforward neural networks but can also be applied to any neural networks, including convolutional neural networks, recurrent neural networks, auto-encoders and Boltzmann machines.
  • In embodiments the system 100 comprises a secondary analysis unit 114 for receiving the cell variables from the output layer and providing further analysis, as described below.
  • The memory 106 may comprise a database for storing activations and learned weights for each feature detector, as well as for storing datasets of genetic information and extra information and optionally for storing outputs from the CVP 101. The genetic information may provide a training set comprising training data. The training data may, for example, be used for training the CVP 101 to predict cell variables, in which case DNA and RNA sequences with known cell variables and/or phenotypes may be provided. The memory 106 may further store a validation set comprising validation data.
  • Generally, during the training stage, the neural network learns optimized weights for each processing unit. After learning, the optimized weight configuration can then be applied to test data. Stochastic gradient descent can be used to train feedforward neural networks. A learning process (backpropagation), involves for the most part matrix multiplications, which makes them suitable for speed up using GPUs. Furthermore, the dropout technique may be utilized to prevent overfitting.
  • The system may further comprise a computing device 110 communicatively linked to the CVP 101 for controlling operations carried out in the CVP. The computing device may comprise further input and output devices, such as input peripherals (such as a computer mouse or keyboard), and/or a display. The computing device 110 may further be linked to a remote device 112 over a wired or wireless network 108 for transmitting and receiving data. In embodiments, genetic information is received over the network 108 from the remote device 112 for storage in memory 106. Cell variable predictions and lists of variants priorities may be displayed to a user via the display.
  • Referring now to FIG. 2, shown therein is a comparison of a prior (FIG. 2(a)) and currently described (FIG. 2(b)) machine learning process to predict phenotypes, such as disease risks or deleteriousness from a genotype. Contrary to the prior approach, which was described above, the currently described process predicts a cell variable as an intermediate to the phenotype. As described above, the inputs 204 to a CVP can include sequences representing DNA, RNA, features derived from DNA and RNA, and features providing extra information (e.g. tissue type, age, sex). The cell variables 206 could be, for example, the distribution of proteins along a strand of DNA containing a gene, the number of copies of a gene (transcripts) in a cell, the distribution of proteins along the transcript, and the number of proteins. Once determined, the cell variables can be used by the system to determine how much a variant causes the cell variable to change. By examining how much a mutation causes the cell variable to change, the CVP can be used to score, categorize, and prioritize variants. Specifically, once determined, the cell variable predictions can act as high-level features to facilitate more accurate phenotypic predictions, optionally performed at the secondary analysis unit 114. By training predictors that predict how genotype influences cell variables, such as concentrations of proteins, the resultant machine learning problem is modularized. Moreover, it allows variants to be related to particular cell variables, thereby providing a mechanism to explain variants.
  • In one embodiment, the variant and a reference sequence are fed into the input layer of the CVP 101 and the amount of change in the cell variable is quantified and used to score, categorize and prioritize the variant by the secondary analysis unit 114.
  • In another embodiment, the secondary analysis unit 114 comprises a second system (of similar architecture to the CVP) trained to predict a phenotype based on the outputs of the cell variable prediction systems (as illustrated in FIG. 2b ). For example, in the case of spinal muscular atrophy, the cell variable could be the frequency with which the exon is included when the gene is being copied to make a protein. Other examples of cell variables include the distribution of proteins along a strand of DNA containing a gene, the number of copies of a gene (transcripts) in a cell, the distribution of proteins along the transcript, and the number of proteins.
  • The CVP comprises multiple layers of nonlinear processing units to compute the cell variable using the raw DNA or RNA sequence, or features derived from the sequence. In embodiments, in order to quantify the effect of a variant, the system may first construct a pair of feature vectors corresponding to the reference sequence and the variant sequence. Due to the variant, these genomic feature vectors will be different, but without a further cell variable predictor it may not be possible to predict whether those differences would result in any change in phenotype. Embodiments of the predictive system may therefore infer both the reference cell variable value and the variant cell variable value using these two distinct feature vectors. After that, a distance function that combines the reference and the variant predictions may be used to produce a single score which summarizes the magnitude of predicted effect induced by the mutations. Example distance functions include the absolute difference in expectation, Kullback-Leibler divergence, and variation distance. Detailed mathematical formulas of these will be described in a later paragraph.
  • It will be appreciated that process 250 can rely on input features derived from other types of data besides DNA sequences (e.g. age, sex, known biomarkers)—the above described inputs are merely illustrative.
  • An aspect of the embodiments described herein is the use of machine learning to infer predictors that are capable of generalizing to new genetic contexts and to new cell states. For example, a predictor may be inferred using reference genome and data profiling transcripts in healthy tissues, but then applied to the genome of a cancer cell to ascertain how the distribution of transcripts changes in the cancer cell. This notion of generalization is a crucial aspect of the predictors that need to be inferred. If a predictor is good at generalization, it can analyze variant sequences that lead to changes in cell variables that may be indicative of disease state, without needing experimental measurements from diseased cells.
  • Process 250 may address the two problems discussed with respect to approach 200. Since the cell variables are more closely related to and more easily determined from genomic sequences than are phenotypes, learning predictors that map from DNA to cell variables is usually more straightforward. High-throughput sequencing technologies are currently generating massive amounts of data profiling these cell variables under diverse conditions; these datasets can be used to train larger and more accurate predictors. Also, since the cell variables correspond to intermediate biochemically active quantities, such as the concentration of a gene transcript, they may be good targets for therapies. If high disease risk is associated with a change in a cell variable compared to a healthy individual, an effective therapy may consist of restoring that cell variable to its normal state. Embodiments may include such cell variables as ‘exon inclusion or exclusion’, ‘alternative splice site selection’, ‘alternative polyadenylation site selection’, ‘RNA- or DNA-binding protein or microRNA specificity’, and ‘phosphorylation’.
  • Various aspects of the current system and method include: the method can be applied to raw DNA or RNA sequence or features extracted from the sequence, such as RNA secondary structures and nucleosome positions; the method can compute one or more condition-specific cell variables, without the need for a baseline average across conditions; the method can detect variants that affect all condition-specific cell variables in the same way; the method can compare a variant sequence to a reference sequence, enabling it to make different predictions for the same variant, depending on genetic context; the method can compute the condition-specific cell variables using a deep neural network, which has at least two layers of processing units; the method does not require disease labels (e.g., a case population and a control population); the method can score a variant that has never been seen before; the method can be used to compute a ‘distance’ between a variant sequence and a reference sequence, which can be used to rank the variant; the method can be used to compute a ‘distance’ between variants, which is useful for classifying unknown variants based on how similar they are to known variants.
  • In the following sections, systems and methods for creating a condition-specific cell variable predictor for cassette splicing are described in further detail. First, production of training targets, and generation of outputs using the systems and methods will be described. Subsequently, the procedure for training and optimizing a deep neural network (DNN), such as the CVPs, on a sparse and unbalanced biological dataset will be described. Subsequently, example methods to analyze the outputs of the systems will be described. Further, techniques to analyze the behaviour of such a DNN in terms of its inputs and gradients will be described.
  • Referring now to FIG. 3, shown therein is a method of generating target cell variables for training. During training of a neural network, a family of gradient-following procedures are performed where weights (“θ”) of a neural network are changed according to the gradient of a cost function evaluated using the prediction and the target in a training dataset. To construct the training procedure, the measured cell variable to be modeled is represented in a mathematical form, also referred to as the ‘target’ in a dataset. For example, in predicting the percent-spliced-in values (“PSI”), two distinct forms could be provided, the expected PSI and a discretized version of PSI.
  • To compute these targets, at block 302, the biological measurements such as RNA-Seq datasets are processed to produce a posterior probability distribution p of PSI, using methods such as cufflinks and the bootstrap binomial model. With posterior probability of PSI, at block 304, the expected PSI can be computed by an exact evaluation or an approximation to the following integral: E(ψ)=∫ψψp(ψ)dψ. The result is a scalar value between 0 and 1. A regression model to predict the expected PSI can be trained, with the cost function being squared loss function or the cross-entropy based on a binomial distribution with E(ψ) as the probability of success. In addition to the expected PSI, a discretized version of PSI may also be determined at block 306, which is defined by the probability mass of PSI in k predefined bins with boundaries ranging between 0 and 1. For example, using k=3 bins with a uniform bin width, we arrive at the ‘low, mid, high’ (LMH) formulation of PSI, which we also call a ‘splicing pattern’. With this formulation, p(ψ) is discretized to three probabilities{plow, pmid, phigh} for use during training. In particular, plow is equal to the probability that PSI is between 0 and ⅓: plow=∫0 1/3p(ψ)dψ. For the discretized splicing patterns, the cross entropy cost function can be used for a classification model.
  • Though the preparation of training targets according to method 300 may be different for different cell variables, the system architecture applied may be the same or similar.
  • Referring now to FIGS. 4 to 9, shown therein are example DNN architectures for CVPs that predicts splicing levels (Ψ).
  • Though the figures depict possible architecture embodiments, the number of hidden layers and the number of processing units in each layer can range widely and may be determined by hand, using data or using other information;
  • In an embodiment, the nodes of the DNN are fully connected, where each connection is parameterized by a real-valued weight θ. The DNN has multiple layers of non-linearity consisting of hidden units. The output activation a of each hidden unit v in layer l processes a sum of weighted outputs from the previous layer, using a non-linear function f:

  • a v l =fm M l-1 θv,m l a m l-1)
  • where Ml represents the number of hidden units in layer l, and a0 and M0 are the input into the model and its dimensionality, respectively. Different activation functions for the hidden units can be used, such as the TANH function, SIGMOID, and the rectified linear unit (RELU).
  • Referring now to FIG. 4, shown therein is an example architecture 400 of a deep neural network that predicts alternative splicing inclusion levels in a single tissue type i, where the inclusion level is represented by a real-valued number Ψi.
  • Inputs into the first hidden layer 406 consist of genomic features 402 describing a genomic region; these features may include binding specificities of RNA- and DNA-binding proteins, RNA secondary structures, nucleosome positions, position-specific frequencies of short nucleotide sequences, and many others. To improve learning, the features can be normalized by the maximum of the absolute value across all training examples. The purpose of the first hidden layer is to reduce the dimensionality of the input and learn a better representation of the feature space.
  • The identity of conditions (e.g., tissues) 404, which consists of a 1-of-T binary variables where T represent the number of conditions, are then appended to the vector of outputs of the first hidden layer, together forming the input into the second hidden layer 408. A third hidden layer 410, or additional hidden layers may be included if found to be necessary to improve generalization performance.
  • In an embodiment, the final output 412 may be a regression model that predicts the expected PSI.
  • Referring now to FIG. 5, in another embodiment, the discretized PSI may be predicted by a classification model 512. FIG. 5 shows an example architecture 500 of a deep neural network that predicts alternative splicing inclusion levels in a single tissue type i, where the probability mass function over inclusion levels is represented by a k-valued vector, depicted here with k=3 values labeled (Low, Medium, High).
  • Referring now to FIG. 6, alternatively, the DNN can predict the difference in PSI (ΔPSI) between two conditions for a particular exon. FIG. 6 shows an example architecture 600 of a deep neural network that predicts the difference between the alternative splicing inclusion levels of two tissue types (conditions) i 602 and j 604. Here, instead of one tissue as input, two different tissues can be supplied to the inputs.
  • Further, three classes can be generated, called decreased inclusion 606, no change 608, and increased inclusion 610, which can be similarly generated, but from the ΔPSI distributions. An interval can be chosen that more finely differentiates tissue-specific alternative splicing for this task, where a difference of greater than 0.15 could be labeled as a change in PSI levels. The probability mass could be summed over the intervals of −1 to −0.15 for decreased inclusion, −0.15 to 0.15 for no change, and 0.15 to 1 for increased inclusion.
  • Referring now to FIG. 7, shown therein is an example architecture 700 of a deep neural network that predicts the alternative splicing inclusion levels of two tissue types i and j, where the inclusion levels are represented by real-valued numbers Ψi 702 and Ψj 704 and the difference in alternative splicing inclusion levels between the two tissue types 706 is also represented by a real-valued number.
  • In embodiments, the classification, regression, and tissue difference codes may be trained jointly. The benefit is to reuse the same hidden representations learned by the model, and for each learning task to improve the performance of another.
  • Referring now to FIG. 8, shown therein is an example architecture 800 of a deep neural network that predicts the difference between the alternative splicing inclusion levels of two tissue types i and j, where the probability mass function over inclusion levels is represented by a k-valued vector, depicted here with k=3 values labeled (Low, Medium, High) 802 and the probability mass function over inclusion level differences is represented by a d-valued vector, here depicted with d=3 values labeled (Decrease, No Change, Increase) 804.
  • Referring now to FIG. 9, shown therein is an example architecture 900 of a deep neural network that predicts alternative splicing inclusion levels in T tissue types, where the probability mass function over inclusion levels is represented by a k-valued vector, depicted here with k=3 values labeled (Low, Medium, High). Accordingly, multiple tissues may be trained as different predictors via multitask learning. The learned representation from features may be shared across all tissues. FIG. 9 shows an example architecture of such system.
  • Training of the systems will now be described with reference to FIGS. 10 to 12. Referring now to FIG. 10, shown therein is a method 1000 for training the cell variable predictors of the systems described above. At block 1002, the first hidden layer can be trained using an autoencoder to reduce the dimensionality of the feature space in an unsupervised manner. An autoencoder is trained by supplying the input through a non-linear hidden layer, and reconstructing the input, with tied weights going into and out of the hidden layer. Alternatively, the weights can be untied. This method of pretraining the network may initialize learning near a good local minimum. An autoencoder may be used instead of other dimensionality reduction techniques like principal component analysis, because it naturally fits into the CVP's architecture, and that a non-linear technique may discover a better and more compact representation of the features. At block 1004, in the second stage of training, the weights from the input layer to the first hidden layer (learned from the autoencoder) are fixed, and the inputs corresponding to tissues are appended. A one-hot encoding representation may be used, such that specifying a tissue for a particular training example can take the form [0 1 0 0 0] to denote the second tissue out of 5 possible types. At block 1006, the reduced feature set and tissue variables become input into the second hidden layer. At block 1008, the weights connected to the second hidden layer and the final hidden layer of the CVP are then trained together in a supervised manner, with targets being the expected value of PSI, the discretized version of PSI, the expected value of ΔPSI, and/or the discretized version of ΔPSI, depending on architecture. At block 1010, after training these final two layers, weights from all layers of the CVP may be fine-tuned by backpropagation.
  • In an alternate embodiment, the autoencoder may be omitted altogether, and all weights of neural network may be trained at once.
  • In one embodiment, the targets consist of (1) PSI for each of the two tissues, and (2) ΔPSI between the two tissues. Given a particular exon and N possible tissue types, N×N training examples can be constructed. This construction has redundancy in that it generates examples where both tissues are the same in the input to teach the model that it should predict no change for ΔPSI given identical tissue indices. Additionally, if the tissues are swapped in the input, a previously increased inclusion label should become decreased inclusion. The same rationale extends to the LMH classifier. Generating these additional examples is one method to incorporate this knowledge without explicitly specifying it in the model architecture.
  • A threshold can be applied to exclude examples from training if the total number RNA-Seq junction is below a number, such as 10, to remove low signal training examples.
  • In some of the embodiments, multiple tasks may be trained together. Since each of these tasks might learn at different rates, learning rates may be allowed to differ. This is to prevent one task from overfitting too soon and negatively affecting the performance of another task before the complete model is fully trained. This may be implemented by having different learning rates for the weights between the connections of the last hidden layer and the functions used for classification or regression for each task.
  • To train and test CVPs of the systems described herein, data may be split into folds at random for cross validation, such as five approximately equal folds. Each fold may contain a unique set of genetic information, such as exons that are not found in any of the other folds. Where five folds are provided, three of the folds could be used for training, one used for validation, and one held out for testing. Training can be performed for a fixed number of epochs and hyperparameters can be selected that give optimal area under curve (“AUC”) performance or data likelihood on the validation data. The model can then be re-trained using the selected hyperparameters with both the training and validation data. Multiple models can be trained this way from the different folds of data. Predictions from the models on their corresponding test set can then be used to evaluate the code's performance. To estimate the confidence intervals, the data can be randomly partitioned, and the above training procedure can be repeated.
  • The CVP's processing unit weights may be initialized with small random values sampled from a zero-mean Gaussian distribution. Alternatively it may be initialized with small random values with a zero-mean uniform distribution. Learning may be performed with stochastic gradient descent with momentum and dropout, where mini-batches are constructed as described below. An L1 weight penalty may be included in the cost function to improve the model performance by disconnecting features deemed to be not useful by the predictor. The model's weights may updated after each mini-batch.
  • Referring now to FIG. 11, shown therein is a system to perform non-uniform sampling of training cases for creating a mini-batch for training a deep neural network.
  • To promote neural networks to better discover patterns in the inputs that help to distinguish tissue types or genomic features, a system is provided for biasing the distribution of training events in the mini-batches. The system comprises training cases separated into “high-variance” cases and “low-variance” cases. The set of high-variance training cases is thus selected by thresholding each case's variance across tissue types or genomic features. In the illustrated embodiment the “high-variance” cases are provided in a database 1106, and the “low-variance” cases are provided in a database 1108. The system further comprises switches 1104 and multiplexers 1102. In use, each row of a mini-batch 1110 is sampled either from a list of high- or low-variance training cases, depending on a probabilistic {0,1} switch value. The resulting mini-batch of genomic features and corresponding cell variable targets can be used for training, such as for training the architectures in FIGS. 6 and 7.
  • Referring now to FIG. 12, shown therein is a method for training cell variable predictors for ensuring a consistent backpropagation signal that updates the weights connected to tissue inputs and biases learning towards the event with large tissue variability early on before overfitting occurs. According to a method 1200, at block 1202, all training cases are separated into a database of “high-variance” cases and a database of “low-variance” cases, where the variance of each training case is measured as “variance of the Ψ training targets across tissue types” and the threshold for separating high/low is any pre-determined constant. At block 1204, all events that exhibit large tissue variability are selected, and mini-batches are constructed based only on these events. At each training epoch, training cases can be further sampled (with or without replacement) from the larger pool of events with low tissue variability, of some pre-determined or randomized size typically smaller than equal to one fifth of the mini-batch size. A purpose of method 1200 is to have a consistent backpropagation signal that updates the weights connected to the tissue inputs and bias learning towards the event with large tissue variability early on before overfitting occurs. As training progresses, the splicing pattern of the events with low tissue variability is also learned. This arrangement effectively gives the events with large tissue variability greater importance (i.e. more weight) during optimization. This may be beneficial to improve the models' tissue specificity.
  • With the above methods for training, techniques to reduce overfitting can be applied to the system to provide an embodiment of a CVP with dropout. Along with the use of GPUs, CVPs comprising of deep neural networks may be a competitive technique for conducting learning and prediction on biological datasets, with the advantage that they can be trained quickly, have enough capacity to model complex relationships, and scale well with the number of hidden variables and volume of data, making them potentially highly suitable for ‘omic’ datasets.
  • Additionally, the performance of a CVP depends on a good set of hyperparameters. Instead of conducting a grid search over the hyperparameter space, Bayesian frameworks can be used to automatically select a model's hyperparameters. These methods use a Gaussian Process to search for a joint setting of hyperparameters that optimize a process's performance on validation data. It uses the performance measures from previous experiments to decide which hyperparameters to try next, taking into account the trade-off between exploration and exploitation. This method eliminates many of the human judgments involved with hyperparameter optimization and reduces the time required to find such hyperparameters. Alternatively, randomized hyperparameter search can be performed, where the hyperparameters to be optimized is sampled from a uniform distribution. These methods require only the search range of hyperparameter values to be specified, as well as how long to run the optimization for.
  • In the following paragraphs, methods for using the outputs of the CVP for scoring, classifying and prioritizing genetic variants (with reference to FIG. 13); for scoring variants by associating cell variable changes with those of other variants (with reference to FIG. 14); and for interpreting which genetic features account for variant-induced cell variable changes (with reference to FIGS. 15 to 18), will be described.
  • The systems described above can be used to compute a set of condition-specific scores for how deleterious a variant is. For instance, a variant may be found to have a high deleteriousness score in brain tissue, but not in liver tissue. In this way the condition-specific cell variables computed as described above can be used to compute condition-specific deleteriousness scores. To classify variants as pathogenic, likely pathogenic, unknown significance (VUS), likely benign or benign, and to prioritize or rank a set of variants, these sets of scores can be combined.
  • According to a method 1300, to quantify the effect of a SNV (single nucleotide variation) or a combination of mutations (called in general a variant) using a CVP, at block 1302, a pair of feature vectors are constructed corresponding to the reference sequence and the variant sequence. Due to the mutation, these genomic feature vectors will be different, but without a further CVP it may not be possible to predict whether those differences will result in any change in phenotype. At block 1304, the predictive system is therefore used to compute both the reference cell variable value and the mutant cell variable value for each condition, using these two distinct feature vectors. After that, at block 1306, a distance function that combines the reference and the mutant predictions can be used to produce a single score for each condition, which summarizes the magnitude of predicted effect induced by the mutations. Because large change of cell variables is likely to cause diseases, without further information about a particular diseases and a particular cell variable, high scoring mutations are assumed to cause diseases.
  • Examples of distance functions are the expected difference, Kullback-Leibler divergence, and variation distance. In the following, we describe each of these distance functions in detail using a LMH splicing predictor as an example.
  • The expected difference represents the absolute value of the difference induced by the mutation in the expected value of a cell variable. For an LMH PSI predictor, the predicted reference splicing patterns {plow wt, pmid wt, phigh wt} and the predicted mutant splicing patterns {plow mut, pmid mut, phigh mut} are computed using the reference and mutant feature vectors as inputs. Then, the expected value of the predicted cell variable with and without the mutation is computed, denoted as ψwt and ψmut. The expected value is a weighted average of the PSI values corresponding to the center of the bins used to define the splicing pattern. As described above, if three bins are used with uniform spacing, reference PSI is computed by ψwt=⅙ plow wt+½ pmid wt+⅚ phigh wt. In the same way, mutant PSI is computed by ψmut=⅙plow mut+½pmid mut+⅚phigh mut. The final score is the absolute difference between the expected PSI: s=|ψmut−ψwt|. This can be combined across conditions, by computing the maximum absolute difference across conditions.
  • Kullback-Leibler (KL) divergence is an information theoretic measure of difference between probability distributions P and Q:
  • D KL ( P , Q ) = i P ( i ) log P ( i ) Q ( i ) .
  • Due to the asymmetric nature of the KL divergence, either s=DKL (Pwt, Pmut) or s=DKL (Pmut, Pwt) can be used as the distance measure. The KL divergence can be computed for each condition and the sum (or average) KL divergence can be computed across conditions, or the maximum KL divergence can be computed across tissues.
  • The variation distance is another measure of difference between probability distributions. It is the sum of absolute value of the predicted probabilities. In the LMH splicing predictor example, s=½Σs∈{low,mid,high} |ps mut−ps wt|. Again, this can be computed for each condition and then the sum or maximum can be taken across conditions.
  • Once the score of a variant has been computed at block 1306, at block 1308 the score can be thresholded and/or combined with other information to classify the variant as pathogenic, likely pathogenic, unknown significance (VUS), likely benign or benign.
  • Further, at block 1310, given a set of variants, the score of every variant can be computed and the set of variants can be reordered so that the highest-scoring (most deleterious) variants are at the top of the list and the lowest-scoring variants are at the bottom of the list.
  • Referring now to FIG. 14, a method 1400 is shown for scoring, classifying and prioritizing variants. The method 1400 comprises by, at block 1402, associating the cell variable changes of variants with those of other variants with known function. For instance, suppose the system 100 determines that a variant that has never been seen before causes a change in a particular cell variable, say the cassette splicing level of a specific exon. Suppose a nearby variant whose disease function is well-characterized causes a similar change in the exact same cell variable, e.g., the splicing level of the same exon. Since mutations act by changing cellular chemistry, such as the splicing level of the exon, it can be inferred that the unknown variant likely has the same functional impact as the known variant. The system can ascertain the ‘distance’ between two variants in this fashion using a variety of different measures. Because the system computes variant-induced changes in a cell variable for different conditions, this information can be used to more accurately associate variants with one another. For example, two variants that induce a similar cell variable change in brain tissue would be associated more strongly than two variants that induce similar cell variable changes, but in different tissues.
  • Unlike many existing systems, the methods and systems described here can be used to score, classify, prioritize and interpret a variant in the context of different reference sequences. For instance, when a child's variant is compared to a reference sequence obtained from the reference human genome, the variant may have a high score, but when the same variant is compared to the reference sequences obtained from his or her unaffected parents, the variant may have a low score, indicating that the variant is likely not the cause of the disease. In contrast, if the child's variant is found to have a high score when it is compared to the reference sequences obtained from his or her parents, then it is more likely to be the cause of the disease. Another circumstance in which different reference sequences arise is when the variant may be present in more than one transcript, which can occur because transcription occurs bidirectionally in the genome, there may be alternative transcription start sites, there may be alternative splicing, and for other reasons.
  • Referring now to FIGS. 15 to 19, methods will now be described to identify the impact of features (which may include nucleotides) on a cell variable CVP prediction.
  • It can be useful to determine why a variant changes a cell variable and leads to disease. A variant leads to a change in DNA/RNA sequence and/or a change in the DNA/RNA features extracted from the sequence. However, which particular changes in the sequence or features are important. An SNV may change more than one feature (e.g., a protein binding site and RNA secondary structure), but because of contextual dependence only some of the affected features play an important role.
  • To ascertain this, the system 100 can determine which inputs (nucleotides or DNA/RNA features) are responsible for changes in cell variables. In other words, it is useful to know how important a feature is overall for making a specific prediction, and it is also useful to know in what way the feature contributes to the prediction (positively or negatively).
  • Referring now to FIG. 15, a first method 1500 to identify the impact of features on a cell variable CVP prediction works by computing, at block 1502, the features for the sequence containing the variant and the features for the sequence that does not have the variant. At block 1504, both feature vectors are fed into the cell variable predictor to obtain the two sets of condition-specific cell variables. At block 1506, a single feature from the variant sequence is copied into the corresponding feature in the non-variant sequence and the system is used to compute the set of condition-specific cell variables. At block 1508, this is repeated for all features and the feature that produces the set of condition-specific cell variables that is most similar to the set of condition-specific cell variables for the variant sequence is identified. This approach can be extended to test a set of pairs of features or a set of arbitrary combinations of features.
  • Referring now to FIG. 16, a second method 1600 to identify the impact of features on a cell variable CVP prediction evaluates the impact of a subset S⊆{1, . . . , n} of input features x=(x1, . . . , xn) on the corresponding cell variable prediction z=f (x). The method consists of, at block 1602, constructing a new set of input features {circumflex over (x)}=({circumflex over (x)}1, . . . , {circumflex over (x)}n) where for each feature index i∈S in the subset the value {circumflex over (x)}i has been replaced with the median value of xi across the training dataset. At block 1604, this new feature vector is then sent through the cell variable prediction system in question, resulting in a new prediction {circumflex over (z)}=f ({circumflex over (x)}). For a splicing cell variable predictor, this entails replacing genomic feature xi with its median value across all events (all exons) in the training set. The impact of feature subsets of the same size are comparable, including all cases when |S|=1. Among comparable feature subsets, those that correspond to the largest decrease in performance may be deemed to have high impact. At block 1606, the overall importance of a feature (as opposed to its importance for a specific training or test case) with regard to a particular dataset (e.g. a training or test set) can be determined as the average or median of all its impact scores across all cases in that dataset.
  • Referring now to FIG. 17, a third method 1700 is described to identify the impact of features on a cell variable CVP prediction. At block 1702, an example from the dataset is given as input to the trained model and forward propagated through a CVP comprising of a neural network to generate an output. At block 1704, the target is modified to a different value compared to the predicted output; for example, in classification, the class label would be modified so that it differs from the prediction. At block 1706, the error signal is backpropagated to the inputs. The resulting signal describes how much each input feature needs to change in order to make the modified prediction, as well as the direction. The computation is extremely quick, as it only requires a single forward and backward pass through the CVP, and all examples can be calculated in parallel. Features that need to be changed the most are deemed to be important. At block 1708, the overall importance of a feature (as opposed to its importance for a specific training or test case) with regards to a particular dataset (e.g. a training or test set) can be determined as the average or median of amount of change across all cases in that dataset. The benefit of this approach compared to the first is it can model how multiple features operate simultaneously.
  • Referring now to FIG. 18, a complementary method 1800 based on the method of 1700 to analyze a CVP is to see how features are used in a tissue-specific manner. At block 1802, this extension simply receives examples from the dataset corresponding to particular tissues, and, at block 1804, performs the procedure as described above [110]. In cases where the cell variable predictor is tissue-specific (e.g. FIGS. 4-9) this procedure yields tissue-specific feature importance information.
  • Referring now to FIG. 19, shown therein is a detailed illustration of a method 1900 to generate a visualization for tissue-specific feature importance based on the method described in 1700 and 1800. At block 1902, input comprising examples from a dataset corresponding to a particular tissue is provided to the CVP. At block 1904, tissue-specific cell variable predictions are provided by the CVP. At block 1906, targets are constructed based on the cell value predictions, such that there is a mismatch between the prediction and the target. At block 1908, an update signal is computed which describes how the weights of the connection need to change to make the prediction match the target. At block 1910, an update signal backpropagated to the input, Δfeature, is further computed. At block 1912, examples from the dataset are sorted by tissue types. At block 1914, the overall importance of features for each tissue is computed by taking the mean of the magnitude of the update signal over the entire dataset. At block 1916, a visualization is generated, where the importance of each feature is colored accordingly for each tissue.
  • The systems and methods described here can also be used to determine whether a therapy reverses the effect of a variant on a pertinent cell variable. For example, an SNV within an intron may cause a decrease in the cell variable that corresponds to the inclusion level of a nearby exon, but an oligonucleotide therapy that targets the same region as the SNV or a different one may cause the cell variable (inclusion level) to rise to its original level. Or, a DNA editing system such as CRISPR-Cas9 may be used to edit the DNA, adding, remove or changing a sequence such that the cell variable (inclusion level) of the exon rises to its original level. If the method described here is applied to a variant and a reference sequence obtained from the reference genome or an unaffected family member, and the cell variable is found to change by a certain amount, or if the cell variable has been measured to change by a certain amount, the following technique can be used to evaluate putative therapies to see if they correct the change. In the case of therapies that target the variant sequence, such as by protein-DNA or protein-RNA binding or by oligonucleotide hybridization, the effect of the therapy on the variant can be computed using the CVP, where the reference is taken to be the variant sequence and the “variant sequence” is now taken to be the variant sequence modified to account for the effect of the therapy. If the therapy targets a subsequence of the variant, that subsequence may be, in silico, modified by randomly changing the nucleotides, setting them all to a particular value, or some other method. Alternatively or additionally, when features are extracted from the modified sequence, features that overlap, fully or partially, with the targeted subsequence may be set to values that reflect absence of the feature. The reference (the original variant) and the modified variant are then fed into the CVP and the change in the cell variable is computed. This is repeated with a wide range of therapies, and the efficacy of each therapy can be determined by how much the therapy-induced change in the cell variable corrects for the original variant-induced change. In the case of a DNA editing system, such as CRISPR-Cas9, the procedure is even more straightforward. The reference is taken to be the original variant, and the variant is taken to be the edited version of the variant. The output of the CVP then indicates by how much the cell variable will change because of the editing.
  • Thus, what has been provided is, essentially, a system and method for computing variant-induced changes in one more condition-specific cell variables. An exemplary method comprises computing a set of features from the DNA or RNA sequence containing the variant, applying a network of at least two layers of processing units (the deep neural network) to the variant features to compute the one or more condition-specific variant cell variables, computing a set of features from a reference DNA or RNA sequence, applying the deep network to the reference features to compute the one or more condition-specific reference cell variables, and computing the variant-induced changes in the one or more condition-specific cell variables by comparing the one or more condition-specific reference cell variables to the one or more condition-specific variant cell variables. In embodiments, the number of condition-specific cell variables is at least two.
  • The deep neural network may be trained using a dataset of examples, where each example is a measured DNA or RNA sequence and a corresponding set of measured values of the condition-specific cell variables, one for each condition, and where the condition-specific cell variables are not normalized using a baseline that is determined by combining the condition-specific cell variables across two or more conditions.
  • The set of features may include a binary matrix with 4 rows and a number of columns equal to the length of the DNA or RNA sequence and where each column contains a single ‘1’ and three ‘0’s and where the row in which each ‘1’ occurs indicates the nucleotide at the corresponding position in the DNA or RNA sequence. The set of features includes a set of features may be computed using the recognition path of an autoencoder that is applied to the binary matrix. The autoencoder may be trained using a dataset of binary matrices computed using a set of DNA or RNA sequences of fixed length. The set of features may also include real and binary features derived from the DNA or RNA sequence.
  • At least part of the deep network may be configured to form a convolutional network and/or recurrent network. Part of the deep network that is a recurrent network may be configured to use of long-term short-term memory.
  • The deep neural network may be trained using a dataset of feature vectors extracted from DNA or RNA and a corresponding set of measured values of cell variables. The training method may adjust the parameters of the deep neural network so as to minimize the sum of the error between the measured cell variables and the output of the deep neural network. The error may be the squared difference between the measured cell variable and the corresponding output of the neural network. The error may be the absolute difference between the measured cell variable and the corresponding output of the neural network. The error may be the Kullback-Leibler divergence between the measured cell variable and the corresponding output of the neural network. Stochastic gradient descent may be used to train the deep neural network.
  • Dropout may be used to train the deep neural network.
  • The hyperparameters of the deep neural network may be adjusted so as to minimize the error on a separate validation set.
  • The deep neural network may be trained using multitask learning, where the outputs of the deep neural network are comprised at least two of the following: a real-valued cell variable, a probability distribution over a discretized cell variable, a probability distribution over a real-valued cell variable, a difference between two real-valued cell variables, a probability distribution over a discretized difference between two real-valued cell variables, a probability distribution over the difference between two real-valued cell variables.
  • An input to the deep neural network may indicate the condition for which the cell variable is computed and the deep neural network is applied repeatedly to compute each condition-specific cell variable.
  • The output of the deep neural network may comprise one real value for each condition and the variant-induced change for each condition may be computed by subtracting the computed reference cell variable from the computed variant cell variable.
  • The output of the deep neural network may comprise a probability distribution over a discrete variable for each condition and the variant-induced change for each condition may be computed by summing the absolute difference between the computed probabilities for the reference cell variable and the variant cell variable.
  • The output of the deep neural network may comprise a probability distribution over a discrete variable for each condition and the variant-induced change for each condition may be computed using the Kullback-Leibler divergence between the computed probabilities for the reference cell variable and the variant cell variable.
  • The output of the deep neural network may comprise a probability distribution over a discrete variable for each condition and the variant-induced change for each condition may be computed by first computing the expected value of the reference cell variable and the variant cell variable, and then subtracting the expected value of the reference cell variable from the expected value of the variant cell variable.
  • The variant-induced changes in the one or more condition-specific cell variables may be combined to output a single numerical variant score. The variant score may be computed by summing the variant-induced changes across conditions. The variant score may be computed by summing the squares of the variant-induced changes across conditions. The variant score may be computed by summing the outputs of a nonlinear function that are computed by applying the nonlinear function to the variant-induced changes across conditions.
  • At least two variants and corresponding reference sequences may be independently processed to compute the variant-induced changes in one or more condition-specific cell variables for each variant and corresponding reference sequence. At least two variants and corresponding reference sequences may be independently processed to compute the variant score for each variant and corresponding reference sequence. The variant scores may be used to prioritize the variants by sorting them according to their scores. Thresholds may be applied to the score to classify the variant as deleterious or non-deleterious, or to classify the variant as pathogenic, likely pathogenic, unknown significance, likely benign or benign, or to classify the variant using any other discrete set of labels. A validation data consisting of variants, reference sequences, and labels may be used to compute the thresholds that minimize classification error. The scores may be combined with additional numerical information before the variants are sorted. The scores may be combined with additional numerical information before the thresholds are applied. The scores may be combined with additional numerical information before the thresholds are applied.
  • For one or more pairs of variants, the distance between the two variants in each pair may be computed by summing the output of a nonlinear function applied to the difference between the change in the condition-specific cell variable for the first variant and the change in the condition-specific cell variable for the second variant. The nonlinear function may be the square operation. The nonlinear function may be the absolute operation.
  • The deleteriousness label of an unknown variant may be determined by computing the distance of the variant to one or more variants of known deleteriousness and outputting the label or the score of the closest known variant. The deleteriousness value of an unknown variant may be determined by computing the distance of the variant to one or more variants of known deleteriousness and then computing the weighted average of their labels or scores, where the weights are nonlinear functions of the distances. Two or more unknown variants may be prioritized, by sorting them according to their deleteriousness values.
  • The mini-batches used during multitask training may be balanced so that the number of cases that exhibit a large difference is similar to the number of cases that exhibit a small difference.
  • The genetic variant may be a single nucleotide variant. The genetic variant may contain two or more distinct single nucleotide variants. The genetic variant may be a combination of substitutions, insertions and deletions and not be a single nucleotide variant. The genetic variant may be obtained by sequencing the DNA from a patient sample.
  • The reference sequence may be obtained by sequencing the DNA from a close relative of the patient. The reference sequence may be any DNA or RNA sequence and the variant sequence may be any DNA or RNA sequence, but where the reference sequence and the variant sequence are not identical.
  • The features may include position-dependent genetic features such as conservation.
  • The most explanatory feature may be determined by examining each feature in turn, and computing a feature-specific variant feature vector by copying the feature derived from the variant sequence onto the features derived from the reference sequence; using the deep neural network to compute the variant-induced changes in the one or more condition-specific cell variables for that feature-specific variant identifying the feature whose corresponding feature-specific variant-induced changes in the one or more condition-specific cell variables are most similar to the variant-induced changes in the one or more condition-specific cell variables.
  • The similarity between the feature-specific variant-induced changes in the one or more condition-specific cell variables and the variant-induced changes in the one or more condition-specific cell variables may be computed by summing the squares of their differences.
  • Although the invention has been described with reference to certain specific embodiments, various modifications thereof will be apparent to those skilled in the art without departing from the spirit and scope of the invention as outlined in the claims appended hereto.

Claims (20)

What is claimed is:
1. A computer-implemented method for computing a set of variant-induced changes in a condition-specific cell variable for a genetic variant, comprising processing a set of variant features using a cell variable predictor to quantify a condition-specific variant cell variable without obtaining a reference measurement of the genetic variant across a plurality of conditions.
2. The method of claim 1, wherein the genetic variant comprises a variant in a deoxyribonucleic acid (DNA) or ribonucleic acid (RNA) variant sequence relative to a DNA or RNA reference sequence, and wherein the method further comprises extracting the set of variant features from the DNA or RNA variant sequence.
3. The method of claim 1, further comprising extracting a set of reference features from the DNA or RNA reference sequence, and processing the set of reference features using the cell variable predictor to quantify a condition-specific reference cell variable.
4. The method of claim 3, wherein the set of variant features is extracted from the DNA or RNA variant sequence by generating:
a. a binary matrix with 4 rows and a number of columns equal to a length of the DNA or RNA variant sequence or the DNA or RNA reference sequence, wherein each column contains a bit indicating the nucleotide value at the corresponding position in the DNA or RNA variant sequence or the DNA or RNA reference sequence;
b. a set of features computed using one or more layers of an autoencoder other than the input and output layers of the cell variable predictor; or
c. a set of features that correspond to one or more of: RNA secondary structures, nucleosome positions, and retroviral repeat elements.
5. The method of claim 3, further comprising computing, using the cell variable predictor, probabilities for discrete levels of the condition-specific cell variable, wherein each of the set of variant-induced changes is computed by:
a. summing an absolute difference between the computed probabilities for the condition-specific reference cell variable and the condition-specific variant cell variable;
b. summing a Kullback-Leibler divergence between the computed probabilities of the condition-specific reference cell variable and the condition-specific variant cell variable for each condition; or
c. computing an expected value of the condition-specific reference cell variable and the condition-specific variant cell variable, and subtracting the expected value of the condition-specific reference cell variable from the expected value of the condition-specific variant cell variable.
6. The method of claim 1, wherein the cell variable predictor comprises a deep neural network.
7. The method of claim 6, wherein the deep neural network comprises a convolutional neural network, a recurrent neural network, or a long-term short-term memory recurrent neural network.
8. The method of claim 1, further comprising combining the set of variant-induced changes in the condition-specific cell variable to compute a single numerical variant score for the genetic variant, the single numerical variant score computed by:
a. outputting the score for a fixed condition;
b. summing the variant-induced changes across a plurality of conditions; or
c. computing the maximum of the absolute variant-induced changes across a plurality of conditions.
9. The method of claim 1, further comprising computing, for a pair of genetic variants, a distance between the two genetic variants in the pair by summing the output of a nonlinear function applied to a difference between the change in the condition-specific cell variable for the first of the two genetic variants and the change in the condition-specific cell variable for the second of the two genetic variants.
10. The method of claim 1, wherein the genetic variant comprises a) two or more distinct single nucleotide variants (SNVs); or b) a combination of substitutions, insertions, and deletions, wherein the combination is not a single nucleotide variant (SNV).
11. A computer-implemented method for computing a set of variant-induced changes in a condition-specific cell variable for a genetic variant, comprising processing a set of variant features using a cell variable predictor to quantify a condition-specific variant cell variable, wherein the cell variable predictor comprises a deep neural network comprising at least two connected layers of processing units.
12. The method of claim 11, wherein the genetic variant comprises a variant in a deoxyribonucleic acid (DNA) or ribonucleic acid (RNA) variant sequence relative to a DNA or RNA reference sequence, and wherein the method further comprises extracting the set of variant features from the DNA or RNA variant sequence.
13. The method of claim 11, further comprising extracting a set of reference features from the DNA or RNA reference sequence, and processing the set of reference features using the cell variable predictor to quantify a condition-specific reference cell variable.
14. The method of claim 13, wherein the set of variant features is extracted from the DNA or RNA variant sequence by generating:
a. a binary matrix with 4 rows and a number of columns equal to a length of the DNA or RNA variant sequence or the DNA or RNA reference sequence, wherein each column contains a bit indicating the nucleotide value at the corresponding position in the DNA or RNA variant sequence or the DNA or RNA reference sequence;
b. a set of features computed using one or more layers of an autoencoder other than the input and output layers of the cell variable predictor; or
c. a set of features that correspond to one or more of: RNA secondary structures, nucleosome positions, and retroviral repeat elements.
15. The method of claim 13, further comprising computing, using the cell variable predictor, probabilities for discrete levels of the condition-specific cell variable, wherein each of the set of variant-induced changes is computed by:
a. summing an absolute difference between the computed probabilities for the condition-specific reference cell variable and the condition-specific variant cell variable;
b. summing a Kullback-Leibler divergence between the computed probabilities of the condition-specific reference cell variable and the condition-specific variant cell variable for each condition; or
c. computing an expected value of the condition-specific reference cell variable and the condition-specific variant cell variable, and subtracting the expected value of the condition-specific reference cell variable from the expected value of the condition-specific variant cell variable.
16. The method of claim 11, wherein the deep neural network comprises a convolutional neural network, a recurrent neural network, or a long-term short-term memory recurrent neural network.
17. The method of claim 11, further comprising combining the set of variant-induced changes in the condition-specific cell variable to compute a single numerical variant score for the genetic variant, the single numerical variant score computed by:
a. outputting the score for a fixed condition;
b. summing the variant-induced changes across a plurality of conditions; or
c. computing the maximum of the absolute variant-induced changes across a plurality of conditions.
18. The method of claim 11, further comprising applying thresholds that are fixed or selected using labeled data to the single numerical variant score for the genetic variant to classify the genetic variant (i) as one of deleterious or non-deleterious, (ii) as one of pathogenic, likely pathogenic, unknown significance, likely benign, or benign, or (iii) using another discrete set of labels.
19. The method of claim 11, further comprising computing, for a pair of genetic variants, a distance between the two genetic variants in the pair by summing the output of a nonlinear function applied to a difference between the change in the condition-specific cell variable for the first of the two genetic variants and the change in the condition-specific cell variable for the second of the two genetic variants.
20. The method of claim 11, wherein the genetic variant comprises a) two or more distinct single nucleotide variants (SNVs); or b) a combination of substitutions, insertions, and deletions, wherein the combination is not a single nucleotide variant (SNV).
US17/369,499 2015-06-15 2021-07-07 Systems and methods for classifying, prioritizing and interpreting genetic variants and therapies using a deep neural network Pending US20210383890A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/369,499 US20210383890A1 (en) 2015-06-15 2021-07-07 Systems and methods for classifying, prioritizing and interpreting genetic variants and therapies using a deep neural network

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US14/739,432 US10185803B2 (en) 2015-06-15 2015-06-15 Systems and methods for classifying, prioritizing and interpreting genetic variants and therapies using a deep neural network
US16/197,146 US11887696B2 (en) 2015-06-15 2018-11-20 Systems and methods for classifying, prioritizing and interpreting genetic variants and therapies using a deep neural network
US17/369,499 US20210383890A1 (en) 2015-06-15 2021-07-07 Systems and methods for classifying, prioritizing and interpreting genetic variants and therapies using a deep neural network

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US16/197,146 Continuation US11887696B2 (en) 2015-06-15 2018-11-20 Systems and methods for classifying, prioritizing and interpreting genetic variants and therapies using a deep neural network

Publications (1)

Publication Number Publication Date
US20210383890A1 true US20210383890A1 (en) 2021-12-09

Family

ID=57517141

Family Applications (5)

Application Number Title Priority Date Filing Date
US14/739,432 Active 2036-01-25 US10185803B2 (en) 2015-06-15 2015-06-15 Systems and methods for classifying, prioritizing and interpreting genetic variants and therapies using a deep neural network
US15/841,106 Active 2037-04-01 US11183271B2 (en) 2015-06-15 2017-12-13 Neural network architectures for linking biological sequence variants based on molecular phenotype, and systems and methods therefor
US16/197,146 Active 2039-04-08 US11887696B2 (en) 2015-06-15 2018-11-20 Systems and methods for classifying, prioritizing and interpreting genetic variants and therapies using a deep neural network
US17/369,499 Pending US20210383890A1 (en) 2015-06-15 2021-07-07 Systems and methods for classifying, prioritizing and interpreting genetic variants and therapies using a deep neural network
US17/378,404 Pending US20210407622A1 (en) 2015-06-15 2021-07-16 Neural network architectures for linking biological sequence variants based on molecular phenotype, and systems and methods therefor

Family Applications Before (3)

Application Number Title Priority Date Filing Date
US14/739,432 Active 2036-01-25 US10185803B2 (en) 2015-06-15 2015-06-15 Systems and methods for classifying, prioritizing and interpreting genetic variants and therapies using a deep neural network
US15/841,106 Active 2037-04-01 US11183271B2 (en) 2015-06-15 2017-12-13 Neural network architectures for linking biological sequence variants based on molecular phenotype, and systems and methods therefor
US16/197,146 Active 2039-04-08 US11887696B2 (en) 2015-06-15 2018-11-20 Systems and methods for classifying, prioritizing and interpreting genetic variants and therapies using a deep neural network

Family Applications After (1)

Application Number Title Priority Date Filing Date
US17/378,404 Pending US20210407622A1 (en) 2015-06-15 2021-07-16 Neural network architectures for linking biological sequence variants based on molecular phenotype, and systems and methods therefor

Country Status (3)

Country Link
US (5) US10185803B2 (en)
EP (1) EP3308309B1 (en)
WO (1) WO2016201564A1 (en)

Families Citing this family (97)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10410118B2 (en) 2015-03-13 2019-09-10 Deep Genomics Incorporated System and method for training neural networks
US10395759B2 (en) 2015-05-18 2019-08-27 Regeneron Pharmaceuticals, Inc. Methods and systems for copy number variant detection
US10185803B2 (en) 2015-06-15 2019-01-22 Deep Genomics Incorporated Systems and methods for classifying, prioritizing and interpreting genetic variants and therapies using a deep neural network
US10529318B2 (en) * 2015-07-31 2020-01-07 International Business Machines Corporation Implementing a classification model for recognition processing
US10733979B2 (en) 2015-10-09 2020-08-04 Google Llc Latency constraints for acoustic modeling
US10546650B2 (en) 2015-10-23 2020-01-28 Google Llc Neural network for processing aptamer data
KR102341129B1 (en) 2016-02-12 2021-12-21 리제너론 파마슈티칼스 인코포레이티드 Methods and systems for detecting abnormal karyotypes
US11514289B1 (en) * 2016-03-09 2022-11-29 Freenome Holdings, Inc. Generating machine learning models using genetic data
EP3455759A4 (en) * 2016-05-13 2020-01-01 Deep Genomics Incorporated Neural network architectures for scoring and visualizing biological sequence variations using molecular phenotype, and systems and methods therefor
BR112018074572B1 (en) 2016-06-01 2024-02-27 Quantum-Si Incorporated METHODS FOR IDENTIFYING NUCLEOTIDES AND FOR CALIBRATING A SEQUENCING INSTRUMENT, NON-TRAINER COMPUTER READABLE STORAGE MEDIUM, SEQUENCING DEVICE AND INSTRUMENT
US9807037B1 (en) 2016-07-08 2017-10-31 Asapp, Inc. Automatically suggesting completions of text
US10083451B2 (en) 2016-07-08 2018-09-25 Asapp, Inc. Using semantic processing for customer support
CN109952382B (en) 2016-08-08 2023-11-14 豪夫迈·罗氏有限公司 Base recognition by random sequencing methods
CN109952581A (en) 2016-09-28 2019-06-28 D5A1有限责任公司 Study for machine learning system is trained
US10216899B2 (en) * 2016-10-20 2019-02-26 Hewlett Packard Enterprise Development Lp Sentence construction for DNA classification
US11250327B2 (en) 2016-10-26 2022-02-15 Cognizant Technology Solutions U.S. Corporation Evolution of deep neural network structures
US10650311B2 (en) * 2016-12-19 2020-05-12 Asaap, Inc. Suggesting resources using context hashing
US10109275B2 (en) 2016-12-19 2018-10-23 Asapp, Inc. Word hash language model
CA3053368A1 (en) 2017-02-14 2018-08-23 Dignity Health Systems, methods, and media for selectively presenting images captured by confocal laser endomicroscopy
US10134131B1 (en) * 2017-02-15 2018-11-20 Google Llc Phenotype analysis of cellular image data using a deep metric network
US10467754B1 (en) * 2017-02-15 2019-11-05 Google Llc Phenotype analysis of cellular image data using a deep metric network
US10769501B1 (en) 2017-02-15 2020-09-08 Google Llc Analysis of perturbed subjects using semantic embeddings
KR101864986B1 (en) * 2017-02-27 2018-06-05 한국과학기술원 Disease susceptibility and causal element prediction method based on genome information and apparatus therefor
US11507844B2 (en) 2017-03-07 2022-11-22 Cognizant Technology Solutions U.S. Corporation Asynchronous evaluation strategy for evolution of deep neural networks
US20180349158A1 (en) * 2017-03-22 2018-12-06 Kevin Swersky Bayesian optimization techniques and applications
US11915152B2 (en) 2017-03-24 2024-02-27 D5Ai Llc Learning coach for machine learning system
US11295210B2 (en) 2017-06-05 2022-04-05 D5Ai Llc Asynchronous agents with learning coaches and structurally modifying deep neural networks without performance degradation
CA3067642A1 (en) * 2017-06-19 2018-12-27 Jungla Llc Interpretation of genetic and genomic variants via an integrated computational and experimental deep mutational learning framework
US10762423B2 (en) 2017-06-27 2020-09-01 Asapp, Inc. Using a neural network to optimize processing of user requests
US9922285B1 (en) * 2017-07-13 2018-03-20 HumanCode, Inc. Predictive assignments that relate to genetic information and leverage machine learning models
US11699069B2 (en) 2017-07-13 2023-07-11 Helix, Inc. Predictive assignments that relate to genetic information and leverage machine learning models
US11139048B2 (en) * 2017-07-18 2021-10-05 Analytics For Life Inc. Discovering novel features to use in machine learning techniques, such as machine learning techniques for diagnosing medical conditions
US11481218B2 (en) 2017-08-02 2022-10-25 Intel Corporation System and method enabling one-hot neural networks on a machine learning compute platform
US11861491B2 (en) 2017-10-16 2024-01-02 Illumina, Inc. Deep learning-based pathogenicity classifier for promoter single nucleotide variants (pSNVs)
SG10202108020VA (en) 2017-10-16 2021-09-29 Illumina Inc Deep learning-based techniques for training deep convolutional neural networks
MY195477A (en) 2017-10-16 2023-01-26 Illumina Inc Deep Learning-Based Splice Site Classification
WO2019079464A1 (en) * 2017-10-17 2019-04-25 Jungla Inc. Molecular evidence platform for auditable, continuous optimization of variant interpretation in genetic and genomic testing and analysis
US11250314B2 (en) 2017-10-27 2022-02-15 Cognizant Technology Solutions U.S. Corporation Beyond shared hierarchies: deep multitask learning through soft layer ordering
WO2019084559A1 (en) * 2017-10-27 2019-05-02 Apostle, Inc. Predicting cancer-related pathogenic impact of somatic mutations using deep learning-based methods
US20190156204A1 (en) * 2017-11-20 2019-05-23 Koninklijke Philips N.V. Training a neural network model
US10497004B2 (en) 2017-12-08 2019-12-03 Asapp, Inc. Automating communications using an intent classifier
WO2019118299A1 (en) * 2017-12-13 2019-06-20 Sentient Technologies (Barbados) Limited Evolving recurrent networks using genetic programming
WO2019118290A1 (en) 2017-12-13 2019-06-20 Sentient Technologies (Barbados) Limited Evolutionary architectures for evolution of deep neural networks
US10489792B2 (en) 2018-01-05 2019-11-26 Asapp, Inc. Maintaining quality of customer support messages
JP2021511829A (en) 2018-01-26 2021-05-13 クアンタム−エスアイ インコーポレイテッドQuantum−Si Incorporated Machine-learnable pulse and base determination for sequencing devices
US11321612B2 (en) 2018-01-30 2022-05-03 D5Ai Llc Self-organizing partially ordered networks and soft-tying learned parameters, such as connection weights
US11527308B2 (en) 2018-02-06 2022-12-13 Cognizant Technology Solutions U.S. Corporation Enhanced optimization with composite objectives and novelty-diversity selection
US12033079B2 (en) 2018-02-08 2024-07-09 Cognizant Technology Solutions U.S. Corporation System and method for pseudo-task augmentation in deep multitask learning
US10210244B1 (en) 2018-02-12 2019-02-19 Asapp, Inc. Updating natural language interfaces by processing usage data
US11380422B2 (en) * 2018-03-26 2022-07-05 Uchicago Argonne, Llc Identification and assignment of rotational spectra using artificial neural networks
US11715001B2 (en) * 2018-04-02 2023-08-01 International Business Machines Corporation Water quality prediction
NL2020861B1 (en) * 2018-04-12 2019-10-22 Illumina Inc Variant classifier based on deep neural networks
US20210166782A1 (en) * 2018-04-12 2021-06-03 Dana-Farber Cancer Institute, Inc. Clinical interpretation of genomic and transcriptomic data at the point of care for precision cancer medicine
US20210158895A1 (en) * 2018-04-13 2021-05-27 Dana-Farber Cancer Institute, Inc. Ultra-sensitive detection of cancer by algorithmic analysis
CN108959841A (en) * 2018-04-16 2018-12-07 华南农业大学 A kind of drug targeting albumen effect prediction technique based on DBN algorithm
US10169315B1 (en) 2018-04-27 2019-01-01 Asapp, Inc. Removing personal information from text using a neural network
US11482303B2 (en) 2018-06-01 2022-10-25 Grail, Llc Convolutional neural network systems and methods for data classification
US11443181B2 (en) * 2018-06-18 2022-09-13 Peraton Inc. Apparatus and method for characterization of synthetic organisms
CN109192316B (en) * 2018-07-02 2021-09-07 杭州师范大学 Disease subtype prediction system based on gene network analysis
US11126649B2 (en) 2018-07-11 2021-09-21 Google Llc Similar image search for radiology
US11216510B2 (en) 2018-08-03 2022-01-04 Asapp, Inc. Processing an incomplete message with a neural network to generate suggested messages
WO2020028989A1 (en) * 2018-08-08 2020-02-13 Deep Genomics Incorporated Systems and methods for determining effects of therapies and genetic variation on polyadenylation site selection
WO2020041204A1 (en) 2018-08-18 2020-02-27 Sf17 Therapeutics, Inc. Artificial intelligence analysis of rna transcriptome for drug discovery
US10747957B2 (en) 2018-11-13 2020-08-18 Asapp, Inc. Processing communications using a prototype classifier
US11551004B2 (en) 2018-11-13 2023-01-10 Asapp, Inc. Intent discovery with a prototype classifier
US10657447B1 (en) 2018-11-29 2020-05-19 SparkCognition, Inc. Automated model building search space reduction
US11657897B2 (en) 2018-12-31 2023-05-23 Nvidia Corporation Denoising ATAC-seq data with deep learning
CN109840501B (en) * 2019-01-31 2021-06-01 深圳市商汤科技有限公司 Image processing method and device, electronic equipment and storage medium
US11481639B2 (en) 2019-02-26 2022-10-25 Cognizant Technology Solutions U.S. Corporation Enhanced optimization with composite objectives and novelty pulsation
US11443832B2 (en) 2019-03-07 2022-09-13 Nvidia Corporation Genetic mutation detection using deep learning
CN110246541A (en) * 2019-03-08 2019-09-17 中山大学 A kind of circRNA discrimination method based on LightGBM
EP3938898A4 (en) 2019-03-13 2023-03-29 Cognizant Technology Solutions U.S. Corporation System and method for implementing modular universal reparameterization for deep multi-task learning across diverse domains
TWI696129B (en) 2019-03-15 2020-06-11 華邦電子股份有限公司 Memory chip capable of performing artificial intelligence operation and operation method thereof
US11783917B2 (en) 2019-03-21 2023-10-10 Illumina, Inc. Artificial intelligence-based base calling
US11210554B2 (en) 2019-03-21 2021-12-28 Illumina, Inc. Artificial intelligence-based generation of sequencing metadata
US11783195B2 (en) 2019-03-27 2023-10-10 Cognizant Technology Solutions U.S. Corporation Process and system including an optimization engine with evolutionary surrogate-assisted prescriptions
WO2020198732A1 (en) * 2019-03-28 2020-10-01 Themba Inc. Use of gene expression data and gene signaling networks along with gene editing to determine which variants harm gene function
US11562249B2 (en) 2019-05-01 2023-01-24 International Business Machines Corporation DNN training with asymmetric RPU devices
US11593649B2 (en) 2019-05-16 2023-02-28 Illumina, Inc. Base calling using convolutions
US11423306B2 (en) 2019-05-16 2022-08-23 Illumina, Inc. Systems and devices for characterization and performance analysis of pixel-based sequencing
US12026624B2 (en) 2019-05-23 2024-07-02 Cognizant Technology Solutions U.S. Corporation System and method for loss function metalearning for faster, more accurate training, and smaller datasets
US11425064B2 (en) 2019-10-25 2022-08-23 Asapp, Inc. Customized message suggestion with user embedding vectors
TWI769418B (en) 2019-12-05 2022-07-01 財團法人工業技術研究院 Method and electronic device for selecting neural network hyperparameters
EP4091171A1 (en) * 2020-01-16 2022-11-23 Congenica Ltd. Screening system and method for acquiring and processing genomic information for generating gene variant interpretations
BR112022016415A2 (en) 2020-02-20 2022-10-11 Illumina Inc MULTIPLE BASE CALLS TO ARTIFICIAL INTELLIGENCE BASED
US12099934B2 (en) * 2020-04-07 2024-09-24 Cognizant Technology Solutions U.S. Corporation Framework for interactive exploration, evaluation, and improvement of AI-generated solutions
CA3179932A1 (en) * 2020-05-26 2021-12-02 Vadthyavath RAMU Adaptive-learning, auto-labeling method and system for predicting and diagnosing web breaks in paper machine
US11775841B2 (en) 2020-06-15 2023-10-03 Cognizant Technology Solutions U.S. Corporation Process and system including explainable prescriptions through surrogate-assisted evolution
US20220044133A1 (en) * 2020-08-07 2022-02-10 Sap Se Detection of anomalous data using machine learning
US12014281B2 (en) * 2020-11-19 2024-06-18 Merative Us L.P. Automatic processing of electronic files to identify genetic variants
US20220237471A1 (en) * 2021-01-22 2022-07-28 International Business Machines Corporation Cell state transition features from single cell data
US20220328155A1 (en) * 2021-04-09 2022-10-13 Endocanna Health, Inc. Machine-Learning Based Efficacy Predictions Based On Genetic And Biometric Information
US20220336054A1 (en) 2021-04-15 2022-10-20 Illumina, Inc. Deep Convolutional Neural Networks to Predict Variant Pathogenicity using Three-Dimensional (3D) Protein Structures
WO2022272251A2 (en) * 2021-06-21 2022-12-29 The Trustees Of Princeton University Systems and methods for analyzing genetic data for assessment of gene regulatory activity
WO2023196872A1 (en) * 2022-04-06 2023-10-12 Predictiv Care, Inc. Disease or drug association providing system for digital twins with genetic information screened by artificial intelligence
WO2023196868A1 (en) * 2022-04-06 2023-10-12 Predictiv Care, Inc. Gene-based digital twin system that can predict medical risk
WO2024081769A2 (en) * 2022-10-13 2024-04-18 Foundation Medicine, Inc. Methods and systems for detection of cancer based on dna methylation of specific cpg sites

Family Cites Families (36)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE69332980T2 (en) 1992-11-24 2004-03-04 Pavilion Technologies, Inc., Austin OPERATING A NEURONAL NETWORK WITH MISSING AND / OR INCOMPLETE DATA
US6128609A (en) 1997-10-14 2000-10-03 Ralph E. Rose Training a neural network using differential input
AU2002232536A1 (en) 2000-11-09 2002-05-21 Cold Spring Harbor Laboratory Chimeric molecules to modulate gene expression
US8576232B2 (en) * 2001-12-31 2013-11-05 Siemens Product Lifecycle Management Software Inc. Apparatus, method, and system for drafting multi-dimensional drawings
CA2486431A1 (en) * 2002-05-20 2003-12-04 Rosetta Inpharmatics Llc Computer systems and methods for subdividing a complex disease into component diseases
US9740817B1 (en) * 2002-10-18 2017-08-22 Dennis Sunga Fernandez Apparatus for biological sensing and alerting of pharmaco-genomic mutation
US7217807B2 (en) * 2002-11-26 2007-05-15 Rosetta Genomics Ltd Bioinformatically detectable group of novel HIV regulatory genes and uses thereof
US20150235143A1 (en) * 2003-12-30 2015-08-20 Kantrack Llc Transfer Learning For Predictive Model Development
EP1938231A1 (en) * 2005-09-19 2008-07-02 BG Medicine, Inc. Correlation analysis of biological systems
US20080030797A1 (en) * 2006-08-04 2008-02-07 Eric Circlaeys Automated Content Capture and Processing
US20080300797A1 (en) 2006-12-22 2008-12-04 Aviir, Inc. Two biomarkers for diagnosis and monitoring of atherosclerotic cardiovascular disease
US20110172929A1 (en) * 2008-01-16 2011-07-14 The Trustees Of Columbia University In The City Of System and method for prediction of phenotypically relevant genes and perturbation targets
NZ572036A (en) * 2008-10-15 2010-03-26 Nikola Kirilov Kasabov Data analysis and predictive systems and related methodologies
US20130332081A1 (en) 2010-09-09 2013-12-12 Omicia Inc Variant annotation, analysis and selection tool
US20120310539A1 (en) 2011-05-12 2012-12-06 University Of Utah Predicting gene variant pathogenicity
US20130096838A1 (en) 2011-06-10 2013-04-18 William Fairbrother Gene Splicing Defects
US20140359422A1 (en) 2011-11-07 2014-12-04 Ingenuity Systems, Inc. Methods and Systems for Identification of Causal Genomic Variants
EP2776962A4 (en) 2011-11-07 2015-12-02 Ingenuity Systems Inc Methods and systems for identification of causal genomic variants
WO2014026152A2 (en) * 2012-08-10 2014-02-13 Assurerx Health, Inc. Systems and methods for pharmacogenomic decision support in psychiatry
US8697359B1 (en) 2012-12-12 2014-04-15 The Broad Institute, Inc. CRISPR-Cas systems and methods for altering expression of gene products
US9406017B2 (en) 2012-12-24 2016-08-02 Google Inc. System and method for addressing overfitting in a neural network
US20140199698A1 (en) 2013-01-14 2014-07-17 Peter Keith Rogan METHODS OF PREDICTING AND DETERMINING MUTATED mRNA SPLICE ISOFORMS
US9418203B2 (en) 2013-03-15 2016-08-16 Cypher Genomics, Inc. Systems and methods for genomic variant annotation
US20150066378A1 (en) 2013-08-27 2015-03-05 Tute Genomics Identifying Possible Disease-Causing Genetic Variants by Machine Learning Classification
US9679258B2 (en) 2013-10-08 2017-06-13 Google Inc. Methods and apparatus for reinforcement learning
EP3069305B1 (en) * 2013-11-15 2020-11-04 Intel Corporation Methods, systems and computer program products for using a distributed associative memory base to determine data correlations and convergence therein
US20200097835A1 (en) * 2014-06-17 2020-03-26 Ancestry.Com Dna, Llc Device, system and method for assessing risk of variant-specific gene dysfunction
US20160314245A1 (en) * 2014-06-17 2016-10-27 Genepeeks, Inc. Device, system and method for assessing risk of variant-specific gene dysfunction
US10410118B2 (en) 2015-03-13 2019-09-10 Deep Genomics Incorporated System and method for training neural networks
US10185803B2 (en) 2015-06-15 2019-01-22 Deep Genomics Incorporated Systems and methods for classifying, prioritizing and interpreting genetic variants and therapies using a deep neural network
US20170213127A1 (en) * 2016-01-24 2017-07-27 Matthew Charles Duncan Method and System for Discovering Ancestors using Genomic and Genealogic Data
WO2017190211A1 (en) 2016-05-04 2017-11-09 Deep Genomics Incorporated Methods and systems for producing an expanded training set for machine learning using biological sequences
EP3455759A4 (en) * 2016-05-13 2020-01-01 Deep Genomics Incorporated Neural network architectures for scoring and visualizing biological sequence variations using molecular phenotype, and systems and methods therefor
US20180107927A1 (en) 2016-06-15 2018-04-19 Deep Genomics Incorporated Architectures for training neural networks using biological sequences, conservation, and molecular phenotypes
WO2018031485A1 (en) * 2016-08-08 2018-02-15 Och Franz J Identification of individuals by trait prediction from the genome
US9922285B1 (en) * 2017-07-13 2018-03-20 HumanCode, Inc. Predictive assignments that relate to genetic information and leverage machine learning models

Also Published As

Publication number Publication date
US10185803B2 (en) 2019-01-22
US11887696B2 (en) 2024-01-30
US20160364522A1 (en) 2016-12-15
WO2016201564A1 (en) 2016-12-22
US20190252041A1 (en) 2019-08-15
EP3308309A4 (en) 2019-02-13
EP3308309B1 (en) 2024-08-07
US20180165412A1 (en) 2018-06-14
US20210407622A1 (en) 2021-12-30
EP3308309A1 (en) 2018-04-18
US11183271B2 (en) 2021-11-23

Similar Documents

Publication Publication Date Title
US20210383890A1 (en) Systems and methods for classifying, prioritizing and interpreting genetic variants and therapies using a deep neural network
CA2894317C (en) Systems and methods for classifying, prioritizing and interpreting genetic variants and therapies using a deep neural network
Rifaioglu et al. MDeePred: novel multi-channel protein featurization for deep learning-based binding affinity prediction in drug discovery
Caudai et al. AI applications in functional genomics
EP2864919B1 (en) Systems and methods for generating biomarker signatures with integrated dual ensemble and generalized simulated annealing techniques
US20180107927A1 (en) Architectures for training neural networks using biological sequences, conservation, and molecular phenotypes
Lai et al. Artificial intelligence and machine learning in bioinformatics
Zou et al. Approaches for recognizing disease genes based on network
Zhang et al. Discerning novel splice junctions derived from RNA-seq alignment: a deep learning approach
US8572018B2 (en) Method, system and software arrangement for reconstructing formal descriptive models of processes from functional/modal data using suitable ontology
Conard et al. A spectrum of explainable and interpretable machine learning approaches for genomic studies
Liñares Blanco et al. Differential gene expression analysis of RNA-seq data using machine learning for Cancer research
US20230335228A1 (en) Active Learning Using Coverage Score
Wong The practical bioinformatician
Imoto et al. Analysis of gene networks for drug target discovery and validation
Sharma et al. Evolutionary algorithms and artificial intelligence in drug discovery: opportunities, tools, and prospects
CN114300036A (en) Genetic variation pathogenicity prediction method and device, storage medium and computer equipment
Pe'er From gene expression to molecular pathways
Wu et al. Single-cell Ca2+ parameter inference reveals how transcriptional states inform dynamic cell responses
Kabir et al. DRBpred: A sequence-based machine learning method to effectively predict DNA-and RNA-binding residues
US20240273359A1 (en) Apparatus and method for discovering biomarkers of health outcomes using machine learning
Ünsal A deep learning based protein representation model for low-data protein function prediction
Gu Applying Machine Learning Algorithms for the Analysis of Biological Sequences and Medical Records
Assefa Statistical methods for testing differential gene expression in bulk and single-cell RNA sequencing data
Sha et al. Splice site recognition-deciphering Exon-Intron transitions for genetic insights using Enhanced integrated Block-Level gated LSTM model

Legal Events

Date Code Title Description
AS Assignment

Owner name: DEEP GENOMICS INCORPORATED, CANADA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:THE GOVERNING COUNCIL OF THE UNIVERSITY OF TORONTO;REEL/FRAME:057329/0476

Effective date: 20161123

Owner name: THE GOVERNING COUNCIL OF THE UNIVERSITY OF TORONTO, CANADA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:FREY, BRENDAN;LEUNG, MICHAEL K.K.;DELONG, ANDREW THOMAS;AND OTHERS;SIGNING DATES FROM 20161125 TO 20161212;REEL/FRAME:057329/0401

STPP Information on status: patent application and granting procedure in general

Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION