AU2013202555A1 - System and Method for Cleaning Noisy Genetic Data and Using Data to Make Predictions - Google Patents

System and Method for Cleaning Noisy Genetic Data and Using Data to Make Predictions Download PDF

Info

Publication number
AU2013202555A1
AU2013202555A1 AU2013202555A AU2013202555A AU2013202555A1 AU 2013202555 A1 AU2013202555 A1 AU 2013202555A1 AU 2013202555 A AU2013202555 A AU 2013202555A AU 2013202555 A AU2013202555 A AU 2013202555A AU 2013202555 A1 AU2013202555 A1 AU 2013202555A1
Authority
AU
Australia
Prior art keywords
data
chromosome
genetic
dna
probability
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
AU2013202555A
Inventor
Milena Banjevic
Zachary Paul Demko
David Scott Johnson
Matthew Rabinowitz
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Natera Inc
Original Assignee
Natera Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from AU2006318425A external-priority patent/AU2006318425B2/en
Application filed by Natera Inc filed Critical Natera Inc
Priority to AU2013202555A priority Critical patent/AU2013202555A1/en
Publication of AU2013202555A1 publication Critical patent/AU2013202555A1/en
Priority to AU2016201386A priority patent/AU2016201386B2/en
Abandoned legal-status Critical Current

Links

Landscapes

  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

A system and method for determining the genetic data for one or a small set of cells, or from fragmentary DNA, where a limited quantity of genetic data is available, and also for predicting likely phenotypic outcomes using mathematical models and given 5 genetic, phenotypic and/or clinical data of an individual, and also relevant aggregated medical data consisting of genotypic, phenotypic, and/or clinical data from germane patient subpopulations. Genetic data for the target individual is acquired and amplified using known methods, and poorly measured base pairs, missing alleles and missing regions are reconstructed using expected similarities between the target genome and the 10 genome of genetically related subjects.

Description

SYSTEM AND METHOD FOR CLEANING NOISY GENETIC DATA AND USING DATA TO MAKE PREDICTIONS Cross-References To Related Applications 5 This application is a divisional of Australian Application No. 2006318425, filed on 22 November 2006, and is related to International Patent Application No. PCT/US2006/045821, filed on 22 November 2006 and claiming priority from U.S. Provisional Patent Applications: Serial No. 60/739,882, filed on 26 November 2005; 10 Serial No. 60/742,305, filed on 6 December 2005; Serial No. 60/754,396, filed on 29 December 2005; Serial No. 60/774,976, filed on 21 February 2006; Serial No. 60/789,506, filed on 4 April 2006; Serial No. 60/817,741, filed on 30 June 2006; Serial No. 11/496,982, filed on 31 July 2006; and Serial No. 60/846,610, filed on 22 September 2006; each of which is incorporated herein by reference. 15 Field of the Technology The invention relates generally to the field of acquiring, manipulating and using genetic data for medically predictive purposes, and specifically to a system in which 20 imperfectly measured genetic data is made more precise by using known genetic data of genetically related individuals, thereby allowing more effective identification of genetic irregularities that could result in various phenorypic outcomes. It also relates generally to the field of analyzing, managing and acting upon genetic, phenorypic and clinical information, and using that information to predict phenotypic outcomes of medical 25 decisions. More specifically, it relates to methods and systems which use integrated, validated genetic and phenotypic data from a group of subjects to make better decisions regarding a particular subject. Description of the Related Art 30 Prenatal and Preimplantation Genetic Diagnosis Current methods of prenatal diagnosis can alert physicians and parents to abnormalities in growing fetuses. Without prenatal diagnosis, one in 50 babies is born with serious physical or mental handicap, and as many as one in 30 will have some form 35 of congenital malformation. Unfortunately, standard methods require invasive testing and
I
carry a roughly I per cent risk of miscarriage. These methods include amniocentesis, chorion villus biopsy and fetal blood sampling. Of these, amniocentesis is the most common procedure; in 2003, it was performed in approximately 3% of all pregnancies, though its frequency of use has been decreasing over the past decade and a half. A major 5 drawback of prenatal diagnosis is that given the limited courses of action once an abnormality has been detected, it is only valuable and ethical to test for very serious defects. As result, prenatal diagnosis is typically only attempted in cases of high-risk pregnancies, where the elevated chance of a defect combined with the seriousness of the potential abnormality outweighs the risks. A need exists for a method of prenatal 10 diagnosis that mitigates these risks. It has recently been discovered that cell-free fetal DNA and intact fetal cells can enter maternal blood circulation. Consequently, analysis of these cells can allow early Non-Invasive Prenatal Genetic Diagnosis (NIPGD). A key challenge in using NIPGD is the task of identifying and extracting fetal cells or nucleic acids from the mother's blood. 15 The fetal cell concentration in maternal blood depends on the stage of pregnancy and the condition of the fetus, but estimates range from one to forty fetal cells in every milliliter of maternal blood, or less than one fetal cell per 100,000 maternal nucleated cells. Current techniques are able to isolate small quantities of fetal cells from the mother's blood, although it is very difficult to enrich the fetal cells to purity in any quantity. The most 20 effective technique in this context involves the use of monoclonal antibodies, but other techniques used.to isolate fetal cells include density centrifugation, selective lysis of adult erythrocytes, and FACS. Fetal DNA isolation has been demonstrated using PCR amplification using primers with fetal-specific DNA sequences. Since only tens of molecules of each embryonic SNP are available through these techniques, the genotyping 25 of the fetal tissue with high fidelity is not currently possible. Normal humans have two sets of 23 chromosomes in every diploid cell, with one copy coming from each parent. Aneuploidy, a cell with an extra or missing chromosomes, and uniparental disomy, a cell with two of a given chromosome that originate from one parent, are believed to be responsible for a large percentage of failed 30 implantations, miscarriages, and genetic diseases. When only certain cells in an individual are aneuploid, the individual is said to exhibit mosaicism. Detection of chromosomal abnormalities can identify individuals or embryos with conditions such as Down syndrome, Klinefelters syndrome, and Turner syndrome, among others, in addition 2 to increasing the chances of a successful pregnancy. Testing for chromosomal abnormalities is especially important as mothers age: between the ages of 35 and 40 it is estimated that between 40% and 50% of the embryos are abnormal, and above the age of 40, more than half of the embryos are abnormal. 5 Karyotyping, the traditional method used for the prediction of aneuploides and mosaicism is giving way to other more high throughput, more cost effective methods. One method that has attracted much attention recently is Flow cytometry (FC) and fluorescence in situ hybridization (FISH) which can be used to detect aneuploidy in any phase of the cell cycle. One advantage of this method is that it is less expensive than 10 karyotyping, but the cost is significant enough that generally a small selection of chromosomes are tested (usually chromosomes 13, 18, 21, X, Y; also sometimes 8, 9, 15, 16, 17, 22); in addition, FISH has a low level of specificity. Using FISH to analyze 15 cells, one can detect mosaicism of 19% with 95% confidence. The reliability of the test becomes much lower as the level of mosaicism gets lower, and as the number of cells to 15 analyze decreases. The test is estimated to have a false negative rate as high as 15% when a single cell is analysed. There is a great demand for a method that has a higher throughput, lower cost, and greater accuracy. Much research has been done towards the use of pre-implantation genetic diagnosis (PGD) as an alternative to classical prenatal diagnosis of inherited disease. 20 Most PGD today focuses on high-level chromosomal abnormalities such as aneuploidy and balanced translocations with the primary outcomes being successful implantation and a take-home baby. A need exists for a method for more extensive genotyping of embryos at the pre-implantation stage. The number of known disease associated genetic alleles is currently at 389 according to OMIM and steadily climbing. Consequently, it is becoming 25 increasingly relevant to analyze multiple embryonic SNPs that are associated with disease phenotypes. A clear advantage of pre-implantation genetic diagnosis over prenatal diagnosis is that it avoids some of the ethical issues regarding possible choices of action once undesirable phenotypes have been detected. 30 Genotyping Many techniques exist for isolating single cells. The FACS machine has a variety of applications; one important application is to discriminate between cells based on size, shape and overall DNA content. The FACS machine can be set to sort single cells into 3 any desired container. Many different groups have used single cell DNA analysis for a number of applications, including prenatal genetic diagnosis, recombination studies, and analysis of chromosomal imbalances. Single-sperm genotyping has been used previously for forensic analysis of sperm samples (to decrease problems arising from mixed samples) 5 and for single-cell recombination studies. Isolation of single cells from human embryos, while highly technical, is now routine in in vitro fertilization clinics. To date, the vast majority of prenatal diagnoses have used fluorescent in situ hybridization (FISH), which can determine large chromosomal aberrations (such as Down syndrome, or trisomy 21) and 10 PCR/electrophoresis, which can determine a handful of SNPs or other allele calls. Both polar bodies and blastomeres have been isolated with success. It is critical to isolate single blastomeres without compromising embryonic integrity. The most common technique is to remove single blastomeres from day 3 embryos (6 or 8 cell stage). Embryos are transferred to a special cell culture medium (standard culture medium 15 lacking calcium and magnesium), and a hole is introduced into the zona pellucida using an acidic solution, laser, or mechanical drilling. The technician then uses a biopsy pipette to remove a single visible nucleus. Clinical studies have demonstrated that this process does not decrease implantation success, since at this stage embryonic cells are undifferentiated. 20 There are three major methods available for whole genome amplification (WGA): ligation-mediated PCR (LM-PCR), degenerate oligonucleotide primer PCR (DOP-PCR), and multiple displacement amplification (MDA). In LM-PCR, short DNA sequences called adapters are ligated to blunt ends of DNA. These adapters contain universal amplification sequences, which are used to amplify the DNA by PCR. In DOP-PCR, 25 random primers that also contain universal amplification sequences are used in a first round of annealing and PCR. Then, a second round of PCR is used to amplify the sequences further with the universal primer sequences. Finally, MDA uses the phi-29 polymerase, which is a highly processive and non-specific enzyme that replicates DNA and has been used for single-cell analysis. Of the three methods, DOP-PCR reliably 30 produces large quantities of DNA from small quantities of DNA, including single copies of chromosomes. On the other hand, MDA is the fastest method, producing hundred-fold amplification of DNA in a few hours. The major limitations to amplification material from a single cells are (1) necessity of using extremely dilute DNA concentrations or 4 extremely small volume of reaction mixture, and (2) difficulty of reliably dissociating DNA from proteins across the whole genome. Regardless, single-cell whole genome amplification has been used successfully for a variety of applications for a number of years. 5 There are numerous difficulties in using DNA amplification in these contexts. Amplification of single-cell DNA (or DNA from a small number of cells, or from smaller amounts of DNA) by PCR can fail completely, as reported in 5-10% of the cases. This is often due to contamination of the DNA, the loss of the cell, its DNA, or accessibility of the DNA during the PCR reaction. Other sources of error that may arise in measuring the 10 embryonic DNA by amplification and microarray analysis include transcription errors introduced by the DNA polymerase where a particular nucleotide is incorrectly copied during PCR, and microarray reading errors due to imperfect hybridization on the array. The biggest problem, however, remains allele drop-out (ADO) defined as the failure to amplify one of the two alleles in a heterozygous cell. ADO can affect up to more than 15 40% of amplifications and has already caused PGD misdiagnoses. ADO becomes a -health issue especially in the case of a dominant disease, where the failure to amplify can lead to implantation of an affected embryo. The need for more than one set of primers per each marker (in heterozygotes) complicate the PCR process. Therefore, more reliable PCR assays are being developed based on understanding the ADO origin. Reaction 20 conditions for single-cell amplifications are under study. The amplicon size, the amount of DNA degradation, freezing and thawing, and the PCR program and conditions can each influence the rate of ADO. All those techniques, however, depend on the minute DNA amount available for amplification in the single cell. This process is often accompanied by contamination. 25 Proper sterile conditions and microsatellite sizing can exclude the chance of contaminant DNA as microsatellite analysis detected only in parental alleles exclude contamination. Studies to reliably transfer molecular diagnostic protocols to the single-cell level have been recently pursued using first-round multiplex PCR of microsatellite markers, followed by real-time PCR and microsatellite sizing to exclude chance contamination. 30 Multiplex PCR allows for the amplification of multiple fragments in a single reaction, a crucial requirement in the single-cell DNA analysis. Although conventional PCR was the first method used in PGD, fluorescence in situ hybridization (FISH) is now common. It is a delicate visual assay that allows the detection of nucleic acid within undisturbed cellular 5 and tissue architecture. It relies firstly on the fixation of the cells to be analyzed. Consequently, optimization of the fixation and storage condition of the sample is needed, especially for single-cell suspensions. Advanced technologies that enable the diagnosis of a number of diseases at the 5 single-cell level include interphase chromosome conversion, comparative genomic hybridization (CGH), fluorescent PCR, and whole genome amplification. The reliability of the data generated by all of these techniques rely on the quality of the DNA preparation. PGD is also costly, consequently there is a need for less expensive approaches, such as mini-sequencing. Unlike most mutation-detection techniques, mini 10 sequencing permits analysis of very small DNA fragments with low ADO rate. Better methods for the preparation of single-cell DNA for amplification and PGD are therefore needed and are under study. The more novel microarrays and comparative genomic hybridization techniques, still ultimately rely on the quality of the DNA under analysis. Several techniques are in development to measure multiple SNPs on the DNA of a 15 small number of cells, a single cell (for example, a blastomere), a small number of chromosomes, or from fragments of DNA. There are techniques that use Polymerase Chain Reaction (PCR), followed by microarray genotyping analysis. Some PCR-based techniques include whole genome amplification (WGA) techniques such as multiple displacement amplification (MDA), and Molecular Inversion Probes (MIPS) that perform 20 genotyping using multiple tagged oligonucleotides that may then be amplified using PCR with a singe pair of primers. An example of a non-PCR based technique is fluorescence in situ hybridization (FISH). It is apparent that the techniques will be severely error prone due to the limited amount of genetic material which will exacerbate the impact of effects such as allele drop-outs, imperfect hybridization, and contamination. 25 Many techniques exist which provide genotyping data. Taqman is a unique genotyping technology produced and distributed by Applied Biosystems. Taqman uses polymerase chain reaction (PCR) to amplify sequences of interest. During PCR cycling, an allele specific minor groove binder (MGB) probe hybridizes to amplified sequences. Strand synthesis by the polymerase enzymes releases reporter dyes linked to the MGB 30 probes, and then the Taqman optical readers detect the dyes. In this manner, Taqman achieves quantitative allelic discrimination. Compared with array based genotyping technologies, Taqman is quite expensive per reaction (-$0.40/reaction), and throughput is relatively low (384 genotypes per run). While only Ing of DNA per reaction is 6 necessary, thousands of genotypes by Taqman requires microgram quantities of DNA, so Taqman does not necessarily use less DNA than microarrays. However, with respect to the IVF genotyping workflow, Taqman is the most readily applicable technology. This is due to the high reliability of the assays and, most importantly, the speed and ease of the 5 assay (-3 hours per run and minimal molecular biological steps). Also unlike many array technologies (such as 500k Affymetrix arrays), Taqman is highly customizable, which is important for the IVF market. Further, Taqman is highly quantitative, so anueploidies could be detected with this technology alone. Illumina has recently emerged as a leader in high-throughput genotyping. Unlike 10 Affyinetrix, Illumina genotyping arrays do not rely exclusively on hybridization. Instead, Illumina technology uses an allele-specific DNA extension step, which is much more sensitive and specific than hybridization alone, for the original sequence detection. Then, all of these alleles are amplified in multiplex by PCR, and then these products hybridized to bead arrays. The beads on these arrays contain unique "address" tags, not 15 native sequence, so this hybridization is highly specific and sensitive. Alleles are then called by quantitative scanning of the bead arrays. The Illumina Golden Gate assay system genotypes up to 1536 loci concurrently, so the throughput is better than Taqman but not as high as Affymetrix 500k arrays. The cost of Ilumina genotypes is lower than Taqman, but higher than Affymetrix arrays. Also, the Illumina platform takes as long to 20 complete as the 500k Affymetrix arrays (up to 72 hours), which is problematic for IVF genotyping. However, Illumina has a much better call rate, and the assay is quantitative, so anueploidies are detectable with this technology. Illumina technology is much more flexible in choice of SNPs than 500k Affymetrix arrays. One of the highest throughput techniques, which allows for the measurement of 25 up to 250,000 SNPs at a time, is the Affymetrix GeneChip 500K genotyping array. This technique also uses PCR, followed by analysis by hybridization and detection of the amplified DNA sequences to DNA probes, chemically synthesized at different locations on a quartz surface. A disadvantage of these arrays are the low flexibility and the lower sensitivity. There are modified approaches that can increase selectivity, such as the 30 "perfect match" and "mismatch probe" approaches, but these do so at the cost of the number of SNPs calls per array. Pyrosequencing, or sequencing by synthesis, can also be used for genotyping and SNP analysis. The main advantages to pyrosequencing include an extremely fast 7 turnaround and unambiguous SNP calls, however, the assay is not currently conducive to high-throughput parallel analysis. PCR followed by gel electrophoresis is an exceedingly simple technique that has met the most success in preimplantation diagnosis. In this technique, researchers use nested PCR to amplify short sequences of interest. Then, they 5 run these DNA samples on a special gel to visualize the PCR products. Different bases have different molecular weights, so one can determine base content based on how fast the product runs in the gel. This technique is low-throughput and requires subjective analyses by scientists using current technologies, but has the advantage of speed (1-2 hours of PCR, I hour of gel electrophoresis). For this reason, it has been used previously 10 for prenatal genotyping for a myriad of diseases, including: thalassaemia, neurofibromatosis type 2, leukocyte adhesion deficiency type I, Hallopeau-Siemens disease, sickle-cell anemia, retinoblastoma, Pelizaeus-Merzbacher disease, Duchenne muscular dystrophy, and Currarino syndrome. Another promising technique that has been developed for genotyping small 15 quantities of genetic material with very high fidelity is Molecular Inversion Probes (MIPs), such as Affymetrix's Genflex Arrays. This technique has the capability to measure multiple SNPs in parallel: more than 10,000 SNPS measured in parallel have been verified. For small quantities of genetic material, call rates for this technique have been established at roughly 95%, and accuracy of the calls made has been established to 20 be above 99%. So far, the technique has been implemented for quantities of genomic data as small as 150 molecules for a given SNP. However, the technique has not been verified for genomic data from a single cell, or a single strand of DNA, as would be required for pre-implantation genetic diagnosis. The MIP technique makes use of padlock probes which are linear oligonucleotides 25 whose two ends can be joined by ligation when they hybridize to immediately adjacent target sequences of DNA. After the probes have hybridized to the genomic DNA, a gap fill enzyme is added to the assay which can add one of the four nucleotides to the gap. If the added nucleotide (AC,TG) is complementary to the SNP under measurement, then it will hybridize to the DNA, and join the ends of the padlock probe by ligation. The 30 circular products, or closed padlock probes, are then differentiated from linear probes by exonucleolysis. The exonuclease, by breaking down the linear probes and leaving the circular probes, will change the relative concentrations of the closed vs. the unclosed probes by a factor of 1000 or more. The probes that remain are then opened at a cleavage 8 site by another enzyme, removed from the DNA, and amplified by PCR. Each probe is tagged with a different tag sequence consisting of 20 base tags (16,000 have been generated), and can be detected, for example, by the Affymetrix GenFlex Tag Array. The presence of the tagged probe from a reaction in which a particular gap-fill enzyme was 5 added indicates the presence of the complimentary amino acid on the relevant SNP. The molecular biological advantages of MIPS include: (1) multiplexed genotyping in a single reaction, (2) the genotype "call" occurs by gap fill and ligation, not hybridization, and (3) hybridization to an array of universal tags decreases false positives inherent to most array hybridizations. In traditional 500K, TaqMan and other genotyping 10 arrays, the entire genomic sample is hybridized to the array, which contains a variety of perfect match and mismatch probes, and an algorithm calls likely genotypes based on the intensities of the mismatch and perfect match probes. Hybridization, however, is inherently noisy, because of the complexities of the DNA sample and the huge number of probes on the arrays. MIPs, on the other hand, uses mutliplex probes (i.e., not on an 15 array) that are longer and therefore more specific, and then uses a robust ligation step to circularize the probe. Background is exceedingly low in this assay (due to specificity), though allele dropout may be high (due to poor performing probes). When this technique is used on genomic data from a single cell (or small numbers of cells) it will - like PCR based approaches - suffer from integrity issues. For example, 20 the inability of the padlock probe to hybridize to the genomic DNA will cause allele dropouts. This will be exacerbated in the context of in-vitro fertilization since the efficiency of the hybridization reaction is low, and it needs to proceed relatively quickly in order to genotype the embryo in a limited time period. Note that the hybridization reaction can be reduced well below vendor-recommended levels, and micro-fluidic 25 techniques may also be used to accelerate the hybridization reaction. These approaches to reducing the time for the hybridization reaction will result in reduced data quality. Predictive Genomics Once the genetic data has been measured, the next step is to use the data for 30 predictive purposes. Much research has been done in predictive genomics, which tries to understand the precise functions of proteins, RNA and DNA so that phenotypic predictions can be made based on genotype. Canonical techniques focus on the function of Single-Nucleotide Polymorphisms (SNP); but more advanced methods are being 9 brought to bear on multi-factorial phenotypic features. These methods include techniques, such as linear regression and nonlinear neural networks, which attempt to determine a mathematical relationship between a set of genetic and phenotypic predictors and a set of measured outcomes. There is also a set of regression analysis techniques, such as Ridge 5 regression, log regression and stepwise selection, that are designed to accommodate sparse data sets where there are many potential predictors relative to the number of outcomes, as is typical of genetic data, and which apply additional constraints on the regression parameters so that a meaningful set of parameters can be resolved even when the data is underdetermined. Other techniques apply principal component analysis to 10 extract information from undetermined data sets. Other techniques, such as decision trees and contingency tables, use strategies for subdividing subjects based on their independent variables in order to place subjects in categories or bins for which the phenotypic outcomes are similar. A recent technique, termed logical regression, describes a method to search for different logical interrelationships between categorical independent variables 15 in order to model a variable that depends on interactions between multiple independent variables related to genetic data. Regardless of the method used, the quality of the prediction is naturally highly dependant on the quality of the genetic data used to make the prediction. The cost of DNA sequencing is dropping rapidly, and in the near future 20 individual genomic sequencing 'for personal benefit will become more common. Knowledge of personal genetic data will allow for extensive phenotypic predictions to be made for the individual. In order to make accurate phenotypic predictions high quality genetic data is critical, whatever the context. In the case of prenatal or pre-implantation genetic diagnoses a complicating factor is the relative paucity of genetic material 25 available. Given the inherently noisy nature of the measured genetic data in cases where limited genetic material is used for genotyping, there is a great need for a method which can increase the fidelity of, or clean, the primary data. The current methods by which clinical decisions are made do not make the best possible use of existing information. As medical, biochemical and information 30 technology advance, increasing amounts of data are generated and stored both for individual patients, and also in the context of academic and clinical studies. With the recent upsurge in the amounts of genetic, phenotypic and clinical information available for analysis, much effort has gone into finding clinically relevant correlations to help 10 people lead longer, healthier and more enjoyable lives. Whereas previously clinicians and researchers would concentrate their analysis on a handful of obvious potential factors and use a local store of data, it is becoming clear the potential benefit of being able to leverage data measured by scores of other agents, and using more complex models that 5 can identify previously unsuspected factors which correlate with a given genotype or phenotype. This situation will become considerably more complicated once personal genetic data occupies a more central role in understanding the causes and treatments of diseases and other predispositions of subjects. Within the next decade it may be possible to scan the entire genome of a patient as well as to collect a myriad of phenotypic data 10 points, either for clinical trials, or for the purpose of personalized treatments and or drug assignment. As the amount of data available has become enormous, and is still increasing rapidly, the crux of the problem has become designing and implementing good methods that allow the most appropriate correlations to be uncovered and used to benefit people. 15 As the number of variables available to analyze has increased, it has become more important to develop methods that are able to digest the astronomical number of potential correlations, and do not a-priori rule any of them out. At the same time it is important to develop methods that can integrate and utilize the findings of multiple studies, even when those studies were not conducted with identical protocols. It is also becoming 20 increasingly important given the large number of prediction models which have been studied, to develop systems that can correctly identify the optimal method to use in a given analysis. Bioinformatics in the Context of HIV 25 HIV is considered pandemic in humans with more than 30 million people currently living with lIV, and more than 2 million deaths each year attributable to HIV. One of the major characteristics of HIV is its high genetic variability as a result of its fast replication cycle and the high error rate and recombinogenic properties of reverse transcriptase. As a result, various strains of the HiIV virus show differing levels of 30 resistance to different drugs, and an optimal treatment regimen may take into account the identity of the infective strain and its particular susceptibilities. As of today, approved ART drugs consist of a list of eleven RTIs: seven nucleoside, one nucleotide and three non-nucleoside; seven PIs; and one fusion/entry 11 inhibitor. Given the current rollout of ART drugs around the world the appearance of resistance strains of the virus is inevitable, both due to the low genetic barrier to resistance and to poor drug adherence. Consequently, techniques to predict how mutated viruses will respond to anti-retroviral therapy are increasingly important as they will 5 influence the outcome for salvage therapies. The rapidly decreasing cost of viral genetic sequencing - with volume pricing as low as $5 for pre-prepared sequences - makes the selection of drugs based on viral genetic sequence data an attractive option, rather than the more costly and involved in-vitro phenotype measurement. The use of sequence data, however, necessitates accurate predictions of viral drug response, based on the 10 appearance of viral genetic mutations. The many different combinations of viral mutations make it difficult to design a model that includes all the genetic cofactors and their interactions, and to train the model with limited data. The latter problem is exacerbated in the context of modeling in-vivo drug response, where the many different combinations of drug regimens make it difficult to collect sufficiently large data sets for 15 -any particular regimen that contain the variables, namely baseline clinical status, treatment history, clinical outcome and genetic sequence. Resistance to antiviral drugs can be the result of one mutation within the RT or protease sequences, or the combination of multiple mutations. The RT enzyme is coded by a key set of 560 codons; the protease enzyme by 99 codons. By considering only 20 mutations that alter the amino acids, each amino acid locus has 19 possible mutations; so there are a total of 10,640 possible mutations that differ from wild type on the RT enzyme, and 1,981 possible mutations on the protease enzyme. Using a simple linear model, where each mutation encountered in the data (not all mutations will occur) is associated with a particular weighting, or linear regression parameter, several thousand 25 parameters may exist. If only several hundred patient samples are available for each drug, the problem is overcomplete, or ill-posed in the Hadamard sense, since there are more parameters to estimate than independent equations. Many techniques exist that can be applied to the problem of constructing models for the ill-posed problem. These include combining a-priori expert knowledge with observations to create expert-rule based 30 systems, as well as statistical methods including i) ridge regression, ii) principal component analysis, iii) decision trees, iv) stepwise selection techniques, v) Neural Networks, vi) the Least Absolute Shrinkage and Selection Operator (LASSO), and vii) Support Vector Machines (SVM). 12 Three main industry-standard expert systems are typically used to predict the susceptibility of HIV viruses to ART drugs: the ANRS-AC 11 System, the Rega System, and the Stanford HIVdb system. It is commonplace in the literature for new algorithms to be benchmarked against these expert systems. None of these expert systems, however, is 5 designed to perform direct prediction of phenotypic response, but rather to provide a numeric score by which different drugs can be compared, or to classify the drugs into discrete groupings such as Sensitive, Intermediate and Resistant. In addition, it has been clearly established that statistical algorithms, such as linear regression models trained with stepwise selection, substantially outperform expert systems in prediction of 10 phenotypic outcome. Consequently, only a set of statistical techniques are compared with the novel methods in the detailed description, which includes the best performing methods recently disclosed in the literature. Current approaches to predicting clinical outcomes of salvage ART do not demonstrate good predictive power, largely due to a lack of statistically significant 15 outcome data, combined with the many different permutations of drug regimens and genetic mutations. This field has a pressing need both for the integration of multiple heterogeneous data sets and the enhancement of drug response prediction. Bioinfonnatics in the Context of Cancer 20 Of the estimated 80,000 annual clinical trials, 2,100 are for cancer drugs. Balancing the risks and benefits for cancer therapy represents a clinical vanguard for the combined use of phenotypic and genotypic information. Although there have been great advances in chemotherapy in the past few decades, oncologists still treat their cancer patients with primitive systemic drugs that are frequently as toxic to normal cells as to 25 cancer cells. Thus, there is a fine line between the maximum toxic dose of chemo and the therapeutic dose. Moreover, dose-limiting toxicity may be more severe in some patients than others, shifting the therapeutic window higher or lower. For example, anthracyclines used for breast cancer treatment can cause adverse cardiovascular events. Currently, all patients are treated as though at risk for cardiovascular toxicity, though if a patient could 30 be determined to be at low-risk for heart disease, the therapeutic window could be shifted to allow for a greater dose of anthracycline therapy. To balance the benefits and risks of chemotherapy for each patient, one may predict the side effect profile and therapeutic effectiveness of pharmaceutical 13 interventions. Cancer therapy often fails due to inadequate adjustment for unique host and tumor genotypes. Rarely does a single polymorphism cause significant variation in drug response; rather, manifold polymorphisms result in unique biomolecular compositions, making clinical outcome prediction difficult. "Pharmacogenetics" is broadly defined as 5 the way in which genetic variations affect patient response to drugs. For example, natural variations in liver enzymes affect drug metabolism. The future of cancer chemotherapy is targeted pharmaceuticals, which require understanding cancer as a disease process encompassing multiple genetic, molecular, cellular, and biochemical abnormalities. With the advent of enzyme-specific drugs, care may be taken to insure that tumors express the 10 molecular target specifically or at higher levels than normal tissues. Interactions between tumor cells and healthy cells may be considered, as a patient's normal cells and enzymes may limit exposure of the tumor drugs or make adverse events more likely. Bioinformatics will revolutionize cancer treatment, allowing for tailored treatment to maximize benefits and minimize adverse events. Functional markers used to predict 15 response may be analyzed by computer algorithms. Breast, colon, lung and prostate cancer are the four most common cancers. An example of two treatments for these cancers are tamoxifen, which is used to treat breast cancer, and irinotecan which is used in colon cancer patients. Neither tamoxifen or irinotecan are necessary or sufficient for treating breast or colon cancer, respectively. Cancer and cancer treatment are dynamic 20 processes that require therapy revision, and frequently combination therapy, according to a patient's side effect profile and tumor response. If one imagines cancer treatment as a decision tree, to give or withhold any one treatment before, after, or with other therapies, then this tree comprises a subset of decision nodes, where much of the tree (i.e. other treatments) can be considered a black box. Nonetheless, having data to partially guide a 25 physician to the most effective treatment is advantageous, and as more data is gathered, an effective method for making treatment decisions based on this data could significantly improve life expectancies and quality of living in thousands of cancer patients. The colon, or large intestine, is the terminal 6-foot section of the gastrointestinal (GI) tract. The American Cancer Society estimates that 145,000 cases of colorectal cancer 30 will be diagnosed in 2005, and 56,000 will die as a result. Colorectal cancers are assessed for grade, or cellular abnormalities, and stage, which is subcategorized into tumor size, lymph node involvement, and presence or absence of distant metastases. 95% of colorectal cancers are adenocarcinomas that develop from genetically-mutant epithelial 14 cells lining the lumen of the colon. In 80-90% of cases, surgery alone is the standard of care, but the presence of metastases calls for chemotherapy. One of many first-line treatments for metastatic colorectal cancer is a regimen of 5-fluorouracil, leucovorin, and irinotecan. 5 Irinotecan is a camptothecin analogue that inhibits topoisomerase, which untangles super-coiled DNA to allow DNA replication to proceed in mitotic cells, and sensitizes cells to apoptosis. Irinotecan does not have a defined role in a biological pathway, so clinical outcomes are difficult to predict. Dose-limiting toxicity includes severe (Grade III-IV) diarrhea and myelosuppression, both of which require immediate 10 medical attention. Irinotecan is metabolized by uridine diphosphate glucuronosyl transferase isoform lal (UGT1AI) to an active metabolite, SN-38. Polymorphisms in UGTIA1 are correlated with severity of GI and bone marrow side effects. Prior Art 15 Listed here is a set of prior art which is related to the field of the current invention. None of this prior art contains or in any way refers to the novel elements of the current invention. In US Patent 6,720,140, Hartley et al describe a recombinational cloning method for moving or exchanging segments of DNA molecules using engineered recombination sites and recombination proteins. In US Patent 6,489,135 Parrott et al. 20 provide methods for determining various biological characteristics of in vitro fertilized embryos, including overall embryo health, implantability, and increased likelihood of developing successfully to term by analyzing media specimens of in vitro fertilization cultures for levels of bioactive lipids in order to determine these characteristics. In US Patent Application 20040033596 Threadgill et al. describe a method for preparing 25 homozygous cellular libraries useful for in vitro phenotyping and gene mapping involving site-specific mitotic recombination in a plurality of isolated parent cells. In US Patent 5,994,148 Stewart et al. describe a method of determining the probability of an in vitro fertilization (IVF) being successful by measuring Relaxin directly in the serum or indirectly by culturing granulosa lutein cells extracted from the patient as part of an 30 IVF/ET procedure. In US Patent application 5,635,366 Cooke et al. provide a method for predicting the outcome of IVF by determining the level of 11 [-hydroxysteroid dehydrogenase in a biological sample from a female patient. In U.S. Patent No. 7,058,616 Larder et al. describe a method for using a neural network to predict the 15 resistance of a disease to a therapeutic agent. In U.S. Patent No. 6,958,211 Vingerhoets et al. describe a method wherein the integrase genotype of a given HIV strain is simply compared to a known database of HIV integrase genotype with associated phenotypes to find a matching genotype. In U.S. Patent 7,058,517 Denton et al. describe a method 5 wherein an individual's haplotypes are compared to a known database of haplotypes in the general population to predict clinical response to a treatment. In U.S. Patent 7,035,739 Schadt at al. describe a method is described wherein a genetic marker map is constructed and the individual genes and traits are analyzed to give a gene-trait locus data, which are then clustered as a way to identify genetically interacting pathways, which are validated 10 using multivariate analysis. In U.S. Patent No. 6,025,128 Veltri et al. describe a method involving the use of a neural network utilizing a collection of biomarkers as parameters to evaluate risk of prostate cancer recurrence. In U.S. Patent No. 5,824,467 Mascarenhas describes a method to predict drug responsiveness by establishing a biochemical profile for patients and measuring responsiveness in members of the test cohort, and then 15 individually testing the parameters of the patients' biochemical profile to find correlations with the measures of drug responsiveness. SUMMARY OF THE INVENTION 20 The system disclosed enables the cleaning of incomplete or noisy genetic data using secondary genetic data as a source of information, and also using that genetic data to make phenotypic and clinical predictions, While the disclosure focuses on genetic data from human subjects, it should be noted that the methods disclosed apply to the genetic data of a range of organisms, in a range of contexts. The techniques described for 25 cleaning genetic data are most relevant in the context of pre-implantation diagnosis during in-vitro fertilization, prenatal diagnosis in conjunction with amniocentesis, chorion villus biopsy, and fetal blood sampling, and non-invasive prenatal diagnosis, where a small quantity of fetal genetic material is isolated from maternal blood. The diagnoses may focus on inheritable diseases, increased likelihoods of defects or abnormalities, as 30 well as making phenotype predictions for individuals to enhance clinical and lifestyle decisions. The invention addresses the shortcomings of prior art that are discussed above. The techniques described here for making phenotypic and clinical predictions are relevant in multiple contexts, including in the context of pre-implantation diagnosis, prenatal 16 diagnosis, and also in the context of individuals with medical conditions, or susceptibilities. Certain embodiments of the technology disclosed herein describe a system for making accurate predictions of phenotypic outcomes or phenotype susceptibilities for an individual given a set of genetic, phenotypic and or clinical 5 information for the individual. In one aspect, a technique for building linear and nonlinear regression models that can predict phenotype accurately when there are many potential predictors compared to the number of measured outcomes, as is typical of genetic data, is disclosed; in another aspect of the invention the models are based on contingency tables and built from information available in the public domain. In yet another invention, a 10 system is described wherein a number of models are trained on a relevant dataset, and that model which is most accurate in making the relevant prediction is used. In one aspect of the invention, methods make use of imperfect knowledge of the genetic data of the mother and the father, together with the knowledge of the mechanism of meiosis and the imperfect measurement of the embryonic DNA, in order to reconstruct, 15 in silico, the embryonic DNA at the location of key SNPs with a high degree of confidence. It is important to note that the parental data allows the reconstruction not only of SNPs that were measured poorly, but also of insertions, deletions, and of SNPs or whole regions of DNA that were not measured at all. The disclosed method is applicable in the context of in-vitro fertilization, where a 20 very small number of blastomeres are available for genotyping from each embryo being considered for implantation. The disclosed method is equally applicable to the context of Non-Invasive Prenatal Diagnosis (NIPD) where only a small number of fetal cells, or fragments of fetal DNA, have been isolated from the mother's blood. The disclosed method is equally applicable in the case of amniocentesis, and other methods where fetal 25 blood is sampled directly. The disclosed method is more generally applicable in any case where a limited amount of genetic data is available from the target individual, and additional genetic data is available from individuals who are genetically related to the target. In one aspect of the invention, the fetal or embryonic genomic data which has 30 been reconstructed can be used to detect if the cell is aneuploid, that is, if fewer or more than two of a particular chromosome is present in a cell. A common example of this condition is trisomy-21, which gives rise to Down syndrome. The reconstructed data can also be used to detect for uniparental disomy, a condition in which two of a given 17 chromosome are present, and both of which originate from one parent. This is done by creating a set of hypotheses about the potential states of the DNA, and testing to see which one has the highest probability of being true given the measured data. Note that the use of high throughput genotyping data for screening aneuploidy enables a single 5 blastomere from each embryo to be used both to measure multiple disease-linked loci as well as screen for chromosomal abnormalities. In another aspect of the invention, the direct measurements of the amount of genetic material, amplified or unamplified, present at a plurality of loci, can be used to detect for aneuploides, or uniparental disomy. The idea behind this method is simply that 10 the amount of genetic material present during amplification is proportional to the amount of genetic information in the initial sample, and measuring these levels at multiple loci will give a statistically significant result. This method of screening for chromosomal abnormalities can be used in conjunction with the related method described herein for cleaning genetic data. 15 In another aspect of the invention, the disclosed method can clean genetic material of the individual which has been contaminated by foreign DNA or RNA by identifying the data generated by extraneous genetic materials. The spurious signals generated by the contaminating DNA can be recognized in a manner similar to that way that chromosome wide anomalous signals generated by aneuploides can be detected. 20 In another aspect of the invention, target cells are isolated, the genetic data contained in those cells are amplified, and measurements of multiple SNPs are made using a combination of one or more of the following techniques: PCR-based amplification techniques, PCR-based measurement techniques, or detection techniques based on Molecular Inversion Probes, or microarrays such as the GeneChip or TaqMan systems. 25 This genetic data is then used in the system described herein. In another aspect of the invention, the genetic data of an individual can be cleaned using diploid and haploid data from both parents. Alternately, haploid data from a parent can be simulated if diploid and haploid data of the parent's parent can be measured. In another aspect, genetic data from any person of a known genetic relation to the individual 30 can be used to clean the data of the individual, including parents, siblings, grandparents, offspring, cousins, uncles, aunts etc. In another aspect of the invention, the target and/or related individual's genetic data may be partly or wholly known in silico, obviating the need for some direct 18 measurements. Portions of the genetic data can be generated in silico by means of an informatics approach utilizing a hidden Markov model. In one aspect of the invention it is possible to estimate the confidence one has in the determination of those SNPs. 5 Note that the techniques described herein are relevant both to measurements of genetic material in one, or a small number of cells, as well as to measurements on smaller amounts of DNA such as that which can be isolated from the mother's blood in the context of Non-invasive Prenatal Diagnosis (NIPD). Also note that this method can be equally applied to genomic data in silico, i.e. not directly measured from genetic material. 10 In one aspect of the invention, a technique for creating models based on contingency tables that can be constructed from data that is available through publications such as through the OMIM (Online Mendelian Inheritance in Man) database and using data that is available through the HapMap project and other aspects of the human genome project is provided. Certain embodiments of this technique use emerging public data 15 about the association between genes and about association between genes and diseases in order to improve the predictive accuracy of models. In yet another aspect, a technique by which the best model can be found for the data that is available for a particular patient is disclosed. In this aspect, many different combinations of variables may be examined, together with many different modeling 20 techniques, and that combination may be chosen which will produce the best prediction for an individual subject based on cross-validation with testing data from other subjects. In some cases the models that may produce the best at making accurate predictions of phenotypic outcomes or phenotype susceptibilities for an individual are trained using convex optimization techniques to perform continuous subset selection of 25 predictors so that one is guaranteed to find the globally optimal parameters for a particular set of data. This feature is particularly advantageous when the model may be complex and may contain many potential predictors such as genetic mutations or gene expression levels. Furthermore, in some examples convex optimization techniques may be used to make the models sparse so that they explain the data in a simple way. This 30 feature enables the trained models to generalize accurately even when the number of potential predictors in the model is large compared to the number of measured outcomes in the training data. Similar techniques have been published in an academic journal (Rabinowitz, M., et al., 2006, "Accurate prediction of HIV-1 ding response from the 19 reverse transcriptase and protease amino acid sequences using sparse models created by convex optimization." Bioinformatics 22(5): 541-9.). Note that the information from this paper has been included in this document for background and context. While certain illustrative embodiments disclosed herein focus on genetic data 5 from human subjects, and provide specific embodiments for people suffering from cancer or HIV, or for people who seek to understand their susceptibility to diseases such as Alzheimer's or Myocardial Infarction, it should be noted that the methods disclosed apply to the genetic data of a range of organisms, in a range of numerous, different contexts. The techniques described herein for phenotypic prediction and drug response prediction 10 may be relevant in the context of the treatment of a variety of cancers, genetic illnesses, bacterial, fungal or viral infections, as well as in making phenotypic predictions for individuals to enhance clinical and lifestyle decisions. Furthermore, the system can be used to determine the likelihood of particular phenotypic outcomes given genetic data, specifically SNP (single nucleotide polymorphism) data of an embryo (pre-implantation) 15 in the context of IVF, or of a fetus in the context of non-invasive or invasive prenatal diagnosis including amniocentesis. In one embodiment, the predictive models may be applied to genetic data for a particular individual that has been stored in a standardized computable format. The individual may describe particular issues that are relevant to them, or the system may 20 automatically determine which phenotypic susceptibilities are relevant to that individual. As new research data becomes available on disease-gene associations, treatments, or lifestyle habits, the individual can be notified of the impact of this information on their decisions and habits, based on predictive models developed from the aggregated genomic and clinical data. Alternately, the system can use new research data to detect hitherto 25 unsuspected risks to the individual and that individual can be notified of the impact of this information. In another embodiment, enhanced reports can be generated for clinicians using outcome prediction models trained on data integrated from databases of genetic data, phenotypic data, and clinical records including relevant diagnostic tests. This system may 30 provide for the creation of enhanced reports for individuals with diseases and/or disease predispositions, including but not limited to HIV, cancer, Alzheimers and heart diseases. These enhanced reports will indicate to a treating physician which disease-management or preventative treatments may be more or less suitable for a given individual. The report 20 will include predictions and confidence bounds for key outcomes for that individual using models trained on aggregated subject data. According to another embodiment, a system and method where data about a specific individual is used to make predictions about said individual using models based on 5 contingency tables and built from information available in the public domain, where said data is taken from a group consisting of said individual's genetic data, said individual's phenotypic data, said individual's clinical data, and combinations thereof, and where said predictions concern topics taken from a group comprising said individual's phenotypes, phenotype susceptibilities, and possible clinical outcomes, and where said information is 10 taken from a group comprising information about genotype-phenotype associations, information about the frequency of certain genetic alleles, information about the frequency of certain associations among genetic alleles, information about the probability of one or more states of certain phenotypes given certain combinations of genetic alleles, information about the probability of a certain* combinations of genetic alleles given the 15 state of a certain phenotypes, and combinations thereof is disclosed. According to yet another embodiments, a system and method whereby data about a specific individual can be used to make predictions about said individual using a variety of mathematical models trained on aggregated data in a way that the model which shows the best accuracy can be utilized, where said individual's data is taken from a group 20 consisting of said individual's genetic data, said individual's phenotypic data, and said individual's clinical data, and where said predictions concern topics taken from a group comprising said individual's phenotypes, phenotype susceptibilities, possible clinical outcomes, and combinations thereof is provided. In certain embodiments, the method may examine many or all of the different independent variable and dependant variable 25 combinations in a given set of data, using multiple models and multiple tuning parameters, and then selects that combination of independent variables and dependant variables, that model and those tuning parameters that achieved the highest correlation coefficient with the test data for the purpose of making the best phenotypic predictions. According to another embodiment, any of the methods disclosed herein may use 30 predictions to generate reports for a specific individual concerning one or more topics that are relevant to said individual, where said topics are taken from a group comprising lifestyle decisions, dietary habits, hormonal supplements, possible treatment regimens for a disease, possible treatment regimens for a pathogen, drug interventions, and 21 combinations thereof, and where said prediction is based on data concerning said individual's genetic makeup, said individual's phenotypic characteristics, said individual's clinical history, and combinations thereof. According to other embodiments, any of the methods disclosed herein may use 5 predictions to generate reports for an agent of a specific individual, such as a physician or clinician, and where said predictions could aid said agent by providing information relevant to said individual, and where the subject of said information is taken from a group of topics comprising lifestyle decisions, dietary habits, hormonal supplements, possible treatment regimens for a disease, possible treatment regimens for a pathogen, 10 drug interventions, other therapeutic interventions, and combinations thereof, and where said prediction is based on data concerning said individual's genetic makeup, said individual's phenotypic characteristics, said individual's clinical history, and combinations thereof. According to another embodiment, any of the methods disclosed herein may use 15 predictions to benefit a specific individual afflicted with cancer, and where said predictions could aid clinicians by providing information relevant to that individual and or to the specific cancer of said individual, and where the subject of said information is taken from a group of topics comprising treatment regimens, lifestyle decisions, and dietary habits, drug interventions, other therapeutic interventions, and combinations 20 thereof, and where said prediction is based on data concerning said individual's genetic makeup, said individual's phenotypic characteristics, said individual's clinical history, and combinations thereof. According to one embodiment, any of the methods disclosed herein may be used to benefit a specific individual afflicted with a pathogen, and where said predictions could 25 aid clinicians by providing information relevant to that individual and or to the specific pathogen infecting said individual, where said pathogen is of a class taken from a group consisting of bacteria, virus, microbe, amoeba, fungus and other parasites, and where the subject of said information is taken from a group of topics comprising treatment regimens, lifestyle decisions, and dietary habits drug interventions, other therapeutic 30 interventions, and combinations thereof, and where said prediction is based on data concerning said individual's genetic makeup, said individual's phenotypic characteristics, said individual's clinical history, and combinations thereof. 22 According to another embodiment, any of the methods disclosed herein may use predictions regarding a specific individual, new knowledge and data as that knowledge and data becomes available, and which could be used to generate informational reports, automatically or on-demand, regarding topics that are relevant to said individual, where 5 the topics are taken from a group comprising lifestyle decisions, dietary habits, hormonal supplements, possible treatment regimens for a disease, possible treatment regimens for a pathogen, drug interventions, other therapeutic interventions, and combinations thereof, and where the new knowledge and data are medical in nature, and where the prediction is based on data concerning said individual's genetic makeup, said individual's phenotypic 10 characteristics, said individual's clinical history, and combinations thereof. According to another embodiment, any of the methods disclosed herein may use predictions using genetic data from a specific embryo and said predictions can be used to aid in selection of embryos in the context of IVF based on predicted susceptibility to certain phenotypes of said embryo. 15 According to one embodiment, any of the methods disclosed herein may use predictions using genetic data from a specific fetus, and said predictions can be used to estimate particular phonotypic outcomes for the potential progeny, such as life expectancy, the probability of psoriasis, or the probability of a particular level of mathematical ability. 20 Definitions of the specific embodiments of the invention as claimed herein follow. According to a first embodiment of the invention, there is provided a method for detecting the presence or absence of a chromosomal abnormality in a target individual, the method comprising: (a) measuring the amount of genetic material at multiple loci on a 25 chromosome or chromosome segment of interest in a sample comprising DNA from the target individual; (b) comparing the amount from step (a) to either (i) a threshold value or (ii) an expected amount for a particular copy number hypothesis; and (c) identifying the presence or absence of a chromosomal abnormality 30 based on the comparison.
According to a second embodiment of the invention, there is provided a method for determining the number of copies of a chromosome or chromosome segment of interest in the genome of a target individual, the method comprising: (a) measuring the amount of genetic material at multiple loci on a 5 chromosome or chromosome segment of interest in a sample comprising DNA from the target individual; (b) creating a set of one or more hypotheses about the number of copies of the chromosome or chromosome segment of interest in the genome of the target individual; 10 (c) determining on a computer the probability of each of the hypotheses being true or false given the measurements from step (a); and (d) determining the number of copies of the chromosome or chromosome segment of interest in the genome of the target individual using the probabilities associated with each hypothesis. 15 It will be recognized by the person of ordinary skill in the art, given the benefit of this disclosure, that other aspects, features and embodiments may implement one or more of the methods and systems disclosed herein. BRIEF DESCRIPTION OF THE DRAWINGS 20 Figure 1: an illustration of the concept of recombination in meiosis for gamete formation. Figure 2: an illustration of the variable rates of recombination along one region of Human Chromosome 1. Figure 3: determining probability of false negatives and false positives for different hypotheses. 25 Figure 4: the results from a mixed female sample, all loci hetero. Figure 5: the results from a mixed male sample, all loci hetero. [Text continues on page 24.] Figure 6: Ct measurements for male sample differenced from Ct measurements for female sample. Figure 7: the results from a mixed female sample; Taqman single dye. Figure 8: the results from a mixed male; Taqman single dye. 5 Figure 9: the distribution of repeated measurements for mixed male sample. Figure 10: the results from a mixed female sample; qPCR measures. Figure 11: the results from a mixed male sample; qPCR measures. Figure 12: Ct measurements for male sample differenced from Ct measurements for female sample. 10 Figure 13: detecting aneuploidy with a third dissimilar chromosome.. Figure 14: an illustration of two amplification distributions with constant allele dropout rate. Figure 15: a graph of the Gaussian probability density function of alpha. Figure 16: the general relationship diagram of the input data, the database data, the 15 algorithm and the output. Figure 17: a visual overview of how to derive P(HIM). Figure 18: a visual representation of the flow chart describing the algorithm used to demonstrate the effectiveness of the cleaning algorithm on simulated data. Figure 19: an illustration of a system that is configured to accomplish the method 20 disclosed herein, in the context of phenotype prediction of embryos during IVF. Figure 20: an illustration of the LASSO tendency to produce sparse solutions. The Ridge regression solution lies at the meeting of the two circles, and the LASSO solution lies at the meeting of the circle and square. Figure 21: a table of the correlation coefficients (R in %) of measured and predicted 25 response, averaged over ten different 9:1 splits of training and testing data, and then averaged over seven PIs or ten RTIs respectively. Figure 22: graphic representation of the value of LASSO model parameters associated with mutations in the protease enzyme for predicting PI response. Only 40 parameters with the largest absolute magnitudes are shown. 30 Figure 23: graphic representation of the value of LASSO model parameters associated with mutations in the RT enzyme for predicting NRTI drug response. Only the 40 parameters with the largest absolute magnitudes are shown. 24 Figure 24: graphic representation the value of LASSO model parameters associated with mutations in the RT enzyme for predicting NNRTI drug response. Only the 40 parameters with the largest absolute magnitudes are shown. Table 1: a summary of disease genes as found in OMIM/NCBI. 5 Table 2: a summary of different aneuploidy detection techniques Table 3: an example of input data for the method described using SNPs with a low degree of cosegregation. Table 4: an example of input data for the method described using SNPs with a high degree of cosegregation. 10 Table 5: an example of the output data for the input data shown in Table 2. Table 6: an example of the output data for the input data shown in Table 4. Table 7: the results of the preliminary simulation. Table 8: the results of the full simulation of the method. Table 9: three contingency tables representing the results of Farrer (2005), Labert (1998), 15 and Alvarez (1999) for understanding the role of mutations in APOE and ACE in affecting the onset of Alzheimers. Table 10: results generated from meta-analysis of the studies of Table 7. Table 11: a table of correlation coefficients (R in %) of measured and predicted response to Protease Inhibitor (PI) drugs for various methods, averaged over ten different 20 9:1 splits of training and testing data. The standard deviation (Std. dev.) of the results is shown in gray; the number of measured drug responses is shown in the last row. Table 12: a table of correlation coefficients (R in %) of measured and predicted response to Reverse Transcriptase Inhibitor (RTI) drugs for various methods, averaged over 25 ten different 9:1 splits of training and testing data. The standard deviation (Std. dev.) of the results is shown in gray; the number of measured drug responses is shown in the last row. Table 13: the number of samples, and total number of mutations used for training for various regression methods, together with the number of mutations with non-zero 30 weights selected by the Least Absolute Selection and Shrinkage Operator (LASSO) as predictors for Protease Inhibitor (PI) drug response. Table 14: the number of samples, and total number of mutations used for training with various methods, together with the number of mutations with non-zero weights 25 selected by LASSO as predictors for Reverse Transcriptase Inhibitor (RTI) response. Table 15: phenotypic data for the irinotecan study. 5 DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT Conceptual Overview of the System One goal of the disclosed system is to provide highly accurate genomic data for 10 the purpose of genetic diagnoses. In cases where the genetic data of an individual contains a significant amount of noise, or errors, the disclosed system makes use of the similarities between genetic data of related individuals, and the information contained in that secondary genetic data, to clean the noise in the target genome. This is done by determining which segments of chromosomes were involved in gamete formation and 15 where crossovers occurred during meiosis, and therefore which segments of the secondary genomes are expected to be nearly identical to sections of the target genome. In certain situations this method can be used to clean noisy base pair measurements, but it also can be used to infer the identity of individual base pairs or whole regions of DNA that were not measured. In addition, a confidence can be computed for each 20 reconstruction call made. A highly simplified explanation is presented first, making unrealistic assumptions in order to illustrate the concept of the invention. A detailed statistical approach that can be applied to the technology of today is presented afterward. Another goal of the system is to detect abnormal numbers of chromosomes, sections of chromosomes, and origins of chromosomes. In genetic samples that are 25 aneuploid, have unbalanced translocations, uniparental disomy, or other gross chromosomal abnormalities, the amount of genetic material present at a plurality of loci can be used to determine the chromosomal state of the sample. There are multiple approaches to this method, and several of them are described here. In some approaches, the amount of genetic material present in a sample is sufficient to directly detect 30 aneuploides. In other approaches, the method for cleaning the genetic material can be used to enhance the efficiency of detection of chromosomal imbalances. A confidence can be computed for each chromosomal call made. 26 Another goal of the system is to provide an effective and efficient means of extracting the most simple and intelligible statistical models from from genetic data by exploring a wide array of terms that are designed to model the effect of variables related to genetic data. More specifically, most or all of the currently available methods for 5 modeling phenotype or phenotype susceptibility based on genetic data have the following drawbacks: (i) they do not use convex optimization techniques and thus are not guaranteed to find the local minimum solution for the model parameters for a given training data set; (ii) they do not use techniques that minimize the complexity of the model and thus they do not build models that generalize well when there are a small 10 number of outcomes relative to the number of independent variables; (iii) they do not enable the extraction of the most simple intelligible rules from the data in the context of logistic regression without making the simplifying assumption of normally distributed data; (iv) they do not leverage a-priori information about gene-gene associations, gene phenotype associations, and gene-disease associations in order to make the best possible 15 prediction of phenotype or phenotype susceptibility; (v) they do not provide more than one model, and thus do not provide a general approach for selecting the best possible data based on cross-validating various models against training data. These shortcomings are critical in the context of predicting outcomes based on the analysis of vast amounts of data classes relating to genetic and phenotypic information. In summary the currently 20 available methods do not effectively empower individuals to ask questions about the likelihood of particular phenotypic features given genotype, or about the likelihood of particular phenotypic features in an offspring given the genotypic features of the parents. Note that some of the explanation given below includes work that has been previously published by authors of this document. It is provided as background 25 information to facilitate understanding of and to give a greater context to the material disclosed herein. One may consider genotype-phenotype predictive models in three categories: i) genetic defects or alleles are known to cause the disease phenotype with 100% certainty; ii) genetic defects and alleles that increase the probability of disease phenotype, where the 30 number of predictors is small enough that the phenotype probability can be modeled with a contingency table; and iii) complex combinations of genetic markers that can be used to predict phenotype using multidimensional linear or nonlinear regression models. Of the 359 genes (See Table 1, row 2) with currently known sequences and disease phenotypes *27 in the Online Mendelian Inheritance Database (OMIM), the majority fall into category (i); and the remainder fall predominantly into category (ii). However, over time, it is expected that multiple genotype-phenotype models will arise in category (iii), where the interaction of multiple alleles or mutations will need to be modeled in order to estimate the 5 probability of a particular phenotype. For example, scenario (iii) is certainly the case today in the context of predicting the response of HIV viruses to anti-retroviral therapy based on the genetic data of the HIV virus. For scenario (i), it is usually straightforward to predict the occurrence of the phenotype based on expert rules. In one aspect, statistical techniques are described that 10 can be used to make accurate predictions of phenotype for scenarios (ii). In another aspect, statistical techniques are described that can be used to make accurate predictions for scenario (iii). In another aspect, methods are described by which the best model can be selected for a particular phenotype, a particular set of aggregated data, and a particular individual's data. 15 Certain embodiments of the methods disclosed herein implement contingency tables to accurately make predictions in scenario (ii). These techniques leverage a-priori information about gene-gene associations and gene-disease associations in order to improve the prediction of phenotype or phenotype susceptibility. These techniques enable one to leverage data from previous studies in which not all the relevant independent 20 variables were sampled. Instead of discarding these previous results because they have missing data, the technique leverages data from the HapMap project and elsewhere to make use of the previous studies in which only a subset of the relevant independent variables were measured. In this way, a predictive model can be trained based on all the aggregated data, rather than simply that aggregated data from subjects for which all the 25 relevant independent variables were measured. Certain methods described herein use convex optimization to create sparse models that can be used to make accurate predictions in scenario (iii). Genotype-phenotype modeling problems are often overcomplete, or ill-posed, since the number of potential predictors - genes, proteins, mutations and their interactions - is large relative to the 30 number of measured outcomes. Such data sets can still be used to train sparse parameter models that generalize accurately, by exerting a principle similar to Occam's Razor: When many possible theories can explain the observations, the most simple is most likely to be correct. This philosophy is embodied in one aspect relating to building genotype 28 phenotype models in scenario (iii) discussed above. The techniques described here for application to genetic data involve generating sparse parameter models for underdetermined or ill-conditioned genotype-phenotype data sets. The selection of a sparse parameter set exerts a principle similar to Occam's Razor and consequently 5 enables accurate models to be developed even when the number of potential predictors is large relative to the number of measured outcomes. In addition, certain embodiments of the techniques described here for building genotype-phenotype models in scenario (iii) use convex optimization techniques which are guaranteed to find the global minimum solution for the model parameters for a given training data set. 10 Given a set of aggregated data, and a set of available data for an individual, it is seldom clear which prediction approach is most appropriate for making the best phenotypic prediction for that individual. In addition to describing a set of methods that tend to make accurate phenotypic predictions, embodiments disclosed herein present a system that tests multiple methods and selects the optimal method for a given phenotypic 15 prediction, a given set of aggregated data, and a given set of available data for the individual for whom the prediction is to be made. The disclosed methods and systems examine all the different independent variable and dependant variable combinations in a given set of data using multiple models and multiple tuning parameters, and then selects that combination of independent variables, dependant variables, and those tuning 20 parameters that achieve the best modeling accuracy as measured with test data. In cases corresponding to scenario (i) expert rules may be drawn; in other cases with few independent variables, such as in category (ii), contingency tables will provide the best phenotype prediction; and in other cases such as scenario (iii) linear or non-linear regression techniques may be used to provide the optimal method of prediction. Note that 25 it will be clear to one skilled in the art, after reading this disclosure, how the approach to selecting the best model to make a prediction for an individual may be used to select from amongst many modeling techniques beyond those disclosed here. Certain embodiments of the technology are demonstrated in several contexts. First, it is demonstrated in the context of predicting the likelihood of developing 30 Alzheimer's disease using contingency tables and an incomplete set of data integrated from many clinical studies focusing on predicting Alzheimer's disease based on genetic markers. Next, the system is demonstrated in the context of modeling the drug response of Type-l Human Immunodeficiency Virus (HIV-1) using regression analysis and the 29 knowledge of genetic markers in the viral genome. Finally the system is demonstrated in the context of predicting the side-effects caused by the usage of tamoxifen and irinotecan in the treatment of various cases of breast and colon cancer, respectively, using regression analysis and the incomplete data of both genetic markers on the individuals and also 5 laboratory and clinical subject information relevant to the cancer. Due to the decreasing expense of genotypic testing, statistical models that reliably predicts viral drug response, cancer drug response, and other phenotypic responses or outcomes from genetic data are important tools in the selection of appropriate courses of action whether they be disease treatments, lifestyle or habit 10 decisions, or other actions. The optimization techniques described will have application to many genotype-phenotype modeling problems for the purpose of enhancing clinical decisions. Technical Description of the System 15 Cleaning Data: A Simplified Example Figure 1 illustrates the process of recombination that occurs during meiosis for the formation of gametes in a parent. The chromosome 101 from the individual's mother is shown in orange (or grey). The chromosome 102 from the individual's father is shown in 20 white. During this interval, known as Diplotene, during Prophase I of Meiosis, a tetrad of four chromatids 103 is visible. Crossing over between non-sister chromatids of a homologous pair occurs at the points known as recombination nodules 104. For the purpose of illustration, the example will focus on a single chromosome, and three Single Nucleotide Polymorphisms (SNPs), which are assumed to characterize the alleles of three 25 genes. For this discussion it is assumed that the SNPs may be measured separately on the maternal and paternal chromosomes. This concept can be applied to many SNPs, many alleles characterized by multiple SNPs, many chromosomes, and to the current genotyping technology where the maternal and paternal chromosomes cannot be individually isolated before genotyping. 30 Attention must be paid to the points of potential crossing over in between the SNPs of interest. The set of alleles of the three maternal genes may be described as (am,, am2, am 3 ) corresponding to SNPs (SNPI, SNP 2 , SNP 3 ). The set of alleles of the three paternal genes may be described as (apl, ap 2 , ap3). Consider the recombination nodules 30 formed in Figure 1, and assume that there is just one recombination for each pair of recombining chromatids. The set of gametes that are formed in this process will have gene alleles: (ami, am2, ap 3 ), (ami, ap 2 , ap3), (api, am2, ap 3 ), (apt, ap 2 , am3). In the case With no crossing over of chromatids, the gametes will have alleles (ami, am2, am3), (api, ap 2 , ap 3 ). In 5 the case with two points of crossing over in the relevant regions, the gametes will have alleles (a.i, ap 2 , am3), (api, am2, ap 3 ). These eight different combinations of alleles will be referred to as the hypothesis set of alleles, for that particular parent. The measurement of the alleles from the embryonic DNA will be noisy. For the purpose of this discussion take a single chromosome from the embryonic DNA, and 10 assume that it came from the parent whose meiosis is illustrated in Figure 1. The measurements of the alleles on this chromosome can be described in terms of a vector of indicator variables: A = [A, A 2
A
3 ]IT where A 1 = 1 if the measured allele in the embryonic chromosome is ami, A 1 = -1 if the measured allele in the embryonic chromosome is ap , and A 1 = 0 if the measured allele is neither ami or api. Based on the hypothesis set of 15 alleles for the assumed parent, a set of eight vectors may be created which correspond to all the possible gametes describe above. For the alleles described above, these vectors would be a, = [1 1 1]T, a 2 =[l 1 -1I, as = [I -1 1Ia a4= [1 -1 -1] , as = [-1 1 ]I, a 6 = [-1 1 -1I, a7 = [-1 -1 1 IT a8 = [-I 1 -]. In this highly simplified application of the system, the likely alleles of the embryo can be determined by perfonning a simple correlation 20 analysis between the hypothesis set and the measured vectors: i*=argma x
A
T
a,, i=l...8 (1) Once i* is found, the hypothesis ai* is selected as the most likely set of alleles in the embryonic DNA. This process is then repeated twice, with two different assumptions, 25 namely that the embryonic chromosome came from the mother or the father. That assumption which yields the largest correlation ATai* would be assumed to be correct. In each case a hypothesis set of alleles is used, based on the measurements of the respective DNA of the mother or the father. Note that in a typical embodiment of the disclosed method, one measures a large number of SNPs between those SNIs that are important 30 due to their association with particular disease phenotypes - these will be referred to these as Phenotype-associated SNPs or PSNPs. The Non-phenotype-associated SNPs (NSNPs) between the PSNPs may be chosen a-priori (for example, for developing a specialized 31 genotyping array) by selecting from the NCBI dbSNP database those RefSNPs that tend to differ substantially between individuals. Alternatively, the NSNPs between the PSNPs may be chosen for a particular pair of parents because they differ between the parents. The use of the additional SNPs between the PSNPs enables one to determine with a 5 higher level of confidence whether crossover occurs between the PSNPs. It is important to note that while different "alleles" are referred to in this notation, this is merely a convenience; the SNPs may not be associated with genes that encode proteins. The System in the Context of Current Technology 10 In another more complex embodiment, the a-posteriori probability of a set of alleles is computed given a particular measurement, taking into account the probability of particular crossovers. In addition, the scenario typical of microarrays and other genotyping technologies is addressed where SNPs are measured for pairs of chromosomes, rather than for a single chromosome at a time. The measurements of the 15 genotype at the locus i for the embryonic, paternal and maternal chromosomes may be characterized respectively by random variables representing the pairs of SNP measurements (ei,, e2,), (p1j, p2,1) and (mi,, m2, 1 ). Since one cannot determine the presence of crossovers in the maternal and paternal chromosomes if all measurements are made as pairs, the method is modified: in addition to genotyping the fertilized embryos 20 and paternal and maternal diploid tissue, one haploid cell from each parent, namely, a sperm cell and an egg cell, is also genotyped. The measured alleles of the sperm cell are represented by pIyi, i=1 ... N and the complementary alleles measured from the paternal diploid tissue by p2,i. Similarly, the measured alleles of the egg cell are represented by mi and their complement in the mother's diploid cell by m2,i. These measurements provide 25 no information on where the parental chromosomes crossed over in generating the measured sperm and egg cells. However, one can assume that the sequence of N alleles on the egg or sperm was created from the parental chromosomes by a small number of, or no, crossovers. This is sufficient information to apply the disclosed algorithm. A certain error probability is associated with calling the paternal and maternal SNPs. The 30 estimation of this error probability will vary based on the measurements made (pjj, p2,j) and (mlj, m 2 ,1) and the signal-to-noise ratio for the technology used. Although these error probabilities can be uniquely computed for each locus without affecting the disclosed 32 method, the algebra is simplified here by assuming that the probabilities of correctly calling the paternal and maternal SNPs are constant at pp and pm respectively. Assume that a measurement is performed on the embryonic DNA which is termed measurement M. In addition, the notation is slightly modified so that A is now a set and 5 not a vector: A refers to a particular hypothesis about the combination (or set) of alleles derived from each parent. The set of all possible combinations of alleles A from both parents is denoted as SA The goal is to determine the combination of alleles (or that hypothesis) A c SA with the highest a-posteriori probability, given the measurement M: A* = argmax 4 P(A I M), VA e S 10 (2) Using the law of conditional probabilities, P(AIM) = P(MIA)P(A)/P(M). Since P(M) is common for all different A's, the optimizing search can be recast as: A* = arg max, PM / A)P(A), VA e S, (3) 15 Now consider the computation of P(M/A). Begin with a single locus i, and let the hypothesis be that this locus on the embryo is derived from the parental SNPs pt,1,i and m,1,i, where the underscore t is used to denote the true value of these Parental SNPs, as opposed to the measurements performed, pi,i and mIj, which may or may not be correct. The true value of the embryonic SNPs is denoted as (e, 1 ,i, et, 2 ,). If hypothesis A is true, 20 then (et,I,i, et2,i) = (pt, i,, mt,,1) or (mt,1,1, pt,1,i)- Since one cannot differentiate which of the measurements (ei,i, e 2 ,j) comes from which parent, both orders must be considered so the hypothesis set A = [(pti,j, mi, 1 ), (mt,1,j, ptj,1)]. The probability of a particular measurement M depends on the true values or the underlying states of the parental SNPs, namely (pr,I,i, Pt,2,i) and (mj,, t,2,). Since there are four SNPs, pt,,i, pt,2,i, m ,1,, Mt2,j, and each of these 25 can assume the value of four nucleotide bases, AC,T,G, there are 44 or 256 possible states. The algorithm is illustrated for one state si for which it is assumed that p,1,Wt2,iftI,iimt,2,. From this explanation, it will be clear how to apply the method to all 256 possible states, sk, k-1...256. Assume a measurement M of embryonic SNPs (e 1 ,i, e2,) is performed, and the result e,i=pIi,, e2,j=m, 1 is obtained. The a priori probability for 30 this measurement given that hypothesis A and state sI are true is computed: 33 P(egj = p, ,eyj =m |1, A-, s) )= P(e, = p I e,, 2 ,, =min,,,, 1,s,)P(e = p, I eI = p= in)P(e = e = m ) + P(eL = in,,, e,, = p, 1-A, s,)Pe = p I e, =n,,, p,, # m )P(el = mi. e = Pt, P2. 1 (4) Consider the first expressions in the first term and second term: P(ei,i=plj,e2,=g|iA,si)= P(el,i=myi,e2,i=pil}A,si)=0.5 since the hypothesis A=[(pglj, mt,,), (mii, ptjj)] makes two 5 orderings for the embryonic SNPs equally likely. Now consider the second expression of the first term, P(eli=p,i =pj), the probability of measuring ey=pg given the assumption that embryonic SNP etj actually is derived from paternal SNP p,1,i. The probabilities for correctly measuring the paternal SNPs, maternal SNPs, and embryonic SNPs are pp, pm, and p.. Given the assumption (et,1j=pt,j), the measurement (eij=pj) 10 requires either that both embryonic and paternal SNPs are correctly measured, or that both are incorrectly measured and they happen to be incorrectly measured as the same nucleotide (A,C,T, or G). So, P(eij-piIjeii=pij) = pepp+(l-pe)(1-pp)/ 3 where it is assumed for simplicity that the probability of incorrectly calling all of the four nucleotides is equally likely - the algorithm can be easily modified to accommodate 15 different probabilities of calling a particular nucleotide (A,C,T,G) given a measurement on another particular nucleotide. The same approach may be applied to the third expression in the first term to'obtain P(e2;=mgi let, 2 ,i=m,5,) = pepm+(l-pe)(l-pm)/ 3 . Now consider the second expression of the second term. P(eij=pjj lel,1,=mj, mtljfpx,) requires either that e 1 ,i or pLj be an incorrect measurement, or that both be incorrect 20 measurements, so that the measured values happen to be equal: P(elf=pgi jet,=mti, mtl, jfpt j,) = pe(l-pp)/ 3 +(1-pe)pp/ 3 +(l-pe)(1-pp)2/9. The same argument can be applied to the last expression of the second term to yield P(e 2 ;=m 1 , etp,2=pt,, mIJIpt,) = pe( pm)/3+(-pe)pm/3+(1-pe)(1-pm)2/9. Now, combining all of these terms, and making the assumption - merely to simplify the algebra - that pe-pp=pm=p, one can compute: 25 P(M(e, = p 1 ,e 2 j = M)|A,s 1 ,)= (160p4 -160p 3 + 96p 2 - 28p + 13) 162 (5) Although the computation will vary, a similar conceptual approach to that described here would be used for all 256 possible states, s, k=1 ...256. Computing P(ejj=pi,b e2;=mjI IA,si) for all 256 states si and summing over the probability of each si one obtains 30 P(e i,=pj, e2,i=mli JA). In other words: 34 P(M | A) X P(M I A,s,)P(s,) 1=1.256 (6) In order to compute the probabilities of each state si, P(si), one must treat all the separate alleles making up a state as separate events since they are on separate chromosomes, in 5 other words: P(si) = P(ptj, p12,I, m, m,i) = P(pI,)P(p2,i)P(mt,)P(mt 2 ,i). Bayesian techniques may be applied to estimate the probability distribution for the individual measurements. Every measurement of an allele on the maternal or paternal chromosomes at locus i may be treated as a coin toss experiment to measure the probability of this allele being a particular value (A,C,T or G). These measurements are made on the adult tissue 10 samples and may be treated as being totally reliable, even though pairs of alleles are measured for each SNP, and it is not possible to determine which allele comes from which chromosome. Let wp,u = P(po,), corresponding to the probability of the SNP i on the father's chromosome being value pt,,i. In the following explanation, w is used instead of wp, 1 ,j. Let the measurements performed on SNP i of the father's chromosome be 15 characterized as collecting data D. One can create a probability distribution for w, p(w) and update this after the data is measurement according to Bayes Theorem: p(w|D)= p(w)p(Djw)/p(D). Assume n alleles of SNP i are observed and that the particular allele corresponding to w comes up h times - in other words, heads is observed h times. The probability of this observation can be characterized by the binomial distribution 20 p(DIw)= w "-' (7) Before data is collected, assume there is a prior distribution p(w) which is uniform between 0 and 1. By applying the Bayes theorem, it is straightforward to show that the resulting distribution for p(wjD) will be a beta distribution of the form: 25 p(w I D)= wh(I -w)" where c= w'(1 - w-hdw C (8) and c is a normalizing constant However many times p(w|D) is then updated by applying Bayes theorem and new measurements, it will continue to have a beta distribution as above. The estimates of p(w) are updated every time a new measurement is collected. 30 Note that there will be a different function p(w) for different races and different genders, 35 using the same groupings used in the Hapmap project, since the probability of different alleles at particular SNPs is dependent on these groupings of race and gender. For the computation of P(si), each allele on each chromosome will be associated with an estimated probability distribution, namely pp,1,i(wp,1,j), Pp,2,i(Wp,2), PmJ,i(Wm,1,i) and 5 pn,2,i(Wm,2J). One may then compute the maximum a-posteriori (MAP) estimate for P(si) according to the MAP estimate for each of the individual distributions. For example, let wp, 1 ,i* be the argument that maximizes pp,1,i(wp,1,i). The MAP estimate of P(si) may be found according to P(si) = W *w * W., 1 * w,,, * 10 (9) Since there is the a probability distribution for each w, one can also compute conservative estimates of the values P(si) to any specified confidence level, by integrating over the probability distribution, rather than simply using the MAP estimates. It is possible to do this, for example, to conservatively estimate P(MIA) to within some confidence level. 15 Whether a conservative estimate or a MAP estimate is used, the estimate of P(si) is continually refined for the computation of P(MIA). In what follows, reference to the assumed state will be eliminated to simplify the notation, and state si is assumed for all explanations of detailed computation. Bear in mind that in actuality these calculations would be performed for each of 256 states and be summed over the probability of each. 20 The method of computing P(MIA) is now extended to multiple SNP loci, assuming that M represents the set of measurements of N pairs of SNPs on the embryo, M = [M 1 ,...,MN]. Assume also that A represents the set of hypotheses for each SNP about which parental chromosomes contributed to that SNP, A = [A 1 ,...,AN]. Let SA. represent the set of all other possible hypotheses that are different from A or are in the set A'. 25 P(MIA) and P(MJA') may be computed: P(M I A)= HP(M, I A,), P(M |A')= IP(A) P(M, I A,) (10) 1=1...N AeSA. i=I.-N Consider the computation of P(A). In essence, this is based on the likelihood of particular crossovers occurring in the formation of the gametes that form the embryo. The probability of a particular allele set depends on two factors, namely the probability that 30 the embryonic chromosome comes from the mother or the father, and the probability of a particular combination of crossovers. For a normal set of embryonic chromosomes that do not suffer from aneuploidy, the a-priori probability that the embryonic chromosome 36 comes from the mother or father is -50% and is consequently common for all A. Now, consider the probability of a particular set of recombination nodes. The number of relevant recombination sites R depends on the number of measured SNPS: R=N-1. Since the DNA segment constituting N NSNPs around the PSNP of interest will be relatively 5 short, crossover interference makes it highly improbable that two crossovers on the same chromosome can occur in one region. For reasons of computational efficiency this method assumes that only one crossover will occur in each region for each relevant chromosome, and this can occur at R possible sites. It will be obvious to someone skilled in the art how this method may be extended to include the possibility where there are 10 multiple crossovers in a given region. Let the probability of a crossover in each region between SNPs be denoted Pr, r 1.. .N-1. To first order, the probability of a recombination node in a region r between two SNPs is proportional to the genetic distance between those SNPs (measured in cMorgans). However, much recent research has enabled a precise modeling of the 15 probability of recombination between two SNP loci. Observations from sperm studies and pattems of genetic variation show that recombination rates vary extensively over kilobase scales and that much recombination occurs in recombination hotspots, and causes linkage disequilibrium to display a block-like structure. The NCBI data about recombination rates on the Human Genome is publicly available through the UCSC Genome Annotation 20 Database. Various data sets can be used singly or in combination. Two of the most common data sets are from the Hapmap Project and from the Perlegen Human Haplotype Project. The latter is higher density; the former is higher quality. See Figure 2 for the regional recombination rates from positions 1,038,423 to 4,467,775 of chromosome 1, based on 25 the HapMap Phase I data, release 16a. These rates were estimated using the reversible jump Markov Chain Monte Carlo (MCMC) method which is available in the package LDHat. The state-space considered is the distribution of piece-wise constant recombination rate maps. The Markov chain explores the distribution of the number and :location of rate change-points, in addition to the rates for each segment, 201. These 30 results may be used to generate an estimate of Pr by integrating over the recombination rates times by the length of each constant segment between the SNPS. The cumulative recombination rate over the nucleotides 202 is shown in Figure 2 in red. 37 Let C be a set of indicator variables c, such that cr=l if a crossover occurred in region r and 0 otherwise. co=l if no crossovers occurred and 0 otherwise. Since it is assumed that only one crossover can occur in a region of N SNPs, only one element of the set C is non-zero. Hence, the probability of crossover represented by set C is found to be: 5 P =f (11) P ( r-I.-N-1 r-1 In the hypothesis A about SNPs 1.. .N, there are four potential crossovers of relevance. Namely, the potential crossovers in i) the paternal chromosomes that formed the embryo (denoted by set Cpe of indicator variables), ii) the paternal chromosomes that formed the sequenced sperm (set CPS), iii) the maternal chromosomes that formed the embryo (set 10 Cme) and iv) the maternal chromosomes that formed the sequenced egg (set Cee). Two additional assumptions are v) whether the first paternal embryonic SNP comes from proj or Pt,2,1 and vi) whether the first maternal embryonic SNP comes from mtii or m 2
,
1 . Since the probabilities of crossovers between SNPs is found to differ between races and sexes, different crossover probabilities will be denoted as PP,r for the paternal 15 chromosomes, and Pm,r for the maternal chromosomes. Therefore, the probability of a particular hypothesis A, which subsumes the sets Cpe, CPS, Cme, Cee is expressed as: P(A) 1- P P 1- HP - P--('_ Zp~ lp; 4 r,-jv") 4- ' r=4..A N-i \ -1 ,= 4 4 -4 -.. j- (12) Now with the equations for determining P(A) and P(M/A), all the elements 20 necessary to compute A* per Equation 3 above have been defined. Hence, it is possible to determine from the highly error-prone measurements of the embryonic SNPs where crossovers occurred, and to consequently clean the embryonic measurements with a high degree of confidence. It remains to determine the degree of confidence in the best hypothesis A*. To determine this, it is necessary to find the odds ratio 25 P(A*IM)/P(A*'IM). The tools have all been described above for this computation: P(A* \ M) P(A* I M) _ P(A*)P(M I A*) - P(A*)P(M I A*) (13 P(A*'\ M) 1 - P(A* \M) P(A*')P(M I A*') (I - P(A*))P(M I A*') =R (13) The confidence in A* is then given as P(A*JM) = ORA/(1+OR A*). This computation indicates the confidence in a particular hypothesis A*, but it does not indicate a 30 confidence in a particular determination of a SNP. In order to compute the confidence in a 38 determination of embryonic PSNP n, it is necessary to create the set of all hypotheses A that don't change the value of this SNP. This set will be denoted as SA.,n, which corresponds to all hypothesis that result in PSNP n on the embryo having the same value as is predicted by hypothesis A*. Similarly, create a set SA.,*n which corresponds to all 5 hypothesis that result in PSNP n having a different value to that predicted by hypothesis A*. Now, it is possible to compute the odds ratio of the probability that the SNP is correctly called versus the probability that the SNP is incorrectly called: IZP(A|M) ZP(AIM) ZP(A)P(M IA) OR, = A' = _,,'_ = _' (14) " P(AM) 1- ZP(A|M) ZP(A)P(M IA) A SA, AESA., AA. The confidence in the particular call of embryonic SNP n based on the odds ratio ORA.,n 10 can be computed as: P(correctly called SNP n) = ZP(A - ORAM" (15) AeS, + OR,, Note that this technique could also be used to detect such defects as uniparental disomy (UPD) wherein two of the same chromosomes are from the same parent, while 15 none of that chromosomes from the other parent is present. Upon attempting to deduce the crossovers in the parent chromosomes, there will be no hypothesis which adequately explains the data with a high confidence, and if alternate hypotheses are allowed that include the possibility of UPD, they will found to be more likely. 20 Bounding the Effect of Uncertainty in Recombination Rates.and SNP Measurement Reliability The disclosed method depends on: assumptions about the probability of recombination between particular SNPs; assumptions about the probability of the correct measurement of each SNP on the embryonic, sperm, egg, paternal and maternal 25 chromosomes; and assumptions about the likelihood of certain alleles within different population groups. Consider each of these assumptions: the mechanism of recombination is not perfectly understood and modeled, and the crossover probability has been established to vary based on an individual's genotype. Furthermore, the techniques by which the recombination rates are measured show substantial variability. For example, 39 the package LDHat, which implements the reversible-jump Markov Chain Monte Carlo (MCMC) method, makes a set of assumptions and requires a set of user inputs about the mechanism and characterization of recombination. These assumptions can affect predicted recombination rates between SNPs as is evinced by the different results 5 obtained by various studies. It is anticipated that the assumptions about recombination rates, out of all assumptions listed above, will have the most impact on Equation 15. The computations described above should be based on the best estimates of the probability for crossover between SNPS, Pr. Thereafter, conservative estimates can be used for Pr using values at, 10 for example, the 95% confidence bounds for the recombination rates, in the direction that reduces the confidence measure P(correctly called SNP n). The 95% confidence bounds can be derived from confidence data produced by various studies of recombination rates, and this can be corroborated by looking at the level of discordance between published data from different groups using different methods. 15 Similarly, the 95% confidence bounds can be used for the estimates of the probability that each SNP is correctly called: pp, Pm, Pe. These numbers can be computed based on the actual measured array intensities included in the genotyping assay output files, combined with empirical data on the reliability of the measurement technique. Note that those NSNPs for which these parameters pp, pm and pe are, not well established may 20 be ignored. For example, since the diploid parental data is reliably measured, one may ignore NSNP measurements of the parents' haploid cells and on the embryo that do not correspond to any of the alleles on the relevant SNPs of the parent's diploid tissue. Lastly, consider the assumptions about the likelihood of certain alleles within different population groups, which give rise to the computation P(si). These assumptions 25 also will not have a large impact on the disclosed method since the measurement of the parental diploid data is reliable i.e. direct measurement of the state si from the parental samples typically result in data with high confidence. Nonetheless, it is possible to use the probability distribution for each w as described in Equation 8 in order to compute a confidence bound for the probability of each state P(si). As above, one can compute the 30 95% confidence bound for each P(si) in the conservative direction that reduces confidence measure P(correctly called SNP n). 40 The determination of P(correctly called SNP n) will inform the decision about how many NSNPs need to be measured around each PSNP in order to achieve the desired level of confidence. Note that there are different approaches to implementing the concept of the 5 disclosed method, namely combining the measurement of the parent's DNA, the measurement of the DNA of one or more embryos, and the a-priori knowledge of the process of meiosis, in order to obtain a better estimate of the embryonic SNPs. It will be clear to one skilled in the art, how similar methods can be applied when different subsets of the a-priori knowledge are known or not known, or known to a greater or lesser degree 10 of certainty. For example, one can use the measurements of multiple embryos to improve the certainty with which one can call the SNPs of a particular embryo or to accommodate missing data from the parents. Note also that one does not need a PSNP of interest to be measured by the measurement technique. Even if that PSNPs is not determined by the measurement system, it can still be reconstructed with a high degree of confidence by the 15 disclosed method. Also note that once the points of crossover that occurred during meiosis have been determined, and the regions of the target genome have been mapped to the pertinent regions of the parental DNA, it is possible to infer not only the identity of individual SNPs of interest, but also whole regions of DNA that may be missing in the measured 20 target genome due to allele drop-out or other errors in measurement. It is also possible to measure insertions and deletions in the parental DNA, and use the disclosed method to infer that they exist in the target DNA. Various techniques may be used to improve the computational complexity of the disclosed algorithm described above. For example, one may only or predominantly select 25 those NSNPs that differ between the mother and the father. Another consideration would be to only use NSNPs that are spaced nearby the PSNPs to minimize the chance of crossovers occurring between the NSNPs and PSNPs of interest. One could also use NSNPs that were spaced along the chromosome so as to maximize coverage of multiple PSNPs. Another consideration will be to initially use only a small number of NSNPs to 30 determine roughly where crossovers occurred, and with only a limited degree of certainty. Additional NSNPs can then be used to refine the crossover model and increase the probability of correctly calling the PSNPs. The number of crossover combinations to consider scales roughly as Nc where N is the number of SNPs and C is the maximum 41 number of crossovers. Consequently, for C=4 it is possible to accommodate roughly N=100 for each PSNP while remaining computationally tractable on a Pentium-IV processor. Using the approaches described above and other approaches for increased computational efficiency, N>100, C>4 can be easily accommodated. One such approach 5 will be described below. Note that there are many other approaches to make a call on a PSNP and generate an estimate of the probability that a PSNPs has been correctly determined, based on a particular set of embryonic data, parent data, and algorithm used, without changing the underlying concept. This probability can be used for individual decision-making, and for 10 implementing a reliable service in the context of IVF or NIPGD. Recursive solution to the genetic data cleaning algorithm Another embodiment of the invention involving an algorithm that scales linearly is described here. Given the limited nature of computation power, the length of the 15 computation may be a significant factor in the use of the disclosed method. When running computations, any algorithm that must compute certain values where the number of computations needed rises exponentially with the number of SNPs can become unwieldy. A solution that involves a number of calculations that increase linearly with the number of SNPs will always be preferred from a time standpoint as the number of 20 SNPs gets large. Below this approach is described. A simple approach, which is to consider all possible hypotheses must contend with the running time being an exponential function in number of SNPs. Suppose, as before, that measured data are a collection of measured embryo, father and mother chromosome measurements on k SNPs, i.e. M {M 1 ,...,Mk} where Mi = 25 (eli,e2i,pjp2i,mii,m2i). As before, the hypotheses space is SH = {H ,...,H4}={set of all the hypotheses}, where each hypothesis is of the format Hi = (H,...H1k) where H~i is the "mini" hypothesis for snip i, of the format H's = (p;*,mi*) where p,* e {p,p7) and I, E (m, mI)}. There are 4 different "mini" hypotheses Hi, in particular: Hhl: (e,,,e 2 i)={(pui,m,,) or (m,,,p,,)} 30 H 12: (euezi)={(pj,,m 2 ,) or (m2 1 ,p)} H 3: (e,,,e 2 )={(p2i,m,,) or (m,,,p)} H '4: (e,,,e 2 )={(p 2 ,m 2 ,) or (m2,pzi)} 42 The goal is to choose the most likely hypothesis H* as: H*= arg maxH.S, P(H I M) = arg max,,sN F(M, H) where function F(M,H)=P(H|M) There are 4 k different hypotheses in the space SH. By trying to find the best hypothesis by exhaustively exploring the entire space SH, the necessary algorithm would 5 be of exponential order in k O(exp(k)), where k is the number of SNPs involved. For large k, even k>5, this is immensely slow and unpractical. Therefore, it is more practical to resort to a recursive solution which solves the problem of size k as a function of the problem of size (k-1) in constant time. The solution shown here is of the linear order in k, O(k). 10 Recursive solution linear in the number of SNPs Begin with F(M,H)=P(HIjM) = P(MIH)*P(H)/P(M). Then argmax H F(M,H) argmax H P(MlH)*P(H) and the goal is to solve P(MIH)*P(H) in linear time. Suppose that ,k)= measurement on SNPs s to k, H(s,k) = hypothesis on SNPs s to k, and to simplify 15 notation M(Ik) = Mk, H(k) = Hk = measurement and hypothesis on SNP k. As shown before: P(M(I) I H ,,i)) = P(MI I H) =P(Mk I HI) * ]7 P(M, |H) =P(M I Hk)* P(M(,k_[) H(ik-) i==1 Also: k k-I P(H(k)) = 1/ 4 * ]PF(H,, H 1 ) = PF(Hk-_, H,) * 1/4* PF(H,- , H,) PF(H,,, Hk) *PH i2i=2 20 where PF(H,_,, H,) =1-PC(H,-, H,) H,= H,
PC(H,
1 , H,) H, 1 H, and PC(Hi-,Hi)= probability of crossover between Hi 1 , Hi Finally, for k SNPs: F(M, H)= P(M I H) * P(H) = P(M,k) H( 1 k)* P(H ) = P(MJ,, I Hz 1 ,,-,) * P(H(I ,_)) * P(Mk | H)*PF(Hk-_, Hk) 25 so in short F(M, H) = F(M(Z.k), H(,,,)))= F(M(,k-1) , H(1) * P(Mk I H,) * PF(Hk-_, H,) i.e. we can reduce the calculation of F on k SNPs to the calculation of F on k-l SNPs. For H = (H 1 ,...Hk),the hypothesis. on k SNPs: maxF(M,H)= max F(M,(H( ,Hk) =max maxF(M,(H(lo),Hk) =maxG(M( 1 ,Hk) H IlI 1 H)14 1k 1 14 where 43 G(MH,) = max F(M(,,) (H,,_), H,)= max F(M,, H,_- * P(M,, I H,) * PF(H,,, H,)= P(M,,IH,)*maxF(M(,, I H(I,_,H )) * PF(H,-, H,)= P(M. I H,) * max max F(M,,(H(,-2>, H,,) * PF(H, , ,) P(M. I H) * maxPF(H,-,, H,)* G(M gH,-,)
H.
To summarize this: max F(M, H) = max G(M(I,,, Hk) H H where G can be found recursively: for n=2,..,k 5 G(M ,,H,) = P(M, I H,) * max[PF(H,-,H, ) * G(Mf(l,) , H,-,)J and G(M(II) , H) =0.25 * P(M H,) The algorithm is as follows: For n= 1: Generate 4 hypotheses Hii, calculate G(MI 1 1Hi) for i=l,...,4. For n 2: Generate 4 hypothesis for H 2 i, calculate G(M(, 2
)IH
2 i) ,i=1,...,4 in constant time 10 using the formula: G(M(I2) , H 2 i)= P(M 2 I H 2 i) * max[PF(H,, H 2 i) * G(M,, Hj)] For n = k: Generate 4 hypothesis for Hki, make G(M(1,k)Hki), i=1,...,4 by G(M(lk), Hki) = P(M\k Hi) * max [PF(Hk-_,Hki) * G(Mekf,1) H,,.j)I At any time there are only 4 hypotheses to remember and a constant number of 15 operations. So the algorithm is linear in k, number of SNPs, as opposed to exponential. Solving P(M) in linear time It is not necessary to solve for P(M) to get the best hypothesis, since it is constant for all H. But in order to get the actual, meaningful number for the conditional probability 20 P(HIM) = P(MIH)*P(H)/P(M), it is also necessary to derive P(M). As above, we can write: P(M) = P(M(,,))= P(Mk I H(Ik)) * P(H,k)) =EP(MK |Hk) Z P(M(I, I)IHo 1 ,k) * P(H(l.) )*PF(Hk_,,Hk) . HA H(L-l = P(MK I Hk) * W(M(,,_) I Hk) H4 where W(MA,,I,_) I H)= P(MI,_k-1) I H(l,,_,) * P(Hgg1 * PF(Hk-_, Hk) We can solve for W(M,H) by recursion: W(M(II IH,)= E P(M,,-.I| H- 1 )) * P(H ,_,)* PF(Hk, H ) = X P(M,_, H,_,) Z P(M(1.-2) I H(I,-2>) * P(H(1,k-2)* PF(HI- -2,HkI) * PF(H;, HI) =>P(Mk-| Hki) * PF(H, H,) * W(M( 1 ,_) I H,) so in short, problem of size k is reduced to the problem of size (k-1) by 5 W(MlIy | Hk) = Z P(MkI I H,) * PF(H,_,, H.) * W(M( 1
,-
2 ) I Hki) Hk and W(M(,) IH 2 )=ZP(M, IH,)* 0.25* PF(Hi, H 2 ) H, As before, for n= 2:k, generate W(2),..., W(K) = W(M(IkI) I Hk) recursively, until finally, it is possible to derive P(M) = EP(M, Hk)* W(MI H. l H,) HA At each level there are only four different hypotheses Hk, so the algorithm is again 10 linear in the number of SNPs Ic Individual SNP confidence in linear time Once we the best hypothesis H* = (H 1 *,...,Hk*), has been computed, it is now may be desired to derive the confidence in the final answer for each SNP, namely 15 P(Hi*|M), for i=1,...,k. As before P(H,*IM) = P(MI Hi*)P(Hi*)/P(M)=W(Hi*,M)/P(M), where P(M) is already known. W(M,H,*)= ZP(MIH)*P(H)= ZP(MIH)*P(H), i.e. hypothesis H has been broken up to the hypothesis on first i-1 SNPs, ith SNP, and hypothesis on the i+1 to kth SNP. As before: k 1-4 A; 20 P(M|k) I HH,)=) 7P(M, | H) =I7P(vf 1 Hj) *P(M I H*)* JP(Mj I H) I1 J=4 =i+I = P(M(,) I H( 1 j 1 ,) * P(Mj {H,*) * P(M(,,,k) I H,,,k)) and 45 P(H(,,k)) =1/4* yPF(H,-,H.) J=2 '-I k =1/4*]7PF(Hb,Hj)*PF(H,_,,H,*)*PF(H,-,,H,*)* JJPF(Hb_,,H ) j=2 j=j+2 =1/4*T(H(._q))*PF(H,_,,H,*)*PF(H,- 1 ,H,*)*T(H(lk)) So P(H(l,))=1/4*T(H(Ik))=114*T(H( 1
,,
1 ))*PF(H, ,H,*)*PF(H, ,H,*)*T(H(Ltk)) where T(H(, ))= PF(Hj.
1 , H 1 ). From this it is possible to show that W(M(,,),H,*)= EP(MIH)*P(H)= Z(MIH)*14*T(H) fIH,=H,* H,H,=U,*
P(MQ
1 ) I H(1I 1)) * P(M,|I H*) * P(M( 1
±
1 | H(,,Ik)) H1tTH,*,Hg)))* PF(H,II,H*)*PF(H, -,H,*)*T(H ) 4*P(M,IH,*)* P(M,,, 1 I'H(,_1-)*1/4*T(H,,,I)*PF(H,,,H,*) * 1 P(M1|H )*1/4*T(H(p))*PF(H,*,H,q) =4*P(M,\ HI*)* (W(M(IJ 1 ),H,-)*PF(H,-,H,*) * W(M+,k), ,H,)*PF(H,*, H, ) Again a case of size k has been reduced to two pieces of smaller size, albeit a bit more complicated than before. Each of the pieces can be calculated as W(MQ,,), H,) = P(M,, I H,)* W(M(I,_,,)I H,, 1 ) * PF(H,,_, H,, W(M,,),H,,,) = P(M,,, I H,,)* (W(M,,,k), H,,) * PF(H.,H,, ) So the algorithm will, for n = 1,..,k, m = k,..1, for each of 4 different H,, Hm calculate 10 W(M,,,,H,,), W(M.,k), H) and then combine them as needed to calculate W(M(,,),H,*), for i=l,...,k. The number of operations is still linear in k. Application of the disclosed method to embryonic data when a smaller or different set of data is available 15 In one embodiment of the system it is only necessary to make use of diploid data from one parent (presumably the mother), with or without haploid data from either or both of the parents, and when that data is known to a greater or lesser degree of certainty. 46 For example it is expected that, given the grueling nature of egg donation, there will be occasions when maternal haploid data is not readily available. It will be clear to one skilled in the art, after reading this description, how the statistical methods for computing the likelihood of a particular SNP can be modified given a limited data set, 5 An alternative approach uses data from more distant relatives to make up for missing diploid or haploid data of one or both parents. For example, since it is known that one set of an individual's chromosomes come from each of his or her parents, diploid data from the maternal grandparents could be used to partially reconstruct missing or poorly measured maternal haploid data. 10 Note the recursive nature of this method: given the naturally noisy measurement of single cell parental haploid data, along with the diploid and/or haploid data of the appropriate grandparents, the disclosed method could be used to clean the parental haploid data, which in turn will provide more accurate genotyping of the embryo. It should be obvious to one skilled in the arts how to modify the method for use in these 15 cases. It is preferable to use more information rather than less, as this can increase the chances of making the right call at a given SNP, and can increase the confidence in those calls. This must be balanced with the increasing complexity of the system as additional techniques and sources of data are used. There are many sources of additional 20 information, as well as techniques available to use the information to augment the data. For example, there are informatics based approaches which take advantage of correlations which can be found in Hapmap data, or other repositories of genomic data. In addition there are biological approaches which can allow for the direct measurement of genetic data that otherwise would need to be recreated in silico. For example, haploid data 25 otherwise unavailable may be measureable by extracting individual chromosomes from diploid cells using flow cytometry techniques to isolate fluorescently tagged chromosomes. Alternately, one may use cell fusion to create monoallelic hybrid cells to effect diploid to haploid conversion. Application of the disclosed method to selecting which embryo is likely to implant 30 In one embodiment, the system can be used to determine the likelihood of an embryo to implant in the mother and develop into a baby. To the extent that the likelihood of the embryo implanting is determined by SNPs of the embryo, and/or their relation to 47 SNPs of the mother, the disclosed method will be important in helping the selection of embryos, based on making a reliable prediction of which will successfully implant based on the clean SNP data. To best predict the likelihood it will be necessary to take into account the determined genotype of the embryo possibly combined with the levels of 5 gene expression in the embryo, the levels of gene expression in the mother, and/or the determined genotype of the mother. In addition, it is well known that aneuploid embryos are less likely to implant, less likely to result in a successful pregnancy, and less likely to result in a healthy child. Consequently, screening for aneuploides is an important facet to selecting the embryo that 10 is most likely to result in a successful outcome. More detail on this approach is given below. Deducing Parental Haploid Data In one embodiment of the method, it may be necessary to deduce parental 15 haplotypes, given detailed knowledge of the diploid data of a parent. There are multiple ways this can be done. In the simplest case, haplotypes have already been inferred by molecular assay of single haploid cells of a direct relation (mother, father, son or daughter). In this case, it is a trivial matter to one skilled in the art to deduce the sister haplotype by subtracting the known haplotype from the diploid genotype measured by 20 molecular assay. For example, if a particular locus is heterozygous, an unknown parental haplotype is the opposite allele from the known parental haplotype. In another case, the noisy haploid data of the parent may be known from molecular biological haplotyping of individual parental haploid cells, such as a sperm cell, or from individual chromosomes, which may be isolated by various methods 25 including magnetic beads and flow cytometry. In this case, the same procedure can be used as above, except that the determined haplotype will be as noisy as the measured haplotype. There are also methods for deducing haploid data sets directly from diploid data, using statistical methods that utilize known haplotype blocks in the general population 30 (such as those created for the public Hapmap project). A haplotype block is essentially a series of correlated alleles that occur repeatedly in a variety of populations. Since these haplotype blocks are often ancient and common, they may be used to predict haplotypes from 48 diploid genotypes. The parents' inferred haplotype blocks can then be used as input for the method described herein to clean the noisy data from the embryos. Publicly available algorithms that would accomplish this task include an imperfect phylogeny approach, Bayesian approaches based on conjugate priors, and priors from population genetics. 5 Some of these algorithms use hidden Markov models. One study used public trio and unrelated individual data to demonstrate that these algorithms perform with error rates as low as 0.05% across 1MB of sequence. However, as expected, accuracy is lower for individuals with rare haplotype blocks. In one estimate, computational methods failed to phase as many as 5.1% of loci with minor allele frequency of 20%. 10 In one embodiment of the invention, genetic data from multiple blastomeres taken from different embryos during an IVF cycle is used to infer the haplotype blocks of the parents with greater reliability. Techniques for Screening Aneuploidy using High and Medium Throughput Genotyping 15 In one embodiment of the system the measured genetic data can be used to detect for the presence of aneuploides and/or mosaicism in an individual. Disclosed herein are several methods of using medium or high-throughput genotyping to detect the number of chromosomes or DNA segment copy number from amplified or unamplified DNA from tissue samples. The goal is to estimate the reliability that can be achieved in detecting 20 certain types of aneuploidy and levels of mosaicism using different quantitative and/or qualitative genotyping platforms such as ABI Taqman, MIPS, or Microarrays from Illumina, Agilent and Affymetrix. In many of these cases, the genetic material is amplified by PCR before hybridization to probes on the genotyping array to detect the presence of particular alleles. How these assays are used for genotyping is described 25 elsewhere in this disclosure. Described below are several methods for screening for abnormal numbers of DNA segments, whether arising from deletions, aneuploides and/or mosaicism. The methods are grouped as follows: (i) quantitative techniques without making allele calls; (ii) qualitative techniques that leverage allele calls; (iii) quantitative techniques that leverage 30 allele calls; (iv) techniques that use a probability distribution function for the amplification of genetic data at each locus. All methods involve the measurement of multiple loci on a given segment of a given chromosome to determine the number of instances of the given segment in the genome of the target individual. In addition, the 49 methods involve creating a set of one or more hypotheses about the number of instances of the given segment; measuring the amount of genetic data at multiple loci on the given segment, determining the relative probability of each of the hypotheses given the measurements of the target individual's genetic data; and using the relative probabilities 5 associated with each hypothesis to determine the number of instances of the given segment. Furthermore, the methods all involve creating a combined measurement M that is a computed function of the measurements of the amounts of genetic data at multiple loci. In all the methods, thresholds are determined for the selection of each hypothesis Hi based on the measurement M, and the number of loci to be measured is estimated, in 10 order to have a particular level of false detections of each of the hypotheses. The probability of each hypothesis given the measurement M is P(H 1 IM)= P(MHi)P(Hi)/P(M). Since P(M) is independent of Hi, we can determine the relative probability of the hypothesis given M by considering only P(MHi)P(Hi). In what follows, in order to simplify the analysis and the comparison of different techniques, we assume 15 that P(Hg) is the same for all {HI}, so that we can compute the relative probability of all the P(HiIM) by considering only P(M|Hi). Consequently, our determination of thresholds and the number of loci to be measured is based on having particular probabilities of selecting false hypotheses under the assumption that P(Hi) is the same for all {Hi}. It will be clear to one skilled in the art after reading this disclosure how the approach would be 20 modified to accommodate the fact that P(H) varies for different hypotheses in the set {Hi}. In some embodiments, the thresholds are set so that hypothesis Hi. is selected which maximizes P(HilM) over all i. However, thresholds need not necessarily be set to maximize P(Hi|M), but rather to achieve a particular ratio of the probability of false detections between the different hypotheses in the set {Hi}. 25 It is important to note that the techniques referred to herein for detecting aneuploides can be equally well used to detect for uniparental disomy, unbalanced translocations, and for the sexing of the chromosome (male or female; XY or XX). All of the concepts concern detecting the identity and number of chromosomes (or segments of chromosomes) present in a given sample, and thus are all addressed by the methods 30 described in this document. It should be obvious to one skilled in the art how to extend any of the methods described herein to detect for any of these abnormalities. 50 The Concept of Matched Filtering The methods applied here are similar to those applied in optimal detection of digital signals. It can be shown using the Schwartz inequality that the optimal approach to maximizing Signal to Noise Ratio (SNR) in the presence of normally distributed noise is 5 to build an idealized matching signal, or matched filter, corresponding to each of the possible noise-free signals, and to correlate this matched signal with the received noisy signal. This approach requires that the set of possible signals are known as well as the statistical distribution - mean and Standard Deviation (SD) - of the noise. Herein is described the general approach to detecting whether chromosomes, or segments of DNA, 10 are present or absent in a sample. No differentiation will be made between looking for whole chromosomes or looking for chromosome segments that have been inserted or deleted. Both will be referred to as DNA segments. It should be clear after reading this description how the techniques may be extended to many scenarios of aneuploidy and sex determination, or detecting insertions and deletions in the chromosomes of embryos, 15 fetuses or born children. This approach can be applied to a wide range of quantitative and qualitative genotyping platforms including Taqman, qPCR, Illumina Arrays, Affymetrix Arrays, Agilent Arrays, the MIPS kit etc. Formulation of the General Problem 20 Assume that there are probes at SNPs where two allelic variations occur, x and y. At each locus i, i=1...N, data is collected corresponding to the amount of genetic material from the two alleles. In the Taqman assay, these measures would be, for example, the cycle time, Ct, at which the level of each allele-specific dye crosses a threshold. It will be clear how this approach can be extended to different measurements of the amount of 25 genetic material at each locus or corresponding to each allele at a locus. Quantitative measurements of the amount of genetic material may be nonlinear, in which case the change in the measurement of a particular locus caused by the presence of the segment of interest will depend on how many other copies of that locus exist in the sample from other DNA segments. In some cases, a technique may require linear measurements, such that 30 the change in the measurement of a particular locus caused by the presence of the segment of interest will not depend on how many other copies of that locus exist in the sample from other DNA segments. An approach will be described for how the measurements from the Taqman or qPCR assays may be linearized, but there are many 51 other techniques for linearizing nonlinear measurements that may be applied for different assays. The measurements of the amount of genetic material of allele x at loci 1... N is given by data d, = [d.,,... dN]. Similarly for allele y, dy = [dy,... dyN). Assume that each 5 segment j has alleles aj = [aji .... ajN] where each element aji is either x or y. Describe the measurement data of the amount of genetic material of allele x as dx = sx + ux where s., is the signal and i is a disturbance. The signal s,= [fx(au,...,asi) ... fx(aN,..., amN)] where fx is the mapping from the set of alleles to the measurement, and J is the number of DNA segment copies. The disturbance vector ox is caused by measurement error and, in the 10 case of nonlinear measurements, the presence of other genetic material besides the DNA segment of interest. Assume that measurement errors are normally distributed and that they are large relative to disturbances caused by nonlinearity (see section on linearizing measurements) so that oxi ~ nxi where nxi has variance oxiz and vector n, is normally distributed -N(O,R), R=E(nxnxj). Now, assume some filter h is applied to this data to 15 perform the measurement m = h=d, = hTsz + hTox. In order to maximize the ratio of signal to noise (hIsx/hInx) it can be shown that h is given by the matched filter h = pR'sx where pi is a scaling constant. The discussion for allele x can be repeated for allele y. Method la: Measuring Aneuploidy or Sex by Quantitative Techniques that Do Not Make 20 Allele Calls When the Mean and Standard Deviationfor Each Locus is Known Assume for this section that the data relates to the amount of genetic material at a locus irrespective of allele value (e.g. using qPCR), or the data is only for alleles that have 100% penetrance in the population, or that data is combined on multiple alleles at each locus (see section on linearizing measurements) to measure the amount of genetic 25 material at that locus. Consequently, in this section one may refer to data d, and ignore dy. Assume also that there are two hypotheses: ho that there are two copies of the DNA segment (these are typically not identical copies), and hi that there is only I copy. For each hypothesis, the data may be described as dri(ho) = sxi(ho)+nxi and dj,(hi) = si(hi)+ni respectively, where sxi(ho) is the expected measurement of the genetic material at locus i 30 (the expected signal) when two DNA segments are present and sxi(hl) is the expected data for one segment. Construct the measurement for each locus by differencing out the expected signal for hypothesis ho: mi= dxr-sx(ho). If hi is true, then the expected value of the measurement is E(mxi) = sxi(hi)-sxi(ho). Using the matched filter concept discussed 52 above, set h = (1/N)R'(si(hi)-sxi(ho)). The measurement is described as m = hidx If h, is true, the expected value of E(mJhi) = mi = (1/N)FEi= 1 ...N(sxi(hi)-sx;(ho)) 2
I
2 and the standard deviation of m is amph = (1/N 2 )E-i=. N((sXi(I)-sziOIo))2/G 4)a 5 (i/N2)i=1 ...N(sxi(hl)-sxi(ho))2 X If ho is true, the expected value of m is E(mlho) = mo = 0 and the standard deviation of m is again amiho2 (/N 2 )yi= 1 ...N(si(hI)-s(h))2/ax. Figure 3 illustrates how to determine the probability of false negatives and false positive detections. Assume that a threshold t is set half-way between mi and mo in order 10 to. make the probability of false negatives and false positives equal (this need not be the case as is described below). The probability of a false negative is determined by the ratio of (mi-t)/am(h1=(mi-mo)/(2amihl). "5-Sigma" statistics may be used so that the probability of false negatives is 1-nonnedf(5,0,1) = 2.87e-7. In this case, the goal is for (mi mo)/(2ampo) > 5 or 10sqrt((/N 2 )Zi=l...N(si(hi)-sXi(h)) 2 /oxl 2 ) < (l/N)i=1...N(xi(h1) 15 s,(ho)) 2 /ax or sqrt(Zi=1...N(sx.(h2)-sX(o)) 2 F2) > 10. In order to compute the size of N, Mean Signal to Noise Ratio can be computed from aggregated data: MSNR = (1/N)Ei=1...N(sxi(hi)-sxi(ho)) 2 i 2 . N can then be found from the inequality above: sqrt(N).sqrt(MSNR)> 10 or N > IOO/MSNR. This approach was applied to data measured with the Taqman Assay from Applied 20 BioSystems using 48 SNPs on the X chromosome. The measurement for each locus is the time, Cb that it takes the die released in the well corresponding to this locus to exceed a threshold. Sample 0 consists of roughly 0.3ng (50 cells) of total DNA per well of mixed female origin where subjects had two X chromosomes;, sample 1 consisted of roughly 0.3ng of DNA per well of mixed male origin where subject had one X chromosome. 25 Figure 4 and Figure 5 show the histograms of measurements for samples 1 and 0. The distributions for these samples are characterized by mo= 29.97; SDo=1.32, mi=31.44, SDI=1.592. Since this data is derived from mixed male and female samples, some of the observed SD is due to the different allele frequencies at each SNP in the mixed samples. In addition, some of the observed SD will be due to the varying efficiency of the different 30 assays at each SNP, and the differing amount of dye pipetted into each well. Figure 6 provides a histogram of the difference in the measurements at each locus for the male and female sample. The mean difference between the male and female samples is 1.47 and the SD of the difference is 0.99. While this SD will still be subject to the different allele 53 frequencies in the mixed male and female samples, it will no longer be affected the different efficiencies of each assay at each locus. Since the goal is to differentiate two measurements each with a roughly similar SD, the adjusted SD may be approximated for each measurement for all loci as 0.99/sqrt(2)=0.70. Two runs were conducted for every 5 locus in order to estimate axi for the assay at that locus so that a matched filter could be applied. A lower limit of axi was set at 0.2 in order to avoid statistical anomalies resulting from only two runs to compute a,,i. Only those loci (numbering 37) for which there were no allele dropouts over both alleles, over both experiment runs and over both male and female samples were used in the plots and calculations. Applying the approach above to 10 this data, it was found that MSNR=2.26, hence N= 2252/2.26^2= 17 loci. Method ib: Measuring Aneuploidy or Sex by Quantitative Techniques that Do Not Make Allele Calls When the Mean and Std. Deviation is Not Known or is Unform When the characteristics of each locus are not known well, the simplifying 15 assumptions that all the assays at each locus will behave similarly can be made, namely that E(mxi) and axi are constant across all loci i, so that it is possible to refer instead only to E(mx) and oYx. In this case, the matched filtering approach m= hidx reduces to finding the mean of the distribution of d,. This approach will be referred to as comparison of means, and it will be used to estimate the number of loci required for different kinds of 20 detection using real data. As above, consider the scenario when there are two chromosomes present in the sample (hypothesis ho) or one chromosome present (hi). For ho, the distribution is N(po,ao 2 ) and for hi the distribution is N( ii, 1 2 ). Measure each of the distributions using No and Ni samples respectively, with measured sample means and SDs: mi, m, si, and so. 25 The means can be modeled as random variables Mo, Mi that are normally distributed as Mo -N(po, aYo 2 /No) and Mi-N(pii, a 1 2 /Ni). Assume N 1 and No are large enough (> 30) so that one can assume that Mi-N(mi, sI2/Ni) and Mo -N(mo, so 2 /No). In order to test whether the distributions are different, the difference of the means test may be used, where d = mi-mo. The variance of the random variable D is af - ai 2 /Nt +(o/No which 30 may be approximated as ad - sj 2 /Ni+so2/No. Given ho, E(d) = 0; given hi, E(d)=11i-po. Different techniques for making the call between hi for he will now be discussed. Data measured with a different run of the Taqman Assay using 48 SNPs on the X chromosome was used to calibrate performance. Sample 1 consists of roughly 0.3ng of 54 DNA per well of mixed male origin containing one X chromosome; sample 0 consisted of roughly 0.3ng of DNA per well of mixed female origin containing two X chromosomes. N, = 42 and No = 45. Figure 7 and Figure 8 show the histograms for samples 1 and 0. The distributions for these samples are characterized by mi=32.259, s 1 =1.460, 5 a-mi=s/sqrt(Ni)=0.225; mo= 30.75; so=1.202, omo=so/sqrt(No)=0.l79. For these samples d=1.509 and ud=0.
2 8 7 9 . Since this data is derived from mixed male and female samples, much of the standard deviation is due to the different allele frequencies at each SNP in the mixed samples. SD is estimated by considering the variations in C for one SNP at a time, over 10 multiple runs. This data is shown in Figure 9. The histogram is symmetric around 0 since Ct for each SNP is measured in two runs or experiments and the mean value of Ct for each SNP is subtracted out. The average std. dev. across 20 SNPs in the mixed male sample using two runs is s=0.597. This SD will be conservatively used for both male and female samples, since SD for the female sample will be smaller than for the male sample. 15 In addition, note that the measurement from only one dye is being used, since the mixed samples are assumed to be heterozygous for all SNPs. The use of both dyes requires the measurements of each allele at a locus to be combined, which is more complicated (see section on linearizing measurements). Combining measurements on both dyes would double signal amplitude and increase noise amplitude by roughly sqrt(2), resulting in an 20 SNR improvement of roughly sqrt(2) or 3dB. Detection Assuming No Mosaicism and No Reference Sample Assume that mo is known perfectly from many experiments, and every experiment runs only one sample to compute mi to compare with mo. N 1 is the number of assays and 25 assume that each assay is a different SNP locus. A threshold t can be set half way between mo and mi to make the likelihood of false positives equal the number of false negatives, and a sample is labeled abnormal if it is above the threshold. Assume si = S2= S = 0.597 and use the 5-sigma approach so that the probability of false negatives or positives is l-normcdf(5,0,l) = 2.87e-7. The goal is for 5si/sqrt(NI) < (mi-mo)/2, hence 30 NJ = 100 s I 2 /(m 1 -mo) 2 = 16. Now, an approach where the probability of a false positive is allowed to be higher than the probability of a false negatives, which is the harmful scenario, may also be used. If a positive is measured, the experiment may be rerun. Consequently, it is possible to say that the probability of a false negative should be equal 55 to the square of the probability of a false positive. Consider Figure 3, let t = threshold, and assume Sigma_0 = Sigma__1 = s. Thus (l-normcdf((t-mo)/s,0,1)) 2 = 1-normcdf((mi t)/s,0,1). Solving this, it can be shown that t = mo+0.32(m-mo). Hence the goal is for 5s/sqrt(N1)<m-mo-0.
3 2 (m-mo) =(mrmo)/1.
4 7, hence NJ = (52)( 47 2
)S
2 /(m-mo) 2 = 9. 5 Detection with Mosaicism without Running a Reference Sample Assume the same situation as above, except that the goal is to detect mosaicism with a probability of 97.7% (i.e. 2-sigma approach). This is better than the standard approach to amniocentesis which extracts roughly 20 cells and photographs them. If one 10 assumes that 1 in 20 cells is aneuploid and this is detected with 100% reliability, the probability of having at least one of the group being aneuploid using the standard approach is 1-0.9520 = 64%. If 0.05% of the cells are aneuploid (call this sample 3) then m3 = 0.95mo + 0.05mi and var(m 3 ) = (0.95s2+0.05s 1 2 )/NI. Thus std(m 3 )2<(m3-mo)/2 => sqrt(0.95so 2 40.05s,)/sqrt(Ni) < 0.05(mi-m 2 )/4 => Ni = 16(0.95s22+0.05s1 2 )/(0.05 2 (my_ 15 m2) 2 ) = 1001. Note that using the goal of 1-sigma statistics, which is still better than can be achieved using the conventional approach (i.e. detection with 84.1% probability), it can be shown in a similar manner that Ni =250. Detection with No Mosaicism and Using a Reference Sample 20 Although this approach may not be necessary, assume that every experiment runs two samples in order to compare m, with truth sample m 2 . Assume that N = NJ = No. Compute d = mi-mo and, assuming a, = ao, set a threshold t = (mo+mi)/2 so that the probability of false positives and false negatives is equal. To make the probability of false negatives 2.87e-7, it must be the case that (ml-m2)/2>5sqrt(s 2 /N+s2 2 /N) => N = 25 1 00(sl 2 +s2)/(m1-m2) 2 =32. Detection with Mosaicism and Running a Reference Sample As above, assume the probability of false negatives is 2.3% (i.e. 2-sigma approach). If 0.05% of the cells are aneuploid (call this sample 3) then m3 = 0.95mo + 30 0.05m, and var(m 3 ) = (0.95so 2 +0.05s,)/N 1 . d = m 3 -m2 and ad= (1.95sO 2 +0.05s, 2 )/N. It must be that std(m 3 )2<(mo-m2)/2 => sqrt(l.95s 2 2 -0.05sj 2 )/sqrt(N) < 0.05(m-m2)/4 => N = 16(l.95s +0.05s,2)/(0.05 2 (mIm2) 2 ) = 2002. Again using 1-sigma approach, it can be shown in a similar manner that N = 500. 56 Consider the case if the goal is only to detect 5% mosaicism with a probability of 64% as is the current state of the art. Then, the probability of false negative would be 36%. In other words, it would be necessary to find x such that 1-norrncdf(x,0,1)=36%. Thus N= 4(0.36A2)(I.95s 2 2 +0.05s 2 )/(O.05 2 (mi-m 2
)
2 ) = 65 for the 2-sigma approach, or N 5 = 33 for the 1-sigma approach. Note that this would result in a very high level of false positives, which needs to be addressed, since such a level of false positives is not currently a viable alternative. Also note that if N is limited to 384 (i.e. one 384 well Taqman plate per chromosome), and the goal is to detect mosaicism with a probability of 97.72%, then it 10 will be possible to detect mosaicism of 8.1% using the 1-sigma approach. In order to detect mosaicism with a probability of 84.1% (or with a 15.9% false negative rate), then it will be possible to detect mosaicism of 5.8% using the 1-sigma approach. To detect mosaicism of 19% with a confidence of 97.72% it would require roughly 70 loci. Thus one could screen for 5 chromosomes on a single plate. 15 The summary of each of these different scenarios is provided in Table 2. Also included in this table are the results generated from qPCR and the SYBR assays. The methods described above were used and the simplifying assumption was made that the performance of the qPCR assay for each locus is the same. Figure 10 and Figure 11 show the histograms for samples 1 and 0, as described above. No = N! = 47. The distributions of 20 the measurements for these samples are characterized by mi = 27.65, s, = 1.40, ami=si/sqrt(Ni)=0.204; mo= 26.64; so=1.146, amo=so/sqrt(No)=0.167. For these samples d=1.01 and yd=0.
2 636 . Figure 12 shows the difference between Ct for the male and female samples for each locus, with a standard deviation of the difference over all loci of 0.75. The SD was approximated for each measurement of each locus on the male or 25 female sample as 0.75/sqrt(2)=0.53. Method 2: Qualitative Techniques that Use Allele Calls In this section, no assumption is made that the assay is quantitative. Instead, the assumption is that the allele calls are qualitative, and that there is no meaningful 30 quantitative data coming from the assays. This approach is suitable for any assay that makes an allele call. Figure 13 describes how different haploid gametes form during meiosis, and will be used to describe the different kinds of aneuploidy that are relevant 57 for this section. The best algorithm depends on the type of aneuploidy that is being detected. Consider a situation where aneuploidy is caused by a third segment that has no section that is a copy of either of the other two segments. From Figure 13, the situation 5 would arise, for example, if pi and p4, or P2 and p3, both arose in the child cell in addition to one segment from the other parent This is very common, given the mechanism which causes aneuploidy. One approach is to start off with a hypothesis h that there are two segments in the cell and what these two segments are. Assume, for the purpose of illustration, that ho is. for p3 and rn4 from Figure 13 In a preferred embodiment this 10 hypothesis comes from algorithms described elsewhere in this document Hypothesis hi is that there is an additional segment that has no sections that are a copy of the other segments. This would arise, for example, if p2 or mi was also present. It is possible to identify all loci that are homozygous in P3 and m4. Aneuploidy can be detected by searching for heterozygous genotype calls at loci that are expected to be homozygous. 15 Assume every locus has two possible alleles, x and y. Let the probability of alleles x and y in general be px and py respectively, and px+py=l. If hi is true, then for each locus i for which p3 and m4 are homozygous, then the probability of a non-homozygous call is py or px, depending on whether the locus is homozygous in x or y respectively. Note: based on knowledge of the parent data, i.e. Pi, P2, p4 and mi, mn 2 , M 3 , it is possible to 20 further refine the probabilities for having non-homozygous alleles x or y at each locus. This will enable more reliable measurements for each hypothesis with the same number of SNPs, but complicates notation, so this extension will not be explicitly dealt with. It should be clear to someone skilled in the art how to use this information to increase the reliability of the hypothesis. 25 The probability of allele dropouts is pd. The probability of finding a heterozygous genotype at locus i is poi given hypothesis ho and ph given hypothesis hi. Given ho: poi= 0 Given hi: pu = px(1-pd) or pu = py(1-pd) depending on whether the locus is homozygous for x or y. 30 Create a measurement m = I/Nh Yi=...,Nh Ii where Ii is an indicator variable, and is 1 if a heterozygous call is made and 0 otherwise. Nh is the number of homozygous loci. One can simplify the explanation by assuming that px=py and p0i, pli for all loci are the same two values po and pl. Given ho, E(m) = po = 0 and a 2 ,po = po(l- po)/N. Given hi, 58 E(m)= pi and a 2 ,m = pI(l-p)/Nh. Using 5 sigma-statistics, and making the probability of false positives equal the probability of false negatives, it can be shown that (pi-po)/2 > 5ay hence Nh = 100(po(1-po)+pi(1-pi))/(pi-po) 2 . For 2-sigma confidence instead of 5 sigma confidence, it can be shown that Nh = 4.22(po(1-po)+pi(l-p))/(pi-po) 2 . 5 It is necessary to sample enough loci N that there will be sufficient available homozygous loci Nh-avai such that the confidence is at least 97.7% (2-sigma). Characterize Nh-avai = Zi=]...N Ji where Ji is an indicator variable of value 1 if the locus is homozygous and 0 otherwise. The probability of the locus being homozygous is px 2 +py2. Consequently, E(Nh-avai!)=N(px2+py2) and aNh-avai 2 = N(px 2 +py 2 1 p 2 -py 2 ). To guarantee N is large 10 enough with 97.7% confidence, it must be that E(Nh-avai) - 2 aNh-avai = Nh where Nh is found from above. For example, if one assumes pd = 0.3, p, = py = 0.5, one can find Nh = 186 and N = 391 for 5-sigma confidence. Similarly, it is possible to show that Nh = 30 and N = 68 for 2-sigma confidence i.e. 97.7% confidence in false negatives and false positives. 15 Note that a similar approach can be applied to looking for deletions of a segment when ho is the hypothesis that two known chromosome segment are present, and h, is the hypothesis that one of the chromosome segments is missing. For example, it is possible to look for all of those loci that should be heterozygous but are homozygous, factoring in the effects of allele dropouts as has been done above. 20 Also note that even though the assay is qualitative, allele dropout rates may be used to provide a type of quantitative measure on the number of DNA segments present. Method 3: Making use of Known Alleles of Reference Sequences, and Quantitative Allele Measurements 25 Here, it is assumed that the alleles of the normal or expected set of segments are known. In order to check for three chromosomes, the first step is to clean the data, assuming two of each chromosome. In a preferred embodiment of the invention, the data cleaning in the first step is done using methods described elsewhere in this document. Then the signal associated with the expected two segments is subtracted from the 30 measured data. One can then look for an additional segment in the remaining signal. A matched filtering approach is used, and the signal characterizing the additional segment is based on each of the segments that are believed to be present, as well as their complementary chromosomes. For example, considering Figure 13, if the results of PS 59 indicate that segments p2 and ml are present, the technique described here may be used to check for the presence of p2, p3, ml and m4 on the additional chromosome. If there is an additional segment present, it is guaranteed to have more than 50% of the alleles in common with at least one of these test signals. Note that another approach, not described 5 in detail here, is to use an algorithm described elsewhere in the document to clean the data, assuming an abnormal number of chromosomes, namely 1, 3, 4 and 5 chromosomes, and then to apply the method discussed here. The details of this approach should be clear to someone skilled in the art after having read this document. Hypothesis ho is that there are two chromosomes with allele vectors a,, a 2 . 10 Hypothesis hi is that there is a third chromosome with allele vector a 3 . Using a method described in this document to clean the genetic data, or another technique, it is possible to determine the alleles of the two segments expected by ho: aI = [an ... aIN) and a 2 = [a 21
-.
a2N] where each element aji is either x or y. The expected signal is created for hypothesis ho: sox= [fox(auI, a 21 ) ... fxo(alN, a2N)I, Soy fy(all, a 21 ) ... fy(alN, a2N)] where fx, fy describe 15 the mapping from the set of alleles to the measurements of each allele. Given ho, the data may be described as dxi = so.,+nx, nxi~N(O,Yxi 2 ); dyi = soyi+nyi, nyi-N(0,ayi 2 ). Create a measurement by differencing the data and the reference signal: mxi=dxi-sxi; myi=dy-syi. The full measurement vector is m=[mxT myT]I. Now, create the signal for the segment of interest - the segment whose presence is 20 suspected, and will be sought in the residual - based on the assumed alleles of this segment: a 3 = [a 31 ... a3N]. Describe the signal for the residual as: sr= [s, T SryT]I where s, = [fx(a 3 1 ) ... f,.(a3N)], sry ry[f(a 31 ) ... fry(3N)] where frx(ai) = 5,j if a 3 ; = x and 0 otherwise, fry(a 3 i) = Syi if a 3 ; = y and 0 otherwise. This analysis assumes that the nieasurements have been linearized (see section below) so that the presence of one copy 25 of allele x at locus i generates data i+ndi and the presence of Kx copies of the allele x at locus i generates data cx~,i+nxi. Note however that this assumption is not necessary for the general approach described here. Given hi, if allele a3i= x then mxj= Sxi+nxi, my= ny i and if a3i y then mxi= nxi, myi= Syi+ni. Consequently, a matched filter h = (1/N)Rsr can be created where R = diag([x1 2 --.. .xN2Y1 2 ... -yN 2 ]). The measurement is m = hrd. 30 ho: m = (1/N) E=..N srxinxi ax2+srynyi Gyi2 hi: m = (1/N) Ei=1..N Srxi(8xi+nxi) ax12 2 60 In order to estimate the number of SNPs required, make the simplifying assumptions that all assays for all alleles and all loci have similar characteristics, namely that =y and ax(Ya for i1 ... N. Then, the mean and standard deviation may be found as follows: ho: E(m)=mo=0; aminoz2t= (11N 2o4)(N2)(&5 2 -t-& 2 )= 8 2
I(Y
2 ) 5 hi: E(m)=mi=(l/N)(N/202)(52+2) 82/2; am~p 2 =(l/N2a 4 ) 2(oS 2 )= 5 2 /(N0 2 ) Now compute a signal-to-noise ratio (SNR) for this test of h, versus ho. The signal is in, mo= 52/2, and the noise variance of this measurement is amiho 2 +aI 2 _ 28 2 I(Na2). Consequently, the SNR for this test is (5 4 /a 4 )/(282/(Nol))= NS 2 /(22). Compare this SNR to the scenario where the genetic information is simply 10 summed at each locus without performing a matched filtering based on the allele calls. Assume that h=(l/N)i where i is the vector of N ones, and make the simplifying assumptions as above that Si=Syi= 8 and aji=ay;=a for i 1.. .N. For this scenario, it is straightforward to show that if m-h-d: ho: E(m)=mo=0; CmihO 2 No/N 2 +Na2/N 2 =2 2 /N 15 hi: E(m)=n=(1/N)(N/2+N8/2)= 6; amihi2 (/N2)(N2+ Na 2 )= 2oz/N Consequently, the SNR for this test is NS 2 /(42). In other words, by using a matched filter that only sums the allele measurements that are expected for segment a 3 , the number of SNPs required is reduced by a factor of 2. This ignores the SNR gain achieved by using matched filtering to account for the different efficiencies of the assays at each locus. 20 Note that if we do not correctly characterize the reference signals sai and syi then the SD of the noise or disturbance on the resulting measurement signals mxi and myi will be increased. This will be insignificant if 5 << a, but otherwise it will increase the probability of false detections. Consequently, this technique is well suited to test the hypothesis where three segments are present and two segments are assumed to be exact 25 copies of each other. In this case, sxij and sy, will be reliably known using techniques of data cleaning based on qualitative allele calls described elsewhere. In one embodiment method 3 is used in combination with method 2 which uses qualitative genotyping and, aside from the quantitative measurements from allele dropouts, is not able to detect the presence of a second exact copy of a segment. 30 We now describe another quantitative technique that makes use of allele calls. The method involves comparing the relative amount of signal at each of the four registers for a given allele. One can imagine that in the idealized case involving a single, normal cell, where homogenous amplification occurs, (or the relative amounts of amplification are 61 normalized), four possible situations can occur: (i) in the case of a heterozygous allele, the relative intensities of the four registers will be approximately 1:1:0:0, and the absolute intensity of the signal will correspond to one base pair, (ii) in the case of a homozygous allele, the relative intensities will be approximately 1:0:0:0, and the absolute intensity of 5 the signal will correspond to two base pairs; (iii) in the case of an allele where ADO occurs for one of the alleles, the relative intensities will be approximately 1:0:0:0, and the absolute intensity of the signal will correspond to one base pair; and (iv) in the case of an allele where ADO occurs for both of the alleles, the relative intensities will be approximately 0:0:0:0, and the absolute intensity of the signal will correspond to no base 10 pairs. In the case of aneuploides, however, different situations will be observed. For example, in the case of trisomy, and there is no ADO, one of three situations will occur: (i) in the case of a triply heterozygous allele, the relative intensities of the four registers will be approximately 1:1:1:0, and the absolute intensity of the signal will correspond to 15 one base pair, (ii) in the case where two of the alleles are homozygous, the relative intensities will be approximately 2:1:0:0, and the absolute intensity of the signal will correspond to two and one base pairs, respectively; (iii) in the case where are alleles are homozygous, the relative intensities will be approximately 1:0:0:0, and the absolute intensity of the signal will correspond to three base pairs. If allele dropout occurs in the 20 case of an allele in a cell with trisomy, one of the situations expected for a normal cell will be observed. In the case of monosomy, the relative intensities of the four registers will be approximately 1:0:0:0, and the absolute intensity of the signal will correspond to one base pair. This situation corresponds to the case of a normal cell where ADO of one of the alleles has occurred, however in the case of the normal cell, this will only be 25 observed at a small percentage of the alleles. In the case of uniparental disomy, where two identical chromosomes are present, the relative intensities of the four registers will be approximately 1:0:0:0, and the absolute intensity of the signal will correspond to two base pairs. In the case of UPD where two different chromosomes from one parent are present, this method will indicate that the cell is normal, although further analysis of the data 30 using other methods described in this patent will uncover this. In all of these cases, either in cells that are normal, have aneuploides or UPD, the data from one SNP will not be adequate to make a decision about the state of the cell. However, if the probabilities of each of the above hypothesis are calculated, and those 62 probabilities are combined for a sufficient number of SNPs on a given chromosome, one hypothesis will predominate, it will be possible to detennine the state of the chromosome with high confidence. 5 Methods for linearizing quantitative measurements Many approaches may be taken to linearize measurements of the amount of genetic material at a specific locus so that data from different alleles can be easily summed or differenced. We first discuss a generic approach and then discuss an approach that is designed for a particular type of assay. 10 Assume data dxi refers to a nonlinear measurement of the amount of genetic material of allele x at locus i. Create a training set of data using N measurements, where for each measurement, it is estimated or known that the amount of genetic material corresponding to data di is pai. The training set pj, i=l ... N, is chosen to span all the different amounts of genetic material that might be encountered in practice. Standard 15 regression techniques can be used to train a function that maps from the nonlinear measurement, dx, to the expectation of the linear measurement, E(p3i). For example, a linear regression can be used to train a polynomial function of order P, such that E(pI3,) = [1 dxi dxi 2 ... dip]c where c is the vector of coefficients c = [co ci ... ce]T. To train this linearizing function, we create a vector of the amount of genetic material for N 20 measurements px= [p.. $xN]T and a matrix of the measured data raised to powers 0...P: D = [[1 dxi d.
1 2 .. d.IP]T [1 dx 2 d 2,
.
dx2] T ... [1 dxN dxN 2 .... dxN PT]T. The coefficients can then be found using a least squares fit c = (DTD)IDI x. Rather than depend on generic functions such as fitted polynomials, we may also create specialized functions for the characteristics of a particular assay. We consider, for 25 example, the Taqman assay or a qPCR assay. The amount of die for allele x and some locus i, as a function of time up to the point where it crosses some threshold, may be described as an exponential curve with a bias offset: gi(t) = axi + Priexp(yxit) where axi is the bias offset, yxi is the exponential growth rate, and pxi corresponds to the amount of genetic material. To cast the measurements in terms of pfd, compute the parameter axi by 30 looking at the asymptotic limit of the curve gxi(-oo) and then may find p1x; and yxi by taking -the log of the curve to obtain log(gxi(t)- axi) = log(p.i) + 'yxit and performing a standard linear regression. Once we have values for axi and yx, another approach is to compute pxi 63 from the time, t,,, at which the threshold g, is exceeded. pI3 - (g - axi)exp(-yit). This will be a noisy measurement of the true amount of genetic data of a particular allele. Whatever techniques is used, we may model the linearized measurement as xc, 6 1 ,;+ni where Kc is the number of copies of allele x, 5,,j is a constant for allele x and 5 locus i, and nxr-N(0, a, 2 ) where a 2 can be measured empirically. Method 4: Using a probability distribution function for the amplification of genetic data at each locus The quantity of material for a particular SNP will depend on the number of initial 10 segments in the cell on which that SNP is present. However, due to the random nature of the amplification and hybridization process, the quantity of genetic material from a particular SNP will not be directly proportional to the starting number of segments. Let qs,A, qs,G, qsT, qs,c represent the amplified quantity of genetic material for a particular SNP s for each of the four nucleic acids (A,C,T,G) constituting the alleles. Note that these 15 quantities may be exactly zero, depending on the technique used for amplification. Also note that these quantities are typically measured from the intensity of signals from particular hybridization probes This intensity measurement can be used instead of a measurement of quantity, or can be converted into a quantity estimate using standard techniques without changing the nature of the invention. Let qs be the sum of all the 20 genetic material generated from all alleles of a particular SNP: qs = qs,A + qs,o + qsT + qsC, Let N be the number of segments in a cell containing the SNP s. N is typically 2, but may be 0,1 or 3 or more. For any high or medium throughput genotyping method discussed, the resulting quantity of genetic material can be represented as q, = (A+A,,)N+O 5 where A is the total amplification that is either estimated a-priori or easily measured empirically, 25 Ae, is the error in the estimate of A for the SNP s, and 3 is additive noise introduced in the amplification, hybridization and other process for that SNP. The noise terms AO,, and 0, are typically large enough that q, will not be a reliable measurement of N. However, the effects of these noise terms can be mitigated by measuring multiple SNPs on the chromosome. Let S be the number of SNPs that are measured on a particular 30 chromosome, such as chromosome 21. It is possible to generate the average quantity of genetic material over all SNPs on a particular chromosome as follows: 64 q =-Eq, =MN+-jAoN +0, (16) S1= S ,=1 Assuming that Ao,, and 0, are normally distributed random variables with 0 means and variances o2... and o 2 e,, one can model q =NA-p where (p is a normally distributed 5 random variable with 0 mean and variance -(N20u,, + o-,). Consequently, if sufficient S number of SNPs are measured on the chromosome such that S >> (N2 2A4 + or2e), then N=q/A can be accurately estimated. In a another embodiment, assume that the amplification is according to a model where the signal level from one SNP is s=a+a where (a+a) has a distribution that looks 10 like the picture in Figure 14, left. The delta function at 0 models the rates of allele dropouts of roughly 30%, the mean is a, and if there is no allele dropout, the amplification has uniform distribution from 0 to ao. In terms of the mean of this distribution ao is found to be ao=2.86a. Now model the probability density function of a using the picture in Figure 14, right. Let so be the signal arising from c loci; let n be the number of segments; 15 let Oi be a random variable distributed according to Figure 14 that contributes to the signal from locus i; and let a be the standard deviation for all (ai}. sc=anc+Ei=--1-ne aj; mean(se) = ano; std(s) = sqrt(nc)a. If a is computed according to the distribution in Figure 14, right, it is found to be ca=0.907a 2 . We can find the number of segments from n-se/(ac) and for "5-sigma statistics" we require std(n)<0. 1 so std(se)/(ac) = 0.1 => 0.95a.sqrt(nc)/(ac) = 0.1 20 so c =0.95 2 n/0.1 2 = 181. Another model to estimate the confidence in the call, and how many loci or SNPs must be measured to ensure a given degree of confidence, incorporates the random variable as a multiplier of amplification instead of as an additive noise source, namely s--a(l+a). Taking logs, log(s) = log(a) + log(l+a). Now, create a new random variable 25 y=log(l+a) and this variable may be assumed to be nonnally distributed -N(0,u). In this model, amplification can range from very small to very large, depending on a, but never negative. Therefore a=el-1; and sc=E=i... ca(1+ai). For notation, mean(se) and expectation value E(se) are used interchangeably 30 E(se)=acn+aE Z,=. a,)= acn + aE( , ) =acn(1+E(a)) 65 To find E(cc) the probability density function (pdf) must be found for a which is possible since a is a function of y which has a known Gaussian pdf. p 0 (a)=py(y)(dy/da). So: I dy d Pr I= 3, a dr =---(log(1 +a))- =e-' p,()V= e 2 a and -_-W -da da 1+a 5 and: p,(a) e2 e -r = e 2a2 1 This has the form shown in Figure 15 for o= 1. Now, E(u) can be found by integrating 10 over this pdf E(a) = ap, (a)da which can be done numerically for multiple different a. This gives E(sc) or mean(sc) as a function of a. Now, this pdf can also be used to find var(sc): var(s.) = E(s 0 - E(s)) 2 = E( a(1+ a,) - acn - aE$ al = E$ aa, -aE aY = a2E( a, - cnE(a) = a2E(. a 1 L -2cnE(a)(Z a, 1 )+ c2n2E(a) = a2E(cna2 + cn(cn - 1)ct.a - 2cnE(ac)(K. a,)+ cn2E(a)2) = a2c2n2(E(a2) + (cn - 1)E(a,) - 2cnE(a) 2 + cnE(a)2) = a2c2n2E(a2) + (cn -1)E(ap,) - cnE(a)) 15 which can also be solved numerically using p,(a) for multiple different a to get var(se) as a function of a. Then, we may take a series of measurements from a sample with a known number of loci c and a known number of segments n and find std(se)/E(se) from this data. That will enable us to compute a value for a. In order to estimate n, E(sc)=nac(1+E(a)) so 20 R = 17 can be measured so that std(n)= std(s) s ac(1 + E(a)) ac(1+ E(a)) std(n) 66 When summing a sufficiently large number of independent random variables of 0-mean, the distribution approaches a Gaussian form, and thus s, (and it) can be treated as normally distributed and as before we may use 5-sigma statistics: std(n) = sds) <0.1 ac(1 + E(a)) 5 in order to have an error probability of 2normcdf(5,0,1) = 2.7e-7. From this, one can solve for the number of loci c. Sexing In one embodiment of the system, the genetic data can be used to determine the 10 sex of the target individual. After the method disclosed herein is used to determine which segments of which chromosomes from the parents have contributed to the genetic material of the target, the sex of the target can be determined by checking to see which of the sex chromosomes have been inherited from the father: X indicates a female, and Y indicates a make. It should be obvious to one skilled in the art how to use this method to 15 determine the sex of the target. Validation of the Hypotheses In some embodiments of the system, one drawback is that in order to make a prediction of the correct genetic state with the highest possible confidence, it is necessary 20 to make hypotheses about every possible states. However, as the possible number of genetic states are exceptionally large, and computational time is limited, it may not be reasonable to test every hypothesis. In these cases, an alternative approach is to use the concept of hypothesis validation. This involves estimating limits on certain values, sets of values, properties or patterns that one might expect to observe in the measured data if a 25 certain hypothesis, or class of hypotheses are true. Then, the measured values can tested to see if they fall within those expected limits, and/or certain expected properties or patterns can be tested for, and if the expectations are not met, then the algorithm can flag those measurements for further investigation. For example, in a case where the end of one arm of a chromosome is broken off in 30 the target DNA, the most likely hypothesis may be calculated to be "normal" (as opposed, for example to "aneuploid. This is because the particular hypotheses that corresponds to the true state of the genetic material, namely that one end of the chromosome has 67 broken off, has not been tested, since the likelihood of that state is very low. If the concept of validation is used, then the algorithm will note that a high number of values, those that correspond to the alleles that lie on the broken off section of the chromosome, lay outside the expected limits of the measurements. A flag will be raised, inviting 5 further investigation for this case, increasing the likelihood that the true state of the genetic material is uncovered. It should be obvious to one skilled in the art how to modify the disclosed method to include the validation technique. Note that one anomaly that is expected to be very difficult to detect using the disclosed method is balanced translocations. 10 Application of the method with contaminated DNA In one embodiment of the system, genetic data from target DNA which has been definitely or possibly contaminated with foreign DNA can also be cleaned using the disclosed method. The concept outlined above, that of hypothesis validation, can be used 15 to identify genetic samples that fall outside of expected limits; in the case of contaminated samples it is expected that this validation will cause a flag to be raised, and the sample can be identified as contaminated. Since large segments of the target DNA will be known from the parental genetic data, and provided the degree of contamination is sufficiently low and sufficient SNPs are 20 measured, the spurious data due to the foreign genetic material can be identified. The method disclosed herein should still allow for the reconstruction of the target genome, albeit with lower confidence levels. Provided that the level of contamination is sufficiently low, the hypothesis that is calculated to be most likely is still expected to correspond to the true state of the genetic material in the target DNA sample. 25 It should be obvious to one skilled in the art how to optimize these methods for the purpose cleaning genetic data contaminated with spurious signals due to foreign DNA. Example of Reduction to Practice 30 In one embodiment of the system, the method described above can be implemented using a set of algorithms which will calculate the most likely identity of each SNP in a list of relevant SNPs, as well as a confidence level for each SNP call. 68 Described here is one possible way to implement the method disclosed in this patent. Figure 16 and Figure 17 visually represent the breakdown of this implementation of the disclosed method, the input requirements and the format of the output. Figure 16 focuses on the input data (1601) and its format and requirements, as 5 well as the output data (1605) and its format. Input to the algorithm consists of the measured data (1602), including input by the user, and existing data (1603) preserved in the database, that is consequently updated by the newly collected data. The measured data (MD, 1602) consists of the genetic data as measured for desired SNPs for the embryo, and the paternal and maternal alleles, as well as the accuracy, or confidence with which 10 each of the alleles is known. The existing data (1603) consists of the population frequency data (FD), measurement bias data (BD), and crossover data (CD). The population frequency data (FD) contains the allele frequency (for each of the values A,C,T,G) for each of the SNPs available. These data can be previously known or measured, and can be updated with newly collected data as described elsewhere in this 15 document. Measurement bias data (BD) captures the bias of the measurement process towards certain values. For example, assuming the true value of the allele is X=A, and probability of the correct measurement is px, the distribution of the measured value x is: A C T G Probability px Pc PT PG probability with no bias px (1-px)/ 3 (1-px)/ 3 (1-px)/3 20 where px +pc +PT +pG = 1. If there is no bias of measurement towards any of the values then PC= Pr = PG = (1-px)/ 3 . This information can be discerned from empirical and theoretical knowledge about the mechanism of the measurement process and the relevant 25 instruments. Crossover data (CD) consists of a database of genetic distances and crossover probabilities between pairs of snips, collected from HAPMAP data. Together, (MD), (FD), (BD), (CD) make up the necessary input to the disclosed method (termed 'Parental Support', 1604) algorithm. This algorithm (1604) then operates 30 on the input data to generate the output data (1605), which describes the most likely 69 "true" value of the target's genetic data given the measured values, as well as the most likely origin of each SNP in terms of the parental alleles. Figure 17 focuses on the structure of the algorithm itself (termed 'Parental Support') and how each of these input data are utilized by the algorithm. Working 5 backwards: to find the most likely hypotheses it is necessary to calculate P(HIM) 1707, the probability of the hypothesis given the measurement, for all the possible hypotheses H. As described previously: P(H I M) = P(M-| H) P(H), P(M)= E P(M h)P(h) P(M) heSH In order to find P(HIM) (1710), it is first necessary to find P(MIH) (1707), and P(H) 10 (1708), for all hypotheses H. This allows the calculation of P(M), 1709 by the equation shown above. The probability of the hypothesis P(H) 1708 depends on how many crossovers are assumed and the likelihood of each of these crossovers (CD, 1704), as explained above. P(MIH) can be calculated using the following equation: 15 P(MI H)= P(M I H & t)P(t), as explained previously. P(t), 1706 is the frequency of a particular value t for paternal and maternal alleles and is derived from population frequency data (FD, 1703). P(M|H&t), 1705 is the probability of correctly measuring the allele values of the embryo, the father, and the mother, assuming a particular "true" value t. The measurement data and accuracy entered 20 by the user (MD, 1701), and the measurement bias database (BD, 1702) are the inputs required to calculate P(MIH&t), 1705. A more detailed description of the method is given forthwith. Begin with SNPs R {ri,...,rk}, (a set of k SNPs), and the corresponding measured identities of parents and embryo, M =(ei,e 2 ,p1,p2,mI,m2), for k SNPs, identified with id's si,. ..,sk, where: 25 el= (ell,e12,-.,elk) is the measurement on one of the chromosomes of the embryo (they don't all have to come from the same parental chromosome) for all the SNPs e 2 = (e21,e22,.. .,e 2 ) is the measurement on the other chromosome of the embryo PI = (pli,P12,...,pik) is the measurement on the FIRST chromosome of the father (all coming from the same chromosome 30 p2= (p21,p22,...,p2k) is the measurement on the SECOND chromosome of the father (all coming from the same chromosome) 70 mi= (mnj,m 2 ,.. .,mk) is the measurement on the FIRST chromosome of the mother (all coming from the same chromosome) m2 = (m21,m22,. ..,m 2) is the measurement on the SECOND chromosome of the mother (all coming from the same chromosome) 5 One can also write M = {M 1 ,...,Mk} where Mi=(e,,e 2 i,puP2i) The goal of the method is to determine the "true" embryo valueT= (El,E2), i.e. the most likely case given the measurement M, where: Ei = (E,,Ei2,...,Eig) is the measurement on the FIRST chromosome of the embryo, corresponding to the PATERNAL chromosome, E, e {pl, P2} 10 E 2 = (E 2 1,E 22 ,...,E2k) is the measurement on the SECOND chromosome of the embryo, corresponding to the MATERNAL value, E2, e {ml,m} 21 One can also write T = {T,,...,Tk} where Ti = (EI,,E 2 1). Effectively, the parental chromosome values (pi,p 2 ,mI,m2) are being used as support to check, validate and correct measured values of (e,,e 2 ), hence the term "Parental 15 Support Algorithm". To achieve this goal, all the possible hypotheses for the origin of embryo values are developed and the most likely one is chosen, given the measurement M. The hypotheses space is SH = {H1,...,j= {set of all the hypotheses), where each hypothesis is of the format Hi = (H,... .H) where Hij is the "mini" hypothesis for SNP i, of the 20 format H'i = (pi*,mi*) where p,* e {p, p} and m,* e {m 1 ,m 2 }. There are 4 different "mini" hypotheses Hii, in particular: Hlil: (elu,e2i) = {(pu,mu) or (muI,p)} Hji2: (eu,,e2,) = {(puim2i) or (m2i,pu)} H',3: (eu,ez,) = {(p2i,mln) or (mu,p2)} 25 Hjj4: (eu,,ez) = {(p 2 ,,mz,) or (m 2 ,p 2 i)} In theory, SH can have q = 4 k different members to pick from, though later this space will be limited with a maximal number of crossovers of paternal and maternal chromosomes. The most likely hypothesis H* is chosen to be as: H*= arg max,_, P(H M) 30 For a particular H: 71 P(HIM)- P(M|H) X P(M | h)P(H) So deriving for each hypothesis: (1) P(M/H) is the probability of measurement M given the particular hypothesis H (2) P(H) is the probability of the particular hypothesis H 5 (3) P(M) is the probability of the measurement M After deriving P(HIM) for all H, the one with the greatest probability is chosen. Deriving P(M1H) Since measurements on each SNP are independent, for M = (Mi,...Mk) and the 10 particular hypothesis H=(H ,....Hk) on all k SNPs then: P(MIH)=P(MIH)*...*P(MIHk) For the particular SNP r, derive P(MrIHr). For Q = {A,C,T,G}X{A,C,T,G}X={A,C,T,G}X{A,C,T,G}, the space of all the possible values for "true" parent values (PIr,P2r,Mir,M2r), by Bayes formula is: 15 P(M,/H,)=>P(M,/H,&(P ,P ,M,,M2,)=t)*P((PP,M,,M,)= t) Deriving P(MIHr & (PinP 2 r,M,M 2 r) = t) Mr =(eir,e2r,pir,p2r,mIe,m2r) is a given measurement on this SNP. T=(EIr,E2r,PIr,P2r,MIr,M2r) is the supposed "true" value, for t = (Pir,P2r,MIr,M2r) and (Eir,E2r) fixed from T by hypothesis. (Eir is one of Pir,P 2 r, E2r is one of Mir,M2r) 20 P(M,=(ei,,e 2 ,pir,p2r,mi,,n,)T =(Ei,,E,,P,,,P,,M1,,M ))= P(e, / E,) * P(e, /E,)* P(p/ I P,1*P pr / P,)* P(mr IM1,)* PQM 2 , /M,) Given: pe=P(accurately measuring the embryo value i,on SNP r) ppri=P(accurately measuring the father value i,on SNP r) pm,-=P(accurately measuring the mother value i,on SNP r) P(e ir EI,) = pe 1 , = E. r 25 ' (1- pe,,*p(e., E,,r) e # El =I ,*pr1+Ies,*(l-p)*p(e 1
,,E
1 ,r) =F(er, Ei,pai,r) where p(eir,Eir,r) = 1/3 if there is no measurement bias, otherwise it can be determined from experimental data, such as data from the Hapmap project.. 72 Deriving P((P,P 2 ,M1,,M2r)=t) For t =(ti,t 2 ,t 3 ,t 4 ): P((P,P, MI,, IM2,) = (tIt 2 ,t2,t )) = P(P, = ti)* P(P, = t 2 )* PM 1 , = t 3 )* P(M2, =t) 5 Suppose there are n samples of (P 1
,P
2
,M
1
,M
2 ), all paternal and maternal values are assumed to be independent, and t =(t,t 2 ,t 3 ,t 4 ) for ti in {A,C,T,G} To get a particular PIA = P(PI - ti), for ti=A, assume that in absence of any data this probability could be anything between 0 and 1, so it is assigned a value of U(0,1). With the acquisition of data, this is updated with the new values and the distribution of 10 this parameter becomes a beta distribution. Suppose that out of n observations of P1, there are h values Pl=A, and w- (event Pi=A) and D=(data given). It is described in a prior section the form of the beta distribution B(a,p) with a = h+1, p = n-h+1 for p(wjData) (see equation (8)). The expected value and variance of X-B(ap) distribution are: EX = a a+#3 (a+#8)2(a+#8+1) 15 So the posterior mean value of the parameter pirA = P(Pir= AlData)= (h+1)/(n+2) Similarly pirB = (#(pir = B)+1)/(n+2),... m2,o = (#(m 2 r = G)+1)/(n+2), etc. Thus all the values pirA,...,m2ro have been derived and: P((P,,P, MiM,)= 1 (tt,t I t 4 ))= p, *P2, 2 *m,, * m 2 , 20 Deriving P(H) The probability of the hypothesis H = (Hi,...,Hk) with Hi = (pi*,mi*) depends on the amount of chromosome crossover. For example, with P(crossover) = 0 then P(H) = % and H = (p*,m*) if p* in{(pll,p21,...psl), (p12,p22,...,ps2), m* in {(m11,m21,...,msl),(m12,m22,...,ms2)}, 0 otherwise 25 with P(crossover)>0 it is important to incorporate the probability of crossover between each SNP. Hypothesis H consists of the hypothesis for paternal and maternal chromosomes for each SNP, p,*E{p 1 ,p,} and n,*e={m ,m},i.e. H = (Hp,Hm) where Hp=(pi* ,... *), and Hm=(m *,...mk*), which are independent. 73 P(H)= P(Hp)*P(Hm). Suppose that SNP are ordered by increasing location, k PCH,)= IF(PC,*(1-Il)+(1-PC,)*Il 1=2 where PC, = P(crossover(r-,r,)) i.e. probability of crossover somewhere between SNPs rI,r, and Ii =1 if pi*,pi.i* are coming both from p, or P2, and it is 0 otherwise. 5 Deriving P(crossover(a, b)) Given SNPs ab, at base locations la,lb (given in bases), the probability of crossover is approximated as: P(l, 1b)= 0.5(1 - exp(-2G(l.,lb 10 where G(la,lb) = genetic distance in Morgans between locations la,lb. There is no precise closed form function for G but it is loosely estimated as G(la,Ib) - la-lbl*le-8. A better approximation can be used by taking advantage of the HapMap database of base locations si, and distances G(si,sm± 1 ), for i spanning over all locations. In particular, .G(l.,ls)= XG(s ,,sm), so it can be used in crossover probability. 15 Deriving P(M) Once P(MIH) is known, P(H) can be found for all the different H in SH, P(M)= XP(M|H)P(H) HeS, 20 A more expedient method to derive the hypothesis ofimaximal probability Given the limitation of computer time, and the exponential scaling of complexity of the above method as the number of SNPs increases, in some cases it may be necessary to use more expedient methods to determine the hypothesis of maximal probability, and thus make the relevant SNP calls. A more rapid way to accomplish this follows: 25 From before: P(HIM) = P(MIH)*P(H)/P(M), argmax H P(HIM) = argmax H and P(MjH)*P(H) = argmax H F(M,1), and the object is to find H, maximizing F(M,H). Suppose M(,k)= measurement on snips s to k, H(,k) = hypothesis on snips s to k, and for shorts M,(kk) Mk, H(ok) = Hk= measurement and hypothesis on snip k. As shown 30 before: 74 k k-I P(M(,,) | H r 1 k) = P(M, I H,) =P(Mk I Hk )* P(M, | H,) =P(Mk I Hk) * P(M(I k_, | H _ )) and also P(H(,,k)) =1/4 * fj PF(H,,H,) = PF(Hk,, Hk) * 1/4 * J7 PF(H,, H) =PF(H,,,H,) *P(H where 1l-PC(H, 1 ,H,) H 1
=H
1 5 PF(H,,,H,)= I PC(-I , H,) H', 1 H PC(Hi,,H,) Hi,;tH, and PC(H.
1 ,Hi) = probability of crossover between Hi-i, Hi So finally, for n snips: F(M, H) = P(M I H) * P(H) = P(M( 1 ,) I H(,,)* PH og) = P(AMI, \ H ,, )* P(H(,,)) * P(M. I H, * PF(H,-,, H,) therefore: F(M, H) = F(M(I',) I He,,)) = F(M(I,, 1 ) H,) * P(M, I H)* PF(H,_,,H,) 10 Thus, it is possible to reduce the calculation on n snips to the calculation on n-i snips. For H = (Hi,...Hn) hypothesis on n snips: max F(M, H)= max F(M,(H(J,,), H,,)= max max F(M,(H(,,), H,,)= max G(M,, , H,,) where G(MI,) , H,) = max F(M,,,) , (H( 1 .), H.)= max F(M,(I.,I H,_ * P(M,, \ H,) * PF(H,,_,, H,) = P(M. | H * max F(M(...), H(,) * PF(H,_,, H.)= P(M. I H,)* max max F(M,, 1 _, (H( 1
,,
2 ), H.-.) * PF(H,. , H,)= P(M,, I H,) * max PF(H,,,H,,) * G(M(,,), H,,_) 15 In summary: maxF(M, H)= max G(M,,,),H,) H H. where G can be found recursively:, for i=2,..n G(M.,,), H.) = P(M,, IH) * rna4PF(H,-,, H,) * G(M(I,,) I H,_,)J and G(M(,) H,) = 0.25 * P(M, I H,). The best hypothesis can be found by following the following algorithm: 20 Step 1: For 1=1, generate 4 hypotheses for H 1 , make G(MiIHi) for each of these, and remember G 1
,G
2
,.G
3
,G
4 Step 2: For I =2: generate 4 hypothesis for H2, make G(M 1, 2 )fH 2 ) using the above formula: 75 G(M(1,, H 2 ) = P(M 2
H
2 ) * max[PF(H, H 2 ) * G(M, H,)], remember these new four G,. Repeat step 2 for I=k with k;=ki+l until k-n: generate 4 hypothesis for Hk, make G(M(l,k)|Hk) G(M,k), Hk) = P(Mk I Hk) * maxtPF(H,,-, Hk) * G(MlIk1), H,_,)] and remember these four ilk-I 5 G,. Since there are only four hypotheses to remember at any time, and a constant number of operations, the algorithm is linear. To find P(M): P(HIM)= P(MIH)*P(H)/P(M) = F(M,H)/P(M)) As above: P(M) = P(M,,k)) ZP(M(Ilk) IH(,,)* P(HHlk))* 10 = P(M, |H,) E P(M(Ik,,_) HIk- 1 ,)* P(H,l))* PF(HkI-,, Hk) P(M K I H,)* W(MIkI k,) H, where W(M,,, 1 _) I H) P(M,,k_,) I H(,,k-)) * P(H(k,,,)) * PF(HkI Hk) W(M,H) can be solved by using recursion: W(M,,k_ 1 ) I Hk)= Y P(Mk 1 ) I H(,k, 1 )) * P(H(Ik 1 )) * PF(Hk_, Hk) HMA-4 = P(M_, I Hk_ 1 ) Y P(M(l,k-2) I H( 1 k 2 ,) * P(H(,k--2)) * PF(H-2, H_.) * PF(Hk-_,Hk) = IP(M_, \ Hk_ 1 ) * PF(HkI, Hk) * W(M(.,-2) Hk_ 1 ) Therefore: W(M(I |H1) )= Z P(M, I Hk_.)* PF(Hk-_, H )* W(M( 1
,-
2 ) I H-,) 15 and W(M(,,I H 2 ) =ZP(M jH)*0.25* PF(H,,H 2 ) H, The algorithm is similar to the case above, where i=2:n and in each step a new set of W(i) are generated until the final step yields the optimized W. Deriving the pi, P2, ppI, PP2 values from di, d 2 , h, pdI, pd 2 , ph 20 For the purpose of explanation, this section will focus on the father's diploid and haploid data, but it is important to note that the same algorithm can be applied to the mother. Let: o di, d 2 - allele calls on the diploid measurements 76 o h- allele call on the haploid measurement o PdI, pd2-probabilities of a correct allele call on the each of the diploid measurements o Ph- probability of a correct allele call on the haploid measurement 5 These data should be mapped to the following input parameters for disclosed algorithm: o pi- allele corresponding to haploid cell and one of the diploid cells o p2- allele corresponding to the remaining diploid cell o pp, pp2- probabilities of correct allele call Since h corresponds to di, then to find the value of p, it is necessary to use h and di. Then 10 P2 will automatically correspond to d 2 . Similarly, if h corresponds to d 2 , then to find the value of pi it is necessary to use h and d 2 , and thne P2 will correspond to dI. The term "correspond" is used since it can mean either "be equal" or "originate with higher probability from" depending on different measurement outcomes and population frequency. 15 The goal of the algorithm is to calculate probabilities of "true" allele values hidden beyond results of raw measurement h, d 1 , d 2 , Ph, Pdi, pd2 and population frequencies. The basic algorithm steps are the following: (i) determine whether h corresponds to di or d 2 based on h, di, d 2 , pa, pdI, pd2 values and the population frequency data 20 (ii) assign the allele calls to pi and p2; calculate the probabilities pp1 and pp2 based on step (1) Assigning h to di or d2 Establish two hypotheses: 25 HI: h corresponds to d, (h originates from di)
H
2 : h corresponds to d 2 (h originates from d 2 ) The task is to calculate probabilities of these two hypotheses given the measurement M: P(HI/M(h,di,d 2 ,ph,pdpd 2 )) and P(H2/M(h,d, d2,PhPdPd 2 )) 30 (To simplify the text, these will be referred to as P(Hi/M) and P(H2/M)) hereafter. In order to calculate these probabilities, apply the Bayesian rule:
P(H
1 \M)=P(MI H) *P(H,); P(Hz IM)- P(MIH 2
)*P(H
2 ) P(M) P(M) 77 where P(M)=P(M/H)*P(Hi)+P(M/H 2
)*P(H
2 ). Since hypotheses HI and H2 are equally likely, P(Hi)=P(H 2 )=0.5, therefore: 5 P(H,\M)= P(MIH,) and P(H 2 IM) P(MIH) P(M| H) + P(M I H 2 )
P(MIH,)+P(MIH
2 ) In order to calculate P(M/Hj) and P(M/H 2 ), one must consider the set of all possible values of diploid outcomes di and d 2 , n ={AA,AC,...,GG}, i.e. any combination of A,C,TG, so called underlying states. When the hypotheses are applied to the 10 underlying states (i.e. accompany the assumed value of h based on hypothesis H, or H 2 , to values di and d 2 ), the following tables of all possible combinations (states S={sis2,...,si}) of "true values" H, Di and D 2 for h, di and d 2 , can be generated, respectively: Hypothesis HI: h=dj k={AA,AC,...,G Hypothesis H 2 : K G} h=d 2 ={AA,AC,...,GG } state H Di D2 state H Di D2 si A A A s1 A A A
S
2 A A C S2 C A C S3 A A T S3 T A T s4 A A G S 4 G A G Ss C C A S 5 A C A S6 C C C C C C
S
7 C C T S7 T C T S8 C C Gs G C G S9 T T A S 9 A T A sto T T C sio C T C si T T s T T T S12 T T G S12 G T G sj3 G G A S13 A G A 78 S14 G G C S14 C G C s1 5 G G T s 1 5 T G T S16 G G G S16 G G G Since the "true values" H, Di and D 2 are unknown, and only the raw measurement outcomes h, dj, d 2 , ph, Pdl, pd2, are known, the calculation of the P(M/Hj) and P(M/H 2 ) over the entire set Q must be performed in the following manner: 5 P(MI H,)= EP(M(h,d,d 2 )| H 1 & D 1
,D
2 ) *P(D,D 2 ) ~P(MI H 2 )= EP(M(h,d,d 2 )I H 2
&D,,D
2
)*P(D,D
2 ) (DI A 2 )ef If, for the purpose of the calculation, one assuems that di and d 2 , as well as PdI and pd2 are independent variables, it can be shown that: P(M I H)= P(M(h, d, d 2 )I H & D,D 2 )* P(D,D 2 )= ZP(M(h) H)*P(M(d) I D,))*P(M(d 2 ) D 2
)*P(D)*P(D
2 ) -S 10 Consider the first three terms under the last sum above: P(M(x)/X), for x in {h,di,d 2 }. The calculation of the probability of correct allele call (hitting the "true allele value") is based on measurement of outcome x given the true value of allele X. If the measured value x and the true value X are equal, that probability is px (the probability of correct measurement). If x and X are different, that probability is (1-px)/ 3 . For example, 15 calculate the probability that the "true value" C is found under the conditions that X=C, and the measured value is x=A. The probability of getting A is px. The probability of getting C, T or G is (l-p). So, the probability of hitting C is (1-px)/ 3 , since one can assume that C, T and G are equally likely. If the indicator variable IX is included in the calculation, where I,=1 if x=X and 20 I.=O if x#X, the probabilities are as follows: P(M(x)/X)=Ix=x*p+(I{x=x))*(1/3)*(l-px), x in {h,d,d 2 } Now consider the last two terms in P(MIHI). P(DI) and P(D 2 ) are population frequencies of alleles A,C,T and G, that may be known from prior knowledge. Consider the expression shown above for a particular state s 2 , given the particular 25 measurement M(h = A,d= G,d 2 = C): 79 P(M(h) I H)*P(M(d,) D,)*P(M(d 2 ) ID 2 )*P(D) *P(D) = =P(M(h)=A\H=A)*P(M(d,)=G|D, =A)*P(M(d 2
)=C\D
2 =C)*P(D, =A)*P(D 2 = C)= =Ph *((Ps1 3 )*P2 *f(D, =A)*f(D 2 = C) Similarly, calculate (1) given the particular measurement (in this case M(h=A,d 1 =G,d 2 =C)) for remaining 15 states and sum over the set 92. Now P(M/H 1 ) and P(M/H 2 ) have been calculated. Finally, calculate P(HI/M) and 5 P(Hi/M) as described before: P(H \M)= P(M I H,) PPM| HM)+P(M IH 2 ) P(H \M) = P(MIHZ) 2
P(MIH,)+P(MIH
2 ) Assigning the Allele Calls and Corresponding Probabilities 10 Now establish four different hypotheses: Hp2A: "true value" of P2 is A Hp2c: "true value" of p2 is C Hp2T: "true value" of p2 is T Hp2G: "true value" Of P2 is G 15 and calculate P(Hp2A/M), P(Hp 2 c/M), P(Hp21/M), P(Hp2o/M). The highest value determines the particular allele call and corresponding probability. Since the origin of p2 is unknown (it is derived from di with probability of
P(H
2 /M) and from d 2 with probability P(HI/M)), one must consider both cases that p2 allele originates from dl or d 2 . For Hypothesis HA , applying Bayes rule, give: 20 P(H 2 , I M) = P(H2|A I M, H 1 ) * P(H IM) + P(H,2 I M, H 2 ) * P(H 2 M) P(HI/M) and P(H 2 /M) have already been determined in step 1. By Bayes rule: P(H,2A I HIM) = P(M I HHp2A)*P(HI Hp 2 A) P(H,M) Since H 1 implies that p2 originates from dz: P(M | Hzp 2 A)= P(M(d 2 )| D 2 = A) =-(d,=D)* Pd2 + Id=D2} Pd2/3 25 P(HI,M/Hp 2 A)=P(M(d 2
)/D
2 =A)= I{d2=D2)*p2+(1-Ifd2=D2)*(1/ 3 )*(l-pd2), as described before. P(Hp2A)=P(D 2 =A)= fz(A), where faz(A) is obtained from population frequency data. 80 P(HI,M)=P(HI,M/Hp2A)*P(Hp2A)+P(HI,M/Hp 2 c)*P(Hp 2 c)+P(H,M/Hp2T)*P(Hp 2 T)+P(HI,M /Hp 2 G)*P(Hp2G) Similarly, calculate P(Hp2A&H 2 /M). P(Hp 2 A/M)=P(Hp 2 A&H1/M)+ P(Hp2A&H 2 /M), therefore the probability that p2 is equal to 5 A has been calcualted. Repeat the calculation for C,T, and G. The highest value will give the answer of p2 allele call and the corresponding probability. Assigning the allele call to p, (allele corresponding to the haploid cell and one of the diploid cells) 10 As before, we establish four different hypotheses: HpIA: "true value" of pi is A Hpic: "true value" of pi is C HpT: "true value" of pi is T Hpio: "true value" of pi is G 15 and calculate P(HpIA/M), P(Hpic/M), P(HpIT/M), P(Hpio/M) Here is an elaboration of HpIA. In the "true case" case, p1 will be equal to A only if the haploid and the corresponding diploid cell are equal to A. Therefore, in order to calculate pi and ppi one must consider situations where haploid and corresponding diploid cell are equal. So, the hypothesis HpIA: the "true value" of p, is A and becomes HMA: the 20 "true value" of the haploid cell and corresponding diploid cell is A. Since the origin of h is unknown (it is derived from d, with probability of P(Hi/M) and from d 2 with probability P(H 2 /M)), one must consider both cases, that h allele originates from di or d 2 , and implement that in determination of pl. That means, using Bayes rule: 25 P(H I M) = P(H\dA IM,HI)*P(H IM)+P(H IM,H2)*P(H 2 I M) As before, P(HI/M) and P(H 2 /M) are known from previous calculations. P(HAI H 1 ,M) =P(H,,M I HhdA)*P(HhdA) P(Hi,M) P(HI,M/HhdA)= P(M(h)/H = A)*P(M(di)/Di = A) = [I{h=H} *ph+(1-I{h=H}1)*(/ 3 )*(1ph)I *[I(dID}I*pdl+(1-ald=Dil))*(1/3)*(1-Pdi)], 30 since HI implies that pi originates from dj. P(HhA) = P(h = A)*P(DI = A) = fh(A)*fdl(A) ,where fh(A) and fd2(A) are obtained from population frequency data. P(Hi,M)= 81 P(HI,M/HhdA)*P(HhdA)P(HIM/Hhdc)*P(Hhdc)+P(Hi,M/IhdT)*P(HhdT)+P(H,M/HhdG)*P( HhdG) Similarly, calculate P(HhdA&H2/M). P(HhA/M) = P(HhdA&HI/M)+ P(HhdA&H 2 /M) and now we have calculated the probability 5 that pi is equal to A. Repeat the calculation for C,T, and G. The highest value will give the answer of p, allele call and corresponding probability. Example Input Two input examples are shown. The first example is of a set of SNPs with a low 10 tendency to cosegregate, that is, SNPs spread throughout a chromosome, and the input data is shown in Table 3. The second example is of a set of SNPs with a high tendency to cosegregate, that is SNPs clustered on a chromosome, and the input data is shown in Table 4. Both sets of data include an individual's measured SNP data, the individual's parents SNP data, and the corresponding confidence values. Note that this data is actual 15 data measured from actual people. Each row represent measurements for one particular SNP location. The columns contain the data denoted by the column header. The key to the abbreviations in the column headers is as follows: o familyjid = the unique id for each person (included for clerical reasons) o snpid = the SNP identification number 20 o el, e2 = the SNP nucleotide values for the embryo o pl, p2 = the SNP nucleotide values for the father o ml, m2 = the SNP nucleotide values for the mother o pel, pe2 = the measurement accuracy for el,e2 o ppl, pp2 = the measurement accuracy for p l,p2 25 o pml, pm2 = the measurement accuracy for ml,m2 Example Output The two examples of output data are shown in Table 5 and Table 6, and correspond to the output data from the data given in Table 3 and Table 4 respectively. 30 Both tables show an individual's measured SNP data, the individual's parents SNP data, the most likely true value of the individual's SNP data, and the corresponding confidences. Each row represents the data corresponding to one particular SNP. The 82 columns contain the data denoted by the column header. The key to the abbreviations in the column headers is as follows: o snp_id = the SNP identification number o true value = the proposed nucleotide value for el,e2 5 o true hyp = the hypothesis for the origin of e l,e2 o ee = the measured SNP nucleotide values for e 1,e2 o pp = the measured SNP nucleotide values for p l,p2 o mm= the measured SNP nucleotide values for ml,m2 o HypProb = the probability of the final hypothesis. There is only one number for 10 the output, but due to the excel column structure, this number is replicated in all rows. Note that this algorithm can be implemented manually, or by a computer. Table 3 and Table 4 show examples of input data for a computer implemented version of the method. Table 5 shows the output data for the input data shown in Table 3. Table 6 shows the output data for the input data shown in Table 4. 15 Simulation Algorithm Below is a second simulation which was done to ensure the integrity of the system, and to assess the actual efficacy of the algorithm in a wider variety of situations. In order to do this, 1,000 full system simulations were run. This involves randomly 20 creating parental genetic data, emulating meiosis in silico to generate embryonic data, simulating incomplete measurement of the embryonic data, and then running the method disclosed herein to clean the simulated measured embryonic data, and then comparing that "cleaned" data with the "real" data. A more detailed explanation of the simulation is given below, and the visual representation of the flow of events is given in Figure 18. 25 Two different implementations of the theory were tested. A fuller explanation is given below. Simulation algoritluns for DH and PS and results For both algorithms, the initial input variables are: 30 (i) the list of SNPs to test, (ii) the population frequency of the maternal (popfreqlistMM) and paternal (popfreqlistPP) chromosomes, 83 (iii) the probabilities of a correct allele call for haploid measurement (phpe), and for unordered diploid measurements (pd). These values should be fixed based on the results from empirical data (population frequency) on relevant SNPs, and from measuring instrumentation performance 5 (ph,pd,pe). The simulation was run for several scenarios such as most likely (informed), uniform (uninformed) and very unlikely (extreme case). Once the above static parameters are fixed , crossover probabilities given the particular SNPs are the same for all the simulations, and will be derived ahead of the time given the databases for snip location(SNIPLOCNAME_MAT) and genetic distance 10 (HAPLOCNAMEMAT). [crossprob,snips]= GetCrossProb(snips,SNIPLOCNAMEMAT,parameters,HAPLOCNAMEMAT); Preliminary Simulation Loop 15 The preliminary simulation loop is to demonstrate that the genetic data that will be used for the full simulation is realistic. Steps 1 through 5 were repeated 10,000 times. Note that this simulation can be run for either or both parents; the steps are identical. In this case, the simulation will be run for the paternal case for the purposes of illustration, and the references to Figure 18 will also include the corresponding maternal entry in 20 Figure 18 in parentheses. Step 1: Generate originalparental diploid cells (PI,P2), [P1,P2]=GenerateOriginalChromosomes(snips,popfreqlistPP); 1801 (1802) Generate original paternal cells depending on the population frequency for each SNP 25 for father cells. Step 2: Generate haploid and unordered diploid data for DHAlgo Simulate crossover of the parental chromosomes 1803 to give two sets of chromosomes, crossed over: PIC1, P2CI and P1C2; P2C2; 1804 (1805). Pick one of the 30 father alleles after the crossover 1806 (from the first set) for haploid allele HP 1807 (1808) in this case P1 (since there is no difference which one), and mix up the order in the diploid alleles to get (D1P,D2P) 1807 (1808). HP = PickOne(P1C1,P2C1); 84 [D1P,D2P]= Jumble(P1,P2). Step 3: Introduce error to the original dataset in order to simulate measurements Based on given probabilities of correct measurement (ph-haploid, pd- diploid 5 measurement), introduce error into the measurements to give the simulated measured parental data 1811 (1812). hp = MakeError(HPph); dip MakeError(D1P,pd); d2p MakeError(D2P,pd). 10 Step 4: Apply DHAIgo to get (plp2), (ppI,pp 2 ) DHAlgo takes alleles from haploid cell and an unordered alleles from diploid cell and returns the most likely ordered diploid alleles that gave rise to these. DHAIgo attempts to rebuild (P1,P2), also returns estimation error for father (ppl,pp2). For comparison, the 15 empirical algorithm that does simple allele matching is also used. The goal is to compare how much better is the disclosed algorithm, compared to the simple empirical algorithm. [p1, p2, ppl, pp2] =DHAlgo(hp,dlp,d2p,phpd,snips,popfreqlistPP,'DH'); [p1s,p2s,ppls,pp2s]=DHAlgo(hp,dp,d2p,ph,pd,snips,popfreqlistPP,'ST); 20 Step 5: Collect statistics for the run Compare (P1,P2) to derived (pl,p2). [Plcmp( :,i), P2cmp( :,i),Plprob( :,i), P2prob( :,i),Plmn(i), P2mn(i)]=DHSimValidate(P1,P2,pl, p2,pplpp2); Note: (P1Sj ,P2Sj,P1PjP2Pj,P1AjP2Aj)= (I(PI=pi), I(p2=p2), pplpp2,place, p 2 ace), where 25 Itp1=p1} is binary indicator array for estimation of DH algorithm accuracy for all the SNPs, similarly, for Ip2.p2). pp,pp2 are probabilities of a correct allele call derived from the algorithm, and placc = mean(Ir1=pi)), i.e. average accuracy for this run for pl, similar for p 2 acc 30 Preliminary Simulation Results Ten thousand simulations were used to estimate the algorithm accuracy DHAccuracy.P1 = mean(P1Ai), DHAccuracy.P2= mean(P2A), which shows the overall accuracy of the DH algorithm from P1,P2. On an individual SNP basis, the average 85 accuracy on each SNP SNPAcc.P1 = mean(PlS;) should agree with the average of the estimated probability of correctly measuring that SNP, SNPProb.PI = mean(P2Pi), i.e. if the algorithm works correctly, the value for SNPAcc.Plshould correspond closely to SNPProb.Pl. The relationship between these two is reflected by their correlation. 5 The 10000 loops of the simulation were run for different setup scenarios: (1) The underlying population frequency was given by existing genotyping data which is more realistic, and uniform population frequencies where A,C,TG have the same probability on each SNP. (2) Several combinations for measurement accuracy for haploid and unordered diploid 10 measurements (ph,pd). Varying assumptions were made; that the measurements are both very accurate (0.95,0.95), less accurate (0.75,0.75) and inaccurate or random (0.25,0.25), as well as unbalanced combinations of (0.9, 0.5), (0.5, 0.9). What might be closest to reality may be accuracies of approximately 0.6 to 0.8. (3) The simulation was run in all these cases for both the DHAlgorithm and simple 15 matching STAlgorithm, in order to assess the performance of the disclosed algorithm. The results of all these runs are summarized in Table 7. The disclosed algorithm is performs better than the existing empirical algorithm in these simulations, especially for the realistic cases of non-uniform population frequency, 20 and unbalanced or reduced probabilities of correct measurements. It has also been confirmed that our estimates of the algorithm accuracy for individual SNPs are very good in these cases, since the correlation between the estimated accuracy of correct allele call and simulation average accuracy is around 99%, with average ratio of 1. In the most realistic case, for data population frequency and (ph,pd) = (0.6, 0.8), 25 the average percent of correctly retrieved SNPs for (Pl,P2) is (0.852, 0.816) in implementation 1, and (0.601, 0.673 in implementation 2. . Note that for Table 7 and Table 8 the rows beginning with "data" use population frequency data was taken from empirical results, while the rows beginning with "uniform" assume uniform populations. 30 It is important to note that in Table 7 and Table 8 the accuracy is defined as the average percent of SNPs where the correct SNP call was made and the correct chromosome of origin was identified. It is also important to note that these simulations reflect two possible implementations of the algorithm. There may be other ways to 86 implement the algorithm that may give better results. This simulation is only meant to demonstrate that the method can be reduced to practice. Full Simulation Loop 5 Steps 1-8 were repeated 10000 times. This is the simulation to test the full disclosed method to clean measured genetic data for a target individual using genetic data measured from related individuals, in this case, the parents. Step 1: Generate original parental diploid cells (PJ,P2), (MI,M2) 10 [P1,P2]=GenerateOriginalChromosomes(snips,popfreqlistPP); (1801) [M1,M2]=GenerateOriginalChromosomes(snips,popfreqlistMM); (1802) Generate original parental cells depending on the population frequency for each SNP for mother and father cells. 15 Step 2: Crossover parental cells (PIC,P2C), (M1C,M2C) (1803) Generate two sets of paternal cells with crossovers: first to get (P1C1,P2C1) used in DHAlgo, and second time to get (PI C2,P2C2) used in PSAIgo. (1804) Generate two sets of maternal cells with crossovers: first to get (MIC1,M2C1) used in DHAlgo, and (MIC2,M2C2) used in PSAlgo. (1805) 20 .[P1C1,P2Cl]=Cross(P1,P2,snips,fullprob); [P1C2,P2C2]=Cross(P1,P2,snips,fullprob); [MLC1,M2C1]=Cross(M1,M2,snips,fullprob); [MIC2,M2C2]=Cross(MI,M2,snips,fullprob); 25 Step 3 Make haploid cell and unordered diploid cells for DHAlgo (1806) Pick one of the sets of paternal cells (1804, first set) for haploid cell HP, and mix up the order in the diploid cell to get (DlP,D2P) (1807). Do the same for mother cells (1805, first set) to get NM, (D1M,D2M). (1808). HP = PickOne(PIC1,P2C1); 30 HM= PickOne(MlC1,M2C1); [D1P,D2P]= Jumble(P1,P2); [D1M,D2M] = Jumble(M1,M2); 87 Step 4: Make diploid embryo cell (1809) Pick one of the paternal cells (1804, second set) and one of the maternal cells (1805, second set) for embryo cell. Mix up the order for measurement purposes. El = PickOne(PlC2,P2C2); 5 E2= PickOne(M1C2,M2C2); [E1J,E2J] = Jumble(E1,E2); (1810) Step 5: Introduce error to the measurements (1811, 1812, 1813) Based on given measurement error (ph-haploid cells, pd- unordered diploid cells, pe 10 embryo cells), introduce error into the measurements. hp = MakeError(HP,ph); (1811) dlp =MakeError(DlP,pd); (1811) d2p = MakeError(D2P,pd); (1811) hm = MakeError(HM,ph); (1812) 15 dlm=MakeError(DlM,pd); (1812) d2m = MakeError(D2M,pd); (1812) el = MakeError(E1J,pel); (1813) e2 = MakeError(E2J,pe2); (1813) 20 Step 6: Apply DHAlgo to get (pl,p2),(mn2), (pplpp2),(pmlpm2) DHAlgo takes a haploid cell and an unordered diplod cell and returns the most likely ordered diploid cell that gave rise to these. DHAlgo attempts to rebuild (PlC1,P2C1) for father and (M1C1,M2C1) for mother chromosomes, also returns estimation error for father (ppl,pp2) and mother (pml,pm2) cells. 25 [p 1 ,p2,ppl,pp2l=DHAlgo(hp,dlp,d2p,snips,popfreqlistPP); (1814) [ml,m2,pml,pm2]=DHAIgo(hm,dlm,d2m,snips,popfreqistMMlv); (1815) Step 7: Apply PSAlgo to get (DEJ,DE2) (1816) PSAIgo takes rebuilt parent cells (p1,p2,ml,m2) and unordered measured embryo cell 30 (el,e2) to return most likely ordered true embryo cell (DE1,DE2). PS Algo attempts to rebuild (E1,E2). [DE1,DE2,alldata]=PSAIgo(snips,el,e2,pl,p2,ml,n2,pe,pp1,pp2,pml,pm2,parameter s,crossprob,popfreqlistPP,popreqlistMM); 88 Step 8: Collect desired statistics from this simulation run Get statistics for the run: simdata=SimValidate(alldata,DE1,DE2,P1,P2,MI,M2,E1,E2,pl,p2,ml,m2,el,e2,pe,pe,pp 5 1,pp2,pml,pm2); Simulation results Ten thousand simulations were run and the final estimates for the algorithm accuracy PSAccuracy.El = mean(ElAi), PSAccuracy.E2 = mean(E2Ai), which tells us 10 the overall accuracy of the PS algorithm from E1,E2 were calculated. On an individual SNP basis, the average accuracy on each SNP SNPAcc.El = mean(ElSi) should agree with the average of the estimated probability of correctly measuring that SNP, SNPProb.El = mean(E2Pi), i.e. if the algorithm is written correctly, then SNPAcc.El should be observed to correlate to SNPProb.El. The relationship between these two is 15 reflected by their correlation. Ten thousand loops of the simulation has been run for different setup scenarios: (1) Underlying population frequency given by existing genotyping data which is more realistic, and uniform population frequencies where A,C,T,G have the same probability on each SNP. 20 (2) Several combinations of measurement accuracy for haploid, unordered diploid and embryo measurements (phpd,pe). A variety of accuracies were simulated: very accurate (0.95,0.95,0.95), less accurate (0.75,0.75,0.75) and inaccurate or random (0.25,0.25,0.25), as well as unbalanced combinations of (0.9, 0.5,0.5), (0.5, 0.9,0.9). What may be closest to reality is approximately (0.6,0.8,0.8). 25 (3) We ran the simulation in all these cases for both our PSAlgorithm and simple matching STPSAlgorithm, in order to assess the performance of the disclosed algortihm. The results of these runs are summarized in the Table 8. The disclosed algorithm is performs better than the existing empirical algorithm in 30 these simulations, especially for the realistic cases of non-uniform population frequency, and unbalanced or reduced probabilities of correct measurements. It has also been shown that the estimates of the algorithm accuracy for individual SNPs are very good in these 89 cases, since the correlation between the estimated accuracy of correct allele call and simulation average accuracy is around 99% , with average ratio of 1. In the most realistic case, for data population frequency and (ph,pd,pe)= (0.6, 0.8, 0.8), the average percent of correctly retrieved SNPs for (El,E2) is (0.777, 0.788) in 5 implementation 1 and (0.835, 0.828) in implementation 2.. As mentioned above, the number denoting the average accuracy of algorithm refers not only to the correct SNP call, but also the identification of correct parental origin of the SNP. To be effective, an algorithm must return better results than the algorithm that simply accepts the data as it is measured. One might be surprised to see that in some cases, the accuracy of the 10 algorithm is lower than the listed accuracy of measurement. It is important to remember that for the purposes of this simulation a SNP call is considered accurate only if it is both called correctly, and also its parent and chromosome of origin is correctly identified. The chance of getting this correct by chance is considerably lower than the measurement accuracy. 15 Laboratory Techniques Necessary for Obtaining Prenatal and Embryonic Genetic Material There are many techniques available allowing the isolation of cells and DNA fragments for genotyping. The system and method described here can be applied to any 20 of these techniques, specifically those involving the isolation of fetal cells or DNA fragments from maternal blood, or blastocysts from embryos in the context of IVF. It can be equally applied to genomic data in silico, i.e. not directly measured from genetic material. In one embodiment of the system, this data can be acquired as described below. 25 Isolation of cells Adult diploid cells can be obtained from bulk tissue or blood samples. Adult diploid single cells can be obtained from whole blood samples using FACS, or fluorescence activated cell sorting. Adult haploid single sperm cells can also be isolated 30 from sperm sample using FACS. Adult haploid single egg cells can be isolated in the context of egg harvesting during IVF procedures. Isolation of the target single blastocysts from human embryos can be done following techniques common in in vitro fertilization clinics. Isolation of target fetal cells 90 in maternal blood can be accomplished using monoclonal antibodies, or other techniques such as FACS or density gradient centrifugation. DNA extraction also might entail non-standard methods for this application. Literature reports comparing various methods for DNA extraction have found that in 5 some cases novel protocols, such as the using the addition of N-lauroylsarcosine, were found to be more efficient and produce the fewest false positives. Amplification of genondc DNA Amplification of the genome can be accomplished by multiple methods inluding: 10 ligation-mediated PCR (LM-PCR), degenerate oligonucleotide primer PCR (DOP-PCR), and multiple displacement amplification (M]DA). Of the three methods, DOP-PCR reliably produces large quantities of DNA from small quantities of DNA, including single copies of chromosomes; this method may be most appropriate for genotyping the parental diploid data, where data fidelity is critical. MDA is the fastest method, producing 15 hundred-fold amplification of DNA in a few hours; this method may be most appropriate for genotyping embryonic cells, or in other situations where time is of the essence. Background amplification is a problem for each of these methods, since each method would potentially amplify contaminating DNA. Very tiny quantities of contamination can irreversibly poison the assay and give false data. Therefore, it is 20 critical to use clean laboratory conditions, wherein pre- and post- amplification workflows are completely, physically separated. Clean, contamination free workflows for DNA amplification are now routine in industrial molecular biology, and simply require careful attention to detail. 25 Genotyping assay and hybridization The genotyping of the amplified DNA can be done by many methods including molecular inversion probes (MIPs) such as Affymetrix's Genflex Tag Array, microarrays such as Affymetrix's 500K array or the Illumina Bead Arrays, or SNP genotyping assays such as AppliedBioscience's Taqman assay. The Affymetrix 500K array, 30 MIPs/GenFlex, TaqMan and Illumina assay all require microgram quantities of DNA, so genotyping a single cell with either workflow would require some kind of amplification. Each of these techniques has various tradeoffs in terms of cost, quality of data, quantitative vs. qualitative data, customizability, time to complete the assay and the 91 number of measurable SNPs, among others. An advantage of the 500K and Illumina arrays are the large number of SNPs on which it can gather data, roughly 250,000, as opposed to MIIPs which can detect on the order of 10,000 SNPs, and the TaqMan assay which can detect even fewer. An advantage of the MIPs, TaqMan and Illumina assay 5 over the 500K arrays is that they are inherently customizable, allowing the user to choose SNPs, whereas the 500K arrays do not permit such customization. In the context of pre-implantation diagnosis during IVF, the inherent time limitations are significant; in this case it may be advantageous to sacrifice data quality for turn-around time. Although it has other clear advantages, the standard MIPs assay 10 protocol is a relatively time-intensive process that typically takes 2.5 to three days to complete. In MIPs, annealing of probes to target DNA and post-amplification hybridization are particularly time-intensive, and any deviation from these times results in degradation in data quality. Probes anneal overnight (12-16 hours) to DNA sample. Post-amplification hybridization anneals to the arrays overnight (12-16 hours). A number 15 of other steps before and after both annealing and amplification bring the total standard timeline of the protocol to 2.5 days. Optimization of the MIPs assay for speed could potentially reduce the process to fewer than 36 hours. Both the 500K arrays and the Illumina assays have a faster turnaround: approximately 1.5 to two days to generate highly reliable data in the standard protocol. Both of these methods are optimizable, and 20 it is estimated that the tum-around time for the genotyping assay for the 500k array and/or the Illumina assay could be reduced to less than 24 hours. Even faster is the Taqman assay which can be run in three hours. For all of these methods, the reduction in assay time will result in a reduction in data quality, however that is exactly what the disclosed invention is designed to address. Some available techniques that are faster are not 25 particularly high-throughput, and therefore are not feasible for highly parallel prenatal genetic diagnosis at this time. Naturally, in situations where the timing is critical, such as genotyping a blastocyst during IVF, the faster assays have a clear advantage over the slower assays, whereas in cases that do not have such time pressure, such as when genotyping the 30 parental DNA before IVF has been initiated, other factors will predominate in choosing the appropriate method. For example, another tradeoff that exists from one technique to another is one of price versus- data quality. It may make sense to use more expensive techniques that give high quality data for measurements that are more important, and less 92 expensive techniques that give lower quality data for measurements where the fidelity is not critical. Any techniques which are developed to the point of allowing sufficiently rapid high-throughput genotyping could be used to genotype genetic material for use with this method. 5 A Contextual Example of the Method An example of how the disclosed method may be used in the context of an IVF laboratory that would allow full genotyping of all viable embryos within the time constraints of the IVF procedure is described here. The turn-around time required in an 10 IVF laboratory, from egg fertilization to embryo implantation, is under three days. This means that the relevant laboratory work, the cleaning of the data, and the phenotypic prediction must be completed in that time. A schematic diagram of this system is shown in see Figure 19, and decribed here. This system may consist of parental genetic samples 1901 from IVF user (mother) 1902 and IVF user (father) 1903 being analyzed at IVF lab 15 1904 using a genotyping system. It may involve multiple eggs that are harvested from the mother 1902 and fertilized with sperm from the father 1903 to create multiple fertilized embryos 1905. It may involve a laboratory technician extracting a blastocyst for each embryo, amplifying the DNA of each blastocyst, and analyzing them using a high throughput genotyping system 1906. It may involve sending the genetic data from the 20 parents and from the blastocyst to a secure data processing system 1907 which validates and cleans the embryonic genetic data. It may involve the cleaned embryonic data 1908 being operated on by a phenotyping algorithm 1909 to predict phenotype susceptibilities of each embryo. It may involve these predictions, along with relevant confidence levels, being sent to the physician 1910 who helps the IVF users 1902 and 1903 to select 25 embryos for implantation in the mother 1901. Miscellaneous Notes Relating to Cleaning of Genetic Data It is important to note that the method described herein concerns the cleaning of genetic data, and as all living creatures contain genetic data, the methods are equally 30 applicable to any human, animal, or plant that inherits chromosomes from parents. The list of animals and plants could include, but is not limited to: gorillas, chimpanzees, bonobos, cats, dogs, pandas, horses, cows, sheep, goats, pigs, cheetahs, tigers, lions, salmon, sharks, whales, camels, bison, manatees, elk, swordfish, dolphins, armadillos, 93 wasps, cockroaches, worms, condors, eagles, sparrows, butterflies, sequoia, corn, wheat, rice, petunias, cow's vetch, sun flowers, ragweed, oak trees, chestnut trees, and head lice. The measurement of genetic data is not a perfect process, especially when the sample of genetic material is small. The measurements often contain incorrect 5 measurements, unclear measurements, spurious measurements, and missing measurements. The purpose of the method described herein is to detect and correct some or all of these errors. Using this method can improve the confidence with which the genetic data is known to a great extent. For example, using current techniques, uncleaned measured genetic data from DNA amplified from a single cell may contain between 20% 10 and 50% unmeasured regions, or allele dropouts. In some cases the genetic data could contain between 1% and 99% unmeasured regions, or allel dropouts. In addition, the confidence of a given measured SNP is subject to errors as well. In a case where the uncleaned data has an allele dropout rate of approximately 50%, it is expected that after applying the method disclosed herein the cleaned data will 15 have correct allele calls in at least 90% of the cases, and under ideal circumstances, this could rise to 99% or even higher. In a case where the uncleaned data has an allele dropout rate of approximately 80%, it is expected that after applying the method disclosed herein the cleaned data will have correct allele calls in at least 95% of the cases, and under ideal circumstances, this could rise to 99.9% or even higher. In a case where the 20 uncleaned data has an allele dropout rate of approximately 90%, it is expected that after applying the method disclosed herein the cleaned data will have correct allele calls in at least 99% of the cases, and under ideal circumstances, this could rise to 99.99% or even higher. In cases where a particular SNP measurement is made with a confidence rate close to 90%, the cleaned data is expected to have SNP calls with confidence rate of over 25 95%, and in ideal cases, over 99%, or even higher. In cases where a particular SNP measurement is made with a confidence rate close to 99%, the cleaned data is expected to have SNP calls with confidence rate of over 99.9%, and in ideal cases, over 99.99%, or even higher. It is also important to note that the embryonic genetic data that can be generated 30 by measuring the amplified DNA from one blastomere can be used for multiple purposes. For example, it can be used for detecting aneuploides, uniparental disomy, sexing the individual, as well as for making a plurality of phenotypic predictions. Currently, in IVF laboratories, due to the techniques used, it is often the case that one blastomere can only 94 provide enough genetic material to test for one disorder, such as aneuploidy, or a particular monogenic disease. Since the method disclosed herein has the common first step of measuring a large set of SNPs from a blastomere, regardless of the type of prediction to be made, a physician or parent is not forced to choose a limited number of 5 disorders for which to screen. Instead, the option exists to screen for as many genes and/or phenotypes as the state of medical knowledge will allow. With the disclosed method, the only advantage to identifying particular conditions to screen for prior to genotyping the blastomere is that if it is decided that certain PSNPs are especially relevant, then a more appropriate set of NSNPs which are more likely to cosegregate with 10 the PSNPs of interest, can be selected, thus increasing the confidence of the allele calls of interest. Note that even in the case where SNPs are not personalized ahead of time, the confidences are expected to be more than adequate for the various purposes described herein. 15 Phenotypic and Clinical Prediction There are many models available for predicting phenotypic data from genotypic and clinical information. Different models are more appropriate in different situations, based on the amount and type of data available. In order to choose the most appropriate method for phenotype prediction, it is often best to test multiple methods on a set of 20 testing data, and determine which method provides the best accuracy of predictions when compared to the measured outcomes of the test data. Certain embodiments described herein include a set of methods which, when taken in combination and selected based on performance with test data, will provide a high likelihood of making accurate phenotypic predictions. First, a technique for genotype-phenotype modeling in scenario (ii) using 25 contingency tables is described. Next, a technique for genotype-phenotype modeling in scenario (iii) using regression models built by convex optimization is described. Then, a technique for choosing the best model given a particular phenotype to be predicted, a particular patient's data, and a particular set of data for training and testing a model is described. 30 The Data of Today: Modeling Phenotypic Outcomes based on Contingency Tables In cases where there are known genetic defects and alleles that increase the' probability of disease phenotype, and where the number of predictors is sufficiently 95 small, the phenotype probability can be modeled with a contingency table. If there is only one relevant genetic allele, the presence/absence of a particular allele can be described as A+/A- and the presence absence of a disease phenotype as D+/D-. The contingency table containing (fi, NI, f 2 , N 2 ) is: 5 D+ D- # 2 1
N
2 (NI +N 2 )(P,(l-P 2
)-(~P
1
)P
2
)
2 G+ f 14f 1 N (pN, + P 2
N
2 )((C - P)N. +(1- P 2
)N
2 ) G- f 2 1-f 2
N
2 Where f, and f 2 represent the measured frequencies or probabilities of different outcomes and the total number of subjects is N=N 1
+N
2 . From 10 this table, the odds ratio for the probability of having disease state D+ in the two cases of having independent variable (IV) G+ or G- can be reported as OR = fi(1-f 2 )/f 2 (l-fi) with a with 95% confidence interval: ORl±.
9 6 1 s, where S is a standard deviation. For example, using a study of breast cancer in 10,000 individuals, where M+ represents the presence of BRCA1 or BRCA2 allele: 15 D+ D- # M+ .563 .437 1720 M- .468 .532 8280 This data results in an odds ratio, OR = 1.463, with confidence interval [1.31; 1.62], which 20 can be used to predict the increased probability of the occurrence of breast cancer with the given mutation. Note that contingency tables greater than two by two can be used to accommodate more independent variables or outcome variables. For example, in the case of breast cancer, the contingencies M+ and M- could be replaced with the four contingencies: BRCAI and BRCA2, BRCA1 and not BRCA2, not BRCA1 and BRCA2, 25, and finally not BRCA1 and not BRCA2. It is well understood by those knowledgeable in the art how to determine confidence intervals for contingency tables greater than two by two. This technique will be used when there are few enough IVs and enough data to build models with low standard deviations by counting the patients in different groups defined by different contingencies of the independent variables. This approach avoids the 30 difficulty of designing a mathematical model that relates the different IV's to the outcome that is to be modeled, as is needed when constructing a regression model. 96 Note that the genetic data from particular SNPs may also be projected onto other spaces of independent variables, such as in particular the different patterns of SNPs that are recognized in the HapMap project. The HapMap project clusters individuals into bins, and each bin will be characterized by a particular pattern of SNPs. For example, consider 5 one bin (B 1) has a SNP pattern that contains BRCA1 and BRCA2, another bin (B2) has a SNP pattern that contains BRCA1 and not BRCA2, and a third bin contains a SNP pattern (B3) that is associated all other combinations of mutations. Rather than creating a contingency table representing all the different combinations of these SNPs, one may create a contingency table representing the contingencies B 1, B2, and B3. 10 Note furthermore that the tendency of certain SNPs to occur together, as described by the HapMap project, can be used to create models that use multiple SNPs as predictors, even then the data consists of separate groups of patients where each group has had only one of the SNPs measured. This problem is commonly encountered when creating models from publicly available research papers, such as those available from 15 OMIM, where each paper contains data on a cohort that has only one relevant SNP measured, although multiple SNPs are predictive of the phenotype. In order to illustrate this aspect which is useful for building predictive models using data available today, specific reference is made to Alzheimer's disease for which predictive models can be built based on the lVs: family history of Alzheimer's, gender, race, age, and the various 20 alleles of three genes, namely APOE, NOS3, and ACE. In the context of this disease, a pervasive issue that applies to many diseases beyond Alzheimers is discussed: although many genes are involved in determining propensity for a particular phenotype, the vast majority of historical studies only sampled the alleles of a particular gene. In the case of Alzheimers disease, almost all study cohorts have only one gene sampled, namely APOE, 25 NOS3, or ACE. Nonetheless, it is important to build models that input multiple genetic alleles even when the majority of available data comes from studies where only one gene is investigated. This problem is addressed in one aspect which is illustrated by considering a simplified case of two phenotype states and only two independent variables representing two relevant genes, each with just two states. Given a random variable 30 describing the disease phenotype DE[D+,D-], and two random variables describing the genes AE [A+, A-] and BE [B+, B] the goal is to find the best possible estimate of P(D/A,B). This can be found by applying Bayes Rule using P(D/A,B)=P(A,B/D)P(D)/P(A,B). P(D) and P(AB) are available from public data. 97 Specifically, P(D) refers to the overall prevalence of the disease in the population and this can be found from publicly available statistics. In addition, P(A,B) refers to the prevalence of particular states of the genes A and B occurring together in an individual and this can be found from public databases such as the HapMap Project which has 5 measured many different SNPs on multiple individuals in different racial groups. Note that in a preferred embodiment, all of these probabilities will be computed for particular racial groups and particular genders, for which there are probability biases, rather than for the whole human population. Once these probabilities have been determined, the challenge comes from accurately estimating P(A,B/D) since the majority of cohort data 10 provides estimates of P(A/D) and P(B/D). Relevant information can be found in various public databases, such as the HapMap Project, about the statistical associations between different genetic alleles i.e. about P(A/B). However, given only P(A/B), P(A/D), P(B/D) still nothing can be said of P(A,B/D) since there is an unconstrained degree of freedom. Nonetheless, if some information is known about P(A,B/D) from a cohort for which both 15 genes A and B were sampled, even for just a single contingency such as (A-,B-) then the wealth of information about P(A/D), P(B/D), P(A/B) may be leveraged to improve estimates of P(AB/D). This concept will be illustrated using contingency tables. Consider the two contingency tables below, representing the probabilities of outcomes D+ and D- subject to the genetic states A+ and A-. This study is referred to as 20 A. The measured frequencies for A are referred to with f and the actual probabilities that one seeks to estimate are referred to with p. A D+D A+ f_ 2_ A D+ D A+ pj2 A- P3 P4 25 where f 3 =1-f 1 , f 4 =1-f 2 and P3=1-PI , p4-=l-p2. Let KI represent the number of subjects in the case group for A, that is, the number of subjects that have outcome D+. Let K 2 be the number in the control group for A, that is, the number of subjects that have outcome D-. Similarly, consider the two contingency tables below, representing the probabilities of outcomes D+ and D- subject to the genetic states B+ and B-. This study is 98 referred to as B. The measured frequencies are referred to with f and the actual probabilities that one seeks to estimate are referred to with p. B D+ D B+ f5 F6 B- f 7
F
8 B D+ D Bi- P7 ps 5 where f 7 =1-f 5 , f 8 =1-f 6 and p7=1-ps , PS=1-p6. Let K 3 represent the number in the case group for B and let K4 be the number in the control group for B. The contingency tables above represent trials where the genetic states A and B are measured separately. However, the contingency table that is ideally sought out involves the different states of A and B combined. The contingency table is shown below for a hypothetical study, 10 referred to as AB, where f represents the measured probabilities and p represents the actual probabilities. AB_ D+ D A+B+ f, flo A+B- ful fiz A-B+ f 1 3 f 1 4 A-B- fis f16 AB D+ D A+B+ 9q po A+B- pil p1z A-B+ ps p14 A-B- p1s p16 15 where fis=1-f 9 -f 1 -f 13 , fi16 -f 10 -f 1 2 -f 14 and P15 = I-P9-PII-pI3, P16 1-PIo-Pi2-PI4 Let K5 be the number in case group for AB and let K 6 be the number in control group for AB. For notational purposes, note that K 7 = K9 =K5 and Ks = K1o = K 6 .So in fact, group sizes 20 are: B K3 K4 AB Ks K& 99 Basic rules of statistics may be used to enforce dependencies between the cells of the hypothetical contingency table AB. In this example, for cells corresponding to D+, the following relationships may be enforced: 5 P(A+B-/D+) = P(A+/D+)-P(A+B+/D+) P(A-B+/D+)= P(B+/D+)-P(A+B+/D+) P(A-B-/D+)= 1 - P(A+/D+) - P(B+/D+)+P(A+B+/D+) And similarly for cells corresponding to D-: 10 P(A+B-/D-) = P(A+/D-)-P(A+B+/D-) P(A-B+/D-)= P(B+/D-)-P(A+B+/D-) P(A-B-/D-) = 1 - P(A+/D-) - P(B+/D-)+P(A+B+/D-) 15 Using the notation in the contingency tables above, and leaving out the superfluous last relationship, these relationships translate to: PI Ipi-p9 P13 =P5-P9 20 P1=p2-P2 O P14 =P6-PIo or equivalently 25 pi=p9+pj P2 poP+p2 PS = Ip9+p3 P6 = piO+PI4 30 To summarize all the relationships, below is the table of all the dependencies of p1,..., p16 on p9,.. .,p.16 To get a dependency between the values, the probability within the row is the summation of probabilities within the column that has value=l, for example the first row gives p1=p9+p1. 100 P9 Plio P11 P12 P13 P14 PIS P 16 P2 1 ps 1 P3 P4 P61 P-7 pis P39 1 Pi0 Ph P121 P13 P14 P161 From the relationship between the frequencies and probabilities, the measurement equations fi = pi+ni for n=9 ... 16 may be created, where ni is a noise term representing the imperfect measurement of the probability pi based on frequency of occurrence fi. 5 Applying this to the relationships described above, and assuming that all the cells of contingency table AB have been measured (this is just for illustrative purposes and will be discussed below), these 10 observations may be represented: These measurement equations may be presented in matrix notation as: 10 F = XP+N Where F = [F 1 , ... , F iSI, P - [p, -- p 1 T and N = [ng, ... , n 16 ]T and X is the matrix represented in the table above. This matrix equation may be used to solve for the 8 15 unknown coefficients, p9... P16. In this particular case we are solving for all the 101 parameters P9.. .p1 If we do not have all the measurements for combined A,B genes, we need at least one measurement for D+ and one for D-. Given the relationships above, we can then fill out the rest of the table. In other words, in order to be able to fill out the contingency table for the hypothetical study AB, there desirably is at least one sample 5 where a particular state of A and B were simultaneously measured on subjects that had outcomes of both D+ and D-. This enables one to achieve full rank for the matrix X representing the measurements made, so that the values p9... P16 are solved and filled in the contingency table AB. If more study data exists, further rows may be added to the bottom of the matrix X with a similar structure to that shown above. 10 To perform an accurate regression, a weighted regression with weights for each observation fi determined by the size of the group sample is desirable, so that studies and cells with many more observations get more weight. For the measurement equations fi pi+ni, the ni do not all have the same variance, and the regression is not homoscedastic. Specifically, fi = 1/Ki*Binomial(pi, Ki) N(pi, pi(1-pi)/Ki) where Binomial(pi, Ki) 15 represents a binomial distribution where each test has probability of the case outcome pi and K; tests are performed. This binomial distribution can be approximated by N(pi, pi(l pi)/Ki) which is the normal distribution with mean pi and variance pi(1-pi)/Ki. Consequently, the noise may be modeled as a normal variable ni ~ N(O, pi(l -pi)/Ki) which has theoretical variance Vi = pi*(1-pi)/Ki. This variance can be approximated with the 20 sample frequency vi = fi*(1-fi)/Ki. A weighted regression with weights for each observation i inversely proportional to variance vi was performed. The distribution of the noise matrix N as -N(0, V) where V is a matrix with diagonal elements [v 9 ,...,v,6] and all other elements are 0 may now be described. This is denoted as V = diag([v 9 ,...,v 16 ]). Similarly, let W=diag([ 1/v 9 ,..., 1/vi6]). 25 Now it is possible to solve for P using a weighted regression: P= (X'WXy X'WY It is straightforward to show that the variance of P will be 30 Var(P) =(X'WX)~ 1 which can be used to indicate the confidence in the determination of P. 102 To summarize, we have used the data from individual genes (A: fi,...,f4,B: f 5 ,...,fs) together with data from the combination of A and B (AB:f 9 ,. .,fJ 6 ) to help with estimating the probabilities for combination of A and B (ps,...,pi) and their variances (v 9 ,...,v1 6 ). 5 Finally, in our studies we mostly deal with log odds ratios, not probabilities, so we need to translate these probabilities into LORs. Generally, given the probabilities and variances for an event H as below. D+ D H+ pl p2 H- I pII-p2 V V1 v2 10 The formula for the LOR is LOR = [log(pl) -log(l-pl) ]-[log(p2)-log(1-p2)], with variance (by delta method). V = [(p1)-' +(1-p1)~1] 2 *V(pl) + [(p2)'+(l-p2)'] 2 *V(p2). The table below shows the probabilities, corresponding LOR and variance for combination of A,B D+ D- LOR Var A+B+ P 9 plo lor, VI [1/p9 +1/(1-p9)] 2 v 9 + [1/po +1/(-pro)] 2 vio A+B- P 1 1 p12 lor 2
V
2 = [I/Pii +1/(1-pI)]2vI+ [1/p12 +1/(1-pI2)]v2 I A-B+ P, 3 p14 1or 3
V
3 = [l/p13 +1/(l-p13)] 2 v3+[1/pi4+l(-p14)] 2
ZV
14 A-B- P 15 P16 or 4
V
4 =[1/pis +1/(I-pIs)] v5+ [l/p16 +l/(1-pi6)] 2vi
ZV
16 15 This provides an estimate of the log odds ratios and respective variances. As an illustration of this method, the technique was employed to obtain improved estimates of P(A,B/D) where D represents the state of having Alzheimers and where A 20 and B represents two different states of the APOE and ACE gene respectively. Table 9 represents three different studies conducted by Alvarez in 1999 where only gene A was sampled; by Labert in 1998 where only gene B was sampled; and by Farrer in 2005 where genes A and B were sampled. Two sets of results have been generated from these studies, and are shown in Table 10. The first set (See Table 10, columns 2, 3, 4 and 5) analyzes 103 all the cohorts and improves estimates of P(A,B/D) given P(A/D) and P(B/D) using the methods disclosed here. The second set (see Table 10, columns 6, 7, 8 and 9) uses only those results generated from the modem cohort of Farrer (2005) for P(A,B/D) in which both genes were sampled. The confidence bounds of predictions in the former case are 5 considerably reduced. Note that these predictions can be further improved using data describing P(A/B) from public sources - these measurements can be added to the X matrix as described above. Note also that the techniques described here may be used to improve the estimates on the separate A, B probabilities such as P(A+/D+), P(A+/D-), P(B+/D+), and P(B-/D-) using the relationship such as p1 = p5+p7 as described above. 10 Note that while this method has been illustrated for only two variables A and B, it should be noted that the contingency tables can included many different IVs such as those mentioned above in the context of Alzheimer's prediction: family history of Alzheimer's, gender, race, age, and the various alleles of three genes, namely APOE, NOS3, and ACE. Continuous variables such as age can be made categorical by being categorized in bins of 15 values in order to be suitable to contingency table formulation. In a preferred embodiment, the maximum number of IV's is used to model the probability of an outcome, with the standard deviation of the probability typically being below some specified threshold. In other words, the most specific contingencies possible may be created given the IV's available for a particular patient, while maintaining enough 20 relevant training data for that contingency to make the estimate of the associated probability meaningful. Note that it will also be clear to one skilled in the art, after reading this disclosure, how a similar technique for using data about disease-gene associations, gene-gene associations, and/or gene frequencies in the population can be applied to improve the 25 accuracy of multivariable linear and nonlinear regression and logistic regression models. Furthermore, it will be clear to one skilled in the art, after reading this disclosure, how a similar technique for using data about disease-gene associations, gene-gene associations, *and/or gene frequencies in the population can be applied to improve the accuracy of multivariable linear and nonlinear regression and logistic regression models by enabling 30 the leveraging of outcome data to train the models where not all the independent variables of that are relevant to the model were measured for that outcome data. Furthermore, it will be clear to one skilled in the art, after reading this disclosure, how a similar technique for using data about disease-gene associations, gene-gene associations, and/or gene 104 frequencies in the population can be applied to improve the accuracy of contingency table models built using other techniques such as the Expectation Maximization (EM) algorithm which is well understood in the art. These techniques will be particularly relevant to leveraging data from the HapMap project and other data contained in public 5 databases such as National Center for Biotechnology Information (NCBI) Online Mendelian Inheritance in Man (OMIM) and dbSNP databases. Note also, throughout the patent, that where we refer to data pertaining to an individual or a subject, this also assumes that the data may refer to any pathogen that may have infected the subject or any cancer that is infecting the subject. The individual or 10 subject data may also refer to data about a human embryo, a human blastomere, a human fetus, some other cell or set of cells, or to an animal or plant of any kind. Tomorrow's Data: Modeling Multi-factorial Phenotype with Regression Models As more data is accumulated correlating genotype with multi-factorial phenotype, 15 the predominant scenario will become (iii) as described above, namely it will be desirable to consider complex combinations of genetic markers in order to accurately predict phenotype, and multidimensional linear or nonlinear regression models will be invoked. Typically, in training a model for this scenario, the number of potential predictors will be large in comparison to the number of measured outcomes. Examples of the systems and 20 methods described here include a novel technology that generates sparse parameter models for underdetermined or ill-conditioned genotype-phenotype data sets. The technique is illustrated by focusing on modeling the response of HIV/AIDS to Anti Retroviral Therapy (ART) for which much modeling work is available for comparison, and for which data is available involving many potential genetic predictors. When tested 25 by cross-validation with actual laboratory measurements, these models predict drug response phenotype more accurately than models previously discussed in the literature, and other canonical techniques described here. Two regression techniques are described and illustrated in the context of predicting viral phenotype in response to Anti-Retroviral Therapy from genetic sequence 30 data. Both techniques employ convex optimization for the continuous subset selection of a sparse set of model parameters. The first technique uses the Least Absolute Shrinkage and Selection Operator (LASSO) which applies the I norm loss function to create a sparse linear model; the second technique uses the Support Vector Machine (SVM) with 105 radial basis kernel functions, which applies the 8-insensitive loss function to create a sparse nonlinear model. The techniques are applied to predicting the response of the HIV 1 virus to ten Reverse Transcriptase Inhibitors (RTIs) and seven Protease Inhibitor drugs (PIs). The genetic data is derived from the HIV coding sequences for the reverse 5 transcriptase and protease enzymes. Key features of these methods that enable this performance are that the loss functions tend to generate simple models where many of the parameters are zero, and that the convexity of the cost function assures that one can find model parameters to globally minimize the cost function for a particular training data set. 10 The LASSO and the L' Selection Function When the number of predictors M exceeds the number of training samples N, the modeling problem is overcomplete, or ill-posed, since any arbitrary subset of N predictors is sufficient to yield a linear model with zero error on the training data, so long as the associated columns in the X matrix are linearly independent. Consequently, one is 15 disinclined to put faith in an N-predictor model returned by a linear regression method. Suppose, however, a model with significantly fewer than N variables has low training error. The more sparse the model, the less probable that low training error could be a chance artifact, hence the more likely that the predictors are causally related to the dependent variable. This underlies the importance of sparse solutions in overcomplete 20 problems, as is the case for the RTI data. A similar argument can be applied to ill conditioned problems characterized by a large condition number on the matrix XTX, as is the case for the PI data. In this case, the estimated parameters b are highly susceptible to the model error, as well as to measurement noise, and as a result are unlikely to generalize accurately. Overcomplete and ill-conditioned problems are typical of genetic 25 data, where the number of possible predictors-genes, proteins, or, in our case, mutation sites-is large relative to the number of measured outcomes. One canonical approach to such cases is subset selection. For example, with stepwise selection, at each step a single predictor is added to the model, based on having the highest F-test statistic indicating the level of significance with which that variable is 30 correlated with prediction error. After each variable is added, the remaining variables may all be checked to ensure that none of them have dropped below a threshold of statistical significance in their association with the prediction error of the model. This technique has 106 been successfully applied to the problem of drug response prediction. However, due to the discrete nature of the selection process, small changes in the data can considerably alter the chosen set of predictors. The presence or absence of one variable may affect the statistical significance associated with another variable and whether that variable is 5 included or rejected from the model. This affects accuracy in generalization, particularly for ill-conditioned problems. Another approach is for the values of the estimated parameters b to be constrained by means of a shrinkagefunction. A canonical shrinkage function is the sum of the squares of the parameters, and this is applied in ridge regression which finds the 10 parameters according to: b= arg min, fjy - XbJ1 2 + A| |bII 2 (17) where A is a tuning parameter, typically determined by cross-validation. This method is 15 non-sparse and does not set parameters to 0. This tends to undermine accuracy in generalization, and makes solutions difficult to interpret. These problems are addressed by the LASSO technique. In contrast to subset selection, the LASSO does not perform discrete acceptance or rejection of predictor variables; rather it allows one to select en-masse, via a continuous subset optimization, 20 the set of variables that together are the most effective predictors. It uses the 11 norm shrinkage function: b=argmin Iy - b112 + A ZbI (18) i=1...M 25 where A is typically set by cross-validation. The LASSO will tend to set many of the parameters to 0. Figure 20 provides insight into this feature of the LASSO, termed selectivity. Suppose that a model based on just two mutations is created with the training data X= [1 0; 01]"y = [2 Ir and the x-axis and y-axis represent the two parameters b, and b 2 respectively. Compare the use of the 11 and 12 shrinkage functions, where in both 30 cases a solution is found that fits the training data equally well such that |y-Xb| 2=2. The large circle (2001), small circle (2002), and square (2003) respectively represent level 107 curves for the cost functions llv-Xb1 2 , the 12 norm 11b1 2 , and the 11 norm IbjI+|b 2 1. A solution for ridge regression (12) is found where the two circles meet (2004); a solution for the LASSO (II) is found where the square and the large circle intersect (2005). Due to the "pointiness" of the level curve for the 1 norm, a solution is found that lies on the axis b, 5 and is therefore sparse. This argument, extended into higher dimensions, explains the tendency of LASSO to produce sparse solutions, and suggests why the results achieved are measurably better than those reported in the literature. The 11 norm can be viewed as the most selective shrinkage function, while remaining convex. Convexity guarantees that one can find the one global solution for a 10 given data set. A highly efficient recent algorithm, termed Least Angle Regression, is guaranteed to converge to the global solution of the LASSO in Msteps. Note that it will be clear to one skilled in the art, after reading this disclosure, how the 11 norm can also be used in the context of logistic regression to model the probability of each state of a categorical variable. In logistic regression, a convex cost function may 15 be formed that corresponds to the inverse of the a-posteriori probability of a set of measurements. The a-posteriori probability is the probability of the observed training data assuming the models estimates of the likelihood of each outcome. By adding to the 11 norm to the convex cost function, the resulting convex cost function can be minimized to find a sparse parameter model for modeling the probability of particular outcomes. The 20 use of 11 norm for logistic regression may be particularly relevant when the number of measured outcomes is small relative to the number of predictors. Support Vector Machines and the Li-Norm SVM's may be configured to achieve good modeling of drug response and other 25 phenotypes, especially in cases where the model involves complex interactions between the independent variables. The training algorithm for the SVM makes implicit use of the 11 norm selection function. SVM's are learning algorithms that can perform real-valued function approximation and can achieve accurate generalization of sample data even when the estimation problem is ill-posed in the Hadamard sense. The ability of SVM's to 30 accurately generalize is typically influenced by two selectable features in the SVM model and training algorithm. The first is the selection of the cost function, or the function that is to be minimized in training. The second is the selection of the kernels of the SVM, or those functions that enable the SVM to map complex nonlinear functions, involving 108 interactions between the independent variables, using a relatively small set of linear regression parameters. These features are discussed below. Consider modeling the phenotype for a subject i y, with a linear function approximation: f, = f(xb)= b x, . First, estimate b by minimizing a cost function 5 consisting of a 12 shrinkage function on the parameters, together with the "s-insensitive loss" function, which does not penalize errors below some E>0. The SV regression may be formulated as the following optimization: b=argmin .( +C (+*) (19) 2 10 subject to the constraints: - y -bx TI :8+*, =(20) bTx, -y,: e+ ~,i =1 ... N (21) 15 4* >0,[~ 20, i=1...N (22) The second term of the cost function minimizes the absolute value of the modeling errors, beyond the "insensitivity" threshold F. Parameter C allows one to scale the relative importance of the error vs. the shrinkage on the weights. This constrained optimization 20 can be solved using the standard technique of finding the saddle-point of a Lagrangian, in order to satisfy the Kuhn-Tucker constraints. The Lagrangian, which accommodates the cost and the constraints described above, is: L(b,g -a ,,r = + C (+ ) ay, -bx± + - 2 ±=' (2 3) -Z , q,--brTX+ 6+~-$r ~+,* 1=1 1=1 25 Minimize with respect to the vectors of parameters b, g~,g*, and maximize with respect to the vectors of Lagrange multipliers a~,a, a ,A. Note that the Lagrange multipliers 109 are desirably positive in accordance with the Kuhn-Tucker constraints. Hence, the optimal set of parameters can be found according to: (b.,.* =arg min .- max,.,,-,+, L(b,~,, a+, a-,*r (24) 5 subject to aaA 0, i =I...N (25) Since the order of minimization/maximization can be interchanged, first minimize with 10 respect to variables b, ,*, by setting the partial derivatives of L with respect to these variables to 0. From the resultant equations, one finds that the weight vector can be expressed in terms of N a b=Z a* -a (26) 15 Also from the resultant equations, eliminate variables from the Lagrangian so that one may find the coefficients a,+, a,-, i=1 ...N by maximizing the quadratic form: W(a*,a~ c= a + N a,*)+ y,"( -a a--a,'a~-a,* 'Xx (27) 20 subject to N N a = a (28) i=1 i=1 0<5 a,+ C, i=1...N (29) 25 0!( a,- C, i=...N (30) This enables the vector b to be computed and fully defines the SVM model for the e insensitive loss function. Note from equation (11) that the model may be characterized as 110 M f(x) E,6, xMx, + b (31) ,=1 where , = a,+ - ai . The resulting model will tend to be sparse in that many of the 5 parameters in the set {p,6, i = 1... M} will be 0. Those vectors xi corresponding to non zero valued , are known as the support vectors of the model. The number of support vectors depends on the value of the tunable parameter C, the training data, and the suitability of the mod6l. In an illustration below, it is shown how the model can now be augmented to accommodate complex nonlinear functions with the use of kernelfunctions. 10 Next, it will be shown that the s-insensitive loss function is related to the 11 norn shrinkage function, and essentially achieves the same thing, namely the en-masse selection of a sparse parameter set by means of the 11 norm. In order to model a complex function, with possible coupling between variables, the simple inner product of Equation (17) is replaced with a kernel functions that 15 computes a more complex interaction between the vectors. Inserting kernel functions, our function approximation in (17) takes the form: N N f(x) = AK(x,x,)+ p8 = L,8K(x,x,) (32) 20 whereK(x,x 0 )=1 by definition. To find these parameters, use exactly the same optimization methods described above, and replace all terms x'x, with K(x,x,). As before, compute the parameter set according to A =a,+ - a[, by finding the arguments that maximize W~a*, a~)= Ee a + a,')+ Ey,a, -a- E , -, ,~-,)~~ 25 i=1 i=1 2d=1 (33) subject to the same constraints as above. For the SVM results described above, radial basis kernel functions were selected. 111 Now, to illustrate the implicit use of the 11 norm: consider that instead of trying to optimize equation (17) one begins with the optimization: N2 N p' = arg min, f(x)-- ,K(x,,x) dx+e p,\ (34) 5 where the I shrinkage has been explicitly used to constrain the values offp, and the data fitting error, instead of being defined over discrete samples of training data, is defined over the domain of the hypothetical function being modeled. Now, make the variable substitutions: 8, =a,+ -a,; a,a, 0, aa-0, i=I...N. Then the optimization may 10 be recast as: W(aa-c)= -- (sa, +a,' + Ny, (a,* - a - (a, -- i aja, - a,*)K(x,,xy) 2 (35) subject to the constraints 15 a,,a, 0 (36) aa, = 0 (37) This solution, which has different constraints, will nonetheless coincide with that of the e insensitive loss function if both the value C for the SV method is chosen sufficiently large 20 that the constraints 0 <a + a - C can simply become the constraints (21) and (22) and also one of the basis functions is constant, as in equation (17) for our case. In this case, one does not require the additional constraint a, = a, that is used by the SV method. Note that constraint (25) is already implicit in Equations (15) since the constraints (8) and (9) cannot be simultaneously active, so one of the Lagrange 25 multipliers a,+ or a,- should be slack, or 0. Under these conditions, one can see that the &-insensitive loss function achieves sparse function approximation, implicitly using the approach of an 11 shrinkage function. 112 Example of Multi-factorial Phenotype Prediction: Modeling I-I V-i Drug Response Current approaches to predicting phenotypic outcomes of salvage ART do not demonstrate good predictive power, largely due to a lack of statistically significant outcome data, combined with the many different permutations of drug regimens and 5 genetic mutations. This field has a pressing need both for the integration of multiple heterogeneous data sets and the enhancement of drug response prediction. The models demonstrated herein used data from the Stanford HIVdb RT and Protease Drug Resistance Database for training and testing purposes. This data consists of 6644 in vitro phenotypic tests of HIV-1 viruses for which reverse transcriptase (RT) or 10 protease encoding segments have been sequenced. Tests have been performed on ten reverse transcriptase inhibitors (RTI) and seven protease inhibitors (PI). The RTIs include lamivudine (3TC), abacavir (ABC), zidovudine (AZT), stavudine (D4T), zalcitabine (DDC), didanosine (DDI), delaviradine (DLV), efavirenz (EFV), nevirapine (NVP) and tenofovir (TDF). The PIs include ampranavir (APV), atazanavir (ATV), nelfinavir (NFV), 15 ritonavir (RTV), saquinavir (SQV), lopinavir (LPV) and indinavir (IDV)). For each drug, the data has been structured into pairs of the form (x,, y,), i =1... N, where N is the number of samples constituting the training data, y, is the measured drug fold resistance (or phenotype), and x, is the vector of mutations plus a constant term, x, [1 x,xi 2 ... xL]T, where M is the number of possible mutations on 20 the relevant enzyme. Assume element x,,=1 if the m' mutation is present on i'l sample, and set xi,'=O otherwise. Each mutation is characterized both by a codon locus and a substituted amino acid. Mutations that do not affect the amino acid sequence are ignored. Note that only mutations present in more than 1% percent of the samples for each drug are included in the set of possible predictors for a model, since it is improbable that 25 mutations associated with resistance would occur so infrequently. The measurement y; represents the fold resistance of the drug for the mutated virus as compared to the wild type. Specifically, y, is the log of the ratio of the IC 50 (the concentration of the drug required to slow down replication by 50%) of the mutated virus, as compared to the IC 50 of the wild type virus. The goal is to develop a model for each drug that accurately 30 predicts y from x;. In order to perform batch optimization on the data, stack the independent variables in an N by M+i matrix,X=[x,,x 2 ... X"]T, and stack all observations in a vector y = [y, ,2 --- . 113 The performance of each algorithm is measured using cross-validation. For each drug, the first-order correlation coefficient R is calculated between the predicted phenotypic response of the model and the actual measured in vitro phenotypic response of the test data. 5 R - y y. (38) Where vector f is the prediction of phenotypes y , Y denotes the mean of the elements in vector y and I denotes the vector of all ones. For each drug and each method, the data is 10 randomly subdivided in the ratio 9:1 for training and testing, respectively. In one example, ten different subdivisions are performed in order to generate the vector 9 and R without any overlap of training and testing data. This entire process may then be repeated ten times to generate ten different values of R. The ten different values of R are averaged to generate the R reported. The standard deviation of R is also determined for each of the 15 models measured over the ten different experiments to ensure that models are being compared in a statistically significant manner. Table 11 displays the results of the above mentioned models for the PI drugs; Table 12 displays the results for the ten RTI drugs. Results are displayed in terms of correlation coefficient R, averaged over ten subdivisions of the training and test data. The 20 estimated standard deviation of the mean value of R, computed from the sample variance, is also displayed. The number of available samples for each drug is shown in the last row. The methods tested, in order of increasing average performance, are: i) RR - Ridge Regression, ii) DT - Decision Trees, iii) NN - Neural Networks, iv) PCA - Principal Component Analysis, v) SS - Stepwise Selection, vi) SVM__L - Support Vector Machines 25 with Linear Kernels, vii) LASSO - Least Absolute Shrinkage and Selection Operator, and viii) SVM - Support Vector Machines with Radial Basis Kernels. The information in the last columns of Table 11 and Table 12 is depicted in Figure 21. The circles in Figure 21 display the correlation coefficient R averaged over ten different experiments for each PI, and averaged over the seven different PIs. The diamonds in Figure 21 display the 30 correlation coefficient R averaged over ten different experiments for each RTI, and 114 averaged over the ten different RTIs. The one standard deviation error bars are also indicated. Wherever modeling techniques involve tuning parameters, these have been adjusted for optimal performance of the technique as measured by cross-validation, using 5 a grid search approach. In all cases, the grid quantization was fine enough that the best performing parameters from the grid were practically indistinguishable from the optimal parameters for the given data, since the difference in the prediction due to grid quantization lay below the experimental noise floor. Although there are strong trends in the data, it should be noted that due to 10 differences in the number of samples, interactions of the underlying genetic predictors, and other idiosyncrasies in the data that vary between drugs, the R achieved by each algorithm may vary from drug to drug. This variation may be seen by studying the individual drug columns of Table 11 (columns 3 to 9) and Table 12 (columns 3 to 12). Of all the methods, SVM performs best, slightly outperforming LASSO (P<0.001 15 for the RTIs; P=0. 18 for the PIs). The performance of SVM trained with the e-insensitive loss function is considerably better than that of previously reported methods based on the support vector machine. SVM, which uses nonlinear kernel functions, outperforms SVML which uses linear kernel functions, and which is also trained using the & insensitive loss function (P = 0.003 for RTIs; P < 0.001 for PIs). The SVM considerably 20 outperforms the other nonlinear technique which uses neural networks and which does not create a convex cost function (P<0.001 for both RTIs and PIs). The LASSO technique, which trains a linear regression model using a convex cost function and continuous subset selection, considerably outperforms the SS technique (P<0.00 I for both PIs and RTIs). The top five methods, namely SS, PCA, SVML, LASSO,. SVMR, all 25 tend to generate models that are sparse, or have a limited number of non-zero parameters. In order to illustrate the subset of mutations selected as predictors, certain embodiments disclosed herein focus on the second-best performing model, namely the LASSO, which creates a linear regression model that, unlike SVM, does not attempt to emulate nonlinear or logical coupling between the predictors. Consequently, it is 30 straightforward to show how many predictors are selected. Table 13 shows the number of mutations selected by the LASSO as predictors for each PI drug (Table 13, row 4), together with the number of mutations (Table 13, row 3), and the total number of samples 115 (Table 13, row 2), used in training each model. The same table is shown for the RT~s (Table 14, same rows correspond to the same items). The selected mutations may also enhance understanding of the causes of drug resistance. Figures 22, 23 and 24 show the value of the parameters selected by the 5 LASSO for predicting response to P1, Nucleoside RTIs (NRTIs) and Non-Nucleoside RTIs (NNRTIs) respectively. Each row in the figures represents a drug; each column represents a mutation. Relevant mutations are on the protease enzyme for PI drugs, and on the RT enzyme for NRTI and NNRTI drugs. The shading of each square indicates the value of the parameter associated with that mutation for that drug. As indicated by the 10 color-bar on the right (2201, 2301 and 2401, respectively), those predictors that are shaded darker are associated with increased resistance; those parameters that are shaded lighter are associated with increased susceptibility. The mutations are ordered from left to right in order of decreasing magnitude of the average of the associated parameter. The associated parameter is averaged over all rows, or drugs, in the class. Those mutations 15 associated with the forty largest parameter magnitudes are shown. Note that for a particular mutation, or column, the value of the parameter varies considerably over the rows, or the different drugs in the same class. For the algorithms RR, DT, NN, and SS, the model was not trained on all genetic mutations, but rather on a subset of mutations occurring at those sites that have been 20 determined to affect. resistance by the Department of Health and Human Services (DHHS). The reduction in the number of independent variables was found to improve the performance of these algorithms. In the case of the SVML algorithm, best performance for RTIs was achieved using only the DHHS mutation subset, while best performance for PIs was achieved by training the model on all mutations. For all other algorithms, best 25 overall performance was achieved by training the model on all mutations. The set of mutations shown in Figures 22, 23 and 24 that were selected by the LASSO as predictors, but are not currently associated with loci determined by the DHHHS to affect resistance, are: for PIs - 19P, 91S, 67F, 4S, 37C, III, 14Z; for NRTIs - 68G, 203D, 245T, 208Y, 218E, 208H, 351, 11K, 40F, 281K; and for NNRTIs - 139R, 30 317A, 35M, 102R, 241L, 322T, 379G, 2921, 294T, 21 IT, 142V. Note that in some cases, such as for the LASSO and the SVM, the performance for particular drugs, such as LPV, was significantly improved (P<0.001) when all mutations were included in the model (R = 86.78, Std. dev = 0.17) as compared to the case when only those loci recognized to 116 affect resistance by DHHS were included (R = 81.72, Std. dev. = 0.18). This illustrates that other mutations, beyond those recognized by the DHHS, may play a role in drug resistance. The use of convex optimization techniques has herein been demonstrated to 5 achieve continuous subset selection of sparse parameter sets in order to train phenotype prediction models that generalize accurately. The LASSO applies the 11 norm shrinkage function to generate a sparse set of linear regression parameters. The SVM with radial basis kernel functions and trained with the s-insensitive loss function generates sparse nonlinear models. The superior performance of these techniques may be explained in 10 terms of the convexity of their cost functions, and their tendency to produce sparse models. Convexity assures that one can find the globally optimal parameters for a particular training data set when there are many potential predictors. Sparse models tend to generalize well, particularly in the context of underdetermined or ill-conditioned data, as is typical of genetic data. The 11 norm may be viewed as the most selective convex 15 function. The selection of a sparse parameter set using a selective shrinkage function exerts a principle similar to Occam's Razor: when many possible theories can explain the observed data, the most simple is most likely to be correct. The SVM, which uses an 12 shrinkage function together with an s-insensitive loss function, tends to produce an effect similar to the explicit use of the 11 norm as a shrinkage function applied to the parameters 20 associated with the support vectors. Techniques using the 11 shrinkage function are often able to generalize accurately when the number of IVs is large, and the data is undetermined or ill-conditioned. Consequently, it is possible to add nonlinear or logical combinations of the independent variables to the model, and expect that those combinations that are good predictors will 25 be selected in training. The SVM is able to model interactions amongst the independent variables with the use of nonlinear kernel functions, such as radial basis functions, which perform significantly better than linear kernel functions. Consequently, without changing the basic concepts disclosed herein, the performance of the LASSO may be enhanced by adding logical combinations of the independent variables to the model. Logical terms can 30 be derived from those generated by a decision tree, from those logical interactions described by expert rules, from the technique of logic regression, or even from a set of random permutations of logical terms. An advantage of LASSO is that the resulting model will be easy to interpret, since the parameters directly combine independent 117 variables, or expressions involving independent variables, rather than support vectors. The robustness of the LASSO to a large number of independent variables in the model is due both to the selective nature of the 11 norm, and to its convexity. Other techniques exist that use shrinkage function more selective than the 1l norm. 5 For example, log-shrinkage regression uses a shrinkage function derived from coding theory which measures the amount of information residing in the model parameter set. This technique uses the log function as a shrinkage function instead of the Il -norm and is consequently non-convex. While offering a theoretically intriguing approach for seeking a sparse set of parameters, the non-convexity of the penalty function means that solving 10 the corresponding regression is still computationally less tractable than the LASSO, and for large sets of predictors may yield only a local rather than a global minimum for the given data. The techniques described here may be applied to creating linear and nonlinear regression models for a vast range of phenotype prediction problems. They are 15 particularly relevant when the number of potential genetic predictors is large compared to the number of measured outcomes. Simplifying a Regression Model by Mapping Genetic Independent Variables into a Different Space 20 Note that, as described above, in cases where complex combinations of genetic markers are considered, it is possible to project the SNP variables onto another variable space in order to simplify the analysis. This variable space may represent known patters of mutations, such as the clusters or bins described by the HapMap project. In other words, rather than the vector xi representing particular SNP mutations as described above, 25 it may represent whether the individual falls into particular HapMap clusters or bins. For example, following the notation above, imagine there is a vector x"=[xu, xa ... x 1 fr where B is the number of relevant HapMap bins. One can set element xb=1 if the individuals SNPS pattern falls into the bM bin and 0 otherwise. Alternatively, if the overlap between the individuals SNPs and a particular bin is incomplete, and it may not be desirable to 30 simply place the individual in a category "other", then one may set each xi equal to the fraction of overlap between his pattern of SNPs and that of bin b. Many other techniques are possible to formulate the regression problem without changing the concepts disclosed herein. 118 Model Selection by Cross Validationfor Outcome Prediction In what has preceded this discussion, different phenotype prediction techniques involving expert rules, contingency tables, linear and nonlinear regression were 5 described. Now a general approach to selecting from a set of modeling techniques which is best to model a particular categorical or non-categorical outcome for a particular subject is described, based on the use of training data. Figure 25 provides an illustrative flow diagram for the system. The process described in Figure 25 is a general approach to selecting the best model given the data that is available for a particular patient, the 10 phenotype being modeled, and a given set of testing and training data, and the process is independent of the particular modeling techniques. In a preferred embodiment the set of modeling techniques that may be used include expert rules, contingency tables, linear regression models trained with LASSO or with simple least-squares where the data is no under-determined, and nonlinear regression models using support vector machines. 15 The process begins 2501 with the selection of a particular subject and a particular dependent variable (DV) that will be modeled, or - if it's a categorical variable - for which the probability may be modeled. The system then determines 2502 the set of Independent Variables (IVs) that are associated with that subject's record and which may be relevant to modeling the outcome of the DV. The human user of the system may also 20 select that subset of IVs that the user considers to be possible relevant to the model. The system then checks 2503a to see whether a model has already been trained and selected for the given combination of independent variables and the given dependent variable to be modeled. If this is the case, and the data used for training and testing the ready-made model is not out of date, the system will go directly to generating a prediction 2519 using 25 that model. Otherwise, the system will extract from the database all other records that have the particular DV of interest and which may or may not have the same set of IV's as the particular subject of interest In so doing, the system determines 2503b whether data is available for training and testing a model. If the answer is no, the system checks 2515 to see if there are any expert rules available to predict the outcome based on a subset of the 30 IV's available for the subject. If no expert rules are available then the system exits 2504 and indicates that it cannot make a valid prediction. If one or more expert rules are available, then the system will select 2505 a subset of expert rules that are best suited to the particular subject's data. In a preferred embodiment, the selection of which expert rule 119 to apply to a subject will be based on the level of confidence in that expert rule's estimate. If no such confidence estimate is available, the expert rules can be ranked based on their level of specificity, namely based on how many of the IVs available for the subject of interest the expert rules uses in the prediction. The selected subset of expert rules is then 5 used to generate a prediction 2506. If it is determined 2503b that data is available, the system will check 2516 to determine whether or not there is any data missing in the test and training data. In other words, for all those records that include the relevant DV, the system will check to see if all those records have exactly the same set of IVs as are available for the patient of 10 interest and which may be potential predictors in the model. Typically, the answer will be 'no' since different information will be available on different patients. If there is missing data, the system will go through a procedure to find that subset of IV's that should be used to make the best possible prediction for the subject. This procedure is time consuming since it involves a multiple rounds of model training and cross-validation. 15 Consequently, the first step in this procedure is to reduce 2507 the set of IVs considered to a manageable size based on the available computational time. In a preferred embodiment, the set of IVs are reduced based on there being data on that IV for a certain percentage of the subjects that also have the DV available. The set of IVs can be further reduced using other techniques that are known in the art such as stepwise selection which 20 assumes a simple linear regression model and selects IVs based on the extent to which they are correlated with the modeling error. The system then enters a loop in which every combination of the remaining IVs is examined. In a preferred embodiment the following states for each IV and the DV are considered: each IV can either be included or not included in the model and for numerical data for an IV or DV that is positive for all 25 subjects, the data may or may not be preprocessed by taking its logarithm. For each particular combination of inclusions/exclusions and pre-processing of the IWs and the DV, a set of modeling technique is applied 2510. Most modeling techniques will have some tuning parameter that can be optimized or tuned based on a grid-search approach using cross-validation with the test data. For 30 example, for the LASSO technique discussed above, many values will be explored for the variable parameter X. For each value of k, the regression parameters may be trained, and the model predictions may be compared with the measured values of test data. Similarly, for the support vector machine approach discussed above, the tuning parameters to be 120 optimized using a grid-search approach include C, e and possibly parameters describing the characteristics of the kernel functions. For techniques based on contingency tables, the tunable parameter may correspond to the highest standard deviation that can be accepted from a contingency table model, while making the contingencies as specific as possible 5 for the given subject, as discussed above. Many different metrics may be used to compare the model predictions with test data in order to optimize the tunable parameters and select among models. In a preferred embodiment, the standard deviation of the error is used. In other embodiments, one may use the correlation coefficient R between the predicted and measured outcomes. In the 10 context of logistic regression or contingency tables, one may also use the maximum a posteriori probability, namely the probability of the given set of test data assuming the model's prediction of the likelihood of each test outcome. Whatever metric is used, that value of the tuning parameter is selected that optimizes the value of the metric, such as minimizing the standard deviation of the prediction error if the standard deviation of the 15 prediction error is used as a test metric. Since model training and cross-validation is a slow process, at this stage 2510 the grid that defines the different tuning parameters to be examined is set coarsely, based on the amount of available time, so that only a rough idea of the best model and best tuning parameters can be obtained. Once all the different IV/DV combinations have been examined in this way 2511, 20 the system selects that combination of IVs/DV, that model and those tuning parameters that achieved the best value of the test metric. Note that if there is no missing data then the system will skip the step of checking all combinations of the IVs/DV. Instead, the system will examine the different modeling techniques and tuning parameters 2508, and will select that modeling method and set of tuning parameters that maximizes the test 25 metric. The system then performs refined tuning of the best regression model, using a more finely spaced grid, and for each set of tuning parameter values, determines the correlation with the test data. The set of tuning parameters is selected that produces the best value of the test metric. The system then determines 2518 whether or not the test metric, such as the standard deviation of the prediction error, is below or below a selected 30 threshold so that the prediction can be considered valid. For example, in one embodiment, a correlation coefficient of R > 0.5 is desirable for a prediction to be deemed valid. If the resultant test metric does not meet the threshold then no prediction can be made 2517. If the test metric meets the requisite threshold, a phenotype prediction may be produced, 121 together with the combination of IV's that was used for that prediction and the correlation coefficient that the model achieved with the test data. illustrating Model Selection by Cross Validation in Cancer Cohorts with Missing Data 5 In order to demonstrate this aspect, a focus was on utilizing the genetic and phenotypic data sets related to colon cancer that can be found in PharmGKB which is part of the National Institute of Health's Pharmacogenomic Research Network and has a mission to discover how individual genetic variations contribute to different drug response. For this dataset, a key challenge was missing information. Ideally, one would 10 like to apply the regression techniques described above to automatically select an IV subset for the model from all IVs that are available on a particular patient. However, this limits the amount of data that is available from other patients for training and testing the model. Consequently, for datasets containing few enough IVs, it is possible to search through all possible subsets of the independent variables. For each, as described above, 15 one can extract that set of patients for which the required outcome has been measured, and the relevant set of independent variables is available. As described above, one can also search the space of possible ways to preprocess the included independent variables, such as taking the logs of positive numeric independent variables. For each combination of independent variables included and independent variable preprocessing techniques, the 20 model is trained and tested by cross-validation with test data. That model is selected which has the best cross-validation with test data. Once a model has been created for a given set on IWs, that model is applied to new patient data submitted with the same set of IVs without requiring the exhaustive model search. This technique has been used to predict clinical side effects for colorectal cancer 25 drug Irinotecan. Severe toxicity is commonly observed in cancer patients receiving Irinotecan. Data was included which describes the relationships between Irinotecan pharmacokinetics and side effects with allelic variants of genes coding for Irinotecan metabolizing enzymes and transporters of putative relevance. Patients were genotyped for variations in the genes encoding MDR1 P-glycoprotein (ABCB1), multidrug resistance 30 associated proteins MRP-1 (ABCC1) and MRP-2 (ABCC2), breast cancer resistance protein (ABCG2), cytochrome P450 isozymes (CYP3A4, CYP3A5), carboxylesterases (CES1, CES2), UDP glucuronosyl-transferases (UGTlA1, UGTIA9), and the hepatic 122 transcription factor TCF1. The phenotypic data that is associated with the genetic sequence data for this study is described in Table 15. Figure 26 illustrates a model of prediction outcome for colon cancer treatment with irinotecan given the available PharmGKB data that was submitted using the 5 pharmacogenomic translation engine. In Figure 26, the model shows the relevant genetic loci (2601), the indicators used, in this casethe log of CPT-l1 area-under-the concentration curve (AUC) from 0-24 hours (2602) and the log of SN-38 AUC from 0-24 hours (2603) to predict the log of the Nadir of Absolute Neutrophil Count from day 12 to day 14 (2604). Cross-validating the model with test data, a correlation coefficient of 10 R=64% was achieved (2605). The empirical standard deviation of the model prediction is shown (2606) superimposed against the histogram of outcomes that were used to train the model (2607). These statistics can be used to make an informed treatment decision, such as to forgo irinotecan treatment completely or to administer a second drug, such as granulocyte colony stimulating factor, to prevent a low ANC and resultant infections. 15 Enhanced Diagnostic Reporting In the context of disease treatment, the generated phenotypic data is of most use to a clinician who can use the data to aid in selecting a treatment regimen. In one aspect, the phenotypic predictions will be contextualized and organized into a report for the clinician 20 or patient. In another aspect, the system and method disclosed herein could be used as part of a larger system (see Figure 27) wherein a diagnostic lab 2703 validates data from lab tests 2701 and medical reports 2702, and sends it to a data center 2704 where it is integrated into a standard ontology, analyzed using the disclosed method, and an enhanced diagnostic report 2705 could be generated and sent to the physician 2706. 25 One possible context in which a report may be generated would be related to predicting clinical outcomes for colon cancer patients being treated with irinotecan. It may take into consideration concepts such as contraindications for treatment, dosing schedules, side effect profiles. Examples of such side effects include myelosuppression and late-onset diarrhea which are two common, dose-limiting side effects of irinotecan 30 treatment which require urgent medical care. In addition, severe neutropenia and severe diarrhea affect 28% and 31% of patients, respectively. Certain UGTIAI alleles, liver function tests, past medical history of Gilbert's Syndrome, and identification of patient 123 medications that induce cytochrome p450, such as anti-convulsants and some anti emetics, are indicators warranting irinotecan dosage adjustment. Figure 28 is a mock-up of an enhanced report for colorectal cancer treatment with irinotecan that makes use of phenotype prediction. Prior to treatment, the report takes into 5 account the patient's cancer stage, past medical history, current medications, and UGTAIl genotype to recommend drug dosage. Roughly one data after the first drug dosage, the report includes a prediction of the expected Nadir of the patient's absolute neutrophil count in roughly two weeks time, based on the mutations in the UGTIAl gene and metabolites (e.g. SN-38, CPT-1 1) measured from the patient's blood. Based on this 10 prediction, the doctor can make a decision whether to give the patient colony stimulating factor drugs, or change the Irinotecan dosage. The patient is also monitored for blood counts, diarrhea grade. Data sources and justification for recommendations are provided. Combinations of the Aspects 15 As noted previously, given the benefit of this disclosure, other aspects, features and embodiments may implement one or more of the methods and systems disclosed herein. Below is a short list of examples illustrating situations in which the various aspects of the disclosed invention can be combined in a plurality of ways. It is important to note that this list is not meant to be comprehensive, many other combinations of the 20 aspects, features and embodiments of this invention are possible. One example could utilize a variety of genotyping measurement techniques in a way that would optimize the value of each. For example, a lab could use an technique that is expensive but can give high quality data in cases with low signal, such as AppliedBioscience's Taqman assay, to measure the target DNA, and use a technique that 25 is less expensive but requires a greater amount of genetic material to give good quality data, such as Affymetrix's 500K Genechip, or MIPS, to measure the parental DNA. Another example could be a situation in which a couple undergoing IVF treatment have eggs harvested from the woman, and fertilized with sperm from the man, producing eight viable embryos. A blastocyst is harvested from each embryo, and the genomic data 30 from the blastocysts are measured using Taqman Genotyping Assay. Meanwhile, the diploid data is measured from tissue taken from both parents using Molecular Inversion Probes. Haploid data from one of the man's sperm, and one of the woman's eggs is also measured using MIPs. The genetic data of the parents is used to clean the SNP data of 124 the eight blastocysts. The cleaned genetic data is then used to allow predictions to be made concerning the potential phenotypes of the embryos. Two embryos are selected which have the most promising profile, and allowed to implant in the woman's uterus. Another example could be a situation where a pregnant woman whose husband 5 has a family history of Tay-Sachs disease wants to know if the fetus she is carrying is genetically susceptible, but she does not want to undergo amniocentesis, as it carries a significant risk of miscarriage. She has her blood drawn, some fetal DNA is isolated from her blood, and that DNA is analyzed using MIPs. She and her husband had already had their full genomic data analyzed previously and it is available in silico. The doctor is 10 able to use the in silico knowledge of the parental genomes and the method disclosed herein to clean the fetal DNA data, and check if the critical gene that is responsible for Tay-Sachs disease is present in the genome of the fetus. Another example could be a situation where a 44-year old pregnant woman is concerned that the fetus she is carrying may have Downs Syndrome. She is wary of 15 having an intrusive technique used for pre-natal diagnosis, given a personal history of miscarriages, so she chooses to have her blood analyzed. The health care practitioner is able to find fetal cells in the maternal blood sample, and using the method disclosed herein, together with the knowledge of the woman's own genetic data, is able to diagnose for aneuploidy. 20 Another example could be a situation where a couple are undergoing IVF treatment they have eggs harvested from the woman, and fertilized with sperm from the man, producing nine viable embryos. A blastocyst is harvested from each embryo, and the genomic data from the blastocysts are measured using an Ilumina Bead Assay. Meanwhile, the diploid data is measured from tissue taken from both parents using 25 Molecular Inversion Probes. Haploid data from the father's sperm is measured using the same method. There were no extra eggs available from the mother, so bulk diploid tissue samples are taken from her own father and mother, and a sperm sample from her father. They are all analyzed using MIPs and the method disclosed herein is used to provide a genetic analysis for the mother's genome. That data is then used, along with the father's 30 diploid and haploid data, to allow a highly accurate analysis of the genetic data of each of the blastocysts. Based on the phenotypic predictions, the couple chooses three embryos to implant. 125 Another example could be a situation where a racehorse breeder wants to increase the likelihood that the foals sired by his champion racehorse become champions themselves. He arranges for the desired mare to be impregnated by IF, and uses genetic data from the stallion and the mare to clean the genetic data measured from the viable 5 embryos. The cleaned embryonic genetic data allows the breeder to find relevant genotypic-phenotypic correlations and select the embryos for implantation that are most likely to produce a desirable racehorse. Another example could be a situation where a pregnant woman wants to know whether the fetus she is carrying is predisposed towards any serious illness. The father 10 has since passed away, and so the haploid and diploid data generated from the father's brother and the father's father are used to help clean the genetic data of the fetus, measured from fetal cells gathered during fetal blood sampling. A company contracted by the health care practitioner uses the cleaned fetal genetic data to provide a list of phenotypes that the fetus is likely to exhibit, along with the confidence of each prediction. 15 Another example could be an amniocentesis lab that must occasionally contend with contaminated fetal genetic data due to poor laboratory techniques. The disclosed method could be used to clean the contaminated fetal genetic data using maternal and paternal genetic data. One could imagine a situation where a laboratory is able to cut costs by relaxing sterility procedures, knowing that the disclosed method would be able to 20 compensate for an increased rate of contaminating DNA. Another example could be a situation in which a woman in her forties is undergoing IVF to get pregnant. She wants to screen the embryos to select the one(s) that are least likely to have a genetic illness, and are most likely to implant and carry to term. The IVF clinic she is using harvests a blastocyst from each of the viable embryos, and 25 uses standard procedures to amplify the DNA, and measure key SNPs. The technician then uses the methods disclosed herein to screen for chromosomal imbalances, and also to find and clean the genetic data of the embryos to make predictions about the phenotypic predispositions of each embryo. Another example could be a situation where a pregnant woman has amniocentesis, 30 and the genetic material in the fetal cells in the blood sample are used, along with the methods described herein to screen for aneuploidy and other chromosomal abnormalities. One example could be a situation in which a non-linear model using Support Vector Machine with radial basis kernel functions and a norm loss function utilizes 126 genetic and phenotypic data of a human adult to predict the likelihood of early onset Alzheimer's disease, and to suggest possible lifestyle changes and exercise regimens which may delay the onset of the disease. Another example could be a situation in which a linear model using the LASSO 5 technique utilizes the genetic and phenotypic data of an adult woman afflicted with lung cancer, along with genetic data of the cancer to generate a report for the -woman's physicians predicting which pharmaceuticals will be most effective in delaying the progression of the disease. Another example could be a situation in which a plurality of models are tested on 10 aggregated data consisting of genetic, phenotypic and clinical data of Crohn's disease patients, and then the non-linear regression model that is found to be the most accurate utilizes the phenotypic and clinical data of an adult man to generate a report suggesting certain nutritional supplements that are likely to alleviate the symptoms of his Crohn's disease. 15 Another example could be a situation in which a model utilizing contingency tables built from data acquired through the Hapmap project, and utilizing genetic information gathered from a blastocyst from an embryo are used to make predictions regarding likely phenotypes of a child which would result if the embryo were implanted. Another example could be a situation where linear regression models utilizing 20 genetic information of the strain of IRV infecting a newborn are used to generate a report for the baby's physician suggesting which antiretroviral drugs give her the greatest chance of reaching adulthood if administered. Another example could be a situation where a new study is published suggesting certain correlations between the prevalence of myocardial infarctions in middle aged 25 women and certain genetic and phenotypic markers. This then prompts the use of a non linear regression model to reexamine the aggregate data of middle aged data, as well as genetic and phenotypic data of identified individuals whose data is known to the system, and the model then identifies those women who are most at risk of myocardial infarctions, and generates reports that are sent to the women's respective physicians 30 informing them of the predicted risks. Another example could be a situation where a plurality of models are tested on aggregated data of people suffering from colon cancer, including the various drug interventions that were attempted. The model that is found to allow the best predictions is 127 used to identify the patients who are most likely to benefit from an experimental new pharmaceutical, and those results are used by the company which owns the rights to the new pharmaceutical to aid them in conducting their clinical trials. 5 Definitions SNP (Single Nucleotide Polymorphism): A specific locus on a chromosome that tends to show inter-individual variation. To call a SNP: to interrogate the identity of a particular base pair, taking into account the direct and indirect evidence. 10 To call an allele: to call a SNP. To clean genetic data: to take imperfect genetic data and correct some or all of the errors, using genetic data of related individuals and the method describe herein. Imperfect genetic data: genetic data with any of the following: allele dropouts, unclear base pair measurements, incorrect base pair measurements, spurious signals, or 15 missing measurements. Confidence: the statistical likelihood that the called SNP, allele, or set of alleles correctly represents the real genetic state of the individual. Multigenic: affected by multiple genes, or alleles. Noisy genetic data: incomplete genetic data, also called incomplete genetic data.; 20 Uncleaned genetic data: genetic data as measured, that is, with no method has been used to correct for the presence of noise in the raw genetic data; also called crude genetic data. Direct relation: mother, father, son, or daughter. Chromosomal Region: a segment of a chromosome, or a full chromosome. 25 Parental Support: the name sometimes used for the disclosed method of cleaning genetic data. Section of a chromosome: a section of a chromosome that can range in size from one base pair to the entire chromosome. 128 The term 'comprise' and variants of the term such as 'comprises' or 'comprising' are used herein to denote the inclusion of a stated integer or stated integers but not to exclude any other integer or any other integers, unless in the context or usage an exclusive interpretation of the term is required. 5 Any reference to publications cited in this specification is not an admission that the disclosures constitute common general knowledge in Australia. 128a TABLES Tabl MA1..... __________________ ail_ - 21 wL!'2l' I± ~ z T-hiqi Fm~aA~dip~e QOfoJpyNb Ms- Aimpff Pdiia Fba Sag C34lssr TCd4Is 033 i N______ NMuc IMEffl ~ rd _ C' 'd~' li N SIP 2io4s sS$ I-Gcf n. Wal V YN I 1FCl4 Q Ozpdnom m*rw w 21 Z nm lq -M 31 416 0Q 2ci1 I 14120' $3Z 3 f 286 _ ox $48~ *2~2mfr Ta~Ti 22g1n1-~ 31A am a 2,s 1 3:1f $3 2874 E C161$5 omdnm4'2uw S28 kdbuwwmle~~ aQI~f-O ar 7fA 017 9 2vs1 811'A 103 $)Z 3 2 2'dmg 14131t wm yj ___ _ 322 1R Q 0-1 6 2 vs1 60'A 10 3111M 2 ZMEAM $313 rmpfnw 4'MI~I2441l271Y Wud ,s.vie 4f~ MI8 __2 M1 Qf 1 9 2\i 61' $241 14X13290M1O $3311 )=dnismro r(M~(14rIhsf2YZ om swi ft7(mtnOY$3fi=0CE6 Tw - 32 4U1, 069 -S 2\Ii 4.1% 8 1' I= $414ta If-(m1-ci1'$3 =411 Tx Y 317 gap 41 2481 1913' C = 2 2-j6c 35 $1411 dnmn 22mr -R26 aS6 a16 W2481 IMO 22M 3 28? 61_ Id E3 fd 2eW2VY~t4~y q( 27.6 2a6 OM 412481vs 101 $3 __ 1 8W $71,Z dnmr8n, *1$s Mule OMM~q cPC 27.6 201 rM2481 E8 m$31 a 3S ( 6 m t4b *mr ti1 m~s cM 2765 2L6 , Q412481 E03 $3 m, 3 2 0E ( 9 16aS C dof zr., 4V((l'f)W3'249' swrie ffO"I-nG4T2ff=1WO QE5 27 2vs 4- 31281 1 1 $3l 3 2 2IE f$338xoa dffamw ('z(4)94)~ Wnlf 330~ 7 ~4 128 4 14 3 1. 1x $318jm saifle $1'2Tntfrg13Y ff--l9 27 saA 21 4123015M 21 $ _ $131 Table 2. 129 snpid el e2 pl p2 ml m2 pel pe2 pp1 pp2 pm1 pm2 101100940 C T T C C T 0.9538 0.8902 0.8626 0.8580 0.8654 0.9101 101164838 T C T C T C 0.9359 0.9521 0.9406 0.9253 0.9957 0.8770 rs1463589 C C T C C C 0.9428 0.9928 0.9841 0.9266 0.8661 0.9798 101028396 C G C G C C 0.9252 0.8792 0.9246 0.9856 0.9819 0.8631 101204217 A G G A G G 0.9799 0.9843 0.9194 0.9478 0.9438 0.9709 101214313 A G G A G A 0.8513 0.9863 0.9521 0.9707 0.8570 0.9639 101231593 G A G G A A 0.9857 0.9653 0.8908 0.9036 0.9431 0.9832 Ts1426442 G C C G C G 0.9338 0.9278 0.9469 0.9514 0.8766 0.9017 rs7486852 C C C T T C 0.9566 0.9616 0.9390 0.8673 0.8785 0.8889 101266729 A G A G . A G 0.9238 0.9500 0.9026 0.9855 0,8760 0.9381 Table 3. snp id el e2. .pl p2 ml m2. -pel pe2 -pp1 pp2 m pm 101019515 G G G G G G 0.9134 0.8768 0.8666 0.9690 0.8679 0.8599 101100940 C T T C C T 0.9538 0.8902 0.8626 0.8580 0.8654 0-9101 101160854 A A A A A A 0.8705 0.9769 0.8763 0.8870 0.9311 0.9553 .rs4980809 A G G A A G 0.9638 0.9951 0.9582 0.9621 0.9197 0.9199 101058479 G A G A G A 0.9003 0.9885 0.8906 0.9235 0.9787 0.8792 101236938 G G G G G A 0.8528 0.9710 0.8810 0.9249 0.9274 0.9891 rs7137405 T T T T T A 0.9360 0.9918 0.9148 0.9558 0.9135 0.9388 101251161 G G G G G G 0.9802 0.8620 0.9372 0.8501 0.9891 0.8679 101270051 G G G G G A 0.9004 0.9643 0.9778 0.9060 0.9943 0.8962 rs215227 G- G G G G A 0.9244 0.9236 0.9629 0.8575 0.9019 0.9362 101245075 G G G G G G 0.9958 0.8593 0.9129 0.8504 0.8534 0.9866 101158538 A G A G G G 0.9471 0.8909 0.8710 0.9581 0.8961 0.9046 rs2535386 A A A A A A 0.9273 0.9479 0.9867 0.8918 0.9264 0.9750 rs6489653 T T T T T T 0.9453 0.9776 0.9051 0.8547 0.9636 0.9532 101137205 C G C C G G 0.8619 0.9503 0.9029 0.9426 0.8845 0.9282 101089311 T C C C C T 0.8844 0.9381 0.9719 0.8636 0.9186 0.9652 101205712 A A A A A A 0.8513 0.9226 0.8755 0.8999 0.9193 0.8535 101124605 G G G G G G 0.8981 0.9093 0.9075 0.8676 0.8931 0.9258 101025989 G T T G G T 0.9695 0.9016 0.8722 0.8821 0.9787 0.9273 rs4766370 T A A T T A 0.8886 0.9166 0.8762 0.8767 0.9890 0.8536 Table 4. 130 101100940 CT p2 m2 CT TC CT 0.8416 0.5206 101164838 CT p2 ml TC TC TC 0.9061 0.5206 rs1463589 CC p2 ml CC TC CC 0.9946 0.5206 101028396 GC p2 ml CG CG CC 0.9791 0.5206 101204217 AG p2 m2 AG GA GG 0.9577 0.5206 101214313 GA pl m2 AG GA GA 0.9308 0.5206 101231593 GA p1 m2 GA GG AA 1.0000 0.5206 rs1426442 CG pl m2 GC CG CG 0.9198 0.5206 rs7486852 CC p1 m2 CC CT TC 0.9138 0.5206 101266729 AG p1 m2 AG AG AG 0.9296 0.5206 Table 5. 101019515 GO pl mI GG GG GG 1.0000 0.9890 101100940 TC plm1 CT TC CT 0.9946 0.9890 .101160854 AA plmi AA AA AA 1.0000 0.9890 rs4980809 GA pl m1 AG GA AG 0.9961 0.9890 101058479 GG pI mi GA GA GA 0.9957 0.9890 101236938 GG p1 ml GG GG GA 1.0000 0.9890 rs7137405 TT plm1 TT TT TA 1.0000 0.9890 101251161 GG pl ml GG GG GG 1.0000 0.9890 101270051 GG p1 ml GG GG GA 1.0000 0.9890 rs215227 GG plmi GG GG GA 1.0000 0.9890 101245075 GG pl ml GG GG GG 1.0000 0.9890 101158538 AG pi mi AG AG GG 0.9977 0.9890 rs2535386 AA p1 ml AA AA AA 1.0000 0.9890 rs6489653 TT p1 ml TT TT TT 1.0000 0.9890 101137205 CG pi mi CG CC GG 1.0000 0.9890 101089311 CC pI mi TC CC CT 0.9940 0.9890 101205712 AA pl ml AA AA AA 1.0000 0.9890 101124605 GG pI mI GG GG GG 1.0000 0.9890 101025989 TG pl ml GT TG GT 0.9973 0.9890 rs4766370 AT p1 ml TA AT TA 0.9973 0.9890 - Table 6. 131 DHAlgorithml DHAIgorithm2 Pop.Freq ph pd P1 accuracy P2accuracy PI accuracy P2accuracy data 0.95 0.95 0.982 0.951 0.95 0.906 data 0.75 0.75 0.891 0.811 0.749 0.618 data 0.25 0.25 0.71 0.71 0.253 0.25 data 0.5 0.9 0.849 0.838 0.499 0.768 data 0.9 0.5 0.942 0.734 0.898 0.347 data 0.6 0.8 0.852 0.816 0.601 0.673 uniform 0.95 0.95 0.95 0.906 0.949 0.905 uniform 0.75 0.75 0.749 0.612 0.749 0.612 uniform 0.25 0.25 0.25 0.248 0.25 0.25 uniform 0.5 0.9 0.69 0.669 0.501 0.671 uniform 0.9 0.5 0.901 0.412 0.901 0.413 uniform 0.6 0.8 0.678 0.618 0.6 0.618 Table 7. PSAlgorithml PSAlgorithm2 Pop.Freq ph pd pe P1accuracy P2accuracy P1accuracy P2accuracy data 0.95 0.95 0.95 0.834 0.815 0.928 0.931 data 0.75 0.75 0.75 0.797 0.769 0.819 0.819 data 0.25 0.25 0.25 0.711 0.682 0.703 0.687 data 0.5 0.9 0.9 0.849 0.838 0.866 0.864 data 0.9 0.5 0.5 0.792 0.809 0.756 0.752 data 0.6 0.8 0.8 0.777 0.788 0.835 0.828 uniform 0.95 0.95 0.95 0.673 0.631 0.898 0.901 uniform 0.75 0.75 0.75 0.549 0.497 0.635 0.65 uniform 0.25 0.25 0.25 0.239 0.249 0.252 0.25 uniform 0.5 0.9 0.9 0.601 0.611 0.814 0.818 uniform 0.9 0.5 0.5 0.459 0.391 0.449 0.468 uniform 0.6 0.8 0.8 0.544 0.511 0.672 0.679 Table 8. jt~ W3 R,-Aaezt ~ : I A Table 9. 132 F~rB~/D 0~28 'Vdz 0024 a4-, NET uwG ~Q "0 P A~ i~ 6t Ws0.~ S~& ~7 005 U~02 Awe 0'- 11 . S 0%f~ !'~e .2t tf2 Table 10. Table 129 -% Igkf& WPM &IW dent. ---------- Table 13 A' M OM I -2 MI 133: do* M. T ~1C V ~ ' Table 14. Table 15. 134 Other embodiments of the invention as described herein are defined in the following paragraphs: 1. A method for determining genetic data of a target individual based on the imperfect knowledge of the target individual's genetic data, and knowledge of the genetic 5 data of one or more individuals genetically related to the target, the method comprising: (i) creating a set of one or more hypotheses concerning which segments of which chromosomes from the related individual(s) correspond to those segments found in the target individual's genome, (ii) determining the probability of each of the hypotheses given the measurements 10 of the target individual's genetic data and of the related individual's genetic data, and (iii) using the probabilities associated with each hypothesis to determine the most likely state of the actual genetic material of the target individual. 2. The method of paragraph 1, wherein the method involves determining which 15 regions of the parental chromosomes have the maximum likelihood of having contributed to the formation of the gametes that contributed to the target individual, based on the measurements of genetic data of the target, and determination of the likelihood of those particular measurements given the genetic data of the parents. 20 3. The method of paragraph 1, where the haplotypes of at least one of the parents have been determined by using genetic data measured from the parent's diploid sample, and the genetic data measured from a haploid sample from the parent which is used to determine which alleles measured from the diploid sample belong to which haplotype. 25 4. The method of paragraph 1, wherein the genetic data from genetically related individual(s) is(are) taken from a group comprising genetic data from a diploid maternal sample, a haploid maternal sample, a diploid paternal sample, and a haploid paternal sample. 30 5. The method of paragraph 1, wherein a confidence is computed for each of the individual SNP calls in the cleaned embryonic genetic data. 6. The method of paragraph 1, wherein the genetic data from the genetically related individual(s) is(are) taken from a group comprising genetic data from diploid maternal 135 cells, diploid paternal cells, haploid paternal cells, diploid cells from the maternal grandfather, and haploid cells from the maternal grandfather. 7. The method of paragraph 1, wherein the genetic data from the genetically related 5 individual(s) is(are) taken from a group comprising genetic data from diploid maternal cells, diploid paternal cells, and diploid cells from a related individual known to be a carrier of the phenotype in question. 8. The method of paragraph 1, wherein the genetically related individual(s) is/are 10 taken from a group consisting of the father, mother, son, daughter, brother, sister, grandfather, grandmother, uncle, aunt, nephew, niece, grandson, granddaughter, cousin, clone, other individuals with known genetic relationship to the target, and combinations thereof. 15 9. The method of paragraph 1, where the target individual is taken from a group consisting of an adult human, a juvenile human, a human fetus, a human embryo, a non human adult, a non-human juvenile, a non-human fetus, and a non-human embryo. 10. The method of paragraph 1, where one or more of the individual's genetic data has 20 been amplified using tools and/or techniques taken from the group comprising Polymerase Chain Reaction (PCR), Ligand mediated PCR, degenerative oligonucleotide primer PCR, Multiple Displacement Amplification, allele-specific amplification techniques, and combinations thereof. 25 11. The method of paragraph 1, wherein one or more of the individual's genetic data has been measured using tools and or techniques taken from the group comprising Molecular Inversion Probes (MIP), Genotyping Microarrays, the Taqman SNP Genotyping Assay, the Ulumina Genotyping System, other genotyping assays, fluorescent in-situ hybridization (FISH), and combinations thereof. 30 12. The method of paragraph 1, where one or more of the individual's genetic data has been measured by analyzing substances taken from the group comprising the individual's bulk diploid tissue, one or more diploid cells taken from the individual, one or more blastocysts taken from the individual, the individual's semen, the individual's egg, 35 extracellular genetic material found on the individual, extra-cellular genetic material from 136 the individual found in maternal blood, extra-cellular genetic material from the individual found in maternal plasma, cells from the individual found in maternal blood, genetic material known to have originated from the individual, and combinations thereof. 5 13. The method of paragraph 1, where one or more of the related individual's genetic data is partly or wholly known in silico, or is provided by parties other than those determining the target individual's genetic data. 14. The method of paragraph 1, where the haploid genetic data of one or more of the 10 individuals is partly or wholly generated in silico by means of a computer algorithm that simulates haploid data from diploid data. 15. The method of paragraph 14, wherein the computer algorithm comprises a hidden Markov model. 15 16. The method of paragraph 1, wherein the determination of the target's genetic data is used for the purpose of embryo selection in the context of in-vitro fertilization. 17. The method of paragraph 1, wherein the determination of the target's genetic data 20 is used for the purpose of prenatal genetic diagnosis. 18. The method of paragraph 1, wherein the determination of the target's genetic data is used for the purpose of making predictions of phenotype susceptibility using statistical models and/or expert rules. 25 19. The method of paragraph 1, wherein the determination of the target's genetic data is used for the purpose of making phenotypic predictions, and where the likelihood of displaying some or all of the phenotypes is affected by other previously known phenotypic information. 30 20. The method of paragraph 1, wherein the determination of the target's genetic data is used for the purpose of making phenotypic predictions, and where the predictions are made by comparing the target's genetic data to known genetic markers found in the public domain. 35 137 21. The method of paragraph 1, wherein the determination of the target's genetic data is used for the purpose of making clinical decisions. 22. The method of paragraph 1, wherein the determination of the target's genetic data 5 is used, in combination with phenotypic markers, for the purpose of making clinical decisions. 23. The method of paragraph 1, wherein the determination of the target's genetic data is used for the purpose of screening for susceptibility to one or more diseases, where no 10 family history of the disease(s) is (are) present. 24. The method of paragraph 1, wherein the determination of the target's genetic data is used for the purpose of screening for susceptibility to one or more phenotypes, where some or all of the phenotypes are multigenic. 15 25. The method of paragraph 1, wherein the knowledge of the target's genetic data is known to or suspected to contain spurious data from contaminating DNA or RNA. 26. The method of paragraph 1, wherein the genetic data of one or more of the 20 individuals includes the allele calls for a plurality of SNPs and the confidence with which each SNP is known. 27. The method of paragraph 1, wherein the confidence in the SNP calls of the target individual is determined by computing the odds ratio of the probabilities that the SNP is 25 correctly versus incorrectly called. 28. A system configured to accomplish the method of paragraph 1. 29. A computer-implemented system configured to accomplish the method of 30 paragraph 1. 30. A method wherein the measurement of multiple loci on a given segment of a given chromosome of a target individual is used to determine the number of instances of the given segment in the genome of the target individual, the method comprising: 138 (i) creating a set of one or more hypotheses about the number of instances of the given segment present in the genome of the target individual, (ii) measuring the amount of genetic data for some or all of the possible alleles at a plurality of loci on the given segment, 5 (iii) determining the relative probability of each of the hypotheses given the measurements of the target individual's genetic data and possibly also the related individual's genetic data, and (iv) using the relative probabilities associated with each hypothesis to determine the most likely state of the actual genetic material of the target individual. 10 31. The method of paragraph 30 where the determination of the number of instances of segments of chromosomes that are present in the target's genome is done in the context of screening for a chromosomal abnormality, that abnormality taken from a list comprising monosomy, uniparental disomy, trisomy, other aneuploidies, unbalanced 15 translocation, and combinations thereof. 32. The method of paragraph 30 where the determination of the relative probability of each hypothesis is made using the concept of matched filtering. 20 33. The method of paragraph 30 where the determination of the relative probability of each hypothesis is made using quantitative techniques that do not make allele calls and where the mean and standard deviation for the measurement of each locus is either known, unknown, or uniform. 25 34. The method of paragraph 30 where the determination of the relative probability of each hypothesis is made using qualitative techniques that make use of allele calls. 35. The method of paragraph 30 where the determination of the relative probability of each hypothesis is made by making use of known alleles of reference sequences, and 30 quantitative allele measurements. 36. The method of paragraph 30, where the target individual is taken from a group consisting of an adult human, a juvenile human, a human fetus, a human embryo, a non human adult, a non-human juvenile, a non-human fetus, and a non-human embryo. 35 139 37. The method of paragraph 30, wherein the target individual's genetic data has been amplified using tools and or techniques taken from the group comprising Polymerase Chain Reaction (PCR), Ligand mediated PCR, degenerative oligonucleotide primer PCR, Multiple Displacement Amplification, allele-specific amplification and combinations 5 thereof. 38. The method of paragraph 30, wherein the target individual's genetic data has been measured using tools and or techniques taken from the group comprising Molecular Inversion Probes (MIP), Genotyping Microarrays, the Taqman SNP Genotyping Assay, 10 the Illumina Genotyping System, other genotyping assays, fluorescent in-situ hybridization (FISH), and combinations thereof. 39. The method of paragraph 30, wherein the target individual's genetic data has been measured by analyzing substances taken from the group comprising the target individual's 15 bulk diploid tissue, one or more diploid cells taken from the target individual, one or more blastocysts taken from the target individual, extra-cellular genetic material found on the target individual, extra-cellular genetic material from the target individual found in maternal blood, cells from the target individual found in maternal blood, genetic material known to have originated from the target individual, and combinations thereof. 20 40. The method of paragraph 30, wherein the determination of the number of chromosomes or chromosome segments in the target is used for the purpose of embryo selection in the context of in-vitro fertilization. 25 41. The method of paragraph 30, wherein the determination of the number of chromosomes or chromosome segments of the target is used for the purpose of prenatal genetic diagnosis. 42. A system configured to accomplish the method of paragraph 30. 30 43. A computer-implemented system configured to accomplish the method of paragraph 30. 44. A method for determining the genetic data of a target individual, and also the 35 number of instances of chromosomes or segments of chromosomes that are present in the 140 target's genome, based on the imperfect knowledge of the target individual's genetic data, and knowledge of the genetic data of one or more individuals genetically related to the target, the method comprising: (i) creating a set of one or more hypotheses about which segments of which 5 chromosomes from the related individual(s) correspond to those segments found in the target individual's genome, (ii) creating a set of one or more hypotheses about the number of a given chromosomal segment present in the genome of the target, (iii) measuring the amount of genetic data for each of the possible alleles at a 10 plurality of loci on the given segment, (iv) determining the relative probability of each of the hypotheses given the measurements of the target individual's genetic data and of the related individual's genetic data, and (v) using the relative probabilities associated with each hypothesis to determine 15 the most likely state of the actual genetic material of the target individual. 45. A method for making predictions regarding an individual, the method comprising: (i) constructing models based on contingency tables built from publicly available information about gene-disease associations; and 20 (ii) applying models to make the predictions by operating on data pertaining to the individual. 46. The method of paragraph 45, wherein an accuracy of the contingency tables that employ multiple independent variables can be refined using outcome data where only a 25 subset of those independent variables was measured. 47. The method of paragraph 45, wherein an accuracy of the contingency tables that employ multiple independent variables can be refined using data about the association of the independent variables. 30 48. The method of paragraph 45, wherein an accuracy of the contingency tables that employ multiple independent variables can be refined using data about the frequency of occurrence of certain values of the independent variables. 141 49. A method for making predictions regarding a first individual, the method comprising: (i) creating and testing a plurality of models using aggregated data from a second set of individuals for whom the outcome to be predicted is known; 5 (ii) calculating the relative accuracies of the various models for making the prediction given the data available on the first individual; and (iii) using the model that is identified as the most accurate to make a prediction for the first individual. 10 50. The method of paragraph 45, where the type of data pertaining to the individual comprises data taken from a group consisting of the individual's genotypic data, the individual's phenotypic data, the individual's clinical data, and the individual's laboratory data. 15 51. The method of paragraph 49, where the type data pertaining to the individual comprises data taken from a group consisting of the individual's genotypic data, the individual's phenotypic data, and the individual's clinical data, and the individual's laboratory data. 20 52. The method of paragraph 45, where the type of data also consists of data of a pathogen infecting the individual. 53. The method of paragraph 49, where the type of data also consists of data of a pathogen infecting the individual. 25 54. The method of paragraph 45, where said predictions concern topics selected from the group consisting of the individual's phenotypes, phenotype susceptibilities, possible clinical outcomes, lifestyle decisions, physical exercise, dietary habits, hormonal supplements, nutritional supplements, treatments for a disease, treatments for a pathogen, 30 treatments for an undesirable condition, treatments with pharmaceuticals, and combinations thereof. 55. The method of paragraph 49, where said predictions concern topics selected from the group consisting of the individual's phenotypes, phenotype susceptibilities, possible 35 clinical outcomes, lifestyle decisions, physical exercise, mental exercise, dietary habits, 142 hormonal supplements, nutritional supplements, treatments for a disease, treatments for a pathogen, treatments for an undesirable condition, treatments with pharmaceuticals, and combinations thereof. 5 56. The method of paragraph 45, where said predictions are used to generate a report for the individual or for an agent of the individual. 57. The method of paragraph 49, where said predictions are used to generate a report for the individual or for an agent of the individual. 10 58. The method of paragraph 45, wherein the act of operating comprises operating on new data to update the individual's predictions where the data is taken from a group comprising new research data or new aggregated data on other subjects. 15 59. The method of paragraph 49, wherein the act of operating comprises operating on new data to update the individual's predictions where the data is taken from a group comprising new research data or new aggregated data on other subjects. 60. A system configured to accomplish the method of paragraph 45. 20 61. A system configured to accomplish the method of paragraph 49. Still further embodiments are within the scope of the following claims. 143

Claims (30)

1. A method for detecting the presence or absence of a chromosomal abnormality in a target individual, the method comprising: (a) measuring the amount of genetic material at multiple loci on a chromosome or chromosome segment of interest in a sample comprising DNA from the target individual; (b) comparing the amount from step (a) to either (i) a threshold value or (ii) an expected amount for a particular copy number hypothesis; and (c) identifying the presence or absence of a chromosomal abnormality based on the comparison.
2. The method of claim 1, wherein the expected amount is a reference amount.
3. The method of claim 1, wherein the expected amount is the mean value for a reference chromosome or chromosome segment that is present in two copies.
4. The method of claim 1, wherein the expected amount is the mean value for samples with two copies of a reference chromosome or chromosome segment.
5. The method of claim 1, further comprising determining the number of copies of the chromosome or chromosome segment of interest in the genome of the target individual.
6. The method of claim 1, wherein the amount of genetic material at a particular locus is determined irrespective of the identity of the alleles at the locus.
7. The method of claim 1, wherein the multiple loci comprise an allele with 100% penetrance.
8. The method of claim 1, wherein sequencing is used to measure the amount of genetic material at the multiple loci.
9. The method of claim 1, wherein the identification of the presence or absence of a chromosomal abnormality is used to make a clinical decision about the target individual. 144
10. The method of claim 9, wherein the target individual(s) is one or more embryos, and wherein the method further comprises: (i) using the identification of the presence or absence of a chromosomal abnormality in the one or more embryos to select an embryo for in vitro fertilization; and (ii) performing in vitro fertilization with the selected embryo.
11. The method of claim 1, wherein the target individual is a fetus, and wherein the sample is a maternal blood sample comprising DNA from the fetus and DNA from the mother of the fetus.
12. The method of claim 1, wherein the target individual is a fetus, and wherein the method further comprises performing amniocentesis or chorion villus biopsy.
13. The method of claim 1, wherein the chromosome of interest is selected from the group consisting of chromosome 13, chromosome 18, chromosome 21, the X chromosome, the Y chromosome, and combinations thereof.
14. The method of claim 1, wherein the chromosomal abnormality is selected from the group consisting of monosomy, uniparental disomy, trisomy, other aneuploidies, unbalanced translocations, insertions, deletions, and combinations thereof.
15. The method of claim 1, wherein the method comprises determining whether the target individual has Down syndrome, Klinefelters syndrome, or Turner syndrome.
16. A method for determining the number of copies of a chromosome or chromosome segment of interest in the genome of a target individual, the method comprising: (a) measuring the amount of genetic material at multiple loci on a chromosome or chromosome segment of interest in a sample comprising DNA from the target individual; (b) creating a set of one or more hypotheses about the number of copies of the chromosome or chromosome segment of interest in the genome of the target individual; (c) determining on a computer the probability of each of the hypotheses being true or false given the measurements from step (a); and (d) determining the number of copies of the chromosome or chromosome segment of interest in the genome of the target individual using the probabilities associated with each hypothesis. 145
17. The method of claim 16, wherein the probabilities are determined using a combined measurement that is a function of the measurements from step (a) at the multiple loci.
18. The method of claim 16, wherein the probabilities are determined using a comparison of the measurements from step (a) to either (i) a threshold value or (ii) an expected amount for a particular copy number hypothesis.
19. The method of claim 18, wherein the expected amount is a reference amount.
20. The method of claim 18, wherein the expected amount is the mean value for a reference chromosome or chromosome segment that is present in two copies.
21. The method of claim 16, wherein the amount of genetic material at a particular locus is determined irrespective of the identity of the alleles at the locus.
22. The method of claim 16, wherein the multiple loci comprise an allele with 100% penetrance.
23. The method of claim 16, wherein sequencing is used to measure the amount of genetic material at the multiple loci.
24. The method of claim 16, wherein the copy number determination is used to make a clinical decision about the target individual.
25. The method of claim 24, wherein the target individual(s) is one or more embryos, and wherein the method further comprises (i) using the copy number determination for the one or more embryos to select an embryo for in vitro fertilization, and (ii) performing in vitro fertilization with the selected embryo.
26. The method of claim 16, wherein the target individual is a fetus, and wherein the sample is a maternal blood sample comprising DNA from the fetus and DNA from the mother of the fetus.
27. The method of claim 16, wherein the target individual is a fetus, and wherein the method further comprises performing amniocentesis or chorion villus biopsy. 146
28. The method of claim 16, wherein the chromosome of interest is selected from the group consisting of chromosome 13, chromosome 18, chromosome 21, the X chromosome, the Y chromosome, and combinations thereof.
29. The method of claim 16, wherein the method comprises screening for a chromosomal abnormality in the genome of the target individual; wherein the chromosomal abnormality is selected from the group consisting of monosomy, uniparental disomy, trisomy, other aneuploidies, unbalanced translocations, insertions, deletions, and combinations thereof.
30. The method of claim 16, wherein the method comprises determining whether the target individual has Down syndrome, Klinefelters syndrome, or Turner syndrome. Date: 4 April 2013 147
AU2013202555A 2005-11-26 2013-04-04 System and Method for Cleaning Noisy Genetic Data and Using Data to Make Predictions Abandoned AU2013202555A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
AU2013202555A AU2013202555A1 (en) 2005-11-26 2013-04-04 System and Method for Cleaning Noisy Genetic Data and Using Data to Make Predictions
AU2016201386A AU2016201386B2 (en) 2005-11-26 2016-03-03 System and Method for Cleaning Noisy Genetic Data and Using Data to Make Predictions

Applications Claiming Priority (10)

Application Number Priority Date Filing Date Title
US60/739,882 2005-11-26
US60/742,305 2005-12-06
US60/754,396 2005-12-29
US60/774,976 2006-02-21
US60/789,506 2006-04-04
US60/817,741 2006-06-30
US11/496,982 2006-07-31
US60/846,610 2006-09-22
AU2006318425A AU2006318425B2 (en) 2005-11-26 2006-11-22 System and method for cleaning noisy genetic data and using data to make predictions
AU2013202555A AU2013202555A1 (en) 2005-11-26 2013-04-04 System and Method for Cleaning Noisy Genetic Data and Using Data to Make Predictions

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
AU2006318425A Division AU2006318425B2 (en) 2005-11-26 2006-11-22 System and method for cleaning noisy genetic data and using data to make predictions

Related Child Applications (1)

Application Number Title Priority Date Filing Date
AU2016201386A Division AU2016201386B2 (en) 2005-11-26 2016-03-03 System and Method for Cleaning Noisy Genetic Data and Using Data to Make Predictions

Publications (1)

Publication Number Publication Date
AU2013202555A1 true AU2013202555A1 (en) 2013-05-02

Family

ID=48480812

Family Applications (1)

Application Number Title Priority Date Filing Date
AU2013202555A Abandoned AU2013202555A1 (en) 2005-11-26 2013-04-04 System and Method for Cleaning Noisy Genetic Data and Using Data to Make Predictions

Country Status (1)

Country Link
AU (1) AU2013202555A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114045298A (en) * 2021-09-09 2022-02-15 武汉科前生物股份有限公司 Mycoplasma synoviae subunit vaccine and preparation method and application thereof
CN116843735A (en) * 2023-08-17 2023-10-03 长春工业大学 Machine learning-based three-dimensional point cloud accurate registration method

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114045298A (en) * 2021-09-09 2022-02-15 武汉科前生物股份有限公司 Mycoplasma synoviae subunit vaccine and preparation method and application thereof
CN114045298B (en) * 2021-09-09 2023-10-20 武汉科前生物股份有限公司 Chicken bursa mycoplasma subunit vaccine and preparation method and application thereof
CN116843735A (en) * 2023-08-17 2023-10-03 长春工业大学 Machine learning-based three-dimensional point cloud accurate registration method
CN116843735B (en) * 2023-08-17 2023-12-29 长春工业大学 Machine learning-based three-dimensional point cloud accurate registration method

Similar Documents

Publication Publication Date Title
AU2006318425B2 (en) System and method for cleaning noisy genetic data and using data to make predictions
US10597724B2 (en) System and method for cleaning noisy genetic data from target individuals using genetic data from genetically related individuals
US8682592B2 (en) System and method for cleaning noisy genetic data from target individuals using genetic data from genetically related individuals
AU2016201386B2 (en) System and Method for Cleaning Noisy Genetic Data and Using Data to Make Predictions
AU2013202555A1 (en) System and Method for Cleaning Noisy Genetic Data and Using Data to Make Predictions
US20200321075A1 (en) Genotyping polyploid loci

Legal Events

Date Code Title Description
MK5 Application lapsed section 142(2)(e) - patent request and compl. specification not accepted