CN105648045B

CN105648045B - The method and apparatus for determining fetus target area haplotype

Info

Publication number: CN105648045B
Application number: CN201410639577.7A
Authority: CN
Inventors: 袁媛; 王垚燊; 朱红梅; 易鑫
Original assignee: TIANJIN BGI TECHNOLOGY Co Ltd; BGI Shenzhen Co Ltd
Current assignee: TIANJIN BGI TECHNOLOGY Co Ltd; BGI Shenzhen Co Ltd
Priority date: 2014-11-13
Filing date: 2014-11-13
Publication date: 2019-10-11
Anticipated expiration: 2034-11-13
Also published as: CN105648045A

Abstract

The present invention provides a kind of method and device thereof of determining fetus target area haplotype.The method for determining fetus target area haplotype includes: to carry out sequencing to the target area of free nucleic acid in pregnant woman's body fluid, to obtain the first sequencing data；Sequencing is carried out to the same target region of fetus family member, to obtain the second sequencing data, third sequencing data and the 4th sequencing data, wherein, second sequencing data is the sequencing data of fetus mother, third sequencing data is the sequencing data of fetus father, and the 4th sequencing data is the sequencing data of propositus；The fetal nucleic acid content in pregnant woman's body fluid is stated based on first, second and optional third sequencing data, determination；Based on second, third and the 4th sequencing data, the target area haplotype of fetus mother and the target area haplotype of fetus father are constructed respectively；And based on fetus mother, the target area haplotype of father and fetal nucleic acid content, determine the target area haplotype of fetus.

Description

Method and device for determining target area haplotype of fetus

Technical Field

The present invention relates to the field of biological information, and in particular, to a method and apparatus for determining the haplotype of a target region of a fetus.

Background

Spinal Muscular Atrophy (SMA) is a group of common autosomal recessive genetic diseases, and occupies the second place of lethal autosomal recessive genetic diseases, and the incidence rate of patients in live-born infants is 1/6000-1/10000. Current studies have shown that the cause of SMA is mainly a deletion of the SMN gene: wherein SMN1 is determinative gene and expresses complete and stable SMN functional protein, and SMN2 is modified gene of SMA. It has been reported that 98.7% (226/229) pediatric patients have a deletion of the SMN1 gene, of which about 90% of SMA patients show a homozygous deletion of exon 7 and/or 8 of SMN 1. The SMA neuromuscular disease is serious, and no effective treatment means exists clinically at present. Prenatal diagnosis is an important means for preventing this birth defect.

With the discovery of the existence of fetal free DNA in the peripheral plasma of the pregnant woman, the method provides possibility for noninvasive prenatal detection of fetal genotype. However, reports on noninvasive fetal SMA detection through maternal plasma free DNA have not been found at present. In the existing SMA detection reports, the deletion type SMN1 mutation is mostly detected by designing QPCR primers and probes through diagnosing the SMN17 exon, such as 'a fluorescence quantitative PCR kit for diagnosing human spinal muscular atrophy' disclosed by Xuxiangmin et al (publication No. CN 103614477A). However, because of the relatively low fetal DNA content in maternal plasma, QPCR is not sensitive enough to detect mutations in fetal SMN1 gene in a high maternal background.

Therefore, the development of a detection method capable of non-invasively detecting the SMN1 genotype of the fetus plays an important role in prenatal diagnosis of the disease.

Disclosure of Invention

According to an aspect of the present invention, there is provided a method of determining the haplotype of a target region of a fetus, the method comprising the steps of: sequencing the target region of free nucleic acid in a bodily fluid of a pregnant woman to obtain first sequencing data; sequencing the target region of the family member of the fetus to obtain second sequencing data, third sequencing data and fourth sequencing data, wherein the second sequencing data is the sequencing data of the mother of the fetus, the third sequencing data is the sequencing data of the father of the fetus, and the fourth sequencing data is the sequencing data of the proband; determining a fetal nucleic acid content in the maternal body fluid based on the first sequencing data, second sequencing data, and optionally third sequencing data; respectively constructing a target region haplotype of the mother of the fetus and a target region haplotype of the father of the fetus based on the second sequencing data, the third sequencing data and the fourth sequencing data; and determining a target regional haplotype of the fetus based on the target regional haplotype of the mother of the fetus, the target regional haplotype of the father of the fetus, and the fetal nucleic acid content. Wherein, the first, second, third and fourth sequencing data are obtained without following the sequence relation, and can be obtained simultaneously, or obtained one by one or obtained several together; the step of determining the fetal nucleic acid content and the step of constructing the parental haplotype are not in sequence.

According to another aspect of the present invention there is provided an apparatus for determining the haplotype of a target region of a fetus, the apparatus being capable of performing some or all of the steps of the method provided by one aspect of the present invention, the apparatus comprising: a sequencing unit, configured to perform sequencing on the target region of free nucleic acid in a body fluid of a pregnant woman to obtain first sequencing data, and perform sequencing on the target region of a family member of the fetus to obtain second sequencing data, third sequencing data and fourth sequencing data, wherein the second sequencing data is sequencing data of a mother of the fetus, the third sequencing data is sequencing data of a father of the fetus, and the fourth sequencing data is sequencing data of a proband; a fetal nucleic acid content determination unit, connected to the sequencing unit, for determining a fetal nucleic acid content in the body fluid of the pregnant woman based on the first sequencing data, the second sequencing data and optionally the third sequencing data; a parent haplotype determining unit connected with the sequencing unit and used for respectively constructing a target region haplotype of the mother of the fetus and a target region haplotype of the father of the fetus based on the second sequencing data, the third sequencing data and the fourth sequencing data; and a fetal haplotype determining unit, coupled to the fetal nucleic acid content determining unit and the parent haplotype determining unit, for determining a target regional haplotype of the fetus based on the target regional haplotype of the mother of the fetus, the target regional haplotype of the father of the fetus, and the fetal nucleic acid content.

The method and/or device of one aspect of the invention provides a method based on target region capture and family target region haplotype linkage analysis, and the method can be used for judging or assisting in judging whether a fetus has a target region variation related disease or abnormality by deducing a fetal target region genotype from maternal body fluid samples such as pregnant woman peripheral plasma DNA sequencing data through linkage analysis. The methods or devices of the invention greatly reduce the incidence of false positives and false negatives by utilizing linked haplotype information. The application of the method and/or the device can greatly avoid false negative and false positive results caused by inaccurate measurement ratio of a single site, sequencing error of the single site and the like, so that the detection result is more accurate and reliable. By applying the method to SMN1 sick high-risk families, sick infants can be effectively detected, and unnecessary invasive sampling operations such as amniotic fluid puncture and the like are reduced.

Drawings

The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIG. 1 is a schematic diagram of an apparatus for determining the haplotype of a target region of a fetus in accordance with one embodiment of the present invention;

FIG. 2 is a schematic diagram of an overall technical circuit for fetal genotype determination in one embodiment of the present invention;

FIG. 3 is a graph showing the result of determining the genotype of a fetus according to an embodiment of the present invention, FIG. 3A is a graph showing the result of determining the haplotype inherited from a father of the fetus, and FIG. 3B is a graph showing the result of determining the haplotype inherited from a mother of the fetus; in the figure, the point represents the difference between the probability of an snp locus inherited from Hap0 and the probability of an snp locus inherited from Hap1, and the loop is the combined judgment result.

Detailed Description

According to an embodiment of the present invention, there is provided a method for determining a haplotype of a target region of a fetus, comprising the steps of:

the method comprises the following steps: first, second, third, and fourth sequencing data are obtained.

Obtaining free nucleic acid in a body fluid of a pregnant woman, capturing a target region, and performing sequence determination on the captured target region to obtain first sequencing data. The maternal body fluid sample is a sample comprising fetal nucleic acids, such as maternal peripheral blood plasma comprising fetal nucleic acids, the extracted peripheral blood free nucleic acids are a mixture of maternal and fetal nucleic acids, the mixture is highly fragmented. According to the existing sequencing platform, a sequencing library is constructed by extracting free nucleic acid from a peripheral blood sample of a pregnant woman, a target region sequencing library is obtained by utilizing a probe or a chip or a liquid-phase probe for capture, and the target region sequencing library is subjected to on-machine sequencing to obtain first sequencing data, wherein the first sequencing data is mixed data of the pregnant woman nucleic acid and a fetal nucleic acid mixture. Sequencing platforms include, but are not limited to, cg (complete genomics), Illumina/Solexa, Life Technologies ABI SOLiD and Roche 454, and corresponding sequencing library preparation can be performed according to the selected sequencing platform, single-ended or double-ended sequencing can be selected, and each sequencing data obtained therefrom consists of a plurality of short sequences, each short sequence being referred to as a read. The chip used for capturing is composed of a solid phase substrate and a plurality of probes fixed on the solid phase substrate, the probes can identify a target region, the target region can be a part of the genome DNA of a sample to be detected or the whole genome, in one embodiment of the invention, the target capture region comprises the exon regions of the SMN1 gene, the position of each exon region on a reference genome HG19 is shown in Table 1, the target region also comprises SNP sites with high heterozygosity rate in the SMN1 gene and the 3M region upstream and downstream of the SMN1 gene, the number distribution of the SNPs in each region is shown in Table 2, and the sub-allele frequency (MAF) of the SNPs is between 0.3 and 0.5. The information of the regions and the sites is beneficial to judging and analyzing fetal haplotypes, the capture of the 3M regions at the upstream and downstream of the target gene reduces the recombination probability to be less than ten thousandth, so that the subsequent haplotype construction or determination can be accurately carried out, the capture of the SNP sites with high heterozygosity rate can easily obtain specific sites or sequences from the fetus, and the nucleic acid content of the fetus in the mixed DNA can be estimated by using the sites or sequences from the fetus. When designing a probe capable of specifically recognizing the region, in order to ensure the capturing characteristic and the detection accuracy, the probe containing at least one SNP site is uniquely aligned on a reference genome, so that the specificity of the target site captured by the probe can be enhanced. When the probes are designed, the GC content of each probe is 40-50%, so that the whole group of probes can be specifically combined with a target area in the same system and can be eluted together in the same reaction system.

TABLE 1 capture ranges for SMN1 gene regions

Region (Region)	Chromosome number (chr)	Starting position (start)	End position (end)
				1	chr5	70220738	70221835
2	chr5	70222126	70223263
				3	chr5	70223351	70223620
4	chr5	70224046	70224569
				5	chr5	70224596	70225332
6	chr5	70225421	70227146
				7	chr5	70227276	70229560
8	chr5	70229641	70230603
				9	chr5	70230671	70231084
10	chr5	70231091	70231402
				11	chr5	70231511	70232075
12	chr5	70232161	70232534
				13	chr5	70233276	70233724
14	chr5	70234111	70235041
				15	chr5	70235136	70235933
16	chr5	70236016	70236631

17	chr5	70236716	70239101
				18	chr5	70239196	70239701
19	chr5	70239786	70241034
				20	chr5	70241131	70242428
21	chr5	70242496	70242844
				22	chr5	70243026	70243331
23	chr5	70243681	70244193
				24	chr5	70244286	70244815
25	chr5	70245011	70245717
				26	chr5	70247436	70248868

TABLE 2 SNP site differentiation for SMN1 region haplotype analysis

region	Number of SNP sites
		upstream10M-3M	7
upstream3M-2.5M	1
		upstream2.5M-2M	14
upstream2M-1.5M	98
		upstream1.5M-1M	52
upstream1M-500K	71
		upstream500K-0K	66
Gene±1M	1629
		downstream0K-500K	67
downstream500K-1M	26
		downstream1M-1.5M	42
downstream1.5M-2M	78
		downstream2M-2.5M	87
downstream2.5M-3M	0
		downstream3M-10M	7

Obtaining samples of fetal family members, including nucleic acid samples of a fetal biological mother (pregnant woman), a fetal biological father and a proband, extracting nucleic acid in each family member sample, capturing the same target region in the fetal family member nucleic acid by referring to the mode of obtaining the first sequencing data, performing sequence determination on the same target region of each family member, and obtaining family member sequencing data, wherein the family member sequencing data comprise second, third and fourth sequencing data which respectively correspond to the sequencing data of the same target region of the fetal biological mother, the fetal biological father and the proband. The second sequencing data, namely maternal sequencing data, can be obtained by separating the maternal peripheral blood sample from which the first sequencing data was obtained, separating the maternal peripheral blood sample to obtain a maternal peripheral blood plasma sample and maternal blood cells, and obtaining maternal genomic nucleic acid from the maternal blood cells, such as leukocytes, to obtain the second sequencing data. Probands the family is members identified as having the relevant variation of the target area, where probands are siblings of the fetus of the same biological parent as the fetus to be tested, including born and unborn, including in vitro cultured embryos or fertilized eggs, including both alive and inexperienced. In addition, in other embodiments, the proband may also be siblings of the parent of the fetus to be tested, such as jijiujiu, uncle, girl, etc. of the fetus, in which case, the sequencing data of the family member of the fetus should further include the grandparent and/or the external grandparent of the fetus, so that the target area haplotype of the grandparent or the external grandparent can be constructed by using the sequencing data of the siblings of the parent and the sequencing data of the parent, and the target area haplotype to which the parent is inherited can be determined. The first, second, third and fourth sequencing data are obtained without following a sequence relation, and can be obtained simultaneously, for example, a plurality of samples are marked by using tags, sequencing data of a plurality of samples are obtained simultaneously by mixing a plurality of sample nucleic acid mixed libraries and performing computer sequencing, and sequencing data of nucleic acid samples can be obtained one by one or several by one.

Step two: determining the fetal nucleic acid content.

Determining the fetal nucleic acid content in the maternal body fluid sample based on the first and second sequencing data, or based on the first, second and third sequencing data.

Wherein determining the fetal nucleic acid content in the maternal body fluid sample based on the first and second sequencing data is performed by: first, sites were selected that had two genotypes in the first sequencing data and only one genotype in the second sequencing data. The site can be screened by alignment using soap (short oligonucleotide analysis package), bwa, samtools and other software, but this embodiment is not limited thereto, and the polymorphic site can be identified by alignment. The reference sequence used for alignment is a known sequence and may be any reference template in a biological class to which the target individual belongs, which is obtained in advance. For example, if the target individual is a human, the reference sequence may be selected from HG19 provided in the NCBI database. Furthermore, a resource library containing more reference sequences may be configured in advance, and before sequence comparison, a more similar sequence is selected or determined and assembled as a reference sequence according to factors such as sex, race, region, and the like of a target individual, which is helpful for obtaining a more accurate detection and analysis result. During the alignment process, according to the setting of alignment parameters, at most n base mismatches (mismatches) are allowed for each or each pair of reads (reads or a pair of end-read pair) in each sequencing data, n is preferably 1 or 2, and if more than n base mismatches occur in reads, the reference sequence cannot be aligned with the reads/pairs. A position, assuming the position is A in the reference sequence, the alignment of the second sequencing data indicates that the bases aligned up to the position in the reference sequence in the second sequencing data, i.e., the maternal sequencing data, are all A, however, the alignment of the first sequencing datum, i.e.the maternal and fetal sequencing datum, indicates that the base aligned to the reference sequence at that position in the first sequencing datum is A and another base other than A, such as T, C or G, since the first sequencing data is the mixed sequencing data of the maternal and fetal nucleic acids, and the alignment result of the second sequencing data indicates that the site of the mother is AA, the non-A base of the site in the first sequencing data can be judged to be from the fetus, all such sites are selected so that the fetal nucleic acid content of the mixed nucleic acid is reflected based on the ratio of the sites in the mixed sequencing data. Similarly, if the alignment of the second sequencing data indicates that the genotype of a site of the mother is heterozygous, such as AG, and the alignment of the first sequencing data indicates that both the AG and AA genotypes are supported at the site, the fetal nucleic acid content in the peripheral blood sample of the pregnant woman can also be estimated based on the number, content or ratio of the A bases in the first sequencing data. When there is only a homozygous genotype in the second sequencing data and there is a heterozygous genotype in addition to the same homozygous genotype in the first sequencing data as in the former case above, the fetal nucleic acid content f is 2d/(c + d), whereas when there is only a heterozygous genotype in the second sequencing data and there is a homozygous genotype in addition to the heterozygous genotype in the first sequencing data as in the latter case above, the fetal nucleic acid content f is (c-d)/(c + d), c in the formula is the number of reads supporting allele a in the first sequencing data and d is the number of reads supporting non-a alleles in the first sequencing data.

Determining a fetal nucleic acid content in the maternal body fluid sample based on the first, second and third sequencing data by: and screening out sites which are different homozygous genotypes in the second sequencing data and the third sequencing data, wherein the genotypes of the sites in the second sequencing data and the third sequencing data are RR and RR respectively, so that the genotype of the site in the fetal nucleic acid is Rr in terms of heredity, calculating the fetal nucleic acid content in the peripheral blood sample of the pregnant woman based on a plurality of the sites of the type, wherein the fetal nucleic acid content is g/(g + h), g is the number of reads supporting the allele R in the first sequencing data, and h is the number of reads supporting the allele R in the first sequencing data. The screening of the loci involves alignment, setting of alignment parameters, alignment results, and the like, as described above with reference to the estimation of fetal nucleic acid content based on the first and second sequencing data.

Step three: a target region haplotype for the parent is constructed.

And constructing the haplotypes of the target regions of the mother and the father based on the second, third and fourth sequencing data, i.e., constructing the haplotypes of the respective parents based on the respective sequencing data of the parents and the known sequencing data of the children (probands) with variation in the target regions of the pair of parents. Comparing respective sequencing data of parents and sequencing data of probands with reference sequences, respectively, identifying SNPs in target regions of the parents and the probands and obtaining genotypes of the SNPs by using software such as SOAPsnp, GATK, bowtite and the like, wherein two haplotypes (two groups of SNP sets) of the probands consist of one haplotype of the parents and the mothers, so that the haplotypes of the parents and the probands are constructed according to Mendel genetic rules and according to the genotypes of sites where the SNPs of the parents and the probands are located, for example, by using a plurality of region-type SNPs, wherein the region-type SNPs mean that the parents are different genotypes and can provide for next generation of SNPs capable of distinguishing the source of the haplotypes, and the haplotypes of the parents and the mothers. Haplotypes tend to be inherited as a genetic unit to progeny, where a haplotype is a collection of SNPs.

It should be noted that, the embodiment of the present invention does not have any sequence restriction on the implementation of step two and step three, and step two may be performed first and step three may be performed second, or step three may be performed first to obtain the parental target region haplotype and then step two may be performed to determine the fetal nucleic acid content.

Step four: and determining the haplotype of the fetal target area.

Determining the fetal target region haplotype based on the target region haplotypes of the mother and father and the fetal nucleic acid content. Specifically, the haplotype of the paternal target region where the fetus is inherited is determined by using a plurality of sites which are heterozygous at the haplotype of the paternal target region and homozygous at the haplotype of the maternal target region, because if a certain SNP site of the fetus is heterozygous, since only one type of base can be derived from the mother, the other base of the site can be determined to be from the paternal, and by using a plurality of such sites, for example, one haplotype of the two haplotypes of the fetus, which is derived from the paternal, can be determined, wherein more than 10 alleles of the sites are derived from the paternal. While another haplotype determination for a fetus can be similarly performed using multiple sites that are homozygous for the haplotype in the paternal target region and heterozygous for the haplotype in the maternal target region, since the fetal nucleic acid sample, i.e., the maternal peripheral blood sample, contains a large amount of maternal DNA, it is impossible to determine whether the fetus inherits R or the maternal haplotype where R is located from the above type of SNPs alone, since any allelic base at the site may also be maternal only, and we can determine the haplotype of the mother where the fetus has been inherited by combining the fetal nucleic acid content. For a plurality of polymorphic sites that are homozygous on the paternal haplotype and heterozygous on the maternal haplotype, each such site in the maternal peripheral blood sample can be designated Rr, if a plurality of such sites all correspond to R/R ═ (1+ x%)/(1-x%), then the haplotype in which the fetal maternal allele R is inherited is determined, if a plurality of such sites all correspond to R/R ═ 1, then the haplotype in which the fetal maternal allele R is inherited is determined, R and R represent a pair of alleles, x% represents fetal nucleic acid content, R/R is the number of R-bearing reads in the first aligned sequencing data/the number of R-bearing reads in the first aligned sequencing data. Thereby, the haplotype of the fetus is determined.

It will be understood by those skilled in the art that all or part of the steps of the methods in the above embodiments may be implemented by a program instructing associated hardware, and the program may be stored in a computer-readable storage medium, and the storage medium may include: read-only memory, random access memory, magnetic or optical disk, and the like.

In accordance with another embodiment of the present invention, there is provided an apparatus for determining the haplotype of a target area of a fetus, which can be used to perform some or all of the steps of the method according to one embodiment of the present invention, as shown in fig. 1, the apparatus 1000 comprises: the sequencing unit 100 is used for obtaining free nucleic acid in body fluid of a pregnant woman, capturing a target region, performing sequence determination on the captured target region to obtain first sequencing data, capturing the same target region in nucleic acid of a fetal family member, performing sequence determination on the same target region of the family member to obtain family member sequencing data, wherein the family member sequencing data comprise second, third and fourth sequencing data which respectively correspond to sequencing data of the same target region of a mother and father affinity probands of the fetus; a fetal nucleic acid content determination unit 200, connected to the sequencing unit 100, for determining the fetal nucleic acid content in the maternal body fluid sample based on the first and second sequencing data, or based on the first, second and third sequencing data; a parental haplotyping unit 300, coupled to the sequencing unit 100, for constructing a maternal and paternal target region haplotype based on the second, third and fourth sequencing data; a fetal haplotype determination unit 400, coupled to the fetal nucleic acid content determination unit 200 and the parent haplotype determination unit 300, for determining the fetal target region haplotype based on the maternal and paternal target region haplotypes and the fetal nucleic acid content. The description of the technical features and advantages of the method according to an embodiment of the invention, which is also applicable to the device according to this embodiment of the invention, will not be repeated here.

The following detailed description and results are presented in conjunction with the use of a particular sample for determining the haplotype, genotype, haplotype or genotype of a region of interest according to the methods of the present invention. The following examples are given for the purpose of illustration only and are not to be construed as limiting the invention. The use of "first," "second," "third," etc. in this disclosure is for convenience of description only and is not to be construed as indicating or implying any relative importance, nor order relationships therebetween. In the description of the present invention, "a plurality" means two or more unless otherwise specified.

Unless otherwise noted, reagents, sequences (linkers, tags, and primers), software, and instruments, which are not specifically mentioned in the following examples, are conventionally commercially available products or publicly available, such as hiseq2000 sequencing platform library-building related kit available from Illumina corporation for sequencing library construction, and the like.

The general method comprises the following steps:

1. selection of target capture regions and design of probes

The target capture region comprises an exon region of the SMN1 gene, and the capture sequencing of SNP sites with high heterozygosity in the SMN1 gene and the 3M regions upstream and downstream of the SMN1 gene. SNP selection refers to the dbSNP database, where SNP sites with a reference chromosome number of more than 100 and a MAF between 0.3 and 0.5 are selected. Meanwhile, in order to ensure the detection accuracy, the 63mer base sequence of the SNP locus is ensured to be uniquely compared on the genome, and the GC content is 40-50%. SMN1 region Capture regions are shown in tables 1 and 2

2. Obtaining of pedigree pathogenic haplotype

And through biological information analysis, the SNP locus genotypes of the target gene and the upstream and downstream regions of the target gene of the pregnant woman, the pregnant woman husband and the proband are judged. And determining the gene information of SNP loci closely linked with pathogenic mutations by performing linkage analysis on the SNP genotypes of the three, and further obtaining haplotype information linked with the pathogenic mutations. The overall technical route is shown in fig. 2.

(1) Genomic DNA was extracted from peripheral blood of pregnant women, pregnant women's husband and probands, and the obtained DNA was subjected to quality detection using electrophoresis and OD.

(2) Preparation of a target region capture library was performed using genomic DNA that was qualified for quality testing. The library preparation is that 1 mu g genome DNA is broken into small fragment DNA with 200-plus 300bp main band, then the broken DNA fragment is subjected to end filling, a base A is added at the 3 'end to ensure that the DNA fragment can be connected with a special joint with a T base at the 3' end, a finished library is constructed by Non-trapped PCR (PCR before trapping), the Exon and the flanking +/-30 bp area of a specific gene selected by a SMN1 gene target area trapping probe are enriched, the enriched product is amplified by PCR, and finally the sequence trapping hybridization efficiency is obtained by PCR product QPCR detection before and after hybridization.

(3) The obtained sample library was sequenced using a high throughput sequencer. So that the average sequencing depth of the target region reaches more than 200.

(4) Sequencing information is analyzed and researched through biological information analysis to obtain genetic variation information such as Single Nucleotide Variation (SNV) of related genes, insertion and deletion of a few bases (InDel) and the like. And defining SNP information linked with the target pathogenic mutation to be detected, namely the pathogenic haplotype. It is assumed that probands obtain a disease-causing mutation from both parents, respectively, and if,

1) suppose the genotype of a certain point outside the pathogenic gene of the proband is AA, the father is AC and the mother is AA. Then, it can be known that: the proband obtains A from father and one A from mother, and the two SNP sites are linked with pathogenic mutation to be inherited. Whereas in the father C is linked to a non-pathogenic allele (allele);

2) suppose that the genotype of a certain point outside the pathogenic gene of the proband is AC, the father is AC and the mother is AA. Then, it can be known that: the proband obtained C from father and A from mother, and these two SNP sites are linked with pathogenic mutation and inherited. And C is linked to a non-pathogenic allele in the father;

3) suppose the genotype of a certain point outside the pathogenic gene of the proband is AC, the father is AA and the mother is AC. Then, it can be known that: the proband obtains A from father and C from mother, and the two SNP sites are linked with pathogenic mutation to be inherited. And C is linked to a non-pathogenic allele in the mother;

by applying the above-described estimation method to the SMN1 gene and SNP sites of the 3M regions on both sides, haplotype information in a range can be obtained, and haplotype information linked to a pathogenic mutation in this region can be obtained. Thereby further deducing the SNP information closely linked with nonpathogenic allele.

3. Pregnant woman plasma DNA target region capture sequencing

And (3) carrying out target region capture sequencing on the plasma DNA of the pregnant woman, and carrying out bioinformatics SNP/indel analysis. And (4) taking whether the genetic relationship is correct and the DNA content of the fetus as a quality control link, and only carrying out subsequent analysis on the sample qualified in quality control. And carrying out genotyping on the plasma free DNA sequencing data of the pregnant women, and carrying out linkage analysis by combining with the family haplotype to judge whether the fetus inherits the pathogenic haplotype of the couple.

(1) Cell-free DNA was extracted from 1.2ml pregnant plasma and quality checked using Qubit to quantify DNA.

(2) Preparation of a target region capture library was performed using genomic DNA that was qualified for quality testing. Firstly, filling the tail end of a DNA fragment, adding a base A at the 3 'end to ensure that the DNA fragment can be connected with a special joint with a T base at the 3' end, constructing a finished library through Non-Captured PCR, enriching the Exon and the flanking +/-30 bp region of a specific gene selected by an SMN1 target region capture probe, amplifying the enriched product through PCR, and finally obtaining the sequence capture hybridization efficiency through PCR product QPCR detection before and after hybridization.

(3) The obtained sample library was sequenced using a high throughput sequencer. So that the average sequencing depth of the target region reaches more than 500.

4. Fetal genotype prediction

(1) Sequencing information is analyzed and researched through biological information analysis to obtain genetic variation information such as Single Nucleotide Variation (SNV) of related genes, insertion and deletion of a few bases (InDel) and the like.

(2) The fetal DNA content of plasma-free DNA was calculated as follows

a) Assuming that the maternal leukocyte DNA genotype is AA and the fetal genomic DNA is AT, the genotypes observed in plasma AT this time are a and T, and if the number of reads supporting a is C and the number of reads supporting C is d, then f is 2d/(C + d);

b) assuming that the maternal leukocyte DNA genotype is AT and the fetal genomic DNA is AA, the genotypes observed in plasma AT this time are a and T, and if the number of reads supporting a is c and the number of reads supporting T is d, then f is (c-d)/(c + d).

And if the fetal DNA content is more than 3%, the quality control is qualified, and the subsequent experiment is carried out.

(3) And (3) judging the genotype of the fetus inherited from the father in the following calculation mode:

a) sites where the mother is homozygous and the father is heterozygous are selected for the judgment of the father's genetic haplotype. Assuming that the maternal genotype and the paternal genotype of a certain SNP locus are AA and AC, if the result of the call SNP of the plasma sequencing data is A and C, the content of C accords with the estimated fetal concentration. Indicating that the fetus obtains allele of the SNP C;

b) and (3) using all SNPs in the SMN1 capture region, which meet the condition a), to judge the SNP information obtained by the fetus from the father, and forming haplotype information obtained by the fetus from the father. And according to the information in 2- (4), whether the haplotype is linked with the pathogenic mutation is determined, so that whether the fetus obtains the pathogenic allele from the father is known.

(4) The genotype of the fetus inherited from the mother is determined by the following calculation method

The sites where the mother is heterozygous and the father is homozygous were selected for the judgment of the mother's genetic haplotype. Assuming that the maternal genotype and the paternal genotype of a certain SNP locus are AC and AA, if the result of the call SNP of the plasma sequencing data is A and C, if the fetus inherits the A allele from the mother and the genotype of the fetus is AA, the A/C is approximately similar to (1+ f)/(1-f) to be observed; if the fetus inherits the C allele and the genotype of the fetus is AC, an A/C of approximately 0.5 can be observed. Constructing a binomial distribution model for reads support numbers OF alleles to respectively calculate the probabilities OF inheritance A, C to obtain relative probabilities Pa and Pc (Pa + Pc is 1), constructing an HMM model for each point probability OF all SNPs, judging haplotype information OF a fetus obtained from a mother by using a Viterbi algorithm (Lawrence R.Rabiner, PROCEEDINGS OF THEEEE, Vol.77, No.2, 2 months 1989), and obtaining whether the fetus obtains a pathogenic allele from the mother according to whether the haplotype is linked with a pathogenic mutation or not.

(5) And (4) integrating the results of (3) and (4) to obtain the genotype information of the fetus.

Examples

Noninvasive prenatal gene testing was performed on 1 pregnant woman (Tianjin maternal care institute) who had a high risk of developing SMN1 and two births. Pregnant women and husband are heterozygous carriers of deletion mutation of exon 7 of SMN1 gene, and a patient with SMN1 homozygous mutation is bred. And (3) extracting peripheral blood of the pregnant woman and separating plasma in time in the second pregnancy, and then performing capture sequencing on the plasma DNA and genome DNAs of the pregnant woman, the pregnant woman husband and the proband to analyze the gene condition of the fetus.

A salting-out method is used for extracting sample DNA, ultrasonic breaking is carried out on large-fragment DNA, and the sample DNA is broken into fragments in the range of 100-700bp by using a Covaris breaking method at present. (Note: the disruption effect is generally desirable at the 200-250bp position of the main band of the desired Insert fragment of the preparative library, and re-disruption is required if the disruption effect is not desirable.)

Plasma free DNA was extracted by salting out method, and library construction was directly performed after quantification by using a Qubit.

1. Library preparation

1.1 end repair and purification

After the prepared mix was shaken and mixed well, 25. mu.L of the enzyme reaction mixture was added for each reaction.

Reaction conditions are as follows: 20 ℃ for 30min

Product purification was performed using 180. mu.L of Ampure Beads, and the recovered DNA was dissolved in 30. mu.L of water (of which 1.9. mu.L was used as a waste).

1.2 adding A (A-Tailing) at the end

After the prepared mix was shaken and mixed well, 6.9. mu.L of the enzyme reaction mixture was added to each tube.

Reaction conditions are as follows: 20 ℃ for 30min

Note: adding "A" to the end of the strain and then purifying the strain

1.3 ligation and purification of Adapter

The prepared mix is shaken and mixed evenly, and 15 mu L of enzyme reaction mixed solution is added into each reaction.

Reaction conditions are as follows: 16 ℃ for 12-16h (overnight)

Product purification was performed using 75. mu.L of Ampure Beads, and the recovered DNA was dissolved in 35. mu.L (of which 2. mu.L was used as a waste) of water.

1.4Non-Captured sample Pre-LM-PCR

PCR procedure:

2. chip hybridization, target area capture enrichment

In this experiment, hybridization elution was performed with reference to the NimbleGen instructions to obtain the target gene and PCR enrichment was performed.

3. Sequencing on machine

In the experiment, on-machine sequencing is carried out by adopting a hiseq2000 or hiseq2500PE101+8+101 program.

4. Information analysis

A sequencer obtains an original short sequence;

removing the linker and low quality data from the sequencing data;

positioning the short sequence to a corresponding position of human genome data;

counting sequencing result information, the number of short sequences, the coverage size of a target area, the average sequencing depth and the like;

filtering low quality values and low coverage of mononucleotides;

annotation, determination of the gene, coordinates, amino acid changes, etc. occurring at the site of mutation;

the genotype of each SNP within the SMN1 capture region was determined.

5. Analysis of results

1) Data throughput conditions

As shown in Table 3, the average sequencing depth of the tested samples in the target area is more than 100X, and the plasma sequencing depth reaches 271X.

TABLE 3 data output situation table

2) SNP phasing profile

We used SNP sites within 1M upstream and downstream of SMN1 gene for proband haplotype construction by father, mother and proband. Table 4 shows the number of SNPs (phased SNPs) in the region that successfully determines the haplotype. These phasedSNPs were subsequently used for paternal genetic haplotype determination (SNP used for Pat-Hap) and for maternal genetic haplotype determination (SNP used for Mat-Hap)

TABLE 4 statistics of phase SNP in SMN1 Gene-related region

3) Analysis of fetal DNA content in plasma

Points were selected where the father was heterozygous and the mother was homozygous, and the fetal DNA content in plasma was estimated: assuming that the maternal genotype is AA and the fetal genotype is AT, if the number of reads for a and C is a, the fetal DNA content in plasma C is 2b/(a + b). The results showed that the fetal DNA content in this plasma sample was 0.0930.

4) Fetal genotype determination

The peripheral plasma data of pregnant women in the SMA 1 family are analyzed, the SMN1 gene condition of the pregnant fetus is presumed by using an HMM algorithm, specifically, haps 0 and Hap1 of the fetus are used as hidden states (hidden states), SNPs successfully judging the haplotype are used as observation sequences (observations), the state transition probability (transition probabilities) is calculated according to the position of a SNP position and the recombination probability between adjacent SNPs calculated by a genetic map, the relative probability (Emission _ probability) that each SNP position supports Hap0 and Hap1 is calculated according to the reads support number, and then the haplotype arrangement supported by the SNPs can be deduced by a Wheatstone algorithm (Viterbi algorithm), so that the most possible fetal haplotype combination is obtained. This can be done with reference to Chen S1, Ge H2, Wang X, et al, laptop-assisted acid non-induced total genome recovery, genome Med.2013,5(2): 18.

To avoid the effect of the repeated sequence region on the analysis results, only the unique sequence region was used for the analysis. The results are shown in FIG. 3, each point on the graph represents the difference between the probability of an snp locus inherited from the parent/mother Hap0 and the probability of an snp locus inherited from the parent/mother Hap1, each small circle is a combined judgment result, the line formed by the small circle is upstream of the intermediate baseline to represent the final judgment inherited from Hap0, and the line formed by the small circle is below the intermediate baseline to represent the final judgment inherited from Hap 1. As can be seen from FIG. 3, the haplotypes for both parents of Pat-Hap 0 and Mat-Hap 0, respectively, carry a pathogenic mutation, and the haplotypes for both parents of Pat-Hap1 and Mat-Hap1, respectively, do not carry a pathogenic mutation. The inference results show that the fetus obtained Pat-Hap1 and Mat-Hap1 from its father, i.e., chromosomes that did not carry the SMN1 causative mutation. Indicating that the fetus does not have a SMN1 deletion.

Claims

1. A method of determining a haplotype of a target region of a fetus for non-disease diagnostic purposes, comprising,

sequencing the target region of free nucleic acid in a bodily fluid of a pregnant woman to obtain first sequencing data;

sequencing the target region of the family member of the fetus to obtain second sequencing data, third sequencing data and fourth sequencing data, wherein the second sequencing data is the sequencing data of the mother of the fetus, the third sequencing data is the sequencing data of the father of the fetus, and the fourth sequencing data is the sequencing data of the proband;

determining a fetal nucleic acid content in the maternal body fluid based on the first sequencing data, second sequencing data, and optionally third sequencing data;

respectively constructing a target region haplotype of the mother of the fetus and a target region haplotype of the father of the fetus based on the second sequencing data, the third sequencing data and the fourth sequencing data; and

determining a target region haplotype of the fetus based on the target region haplotype of the mother of the fetus, the target region haplotype of the father of the fetus, and the fetal nucleic acid content;

wherein the fetal nucleic acid content is determined by:

determining loci of different homozygous genotypes in both the second and third sequencing data, wherein RR and RR represent different homozygous genotypes, R and R are a pair of alleles,

determining the fetal nucleic acid content based on the formula f ═ g/(g + h),

wherein,

g is the number of reads supporting allele R in the first sequencing data, h is the number of reads supporting allele R in the first sequencing data;

the determining of the fetal target region haplotype comprises,

determining a haplotype of the paternal target region to which the fetus has been inherited using a plurality of loci that are heterozygous at the haplotype of the paternal target region and homozygous at the haplotype of the maternal target region, and determining a haplotype of the maternal target region to which the fetus has been inherited using a plurality of loci that are homozygous at the haplotype of the paternal target region and heterozygous at the haplotype of the maternal target region, and the fetal nucleic acid content;

wherein for the plurality of loci that are homozygous in the haplotype of the paternal target region and heterozygous in the haplotype of the maternal target region, if a plurality of such loci meet R/R ═ (1+ x%)/(1-x%), then the haplotype of the target region in which the fetal inherited maternal allele R is determined, if a plurality of such loci meet R/R ═ 1, then the haplotype of the target region in which the fetal inherited maternal allele R is determined, R and R represent a pair of alleles, x% represents fetal nucleic acid content, and R/R is the number of R-supported reads in the first sequencing data/the number of R-supported reads in the first sequencing data.

2. The method of claim 1, wherein sequencing said target region of free nucleic acid in a bodily fluid of a pregnant woman comprises:

capturing the free nucleic acid with a probe that specifically recognizes the target region.

3. The method of claim 2, wherein the probes are provided in the form of a chip.

4. The method of claim 2, characterized in that said probes comprise SNP site probes that are uniquely aligned on a reference genome.

5. The method of claim 2, characterized in that the GC content of the probe is 40-50%.

6. An apparatus for determining the haplotype of a target region of a fetus, comprising,

a sequencing unit, configured to perform sequencing on the target region of free nucleic acid in a body fluid of a pregnant woman to obtain first sequencing data, and perform sequencing on the target region of a family member of the fetus to obtain second sequencing data, third sequencing data and fourth sequencing data, wherein the second sequencing data is sequencing data of a mother of the fetus, the third sequencing data is sequencing data of a father of the fetus, and the fourth sequencing data is sequencing data of a proband;

a fetal nucleic acid content determination unit, connected to the sequencing unit, for determining a fetal nucleic acid content in the body fluid of the pregnant woman based on the first sequencing data, the second sequencing data and optionally the third sequencing data;

a parent haplotype determining unit connected with the sequencing unit and used for respectively constructing a target region haplotype of the mother of the fetus and a target region haplotype of the father of the fetus based on the second sequencing data, the third sequencing data and the fourth sequencing data; and

a fetal haplotype determination unit coupled to the fetal nucleic acid content determination unit and the parent haplotype determination unit for determining a target regional haplotype of the fetus based on the target regional haplotype of the fetal mother, the target regional haplotype of the fetal father, and the fetal nucleic acid content;

wherein the fetal nucleic acid content determination unit determines the fetal nucleic acid content by:

wherein,

the determining of the fetal target region haplotype comprises,

7. The device of claim 6, wherein said target region comprises an exonic region of the SMN1 gene.

8. The device of claim 7, wherein said target region further comprises SNP sites having a frequency of 0.3 to 0.5 at the minor bases within the SMN1 gene and in each of the 3M regions upstream and downstream of the SMN1 gene.