CN111304309A

CN111304309A - Detection method for sequencing platform tag sequence pollution

Info

Publication number: CN111304309A
Application number: CN202010152944.6A
Authority: CN
Inventors: 林健; 杨敬敏; 覃振东; 唐嘉婕; 朱学萍
Original assignee: Shanghai Wickham Biomedical Technology Co ltd
Current assignee: Shanghai Wickham Biomedical Technology Co ltd
Priority date: 2020-03-06
Filing date: 2020-03-06
Publication date: 2020-06-19

Abstract

The invention provides a detection method for sequencing platform tag sequence pollution, which at least comprises the following steps: (1) connecting the tag sequence to be detected with the known sequence to obtain a tag sequence-known sequence, wherein the tag sequence type of the sequencing platform is fixed and known; (2) sequencing the sequence obtained in the step (1) to obtain a sequencing result, wherein the sequencing result comprises the base sequence and the number of the sequence; (3) splitting a sequencing result according to the difference of the types of the tag sequences, and if a known sequence rm appears in the classification results of other tag sequences Tn besides the corresponding tag sequence Tm, polluting the tag sequence Tm by the other tag sequences Tn; wherein m and n are natural numbers, and m is not equal to n. The method ensures the credibility of subsequent offline data by verifying the uniqueness of the tag sequence.

Description

Detection method for sequencing platform tag sequence pollution

Technical Field

The invention relates to the field of bioinformatics and biotechnology, in particular to a detection method for sequencing platform tag sequence pollution.

Background

In recent years, with the development of technology and the reduction of sequencing cost, high-throughput sequencing has been penetrated by scientific research toward people's daily life. At present, the technology is mainly applied to several aspects: genome sequencing, RNA sequencing, DNA methylation and the like, and is more generally applied to tumors, genetic diseases, metagenomes and the like. The data of high-throughput sequencing is that the corresponding samples are sorted by the label sequences, but if the index sequences are single or other contaminations are introduced in the operation process, the label sequences are not pure any more, and errors occur in subsequent sample sorting and biological analysis.

Disclosure of Invention

In view of the above-mentioned shortcomings of the prior art, the present invention aims to provide a method for detecting tag sequence contamination of a sequencing platform. The method at least comprises the following steps:

(1) connecting the tag sequence to be detected with the known sequence to obtain a tag sequence-known sequence, wherein the tag sequence type of the sequencing platform is fixed and known;

(2) sequencing the sequence obtained in the step (1) to obtain a sequencing result, wherein the sequencing result comprises the base sequence and the number of the sequence;

(3) splitting a sequencing result according to the difference of the types of the tag sequences, and if a known sequence rm appears in the classification results of other tag sequences Tn besides the corresponding tag sequence Tm, polluting the tag sequence Tm by the other tag sequences Tn; wherein m and n are natural numbers, and m is not equal to n.

The invention also provides a detection method for the pollution of the sequencing platform tag sequence and application of the detection method in gene sequencing.

As mentioned above, the method for detecting the tag sequence contamination of the sequencing platform has the following beneficial effects:

the invention utilizes single DNA sequence to build a library, carry out on-machine sequencing, and test the label sequence by the known sequence in the off-machine data, thereby ensuring that the label sequence is free from other label sequence pollution. And the credibility of subsequent offline data is ensured by verifying the uniqueness of the tag sequence.

Drawings

FIG. 1 shows the electrophoresis results of PCR amplification of known sequences.

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention.

Before the present embodiments are further described, it is to be understood that the scope of the invention is not limited to the particular embodiments described below; it is also to be understood that the terminology used in the examples is for the purpose of describing particular embodiments, and is not intended to limit the scope of the present invention; in the description and claims of the present application, the singular forms "a", "an" and "the" include plural referents unless the context clearly dictates otherwise.

When numerical ranges are given in the examples, it is understood that both endpoints of each of the numerical ranges and any value therebetween can be selected unless the invention otherwise indicated. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. In addition to the specific methods, devices, and materials used in the examples, any methods, devices, and materials similar or equivalent to those described in the examples herein can be used in the practice of the invention, as would be known to one skilled in the art and the description of the invention.

Unless otherwise indicated, the experimental methods, detection methods, and preparation methods disclosed herein all employ techniques conventional in the art of molecular biology, biochemistry, chromatin structure and analysis, analytical chemistry, cell culture, recombinant DNA technology, and related arts.

The detection method for the tag sequence pollution of the sequencing platform at least comprises the following steps:

The tag sequence refers to a nucleic acid fragment used for distinguishing different samples during mixed sequencing.

The known sequence refers to a nucleic acid fragment having a known base sequence.

The number of sequences refers to the number of nucleic acid fragments.

Further, when detecting whether the sequencing platform tag sequence is contaminated, the steps (1) - (3) can be adopted to directly detect the tag sequence.

The other known sequence rn may be one or more. When the other known sequence rn is plural, it indicates that the tag sequence Tm contaminates plural tag sequences. For example, if the tag sequence T1 contains other known sequences r2 and r5 in addition to the known sequence r1 corresponding to the tag sequence T1, it indicates that the tag sequence T1 corresponding to the known sequence r1 contaminates the tag sequence T2 corresponding to the other known sequence r2 and the tag sequence T5 corresponding to the other known sequence r 5.

Further, in step (1), the known sequence is derived from the genome of any species.

In a further embodiment, in step (1), the known sequence is derived from an archaea or phage genome. The genomes of archaea or bacteriophages are the source of known sequences. Convenient to use, preventing the presence of multiple copies of the sequence.

Preferably, the known sequences are all from the same organism or the same individual organism in a single assay. The method can ensure that each known sequence is different to the maximum extent.

In one embodiment, the bacteriophage is selected from lambda bacteriophage.

In one embodiment, the archaea is selected from the phylum archaeota (Korarchaeota), naarchaeota (Nanoarchaeota), Thaumarchaeota (Thaumarchaeota) or Euryarchaeota (Euryarchaeota). Such as Pyrococcus (Thermococcus) or Pyroluta (Pyroditicum) or Halobacterium (Halobacterium salinum).

The tag sequence may be a single-ended or double-ended barcode.

The invention can detect whether one label sequence or a plurality of label sequences are polluted or not.

In the step (1), when the types of the tag sequences to be detected are more than 1, the known sequences connected with different tag sequences are different. That is, in step 1, the tag sequences and the known sequences need to be linked in a one-to-one correspondence, i.e., a tag sequence to be tested is only linked to a known sequence in a corresponding manner.

Further, the detection method further comprises the following steps: calculating the pollution ratio of the label sequence Tm to the label sequence Tn by adopting the following method:

the number of sequences of the known sequence rm in the class of the tag sequences Tn/the number of sequences of the known sequence rm in the class of the tag sequences Tm is 100%.

Preferably, the length of the known sequence is 100-250 bp.

The tag sequence to be detected and the known sequence can be connected by a PCR mode or a ligase. The tag sequence may be directly linked to the known sequence or indirectly linked to the known sequence. When indirectly linked, the linkage may be by primer sequences. The primer sequence refers to a joint commonly used in sequencing, such as a short Y joint or a long Y joint.

In one embodiment, in the sequencing mode of multiplex pooling, the tag sequence can be linked to a known sequence by PCR methods.

In one embodiment, in the sequencing mode of the whole genome library, if a short Y-linker is used, the tag sequence is first ligated to the short Y-linker and the known sequence is then ligated to the short Y-linker by PCR to obtain the tag sequence-short Y-linker-known sequence.

In one embodiment, the indirect ligation of the known sequence to the tag sequence can be achieved by ligating the long Y-linker to the known sequence by a ligase if a long Y-linker is used, which already contains the tag sequence.

Optionally, the sequencing platform is an Illumina next generation sequencing platform.

The detection method for tag sequence contamination of the sequencing platform can be used for gene sequencing.

Example 1

1.1 reagents used: 2 XTAQ Plus Master Mix (Dye Plus), Ligation Module (assist in san), DNA purification magnetic beads (assist in san), 2 XTKapa Enzyme Mix, NextSeq 500/550 Mix Output reagent card V2, NextSeq Access Box V2, NextSeq 500/550 Mix Output Flow cell card V2, NextSeq 500/550Buffer card V2

1.2 fragment amplification: DNA amplification was carried out using 2 XTaq Plus Master Mix (Dye Plus), lambda phage DNA as a template, sequences listed in Table 1 as primers, reaction system of Table 2, and reaction conditions of Table 3, respectively.

TABLE 1 primer List

TABLE 2 reaction System

PCR Components	Volume (μ L)
		λDNA，50ng/μL	1
Forward primer, 10. mu.M	1
		Reverse primer, 10. mu.M	1
2×Taq Plus Master Mix	25
		Water (W)	22
Total	50

TABLE 3 reaction conditions

The PCR product was detected by agarose gel electrophoresis, and as can be seen from FIG. 1, the size of the detected amplified fragment was consistent with the size of the expected target fragment.

1.3 purification of PCR products: PCR products were purified using a 1.4 sample volume of Oxin DNA purification beads, washed twice with 80% ethanol, eluted with 50. mu.L TE and the concentration was determined. The known sequences r1-r15 were obtained, respectively.

1.4 connection: preparing a ligation reaction system as shown in Table 5, and connecting a tag sequence to be tested with a known sequence through a primer sequence under the reaction conditions as shown in Table 6, wherein the nucleotide sequence of the primer sequence is as follows:

CAAGCAGAAGACGGCATACGAGATNNNNNNGTGACTGGAGTTCCTTGGCACCCGAGAATTC, (SEQ ID NO: 46), wherein NNNNNN represents the tag sequence, and the specific sequence is shown in Table 4. And respectively connecting the primer sequence with the known sequence, namely respectively connecting the known sequence with the to-be-detected tag sequence T1-T15 to obtain the tag sequence-known sequence. Specifically, the compound is T1-r1, T2-r2, T3-r3, T4-r4, T5-r5, T6-r6, T7-r7, T8-r8, T9-r9, T10-r10, T11-r11, T12-r12, T13-r13, T14-r14 and T15-r 15. The sequence of the tag to be detected is a barcode sequence

TABLE 4 tag sequences

Tag number	Tag sequences	SEQ ID
			T1	TTATAT	NO：31
T2	ACCAAC	NO：32
			T3	CTATGC	NO：33
T4	ATTCCT	NO：34
			T5	CAACTC	NO：35
T6	TTAGGC	NO：36
			T7	AGGATC	NO：37
T8	CAGCAA	NO：38
			T9	AAGTAG	NO：39
T10	ACAGTG	NO：40
			T11	GGTCCA	NO：41
T12	GATCAG	NO：42
			T13	ATTATG	NO：43
T14	GGCTAC	NO：44
			T15	GAACCT	NO：45

TABLE 5 connection System

Name of reagent	Volume/. mu.L
		Assist in saint coupling buffer	10
Ligase	2.5
		DNA(10ng/μL)	3
Test tag sequence (barcode sequence)	2
		Water (W)	32.5
total	50

TABLE 6 reaction conditions

Reaction temperature	Reaction time
		4℃	Hold
22℃	60min
		4℃	Hold

After completion of ligation, the ligation product was purified using the DNA purification beads of FIG. 1.0X sample volume, washed twice with 80% ethanol,

elution with 12. mu.L TE;

1.5 library amplification: all tag sequences-known sequences were mixed to prepare a ligation reaction system as shown in Table 7 and linker ligation was performed under the reaction conditions shown in Table 8 (P5/P7).

TABLE 7 connection system

Name of reagent	Volume/. mu.L
		KAPA HiFi Mix	12.5
P5/P7 linker	2
		DNA (tag sequence)Known sequence)	10.5
total	25

TABLE 8 reaction conditions

After the amplification is finished, the ligation product is purified by using the DNA purification beads assist in FIG. 1.2 Xthe sample volume, washed twice with 80% ethanol, and eluted with 30. mu.L of TE to obtain a DNA library to be detected;

1.6 quality control

Measuring the DNA concentration and fragment size in the amplified sequencing library by using a Qubit dye and 1.5% agarose gel electrophoresis;

1.7 dilution of library: diluting the constructed libraries to 10nM and mixing in an amount that yields 1M data per library;

1.8 sequencing on machine: sequencing a sample by using an Illumina CN500 second-generation sequencing platform, and splitting the sample by using a tag sequence after sequencing is completed, wherein the results are shown in tables 9, 10 and 11;

TABLE 9 Credit results-1

TABLE 10 Credit results-2

TABLE 11 Credit results-3

Analysis was performed in the following data: as shown in Table 9, the tag sequence T2 corresponding to r2 is resolved, and as a result, no other known sequences except r2 exist under the tag sequence T2, which indicates that the remaining 14 barcode is not polluted by the barcode corresponding to r 2; as shown in table 10, when the splitting was performed by the tag sequence T14 corresponding to r14, it was found that when r11 was present in addition to r14 in the tag sequence T14, the barcode corresponding to r14 was contaminated with the barcode corresponding to r11, and when the splitting was performed by the tag sequence T11 corresponding to r11 according to table 11, it was found that the contamination ratio was 0.07% instead of 4/5612 when the amount of r11 was 5612.

The above examples are intended to illustrate the disclosed embodiments of the invention and are not to be construed as limiting the invention. In addition, various modifications of the methods and compositions set forth herein, as well as variations of the methods and compositions of the present invention, will be apparent to those skilled in the art without departing from the scope and spirit of the invention. While the invention has been specifically described in connection with various specific preferred embodiments thereof, it should be understood that the invention should not be unduly limited to such specific embodiments. Indeed, various modifications of the above-described embodiments which are obvious to those skilled in the art to which the invention pertains are intended to be covered by the scope of the present invention.

Sequence listing

<110> Shanghai Wehn biomedical science and technology, Inc

<120> detection method for sequencing platform tag sequence pollution

<160>46

<170>SIPOSequenceListing 1.0

<210>1

<211>15

<212>DNA

<213> Artificial Sequence (Artificial Sequence)

<400>1

gctgacattt tcggt 15

<210>2

<211>15

<212>DNA

<213> Artificial Sequence (Artificial Sequence)

<400>2

tggcctgccg cagtt 15

<210>3

<211>15

<212>DNA

<213> Artificial Sequence (Artificial Sequence)

<400>3

cagccaggaa ctatt 15

<210>4

<211>16

<212>DNA

<213> Artificial Sequence (Artificial Sequence)

<400>4

gttttccagt tccgga 16

<210>5

<211>15

<212>DNA

<213> Artificial Sequence (Artificial Sequence)

<400>5

atccgtgagg tgaat 15

<210>6

<211>15

<212>DNA

<213> Artificial Sequence (Artificial Sequence)

<400>6

cagcgacgga atatc 15

<210>7

<211>15

<212>DNA

<213> Artificial Sequence (Artificial Sequence)

<400>7

gatattgaac aggaa 15

<210>8

<211>15

<212>DNA

<213> Artificial Sequence (Artificial Sequence)

<400>8

taagatactg ctcct 15

<210>9

<211>15

<212>DNA

<213> Artificial Sequence (Artificial Sequence)

<400>9

gtcatccgcc agcag 15

<210>10

<211>15

<212>DNA

<213> Artificial Sequence (Artificial Sequence)

<400>10

agtctttgac aatct 15

<210>11

<211>15

<212>DNA

<213> Artificial Sequence (Artificial Sequence)

<400>11

tatcgactcc cagct 15

<210>12

<211>15

<212>DNA

<213> Artificial Sequence (Artificial Sequence)

<400>12

catttctgca ccatt 15

<210>13

<211>15

<212>DNA

<213> Artificial Sequence (Artificial Sequence)

<400>13

tccgtctacg gaaag 15

<210>14

<211>15

<212>DNA

<213> Artificial Sequence (Artificial Sequence)

<400>14

tcgggaagtg aacgg 15

<210>15

<211>15

<212>DNA

<213> Artificial Sequence (Artificial Sequence)

<400>15

gacgcaatga ggcac 15

<210>16

<211>15

<212>DNA

<213> Artificial Sequence (Artificial Sequence)

<400>16

tcatcctctc cggat 15

<210>17

<211>15

<212>DNA

<213> Artificial Sequence (Artificial Sequence)

<400>17

atgacctgat gacag 15

<210>18

<211>15

<212>DNA

<213> Artificial Sequence (Artificial Sequence)

<400>18

atacataaaa tcctg 15

<210>19

<211>16

<212>DNA

<213> Artificial Sequence (Artificial Sequence)

<400>19

gaatatgccg gttatc 16

<210>20

<211>16

<212>DNA

<213> Artificial Sequence (Artificial Sequence)

<400>20

cctgatgcag ctggat 16

<210>21

<211>16

<212>DNA

<213> Artificial Sequence (Artificial Sequence)

<400>21

gaagcggcat ggaaag 16

<210>22

<211>16

<212>DNA

<213> Artificial Sequence (Artificial Sequence)

<400>22

ctgaccatcc ggaact 16

<210>23

<211>15

<212>DNA

<213> Artificial Sequence (Artificial Sequence)

<400>23

tattacgtca gcgag 15

<210>24

<211>16

<212>DNA

<213> Artificial Sequence (Artificial Sequence)

<400>24

tgcccgtcct ccacgg 16

<210>25

<211>15

<212>DNA

<213> Artificial Sequence (Artificial Sequence)

<400>25

cagcgtgatg gagca 15

<210>26

<211>16

<212>DNA

<213> Artificial Sequence (Artificial Sequence)

<400>26

ccaatccagc cggtca 16

<210>27

<211>16

<212>DNA

<213> Artificial Sequence (Artificial Sequence)

<400>27

tgcagacggc tcagga 16

<210>28

<211>16

<212>DNA

<213> Artificial Sequence (Artificial Sequence)

<400>28

aaagtacgcc cacgac 16

<210>29

<211>15

<212>DNA

<213> Artificial Sequence (Artificial Sequence)

<400>29

gaaagaagtt cagga 15

<210>30

<211>15

<212>DNA

<213> Artificial Sequence (Artificial Sequence)

<400>30

gattcaaatg ctgca 15

<210>31

<211>6

<212>DNA

<213> Artificial Sequence (Artificial Sequence)

<400>31

ttatat 6

<210>32

<211>6

<212>DNA

<213> Artificial Sequence (Artificial Sequence)

<400>32

accaac 6

<210>33

<211>6

<212>DNA

<213> Artificial Sequence (Artificial Sequence)

<400>33

ctatgc 6

<210>34

<211>6

<212>DNA

<213> Artificial Sequence (Artificial Sequence)

<400>34

attcct 6

<210>35

<211>6

<212>DNA

<213> Artificial Sequence (Artificial Sequence)

<400>35

caactc 6

<210>36

<211>6

<212>DNA

<213> Artificial Sequence (Artificial Sequence)

<400>36

ttaggc 6

<210>37

<211>6

<212>DNA

<213> Artificial Sequence (Artificial Sequence)

<400>37

aggatc 6

<210>38

<211>6

<212>DNA

<213> Artificial Sequence (Artificial Sequence)

<400>38

cagcaa 6

<210>39

<211>6

<212>DNA

<213> Artificial Sequence (Artificial Sequence)

<400>39

aagtag 6

<210>40

<211>6

<212>DNA

<213> Artificial Sequence (Artificial Sequence)

<400>40

acagtg 6

<210>41

<211>6

<212>DNA

<213> Artificial Sequence (Artificial Sequence)

<400>41

ggtcca 6

<210>42

<211>6

<212>DNA

<213> Artificial Sequence (Artificial Sequence)

<400>42

gatcag 6

<210>43

<211>6

<212>DNA

<213> Artificial Sequence (Artificial Sequence)

<400>43

attatg 6

<210>44

<211>6

<212>DNA

<213> Artificial Sequence (Artificial Sequence)

<400>44

ggctac 6

<210>45

<211>6

<212>DNA

<213> Artificial Sequence (Artificial Sequence)

<400>45

gaacct 6

<210>46

<211>61

<212>DNA

<213> Artificial Sequence (Artificial Sequence)

<400>46

caagcagaag acggcatacg agatnnnnnn gtgactggag ttccttggca cccgagaatt 60

c 61

Claims

1. A method for detecting tag sequence contamination of a sequencing platform at least comprises the following steps:

2. The method for detecting tag sequence contamination of a sequencing platform of claim 1, wherein in step (1), the known sequence is derived from a genome of any species.

3. The method for detecting tag sequence contamination of a sequencing platform of claim 2, wherein in step (1), the known sequence is derived from archaea or phage genome.

4. The method for detecting contamination of sequencing platform tag sequences according to claim 3, wherein the bacteriophage is selected from lambda bacteriophages.

5. The method of detecting sequencing platform tag sequence contamination of claim 1, further comprising one or more of the following features:

1) the tag sequence is selected from single-ended barcode and double-ended barcode;

2) in the step (1), when the types of the tag sequences to be detected are more than 1, the known sequences connected with different tag sequences are different.

6. The method for detecting tag sequence contamination of a sequencing platform of claim 1, further comprising the steps of: calculating the pollution ratio of the label sequence Tm to the label sequence Tn by adopting the following method:

7. The method for detecting tag sequence contamination of a sequencing platform of claim 1, wherein the known sequence has a length of 100-250 bp.

8. The method for detecting the contamination of the sequencing platform tag sequence according to claim 1, wherein the tag sequence to be detected is linked to the known sequence by means of PCR or ligase.

9. The method for detecting tag sequence contamination of a sequencing platform of claim 1, wherein the sequencing platform is an Illumina next generation sequencing platform.

10. The method for detecting tag sequence contamination of the sequencing platform of any one of claims 1 to 9, for use in gene sequencing.