WO2015113063A1 - Methods and systems for identifying crispr/cas off-target sites - Google Patents

Methods and systems for identifying crispr/cas off-target sites Download PDF

Info

Publication number
WO2015113063A1
WO2015113063A1 PCT/US2015/013134 US2015013134W WO2015113063A1 WO 2015113063 A1 WO2015113063 A1 WO 2015113063A1 US 2015013134 W US2015013134 W US 2015013134W WO 2015113063 A1 WO2015113063 A1 WO 2015113063A1
Authority
WO
WIPO (PCT)
Prior art keywords
sequence
guide
target
mismatches
nucleotide
Prior art date
Application number
PCT/US2015/013134
Other languages
French (fr)
Inventor
Thomas James CRADICK
Gang Bao
Peng QIU
Original Assignee
Georgia Tech Research Corporation
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Georgia Tech Research Corporation filed Critical Georgia Tech Research Corporation
Priority to US15/114,799 priority Critical patent/US10354746B2/en
Publication of WO2015113063A1 publication Critical patent/WO2015113063A1/en
Priority to US16/410,395 priority patent/US20190295689A1/en
Priority to US16/594,905 priority patent/US11315659B2/en

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2468Fuzzy queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N7/00Computing arrangements based on specific mathematical models
    • G06N7/01Probabilistic graphical models, e.g. probabilistic networks
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • G16B25/20Polymerase chain reaction [PCR]; Primer or probe design; Probe optimisation
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search

Definitions

  • the invention is generally directed to bioinformatics methods and systems for identifying CRISPR/Cas, or similar nucleotide-directed nuclease on-target and putative off-target sites.
  • the invention also includes systems for ranking and comparing CRISPR/Cas, or similar nucleotide-directed nuclease target sites. These putative cleavage sites can have mismatches, insertions, and/or deletions compared to the guide strand. Determining the possible off-target sites allows better choice of guide strands and testing for effects from nuclease treatment. These methods are an improvement over partial search methods that fail to locate every possible target site.
  • Genome editing has successfully created cell lines and animal models for biological and disease studies, and has a wide range of potential therapeutic applications (Gaj, et al., Trends Biotechnol, 31 :397-405 (2013)).
  • engineered nucleases creating DNA double-strand breaks or single-strand breaks ("nicks") at specific genomic sequences greatly enhance the rate of genomic manipulation.
  • Double-strand breaks repaired by the cellular non-homologous end joining (NHEJ) pathway often induce insertions, deletions, and mutations, or other events, which are effective for gene disruptions and knockouts.
  • NHEJ non-homologous end joining
  • CRISPR Clustered regularly interspaced short palindromic repeats
  • Target sites for CRISPR/Cas9 systems can be found near most genomic loci; the only requirement is that the target sequence, matching the guide strand RNA, is followed by a protospacer adjacent motif (PAM) sequence in either orientation (Mojica, et al, Microbiology, 155 (Pt. 3): 733-740 (2009); Shah, et al, RNA Biol, 10:891-899 (2013); Horvath, et al, J Bacteriol, 190: 1401-1412 (2008)).
  • PAM protospacer adjacent motif
  • Sp Streptococcus pyogenes
  • this is any nucleotide followed by a pair of guanines (marked as NGG).
  • RNA guide strands containing insertions or deletions in addition to base mismatches can result in cleavage and mutagenesis at genomic target site with levels similar to that of the original guide strand (Lin, et al., Nucleic Acids Res, 42:7473-7485 (2014)). These studies provide the first experimental evidence that genomic sites could be cleaved when the DNA sequences contain insertions or deletions compared with the CRISPR guide strand. These results have demonstrated the need to identify potential off-target sites when choosing guide strand designs and examine off-target effects experimentally when using CRISPR/Cas systems in cells, plants and/or animals.
  • mismatches and indels are tolerated between the guide strand and target sequences
  • the intended mismatches, truncations, indels or other non- complementary sequences may be included, such that the guide sequence will direct cleavage to the target site, although not a direct matching sequence.
  • CRISPR tools including Cas Online Designer (Hsu, et al, Nat Biotechnol, 31 : 827-832 (2013)), ZiFit,27 CRISPR Tools, (Hsu, et al, Nat Biotechnol, 31 : 827-832 (2013)) and Cas OFFinder (Bae, et al, Bioinformatics, 30: 1473-1475 (2014)), for different functions (Hsu, et al., Nat Biotechnol, 31 : 827-832 (2013); Bae, et al, Bioinformatics, 30: 1473-1475 (2014); Xiao, et al., Bioinformatics, 30: 1180-1182 (2014); Grissa, et al., Nucleic Acids Res, 35: W52-W57 (2007); Grissa, et al., BMC Bioinformatics, 8: 172 (2007); Rousseau, et al., Bioinformatics, 25: 33
  • RNA guide strand of choice RNA guide strand of choice and genomic sequences.
  • the methods include ranking the potential off-target sites based on the number and location of mismatches, insertions, and/or deletions in the g NA guide sequence relative to the genomic DNA sequence at a putative target site in the genome, allowing the selection of better target sites and/or experimental confirmation of off-target sites.
  • nuclease preferably a nucleotide-directed nuclease, most preferable a CRISPR/Cas nuclease
  • the nuclease is RNA- directed, DNA-directed, or directed by RNA, DNA and/or alternative nucleotide format.
  • the nuclease can cleave both DNA strands, can be a single nickase, or be a double nickase.
  • the nuclease is Cas9, or a variant thereof.
  • methods identify binding locations of a nucleotide- directed protein, that binds to and/or interacts with DNA, but is not a nuclease are provided.
  • the methods can include, in a computer system, comparing a series of query sequences including a guide strand sequence (a guide sequence) and at least one variant sequence thereof including one or more nucleotide insertions, one or more nucleotide deletions, and/or one or more nucleotide substitutions relative to the guide sequence, to genomic sequence and reporting target cleavage sites corresponding to locations in the genomic sequence having sequence identity to one or more of the query sequences.
  • a series of query sequences including a guide strand sequence (a guide sequence) and at least one variant sequence thereof including one or more nucleotide insertions, one or more nucleotide deletions, and/or one or more nucleotide substitutions relative to the guide sequence, to genomic sequence and reporting target cleavage sites corresponding to locations in the genomic sequence having sequence identity to one or more of the query sequences.
  • the series of query sequences can include all possible guide strand sequence variants having between 0 and 10, preferable between 0 and 5, more preferably 0, 1, or 2 nucleotide insertions relative to the guide sequence; all possible guide strand sequence variants having between 0 and 10, preferable between 0 and 5, more preferably 0, 1, or 2 nucleotide deletions relative to the guide sequence; between 0 and 10, preferable between 0 and 5, more preferably 0, 1, 2, or 3 nucleotide mismatches (e.g., substitutions) relative to the guide sequence; and all possible combinations thereof.
  • an interface for example a computer implemented interface, that allows the user to select the number of insertions, deletions, and/or mismatches.
  • the interface is a web-based interface.
  • a web-based interface allows the user choice of insertions or deletions of a single nucleotide, though other
  • the query guide sequences provide guide strand variant sequences having no indels and 0, 1, 2, or 3 mismatches; 1-base deletion, no insertions, and 0, 1, or 2 mismatches; 1-base insertion, no deletions, and 0, 1, or 2 mismatches; 1-base deletion, 1-base insertion, and 0, 1, or 2 mismatches; or any combination thereof.
  • the methods typically include comparing or searching one, or more, query sequence against a genome sequence (s) and reporting putative target sites.
  • an individual guide strand is searched.
  • multiple guide strands are searched, which can allow comparisons of the output or other testing.
  • a target site is reported if a genomic sequence is identified that matches the user-supplied search criteria, which can include presence or lack of sites with no indel, with insertion(s), with deletion(s), with mismatch(es), or with combinations thereof.
  • the user-supplied preferences typically include the number of allowed mismatches for each of the categories listed above. In each of these cases, the user can alternatively choose preferences from general or search type-specific defaults, or modify such preferences.
  • the output contains each site in the genome satisfying the search criteria.
  • the output can also include sites that might satisfy the search criteria if the ambiguous nucleotides were known.
  • the output can contain exact matches to the query sequences and/or contain sites that differ (have mismatches) at, for example, 1-12 positions, that differ at 1-5 positions, or in that differ at 1-3 positions. The percentage of the sequences matching can then vary depending on the length of the query sequence and the number of mismatches.
  • the search criteria can result in the reporting of genomic sequence that have approximately at least 80%, 85%, 90%, 95%, 96%, 97%, 98%, or 99% sequence identity to one or more of the sequences in the series of query sequences.
  • the report can include the genomic location and preferably the genomic target sequence for each target site identified.
  • the report can include the cleavage location and/org genomic sequence.
  • the report can include a score indicating the likelihood that the guide sequence will direct a CRISPR/Cas system to the DNA sequence and facilitate nuclease cleavage.
  • the score can be used to rank the putative target sites in a list.
  • the score can include additional information from experiments and/or databases, such as ENCODE, about the genomic context. For example, data on the histones, protein binding or confirmation of individual chromosomal regions can indicate if there is less or more likelihood of cleavage.
  • target cleavage locations including genomic sequences with higher sequence identity to the guide sequence receive a lower score relative to target cleavage locations having genomic sequences with lower sequence identity to the guide sequence.
  • the score is increased more for deletion(s) in the genomic sequence relative to the guide sequence (RNA bulges) than for insertions in the genomic sequence relative to the guide sequence (DNA bulges).
  • the score can also reflect that sgRNA bulges are less tolerant to additional base mismatches, and vice versa.
  • each query sequence in the series includes a protospacer adjacent motif (PAM) suffix.
  • PAM protospacer adjacent motif
  • exemplary suffixes include, but are not limited to, NGG, NAG, and NRG.
  • a target cleavage site having a NGG PAM guide strand is given a lower score than that of NAG PAM.
  • Some embodiments may include PAM flanking sequences that are deemed to affect binding.
  • the scoring and ranking may be separated, with or without user input.
  • the ranking can also be conducted using two steps, such as an initial ranking and then ranking or re-ranking, based on input weight factors.
  • the ranking method may involve a series of weight scores or position weight matrix to total the scores of the individual weigh the positions of mismatch, insertions or deletions and influence the scoring based on their impact on the design criteria.
  • the ranking can also include sequence specific features such that a match or mismatch weigh considers the interacting nucleotide.
  • the sequence specific weight scores may correlate with hydrogen bonds, as with G-C verse A-T interactions, or may relate to sequence specificities at individual positions, possibly due to protein interactions.
  • the design criteria can include binding, DNA cleavage rate, mutation rate, or other criteria.
  • the ranking method is applied to genomic loci independently of the search method. In some embodiments he ranking
  • primer sequences suitable for amplifying the genomic sequence at the target cleavage site are reported. These primers may be suitable for PC amplification or DNA preparation or isolation using other techniques, such as pull-down preparations. The primers may be used for Sanger sequencing, next generational sequencing, mutation detection assays, such as the surveyor (Cradick 2009 Thesis) and T7 Endonuclease I, and others.
  • the genome sequence or sequences that the series of query sequences are searched against typically makes up an organismal genome, preferably a complete or nearly complete organismal genome.
  • the organismal genome is a human genome, a rat genome, a mouse genome, or a rhesus macaque genome.
  • the searched sequence could be artificial sequences or a combination or artificial and genomic sequences.
  • the searched sequences can be DNA, RNA, etc.
  • the searched sequences are mRNA, for example, a transcriptome.
  • the genomic sequence(s) can be DNA sequence converted into FASTA or similarly formatted files, then transformed into index entries that have all possible 25 bases-long tags in the DNA sequence. In other embodiments, other tagging schemes can be used including longer and shorter tags.
  • the index entries can be sorted and the results stored as a binary main index file.
  • the main index file can be divided into parts, each representing entries having about 12 nucleotides of the first nucleotides identical. In other embodiments, other lengths of index files may be used.
  • a secondary index file can include the position in the main index file where each part starts added to the end of the index file. Searching genome sequence organized and indexed in such a way can improve the speed of the search, while allowing exhaustive searching.
  • Preferred embodiments utilize index files, though other embodiments could use other index methods, similar expedited search strategies, or provide searching without index files, as done with linear searches through the full sequence space, though these would increase run times.
  • a particular embodiment of the disclosed method is referred to herein as COSMID (C ISP Off-target Sites with Mismatches, Insertions, and Deletions).
  • the disclosed methods and systems can aid the design and optimization of CRISPR guide strands by selecting the preferred target sites with minimum Cas- induced off-target cleavage and facilitate the experimental confirmation of off-target activity by providing both putative off-target sites and primer for testing cleavage that the sites in a CRISPR/Cas system.
  • the disclosed methods are more exhaustive and/or have a higher sensitivity for identifying putative and/or actual off-target sites than previously known methods or programs.
  • Figure 1 A is a sequence alignment of guide strands to their target sites in HBB and aligned to the corresponding region in HBD.
  • Forward direction guide strands (marked 'greater than') are shown adjacent to NGG, representing the PAM sequence.
  • Guide strands complementary to the reverse strand (marked 'less than') are listed to the right of CCN.
  • Asterisks between HBB and HBD indicate nucleotides that differentiate the two genes, whereas the other nucleotides are the same in both genes.
  • the first base shown in HBB is the sickle cell anemia mutation site.
  • Figure IB is a sequence alignment showing the high levels of cleavage and mutation that can be found at off-target sites even with mismatch to the guide strands in the first 12 nucleotides closest to the PAM.
  • the on- and off-target mutation rates are listed in decreasing order of the off-target mutation rates at HBD, and illustrate differences between the guide sequence and HBD.
  • a lowercase g indicates that the first base in HBB does not match the guide strands' initial G (for all but R-01).
  • the 12 bases closest to the PAM are boxed and numbered on top.
  • Figure 1C is a bar graph showing the indel percentage in HBB (left-hand bar of each pair) and HBD (right-hand bar of each pair) for mock and guide strands R-01 through R-08 as determined by T7EI mutation detection assays.
  • Figure 2A is a sequence alignment of guide strands to their target sites in CCR5 (shown below the guide strands) and aligned to corresponding region in CCR2 (shown below CCR2). Forward direction guide strands (marked 'greater than') are shown adjacent to NGG, representing the PAM sequence. Guide strands
  • FIG. 2B is an illustration showing that cleavage can occur at off-target sites even with mismatch to the guide strands in both of the first two nts closest to the PAM (R-30).
  • the first two guide strands in the list are in ranked order of the off- target mutation rates at CCR2.
  • Figure 2C is a bar graph showing the indel percentage in CCR5 (left-hand bar of each pair) and CCR2 (right-hand bar of each pair) for mock and guide strands R-01 through R-08 as determined by T7EI mutation detection assays.
  • Figures 3A-3E are bar graphs illustrating how the transfection dosage variability affects on- and off-target mutation rates (%).
  • Figures 3A-3C show R-03 (3 A), R-04 (3B), or R-08 (3C) guide strand mutation rates at HBB (left-hand bar of each pair) and HBD (right-hand bar of each pair) loci when cells were transfected with 100, 200, 400, or 800 ng of CRISPR plasmid.
  • Figures 3D-3E show R-25 (3D) or R-30 (3E) guide strand mutation rates at CCR5 (left-hand bar of each pair) and CCD2 (right-hand bar of each pair) loci when cells were transfected with 100, 200, 400, or 800 ng of CRISPR plasmid.
  • Figures 4A-4B are sequence alignments showing on-target loci (4A) and off- target loci (4B) for guide strands R-03 after transfection with the CRSIPR plasmid. The regions were amplified with flanking PCR primers, cloned and Sanger sequenced. Sequencing reads are given for each guide strand and aligned to the wild- type sequence. The number of times each read occurred is indicated to the left of the alignment. Unmodified reads are indicated by 'WT'. Mutations, insertions, or deletions were detected in 70% of the reads at HBD and 62% of the reads in HBD. In Figure 4B the guide strand mismatch is boxed.
  • Figure 4C depicts the sequence of chromosomal deletions as a sequence alignment showing PCR products of genomic DNA from cells treated with R-03, amplified using an HBD forward primer and reverse primer downstream of the HBB site, sequenced and aligned to ⁇ -HBD' . Sequencing detected that each product contained indels and mutations consistent with NHEJ, near the target sites for R-03. Insertions, point mutations, and deletions are illustrated.
  • Figure 4D is a line graph depicting the Quantitative PCR determination of the percentage of HBD-HBB chromosomal deletions at R-03, and the lower amount after transfection or R-02.
  • Figures 5A-5B are sequence alignments showing on-target loci (5 A) and off- target loci (5B) for guide strands -25 after transfection with the CRSIPR plasmid. The regions were amplified with flanking PCR primers, cloned and Sanger sequenced. Sequencing reads are given for each guide strand and aligned to the wild- type sequence. The number of times each read occurred is indicated to the left of the alignment. Unmodified reads are indicated by 'WT'. Mutations, insertions or deletions were detected in 50% of the reads at CCR5 and 32% of the reads in HCCR2. In Figure 5B the guide strand mismatch is boxed.
  • Figure 5C depicts the sequence of chromosomal deletions as a sequence alignment showing PCR products of genomic DNA from cells treated with R-25, amplified using a CCR2 forward primer and reverse primer downstream of the CCR5 site, sequenced and aligned to 'CCR2- CCR5 Sequencing detected that each product contained indels and mutations consistent with NHEJ, near the target sites for R-25. Insertions, point mutations, and deletions are illustrated.
  • Figures 6A-6C are sequence alignments showing on- and off-target sequencing after CRISPR transfection: R-02 targeted mutations at HBB (6A), R-02 mutations at off-target site 2, GRIN3A (6B), and R-30 off-target mutations at CCR2 (6C).
  • Target loci in genomic DNA of HEK-293T cells transfected with each CRISPR construct were amplified, cloned, Sanger sequenced, and aligned to the reference gene, listed above the alignment, and shown aligned to the guide strand. After the guide strand name and genetic loci for each alignment, the number of clones with indels is shown, as is the total number of clones and percentage with indels.
  • the alignment includes the reference gene and guide strand with mismatches boxed.
  • the first column lists the number of times each read occurred and indel size change in basepairs. Unmodified reads are indicated by "WT”. Insertions, point mutations, and deletions are illustrated.
  • Figure 7 is a bar graph showing the indel spectra from CRISPR/Cas9 cleavage and NHEJ mis-repair. The change in number of base pairs resulting from each indel was calculated and compiled. The y-axis represents the percentage of each number of insertion or deletion.
  • Figures 8A and 8B are diagrams showing that CRISPR can cleave at genomic sites with mismatches to the guide strand and with insertions or deletions relative to the guide strand, for example at off-target sites with a 1-bp insertion (DNA bulge) (8A) or a 1-bp deletion (RNA bulge) (8B).
  • the 20-nt guide sequence in the sgRNA is shown aligned with the genomic target sequence (protospacer) containing single-base DNA bulge (8A, asterisk) or single-base sgRNA bulge (8B, ⁇ ).
  • the zoom-in nucleotide sequences of protospacer and PAM are shown above the sgRNA guide sequence. Positions of nucleotides in the target are numbered 3' to 5' starting from the nucleotide next to PAM.
  • Figure 9A is a sequence alignment illustrating that a single nucleotide was deleted from the original R-01 sgRNA at all possible positions (dashes) throughout the guide sequence for sgRNA R-01 targeting HBB.
  • Figure 9B is a grid mapping the deletions, which in the case of repeated bases, can be thought to have been a deletion of either base. Semi-transparent squares in two positions in the same sgRNA indicate that deletions can be interpreted at either of adjacent positions (also marked by Or') due to identical nucleotides at both positions. Sequence of the original sgRNA is in the top row of the grid.
  • Figure 9C is a bar graph showing cleavage activity aligned to the corresponding sgRNA variants of 9 A and 9B.
  • Figure 10A is a sequence alignment illustrating that a single nucleotide was deleted from the original sgRNA at all possible positions (dashes) throughout the guide sequence for sgRNA R-30 targeting CCR5.
  • Figure 1 OB is a grid mapping the deletions, which in the case of repeated bases, can be thought to have been a deletion of either base.
  • Semi-transparent squares in two positions in the same sgRNA indicate that deletions can be interpreted at either of adjacent positions (also marked by Or') due to identical nucleotides at both positions.
  • the sequence of the original sgRNA is in the top row of the grid.
  • the graph in Figure IOC indicates cleavage activity for the corresponding sgRNA variants measured by T7EI assay in HEK293T cells at the HBB site for the sgRNA variants in (10A), and compares to the activity of the original full- length guide strand.
  • Figure 11 A and 1 IB are alignments of -1 nt sgRNA variants to the HBB (11 A) and CCR5 (1 IB) target loci showing mismatches instead of DNA bulge. Only the variants with detectable intracellular activities are shown. The target loci and index names of the sgRNA variants are indicated on the left of each alignment. Mismatches in the guide sequence and in the "NGG" PAM are marked with asterisks below each alignment. The alignment with the minimum number of mismatches is shown for each sgRNA variant. Nucleotide "U” in the guide RNA is replaced with "T” for the ease of comparison to the target site. For example, modeling the cleavage of R-01 with a deletion at position 6 or 7 (11 A) can either be modeled with a deletion and no mismatches or without a deletion, but with four mismatches close to the PAM
  • the CCR5 guide strand with a deletion at position 9 or 10 (1 IB) has considerable activity can either be modeled with a deletion and no mismatches or without a deletion. If this interaction was modeled without a deletion, there would be six mismatches close to the PAM (indicated by *), which would generally prevent cleavage.
  • Figure 12A is a sequence alignment showing 1-6 bp truncations at the 5' end of the guide sequence R-01 targeted to the HBB gene.
  • Figure 12B is a grid showing cleavage activity for the corresponding sgRNA variants measured by T7EI assay in HEK293T cells at the HBB site for the sgRNA variants in (12A). Truncated positions are highlighted in the grid. Sequence of the original sgRNA is in the top row of the grid.
  • Figure 12C is a bar graph showing cleavage activity aligned to the
  • Figures 13A is a grid showing the activity of Cas9 at the HBB target site carrying single-base sgRNA bulges associated with different variants of the original sgRNAs R-01. Each variant shown has a single nucleotide, A, G, C, or U inserted into the original sgRNA at the positions shown throughout the guide sequence.
  • Figure 14A is a grid showing the activity of Cas9 at the CCR5 target site resulting from treatment with different variants of R-30 with single-base bulges.
  • Figure 14B is a bar graph showing corresponding cleavage activities quantified by T7EI assay in HEK293T cells.
  • Figures 15A and 15B are sequence alignments of +1 nt sgRNA variants to the HBB (15 A) and CCR5 (15B) target loci without a bulge leads to many mismatches, instead of a sgRNA bulge. Only the variants with detectable intracellular activities are shown. The target loci and index names of the sgRNA variants are indicated on the left of each alignment. Mismatches in the guide sequence and in the "NGG" PAM are marked with asterisks below each alignment. The alignment with the minimum number of mismatches is shown for each sgRNA variant. Nucleotide "U” in the guide RNA is replaced with "T” for the ease of comparison to the target site.
  • Figures 16A and 16C are grids showing the activity of Cas9 at the HBB target site carrying single-base DNA bulges (16A) or sgRNA bulges (16C) associated with different variants of the original sgRNAs R-08.
  • Figure 17A is a series of sequence alignments comparing guide RNA variants with insertions greater than one nucleotide and their original target sites R-01 or R-30.
  • the guide RNAs are named for the position of the insertions.
  • Figures 17A and 17B show the larger bulges can also lead to activity.
  • Figure 18 A is a sequence alignment showing the human HBB gene targeted by Cas9 nickases (Cas9n) with paired guide strands R-01 and R-02. PAMs are indicated with bars.
  • Asterisks indicate P-values from a two-tailed independent two-sample t-test. *P ⁇ 0.05, **P ⁇ 0.01, ***P ⁇ 0.001.
  • Figures 18A and 18B show that bulges are tolerated in other CRISPR systems including the nickase nucleases, which only cut one strand.
  • Figures 19A and 19B are sequence alignments showing on-target and off- target alignments containing bulges for sgRNAs R-30 targeted to CCR5 gene (19A), and R-31 target to ERCC5 gene (19B).
  • Off-4 has a mismatch with R-30, 14 nt from the PAM.
  • Horizontal lines indicate the PAM.
  • the mismatch shown between the initial G in sgRNA R-31 and the corresponding nt in its target site or in Off-1 does not affect binding, or cleavage.
  • the genomic DNA was harvested and amplified by flanking primers.
  • Figures 19C and 19D display the mutations, insertions and deletions introduced by mis-repair after cleavage at these sites.
  • the Sanger sequencing reads of amplified off-target sites are aligned to the wild-type genomic sequence and sgRNAs for R-30 (19C) and R-31 (19D). The number of times each sequence occurred is indicated to the left of the alignment, if greater than one.
  • Figure 19E is a bar graph showing activities (indel percent) analyzed by deep sequencing at genomic off-target loci containing bulges coupled with mismatches and in some cases alternative NAG-PAMs.
  • the level after CRISPR treatment with the indicated guide strand is graphed against mutations detected in mock treated samples (likely by mis-reads) (top bar in each pair, outlined) and treated samples (bottom bar in each pair) with sgRNAs at off-target loci shown in the table to the left.
  • the table on the left shows numbers of mismatches at off-target loci in addition to bulge (no.
  • mis mis
  • bulge types bulge types
  • positions of bulges from PAM bulge pos
  • labels for the loci and sequences of off-target sites including PAMs In these off-target genomic sequences, mismatches are lighter, deleted base compared to sgRNA marked as '-' (sgRNA bulge), inserted base compared to sgRNA marked as underlined letters (DNA bulge). Error bars, Wilson intervals (see 'Materials and Methods' section). *P ⁇ 0.05, ***P ⁇ 0.001 as determined by Fisher's exact test. The % indel values of treated samples are also indicated.
  • Figure 20 is a sequence alignment showing the effects of R-30 cleavage and miss-repair at the off-target site 5 (Off-5), quantified by Sanger sequencing.
  • One of the 24 sequencing reads was not wild type with an inserted a in lowercase, the other
  • Figures 21 A and 2 IB are genetic maps showing the histone modification status and annotation of R30 Off-4 (21 A) and Off-5 (2 IB) loci obtained from the
  • Figure 22 is a bar graph showing the results of quantitative PCR of sgRNA expression (sgRNA Log Fold Change (-ddCt)) levels in HEK293T cells for R-01 and
  • Figures 23A-23C are bar graphs showing the range of insertions and deletions introduced with matching guide strand and guide strands with bulges (the indel spectra, the percent in total indels mapped against change in bases) for original sgRNAs and sgRNA variants determined using deep sequencing for R-01 original sgRNA (23A), and variants for DNA bulge (Rl -7/6) (23B) and sgRNA bulge (Rl C+12) (23 C).
  • the change in bases at predicted cut sites resulting from indicated sgRNAs was calculated from ⁇ 10 4 reads per sample.
  • the y-axis represents percentages in all indel-reads for that sgRNA. Overall % indel in total reads are indicated in each graph.
  • Figures 24A-24C are bar graphs showing indel spectra (percent in total indels mapped against change in number of bases) for original sgRNAs and sgRNA variants determined using deep sequencing for R-30 original sgRNA (24A), and variants for DNA bulge (R30-11) (24B) and sgRNA bulge (R30 U+12) (24C).
  • the change in bases at predicted cut sites resulting from indicated sgRNAs was calculated from ⁇ 10 4 reads per sample.
  • the y-axis represents percentages in all indel-reads for that sgRNA. Overall % indel in total reads are indicated in each graph. Expression of Cas9 and the original guide strand or guide strand with indels result in insertions or ranges of deletions.
  • Figure 25A is a screen-shot of an exemplary COSMID user input interface, including drop-down list of searchable genomes, a box to enter a query guide sequence of choice, a box to enter the type of PAM, radio buttons to select allowed number of mismatches, insertions and deletions, and both selection criteria and user input boxes to modify the primer design parameters.
  • Figure 25B is a flow chart showing the COSMID software design and the major steps in performing a search.
  • Figure 25 C is a list of exemplary search strings with insertions or deletions in the first six possible positions demonstrating how the program searches for each insertion or deletion (if selected by user). Alternate deletions of repeated bases are synonymous.
  • Figure 26A is an exemplary COSMID user interface for selecting a searchable genome.
  • Figure 26B is an exemplary COSMID user interface for entering a query sequence.
  • Figure 26C is an exemplary COSMID user interface for entering the protospacer motif (PAM) and selecting the type and number of mismatches and indels.
  • Figure 26D is an exemplary COSMID user interface entering primer design parameters.
  • Figure 26E is an alignment showing the tags generated and used to search the human genome when a COSMID user enters the guide sequence exemplified in Figures 26A and 1-base deletion to allow gRNA bulge (e.g., DNA is base shorter than the guide sequence, as illustrated above the alignment). Deletions of either of consecutive bases result in the same sequence and are therefore omitted from the list.
  • gRNA bulge e.g., DNA is base shorter than the guide sequence, as illustrated above the alignment
  • Figure 26F is an alignment showing the tags generated and used to search the human genome when a COSMID user enters the guide sequence exemplified in Figures 26A and allows 1-base insertion to allow DNA bulge (e.g., guide sequence NA is one base short than DNA, as illustrated above the alignment).
  • Figure 26G is an exemplary COSMID HTML output that shows query type, number of mismatches if the PAM ends in RG (NAG or NGG), the chromosomal positon, strand, cut site, the ranking score and left PCR primer. The right primer is off screen here.
  • Figure 27 is a bar graph showing on- and off-target cleavage rates (% indel frequency) for guide strand R-01 for groups of identical sites. This experiment indicated that other factors in addition to complementary sequence may play in mutation rate - these features may be added into the search calculations, scoring and ranking in other embodiments.
  • Figures 28A and 28B are sequence alignments showing two examples of genomic sites identified using different search queries for R-30. Both possible off- target sites can align to search strings without indels, with a deletion and with an insertion. Search strings are shown aligned to each identified chromosomal location. Mismatches are shaded, and insertions or deletions are illustrated with a dash ('-').
  • Figures 29A-29D are genetic maps showing the number and location of the additional genomic loci found while searching for putative off-target sites with and without indels for R-01 (29A, 29C) and R-30 (29B, 29D).
  • Figures 29A and 29B display putative off-target sites with up to three mismatches and not indels.
  • Figures 29C and 29D include the addition of sites with up to two mismatches and either an insertion or a deletion.
  • Each vertical line represents each identified off-target site, plotted at its chromosomal location by the UCSC genome browser. The chromosome numbers are listed on edges of the plots.
  • Figure 30A is a flow chart of an exemplary method for generating a ranked list of off-target sites that could be implemented on a computer.
  • a user query is used to generate search parameters used by the algorithm to construct a list of possible off- target cleavage sites.
  • the possible off-target sites are ranked by their predicted off- target cleavage activity (or chance for activity) and output as results in a ranked list.
  • Figure 3 OB is a flow chart of an additional exemplary method for generating a ranked list of off-target sites that could be implemented on a computer. This method includes estimating the results and generating a list of primers designed for amplifying and/or testing the mutations introduced at each site.
  • Figure 30C is a flow chart illustrating an exemplary algorithm for executing the disclosed methods of identifying target sites and/or ranking or scoring target sites.
  • Figure 31 is a block diagram of a preferred network-based implementation containing a computer server and one or more client computers in communication over a network.
  • Figure 32 is a block diagram of a computer server containing I/O device(s), a processor, memory, and storage.
  • Figure 33 is a schematic of a graphical user interface (GUI) for receiving input parameters for a computer-implemented off-target site search method.
  • GUI graphical user interface
  • the GUI is displayed in a web browser and contains check boxes, drop-down lists, radio buttons, and text boxes for inputting the query sequence, modifying the search parameters, and customizing criteria design criteria for PCR primers that can be used to test off-target cleavage using the queried guide sequence.
  • Figure 34 is a curve illustrating the score (x-axis) as a function of the location/position of the mismatch or indel relative to the PAM (Y-axis).
  • operative linkage and "operatively linked” (or “operably linked”) are used interchangeably with reference to a juxtaposition of two or more components (such as sequence elements), in which the components are arranged such that both components function normally and allow the possibility that at least one of the components can mediate a function that is exerted upon at least one of the other components.
  • an enhancer is a transcriptional regulatory sequence that is operatively linked to a coding sequence, even though they are not contiguous.
  • an "exogenous" molecule is a molecule that is not normally present in a cell, but can be introduced into a cell by one or more genetic, biochemical or other methods. "Normal presence in the cell" is determined with respect to the particular developmental stage and environmental conditions of the cell. Thus, for example, a molecule that is present only during embryonic development of muscle is an exogenous molecule with respect to an adult muscle cell. Similarly, a molecule induced by heat shock is an exogenous molecule with respect to a non-heat-shocked cell.
  • An exogenous molecule can include, for example, a functioning version of a malfunctioning endogenous molecule, a malfunctioning version of a normally- functioning endogenous molecule or an ortholog (functioning version of endogenous molecule from a different species).
  • nucleic acid As used herein, the terms “nucleic acid,” “polynucleotide,” and
  • oligonucleotide are interchangeable and refer to a deoxyribonucleotide or ribonucleotide polymer, in linear or circular conformation, and in either single- or double-stranded form. For the purposes of the present disclosure, these terms are not to be construed as limiting with respect to the length of a polymer.
  • the terms can encompass known analogues of natural nucleotides, as well as nucleotides that are modified in the base, sugar and/or phosphate moieties (e.g., phosphorothioate backbones). In general and unless otherwise specified, an analogue of a particular nucleotide has the same base-pairing specificity; i.e., an analogue of A will base-pair with T.
  • polypeptide As used herein, the terms “polypeptide,” “peptide” and “protein” are used interchangeably to refer to a polymer of amino acid residues. The term also applies to amino acid polymers in which one or more amino acids are chemical analogues or modified derivatives of corresponding naturally-occurring amino acids.
  • cleavage refers to the breakage of the covalent backbone of a nucleic acid molecule. Cleavage can be initiated by a variety of methods including, but not limited to, enzymatic or chemical hydrolysis of a phosphodiester bond. Both single-stranded cleavage and double- stranded cleavage are possible, and double-stranded cleavage can occur as a result of two distinct single-stranded cleavage events. DNA cleavage can result in the production of either blunt ends or staggered "sticky" ends. In certain embodiments cleavage refers to the double-stranded cleavage between nucleic acids within a double-stranded DNA or RNA chain.
  • genomic DNA refers to deoxyribonucleic acids that are obtained from the nucleus of an organism.
  • genomic DNA encompass genetic material that may have undergone amplification, purification, or fragmentation. In some cases, genomic DNA encompasses nucleic acids isolated from a single cell, or a small number of cells, clones of cells or pools of cells.
  • the "genome” in the sample that is of interest in a study may encompass the entirety of the genetic material from an organism, or it may encompass only a selected fraction thereof: for example, a genome may encompass one chromosome from an organism with a plurality of chromosomes.
  • the genome may refer to the reference sequence for an organism or the sequence of one or more individuals.
  • the genomic sequence can contain or be comprised solely of man-made, altered or non- natural sequences, including, but not limited to, natural genomic sequences with the inclusion of knocked-in sequences, such as GFP expression cassettes or tags, or cDNA or other sequences for the expression of a gene of interest.
  • the genome may not consist of natural chromosomal sequences, but of sequences assembled by man.
  • genomic region or “genomic segment”, as used interchangeably herein, denote a contiguous length of nucleotides in a genome of an organism.
  • a genomic region may be of a length as small as a few kb (e.g., at least 5 kb, at least 10 kb or at least 20 kb), up to an entire chromosome or more.
  • the terms “genome-wide” and “whole genome”, are interchangeable and refer generally to the entire genome of a cell or population of cells and include the sequences normally found in those cells and introduced DNA such as knocked-in cDNAs, promoters, enhancer, tags or other naturally occurring, or man-made sequences or combinations of sequences.
  • the terms “genome-wide” and “whole genome” will generally encompass a complete DNA sequence of all of an organism's DNA (chromosomal, mitochondrial, etc.). Alternatively, the terms “genome-wide” or “whole genome” may refer to most or nearly all of the genome.
  • the terms "genome-wide” or “whole genome” may exclude a few portions of the genome that are difficult to sequence, do not differ among cells or cell types, are not represented on a whole genome array, or raise some other issue or difficulty that prompts exclusion of such portions of the genome.
  • the genome is considered complete if more than 90%, more than 95%, more than 99%, or more than 99.9% of the base pairs have been sequenced. In some cases, less is known of a genome, but the known fraction, can be of use.
  • the genome can refer to any organism for which a portion of the genome has been sequenced.
  • the whole genome is a human genome, a rat genome, a mouse genome, a Zebrafish genome, an Arabidopsis genome, a yeast genome, a D.
  • the "genome” will contain inserted or modified genomic sequences.
  • the set ⁇ A, C, G, T, U ⁇ for adenosine, cytidine, guanosine, thymidine, and uridine respectively.
  • the set ⁇ A, C, G, T, U, I, X, ⁇ for adenosine, cytidine, guanosine, thymidine, uridine, inosine, uridine, xanthosine, pseudouridine respectively.
  • the set of characters is ⁇ A, C, G, T, U, I, X, ⁇ , , Y, N ⁇ for adenosine, cytidine, guanosine, thymidine, uridine, inosine, uridine, xanthosine, pseudouridine, unspecified purine, unspecified pyrimidine, and unspecified nucleotide respectively.
  • the modified sequences, non- natural sequences, or sequences with modified binding, may be in the genomic, the guide or the tracr sequences.
  • Nucleotide and/or amino acid sequence identity percent is understood as the percentage of nucleotide or amino acid residues that are identical with nucleotide or amino acid residues in a candidate sequence in comparison to a reference sequence when the two sequences are aligned. To determine percent identity, sequences are aligned and if necessary, gaps are introduced to achieve the maximum percent sequence identity. Sequence alignment procedures to determine percent identity are well known to those of skill in the art. Often publicly available computer software such as BLAST, BLAST2, ALIGN2 or MEGALIGN (DNASTAR) software is used to align sequences. Those skilled in the art can determine appropriate parameters for measuring alignment, including any algorithms needed to achieve maximal alignment over the full-length of the sequences being compared.
  • mutation encompasses any change in a DNA, RNA, or protein sequence from the wild type sequence or some other reference, including without limitation point mutations, transitions, insertions, transversions,
  • translocations deletions, inversions, duplications, recombinations, or combinations thereof.
  • insertion is used when the endogenous DNA sequence has one or more extra bases compared with the sequence of the guide strand (a DNA bulge).
  • deletion is used when the endogenous DNA sequence has one or more missing bases compared with the guide strand (a RNA bulge).
  • the term "indels” indicates either insertions or deletions. Although insertions and deletions may be viewed as mismatches, as used herein in the context of alignments and identity between a CRISPR guide strand and a genomic target site, the term
  • mismatch is used exclusively for base-pair mismatch when the guide strand and the potential off-target sequence have the same length, but differ in base composition.
  • Guide strands and genomic sequences can have multiple mismatches, multiple insertions, multiple deletions or combination, such as one nucleotide inserted and two mismatches. In some cases the alignment could be represented in several ways, such as with an indel and a few mismatches or without an indel but with a larger number of mismatches.
  • endonucleases refers to any wild-type or variant enzyme capable of catalyzing the hydrolysis (cleavage) of bonds between nucleic acids within a DNA or RNA molecule, preferably a DNA molecule.
  • endonucleases include type II restriction endonucleases such as Fokl, Hhal, Hindlll, Notl, BbvCl, EcoRI, Bglll, and AlwI.
  • Endonucleases comprise also rare-cutting endonucleases when having typically a polynucleotide recognition site of about 12-45 basepairs (bp) in length, more preferably of 14-45 bp.
  • Rare-cutting endonucleases induce DNA double-strand breaks (DSBs) at a defined locus.
  • Rare- cutting endonucleases can for example be a homing endonuclease, a mega-nuclease, a chimeric Zinc-Finger nuclease (ZFN) or TAL effector nuclease (TALEN) resulting from the fusion of engineered zinc-finger domains or TAL effector domain, respectively, with the catalytic domain of a restriction enzyme such as Fokl, other nuclease or a chemical endonuclease.
  • ZFN Zinc-Finger nuclease
  • TALEN TAL effector nuclease
  • exonuclease refers to any wild type or variant enzyme capable of removing nucleic acids from the terminus of a DNA or RNA molecule, preferably a DNA molecule.
  • Non-limiting examples of exonucleases include exonuclease I, exonuc lease II, exonuc lease III, exonuclease IV,.
  • nuclease generally encompasses both endonucleases and exonucleases, however in some embodiments the terms "nuclease" and
  • endonucleases are used interchangeably herein to refer to endonucleases, i.e. to refer to enzyme that catalyze bond cleavage within a DNA or RNA molecule.
  • CRISPR Clustered Regularly Interspaced Short Palindromic Repeats
  • the prokaryotic CRISPR/Cas system has been adapted for use as gene editing (silencing, enhancing or changing specific genes) for use in eukaryotes (see, for example, Cong, Science, 15:339(6121):819— 823 (2013) and Jinek, et al., Science, 337(6096):816-21 (2012)).
  • gene editing stress, enhancing or changing specific genes
  • eukaryotes see, for example, Cong, Science, 15:339(6121):819— 823 (2013) and Jinek, et al., Science, 337(6096):816-21 (2012).
  • a number of methods exist for introducing the guide strand and Cas protein into cells including viral transduction, injection or micro-injection, nano-particle or other delivery, uptake of proteins, uptake of RNA or DNA, uptake of combination of protein and RNA or DNA. Combinations of methods can also be used,
  • compositions for use in genome editing using the CRISPR/Cas systems are described in detail in WO 2013/176772 and WO 2014/018423, which are specifically incorporated by reference herein in their entireties.
  • CRISPR refers to clustered regularly interspaced short palindromic repeats or any of the DNA loci that serve to direct CRISPR associated proteins or similar nucleotide-directed nucleases. It also describes man-made, constructed, or selected systems derived using these frameworks or proteins. CRISPR systems and the related proteins vary among the currently described type I, type II and type III systems, though it is possible other analogous systems have yet to be described.
  • CRISPR system refers collectively to transcripts and other elements involved in the expression of or directing the activity of CRISPR-associated (“Cas”) genes, including sequences encoding a Cas gene, a tracr (trans-activating CRISPR) sequence (e.g., tracrRNA or an active partial tracrRNA), a tracr-mate sequence (encompassing a "direct repeat” and a tracrRNA-processed partial direct repeat in the context of an endogenous CRISPR system), a guide sequence (also referred to as a "spacer” in the context of an endogenous CRISPR system), and other sequences and transcripts from a CRISPR locus.
  • a tracr trans-activating CRISPR
  • tracrRNA or an active partial tracrRNA e.g., tracrRNA or an active partial tracrRNA
  • a tracr-mate sequence encompassing a "direct repeat” and a tracrRNA-processed partial direct repeat in the context of an
  • One or more tracr mate sequences operably linked to a guide sequence can also be referred to as pre-crRNA (pre-CRISPR RNA) before processing or crRNA after processing by a nuclease.
  • CRISPR systems can also include modified, swapped or engineered, guide, tracr or chimeric RNA sequences and the protein to which they interact (For example, Briner, et al., Mol Cell 56(2)333-9 (2014)).
  • the methods disclosed herein may also be applicable to other, non-CRISPR nucleotide-directed nucleases.
  • a tracrRNA and crRNA are linked and form a chimeric crRNA-tracrRNA hybrid where a mature crRNA is fused to a partial tracrRNA via a synthetic stem loop to mimic the natural crRNA:tracrRNA duplex as described in Cong, Science, 15:339(6121):819-823 (2013) and Jinek, et al, Science,
  • a single fused crRNA-tracrRNA construct can also be referred to as a guide RNA or gRNA (or single-guide RNA (sgRNA)).
  • gRNA guide RNA
  • sgRNA single-guide RNA
  • the crRNA portion can be identified as the 'target sequence' and the tracrRNA is often referred to as the 'scaffold'.
  • the target sequence can be perfectly
  • a targeted site complementary to a targeted site, as is often the case for an on-target site, or may also contain mismatches, insertions, deletions or be of different length than the cleaved intended or un-intended sites.
  • the tracrRNA can be modified in length, sequence or other composition.
  • the guide portion or guide sequence can be modified in sequence and/or in length.
  • the guide strand length varies between species.
  • the length of the guide RNA is shortened, lengthened or further changed to alter the affinity to the complementary sequence in hopes of increase specificity or affecting the activity (Fu, et al., Nature Biotech. (3):279-84. (2014)).
  • a gRNA/Cas9 complex forms and is recruited to the genomic target sequence through binding to the PAM and/or the base-pairing between the gRNA sequence and the complement to the target sequence in the genomic DNA (Addgene, "CRISPR in the Lab: A Practical Guide,” Addgene website, 2014).
  • the guide strand and target sequence must be sufficiently complementary, followed by a protospacer adjacent motif (PAM) sequence.
  • PAM protospacer adjacent motif
  • the specified nucleotides in the PAM may range in spacing from the protospacer, in some systems the PAM sequence is NGG, or can be further away as in NNNNGATT, where N is any nucleotide.
  • the PAM sequence is present in the DNA target sequence, but not in the gRNA sequence. Any DNA sequence with the correct target sequence followed by the PAM sequence may be bound by Cas9, and may be cleaved.
  • wild type Sp Cas9 makes a double strand break 3-4 nucleotides upstream of the PAM sequence, which can be repaired by the Non- Homologous End Joining (NHEJ) DNA repair pathway, the Homology Directed Repair (HDR) pathway or alternative DNA repair pathways.
  • NHEJ Non- Homologous End Joining
  • HDR Homology Directed Repair
  • the system can be manipulated to induce a variety of gene modifications including insertions and deletions causing frameshifts and/or premature stop codons, specific nucleotide changes, etc.
  • one or more vectors driving expression of one or more elements of a CRISPR system are introduced into a target cell such that expression of the elements of the CRISPR system direct formation of a CRISPR complex.
  • the sgRNA expression plasmid contains the target sequence (generally about 20 nucleotides), a form of the tracrRNA sequence (the scaffold), as well as a suitable promoter and necessary elements for proper processing in eukaryotic cells.
  • Such vectors are commercially available (see, for example, Addgene). Many of the systems rely on custom, complementary oligonucleotides that are annealed to form a double stranded DNA and then cloned into the sgRNA expression plasmid.
  • sequences can also be generated using PCR cloning or mutagenic strategies. Selection methodologies can also be use to isolate guide RNAs from pools of guide RNAs. Co-expression of the sgRNA and the appropriate Cas enzyme from the same or separate plasmids in transfected cells results in a single or double strand break (depending of the activity of the Cas enzyme) at the desired target site.
  • the literature also contains examples indicating the importance of off-target analysis.
  • the Examples below show that levels of off-target cleavage using
  • CRISPR/Cas9-based gene modification strategies can be comparable with the on- target rates, even when there are multiple mismatches to the guide strand in the region close to the PAM.
  • the Examples also show that RNA guide strands containing insertions or deletions in addition to base mismatches can result in cleavage and mutagenesis at genomic target site with levels similar to that of the original guide strand. These studies provide experimental evidence that genomic sites can be cleaved when the DNA sequences contain insertions or deletions compared with the CRISPR guide strand. Accordingly, methods and systems for identifying target sites, and particularly off-target sites, of CRISPR/Cas guide strands are provided.
  • methods and systems for ranking target sites, and particularly off-target sites, of CRISPR/Cas guide strands are provided.
  • the methods and systems can be used to prepare a list of off-target sites for a guide strand based on 1, 2, 3, or more mismatches, insertions, deletions, or combinations thereof.
  • a chimeric guide RNA contains a target sequence, or guide sequence, and a tracrRNA sequence
  • guide refers to a gRNA or sgRNA sequence including, and preferably consisting of the target sequence of the gRNA that binds to a complementary genomic sequence at the target site (Jinek, et al., Science, 337:816- 821 (2012)).
  • the guide sequence is not a chimeric sequence, but contains two parts: the guide portion and the tracrRNA.
  • Alternative versions also exist in other embodiments with combinations of sequences, or replacements or modifications of portions of the tracrRNA or linking of RNA fragments, such as modifications to the lower or upper stem, nexus or hairpins, or the inclusion of additional sequences.
  • the additional sequences may permit quantitation, binding to other nucleotides, linking to functional domains, other uses, or not provide a function.
  • the guide sequence can be expressed from a plasmid, provided as RNA, or complexed with the Cas protein prior to adding to the cells.
  • the sequence can be articulated as an RNA sequence or a cDNA sequence.
  • gRNA and genomic sequences can be compared as RNA-to-DNA or DNA-DNA and have the same sequence identity.
  • the disclosed systems and methods include converting an RNA sequence to DNA, or vice versa, so that sequences are compared as DNA-to-DNA, or RNA-to-RNA.
  • other nucleotides including non-natural nucleotides can be included.
  • target site generally refers to a genomic location to which a guide strand might bind.
  • the binding level may vary and may depend on context, accessibility or other factors.
  • An "on-target” site generally refers to a genomic site to which a practitioner desires binding and/or cleavage to occur, while “off-target” refers to a genomic site to which a practitioner does not desire binding and/or cleavage to occur.
  • the definition of target site or on-target site can be thought of as the intended binding or cleavage site, regardless of its level of identity, or number of mismatches, and regardless of how this site compares to other un- intended sites that may score below or higher in these indices.
  • an on- target site can be a genomic site at which genetic modification is desired, while an off-target site can be a genomic site at which genetic modification is not required, not desired, or undesirable.
  • On-target and off-target sites can have the same (e.g., identical), or different nucleotide sequences.
  • a "cleavage site” is the site where the nuclease creates a single-strand break or double-stranded DNA breaks, in the CIRSPR systems used in some embodiments, this is within the target site, 3 nucleotides from the PAM.
  • target sequence and “target site sequence” are used interchangeable.
  • the terms generally refer to the genomic DNA sequence at the target site and can optionally include the sequence of a PAM motif.
  • the site is double-stranded genomic DNA, and therefore, the target sequence can be expressed or described by providing the sequence of either strand of DNA at the target site.
  • the target sequence can be expressed as the sequence of the strand of genomic DNA to which the guide sequence of a g NA binds, or its complementary strand. Therefore, a target sequence can also be expressed as a sequence that is the same or similar to the gRNA sequence.
  • a site can be cleaved using more than one guide strand on one or the other DNA strand.
  • the target sequence is most typically expressed as the same or similar sequence to the guide sequence so that the guide sequence can be aligned to the sequence of genomic DNA at the target site and establish the identity between the guide sequence and DNA sequence at the site.
  • the systems and methods described herein for predicting off-target sites generally involve generating search criteria derived from input criteria, generating a list of target sites, and directing the list of target sites as output to the user.
  • the input criteria will generally include information regarding the guide sequence, and optionally the PAM sequence, the number of allowed mismatches, the number of allowed insertions, the number of allowed deletions, the genome to be searched, etc.
  • the output is provided in the form of a ranked-list wherein each of the target sites are assigned a numerical value, "score", that correlates with the likelihood of nuclease cleavage at that site.
  • the practitioner knows the on-target location and although the methods and systems are designed to identify off-target locations, may nonetheless also include the on- target site(s).
  • the user may wish to determine if there are on- or off-target sites within different genomes. Therefore, in some embodiments, the list of target sites includes both on-target sites and off-target sites. In other embodiments, only off-site targets are provided.
  • An example of genomic search for only off-target sites is when targeting non-genomic sequences, such as mutated sites, chromosomal re-arrangements, introduced sequences (such as cDNA or other expression cassettes) or viral sequences.
  • the on-target site(s) can be subtracted or removed from the output.
  • the methods and systems rank the target sites based on the likelihood of cleavage.
  • the ranking can be based upon a scoring function for predicting nuclease activity based at least in-part on identity between the guide strand and each genomic target sequence and/or the ability of the guide sequence to hybridize to the complement thereof.
  • the predictions can be based on the sequences and other known or predicted features such as accessibility, type of sequence, expression state or genomic context. In some embodiments the predictions will also include information about the cells in question, their
  • the methods and systems provide PC primer sequences that can be used for synthesizing oligonucleotide primers for testing cleavage in vivo.
  • user input can include the genome of interest, guide strand sequence, PAM sequence, and the number of base mismatches, insertions, and deletions allowed.
  • a user chooses the genome of interest from the list, and enters the guide strand and optionally PAM sequences ( Figure 25 A).
  • Types of indel query include, for example, (i) the number of mismatches with no insertion or deletion (i.e., "No indels"); (ii) the number of mismatches in addition to a single-base deletion (i.e., "Del”); and (iii) the number of mismatches in addition to a single-base insertion (i.e., "Ins").
  • mismatches without indels up to two mismatches together with a one-base insertion and/or one-base deletion can be selected.
  • 4, 5, 6, 7, 8, 9, 10, or more mismatches, insertions, deletions, or any combination thereof can be selected.
  • PAM variants such as NRG or other PAM sequences can be entered in the suffix box.
  • the spacer (Ns) and required nucleotides are entered into the suffix box, such as "NNNNGATT”, “NNAGAA”, “NNAGAA”, “NAAAAC” and include genomic sites with any nucleotide at the N positions in the output.
  • a range of other sequences may constitute naturally occurring or modified PAM sequences. If primers are desired, primer design parameter settings and parameter templates can also be entered.
  • parameters may be entered that correspond to cell type, culture conditions, animal age or growth, developmental state, genomic context, chromosomal or methylation state, DNA mutation repair, pathway choice and other features affecting cleavage and /or mutation rates.
  • the disclosed methods for identifying off-target cleavage locations of a CRISPR/Cas nuclease typically computer-implemented methods that include scanning or searching the genomic sequence data for the target cleavage locations of the nuclease based on parameters selected from the group consisting of guide strand sequence, organismal genome, number of mismatches, insertions, and/or deletions, to return target cleavage location sequence and/or locations in the genome.
  • the target sites identified by the search are assigned a score that is used to rank the target cleavage locations based on the likelihood of target cleavage.
  • the prime function is ranking sequences to a range of criteria.
  • a series of search entries are constructed according to the user-specified guide strand and search criteria (Figure 25B).
  • the search entries include all insertions and deletions at each possible location ( Figure 25 C, Figures 26E-26F).
  • RNA bulges RNA bulges
  • DNA bulges DNA bulges
  • search for a wide range of insertions and deletions will likely result in a very large number of returned sites. Therefore, in a preferred embodiment only searches for single-base insertions and deletions in the DNA sequence are compared with the guide strand ( Figure 25 A). In other embodiments, larger number of nucleotide insertions or deletions, or multiple insertions and/or deletions can be accommodated, though this is likely to result in a longer list of sites output.
  • the search algorithm can allow some ambiguities (such as N for any nucleotide). Ambiguities included in the search string are not counted toward the user-specified mismatch limits.
  • ranges of ambiguities can be employed, such as the codes for either of two nucleotides (R, W, S, K, R or Y) or three nucleotides (B, D, H, V), in addition to N.
  • R, W, S, K, R or Y two nucleotides
  • B, D, H, V three nucleotides
  • the use of ambiguities allows the inclusion of the matching genomic base with the output sequences.
  • One possibility is to include an "N" in positions that can have substitutions, such as the first base in a guide strand that is often a G primarily to aid in transcription, but does not need to match the
  • the search algorithm is based on sequence homology and identity, with the option to allow insertions or deletions a search method, a ranking method, or a combination thereof.
  • the off-target site lists can be constructed using, for example, existing search algorithms such as FASTA or
  • BLAST BLAST.
  • these types of existing or freshly generated lists can be ranked by the methods described here.
  • the FASTA algorithm is described in W.R. Pearson, and D.J. Lipman (1988) Proc. Natl. Acad. Set, 85:2444-2448 and D.J.
  • BLAST algorithm is described in S. Altschul, et al. (1990) J. Mol. Biology, 215:403-410. While FASTA, BLAST, megaBLAST, BLAST Bowtie, and other later improvements can be used to construct a list of target sites, these are not the preferred approaches. In some embodiments, other search methods are used, then refined by using a ranking algorithm that can weigh the number and positions of mismatches, insertions, deletions and their combinations. The output from non-exhaustive search tools may not be considered to have all possible off-target sites.
  • on-site and off-site targets of the CRISPR guide strands are determined by comparing the query sequence both with and without insertions, deletions, and/or mismatches at one or multiple positions using the
  • FetchGWI search program (Iseli, et al, PLoS ONE, 2(6): e579 (2007).
  • FetchGWI operates on indexed genome sequences that are precompiled and stored ( Figures 26A- 26G). It can identify genomic locations with sequences that match any of the series of search entries.
  • FetchGWI saves run time by searching indexed files that represent the genome sequences, rather than the sequences themselves. There is one index entry for each nucleotide in the genome, which allows a rapid and exhaustive search. In other embodiments, other indexing strategies can be used. Exhaustive, complete searches are a key advantage over BLAST and other programs that scan non- overlapping words and may miss potential off-target sites.
  • the guide strand sequence and/or variants thereof and/or other query sequences can be compared to an organismal genome, or any loaded sequence files.
  • the searched genome is human, mouse, Caenorhabditis elegans, or rhesus macaque genomes.
  • any genome, modified genome or sequence file can be searched.
  • the searchable genome is prepared using the genwin program (Iseli, et al., PLoS ONE, 2(6): e579 (2007)) to transform the DNA sequence from FASTA formatted files into unsorted index entries which have all possible 25 bases-long tags in the DNA sequence.
  • sortGWI After that, the sortGWI program is used to sort the index entries, and store the result as a binary index file.
  • sortGWI subdivides the whole index file into parts, each representing entries having identical first 12 nucleotides.
  • a secondary index recording the position in the main index file where each part starts, is added to the end of the index file to enable faster search and reduce file size.
  • the index files can be stored in a server.
  • the sequence tags can be used to generate a series of additional tags that contain indels if the insertion or deletion boxes are checked, or if defaults are used. Identical tags are removed if they are duplications for strings containing consecutive identical bases, or in other embodiments, these can be removed at other steps in the processing.
  • the resulting tags are all searched against the user-selected genome.
  • the working Examples include exemplary searches, for example, if guide strand R-01 is entered and one (1) insertion and one (1) deletion are selected, the tags illustrated in Figure 26E and 26F are generated and used to search a genome.
  • FetchGWI program can be used (Iseli, et al, PLoS ONE, 2(6): e579 (2007). For example, if the user specifies a search with one or more mismatches, all possible sequence tags can be generated by replacing the specified number of nucleotides with all other possibilities. In the preferred embodiment, FetchGWI can search the genome allowing the user-specified number of mismatches. After that, FetchGWI can sort all the query tags and searches for matches in the index file, using binary search.
  • FetchGWI can report the search results by appending the actual sequence tag found, along with the accession number and position offset within the sequence for each matched query tags.
  • Programs such as the TagScan algorithm can be used to minimize run times while still performing exhaustive genome searches. In other embodiments, other programs are used that can allow greater numbers of mismatches to the genomic sequences.
  • a series of guide sequence variants are constructed based on a user entered guide sequence and used to query the selected genome for potential target sites.
  • the parameters used to construct the series of query guide sequences is typically prepared based on user entered parameters includes, the number of mismatches (e.g., 0, 1, 2, 3, etc.), insertions (e.g., 0, 1, 2, etc.), and/or deletions (e.g., 0, 1, 2, etc.) that are allowed at the target site relative to the guide sequence. In some embodiments, multiple insertions and/or deletions may be allowed.
  • duplicative query sequences are subtracted or culled from the series before the search such that each sequence in the series is unique and only searched once.
  • the query guide sequences provide guide strand variant sequences having no indels and 0, 1, 2, or 3 mismatches; 1-base deletion, no insertions, and 0, 1, or 2 mismatches; 1-base insertion, no deletions, and 0, 1, or 2 mismatches; 1-base deletion, 1-base insertion, and 0, 1, or 2 mismatches; or any combination thereof.
  • each nucleotide can be inserted generating different guide strand variations.
  • there are four natural nucleotides in most embodiments, there will be four variations with A, C, G or T introduced in in position in the four different variations.
  • an "N" is inserted that will match any of these. If insertions of greater than one nt are allowed, then the single inserted N can also be replaced with two or more Ns, which can be inserted into each position to generate variations with one or more nt insertions.
  • each nucleotide can be deleted resulting in a guide strand that is one nt shorter.
  • deleting any one would result in the same variant. This is consistent if either is deleted when two nt are the same, or deleting any of a longer repeated string of nts. If deletions of greater than one nt are allowed, then the single nt deleted can also be replaced with two or more deleted nt that can be deleted at each position along the guide strand.
  • a series of query guide sequences are generated that are variations of the original guide sequence.
  • each nucleotide can be inserted generating different guide strand variations.
  • there are four natural nucleotides in most embodiments, there will be four variations with A,C,G or T introduced in in position in the four different variations.
  • an "N" is inserted that will match any of these as with insertions alone.
  • the resulting string of queries is then subjected to individual deletions as in (2) above resulting in variations that have inserted and deleted bases. Deleting an inserted base would result in the original sequence.
  • Allowing more than one base inserted and / or deleted would introduce even more variations.
  • each nucleotide can be inserted generating different guide strand variations.
  • there are four natural nucletides in most embodiments, there will be four variations with A,C,G or T introduced in in position in the four different variations.
  • an "N" is inserted that will match any of these.
  • other embodiments can allow the introduction of a second insertion at each point in the guide sequence.
  • each nucleotide can be deleted resulting in a guide strand that is one nt shorter.
  • deleting any one would result in the same variant. This is consistent if either is deleted when two nt are the same, or deleting any of a longer repeated string of nts.
  • embodiments can allow the introduction of a second insertion at each point in the guide sequence.
  • a series of query guide sequences are generated that are variations of the original guide sequence.
  • each nucleotide can be inserted generating different guide strand variations.
  • there are four natural nucleotides in most embodiments, there will be four variations with A,C,G or T introduced in in position in the four different variations.
  • an "N" is inserted that will match any of these as with insertions alone.
  • the resulting string of queries is then subjected to individual deletions as in (5) above resulting in variations that have inserted and deleted bases. Deleting an inserted base would result in the original sequence, though deleting one of the inserted bases may produce a variation already included in the ouput.
  • these query sequences, or tags are used to search the specified genome(s). In one embodiment, this is using FetchGWI to compare each variant to sequences throughout the genome and output the sites that match the user-specified guideline. In one embodiment, that is the number of mismatches for each condition: no indels, with insertions or with deletions. In other embodiments, the output contains other user-specified or default criteria to limit the sequences output.
  • Example of this type of screenings are is the possibility of only including sites that appear to be in open chromatin, or only outputting sites with particular annotations, such as in exons, regulatory sequences or in defined oncogenic regions.
  • mismatches can similarly be added to the query sequences prior to searching,
  • the series of query guide sequences includes the guide sequence and sequence variants thereof wherein each nucleotide position in the guide sequence is individually substituted by each of the alternative nucleotides,
  • each of the query guide sequences in the series has zero or one mismatches, zero insertions, and zero deletions relative to the guide sequence
  • the series of query guide sequences includes the guide sequence and sequence variants thereof wherein each nucleotide position in the guide sequence is individually substituted by each of the alternative nucleotides, and
  • each of the query guide sequences in the series has zero, one, or two mismatches, zero insertions, and zero deletions relative to the guide sequence;
  • the series of query guide sequences includes the guide sequence and sequence variants thereof wherein each nucleotide position in the guide sequence is individually substituted by each of the alternative nucleotides,
  • each combination of three nucleotide positions in the guide sequence is substituted with each alternative nucleotide, and such that each of the query guide sequences in the series has zero, one, two, or three mismatches, zero insertions, and zero deletions relative to the guide sequence; (13) if zero mismatches, one insertion, and zero deletions is selected:
  • the series of query guide sequences includes the guide sequence and sequence variants thereof wherein each canonical nucleotide is individually inserted into each nucleotide position of the guide sequence,
  • each of the query guide sequences in the series has zero mismatches, one insertion, and zero deletions relative to the guide sequence
  • the series of query guide sequences includes the guide sequence and sequence variants thereof wherein each canonical nucleotide is individually inserted into each nucleotide position of the guide sequence, and
  • each of the query guide sequences in the series has zero mismatches, two insertions, and zero deletions relative to the guide sequence;
  • the series of query guide sequences includes the guide sequence and sequence variants thereof wherein one nucleotide is individually deleted from each nucleotide position of the guide sequence,
  • each of the query guide sequences in the series has zero mismatches, zero insertions, and one deletion relative to the guide sequence.
  • the series of query guide sequences includes the guide sequence and sequence variants thereof wherein one nucleotide is individually deleted from each nucleotide position of the guide sequence, and
  • each of the query guide sequences in the series has zero mismatches, zero insertions, and two deletions relative to the guide sequence;
  • the series of query guide sequences includes the guide sequence, and sequence variants thereof wherein each nucleotide position in the guide sequence is individually substituted by each of the alternative nucleotides; each canonical nucleotide is individually inserted into each nucleotide position of the guide sequence; and guide sequence variants having the combination thereof,
  • each of the query guide sequences in the series has zero or one mismatches, zero or one insertions, and zero deletions relative to the guide sequence;
  • the series of query guide sequences includes the guide sequence, and sequence variants thereof wherein each nucleotide position in the guide sequence is individually substituted by each of the alternative nucleotides; each combination of two nucleotide positions in the guide sequence is substituted with each alternative nucleotide; each canonical nucleotide is individually inserted into each nucleotide position of the guide sequence; and guide sequence variants having combinations thereof,
  • each of the query guide sequences in the series has zero, one, or two mismatches, zero or one insertions, and zero deletions relative to the guide sequence;
  • the series of query guide sequences includes the guide sequence, and sequence variants thereof wherein each nucleotide position in the guide sequence is individually substituted by each of the alternative nucleotides; each combination of two nucleotide positions in the guide sequence is substituted with each alternative nucleotide; each combination of three nucleotide positions in the guide sequence is substituted with each alternative nucleotide; each canonical nucleotide is individually inserted into each nucleotide position of the guide sequence; and guide sequence variants having combinations thereof,
  • each of the query guide sequences in the series has zero, one, two, or three mismatches, zero or one insertions, and zero deletions relative to the guide sequence;
  • the series of query guide sequences includes the guide sequence, and sequence variants thereof wherein each nucleotide position in the guide sequence is individually substituted by each of the alternative nucleotides; each canonical nucleotide is individually inserted into each nucleotide position of the guide sequence; each combination of two canonical nucleotides are individually inserted into the guide sequence each combination of two positions in the guide sequence; and guide sequence variants having the combination thereof,
  • each of the query guide sequences in the series has zero or one mismatches, zero, one, or two insertions, and zero deletions relative to the guide sequence;
  • the series of query guide sequences includes the guide sequence, and sequence variants thereof wherein each nucleotide position in the guide sequence is individually substituted by each of the alternative nucleotides; each combination of two nucleotide positions in the guide sequence is substituted with each alternative nucleotide; each canonical nucleotide is individually inserted into each nucleotide position of the guide sequence; each combination of two canonical nucleotides are individually inserted into the guide sequence each combination of two positions in the guide sequence; and guide sequence variants having combinations thereof,
  • each of the query guide sequences in the series has zero, one, or two mismatches, zero, one, or two insertions, and zero deletions relative to the guide sequence;
  • the series of query guide sequences includes the guide sequence, and sequence variants thereof wherein each nucleotide position in the guide sequence is individually substituted by each of the alternative nucleotides; each combination of two nucleotide positions in the guide sequence is substituted with each alternative nucleotide; each combination of three nucleotide positions in the guide sequence is substituted with each alternative nucleotide; each canonical nucleotide is individually inserted into each nucleotide position of the guide sequence; each combination of two canonical nucleotides are individually inserted into the guide sequence each combination of two positions in the guide sequence; and guide sequence variants having combinations thereof,
  • each of the query guide sequences in the series has zero, one, two, or three mismatches, zero, one, or two insertions, and zero deletions relative to the guide sequence;
  • the series of query guide sequences includes the guide sequence, and sequence variants thereof wherein each nucleotide position in the guide sequence is individually substituted by each of the alternative nucleotides; one nucleotide is individually deleted from each nucleotide position of the guide sequence; and guide sequence variants having the combination thereof,
  • each of the query guide sequences in the series has zero or one mismatches, zero insertions, and zero or one deletions relative to the guide sequence;
  • the series of query guide sequences includes the guide sequence, and sequence variants thereof wherein each nucleotide position in the guide sequence is individually substituted by each of the alternative nucleotides; each combination of two nucleotide positions in the guide sequence is substituted with each alternative nucleotide; one nucleotide is individually deleted from each nucleotide position of the guide sequence; and guide sequence variants having combinations thereof,
  • each of the query guide sequences in the series has zero, one, or two mismatches, zero insertions, and zero or one deletions relative to the guide sequence;
  • the series of query guide sequences includes the guide sequence, and sequence variants thereof wherein each nucleotide position in the guide sequence is individually substituted by each of the alternative nucleotides; each combination of two nucleotide positions in the guide sequence is substituted with each alternative nucleotide; each combination of three nucleotide positions in the guide sequence is substituted with each alternative nucleotide; one nucleotide is individually deleted from each nucleotide position of the guide sequence; and guide sequence variants having combinations thereof,
  • each of the query guide sequences in the series has zero, one, two, or three mismatches, zero insertions, and zero or one deletions relative to the guide sequence;
  • the series of query guide sequences includes the guide sequence, and sequence variants thereof wherein each nucleotide position in the guide sequence is individually substituted by each of the alternative nucleotides; one nucleotide is individually deleted from each nucleotide position of the guide sequence; two nucleotides are deleted from each combination of two nucleotide positions of the guide sequence; and guide sequence variants having the combination thereof,
  • each of the query guide sequences in the series has zero or one mismatches, zero insertions, and zero, one, or two deletions relative to the guide sequence;
  • the series of query guide sequences includes the guide sequence, and sequence variants thereof wherein each nucleotide position in the guide sequence is individually substituted by each of the alternative nucleotides; each combination of two nucleotide positions in the guide sequence is substituted with each alternative nucleotide; one nucleotide is individually deleted from each nucleotide position of the guide sequence; two nucleotides are deleted from each combination of two nucleotide positions of the guide sequence; and guide sequence variants having combinations thereof,
  • each of the query guide sequences in the series has zero, one, or two mismatches, zero insertions, and zero, one, or two deletions relative to the guide sequence;
  • the series of query guide sequences includes the guide sequence, and sequence variants thereof wherein each nucleotide position in the guide sequence is individually substituted by each of the alternative nucleotides; each combination of two nucleotide positions in the guide sequence is substituted with each alternative nucleotide; each combination of three nucleotide positions in the guide sequence is substituted with each alternative nucleotide; one nucleotide is individually deleted from each nucleotide position of the guide sequence; two nucleotides are deleted from each combination of two nucleotide positions of the guide sequence; and guide sequence variants having combinations thereof,
  • each of the query guide sequences in the series has zero, one, two, or three mismatches, zero insertions, and zero, one, or two deletions relative to the guide sequence;
  • the series of query guide sequences includes the guide sequence, and sequence variants thereof wherein each nucleotide position in the guide sequence is individually substituted by each of the alternative nucleotides; each canonical nucleotide is individually inserted into each nucleotide position of the guide sequence; one nucleotide is individually deleted from each nucleotide position of the guide sequence; one nucleotide is individually deleted from each nucleotide position of the guide sequence; and guide sequence variants having the combination thereof, such that each of the query guide sequences in the series has zero or one mismatches, zero or one insertions, and zero or one deletions relative to the guide sequence;
  • the series of query guide sequences includes the guide sequence, and sequence variants thereof wherein each nucleotide position in the guide sequence is individually substituted by each of the alternative nucleotides; each combination of two nucleotide positions in the guide sequence is substituted with each alternative nucleotide; each canonical nucleotide is individually inserted into each nucleotide position of the guide sequence; one nucleotide is individually deleted from each nucleotide position of the guide sequence; one nucleotide is individually deleted from each nucleotide position of the guide sequence; and guide sequence variants having combinations thereof, such that each of the query guide sequences in the series has zero, one, or two mismatches, zero or one insertions, and zero or one deletions relative to the guide sequence;
  • the series of query guide sequences includes the guide sequence, and sequence variants thereof wherein each nucleotide position in the guide sequence is individually substituted by each of the alternative nucleotides; each combination of two nucleotide positions in the guide sequence is substituted with each alternative nucleotide; each combination of three nucleotide positions in the guide sequence is substituted with each alternative nucleotide; each canonical nucleotide is individually inserted into each nucleotide position of the guide sequence; one nucleotide is individually deleted from each nucleotide position of the guide sequence; one nucleotide is individually deleted from each nucleotide position of the guide sequence; and guide sequence variants having combinations thereof,
  • each of the query guide sequences in the series has zero, one, two, or three mismatches, zero or one insertions, and zero or one deletions relative to the guide sequence;
  • the series of query guide sequences includes the guide sequence, and sequence variants thereof wherein each nucleotide position in the guide sequence is individually substituted by each of the alternative nucleotides; each canonical nucleotide is individually inserted into each nucleotide position of the guide sequence; each combination of two canonical nucleotides are individually inserted into the guide sequence each combination of two positions in the guide sequence; one nucleotide is individually deleted from each nucleotide position of the guide sequence; and guide sequence variants having the combination thereof,
  • each of the query guide sequences in the series has zero or one mismatches, zero, one, or two insertions, and zero or one deletions relative to the guide sequence;
  • the series of query guide sequences includes the guide sequence, and sequence variants thereof wherein each nucleotide position in the guide sequence is individually substituted by each of the alternative nucleotides; each combination of two nucleotide positions in the guide sequence is substituted with each alternative nucleotide; each canonical nucleotide is individually inserted into each nucleotide position of the guide sequence; each combination of two canonical nucleotides are individually inserted into the guide sequence each combination of two positions in the guide sequence; one nucleotide is individually deleted from each nucleotide position of the guide sequence; and guide sequence variants having combinations thereof,
  • each of the query guide sequences in the series has zero, one, or two mismatches, zero, one, or two insertions, and zero or one deletions relative to the guide sequence;
  • the series of query guide sequences includes the guide sequence, and sequence variants thereof wherein each nucleotide position in the guide sequence is individually substituted by each of the alternative nucleotides; each combination of two nucleotide positions in the guide sequence is substituted with each alternative nucleotide; each combination of three nucleotide positions in the guide sequence is substituted with each alternative nucleotide; each canonical nucleotide is individually inserted into each nucleotide position of the guide sequence; each combination of two canonical nucleotides are individually inserted into the guide sequence each combination of two positions in the guide sequence; one nucleotide is individually deleted from each nucleotide position of the guide sequence; and guide sequence variants having combinations thereof,
  • each of the query guide sequences in the series has zero, one, two, or three mismatches, zero, one, or two insertions, and zero or one deletions relative to the guide sequence;
  • the series of query guide sequences includes the guide sequence, and sequence variants thereof wherein each nucleotide position in the guide sequence is individually substituted by each of the alternative nucleotides; each canonical nucleotide is individually inserted into each nucleotide position of the guide sequence; one nucleotide is individually deleted from each nucleotide position of the guide sequence; one nucleotide is individually deleted from each nucleotide position of the guide sequence; two nucleotides are deleted from each combination of two nucleotide positions of the guide sequence; and guide sequence variants having the combination thereof,
  • each of the query guide sequences in the series has zero or one mismatches, zero or one insertions, and zero, one, or two deletions relative to the guide sequence;
  • the series of query guide sequences includes the guide sequence, and sequence variants thereof wherein each nucleotide position in the guide sequence is individually substituted by each of the alternative nucleotides; each combination of two nucleotide positions in the guide sequence is substituted with each alternative nucleotide; each canonical nucleotide is individually inserted into each nucleotide position of the guide sequence; one nucleotide is individually deleted from each nucleotide position of the guide sequence; one nucleotide is individually deleted from each nucleotide position of the guide sequence; two nucleotides are deleted from each combination of two nucleotide positions of the guide sequence; and guide sequence variants having combinations thereof,
  • each of the query guide sequences in the series has zero, one, or two mismatches, zero or one insertions, and zero, one, or two deletions relative to the guide sequence;
  • the series of query guide sequences includes the guide sequence, and sequence variants thereof wherein each nucleotide position in the guide sequence is individually substituted by each of the alternative nucleotides; each combination of two nucleotide positions in the guide sequence is substituted with each alternative nucleotide; each combination of three nucleotide positions in the guide sequence is substituted with each alternative nucleotide; each canonical nucleotide is individually inserted into each nucleotide position of the guide sequence; one nucleotide is individually deleted from each nucleotide position of the guide sequence; one nucleotide is individually deleted from each nucleotide position of the guide sequence; two nucleotides are deleted from each combination of two nucleotide positions of the guide sequence; and guide sequence variants having combinations thereof, such that each of the query guide sequences in the series has zero, one, two, or three mismatches, zero or one insertions, and zero, one, or two deletions relative to the guide sequence;
  • the series of query guide sequences includes the guide sequence, and sequence variants thereof wherein each nucleotide position in the guide sequence is individually substituted by each of the alternative nucleotides; each canonical nucleotide is individually inserted into each nucleotide position of the guide sequence; each combination of two canonical nucleotides are individually inserted into the guide sequence each combination of two positions in the guide sequence; one nucleotide is individually deleted from each nucleotide position of the guide sequence; two nucleotides are deleted from each combination of two nucleotide positions of the guide sequence; and guide sequence variants having the combination thereof,
  • each of the query guide sequences in the series has zero or one mismatches, zero, one, or two insertions, and zero, one, or two deletions relative to the guide sequence;
  • the series of query guide sequences includes the guide sequence, and sequence variants thereof wherein each nucleotide position in the guide sequence is individually substituted by each of the alternative nucleotides; each combination of two nucleotide positions in the guide sequence is substituted with each alternative nucleotide; each canonical nucleotide is individually inserted into each nucleotide position of the guide sequence; each combination of two canonical nucleotides are individually inserted into the guide sequence each combination of two positions in the guide sequence; one nucleotide is individually deleted from each nucleotide position of the guide sequence; two nucleotides are deleted from each combination of two nucleotide positions of the guide sequence; and guide sequence variants having combinations thereof,
  • each of the query guide sequences in the series has zero, one, or two mismatches, zero, one, or two insertions, and zero, one, or two deletions relative to the guide sequence;
  • the series of query guide sequences includes the guide sequence, and sequence variants thereof wherein each nucleotide position in the guide sequence is individually substituted by each of the alternative nucleotides; each combination of two nucleotide positions in the guide sequence is substituted with each alternative nucleotide; each combination of three nucleotide positions in the guide sequence is substituted with each alternative nucleotide; each canonical nucleotide is individually inserted into each nucleotide position of the guide sequence; each combination of two canonical nucleotides are individually inserted into the guide sequence each combination of two positions in the guide sequence; one nucleotide is individually deleted from each nucleotide position of the guide sequence; two nucleotides are deleted from each combination of two nucleotide positions of the guide sequence; and guide sequence variants having combinations thereof,
  • each of the query guide sequences in the series has zero, one, two, or three mismatches, zero, one, or two insertions, and zero, one, or two deletions relative to the guide sequence.
  • the guide sequence and the series of query guide sequences can be modified to include one or more PAM sequence suffixes as discussed above.
  • the guide sequence and the series of query guide sequences, with and/or with the PAM sequence suffix(es) is compared or aligned to a genome.
  • the genome is a user selected genome composed of indexed files that represent the genome sequences, rather than the sequences themselves.
  • a target site location in the genome is typically identified or reported in the output when the genomic sequence matches the user-specified criteria.
  • the number of mismatches is below the user-supplied limit, and it lacks indels in relation to the guide strand if only "no indels" is chosen.
  • the maximal number of mismatches allowed can be 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 15 or longer depending on the guide strand length.
  • a site can be output if it does have an insertion or deletion and that type of search is chosen by the user, subject to the site having a direct match or having less mismatches than the user-specified limit.
  • the maximal number of mismatches allowed can be 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 15 or longer depending on the guide strand length.
  • the user can also specify one, two, three or more PAM sequences individually or using consensus or ambiguous sequences.
  • the genomic sequence may have at least 60, 65, 70, 80, 85, 90, 92, 95, 96, 97, 98, 99 or 100 percent identify to the guide strand.
  • genomic sites most similar to the guide strand my correspond to lower levels of identity, such as at least 60, 70, 80, 85, 90, 92, 95, 96, 97, 98, 99 or 100 percent identify to the guide strand. It maybe important to query sequences throughout this range as tissue culture experiments have revealed that guide strands have been found to cleave sites with identities in this range.
  • the level of matching is further or solely weighed based on sequence-dependent scoring, such that modified counts of the number of mismatches or indels or a modified percentage is determined by the sequence of the guide, the complementary genomic sequence or both. In some embodiments this may be weighed as the change in nucleotide affinity, the ability to tolerate mismatches or indels, or based on other modeling or data.
  • other search programs are used to scan the genomes using the range of guide strand variants generated.
  • Other index strategies can be used or whole genomic sequences can be scanned using perl, pyton, or other direct search programs or scripts.
  • the programs or scripts would identify sites that match the search criteria, though in other embodiments the sites would correspond to matching the guide strands and variants based on identity percentage.
  • the sites output can be the highest percentages, or those sites above a calculated percentage (based on probability of finding sites after comparing the guide strand, PAM lengths and/or genome size).
  • a target site location in the genome is typically identified or reported when the genomic sequence has 100% sequence identity with the guide sequence, or the highest percentage in the genome and/or one or more of the query guide sequences with or without one or more appended PAM sequences.
  • the sequence identity between the genomic sequence and the guide sequence and/or one or more of the query guide sequences with or without one or more appended PAM sequences is at least 80, 85, 90, 92, 95, 96, 97, 98, or 99 percent.
  • the target site or on-target site can be thought of as the intended cleavage site, regardless of its level of identity, or number of mismatches, if it includes indels related to the gRNA and regardless of how this site compares to other un-intended sites (i.e., off-target sites) that may score below or higher in these indices.
  • any search method using local alignment or index searches could be used, such as Eland, SOAP, SHRiMP, Bowtie, Q-pick, Maq, BWA.
  • the programs can vary in their speed and ability to locate all sites. Searches that fail to exhaustively locate all possible target sites, will not output the sites it fails to test, or fails to measure. Other embodiments that fail to filter sites may produce very long lists of sites to sort through scoring and ranking. In some embodiments, the scoring and ranking methods is used to weigh ever site in a genome, and only output top sites or sites scoring above a specified threshold, or number of sites.
  • the guide sequences, variants thereof, query sequences, etc. can include one or more "N" and other symbolic nucleotides, such as those described herein, that refer to one or more nucleotides. It will be appreciated that in some embodiments, where variant and query sequences are constructed by adding (insertions) or substituting (mismatches) each nucleotide, or each alternative nucleotide as appropriate, relative to a parent sequence (e.g., the guide sequence(s)) at one or more positions, this can additionally or alternatively be accomplished by adding or substituting with an "N" and other symbolic nucleotides, and vice versa. Such symbols can be understood by the user and/or computational software, and thus reduce the total number of variant or query sequences that have to be prepared relative to adding or substituting each of the possible alternative nucleotides individually.
  • the disclosed methods and systems can rank the target sites.
  • the ranking can be based on a score that reflects the expectation of how likely the target site will be cleaved by a CRISPR/Cas nuclease such as Cas9, and can be weighted based on one or more factors or attributes.
  • the ranking can be based upon a scoring function for predicting nuclease activity based at least in-part on sequence identity between the guide strand and the genomic target sequence and/or complementarity between to the guide strand and complementary strand of the genomic target sequence.
  • the scoring function is derived empirically or by incorporating various design rules.
  • the rank can be determined based on the sum of scores corresponding to different design considerations.
  • the ranking can include scoring systems that include the weights for mismatches, insertions, deletions and the combinations of these with particular weight
  • the ranking can include scoring systems with additive (or subtractive) weight factors and/or multiplicative factors and/or higher-order weights.
  • rankings will include features corresponding to the cell type, culture conditions, animal age and/or growth, developmental state, genomic context, chromosomal and/or methylation state, other features affecting cleavage rate, and combinations thereof. Therefore, the method is flexible and will be able to incorporate more design variables into the function as more information about the factors affecting nuclease activity at various target sites becomes available. In addition, the method can be re-applied to an enlarged training set of data once more experimental data become available.
  • Figure 30 presents a flow chart of an exemplary target site prediction method (700) that generates search parameters (710) based upon an input query, constructs a list of on- and off-target sites (720) based upon the search parameters, and ranks (730) the target sites in the list before outputting the results.
  • the score can also include consideration of the number and location of base mismatches, insertions, and/or deletions, when ranking of the more likely target sites. Other considerations include, but are not limited to, the distance between mismatch(es) and the PAM. The Examples below show that mismatches further from the PAM are more likely to result in off-target cleavage. In some or all sequences, there are positions that may vary from this general trend.
  • Bioinformatics based ranking of CRISPR/Cas off-target sites may be hindered by the effects of genomic context and DNA modifications. Identical genomic sites and duplicated sites may have dramatic differences in off-target activity.
  • the data presented in the Examples below shows that the indel rate at off-target site R-01 OT2 was 44%, though other loci with the same complementary sequence have much less, or no activity, possibly due to nuclease blocking or any of the other features described above.
  • the accessibility of the genomic DNA may influence nuclease activity sites of similar sequence. Accordingly, in some embodiments, the score includes consideration of factors including chromatin condensation and/or DNA availability at the genomic location of the on- and off-target sites, alone or in combination with other factors in the search algorithm.
  • the results are sorted for unique sites with the lowest mismatch and indel score to locate the most likely target sites.
  • a low score correlates with a high likelihood of nuclease cleavage at the target site.
  • one or more on-target sites are reported, generally first in the list, having a score of "0" and off-target sites are ranked in descending order of likelihood of cleavage based on ascending scores of greater than 0.
  • the Examples below show an exemplary scoring paradigm wherein a binding site of a NGG PAM guide strand is typically ranked ahead of a binding site for the guide strand with a NAG PAM (by non-limiting example, +0.3 points can be added to the default scoring).
  • a high score correlates with a high likelihood of nuclease cleavage at the target site.
  • Other scoring schemes can be used in other embodiments, such as having 100 equal a perfect match or the top scoring site and scoring lower the less probable sites in accordance to mismatches, insertions and deletions, their combinations and positions.
  • the mismatches, insertions, and/or deletions result in the addition to the score corresponding to their location in the guide strand, here in nucleotides from the PAM.
  • each mismatch, insertion or deletion are added to make the score.
  • the method adds 0.1, for positions 9-12, 0.5; for 7 and 8, 1.0; for position 6, 1.4; for position 5, 1.9; for position 4, 2.0 ; for position 1-3, 4; for mismatches in the PAM, 10.
  • the weight scores are multiplied or they can be added/subtracted while other weights are multiplied to include score for individual or multiple mismatches or indels or multiple sets of mismatches or indels.
  • sequence specific weights in addition to position specific weights, and these weights can include the guide or complementary sequence or both.
  • mismatches at G-C base pairing may be weighed differently than mismatches replacing A-T base pairs.
  • the resulting mismatches may be weighed, such that G-A, G-T, C-A, or C-T can be scored differently depending on the orientation, the surrounding bases or other features.
  • other sequence- specific features are weighed such as the binding affinity, sequence patterns, GC or AT content, di-nucleotide pair usage or NA secondary or tertiary structures or capacity to form such structures.
  • Each of these embodiments may be used with each application, such that one scoring system may be applied to look for on- and off-target binding, on- and off-target binding when linked to effector domains, nuclease or nickase binding, nuclease or nickase cleavage, or other binding or functional effects.
  • Table 22 illustrates an exemplary of two scoring paradigms that can be used to analyze and rank target sites based on the location/position of the mismatch or indel, and its type (e.g., mismatch, deletion, or insertion).
  • a "penalty” of "fine” of 0.5 is assessed for deletions, 0.6 for insertions, 0.3 for NAG PAM, and 20 for less preferred PAMs (anything outside NRG for S. pyogenes Cas9).
  • the weights may be different, in some, or all positions.
  • weight scores are not decreasing as their distance varies from the PAM, but may be based on off-target data, biochemical or cellular testing, or other data or modeling.
  • the total scoring is combinations of additive and/or multiplicative weight scores and may include factors weighing combinations of features, such as pairs of mismatches, or mismatches and indels.
  • the weights may include sequence-specific weights including combinations of features, such as pairs of mismatches, or mismatches and indels. In such an embodiment changing a given nucleotide to any of the others may result in different weight scores, depending on that sequence change and the sequence of the remainder of the guide and/or complementary sequence. There may be a number of concurrent embodiments based on the particular applications, or user-specified features or requirements. Table 22: Exemplary Scoring Paradigm
  • Figure 34 is a curve illustrating the score (x-axis) as a function of the location/position of the mismatch or indel relative to the PAM (y-axis) Mismatches in the PAM are not plotted.
  • This graph displays one embodiment of the relationship between weight scores for the position of indels or mismatches.
  • Lower scores under this scoring paradigm are believed to correlate with increased likelihood of nuclease activity at the target site with a mismatch or indel at this site.
  • weights scores or "fines" are added for multiple mismatches or indels according to these individual weights. Accordingly, in some embodiments under this paradigm, scores would be reported in ascending order with the target site believed to have the highest nuclease activity appearing first and others following in descending order.
  • Output typically includes some or every genomic sequences that matches the user-supplied search criteria in comparison with the entered guide strand.
  • the output method can be based on number of mismatches, indels, or as percentages.
  • the output list of target sites allows a user to compare the number and score target sites for the input guide sequence.
  • the output can include returning polymerase chain reaction primer sequences for amplification of the ranked cleavage site locations, returning a full nucleic acid sequence of an amplicon for detecting induced mutations; and designating each target cleavage location as being in an exon, intron, promoter, or regulatory or intergenic region.
  • the output can return hyperlinks to internet resources on the genomic region of the cleavage locations.
  • the output includes a ranked list of perfectly matched (on-target site and possibly other sites) and partially matched (potential off-target) sites in the genome, their ranking score, optionally along with reference sequences and primer designs that can be used for sequencing and/or mutation detection assays.
  • each line of the output file describes one genomic locus matching the search criteria. A locus may appear on multiple lines if it can be modeled and found in multiple ways.
  • the output shows the genomic target site sequence ("hit"), preferably aligned to the query sequence (e.g., guide sequence) to highlight matches, mismatches, indels, etc.
  • the query sequence e.g., guide sequence
  • nucleotides that are not a direct match, including mismatches, insertions, and deletions are colored or shaded differently or otherwise distinguished from matches.
  • Ambiguities in the query sequence such as the "N" in the PAM sequence NGG, are indicated differently or are similarly shown, though they do not count as mismatches.
  • the output can also include the query type, including (i) no deletion or insertion (No indel), (ii) deletions (Del), or (iii) insertions (Ins), with or without mismatches.
  • This portion of the output can indicate if there are insertions or deletions, and specify the indel positions as the number of nucleotides away from the PAM.
  • the output can also include the number of mismatched bases between the guide sequence and target sequences. As illustrated in more detail in the Examples below, when two repeated bases appear in the guide strand, a deletion of either one of them in the target sequence gives the same query sequence, so the ambiguity can be noted in the output.
  • the output can also indicate if the PAM in the hit ends in G, as NGG is the Cas9 PAM with the highest activity, followed by NAG. This portion of the output helps in ruling out genomic sites with unlikely PAMs.
  • Other information that can be provided in the output includes, but is not limited to, the chromosomal location of the matching sequence, its strand, and the chromosomal location of the cleavage site.
  • the predicted cleavage position is based on the fact that Cas9 primarily cleaves both DNA strands three nucleotides from the PAM.
  • the output can include hyperlinks directed to the chromosomal sites one or more genomic websites or databases, for example, the UCSC genome browser. This allows determination of the gene that best matches the target sequence and if the target site is in an exon, intron, or other region. This information is helpful as mutations may be better tolerated in regions that are noncoding and nonfunctional. This information can also be included as part of the output.
  • the output is grouped by query types, including (i) genomic sites with base mismatches, but no insertions or deletions (No indels), (ii) sites with deletions (Del), and (iii) sites with insertions (Ins) between the query and potential off-target sites (e.g., Table 12).
  • no indels genomic sites with base mismatches
  • Del sites with deletions
  • Ins sites with insertions between the query and potential off-target sites
  • sites with mismatches further from the PAM are typically listed first, which are more likely to result in off- target cleavage.
  • the scoring is the primary determinant of the order in the lists, though a number of tie-breaking criteria, such as lack of indels, or chromosomal location can be used.
  • the same genomic location may satisfy two or more search criteria, such as those sites that satisfy the mismatched base limit without and with an insertion or deletion. For example, mismatches at the base farthest from the PAM and deletions of this base will give the same set of genomic locations. This can also occur when the guide strand contains consecutively repeated bases. Since genomic locations can be specified through multiple criteria, they can be indicated as duplications in the output, for example, by listing in each of the corresponding groupings to aid further evaluation and scoring. In other embodiments, duplicate sites are removed or withheld in the output.
  • the output lists the potential off-target sites according to attributes or by adding weight matrixes to rank the most likely off-target sites.
  • the accumulation of additional experiments on CRISP off-target activity will allow creation of a more predictive scoring system. It is believed that mutations in the PAM are least well tolerated followed by sites closest to the PAM; however, little is known about how the guide strand sequence influences these effects (Jinek, et al., Elife 2:e00471 (2013); Fu, et al, Nat Biotechnol, 31 : 822-826 (2013); Hsu, et al, Nat Biotechnol, 31 : 827-832 (2013); Cradick, et al, Nucleic Acids Res, 41 :9584-9592 (2013)).
  • the output is in HyperText Markup Language (HTML).
  • HTML HyperText Markup Language
  • some or all of the output is exported into a spreadsheet, such as in Excel, text or comma, or tab separated formats.
  • the spreadsheet can facilitate further processing by the user, such as sorting by attributes or adding weight matrixes to rank the most likely off-target sites.
  • the primary ranking is done in the spreadsheet to allow iterative tuning or ranking based on the default of user-supplied weight factors.
  • secondary, tertiary, or further ranking are done in the spreadsheet to add newer, alternative or other weight or multiplicative scores.
  • the preferred embodiment allows the search method to greatly decrease the number of sites in the genome to a relatively low number, possibly hundreds, or to many thousands of loci to process in spreadsheets.
  • Table 10 shows an exemplary output in HMTL.
  • the output includes the genomic sites matching the user-supplied criteria in comparison to a user supplied guide strand sequence with chromosomal location. Scoring of the mismatches is provided for ranking, as are PCR primers and reference sequence.
  • Other typical output elements include, but are not limited to, right and/or left primer sequences and links to test each primer pair using the UCSC in- silico PCR web site, amplicon sequence, and digest size (discussed in more detail below).
  • the chromosomal location ("Chr. position") for each "hit" in Table 12 is provided as a hyperlink to genomic resources, e.g. UCSC genome browser, and to an output file as a spreadsheet for further manipulation and primer ordering.
  • links can be provided with genomic annotation, sequence viewers, in silico primer testing, and or pubmed links.
  • each hit is appropriately aligned to the query shown in the "Result” box.
  • DNA bases corresponding to mismatches, indels, ambiguity codes, such as N, are shown in the query line to identify the matching genomic bases.
  • To the right of the "Result” box are boxes with the query type, number of mismatches, chromosomal position, score, primers, and other features.
  • a spreadsheet output allows the user to manipulate the output to evaluate the number and scores of the low- scoring sites that are predicted to be more likely off-target sites, which may provide important guidelines when evaluating and choosing guide strands and/or testing for true cleavage events using DNA samples from cells after CRISPR/Cas treatment.
  • An automated primer pair design is sometimes included to design primers appropriate for target site validation assays, matching user input criteria.
  • the primer design function can be used in combination with assays for off-target cleavage after cells or animals are treated with CRISPR guide strands and nuclease.
  • Primers are designed that fit the criteria needed for the particular assay or sequencing platform using an automated primer pair design process. This greatly simplifies the standard method for primer design that requires iterative steps of primer design and verification of the resulting fragment sizes.
  • an automated design process allows the primers to be custom designed for the downstream assays or sequencing, and to be matched for high-throughput, full-plate PCR amplification. Primers can be designed according to specified criteria or to the defaults given for particular applications ( Figure 25 A)
  • the primer pair design will sometimes provide for specifying the minimum distance from the edge of the amplicon to the nuclease site.
  • the recommended parameters will in some cases include a separation distance between cleavage bands that is greater than 0, 20, 40, 60, 80, 100, 120, 140, 160, 180, or 200 base pairs.
  • primer pairs are chosen such that the minimum separation between unc leaved and cleaved products is greater than 50, 75, 100, 125, 150, 175, or 200 base pairs.
  • the primers may be optimally chosen for a variety of sequencing assays, such as appropriate for each sequencing platform.
  • users can also input the number of bases the cleavage site must be from each amplicon's edge to ensure sequencing coverage depending on the different sequencing platforms.
  • SM T real-time
  • a set of exemplary recommended parameters are: Minimum Distance Between Cleavage Bands of 0 base pairs, Minimum Separation Between Uncleaved and Cleaved Products of 125 base pairs.
  • the primer design parameters can be specified to ensure that the nuclease site is placed in an optimal position within the amplicon to yield cleavage bands that can be easily distinguished from the parental band and each other using agarose, polyacrylamide, other gels or capillary apparatus.
  • exemplary recommended parameters for use in Surveyor assays resolved on 2% agarose gels are: Minimum Distance Between Cleavage Bands— 100 bp, Minimum Separation Between Uncleaved and Cleaved Products— 150 bp.
  • the recommended parameters may be: Minimum Distance Between Cleavage Bands of 100 base pairs, Minimum Separation Between Uncleaved and Cleaved Products of 150 base pairs.
  • the output primers can also easily modified in the spreadsheet, such as to add flanking sequences for additional amplification and/or barcodes for sequencing.
  • the primer pair design process implemented will in some cases use the following steps and considerations to yield primer pairs suitable for high-throughput PCPv.
  • the primer design process may take into account the potential secondary structure that could arise of the 3 ' end of a primer folding back; may take into account estimated physical properties including the temperature or length; may define targets for the content of specific bases in the primer; and may check to ensure for primers that are not self-complementary.
  • Each possible position in the sequence 5 ' of the nuclease binding sites is considered as a possible 5 ' base for a primer (in some cases allowing for a user- specified minimum distance between the edge of the amplicon and the nuclease site).
  • a first number of bases in the 3 ' direction are taken as an initial sequence for the primer.
  • the first number of bases may be any integer number of bases, but in some preferred embodiments the first number of bases chosen will be 15, 16, 17, 18, 19, or 20 bases. Then the following design loop begins:
  • the specified range may be greater than 25, 30, 31, 32, 33, 34, 35, or 40 % and less than 55, 60, 61, 62, 63, 64, 65, 70, or 75%.
  • the melting temperature can be approximated by a number of methods. In one embodiment it is approximated by the empirical relation below, where the %GC is the percentage of G and C residues and the length is the primer length in units of the number of nucleotides.
  • the predicted melting temperature falls outside of certain specified values, then lengthen the primer by one base in the 3 ' direction and repeat the loop.
  • the predicted melting temperature is desirably less than 70, 65, 60, 59, 58, 57, 56, 55, 50 degrees when using the empirical formula above. 3) If the primer is longer than a specified maximum primer length, i.e. 30 base pairs, then exit the loop unsuccessfully— no primer for this position. In some cases the maximum primer length may be 20, 30, 35, 40, 50, 60, or 70 base pairs.
  • pairs may then be made with each forward pair to each possible reverse pair.
  • This list of pairs can then be pruned in some cases to remove any that would result in products where the distances between nuclease sites and the ends of the amplicon fall outside of some specified ranges.
  • This list may further pruned to remove primer pairs that are somehow undesirable, i.e. could potentially form primer dimers as defined by having the final 3 ' bases of one primer match the reverse complement of the final 3 ' bases of the other primer.
  • primer pairs may then be sorted by some selection criteria depending upon the application, for example how close the melting temperature is to a specified target melting temperature. Primer pairs may also be sorted and/or filtered by providing a preference, for instance for shorter amplicon lengths, or may be sorted alphabetically or any other acceptable manner.
  • the primer pairs are then sorted by how close their melting temperature is to the target melting temperature (the default is 60°C) by
  • the algorithm may selectively relax constraints in some embodiments to generate a minimum number of primer pairs.
  • the most lenient set of criteria still require a minimum %GC of 25, a maximum %GC of 70, a maximum length of 38, and a minimum melting temperature of 55°C.
  • the output can include returning polymerase chain reaction primer sequences for amplification of the ranked off-site cleavage locations alone, or in combination with a full nucleic acid sequence of an amplicon for detecting induced mutations.
  • the output "primer sequences” can be used for other applications such as binding without amplification, pull-down sequences, probe sequences, or as sequence-specific tags.
  • Some embodiments provide an estimate of the number of expected target site based upon the search criteria, for example to provide the user with a guide for selecting appropriate search parameters or to prohibit queries that would generate such a large number of hits to be too time or resource intensive.
  • Figure 30B depicts a flow chart for an exemplary method (900) for generating target sites.
  • a query is obtained and search parameters are generated (910).
  • an estimate of the number of expected results is provided (920).
  • the query may then be updated with a revised query, wherein a revised estimate is subsequently generated of the number of expected results. This process can be completed to obtain a desirable number of expected results.
  • the query is then used to construct a target site list (930) using methods provided herein.
  • the results in the target site list are ranked by score (940) and/or filtered by specified selection criteria (950).
  • the list of target sites is then used to generate primer pairs (960) for generating test amplicons.
  • the list of target sites and primer pairs is then output as results.
  • D Exemplary Algorithm for Identifying and/or Ranking
  • FIG. 30C An exemplary decision tree for identifying and/or ranking putative target sites is illustrated in Figure 30C (100).
  • a guide strand sequence gRNA
  • variants of the guide NA are generated that vary in insertion(s) and/or deletion(s) in each possible position.
  • the collection of these variants without the original guide (or with the original guide, depending on embodiment) 120), are then aligned to the chosen genomic (or other) sequence (130). If specified, the required adjacent motif must be present within the supplied limits or mismatches. This can be a PAM or other type of sequence.
  • the program can determine if each of the guides or variant guides matches within the user specified number of mismatches (140).
  • the sequence is not added to the output (150) and the search moves one nt further through the genome index, the specified sequence or file and searches again (130).
  • the input guide strand sequence (gRNA) (110), can also be used to search the genomic or other sequences without the possible addition of indels, based on the user- supplied input (170). This process can occur in parallel, or as part of the search with variants, or it may occur prior or at other times than the search described above (130).
  • the program can determine if each of the guides or variant guides matches within the user specified number of mismatches (180). If specified, the required adjacent motif must be present within the supplied limits or mismatches. This can be a PAM or other type of sequence. If not, the sequence is not added to the output (190) and the search moves one nt further through the genome index, the specified sequence or file and searches again (170).
  • the output can contain some or all of the following information or additional information: a list of genomic sequences, the genomic location, such as the chromosome number and base position in most genomes, and annotation on the nearest gene, if the site is in an exon, intron or other annotated sequence or other data from current or future data bases.
  • an output without indels (220) and one that can include indels (250) remain separate. This data can be generated from the process listed above (110-210), or can be derived from other sources, and processed primarily in terms of ranking the output or sequences collected from any source.
  • each site of a given length, sub-sequences, in a genome or other sequence can be scanned and given a ranking score using the algorithm described below (240, 270).
  • the user would request only the sub-sequences above a user- input or default cut-off, generally the sites that would likely be cut.
  • the listed sites are each individually compared to the guide sequence (220), or guide sequence allowing indels (260) with the ranking performed in any of a number of weighted methods (one embodiment described in Table 22).
  • the site is aligned to the genomic site and included in the output (230 or 260), whereas in other embodiments, the site can be iteratively compared to the genomic site with different combinations of mismatches, insertions and/or deletions (260, 270), or aligned across the full specified sequence or genomic indices. Based on the alignment, the differences are scored with weights for mismatches, insertions and/or deletions using one of the default or user-supplied ranking methods (240, 270).
  • the results of the ranking are given as output (280), which can be combined with other annotated information and provided as HTML, graphical, text, spreadsheet and/or other forms of output (290).
  • the output can be further processed based on the results of this output, such as the number of sites returned, based on newer or different data that emerged, based on alternative applications or other reasons.
  • the output can therefore be re-ranked using independent scoring or scoring systems that incorporate the previously determined score. In one embodiment, this can be as simple as adding further weights for additional features, such as PAM mismatches. In other
  • re-ranking can be used to add data not in the original ranking such as chromosomal context, DNA accessibility, sequence specific features or known interactions (310).
  • This output can be provided as HTML, graphical, text, spreadsheet and/or other forms of output (320).
  • the output in one preferred embodiment allows one to avoid guide strands that may result in high off-target activity, that may target important genes or may result in other off-target events (300). In other embodiments, this process allows the better choice of guide strands, but comparing the output between a ranking of guide strands, that may target the same gene, regions or otherwise be alternatives (300).
  • the genomic, plasmid or other DNA can be harvested to measure activity.
  • output primers are provided that can be used to determine cleavage, homologous recombination, mutation rates or the rates of other events at the on-target and putative off-target sites (330). Similarly, one can use the output primers or other methods to evaluate the on-target or off-target activity of the guide strands and then compare between the guide strands (330).
  • FIG. 31 is a block diagram of a preferred network-based implementation (400) wherein a client computer system (410) is in communication with a server computer system (420) via a network (430), i.e. the Internet or in some cases a private network or a local intranet.
  • a network i.e. the Internet or in some cases a private network or a local intranet.
  • One or both of the connections to the network may be wireless.
  • the server is in communication with a multitude of clients over the network, preferably a heterogeneous multitude of clients including personal computers and other computer servers as well as hand-held devices such as smartphones or tablet computers.
  • the server computer is in communication, i.e. is able to receive an input query from or direct output results to, one or more laboratory automation systems, i.e. one or more automated laboratory systems or automation robotics that automate biochemical assays, PCR amplification, or synthesis of PCR primers. See for example automated systems available from Beckman Coulter.
  • FIG. 32 is a block diagram of the basic components of an exemplary computer server (500) on which the methods may be implemented.
  • the systems will typically contain storage space (510), memory (520), one or more processors (530), and one or more input/output devices (540).
  • processor as used herein is intended to include any processing device, such as, for example, one that includes a CPU (central processing unit).
  • memory as used herein is intended to include memory associated with a processor or CPU, such as, for example, RAM, ROM, etc.
  • input/output devices or “I/O devices” as used herein is intended to include, for example, one or more input devices, e.g., keyboard, for making queries and/or inputting data to the processing unit, and/or one or more output devices, e.g., a display and/or printer, for presenting query results and/or other results associated with the processing unit.
  • An I/O device might also be a connection to the network where queries are received from and results are directed to one or more client computers.
  • processor may refer to more than one processing device.
  • processing devices may share the elements associated with the processing device. Accordingly, software components including instructions or code for performing the methodologies of the invention, as described herein, may be stored in one or more of the associated memory or storage devices (e.g., ROM, fixed or removable memory) and, when ready to be utilized, loaded in part or in whole into memory (e.g., into RAM) and executed by a CPU.
  • the storage may be further utilized for storing program codes, databases of genomic sequences, etc.
  • the storage can be any suitable form of computer storage including traditional hard-disk drives, solid-state drives, or ultrafast disk arrays.
  • the storage includes network-attached storage that may be operatively connected to multiple similar computer servers that comprise a computing cluster.
  • the computer server receives input submitted through a graphical user interface (GUI).
  • GUI graphical user interface
  • the GUI may be presented on an attached monitor or display and may accept input through a touch screen, attached mouse or pointing device, or from an attached keyboard.
  • the GUI will be communicated across a network using an accepted standard to be rendered on a monitor or display attached to a client computer and capable of accepting input from one or more input devices attached to the client computer.
  • Figure 33 depicts some of the components that may be found in an exemplary GUI for inputting parameters for target site searches capable of being rendered in a standard web browser window (600) on a client computer.
  • a phone interface can identify, read and or run entered sequences.
  • the GUI contains a target genome selection region (612) where the user selects the genome to be searched.
  • a genome is indicated by clicking, touching, highlighting or selecting one of the genomes that are listed (615).
  • the target genome is selected from a drop-down list.
  • the GUI contains in query sequence region (620) for entering or uploading one or more query guide sequences.
  • the GUI typically includes a text box for the user to input a query guide strand sequence (622).
  • users may input any sequence or sequences for which they would like to design amplification primers.
  • the GUI may additionally or
  • the text file must contain only one query sequence per line.
  • the GUI may also contain radio buttons that allow the user to select if the target sequence will be entered in a text box (624) or upload from a text file (628).
  • the GUI may include a button for choosing the file (626), may allow a user to drag and drop the intended file, or other means of having the file uploaded.
  • the GUI generally accepts a sequence of length acceptable for serving as a CRISPR/Cas guide strand sequence, for example between about 10 and about 55 nucleotides. In preferred embodiments this may range from 17-22 nucleotides.
  • the input is typically a string of letters, each corresponding to a single letter designating a nucleotide, or other symbols allowing ambiguity at indicated positions (N, R, etc.,), and together providing the nucleic acid sequence of the guide strand polynucleotide.
  • the sequence will generally be entered using a combination of characters selected from the allowable characters and dependent upon the implementation may be limited to characters for the standard nucleotides, or may include non-standard nucleotides.
  • the GUI contains a region where the user selects search options (630).
  • the region can include a text box for the user to input a target sequence protospacer adjacent motif (PAM) (632).
  • PAM target sequence protospacer adjacent motif
  • the input is typically a string of three letters corresponding to the single letter code for the PAM.
  • Exemplary PAM include, but are not limited to, NGG, NAG, and NRG.
  • the GUI also typically includes additional radio buttons, boxes, or/and other manners for the user to input the number of allowed mismatches, insertions, and/or deletions.
  • the search options region (630) provides a check button for selecting if no indels should be included in the search (634), a check button for selecting if deletions should be included in the search (636), a check button for selecting if insertions should be included in the search (638), and radio buttons for entering how many mismatches (e.g., 0, 1, 2, or 3, etc.), deletions, (e.g., 0, 1, 2, etc.), insertions (e.g., 0, 1, 2, etc.), or a combination thereof should be searched.
  • the interface provides a check button to elect no indels in combination with radio buttons for selecting 0, 1, 2, or 3 mismatches; a check button to elect 1-base deletion in combination with radio buttons for selecting 0, 1, or 2 mismatches; and a check button to elect 1-base insertion in combination with radio buttons for selecting 0, 1, or 2 mismatches (640).
  • the number of mismatches, insertions, and/or deletions may be entered as individual numeric values, as a list of numeric values, or as a range of numeric values in a text box(es).
  • the input strings “0,1,2,3”, “0,1-3”, “0,1,2-3”, or "0,1-2,3” would in some cases all be accepted inputs and would generate all possible alignments including 0, 1, 2, or 3 mismatches, insertions, or deletions.
  • the GUI can include options for the user to select pre-determined primer design options and/or to customize certain design parameters.
  • the PCR primer design options region (650) includes a check box (652) or radio button that allows the user to select whether or not primer sequences should be included with the output.
  • the GUI can include radio buttons or tabs (654) that allow the user to select a preferred primer design strategy, for example, default, Illumina 250, Illumina 250 - paired, SMRT, or enzyme. Additionally, or
  • the GUI can include text boxes that allow the user to customize primer parameter settings including, for example, the minimum separation of uncleaved to cleaved (660), minimum cleavage product size difference (662), minimum amplicon length (664), maximum amplicon length (666), optimal amplicon length (668), etc.
  • the user input for each text box is typically an integer, for example, between about 0 and 100,000 inclusive, preferably between about 0 and 10,000 inclusive, or between 0 and 1,000 inclusive.
  • the text boxes can be populated with default setting before or after the user submits the query.
  • the user can also elect not to include primer sequence as part of the output, which can reduce the runtime associated with the query.
  • the GUI also typically includes an interface for the user to initiate a search.
  • the exemplary GUI embodiment (600) includes a submit button or tab (680) that when selected initiates a search according to the user entered or default criteria.
  • the GUI can also include a reset button or tab (682) that when selected removes that user input and/or restores the default settings.
  • the GUI will in some embodiments have an example button that, when selected by the user populates all of the input fields with default values.
  • the option selected by the example values may in some embodiments coincide with an example described in detail in a tutorial, manual, or help section.
  • the GUI will in some embodiments contain all or only some of the elements described above.
  • the GUI may contain any graphical user input element or combination thereof including one or more menu bars, text boxes, buttons, hyperlinks, drop-down lists, list boxes, combo boxes, check boxes, radio buttons, cycle buttons, data grids, or tabs.
  • Figures 26A-26G and Table 14 illustrate an exemplary search string processed according to the disclosed methods and include examples showing the input, and portions of a web result and spreadsheet output for a search of the human genome using guide strand -01.
  • the genome of interest is chosen from the Target Genome list ( Figure 26 A).
  • the target sequence is entered into the Query Sequence box ( Figure 26B).
  • the required protospacer adjacent motif (PAM) is entered into the 'Add suffix' Box of the Search Options section ( Figure 26C).
  • the spacers (Ns) and required bases are included, such as NGG or NRG.
  • Primer design parameters are set by pressing the button for 'Default', 'Illumina 250', 'Illumina 250 paired', 'SM T' or 'enzyme' (when using other enzymes). Any of the parameters can be entered by hand to further customize.
  • the methods provided herein will in some cases completely replace the need for experimentally screening nuclease target sites or nuclease activities, allowing for the design of CRISPR/Cas guide strands in a completely m-silico manner.
  • the tools provided herein will serve as an essential first step in the design process by screening and selecting only the few potential guide strands that are predicted to have the desired cleavage-mediating activity at the on-target site, with limited off-site cleavage.
  • the tool will prevent the use of guide strands that have medium or high probability of cleaving an off-target site or cleaving multiple sites in the genome. This will allow for far less experimental time and resources being applied to preparing and testing guide strands that do not have the desired features.
  • the methods provided herein for predicting off-target sites are used without the need for experimental data. In some cases the methods provided herein for predicting off-target sites are parameterized to correlate with
  • the methods provided herein for predicting off-target sites are used to screen candidate guide strands wherein a much smaller subset are subsequently tested experimentally.
  • the methods of predicting off-target sites can be used in combination with experimental methods for measuring both on-target and/or off-target cleavage activity. In some embodiments this includes using the results from one or more experiments to guide the search for guide strand with the desired activity at the target site and little or no activity on off-target sites.
  • the experimental methods can include any method capable of measuring the cleavage activity or identifying off-target active sites of a guide stand in combination with a CRISPR/Cas nuclease.
  • Non-limiting exemplary experimental methods are described below.
  • mutation detection assays can be used to determine if off-target cleavage occur at putative off-target sites identified by according to the disclosed methods.
  • Suitable assays such as enzyme mismatch assays, are known in the art, see, for example, Guschin, et al., Methods Mol. Biol., 649:247-56 (2010), which describes a procedure for quantifying mutations that result from DNA double-strand break repair via non-homologous end joining; and Huang, et al., Electrophoresis, 33(5):788-96 (2012), which describes a T7 endonuclease I-based assay.
  • the assays are typically based on the ability of a nuclease to selectively cleave distorted duplex DNA formed via cross-annealing of mutated and wild-type sequence.
  • primers such as primers designed according to the methods described herein
  • PCR is used to amplification of the genomic loci of putative target sites after transfecting test cells with the elements of the CRISPR/Cas system (e.g., a plasmid expressing Cas9 and a test guide strand).
  • Sanger sequencing can be used to observe mutations.
  • Deep sequencing can also be used to detect and quantitate nuclease induced mutations in CRISPR/Cas-treated cell populations.
  • Example 1 CRISPR guide strands can exhibit off-target activity at similar levels as on-target activity, even with mismatches within first 12 nucleotides.
  • CRISPR/Cas9 guide strands targeting HBB were chosen by comparing the similar regions in the human hemoglobin ⁇ (HBD) gene.
  • Eight 20- base guide strands were designed to target sites near the sickle mutation in the HBB gene ( Figure 1 A), each adjacent to a PAM sequence that contains the canonical trinucleotide NGG.
  • Three guide strands were also designed to target two segments in the human CCR5 gene ( Figure 2A), and tested the corresponding CRISPR/Cas9 systems to determine their on-target cleavage and potential off-target activity at the human C-C chemokine receptor type 2 (CCR2) gene.
  • the name of the guide strand (such as R-03) is used to represent the CRISPR/Cas9 system with the specified guide strand.
  • CRISPR plasmids were generated by kinasing and annealing oligonucleotides containing a G followed by 19 additional bases of the guide strand plus sticky ends, ligating into the pX330 plasmid that contains a U6 promoter-driven chimeric +85-bp guide strand and a CHb promoter-driven Cas9 expression cassette, and expressed together from the 8.5-kb Cas9 gene expression plasmid, pX330 (provided by Dr. Feng Zhang, and also available through Addgene 42230) (Hsu, et al., Nat. Biotechnol, 31 :827-832 (2013)).
  • HEK-293T cells/well were seeded and cultured in Dulbecco's modified Eagle medium supplemented with 10% fetal bovine serum (FBS) and 2 mM fresh L-glutamine, 24 h prior to transfection.
  • FBS fetal bovine serum
  • Cells were transfected with 100, 200, 400 or 800 ng of C ISP plasmids (normalized to 800 ng with pUC18) using FuGENE HD (Promega). The genomic DNA was harvested after 3 days using QuickExtract (EpiCentre).
  • Targeted cleavage was measured at the endogenous loci by the rate of mutations through mis-repair, detected using amplification of these sites using bar-coded or traditional primers (Table 1) and the T7EI assay.
  • the fragments were separated on agarose gels and quantitated using ImageJ; the mutation frequencies were calculated and averaged. To better determine the mutation rate, amplification bands were cloned using the TOPO® TA kit
  • Table 1 Sequence of primers used to amplify endogenous loci for the T7EI assay, sequencing and quantitative PCR
  • Off-target analysis was performed using a bioinformatics-based search tool to select potential off-target sites, which were evaluated using the T7EI mutation detection assay. Sanger sequencing was used to confirm the gene modification frequencies for the CRISPR/Cas9 systems, including guide strand R-02 at GRIN3A (see Figure 6B) and compared to the on-target rate ( Figure 6A).
  • ZFNs zinc finger nucleases
  • TALENs transcription activator-like effector nucleases
  • CRISPR/Cas9 systems have a high potential for off-target activity, as they have more promiscuous binding abilities at positions distal from the protospacer-adjacent motif (PAM) region (Cong, et al., Science, 339:819-823 (2013); Gasiunas, et al, Natl Acad. Sci. USA, 109:E2579-E2586 (2012); Jinek, et al, Elife, 2:e00471 (2013); Jiang, et al, Nat. Biotechnol, 31 :233-239 (2013)).
  • PAM protospacer-adjacent motif
  • the guide NA strands typically target a DNA sequence of ⁇ 20 bp, relatively short compared with the >36 bp targeted by TALENs, many potential off-target sites may exist in large genomes, such as in mammals. Additionally, because non- Watson-Crick base pairing is known to occur (Jiang, et al., Nat. Biotechnol, 31 :233-239 (2013)), it is possible that
  • CRISPR/Cas9 systems have more off-target activities compared with corresponding ZFNs and TALENs.
  • CRISPR/Cas9 systems To determine the off-target effects of CRISPR/Cas9 systems in the context of the human genome, a series of CRISPR/Cas9 systems were constructed with guide RNA strands targeting the human hemoglobin ⁇ (HBB) and C-C chemokine receptor type 5 (CCR5) genes, expressed them in human embryonic kidney 293T (HEK-293T) cells, and quantified their on- and off-target activities using the T7 endonuclease I (T7EI) mutation detection assay and Sanger sequencing. Special attention was placed on the effects of mismatches between the guide strands and the complementary target sequences.
  • HBB human hemoglobin ⁇
  • CCR5 C-C chemokine receptor type 5
  • Table 2 summarizes the on- and off-target cleavage rates in which, for each CPJSPR/Cas9 system, the complementary sequence of the guide strand, the number of mismatches within the guide strand and the name and genetic region of the on- and off-target activities are provided.
  • the third and fourth columns list, respectively, the indel percentages determined by Sanger sequencing and T7EI.
  • Some CRISPR/Cas9 systems with guide strands targeting HBB also cleaved HBD (some at high rates), even though there are mismatches between the guide strands and the complementary HBD sequences.
  • guide strands having just one-base mismatch with the complementary HBD sequences located at positions 4 (R-07), 7 (R-01), 8 (R-08), 10 (R-04) and 11 (R-03) bases from the PAM sequence, resulted in off-target mutation rates ranging from 7 to 58%, roughly corresponding to the distance between the mismatch location and the PAM sequence, with R-04 as an exception (Figure IB).
  • two off-target sites at HBD had mutation rates even higher than the on-target rates at HBB, especially R-08, which induced a mutation rate of 48% at HBD, much higher than that at HBB (36%).
  • the guide strand is typically preceded by a guanine (Cong, et al., Science, 339:819-823 (2013)). Results show that it is not necessary for the guanine base to match the target site for efficient cleavage, as seven guide strands without a guanine at this position induced mutations in HBB (R-02 to R-08) and four guide strands (R-03, R-04, R-07, R-08) induced mutations in HBD ( Figure IB).
  • guide strand R-25 was designed with two identical genomic targets in CCR5 and CCR2 genes to identify the influence of factors beyond sequence homology, such as genomic context.
  • the CRISPR/Cas9 system with R-25 showed a >2-fold difference in mutation rate at these two sites (46% versus 20% mutation rate, Figure 2c).
  • These results indicate that other features such as genomic context may play an important role in cleavage activity.
  • guide strand R-30 had two mismatches with CCR2 at the two bases proximal to the PAM region, it induced mutations in CCR2 at a rate of 5% as measured by T7EI with 800 ng of plasmid in transfection ( Figures 2B).
  • a distinct feature of CRISPR off-target activity as related to mismatches in the guide strand is that mismatches in the PAM region can prevent off-target cleavage (Hsu, et al., Nat. Biotechnol, 31 :827-832 (2013)).
  • R-06 which has a one-base mismatch in the PAM, did not induce detectable mutations at HBD, although it has a perfect match of the 14 bases proximal to the PAM ( Figure 1B-1C).
  • R-02 did not induce cleavage at HBD because of the one-base mismatch in the PAM and two mismatches at positions 2 and 4 from the PAM ( Figure IB).
  • CRISPR plasmids were transfected at doses from 100 to 800 ng, and corresponding on- and off-target activities measured by T7EI ( Figure 3A-3E).
  • R-04 and R-25 gave lower on- and off-target activities
  • R-30 resulted in increased on-target activity and decreased off-target activity
  • the on- and off-target activities of R-03 and R-08 remained roughly the same.
  • transfection with the lowest dose (100 ng) increased the ratio of on-target to off-target activities for R-04, R-25 and R-30, although not for R-03 and R-08.
  • genomic DNA from cells transfected with R-03 was amplified using the HBD forward primer and the reverse primer downstream of the HBB site.
  • Genomic DNA from cells transfected with R-25 or R-30 were similarly amplified using the CCR2 forward and the CCR5 reverse primers. Agarose gels were used to confirm that the polymerase chain reaction (PCR) product sizes were consistent with chromosomal deletions between these sites.
  • the R- 03, R-25 and R-30 PCR products were cloned and the individual colonies Sanger sequenced and aligned.
  • HEK-293 cells were transfected in triplicate with CRISPR plasmids containing guide strands R-02 or R-03, or mock transfected cells. Genomic DNA was harvested using QuickExtract (EpiCentre), per manufacturer's protocol.
  • Amplification reactions contained 1 ul of genomic DNA added to mastermix aliquots containing: 0.1 ul of each 10 uM primer, 3.8 ul of water and 5 ul of iTaq Universal SYBR Green 2x Supermix. The reactions were analysed on an Mx3005P qPCR
  • CRISPR-targeted loci showed a wide variety of insertions, deletions and point mutations. Because HBD is located ⁇ 7 kb upstream of HBB on chromosome 11 , cleavage at both sites raises the possibility of chromosomal rearrangements, including a deletion of the intervening segment (Lee, et al., Genome Res., 20:81-89 (2010); Gupta, et al, Genome Res., 23: 1008-1017 (2013); Xiao, et al, Nucleic Acids Res.,41 :el41 (2013); Gratz, et al, Genetics, 194:1029-1035 (2013)).
  • Quantitative PCR was used to estimate the number of HBB alleles containing the chromosomal deletion with HBD. Standard curves were made using serial dilutions of cloned HBD-HBB deletion fragment, so that the standard curves of both sets of primers could be compared ( Figure 4D). Quantities were very similar across this standard curve using either the HBB pair of primers or the HBD-HBB pair of primers, which allowed comparison of the total amount of HBB and the amount of HBD to HBB deletions. The groupings of three HBD/HBB samples for R-02 and R- 03 are labelled ( Figure 4D).
  • Genomic DNA from the cells transfected with guide strand R-03 contained HBD-HBB chromosomal deletions equal to 12.6% of the copies of total HBB (Table 3). This was compared to genomic DNA from the cells transfected with guide strand R-02, which had higher HBB cleavage, but low HBD cleavage. The R-02 treated genomic DNA contained HBD-HBB chromosomal deletions equal to 0.4% of the copies of total HBB. Table 3: Results of quantitative PCR analysis
  • CCR5 is located ⁇ 8 kb upstream of CCR2 on chromosome 3; thus, chromosomal rearrangements may occur with cleavages at both CCR5 and CCR2.
  • These gross chromosomal deletions were detected with the R-25 CRISPR/Cas9 system, which cleaved both genes at high rates ( Figure 5A and 5B).
  • PCR amplification and sequence analysis revealed two cleavage events in (or near) a conserved region of the CCR5 and CCR2 genes, as indicated by indels consistent with NHEJ ( Figure 5C).
  • Cells transfected with the R-30 CRISPR/Cas9 system also had chromosomal deletions between CCR5 and CCR2 ( Figure 5C).
  • Sequencing the on- and off-target loci revealed a range of different indels as a result of CRISPR/Cas9-induced DNA cleavage and mis-repair. Cleavage followed by correct repair is more difficult to detect, as the sequence does not change. The changes include three large insertions (140, 216 and 448 bp), and a range of deletions. Some sequencing reads had mutations and indels and some with only mutations, but no change in length. Specifically, the results indicated that one-base insertions and deletions occurred frequently, usually several bases from the PAM sequence, consistent with the reported cleavage between the third and fourth bases from the PAM (Jinek, et al, Science, 337:816-821 (2012)).
  • CRISPR/Cas9 systems can induce high rates of gene modification in mammalian cells, they do not have perfect specificity, similar to previous
  • a CRISPR/Cas9 system may cause chromosomal rearrangements with one guide strand inducing cleavage at two defined locations, or with a pair of guide strands inducing deletion between the target sites (Xiao, et al., Nucleic Acids Res.,4 ⁇ :e ⁇ 4 ⁇ (2013)); in both cases the off- target effects of each guide strand must be assayed. Therefore, multiplexed gene editing using CRISPR/Cas9-based approaches might have limitations unless optimal design of the guide strands can be performed to reduce or even eliminate the potential for gross chromosomal rearrangements.
  • CRISPR/Cas9 systems may have high rates of off-target cleavage; therefore, care must be taken when choosing and evaluating target sites. Even with diligent choice of target sites, in most genome editing applications, quantifying the off-target activities is necessary to identify unintended cleavage and mutagenesis. Transfection conditions, including plasmid dosage, may be optimized to decrease off-target cleavage, although the effects may vary with guide strands ( Figures 3A-3E).
  • sgRNA variants were constructed and tested with one or more nucleotides inserted or deleted Table 5.
  • Table 5 sgRNA variants
  • Index names correspond to the index in Figures 6A-6H and Figures 2A-5C. Dashes indicate deleted nucleotides, "nd" means activity was not detected in the T7EI assay.
  • the annealed oligonucleotides have 4-bp overhangs that are compatible with the ends of Bbsl-digested pX330 plasmid. Constructed plasmids were sequenced to confirm the guide strand region using the primer CRISPR seq 5'- CGATACAAGGCTGTTAGAGAGATAATTGG -3 '. T7 endonuclease I (T7EI) mutation detection assay for
  • RNA-guided Cas9 at endogenous loci was quantified based on the mutation rates resulting from the imperfect repair of double-stranded breaks by NHEJ.
  • DMEM Dulbecco's Modified Eagle Medium
  • FBS Fetal Bovine Serum
  • FBS Fetal Bovine Serum
  • Cells were transfected with 750 ng (sgRNA variants) or 1000 ng of CRISPR plasmids using 3.4 ⁇ FuGene HD (Promega), following manufacturer's instructions.
  • Each sgRNA plasmid was transfected as biological duplicates in two separate trans fections. All subsequent steps, including the T7EI assay were performed independently for the duplicates.
  • a HEK293T-derived cell line containing stably integrated EGFP gene was used for sgRNAs targeted to the EGFP gene. This cell line was constructed by correcting the mutations in the EGFP gene in the cell line 293/A658 (Jinek, et al., Science, 337:816-821 (2012)) (kindly provided by Dr
  • PCRs polymerase chain reactions
  • the PC products used in the T7EI assays were cloned into plasmid vectors using TOPO TA Cloning Kit for Sequencing (Life Technologies) or Zero Blunt TOPO PCR Cloning Kit (Life Technologies).
  • Plasmid DNAs were purified and Sanger sequenced using a M13F primer (5'- TGTAAAACGACGGCCAGT -3') ⁇ The mutation rates were determined by comparing each sequence read to the genomic sequence.
  • CRISPR Clustered regularly interspaced short palindromic repeats
  • Cas CRISPR-associated proteins
  • sgRNAs Chimeric single-guided RNAs (sgRNAs) based on CRISPR (Jinek, et al., Science, 337:816-821 (2012)) have been engineered to direct the Cas9 nuclease to cleave complementary genomic sequences when followed by a 5'-NGG protospacer- adjacent motif (PAM) in eukaryotic cells (Mali, et al, Nat. Methods, 10:957-963 (2013); Cong, et al, Science, 339:819-823 (2013); Mali, et al, Science, 339:823-826 (2013)). Since gene targeting by CRISPR/Cas9 is directed by base pairing, such that only the short 20-nt sequence of the sgRNA needs to be changed for different target sites, CRISPR/Cas systems enable simultaneous targeting of multiple CRISPR/Cas systems.
  • deoxyribonucleic acid (DNA) sequences and robust gene modification Jinek, et al., Science, 337:816-821 (2012); Mali, et al., Nat. Methods, 10:957-963 (2013); Cong, et al., Science, 339:819-823 (2013); Yang, et al, Cell, 154: 1370-1379 (2013); Xie, et al, Mol Plant, 6 (2013); Hwang, et al, Nat. Biotechnol, 31 :227-229 (2013); Cho, et al., Nat. Biotechnol, 31 :230-232 (2013); Li, et al, Nat. Biotechnol, 31 :681-683 (2013); Shan, et al, Nat. Biotechnol, 31 :686-688 (2013).
  • Endogenous DNA sequences followed by a PAM sequence can be targeted for cleavage by designing a ⁇ 20-nt sequence of the sgRNA complementary to the target.
  • other sequences in the genome may also be cleaved non-specifically, and such off-target cleavage by CRISPR/Cas systems remains a major concern.
  • there is a partial match between the on- and off-target sites and the differences between the on- and off-target sequences can be grouped into three cases: (a) same length but with base mismatches; (b) off-target site has one or more bases missing ('deletions'); (c) off-target site has one or more extra bases ('insertions').
  • off-target effects may limit the applications of Cas9- mediated gene modification, especially in large mammalian genomes that contain multiple DNA sequences differing by only a few mismatches.
  • One report revealed that 99.96% of the sites previously assumed to be unique Cas9 targets in human exons may have potential off-target sites containing a functional (NAG or NGG) PAM and one single-base mismatch compared with the on-target site (Mali, et al., Nat.
  • Examples 3-8 examine the above-mentioned cases (b) and (c) of potential CRISPR/Cas9 off-target cleavage in human cells by systematically varying sgRNAs at different positions throughout the guide sequence to mimic insertions or deletions between off-target sequences and RNA guide strand. To avoid confusion, for single- base insertions, a 'DNA bulge' was used to represent the extra, unpaired base in the DNA sequence compared with the guide sequence. Similarly, for single-base deletions, an 'RNA bulge' was used to represent the extra, unpaired base in the guide sequence compared with the DNA sequence ( Figures 8A-8B).
  • RNA- guided Cas9 at endogenous loci in HEK293T cells transfected with plasmids encoding Cas9 and sgRNA variants was quantified as the mutation rates induced by Non-Homologous End Joining (NHEJ).
  • NHEJ Non-Homologous End Joining
  • Cas9-mediated mutagenesis was also examined at 114 potential off-target loci in the human genome carrying single-base DNA bulges or sgRNA bulges together with a range of base mismatches, and the results confirmed 15 off-target sites with mutation frequencies up to 45.5%.
  • the results illustrate the need to search for genomic sites with base-pair mismatches, insertions and deletions compared with the guide RNA sequence in analyzing
  • off-target sites with DNA bulges may also be interpreted as sequences having various base mismatches with guide sequence and/or PAM ( Figure 1 lA-1 IB).
  • the sgRNA-DNA interfaces corresponding to removing 5'- end bases in the guide sequences can be viewed as having DNA bulges or having mismatches in the 5 '-end region of sgRNA, which have been shown to be better tolerated compared to the 3'-end region (Cong, et al., Science, 339:819-823 (2013); Fu, et al., Nat.
  • the Cas9 cleavage activities induced by these guide strands may be interpreted as tolerance of base mismatches at the 5 '-end of the guide RNA.
  • the position- 1 variant of R-30 results in a shift in the adjacent PAM from GGG to CGG (another canonical PAM), which could explain why the activity of this guide sequence variant was similar to the original R-30.
  • the cleavage activity induced by the R-01 variant at position 2/1 may be alternatively interpreted as Cas9 cleavage with a GTG PAM ( Figure 9B-9C and Figure 11A), which is highly unlikely according to previous studies (Hsu, et al., Nat. Biotechnol, 31 :827-832 (2013), Pattanayak, et al., Nat.
  • a R-30 guide strand variant at position 11 would contain at least seven mismatches if modeled without a bulge.
  • This guide strand resulted in a 1.8-fold higher cleavage activity compared to the original R-30 ( Figure 10B- IOC and Figure 1 IB), which cannot be readily explained by the high level of base mismatches (which should prohibit cleavage), and thus should be attributed to the tolerance of DNA bulges.
  • Figure 10B- IOC and Figure 1 IB Figure 1 IB
  • Nucleotide additions in sgRNA sometimes created consecutive identical nucleotides, such as adding a G before or after position 14 of R-01 or before or after position 15 of R-30. These sgRNA variants model a G-bulge that can be at either position in the sgRNA ( Figure 13A- 14B). In many cases sgRNA bulges with a single U gave rise to high nuclease activities. Among all sgRNA variants with activities higher than the original sgRNAs, ⁇ 71% (5/7) were targeted to the loci with a U-bulge. Overall, single-base sgRNA bulges induced higher Cas9 cleavage activities at many more positions than that with single-base DNA bulges. This is not surprising since NA molecules are more flexible than DNA molecules, thus having smaller binding energy penalty with single-base RNA bulges, resulting in a higher tolerance (Alberts, et al., Garland Science (2007)).
  • RNA-DNA interfaces with single-base RNA bulges can also be viewed as sequences with various mismatches in the guide sequence and PAM ( Figure 15A- 15B). Specifically, sgRNA bulges at the 5 '-end of guide RNA sequences (e.g.
  • U+20/19 for R-01 and R-30 interfaces can be alternatively viewed as having one to a few base mismatches with the 3 '-end of DNA sequences ( Figure 15A-15B), which are often tolerated, similar to deletions of 1-2 bp at the 5' end of guide strands ( Figure 12A-12B).
  • SgRNA bulges close to the 3'-end of guide sequence can be alternatively viewed as having base mismatches in the 3 '-end region, including those at the third base of PAM (R-30 variants) (the last six variants in Figure 15B).
  • Example 5 GC (guanine-cytosine) content of sgRNAs effects the tolerance of single-base sgRNA bulges
  • R-01 variants The specificity profile (location and level of off-target cleavage) of R-01 variants is substantially different from that of R-30 variants.
  • R-30 which showed a higher level of tolerance to DNA and RNA bulges than R-01, has a GC content of 70%, whereas R-01 has a GC content of 50%. It was hypothesized that the GC content of guide strands R-01 and R-30 played a significant role in causing this difference. To investigate this hypothesis, two additional sets of guide strands targeted to HBB and CCR5 genes, respectively, were tested with different GC contents compared to R-01 and R-30 (Table 10).
  • Table 10 Target sites, cleavage activities (% indels by T7EI assay) and GC contents of different guide strands targeted to HBB and CCR5 genes.
  • R-08 has a moderately higher GC content compared to R-01 (65% compared to 50%), whereas the GC content of R-25 is half of that of R-30 (35% compared to 70%).
  • Cas9 induced cleavage with sgRNA variants of R-08 and R-25 was individually tested to quantify the bulge tolerance in HEK 293T cells.
  • R-25 which contains a low percentage of GC
  • all R-25 variants tested showed non-detectable activities using the T7EI assay (Table 5).
  • cleavage activities were observed at more positions compared with R-01 ( Figure 16B-16D).
  • Example 6 sgRNA variants containing 2- to 5-bp bulges induce Cas9 cleavage
  • Example 7 sgRNA variants containing single-base bulges can mediate cleavage by paired Cas9 nickases
  • Paired Cas9 nickases were developed to generate DNA double-strand breaks by inducing two closely spaced single-strand nicks using an appropriately designed pair of guide RNAs (Mali, et al., Nat. Biotechnol, 31 :833-838 (2013); Ran, et al., Cell, 154: 1380-1389 (2013)).
  • This strategy may lower the off-target cleavage, as double stranded breaks (DSBs) could occur only when both guide RNAs of the pair induced two nicks adjacent to each other at roughly the same time.
  • Assays were designed to test if paired Cas9n systems can tolerate bulges by using one bulge- forming guide variant paired with a perfectly matched guide strand.
  • the paired Cas9 nickases with single sgRNA bulges showed activities comparable to Cas9 system having one bulge in RO-1; however, for DNA bulges, the activities of paired Cas9 nickases were >2-fold higher than that of Cas9.
  • Example 8 Cas9 cleavage occurs at genomic loci with both base mismatches and DNA or sgRNA bulges
  • Off-target sites in the human genome were identified using TagScan (https://www.isrec.isb-sib.ch/tagger), a web tool providing genome searches for short sequences (Iseli, et al., PLoS One, 2:e579 (2007)). Guide sequences containing single-base insertions (represented with an 'N' in the sequence) and single- base deletions at different positions were entered, followed by the PAM sequence 'NGG'. Off-target sites were alternatively searched for using the recently developed bioinformatics program COSMID that can identify potential off-target sites due to insertions and deletions between target DNA and guide RNA sequences (disclosed herein). Primers were individually designed to amplify the genomic loci identified in the output.
  • HEK 293T cells were transfected with 750 ng sgRNA variants, as described above. Each sgRNA was transfected as biological triplicates in three separate wells and processed independently. Total RNA was isolated from cells using the RNAeasy kit (Qiagen). Extracted RNA was reverse-transcribed using the iScript cDNA
  • GAPDH glyceraldehyde-3 -phosphate dehydrogenase
  • PCR reactions for each locus were performed independently for eight touchdown cycles in which annealing temperature was lowered by 1°C each cycle from 65 to 57°C, followed by 35 cycles with annealing temperature at 57°C.
  • PCR products were purified using Agencourt AmPure XP (Beckman Coulter) following manufacturer's protocol. The second PCR amplification was performed for each individual amplicon from first PCR using primers containing the adapter sequences from the first PCR, P5/P7 adapters and sample barcodes in the reverse primers (Table 11). PCR products were purified as in first PCR, pooled in an equimolar ratio, and subjected to 2 x 250 paired-end sequencing with an Illumina MiSeq.
  • Paired-end reads from MiSeq were filtered by an average Phred quality (Q score) greater than 20 and merged into a longer single read from each pair with a minimum overlap of 10 nucleotides. Alignments were performed using Borrows- Wheeler Aligner (BWA) for each barcode (Li, et al., Bioinformatics, 26:589-595
  • Deep sequencing was performed at 55 putative off-target sites corresponding to single-base sgRNA bulges and 21 sites corresponding to single-base DNA bulges.
  • the sites were amplified from genomic DNA harvested from HEK 293T cells transfected with Cas9 and sgRNAs.
  • the 55 sites with sgRNA bulges contain 35 sites tested in the preliminary T7EI assay, and the 21 sites with DNA bulges include seven sites tested in the T7EI assay.
  • Putative bulge-forming loci containing one to three PAM-distal mismatches were chosen, since sites associated with a bulge without any base mismatch were not found.
  • Examples 3-8 show that CRISPR/Cas9 systems can have off- target cleavage when DNA sequences have an extra base (DNA bulge) or a missing base (sgRNA bulge) at various locations compared with the corresponding RNA guide strand.
  • sgRNA bulges of up to 4-bp could be tolerated by CRISPR/Cas9 systems ( Figures 17A-17B).
  • the correlation between cleavage activity and the position of DNA bulge or sgRNA bulge relative to the PAM appears to be loci and sequence dependent when comparing the specificity profiles of guide sequences R-01 and R-30.
  • guide strand R-30 (70% GC) showed the highest tolerance to sgRNA and DNA bulges among the four guide strands tested (R-01, R-08, R-25 and R-30), while guide strand R-25 (35% GC) does not seem to tolerate any bulges.
  • bulges in the PAM distal or PAM proximal regions can reflect either mismatch tolerance or RNA/DNA bulge tolerance.
  • some of the potential off-target sites identified may overlap with a search considering bulges.
  • the mismatch and bulge-containing sites should be tested for off-target cleavage, a better understanding of the bulge tolerance as well as the difference in the mechanisms underlying these two scenarios is needed.
  • One study revealed that a Cas9 ortholog from Streptococcus thermophilus has a PAM located 2 bps downstream of the protospacer (Chen, et al., J Biol.
  • the cleavage resulting from the variant R-01 -2/1 may reflect the tolerance of a linker between the target sequence and PAM instead of a DNA- bulge.
  • the cleavage resulting from the variant R-01 -2/1 may reflect the tolerance of a linker between the target sequence and PAM instead of a DNA- bulge.
  • Cas9 cleavage with RNA or DNA bulges in the middle of the target sequence may reflect only the bulge tolerance.
  • Bulge-forming sgRNA variants may be more effective than regular sgRNAs in creating larger deletions that might be preferred in certain applications, such as targeted disruption of genomic elements. These larger deletions may also occur at off-target loci, which strengthens the need to include them in genomic searches.
  • indel query Three types of indel query are allowed: (i) the number of mismatches with no insertion or deletion (No indels); (ii) the number of mismatches in addition to a single-base deletion (Del); and (iii) the number of mismatches in addition to a single-base insertion (Ins). Up to three mismatches without indels, and up to two mismatches together with a one-base insertion or deletion could be chosen.
  • primer design parameter settings and parameter templates should also be entered (Figure 25 A).
  • PAM variants such as NRG can be entered in the suffix box, as well as other PAM sequences (Fischer, et al., J Biol Chem, 287:33351-33363 (2012)).
  • the spacer (Ns) and required nucleotides are entered into the suffix box, such as "NNNNGATT" (Hou, et al., Proc Natl Acad Sci USA, 110: 15644-15649 (2013)), and include genomic sites with any nucleotide at the N positions in the output.
  • COSMID constructs a series of search entries according to the user-specified guide strand and search criteria (Figure 25B).
  • the search entries include all insertions and deletions at each possible location ( Figure 25 C), and are subsequently used to perform rapid and accurate searches of the entire sequence of the interested genome, while allowing for the user-specified number of mismatches. These searches took ⁇ 4 seconds without primer design ( Figure 26A- 26G).
  • RNA bulges RNA bulges
  • DNA bulges DNA bulges
  • COSMID only allows searches for single-base insertions and deletions in the DNA sequence compared with the guide strand ( Figure 25 A).
  • the search algorithm allows some ambiguities (such as N for any nucleotide).
  • Ambiguities included in the search string are marked in red in the HTML results (as are mismatches and indels), but are not counted toward the user-specified mismatch limits.
  • ambiguities allows the inclusion of the matching genomic base with the output sequences.
  • One possibility is to include an "N" in positions that can have substitutions, such as the first base in a guide strand that is often a G primarily to aid in transcription, but does not need to match the complementary target sequence (Hsu, et al., Nat Biotechnol, 31 : 827-832 (2013); Cradick, et al, Nucleic Acids Res, 41 :9584-9592 (2013); Mali, et al., Science, 339: 823-826 (2013).
  • COSMID outputs all genomic sequences that match the user-supplied search criteria in comparison with the entered guide strand.
  • the first column of the HTML output shows the genomic sequence ("hit") aligned to the query sequence with matches shown in black. Nucleotides that are not a direct match are shown, including mismatches, insertions, and deletions (Table 12). Ambiguities in the query sequence, such as the N in the PAM sequence NGG, are also shown in red, though they do not count as mismatches.
  • the second column lists the query type, including (i) no deletion or insertion (No indel), (ii) deletions (Del), or (iii) insertions (Ins).
  • This column indicates if there are insertions or deletions, and specifies the indel positions as the number of nucleotides away from the PAM.
  • the third column lists the number of mismatched bases between the query and target sequences. When two repeated bases appear in the guide strand, a deletion of either one of them in the target sequence gives the same query sequence, so the ambiguity is noted in the query column.
  • the fourth column indicates if the PAM in the hit ends in G, as NGG is the Cas9 PAM with the highest activity, followed by NAG (Hsu, et al., Nat Biotechnol, 31 : 827-832 (2013)). This column helps in ruling out genomic sites with unlikely PAMs. This function must be added to the excel spreadsheet for other PAMs.
  • the fifth, sixth, and seventh columns contain respectively the chromosomal location of the matching sequence, its strand and the chromosomal location of the cleavage site.
  • the predicted cleavage position is based on the fact that Cas9 primarily cleaves both DNA strands three nucleotides from the PAM (Jinek, et al., Science, 337: 816-821 (2012)).
  • the HTML links included in the COSMID output are directed to the chromosomal sites in the UCSC genome browser. This allows determination of the gene that best matches the target sequence and if the target site is in an exon, intron, or other region. This information is helpful as mutations may be better tolerated in regions that are noncoding and nonfunctional.
  • the output is grouped by query types, including (i) genomic sites with base mismatches, but no insertions or deletions (No indels), (ii) sites with deletions (Del), and (iii) sites with insertions (Ins) between the query and potential off-target sites (Table 12). Within each category, sites with mismatches further from the PAM are listed first, which are more likely to result in off-target cleavage (Fu, et al., Nat Biotechnol, 31 : 822-826 (2013); Hsu, et al, Nat Biotechnol, 31 : 827-832 (2013); Cradick, et al., Nucleic Acids Res, 41 :9584-9592 (2013).
  • the same genomic location may satisfy two or more search criteria, such as those sites that satisfy the mismatched base limit without and with an insertion or deletion. For example, mismatches at the base farthest from the PAM and deletions of this base will give the same set of genomic locations. This can also occur when the guide strand contains consecutively repeated bases. Since genomic locations can be specified through multiple criteria (examples shown in Figures 28 A and 28B), they are listed in each of the search criteria, such as those sites that satisfy the mismatched base limit without and with an insertion or deletion. For example, mismatches at the base farthest from the PAM and deletions of this base will give the same set of genomic locations. This can also occur when the guide strand contains consecutively repeated bases. Since genomic locations can be specified through multiple criteria (examples shown in Figures 28 A and 28B), they are listed in each of the
  • Duplicate sites can be removed in the spreadsheet, as described below.
  • COSMID also outputs the potential off-target sites identified in a spreadsheet to allow for further processing, such as sorting by attributes or adding weight matrixes to rank the most likely off-target sites.
  • further processing such as sorting by attributes or adding weight matrixes to rank the most likely off-target sites.
  • the accumulation of additional experiments on CRISP off-target activity will allow creation of a more predictive scoring system.
  • COSMID COSMID 's primer design function is used to assay for off-target cleavage after cells or animals are treated with CRISPR guide strands and nuclease. Primers are designed that fit the criteria needed for the particular assay or sequencing platform using an automated primer pair design process, not found in other CRISP programs.
  • the algorithm was developed for the zinc finger nucleases and TAL effector nucleases off-target search program PROGNOS and found to give a single specific band in -93% of amplifications (Fine, et al., Nucleic Acids Res, 42:e42 (2013)).
  • the automated primer design alleviates the need for the iterative steps of primer design and verification of the resulting fragment sizes, that slow primer design, especially for mutation detection assays where the cleavage product sizes determine how easily the cleavage bands can be distinguished on gels.
  • the recommended parameters for use in Surveyor assays resolved on 2% agarose gels are: Minimum Distance Between Cleavage Bands— 100 bp, Minimum Separation Between Uncleaved and Cleaved
  • SMRT real-time sequencing
  • the recommended parameters are: Minimum Distance Between Cleavage Bands— 0, Minimum Separation Between Uncleaved and Cleaved Products— 125 bp.
  • the output primers can be easily modified in the spreadsheet, such as to add flanking sequences for additional amplification and/or barcodes for sequencing.
  • the COSMID algorithm is based on sequence homology; it searches a genome of interest for sites similar to CRISPR guide strands using the efficient FetchGWI search program that has powered search tools including TagScan34 and ZFN-site (Cradick, et al., BMC Bioinformatics, 12: 152 (2011)).
  • FetchGWI operates on indexed genome sequences that are precompiled and stored ( Figures 26A-26G). It can identify genomic locations with sequences that match any of the series of search entries.
  • FetchGWI saves run time by searching indexed files that represent the genome sequences, rather than the sequences themselves. There is one index entry for each nucleotide in the genome, which allows a rapid and exhaustive search. This is a key advantage of COSMID over BLAST and other programs that scan
  • COSMID currently allows searching the human, mouse, Caenorhabditis elegans, and rhesus macaque genomes.
  • COSMID is a CRISPR off-target search tool with a web interface that allows directed and exhaustive genomic searches to identify potential off-target sites for guide strand choice or experimental validation.
  • a user chooses the genome of interest from the list, and enters the guide strand and PAM sequences ( Figure 25 A).
  • a user can choose to include (i) ⁇ 2 base mismatches with an insertion and/or deletion, or (ii) ⁇ 3 base mismatches without any indels (Figure 25 A).
  • the user has the option to have primers as part of the output.
  • Primers are designed by COSMID that are optimized to the specified criteria or to the defaults given for particular applications (Figure 25 A).
  • COSMID exhaustively scans the genome based on these input parameters ( Figure 25B), allowing consideration of mismatches, insertions, and/or deletions ( Figure 25C, Figure 26A-26G).
  • COSMID outputs a ranked list of perfectly matched (on-target site and possibly other sites) and partially matched (potential off-target) sites in the genome, their ranking score, along with reference sequences and primer designs that can be used for sequencing and/or mutation detection assays (Table 12).
  • Each line of the output file describes one genomic locus matching the search criteria. A locus may appear on multiple lines if it can be modeled and found in multiple ways.
  • An exemplary COSMID Output includes the following text, a hyperlink for viewing the raw search results in a txt file and Table 12.
  • Table 12 shows an exemplary COSMID output in HMTL and includes the genomic sites matching the user-supplied criteria in comparison to guide strand R-01 with chromosomal location. Scoring of the mismatches is provided for ranking, as are PCR primers and reference sequence. The right primers, in silico link, amplicon, and digest sizes are provided in the output, but not shown here. Links are provided to each location in the UCSC genome browser, and to the output file as a spreadsheet for further manipulation and primer ordering.
  • Each hit is appropriately aligned to the query shown in the "Result” box (Table 12).
  • DNA bases corresponding to mismatches, indels, ambiguity codes, such as N, are shown in the query line to identify the matching genomic bases.
  • To the right of the "Result” box are boxes with the query type, number of mismatches, chromosomal position, score, primers, and other features.
  • the web page showing COSMID output also includes links to test each primer pair and to reformat the output file as text or in a spreadsheet.
  • the spreadsheet output allows thorough evaluation of the number and scores of the low-scoring sites that are predicted to be more likely off-target sites, which may provide important guidelines when evaluating and choosing guide strands and/or testing for true cleavage events using DNA samples from cells after
  • COSMID uses the TagScan algorithm to minimize run times while still performing exhaustive genome searches (Iseli, et al., PLoS One, 2:e579 (2007)). With the primer design option off, the run times averaged 4 seconds for the guide strands without indels (Table 13).
  • Run times were measured for COSMID using variations of guide strands R-01 and R- 30, with and without a 5'G, using standard (NGG) or relaxed PAM (NRG). All runs included sites matching the guide strand with three or less mismatches without indels. More matching loci "hits" were identified by allowing single-base insertions or deletions together with ⁇ 2 base mismatches.
  • Allowing insertions or deletions in addition to mismatches increases run time. For example, when searching with a 19-nt guide strand and an NRG PAM, and including two mismatches with either an insertion or an deletion resulted in run times averaging 42 seconds for R-01 and 36 seconds for R-30. The run times for the search with three mismatches without insertions or deletions were similar. Including primer design increased the run times proportional to the number of primer sets and reference sequences returned.
  • Figures 26A-26G and Table 14 illustrate an exemplary search string processing by COSMID include examples showing the input, and portions of the web results and spreadsheet output for a search of the human genome using guide strand R-01.
  • the genome of interest is chosen from the Target Genome list ( Figure 26 A).
  • the target sequence is entered into the Query Sequence box ( Figure 26B).
  • the required protospacer adjacent motif (PAM) is entered into the 'Add suffix' Box of the Search Options section ( Figure 26C).
  • the spacers (Ns) and required bases are included, such as NGG or NRG.
  • the boxes in the 'Allowed indels and mismatch' of the Search Options section are checked to indicate if genome sites to be searched include genomic sites that have No indels (with ⁇ 3 mismatches but the same length), have 1-base Del (are 1-base shorter), or have 1-base Ins (are 1-base longer) ( Figure 26C).
  • Primer design parameters are set by pressing the button for 'Default', 'Illumina 250', 'Illumina 250 paired', 'SMRT' or 'enzyme' (when using other enzymes). Any of the parameters can be entered by hand to further customize.
  • the genwin program was used to transform the DNA sequence from FASTA formatted files into unsorted index entries, which have all possible 25 bases-long tags in the DNA sequence.
  • the sortGWI program was used to sort the index entries, and store the result as a binary index file.
  • sortGWI subdivided the whole index file into 16,777,216 parts, each representing entries having identical first 12 nucleotides.
  • a secondary index recording the position in the main index file where each part starts, was added to the end of the index file to enable faster search and reduce file size.
  • the index files are stored in the COSMID server.
  • the sequence tags in COSMID are used to generate a series of additional tags that contain indels if the insertion or deletion boxes are checked. Identical tags are removed if they are duplications for strings containing consecutive identical bases. The resulting tags are all searched against the user- selected genome. For example, if guide strand R-01 is entered, the tags illustrated in Figure 26E and 26F are generated and used to search the human genome.
  • FetchGWl program is used. If the user specifies a search with one or more mismatches, FetchGWl generates all possible sequence tags by replacing the specified number of nucleotides with all other possibilities. After that, FetchGWl sorts all the query tags and search for matches in the index file, using an efficient method called binary search. FetchGWl reports the search results by appending the actual sequence tag found, along with the accession number and position offset within the sequence for each matched query tags.
  • COSMID For each match that FetchGWl finds, COSMID generates a score that reflects the empirical expectation of how likely it is an off-target site.
  • COSMID web output includes links for html, txt and excel files (Figure 26G). Links are provided to test each primer pair using the UCSC in-silico PC web site.
  • the excel output is sorted for unique sites with the lowest mismatch and indel score to locate the most likely off-target sites.
  • the Score+ column contains a ranking to place NGG ahead of NAG sites (+0.3 points added to the COSMID default scoring).
  • the second column represents the query type, then the chromosomal location, the ranked number and a grid showing the mismatches, insertions and deletions (Table 14). Different sections of the output are illustrated in Table 14.
  • Example 10 COSMID searches and identifies putative off-target cleavage sites Materials and Methods
  • the on- and off-target cleavage activity of Cas9 and guide strand -01 was measured using the mutation rates resulting from the imperfect repair of double- stranded breaks by non-homologous end joining.
  • An amaxa Nucleofector 4D was used to transfect 200,000 K-562 cells with 1 ⁇ g px330 expressing R-01 sgRNA, following manufacturer's instructions.
  • the genomic DNA was harvested after 3 days using QuickExtract DNA extraction solution (Epicentre, Madison, WI), as described (Guschin, et al, Methods Mol Biol, 649: 247-256 (2010)).
  • On- and off-target loci were amplified using AccuPrime Taq DNA Polymerase High Fidelity (Life
  • This guide strand was shown to have on- target cleavage at beta-globin and off-target cleavage at delta-globin,24 so a range of off-target sites were chosen, including two pairs of identical sites (OT6-OT7 and OT8-OT9) and five identical sites (OT1-OT5) to test for off-target mutations and evaluate the role of genomic context on cleavage and mutation rates. It is hoped that increased cellular data, such as provided in ENCODE for some cell lines, may prove useful in this regard.
  • the nucleotides in position 20 and in the first position of the NGG PAM are lowercase, as there are not mismatches at these positions.
  • Table 16 lists these eight experimentally validated off-target sites in decreasing order of mutation rate (%), their ranking by COSMID, as well as that by other on-line CRISP tools.
  • Table 16 Comparison of COSMID with other available tools in predicting off- target sites with two mismatches for guide strand R-01.
  • cleavage rates at R-01 on-target site and off-target sites OT1-OT10 are listed by decreasing T7EI activity in Table 16.
  • OT3 and OT9 had activities below T7EI detection limit.
  • Annotated genes corresponding to the sites are listed.
  • Off-target analysis was performed with different online search tools. If the genomic sites with measurable T7EI activity (Figure 27) were identified by a specific tool (such as Cas OFFinder), their rankings in its output (if sortable) are shown. Sites not in the output of that tool are indicated by a dash in a grey box (e.g., R01 OT1 under "Cas OFFinder").
  • the output from COSMID was also compared with the output from other web tools for their ability to identify off-target sites that contain an extra bases (DNA bulge) or a missed base (RNA bulge) relative to the complementary genomic DNA sequence (Lin, et al, Nucleic Acids Res, 42:7473-7485 (2014)) (Table 17).
  • the off- target sites in Table 17 might also be modeled as sites with four mismatches or noncanonical PAMs compared with the on-target site, though it is less likely that binding of Cas9 would occur without an NGG or NAG PAM.
  • the columns corresponding to the individual tools follow from Table 16, above.
  • R30_Ins9 where the additional G in the genomic sequence might be the first, second, or third of the three adjacent Gs, at locations 2, 3, or 4 nucleotides from the PAM (Table 18).
  • Table 17 Comparison of search results for off-target sites that contain deletions or insertions, in which sequence-verified off-target sites with insertions or deletions, which can also be modeled as loci with four mismatches or alternate PAM considered.
  • Genomic sequences of the off-target sites are given, together with the number of mismatches, bulge type (guide bulge or gDNA bulge) and bulge position relative to PAM. *gDNA mismatches compared to guide strand are shown by alignment;
  • deletions are underlined, and deletions (guide bulge) are represented as dashes.
  • the first nucleotide in PAM is in lower case.
  • this off-target site can be modeled as having three mismatches with a shift in the PAM from NGG to NAG.
  • the off-target site Ol Insl may be modeled as having a NAG PAM. Without a bulge, R30_Insl4 would need to have the unlikely GTA PAM, so it remains unclear how it was modeled by Cas Online Designer.
  • Each site in Tables 17 and 19 are marked "yes" when found by COSMID (first column) or other search method; if any of the confirmed off-target site could not be identified by a search tool, it is shown as a box with a dash.
  • Table 19 The sequence-verified off-target sites with insertions or deletions that cannot be modeled as four mismatches or alternate PAM can only be predicted by COSMID.
  • COSMID has better ability in identifying off-target sites with indels.
  • COSMID provides exhaustive genomic searches for off-target sites due to
  • COSMID In addition to providing optimized primer designs for sequencing and mutation detection for confirming putative off-target sites, COSMID also provides the reference sequence to facilitate sequencing.
  • the reference sequence and knowledge of the cut site location facilitates mutation detection assays, including surveyor and T7EI, and possibly other uses, such as searching for restriction sites that may overlap the cut site.
  • search results for two guide strands were compared with validated activity and known off- target cleavage, including the guide strand R-01 that targets the human HBB gene, and the guide strand R-30 (GTAGAGCGGAGGCAGGAGC) that targets the human HIV co-receptor CCR5 gene (Cradick, et al., Nucleic Acids Res, 41 :9584-9592 (2013); Lin, et al, Nucleic Acids Res, 42:7473-7485 (2014)).
  • COSMID searches were compared with the output given by other existing search tools.
  • off-target sites contain insertions or deletions in addition to mismatches
  • only COSMID searches could identify all of the 10 sequence-validated off-target sites (Tables 15, 16, and 17).
  • the deletion contained in off-target sites R-01_Dell or R-30_Dell (Table 17) could be modeled as four mismatches, and the insertion in off-target sites R-01_Insl, R-30_Ins9, or R-30_Insl4 (Table 17) could be modeled as having alternative PAMs.
  • the number of putative genomic off-target sites output by COSMID increased drastically when indels were allowed in the search. For example, allowing one-base insertions together with two mismatches increased the number of genomic sites adjacent to a NAG or NGG PAM ⁇ 3 and ⁇ 7 times for R-01 and R-30 respectively compared with those without indels and two mismatches (166 versus 49 for R-01 and 224 versus 34 for R-30, Table 20).
  • Table 20 Comparison of search results for guide strands R-01 and R-30 with deletion or insertion permitted.
  • COSMID is 333 for R-01 and 761 for R-30 (Table 21).
  • Table 21 Off-target loci when a one-base deletion was allowed in addition to ⁇ 2 mismatches.
  • NRG PAM located 1 ,040 unique putative off-target sites for R-01 and 1 ,218 for R-30.
  • There were many identical sites located by multiple query types (examples shown in Figures 28A and 28B).
  • the results varied between the two guide strands R-01 and R- 30 (each targets a coding sequence), as can be expected in a nonrandom genome ( Figures 29A-29D).
  • R-01 had a markedly larger number of matching sites with no indels. Of note was a particular 3-mismatch hit in 69 sites.
  • identifying off-target cleavage by CRISPR/Cas9 systems in a genome of interest is important, especially in treating human disease and creating model organisms, as CRISPR off-target cleavage (Fu, et al., Nat Biotechnol, 31 : 822- 826 (2013); Hsu, et al., Nat Biotechnol, 31 : 827-832 (2013)) can result in mutations, deletions, inversions, and translocations (Cradick, et al., Nucleic Acids Res, 41 :9584- 9592 (2013); Xiao, et al., Nucleic Acids Res, 41 :el41 (2013)) inducing detrimental biological consequences and potentially causing disease.
  • COSMID can quickly and exhaustively search a genome for DNA sequences that partially match the target sequence of the guide strand, but contain insertions or deletions in addition to base mismatches. As shown in Table 21, a large number of potential off-target sites would be missed using search tools that only consider base mismatches, but not insertions or deletions.
  • COSMID outputs potential off-target sites ("hits") corresponding to allowed mismatches and indels, the PAM sequence and the chromosomal location of the hits. COSMID also outputs primer designs for experimental validation of the off-target sites. Further processing of the COSMID results from the output spreadsheets extends COSMIDs utility to different
  • CRISPR/Cas platforms including the use of Cas9 nickase pairs (Ran, et al., Cell, 154: 1380-1389 (2013)), Cas9/FokI fusion (Tsai, et al, Nat Biotechnol, 32:569-576 (2014); Guilinger, et al, Nat Biotechnol, 32: 577-582 (2014)), and multiplexed targeting (Cong, et al., Science, 339: 819-823 (2013)) by searching for multiple (sometimes paired) sites within a user-input chromosomal proximity.
  • COSMID can be used to identify potential off-target sites of CRISPR activators, repressors, or other effector domains (Cheng, et al, Cell Res, 23: 1163-1171 (2013)).
  • the on-target and potential off-target sites given in the COSMID output can be tested experimentally using mutation detection assays (Guschin, et al., Methods Mol Biol, 649: 247-256 (2010)) or deep sequencing with genomic DNA harvested from cells treated by CRISPR/Cas. Mutation detection assays, including Surveyor and T7EI, are very commonly used to measure on- and off-target cleavage and
  • COSMID facilitates these assays by automatically designing primers to enable facile gel separation of the uncleaved and cleavage bands.
  • the output also includes the genomic reference sequence for comparison to the sequencing results.
  • COSMID scores the potential off-target sites based on the number and location of base mismatches, allowing ranking of the more likely off-target sites.
  • Bioinformatics based ranking of CRISPR/Cas off-target sites may be influenced by the effects of genomic context and DNA modifications. As exemplified herein, identical genomic sites and duplicated sites may have differences in off-target activity. The indel rate at off-target site R-01 OT2 was 44%, though other loci with the same complementary sequence have much less, or no activity, possibly due to nuclease blocking. It is believed that incorporating parameters such as the effects of chromatin condensation, DNA availability and other factors into the COSMID search algorithm will improve the scoring and ranking of the target sites.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Chemical & Material Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Evolutionary Biology (AREA)
  • Analytical Chemistry (AREA)
  • Software Systems (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Fuzzy Systems (AREA)
  • Databases & Information Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Automation & Control Theory (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Analysis (AREA)
  • Chemical Kinetics & Catalysis (AREA)
  • Computing Systems (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Optimization (AREA)
  • Evolutionary Computation (AREA)
  • Computational Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Algebra (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

Methods and systems for searching genomes for potential CRISPR off-target sites are provided. In preferred embodiments, the methods include identifying possible on- and off-target cleavage sites and /or ranking the potential off-target sites based on the number and location of mismatches, insertions, and/or deletions in the gRNA guide sequence relative to the genomic DNA sequence at a putative target site in the genome. These methods allow for the selection of better target sites and/or experimental confirmation of off-target sites and are an improvement over partial search mechanisms that fail to locate every possible target site.

Description

METHODS AND SYSTEMS FOR IDENTIFYING CRISPR/CAS
OFF-TARGET SITES
CROSS-REFERENCE TO RELATED APPLICATIONS
This application claims the benefit of and priority to U.S. S.N. 61/932,003 filed January 27, 2014 and which is incorporated by reference in its entirety.
STATEMENT REGARDING FEDERALLY SPONSORED RESEARCH
This invention was made with government support under Grant
PN2EY018244 awarded by the National Institutes of Health. The government has certain rights in the invention.
FIELD OF THE INVENTION
The invention is generally directed to bioinformatics methods and systems for identifying CRISPR/Cas, or similar nucleotide-directed nuclease on-target and putative off-target sites. The invention also includes systems for ranking and comparing CRISPR/Cas, or similar nucleotide-directed nuclease target sites. These putative cleavage sites can have mismatches, insertions, and/or deletions compared to the guide strand. Determining the possible off-target sites allows better choice of guide strands and testing for effects from nuclease treatment. These methods are an improvement over partial search methods that fail to locate every possible target site.
BACKGROUND OF THE INVENTION
Genome editing has successfully created cell lines and animal models for biological and disease studies, and has a wide range of potential therapeutic applications (Gaj, et al., Trends Biotechnol, 31 :397-405 (2013)). In particular, engineered nucleases creating DNA double-strand breaks or single-strand breaks ("nicks") at specific genomic sequences greatly enhance the rate of genomic manipulation. Double-strand breaks repaired by the cellular non-homologous end joining (NHEJ) pathway often induce insertions, deletions, and mutations, or other events, which are effective for gene disruptions and knockouts. Alternatively, when a donor DNA is supplied, double-strand breaks and DNA nicks can be repaired through homologous recombination, which incorporates the donor DNA and results in precise modification of the genomic sequence. Regardless of the DNA repair pathway, it is important to minimize off-target cleavage in order to reduce the detrimental effects of mutations and chromosomal rearrangements. Although zinc finger nucleases and TAL effector nucleases potentially have a wide range of applications, they were found to cleave at off-target sites at detectible rates (Cornu, et al, Methods Mol Biol, 649:237-245 (2010); Ramirez, et al, Nucleic Acids Res, 40:5560-5568 (2012);
Tesson, et al., Nat Biotechnol, 29:695-696 (2011); Hockemeyer, et al., Nat
Biotechnol, 29:731-734 (2011); Mussolino, et al., Nucleic Acids Res, 39:9283-9293 (2011)). Clustered regularly interspaced short palindromic repeats (CRISPR), the bacterial defense system using RNA-guided DNA cleaving enzymes (Bolotin, et al., Microbiology, 151 (Pt. 8): 2551-2561 (2005); Barrangou, et al, Science, 315: 1709- 1712 (2007); Brouns, et al., Science, 321 : 960-964 (2008); Hale, et al, Cell, 139: 945-956 (2009); Horvath, et al, Science, 327: 167-170 (2010); Marraffmi, et al, Nat Rev Genet, 11 : 181-190 (2010); Garneau, et al, Nature, 468: 67-71 (2010)) is an exciting alternative to zinc finger nucleases and TAL effector nucleases due to the ease of directing the CRISPR-associated (Cas) proteins (such as Cas9) to multiple gene targets by providing guide RNA sequences complementary to the target sites (Jinek, et al, Science, 337: 816-821 (2012); Cong, et al., Science, 339: 819-823 (2013). Target sites for CRISPR/Cas9 systems can be found near most genomic loci; the only requirement is that the target sequence, matching the guide strand RNA, is followed by a protospacer adjacent motif (PAM) sequence in either orientation (Mojica, et al, Microbiology, 155 (Pt. 3): 733-740 (2009); Shah, et al, RNA Biol, 10:891-899 (2013); Horvath, et al, J Bacteriol, 190: 1401-1412 (2008)). For Streptococcus pyogenes (Sp) Cas9, this is any nucleotide followed by a pair of guanines (marked as NGG). Studies on CRISPR/Cas9 systems indicate the possibility of high off-target activity due to nonspecific hybridization of the guide strand to DNA sequences with base pair mismatches at positions distal from the PAM region (Cong, et al, Science, 339: 819-823 (2013); Gasiunas, et al, Proc Natl Acad Sci USA, 109:E2579-E2586 (2012); Jinek, et al, Elife 2:e00471 (2013); Jiang, et al, Nat Biotechnol, 31 : 233-239 (2013)).
For CRISPR/Cas9 systems, studies have confirmed levels of off-target cleavage comparable with the on-target rates (Fu, et al., Nat Biotechnol, 31 : 822-826 (2013); Hsu, et al., Nat Biotechnol, 31 : 827-832 (2013); Cradick, et al., Nucleic Acids Res, 41 :9584-9592 (2013); Pattanayak, et al, Nat Biotechnol, 31 : 839-843 (2013)), even with multiple mismatches to the guide strand in the region close to the PAM. RNA guide strands containing insertions or deletions in addition to base mismatches can result in cleavage and mutagenesis at genomic target site with levels similar to that of the original guide strand (Lin, et al., Nucleic Acids Res, 42:7473-7485 (2014)). These studies provide the first experimental evidence that genomic sites could be cleaved when the DNA sequences contain insertions or deletions compared with the CRISPR guide strand. These results have demonstrated the need to identify potential off-target sites when choosing guide strand designs and examine off-target effects experimentally when using CRISPR/Cas systems in cells, plants and/or animals.
As mismatches and indels (insertions and deletions) are tolerated between the guide strand and target sequences, there may be embodiments where there are known or unknown differences between the guide stand and its complementary sequences. In some embodiments, the intended mismatches, truncations, indels or other non- complementary sequences may be included, such that the guide sequence will direct cleavage to the target site, although not a direct matching sequence.
A number of CRISPR tools have been developed, including Cas Online Designer (Hsu, et al, Nat Biotechnol, 31 : 827-832 (2013)), ZiFit,27 CRISPR Tools, (Hsu, et al, Nat Biotechnol, 31 : 827-832 (2013)) and Cas OFFinder (Bae, et al, Bioinformatics, 30: 1473-1475 (2014)), for different functions (Hsu, et al., Nat Biotechnol, 31 : 827-832 (2013); Bae, et al, Bioinformatics, 30: 1473-1475 (2014); Xiao, et al., Bioinformatics, 30: 1180-1182 (2014); Grissa, et al., Nucleic Acids Res, 35: W52-W57 (2007); Grissa, et al., BMC Bioinformatics, 8: 172 (2007); Rousseau, et al., Bioinformatics, 25: 3317-3318 (2009); Montague, et al., Nucleic Acids Res, 42:W401-W407 (2014)). However, none of these bioinformatics search tools has considered the off-target sites due to insertions or deletions between target DNA and guide RNA sequences, nor provide application-specific primers. Off-target cleavage could be detected in cells with 15 different insertions and deletions between the guide strand and genomic sequence, sometimes at rates higher than that of the perfectly matched guide strand (Lin, et al., Nucleic Acids Res, 42:Ί '473-' '485 (2014)).
Therefore, it is an object of the invention to provide a bioinformatics tool to identify potential off-target sites that have mismatches, insertions, and/or deletions between an RNA guide strand of choice and genomic sequences.
It is a further object of the invention to provide application-specific primers.
SUMMARY OF THE INVENTION
Methods and systems for searching genomes for potential CRISPR off-target sites are provided. In preferred embodiments, the methods include ranking the potential off-target sites based on the number and location of mismatches, insertions, and/or deletions in the g NA guide sequence relative to the genomic DNA sequence at a putative target site in the genome, allowing the selection of better target sites and/or experimental confirmation of off-target sites.
For example, computer-implemented methods for identifying cleavage locations of a nuclease, preferably a nucleotide-directed nuclease, most preferable a CRISPR/Cas nuclease are provided. In some embodiments, the nuclease is RNA- directed, DNA-directed, or directed by RNA, DNA and/or alternative nucleotide format. The nuclease can cleave both DNA strands, can be a single nickase, or be a double nickase. In the most preferred embodiments, the nuclease is Cas9, or a variant thereof. In some embodiments, methods identify binding locations of a nucleotide- directed protein, that binds to and/or interacts with DNA, but is not a nuclease are provided.
The methods can include, in a computer system, comparing a series of query sequences including a guide strand sequence (a guide sequence) and at least one variant sequence thereof including one or more nucleotide insertions, one or more nucleotide deletions, and/or one or more nucleotide substitutions relative to the guide sequence, to genomic sequence and reporting target cleavage sites corresponding to locations in the genomic sequence having sequence identity to one or more of the query sequences.
The series of query sequences can include all possible guide strand sequence variants having between 0 and 10, preferable between 0 and 5, more preferably 0, 1, or 2 nucleotide insertions relative to the guide sequence; all possible guide strand sequence variants having between 0 and 10, preferable between 0 and 5, more preferably 0, 1, or 2 nucleotide deletions relative to the guide sequence; between 0 and 10, preferable between 0 and 5, more preferably 0, 1, 2, or 3 nucleotide mismatches (e.g., substitutions) relative to the guide sequence; and all possible combinations thereof. In some embodiments is carried out through an interface, for example a computer implemented interface, that allows the user to select the number of insertions, deletions, and/or mismatches. In some embodiments, the interface is a web-based interface. In particular embodiments, a web-based interface allows the user choice of insertions or deletions of a single nucleotide, though other
embodiments are possible, as described above. Larger number of nucleotides may be more applicable to other nuclease, particularly nucleotide-directed nucleases, with either longer guide strands or different binding arrangements. In a particular embodiment, the query guide sequences provide guide strand variant sequences having no indels and 0, 1, 2, or 3 mismatches; 1-base deletion, no insertions, and 0, 1, or 2 mismatches; 1-base insertion, no deletions, and 0, 1, or 2 mismatches; 1-base deletion, 1-base insertion, and 0, 1, or 2 mismatches; or any combination thereof.
The methods typically include comparing or searching one, or more, query sequence against a genome sequence (s) and reporting putative target sites. In some embodiments an individual guide strand is searched. In other embodiments multiple guide strands are searched, which can allow comparisons of the output or other testing. In the most preferred embodiments, a target site is reported if a genomic sequence is identified that matches the user-supplied search criteria, which can include presence or lack of sites with no indel, with insertion(s), with deletion(s), with mismatch(es), or with combinations thereof. The user-supplied preferences typically include the number of allowed mismatches for each of the categories listed above. In each of these cases, the user can alternatively choose preferences from general or search type-specific defaults, or modify such preferences.
In the preferred embodiment, the output contains each site in the genome satisfying the search criteria. In other embodiments, particularly relevant with less well-sequenced genomes or DNA regions, the output can also include sites that might satisfy the search criteria if the ambiguous nucleotides were known. The output can contain exact matches to the query sequences and/or contain sites that differ (have mismatches) at, for example, 1-12 positions, that differ at 1-5 positions, or in that differ at 1-3 positions. The percentage of the sequences matching can then vary depending on the length of the query sequence and the number of mismatches. In some embodiments, the search criteria can result in the reporting of genomic sequence that have approximately at least 80%, 85%, 90%, 95%, 96%, 97%, 98%, or 99% sequence identity to one or more of the sequences in the series of query sequences. The report can include the genomic location and preferably the genomic target sequence for each target site identified. The report can include the cleavage location and/org genomic sequence.
The report can include a score indicating the likelihood that the guide sequence will direct a CRISPR/Cas system to the DNA sequence and facilitate nuclease cleavage. The score can be used to rank the putative target sites in a list. The score can include additional information from experiments and/or databases, such as ENCODE, about the genomic context. For example, data on the histones, protein binding or confirmation of individual chromosomal regions can indicate if there is less or more likelihood of cleavage. In some embodiments, target cleavage locations including genomic sequences with higher sequence identity to the guide sequence receive a lower score relative to target cleavage locations having genomic sequences with lower sequence identity to the guide sequence. Typically, in such embodiments, increasing numbers of substitutions, deletions, and insertions at the target cleavage location increase the score, as do substitutions, deletions, and/or insertions closer to the PAM. The scoring mechanism and position weights can be changed to alter the scoring to better model certain CRISPR/Cas activities. For example, in some embodiments, the score is increased more for deletion(s) in the genomic sequence relative to the guide sequence (RNA bulges) than for insertions in the genomic sequence relative to the guide sequence (DNA bulges). The score can also reflect that sgRNA bulges are less tolerant to additional base mismatches, and vice versa.
In some embodiments, each query sequence in the series includes a protospacer adjacent motif (PAM) suffix. Exemplary suffixes include, but are not limited to, NGG, NAG, and NRG. In some embodiments, a target cleavage site having a NGG PAM guide strand is given a lower score than that of NAG PAM.
Some embodiments may include PAM flanking sequences that are deemed to affect binding.
In some embodiments, the scoring and ranking may be separated, with or without user input. The ranking can also be conducted using two steps, such as an initial ranking and then ranking or re-ranking, based on input weight factors. The ranking method may involve a series of weight scores or position weight matrix to total the scores of the individual weigh the positions of mismatch, insertions or deletions and influence the scoring based on their impact on the design criteria. The ranking can also include sequence specific features such that a match or mismatch weigh considers the interacting nucleotide. The sequence specific weight scores may correlate with hydrogen bonds, as with G-C verse A-T interactions, or may relate to sequence specificities at individual positions, possibly due to protein interactions. The design criteria can include binding, DNA cleavage rate, mutation rate, or other criteria.
In some embodiments, the ranking method is applied to genomic loci independently of the search method. In some embodiments he ranking
In some embodiments, primer sequences suitable for amplifying the genomic sequence at the target cleavage site are reported. These primers may be suitable for PC amplification or DNA preparation or isolation using other techniques, such as pull-down preparations. The primers may be used for Sanger sequencing, next generational sequencing, mutation detection assays, such as the surveyor (Cradick 2009 Thesis) and T7 Endonuclease I, and others.
The genome sequence or sequences that the series of query sequences are searched against typically makes up an organismal genome, preferably a complete or nearly complete organismal genome. In specific embodiments, the organismal genome is a human genome, a rat genome, a mouse genome, or a rhesus macaque genome. In other embodiments, the searched sequence could be artificial sequences or a combination or artificial and genomic sequences. The searched sequences can be DNA, RNA, etc. In a particular embodiment the searched sequences are mRNA, for example, a transcriptome.
The genomic sequence(s) can be DNA sequence converted into FASTA or similarly formatted files, then transformed into index entries that have all possible 25 bases-long tags in the DNA sequence. In other embodiments, other tagging schemes can be used including longer and shorter tags. The index entries can be sorted and the results stored as a binary main index file. The main index file can be divided into parts, each representing entries having about 12 nucleotides of the first nucleotides identical. In other embodiments, other lengths of index files may be used. A secondary index file can include the position in the main index file where each part starts added to the end of the index file. Searching genome sequence organized and indexed in such a way can improve the speed of the search, while allowing exhaustive searching. Preferred embodiments utilize index files, though other embodiments could use other index methods, similar expedited search strategies, or provide searching without index files, as done with linear searches through the full sequence space, though these would increase run times. A particular embodiment of the disclosed method is referred to herein as COSMID (C ISP Off-target Sites with Mismatches, Insertions, and Deletions).
The disclosed methods and systems can aid the design and optimization of CRISPR guide strands by selecting the preferred target sites with minimum Cas- induced off-target cleavage and facilitate the experimental confirmation of off-target activity by providing both putative off-target sites and primer for testing cleavage that the sites in a CRISPR/Cas system. In some embodiments, the disclosed methods are more exhaustive and/or have a higher sensitivity for identifying putative and/or actual off-target sites than previously known methods or programs.
BRIEF DESCRIPTION OF THE DRAWINGS
Figure 1 A is a sequence alignment of guide strands to their target sites in HBB and aligned to the corresponding region in HBD. Forward direction guide strands (marked 'greater than') are shown adjacent to NGG, representing the PAM sequence. Guide strands complementary to the reverse strand (marked 'less than') are listed to the right of CCN. Asterisks between HBB and HBD indicate nucleotides that differentiate the two genes, whereas the other nucleotides are the same in both genes. The first base shown in HBB is the sickle cell anemia mutation site. Figure IB is a sequence alignment showing the high levels of cleavage and mutation that can be found at off-target sites even with mismatch to the guide strands in the first 12 nucleotides closest to the PAM. The on- and off-target mutation rates are listed in decreasing order of the off-target mutation rates at HBD, and illustrate differences between the guide sequence and HBD. A lowercase g indicates that the first base in HBB does not match the guide strands' initial G (for all but R-01). The 12 bases closest to the PAM are boxed and numbered on top. Figure 1C is a bar graph showing the indel percentage in HBB (left-hand bar of each pair) and HBD (right-hand bar of each pair) for mock and guide strands R-01 through R-08 as determined by T7EI mutation detection assays.
Figure 2A is a sequence alignment of guide strands to their target sites in CCR5 (shown below the guide strands) and aligned to corresponding region in CCR2 (shown below CCR2). Forward direction guide strands (marked 'greater than') are shown adjacent to NGG, representing the PAM sequence. Guide strands
complementary to the reverse strand (marked 'less than') are listed to the right of CCN. Asterisks between CCR5 and CCR2 indicate nucleotides that differentiate the two genes. Figure 2B is an illustration showing that cleavage can occur at off-target sites even with mismatch to the guide strands in both of the first two nts closest to the PAM (R-30). The first two guide strands in the list are in ranked order of the off- target mutation rates at CCR2. By sequence comparison, one can identify the differences between the guide strand sequence and complementary sequence in
CCR2. The 12 bases closest to the PAM are boxed and numbered on top. Figure 2C is a bar graph showing the indel percentage in CCR5 (left-hand bar of each pair) and CCR2 (right-hand bar of each pair) for mock and guide strands R-01 through R-08 as determined by T7EI mutation detection assays.
Figures 3A-3E are bar graphs illustrating how the transfection dosage variability affects on- and off-target mutation rates (%). Figures 3A-3C show R-03 (3 A), R-04 (3B), or R-08 (3C) guide strand mutation rates at HBB (left-hand bar of each pair) and HBD (right-hand bar of each pair) loci when cells were transfected with 100, 200, 400, or 800 ng of CRISPR plasmid. Figures 3D-3E show R-25 (3D) or R-30 (3E) guide strand mutation rates at CCR5 (left-hand bar of each pair) and CCD2 (right-hand bar of each pair) loci when cells were transfected with 100, 200, 400, or 800 ng of CRISPR plasmid.
Figures 4A-4B are sequence alignments showing on-target loci (4A) and off- target loci (4B) for guide strands R-03 after transfection with the CRSIPR plasmid. The regions were amplified with flanking PCR primers, cloned and Sanger sequenced. Sequencing reads are given for each guide strand and aligned to the wild- type sequence. The number of times each read occurred is indicated to the left of the alignment. Unmodified reads are indicated by 'WT'. Mutations, insertions, or deletions were detected in 70% of the reads at HBD and 62% of the reads in HBD. In Figure 4B the guide strand mismatch is boxed. Figure 4C depicts the sequence of chromosomal deletions as a sequence alignment showing PCR products of genomic DNA from cells treated with R-03, amplified using an HBD forward primer and reverse primer downstream of the HBB site, sequenced and aligned to ΉΒΒ-HBD' . Sequencing detected that each product contained indels and mutations consistent with NHEJ, near the target sites for R-03. Insertions, point mutations, and deletions are illustrated. Figure 4D is a line graph depicting the Quantitative PCR determination of the percentage of HBD-HBB chromosomal deletions at R-03, and the lower amount after transfection or R-02. Figures 5A-5B are sequence alignments showing on-target loci (5 A) and off- target loci (5B) for guide strands -25 after transfection with the CRSIPR plasmid. The regions were amplified with flanking PCR primers, cloned and Sanger sequenced. Sequencing reads are given for each guide strand and aligned to the wild- type sequence. The number of times each read occurred is indicated to the left of the alignment. Unmodified reads are indicated by 'WT'. Mutations, insertions or deletions were detected in 50% of the reads at CCR5 and 32% of the reads in HCCR2. In Figure 5B the guide strand mismatch is boxed. Figure 5C depicts the sequence of chromosomal deletions as a sequence alignment showing PCR products of genomic DNA from cells treated with R-25, amplified using a CCR2 forward primer and reverse primer downstream of the CCR5 site, sequenced and aligned to 'CCR2- CCR5 Sequencing detected that each product contained indels and mutations consistent with NHEJ, near the target sites for R-25. Insertions, point mutations, and deletions are illustrated.
Figures 6A-6C are sequence alignments showing on- and off-target sequencing after CRISPR transfection: R-02 targeted mutations at HBB (6A), R-02 mutations at off-target site 2, GRIN3A (6B), and R-30 off-target mutations at CCR2 (6C). Target loci in genomic DNA of HEK-293T cells transfected with each CRISPR construct were amplified, cloned, Sanger sequenced, and aligned to the reference gene, listed above the alignment, and shown aligned to the guide strand. After the guide strand name and genetic loci for each alignment, the number of clones with indels is shown, as is the total number of clones and percentage with indels. The alignment includes the reference gene and guide strand with mismatches boxed. The first column lists the number of times each read occurred and indel size change in basepairs. Unmodified reads are indicated by "WT". Insertions, point mutations, and deletions are illustrated.
Figure 7 is a bar graph showing the indel spectra from CRISPR/Cas9 cleavage and NHEJ mis-repair. The change in number of base pairs resulting from each indel was calculated and compiled. The y-axis represents the percentage of each number of insertion or deletion.
Figures 8A and 8B are diagrams showing that CRISPR can cleave at genomic sites with mismatches to the guide strand and with insertions or deletions relative to the guide strand, for example at off-target sites with a 1-bp insertion (DNA bulge) (8A) or a 1-bp deletion (RNA bulge) (8B). The 20-nt guide sequence in the sgRNA is shown aligned with the genomic target sequence (protospacer) containing single-base DNA bulge (8A, asterisk) or single-base sgRNA bulge (8B, Δ). The zoom-in nucleotide sequences of protospacer and PAM are shown above the sgRNA guide sequence. Positions of nucleotides in the target are numbered 3' to 5' starting from the nucleotide next to PAM.
Figure 9A is a sequence alignment illustrating that a single nucleotide was deleted from the original R-01 sgRNA at all possible positions (dashes) throughout the guide sequence for sgRNA R-01 targeting HBB. Figure 9B is a grid mapping the deletions, which in the case of repeated bases, can be thought to have been a deletion of either base. Semi-transparent squares in two positions in the same sgRNA indicate that deletions can be interpreted at either of adjacent positions (also marked by Or') due to identical nucleotides at both positions. Sequence of the original sgRNA is in the top row of the grid. Figure 9C is a bar graph showing cleavage activity aligned to the corresponding sgRNA variants of 9 A and 9B. The graph in Figure 9C indicates cleavage activity for the corresponding sgRNA variants measured by T7EI assay in HEK293T cells at the HBB site for the sgRNA variants in (9 A), and compares to the activity of the original full-length guide strand. Positions relative to PAM are labeled on the y-axis. The vertical dashed lines mark the activity levels of the original sgRNAs. Error bar, SEM (n = 2).
Figure 10A is a sequence alignment illustrating that a single nucleotide was deleted from the original sgRNA at all possible positions (dashes) throughout the guide sequence for sgRNA R-30 targeting CCR5. Figure 1 OB is a grid mapping the deletions, which in the case of repeated bases, can be thought to have been a deletion of either base. Semi-transparent squares in two positions in the same sgRNA indicate that deletions can be interpreted at either of adjacent positions (also marked by Or') due to identical nucleotides at both positions. The sequence of the original sgRNA is in the top row of the grid. The graph in Figure IOC indicates cleavage activity for the corresponding sgRNA variants measured by T7EI assay in HEK293T cells at the HBB site for the sgRNA variants in (10A), and compares to the activity of the original full- length guide strand. Figure IOC is a bar graph showing cleavage activity aligned to the corresponding sgRNA variants of 10A and 10B. Considerable activity, even higher than with the original guide strand was detected with deletions at a number of different positions. Positions relative to PAM are labeled on the y-axis. The vertical dashed lines mark the activity levels of the original sgRNAs. Error bar, SEM (n = 2).
Figure 11 A and 1 IB are alignments of -1 nt sgRNA variants to the HBB (11 A) and CCR5 (1 IB) target loci showing mismatches instead of DNA bulge. Only the variants with detectable intracellular activities are shown. The target loci and index names of the sgRNA variants are indicated on the left of each alignment. Mismatches in the guide sequence and in the "NGG" PAM are marked with asterisks below each alignment. The alignment with the minimum number of mismatches is shown for each sgRNA variant. Nucleotide "U" in the guide RNA is replaced with "T" for the ease of comparison to the target site. For example, modeling the cleavage of R-01 with a deletion at position 6 or 7 (11 A) can either be modeled with a deletion and no mismatches or without a deletion, but with four mismatches close to the PAM
(indicated by *), which would generally not be well tolerated, and prevent cleavage. Similarly, the CCR5 guide strand with a deletion at position 9 or 10 (1 IB) has considerable activity can either be modeled with a deletion and no mismatches or without a deletion. If this interaction was modeled without a deletion, there would be six mismatches close to the PAM (indicated by *), which would generally prevent cleavage.
Figure 12A is a sequence alignment showing 1-6 bp truncations at the 5' end of the guide sequence R-01 targeted to the HBB gene. Figure 12B is a grid showing cleavage activity for the corresponding sgRNA variants measured by T7EI assay in HEK293T cells at the HBB site for the sgRNA variants in (12A). Truncated positions are highlighted in the grid. Sequence of the original sgRNA is in the top row of the grid. Figure 12C is a bar graph showing cleavage activity aligned to the
corresponding sgRNA variants of 12A andl2B. The number of deleted nucleotides is labeled on the y-axis. The vertical dashed lines mark the activity levels of the original sgRNAs. Error bar, SEM (n = 2).
Figures 13A is a grid showing the activity of Cas9 at the HBB target site carrying single-base sgRNA bulges associated with different variants of the original sgRNAs R-01. Each variant shown has a single nucleotide, A, G, C, or U inserted into the original sgRNA at the positions shown throughout the guide sequence.
Sequence of the original sgRNA is in the top row of the grid. Positions of the original guide sequence are shaded, while the inserted positions are white. Due to identical nucleotides at adjacent positions, some inserted nucleotides can be in multiple positions (marked by Or'). Figure 13B is a bar graph showing corresponding cleavage activities quantified by T7EI assay in HEK293T cells. Positions relative to PAM and the single nucleotides added are labeled on the y-axis. Error bar, SEM (n = 2).
Figure 14A is a grid showing the activity of Cas9 at the CCR5 target site resulting from treatment with different variants of R-30 with single-base bulges. A single nucleotide, A, G, C, or U, was inserted into the original sgRNA throughout the guide sequence. Sequence of the original sgRNA is in the top row of the grid.
Positions of the original guide sequence are shaded, while the inserted positions are white. Due to identical nucleotides at adjacent positions, some inserted nucleotides can be in multiple positions (marked by Or'). Figure 14B is a bar graph showing corresponding cleavage activities quantified by T7EI assay in HEK293T cells.
Positions relative to PAM and the single nucleotides added are labeled on the y-axis. Error bar, SEM (n = 2).
Figures 15A and 15B are sequence alignments of +1 nt sgRNA variants to the HBB (15 A) and CCR5 (15B) target loci without a bulge leads to many mismatches, instead of a sgRNA bulge. Only the variants with detectable intracellular activities are shown. The target loci and index names of the sgRNA variants are indicated on the left of each alignment. Mismatches in the guide sequence and in the "NGG" PAM are marked with asterisks below each alignment. The alignment with the minimum number of mismatches is shown for each sgRNA variant. Nucleotide "U" in the guide RNA is replaced with "T" for the ease of comparison to the target site.
Figures 16A and 16C are grids showing the activity of Cas9 at the HBB target site carrying single-base DNA bulges (16A) or sgRNA bulges (16C) associated with different variants of the original sgRNAs R-08. Figures 16B and 16D are bar graphs showing corresponding cleavage activities of 16A and 16C, respectively, quantified by T7EI assay in HEK293T cells. Positions relative to PAM and the single nucleotides added are labeled on the y-axis. Error bar, SEM (n = 2).
Figure 17A is a series of sequence alignments comparing guide RNA variants with insertions greater than one nucleotide and their original target sites R-01 or R-30. The guide RNAs are named for the position of the insertions. Figure 17B is a bar graph showing cleavage activities of the sgRNA variants shown in 17A quantified by T7EI assay in HEK293T cells. Error bar, SEM (n = 2). Figures 17A and 17B show the larger bulges can also lead to activity.
Figure 18 A is a sequence alignment showing the human HBB gene targeted by Cas9 nickases (Cas9n) with paired guide strands R-01 and R-02. PAMs are indicated with bars. Figure 18B is a bar graph showing T7EI activities of Cas9n with R-01 bulge-variants paired with R-02, compared with original Cas9 activities of the R-01 bulge-variants as in Figures 9-10 and 13-14. Error bar, SEM (n = 2). Asterisks indicate P-values from a two-tailed independent two-sample t-test. *P < 0.05, **P < 0.01, ***P < 0.001. Figures 18A and 18B show that bulges are tolerated in other CRISPR systems including the nickase nucleases, which only cut one strand.
Figures 19A and 19B are sequence alignments showing on-target and off- target alignments containing bulges for sgRNAs R-30 targeted to CCR5 gene (19A), and R-31 target to ERCC5 gene (19B). Upper: guide strands aligned to target sequences (CCR5 and ERCC5). Lower: guide strands (R-30 and R-31) aligned to off- target sequences (Off-4 and Off-1) each with a DNA bulge compared to the sgRNA (R-30 and R-31) tested. Off-4 has a mismatch with R-30, 14 nt from the PAM.
Horizontal lines indicate the PAM. The mismatch shown between the initial G in sgRNA R-31 and the corresponding nt in its target site or in Off-1 does not affect binding, or cleavage. After transfection of R-30 and R-31 expression plasmids, and tissue culture for 2 days, the genomic DNA was harvested and amplified by flanking primers. Figures 19C and 19D display the mutations, insertions and deletions introduced by mis-repair after cleavage at these sites. The Sanger sequencing reads of amplified off-target sites are aligned to the wild-type genomic sequence and sgRNAs for R-30 (19C) and R-31 (19D). The number of times each sequence occurred is indicated to the left of the alignment, if greater than one. Unmodified reads are indicated by 'WT'. Deletions are marked with a dash ('-') and insertions marked in shaded. Figure 19E is a bar graph showing activities (indel percent) analyzed by deep sequencing at genomic off-target loci containing bulges coupled with mismatches and in some cases alternative NAG-PAMs. The level after CRISPR treatment with the indicated guide strand is graphed against mutations detected in mock treated samples (likely by mis-reads) (top bar in each pair, outlined) and treated samples (bottom bar in each pair) with sgRNAs at off-target loci shown in the table to the left. The table on the left shows numbers of mismatches at off-target loci in addition to bulge (no. of mis), bulge types, positions of bulges from PAM (bulge pos), labels for the loci and sequences of off-target sites including PAMs. In these off-target genomic sequences, mismatches are lighter, deleted base compared to sgRNA marked as '-' (sgRNA bulge), inserted base compared to sgRNA marked as underlined letters (DNA bulge). Error bars, Wilson intervals (see 'Materials and Methods' section). *P < 0.05, ***P < 0.001 as determined by Fisher's exact test. The % indel values of treated samples are also indicated.
Figure 20 is a sequence alignment showing the effects of R-30 cleavage and miss-repair at the off-target site 5 (Off-5), quantified by Sanger sequencing. One of the 24 sequencing reads was not wild type with an inserted a in lowercase, the other
23 reads were wild type and are marked "WT".
Figures 21 A and 2 IB are genetic maps showing the histone modification status and annotation of R30 Off-4 (21 A) and Off-5 (2 IB) loci obtained from the
UCSC genome browser.
Figure 22 is a bar graph showing the results of quantitative PCR of sgRNA expression (sgRNA Log Fold Change (-ddCt)) levels in HEK293T cells for R-01 and
R-30 variants.
Figures 23A-23C are bar graphs showing the range of insertions and deletions introduced with matching guide strand and guide strands with bulges (the indel spectra, the percent in total indels mapped against change in bases) for original sgRNAs and sgRNA variants determined using deep sequencing for R-01 original sgRNA (23A), and variants for DNA bulge (Rl -7/6) (23B) and sgRNA bulge (Rl C+12) (23 C). The change in bases at predicted cut sites resulting from indicated sgRNAs was calculated from ~104 reads per sample. The y-axis represents percentages in all indel-reads for that sgRNA. Overall % indel in total reads are indicated in each graph.
Figures 24A-24C are bar graphs showing indel spectra (percent in total indels mapped against change in number of bases) for original sgRNAs and sgRNA variants determined using deep sequencing for R-30 original sgRNA (24A), and variants for DNA bulge (R30-11) (24B) and sgRNA bulge (R30 U+12) (24C). The change in bases at predicted cut sites resulting from indicated sgRNAs was calculated from ~104 reads per sample. The y-axis represents percentages in all indel-reads for that sgRNA. Overall % indel in total reads are indicated in each graph. Expression of Cas9 and the original guide strand or guide strand with indels result in insertions or ranges of deletions.
Figure 25A is a screen-shot of an exemplary COSMID user input interface, including drop-down list of searchable genomes, a box to enter a query guide sequence of choice, a box to enter the type of PAM, radio buttons to select allowed number of mismatches, insertions and deletions, and both selection criteria and user input boxes to modify the primer design parameters. Figure 25B is a flow chart showing the COSMID software design and the major steps in performing a search. Figure 25 C is a list of exemplary search strings with insertions or deletions in the first six possible positions demonstrating how the program searches for each insertion or deletion (if selected by user). Alternate deletions of repeated bases are synonymous.
Figure 26A is an exemplary COSMID user interface for selecting a searchable genome. Figure 26B is an exemplary COSMID user interface for entering a query sequence. Figure 26C is an exemplary COSMID user interface for entering the protospacer motif (PAM) and selecting the type and number of mismatches and indels. Figure 26D is an exemplary COSMID user interface entering primer design parameters. Figure 26E is an alignment showing the tags generated and used to search the human genome when a COSMID user enters the guide sequence exemplified in Figures 26A and 1-base deletion to allow gRNA bulge (e.g., DNA is base shorter than the guide sequence, as illustrated above the alignment). Deletions of either of consecutive bases result in the same sequence and are therefore omitted from the list. Figure 26F is an alignment showing the tags generated and used to search the human genome when a COSMID user enters the guide sequence exemplified in Figures 26A and allows 1-base insertion to allow DNA bulge (e.g., guide sequence NA is one base short than DNA, as illustrated above the alignment). Figure 26G is an exemplary COSMID HTML output that shows query type, number of mismatches if the PAM ends in RG (NAG or NGG), the chromosomal positon, strand, cut site, the ranking score and left PCR primer. The right primer is off screen here.
Figure 27 is a bar graph showing on- and off-target cleavage rates (% indel frequency) for guide strand R-01 for groups of identical sites. This experiment indicated that other factors in addition to complementary sequence may play in mutation rate - these features may be added into the search calculations, scoring and ranking in other embodiments. Figures 28A and 28B are sequence alignments showing two examples of genomic sites identified using different search queries for R-30. Both possible off- target sites can align to search strings without indels, with a deletion and with an insertion. Search strings are shown aligned to each identified chromosomal location. Mismatches are shaded, and insertions or deletions are illustrated with a dash ('-').
Figures 29A-29D are genetic maps showing the number and location of the additional genomic loci found while searching for putative off-target sites with and without indels for R-01 (29A, 29C) and R-30 (29B, 29D). Figures 29A and 29B display putative off-target sites with up to three mismatches and not indels. Figures 29C and 29D include the addition of sites with up to two mismatches and either an insertion or a deletion. Each vertical line represents each identified off-target site, plotted at its chromosomal location by the UCSC genome browser. The chromosome numbers are listed on edges of the plots.
Figure 30A is a flow chart of an exemplary method for generating a ranked list of off-target sites that could be implemented on a computer. A user query is used to generate search parameters used by the algorithm to construct a list of possible off- target cleavage sites. The possible off-target sites are ranked by their predicted off- target cleavage activity (or chance for activity) and output as results in a ranked list. Figure 3 OB is a flow chart of an additional exemplary method for generating a ranked list of off-target sites that could be implemented on a computer. This method includes estimating the results and generating a list of primers designed for amplifying and/or testing the mutations introduced at each site. Figure 30C is a flow chart illustrating an exemplary algorithm for executing the disclosed methods of identifying target sites and/or ranking or scoring target sites.
Figure 31 is a block diagram of a preferred network-based implementation containing a computer server and one or more client computers in communication over a network.
Figure 32 is a block diagram of a computer server containing I/O device(s), a processor, memory, and storage.
Figure 33 is a schematic of a graphical user interface (GUI) for receiving input parameters for a computer-implemented off-target site search method. The GUI is displayed in a web browser and contains check boxes, drop-down lists, radio buttons, and text boxes for inputting the query sequence, modifying the search parameters, and customizing criteria design criteria for PCR primers that can be used to test off-target cleavage using the queried guide sequence.
Figure 34 is a curve illustrating the score (x-axis) as a function of the location/position of the mismatch or indel relative to the PAM (Y-axis).
DETAILED DESCRIPTION OF THE INVENTION
I. Definitions
As used herein, the terms "operative linkage" and "operatively linked" (or "operably linked") are used interchangeably with reference to a juxtaposition of two or more components (such as sequence elements), in which the components are arranged such that both components function normally and allow the possibility that at least one of the components can mediate a function that is exerted upon at least one of the other components. For example, an enhancer is a transcriptional regulatory sequence that is operatively linked to a coding sequence, even though they are not contiguous.
As used herein, an "exogenous" molecule is a molecule that is not normally present in a cell, but can be introduced into a cell by one or more genetic, biochemical or other methods. "Normal presence in the cell" is determined with respect to the particular developmental stage and environmental conditions of the cell. Thus, for example, a molecule that is present only during embryonic development of muscle is an exogenous molecule with respect to an adult muscle cell. Similarly, a molecule induced by heat shock is an exogenous molecule with respect to a non-heat-shocked cell. An exogenous molecule can include, for example, a functioning version of a malfunctioning endogenous molecule, a malfunctioning version of a normally- functioning endogenous molecule or an ortholog (functioning version of endogenous molecule from a different species).
As used herein, the terms "nucleic acid," "polynucleotide," and
"oligonucleotide" are interchangeable and refer to a deoxyribonucleotide or ribonucleotide polymer, in linear or circular conformation, and in either single- or double-stranded form. For the purposes of the present disclosure, these terms are not to be construed as limiting with respect to the length of a polymer. The terms can encompass known analogues of natural nucleotides, as well as nucleotides that are modified in the base, sugar and/or phosphate moieties (e.g., phosphorothioate backbones). In general and unless otherwise specified, an analogue of a particular nucleotide has the same base-pairing specificity; i.e., an analogue of A will base-pair with T.
As used herein, the terms "polypeptide," "peptide" and "protein" are used interchangeably to refer to a polymer of amino acid residues. The term also applies to amino acid polymers in which one or more amino acids are chemical analogues or modified derivatives of corresponding naturally-occurring amino acids.
As used herein, the terms "cleavage" or "cleaving" of nucleic acids, refer to the breakage of the covalent backbone of a nucleic acid molecule. Cleavage can be initiated by a variety of methods including, but not limited to, enzymatic or chemical hydrolysis of a phosphodiester bond. Both single-stranded cleavage and double- stranded cleavage are possible, and double-stranded cleavage can occur as a result of two distinct single-stranded cleavage events. DNA cleavage can result in the production of either blunt ends or staggered "sticky" ends. In certain embodiments cleavage refers to the double-stranded cleavage between nucleic acids within a double-stranded DNA or RNA chain.
As used herein, the term "genome", refers to the nuclear DNA of an organism, though it can also include all the DNA in a given organism including mitochondrial DNA. The term "genomic DNA" refers to deoxyribonucleic acids that are obtained from the nucleus of an organism. The terms "genome" and "genomic DNA" encompass genetic material that may have undergone amplification, purification, or fragmentation. In some cases, genomic DNA encompasses nucleic acids isolated from a single cell, or a small number of cells, clones of cells or pools of cells. The "genome" in the sample that is of interest in a study may encompass the entirety of the genetic material from an organism, or it may encompass only a selected fraction thereof: for example, a genome may encompass one chromosome from an organism with a plurality of chromosomes. The genome may refer to the reference sequence for an organism or the sequence of one or more individuals. In some embodiments, the genomic sequence can contain or be comprised solely of man-made, altered or non- natural sequences, including, but not limited to, natural genomic sequences with the inclusion of knocked-in sequences, such as GFP expression cassettes or tags, or cDNA or other sequences for the expression of a gene of interest. In other embodiments, the genome may not consist of natural chromosomal sequences, but of sequences assembled by man. As used herein, the terms "genomic region" or "genomic segment", as used interchangeably herein, denote a contiguous length of nucleotides in a genome of an organism. A genomic region may be of a length as small as a few kb (e.g., at least 5 kb, at least 10 kb or at least 20 kb), up to an entire chromosome or more.
As used herein, the terms "genome-wide" and "whole genome", are interchangeable and refer generally to the entire genome of a cell or population of cells and include the sequences normally found in those cells and introduced DNA such as knocked-in cDNAs, promoters, enhancer, tags or other naturally occurring, or man-made sequences or combinations of sequences. The terms "genome-wide" and "whole genome" will generally encompass a complete DNA sequence of all of an organism's DNA (chromosomal, mitochondrial, etc.). Alternatively, the terms "genome-wide" or "whole genome" may refer to most or nearly all of the genome. For example, the terms "genome-wide" or "whole genome" may exclude a few portions of the genome that are difficult to sequence, do not differ among cells or cell types, are not represented on a whole genome array, or raise some other issue or difficulty that prompts exclusion of such portions of the genome. In some embodiments the genome is considered complete if more than 90%, more than 95%, more than 99%, or more than 99.9% of the base pairs have been sequenced. In some cases, less is known of a genome, but the known fraction, can be of use. The genome can refer to any organism for which a portion of the genome has been sequenced. In some embodiments the whole genome is a human genome, a rat genome, a mouse genome, a Zebrafish genome, an Arabidopsis genome, a yeast genome, a D.
melanogaster genome, a C. elegans genome, a dog genome, a cow genome, an ape genome, or a pig genome. In some embodiments the "genome" will contain inserted or modified genomic sequences.
In some cases nucleotide sequences are provided using character
representations recommended by the International Union of Pure and Applied Chemistry (IUPAC) or a subset thereof. IUPAC nucleotide codes used herein include, A = Adenine, C = Cytosine, G = Guanine, T = Thymine, U = Uracil, = A or G, Y = C or T, S = G or C, W = A or T, K = G or T, M = A or C, B = C or G or T, D = A or G or T, H = A or C or T, V = A or C or G, N = any base, "." or "-" = gap. In some embodiments the set {A, C, G, T, U} for adenosine, cytidine, guanosine, thymidine, and uridine respectively. In some embodiments the set {A, C, G, T, U, I, X, Ψ} for adenosine, cytidine, guanosine, thymidine, uridine, inosine, uridine, xanthosine, pseudouridine respectively. In some embodiments the set of characters is {A, C, G, T, U, I, X, Ψ, , Y, N} for adenosine, cytidine, guanosine, thymidine, uridine, inosine, uridine, xanthosine, pseudouridine, unspecified purine, unspecified pyrimidine, and unspecified nucleotide respectively. The modified sequences, non- natural sequences, or sequences with modified binding, may be in the genomic, the guide or the tracr sequences.
Nucleotide and/or amino acid sequence identity percent (%) is understood as the percentage of nucleotide or amino acid residues that are identical with nucleotide or amino acid residues in a candidate sequence in comparison to a reference sequence when the two sequences are aligned. To determine percent identity, sequences are aligned and if necessary, gaps are introduced to achieve the maximum percent sequence identity. Sequence alignment procedures to determine percent identity are well known to those of skill in the art. Often publicly available computer software such as BLAST, BLAST2, ALIGN2 or MEGALIGN (DNASTAR) software is used to align sequences. Those skilled in the art can determine appropriate parameters for measuring alignment, including any algorithms needed to achieve maximal alignment over the full-length of the sequences being compared. When sequences are aligned, the percent sequence identity of a given sequence A to, with, or against a given sequence B (which can alternatively be phrased as a given sequence A that has or comprises a certain percent sequence identity to, with, or against a given sequence B) can be calculated as: percent sequence identity=X/Y100, where X is the number of residues scored as identical matches by the sequence alignment program's or algorithm's alignment of A and B and Y is the total number of residues in B. If the length of sequence A is not equal to the length of sequence B, the percent sequence identity of A to B will not equal the percent sequence identity of B to A. Mismatches can be similarly defined as differences between the natural binding partners of nucleotides. The number, position and type of mismatches can be calculated and used for identification or ranking purposes.
As used herein, "mutation" encompasses any change in a DNA, RNA, or protein sequence from the wild type sequence or some other reference, including without limitation point mutations, transitions, insertions, transversions,
translocations, deletions, inversions, duplications, recombinations, or combinations thereof. As used herein, in the context of alignments and identity between a CRISP guide strand and each genomic on- or off-target site, the term "insertion" is used when the endogenous DNA sequence has one or more extra bases compared with the sequence of the guide strand (a DNA bulge). Similarly, in the context of alignments and identity between a CRISPR guide strand and a genomic target site, the term "deletion" is used when the endogenous DNA sequence has one or more missing bases compared with the guide strand (a RNA bulge). In the context of alignments and identity between a CRISPR guide strand and a genomic target site, the term "indels" indicates either insertions or deletions. Although insertions and deletions may be viewed as mismatches, as used herein in the context of alignments and identity between a CRISPR guide strand and a genomic target site, the term
"mismatch" is used exclusively for base-pair mismatch when the guide strand and the potential off-target sequence have the same length, but differ in base composition. Guide strands and genomic sequences can have multiple mismatches, multiple insertions, multiple deletions or combination, such as one nucleotide inserted and two mismatches. In some cases the alignment could be represented in several ways, such as with an indel and a few mismatches or without an indel but with a larger number of mismatches.
As used herein, the term "endonuclease", refers to any wild-type or variant enzyme capable of catalyzing the hydrolysis (cleavage) of bonds between nucleic acids within a DNA or RNA molecule, preferably a DNA molecule. Non-limiting examples of endonucleases include type II restriction endonucleases such as Fokl, Hhal, Hindlll, Notl, BbvCl, EcoRI, Bglll, and AlwI. Endonucleases comprise also rare-cutting endonucleases when having typically a polynucleotide recognition site of about 12-45 basepairs (bp) in length, more preferably of 14-45 bp. Rare-cutting endonucleases induce DNA double-strand breaks (DSBs) at a defined locus. Rare- cutting endonucleases can for example be a homing endonuclease, a mega-nuclease, a chimeric Zinc-Finger nuclease (ZFN) or TAL effector nuclease (TALEN) resulting from the fusion of engineered zinc-finger domains or TAL effector domain, respectively, with the catalytic domain of a restriction enzyme such as Fokl, other nuclease or a chemical endonuclease.
As used herein, the term "exonuclease", refers to any wild type or variant enzyme capable of removing nucleic acids from the terminus of a DNA or RNA molecule, preferably a DNA molecule. Non-limiting examples of exonucleases include exonuclease I, exonuc lease II, exonuc lease III, exonuclease IV,. exonuclease V, exonuclease VI, exonuclease VII, exonuclease VII, Xml, and Rati .
In some cases an enzyme is capable of functioning both as an endonuclease and an exonuclease. The term nuclease generally encompasses both endonucleases and exonucleases, however in some embodiments the terms "nuclease" and
"endonuclease" are used interchangeably herein to refer to endonucleases, i.e. to refer to enzyme that catalyze bond cleavage within a DNA or RNA molecule.
II. Methods
CRISPR (Clustered Regularly Interspaced Short Palindromic Repeats) is an acronym for DNA loci that contain multiple, short, direct repetitions of base sequences. The prokaryotic CRISPR/Cas system has been adapted for use as gene editing (silencing, enhancing or changing specific genes) for use in eukaryotes (see, for example, Cong, Science, 15:339(6121):819— 823 (2013) and Jinek, et al., Science, 337(6096):816-21 (2012)). By transfecting a cell with the required elements including a cas gene and specifically designed CRISPRs, the organism's genome can be cut and modified at virtually any desired location. A number of methods exist for expressing the guide strand or Cas protein, including inducible expression of one or both. A number of methods exist for introducing the guide strand and Cas protein into cells including viral transduction, injection or micro-injection, nano-particle or other delivery, uptake of proteins, uptake of RNA or DNA, uptake of combination of protein and RNA or DNA. Combinations of methods can also be used,
simultaneously or in sequence. Multiple rounds of delivery of RNA, DNA or protein can occur with or without further protein expression. Methods of preparing compositions for use in genome editing using the CRISPR/Cas systems are described in detail in WO 2013/176772 and WO 2014/018423, which are specifically incorporated by reference herein in their entireties.
In general, "CRISPR" refers to clustered regularly interspaced short palindromic repeats or any of the DNA loci that serve to direct CRISPR associated proteins or similar nucleotide-directed nucleases. It also describes man-made, constructed, or selected systems derived using these frameworks or proteins. CRISPR systems and the related proteins vary among the currently described type I, type II and type III systems, though it is possible other analogous systems have yet to be described.
In general, "CRISPR system" refers collectively to transcripts and other elements involved in the expression of or directing the activity of CRISPR-associated ("Cas") genes, including sequences encoding a Cas gene, a tracr (trans-activating CRISPR) sequence (e.g., tracrRNA or an active partial tracrRNA), a tracr-mate sequence (encompassing a "direct repeat" and a tracrRNA-processed partial direct repeat in the context of an endogenous CRISPR system), a guide sequence (also referred to as a "spacer" in the context of an endogenous CRISPR system), and other sequences and transcripts from a CRISPR locus. One or more tracr mate sequences operably linked to a guide sequence (e.g., direct repeat-spacer-direct repeat) can also be referred to as pre-crRNA (pre-CRISPR RNA) before processing or crRNA after processing by a nuclease. CRISPR systems can also include modified, swapped or engineered, guide, tracr or chimeric RNA sequences and the protein to which they interact (For example, Briner, et al., Mol Cell 56(2)333-9 (2014)). The methods disclosed herein may also be applicable to other, non-CRISPR nucleotide-directed nucleases.
In some embodiments, a tracrRNA and crRNA are linked and form a chimeric crRNA-tracrRNA hybrid where a mature crRNA is fused to a partial tracrRNA via a synthetic stem loop to mimic the natural crRNA:tracrRNA duplex as described in Cong, Science, 15:339(6121):819-823 (2013) and Jinek, et al, Science,
337(6096):816-21 (2012)). A single fused crRNA-tracrRNA construct can also be referred to as a guide RNA or gRNA (or single-guide RNA (sgRNA)). Within a gRNA, the crRNA portion can be identified as the 'target sequence' and the tracrRNA is often referred to as the 'scaffold'. The target sequence can be perfectly
complementary to a targeted site, as is often the case for an on-target site, or may also contain mismatches, insertions, deletions or be of different length than the cleaved intended or un-intended sites.
In some embodiments, the tracrRNA can be modified in length, sequence or other composition. Similarly the guide portion or guide sequence can be modified in sequence and/or in length. The guide strand length varies between species. In some embodiments the length of the guide RNA is shortened, lengthened or further changed to alter the affinity to the complementary sequence in hopes of increase specificity or affecting the activity (Fu, et al., Nature Biotech. (3):279-84. (2014)).
When a gRNA and Cas9 are expressed together in a cell, a gRNA/Cas9 complex forms and is recruited to the genomic target sequence through binding to the PAM and/or the base-pairing between the gRNA sequence and the complement to the target sequence in the genomic DNA (Addgene, "CRISPR in the Lab: A Practical Guide," Addgene website, 2014). For Cas9 to successfully bind to a DNA sequence, the guide strand and target sequence must be sufficiently complementary, followed by a protospacer adjacent motif (PAM) sequence. Mismatches are tolerated in both the guide and in the PAM sequence (Fu, et al., Nat Biotechnol, 31 : 822-826 (2013); Hsu, et al., Nat Biotechnol, 31 : 827-832 (2013); Cradick, et al., Nucleic Acids Res, 41 :9584-9592 (2013)). The specified nucleotides in the PAM may range in spacing from the protospacer, in some systems the PAM sequence is NGG, or can be further away as in NNNNGATT, where N is any nucleotide. The PAM sequence is present in the DNA target sequence, but not in the gRNA sequence. Any DNA sequence with the correct target sequence followed by the PAM sequence may be bound by Cas9, and may be cleaved.
The binding of the gRNA/Cas9 complex localizes the Cas9 to the genomic target sequence. In one embodiment, wild type Sp Cas9 makes a double strand break 3-4 nucleotides upstream of the PAM sequence, which can be repaired by the Non- Homologous End Joining (NHEJ) DNA repair pathway, the Homology Directed Repair (HDR) pathway or alternative DNA repair pathways. The system can be manipulated to induce a variety of gene modifications including insertions and deletions causing frameshifts and/or premature stop codons, specific nucleotide changes, etc.
In some embodiments, one or more vectors driving expression of one or more elements of a CRISPR system are introduced into a target cell such that expression of the elements of the CRISPR system direct formation of a CRISPR complex.
Although the specifics can vary between different engineered CRISPR systems, the overall methodology is similar. A practitioner interested in using CRISPR technology to target a DNA sequence can insert a short DNA fragment containing the target sequence into a guide RNA expression plasmid. The sgRNA expression plasmid contains the target sequence (generally about 20 nucleotides), a form of the tracrRNA sequence (the scaffold), as well as a suitable promoter and necessary elements for proper processing in eukaryotic cells. Such vectors are commercially available (see, for example, Addgene). Many of the systems rely on custom, complementary oligonucleotides that are annealed to form a double stranded DNA and then cloned into the sgRNA expression plasmid. These sequences can also be generated using PCR cloning or mutagenic strategies. Selection methodologies can also be use to isolate guide RNAs from pools of guide RNAs. Co-expression of the sgRNA and the appropriate Cas enzyme from the same or separate plasmids in transfected cells results in a single or double strand break (depending of the activity of the Cas enzyme) at the desired target site.
The literature also contains examples indicating the importance of off-target analysis. The Examples below show that levels of off-target cleavage using
CRISPR/Cas9-based gene modification strategies can be comparable with the on- target rates, even when there are multiple mismatches to the guide strand in the region close to the PAM. The Examples also show that RNA guide strands containing insertions or deletions in addition to base mismatches can result in cleavage and mutagenesis at genomic target site with levels similar to that of the original guide strand. These studies provide experimental evidence that genomic sites can be cleaved when the DNA sequences contain insertions or deletions compared with the CRISPR guide strand. Accordingly, methods and systems for identifying target sites, and particularly off-target sites, of CRISPR/Cas guide strands are provided.
Additionally, methods and systems for ranking target sites, and particularly off-target sites, of CRISPR/Cas guide strands are provided. The methods and systems can be used to prepare a list of off-target sites for a guide strand based on 1, 2, 3, or more mismatches, insertions, deletions, or combinations thereof.
Although, as discussed above, a chimeric guide RNA (gRNA) contains a target sequence, or guide sequence, and a tracrRNA sequence, with respect to the methods and systems disclosed herein, "guide", "guide strand", "guide strand sequence" and "guide sequence" are used interchangeably and refer to a gRNA or sgRNA sequence including, and preferably consisting of the target sequence of the gRNA that binds to a complementary genomic sequence at the target site (Jinek, et al., Science, 337:816- 821 (2012)). In other embodiments, the guide sequence is not a chimeric sequence, but contains two parts: the guide portion and the tracrRNA. Alternative versions also exist in other embodiments with combinations of sequences, or replacements or modifications of portions of the tracrRNA or linking of RNA fragments, such as modifications to the lower or upper stem, nexus or hairpins, or the inclusion of additional sequences. The additional sequences may permit quantitation, binding to other nucleotides, linking to functional domains, other uses, or not provide a function. The guide sequence can be expressed from a plasmid, provided as RNA, or complexed with the Cas protein prior to adding to the cells. The sequence can be articulated as an RNA sequence or a cDNA sequence. With respect to the methods and systems discussed herein, for purposes of identity, homology, and other means of sequence comparison between gRNA sequence and genomic sequence, there is generally no "penalty" or other loss of identity for uracil (U) in the place of thymine (T). Therefore, the gRNA and genomic sequences can be compared as RNA-to-DNA or DNA-DNA and have the same sequence identity. In some embodiments, the disclosed systems and methods include converting an RNA sequence to DNA, or vice versa, so that sequences are compared as DNA-to-DNA, or RNA-to-RNA. In other embodiments other nucleotides, including non-natural nucleotides can be included.
As used herein, "target site" generally refers to a genomic location to which a guide strand might bind. The binding level may vary and may depend on context, accessibility or other factors. An "on-target" site generally refers to a genomic site to which a practitioner desires binding and/or cleavage to occur, while "off-target" refers to a genomic site to which a practitioner does not desire binding and/or cleavage to occur. The definition of target site or on-target site can be thought of as the intended binding or cleavage site, regardless of its level of identity, or number of mismatches, and regardless of how this site compares to other un- intended sites that may score below or higher in these indices. In the context of the CRISPR/Cas system, an on- target site can be a genomic site at which genetic modification is desired, while an off-target site can be a genomic site at which genetic modification is not required, not desired, or undesirable. On-target and off-target sites can have the same (e.g., identical), or different nucleotide sequences. A "cleavage site" is the site where the nuclease creates a single-strand break or double-stranded DNA breaks, in the CIRSPR systems used in some embodiments, this is within the target site, 3 nucleotides from the PAM. As used herein, "target sequence" and "target site sequence" are used interchangeable. The terms generally refer to the genomic DNA sequence at the target site and can optionally include the sequence of a PAM motif. It will be appreciated that the site is double-stranded genomic DNA, and therefore, the target sequence can be expressed or described by providing the sequence of either strand of DNA at the target site. For example, the target sequence can be expressed as the sequence of the strand of genomic DNA to which the guide sequence of a g NA binds, or its complementary strand. Therefore, a target sequence can also be expressed as a sequence that is the same or similar to the gRNA sequence. In some instances a site can be cleaved using more than one guide strand on one or the other DNA strand. As discussed and exemplified in more detail below, the target sequence is most typically expressed as the same or similar sequence to the guide sequence so that the guide sequence can be aligned to the sequence of genomic DNA at the target site and establish the identity between the guide sequence and DNA sequence at the site.
The systems and methods described herein for predicting off-target sites generally involve generating search criteria derived from input criteria, generating a list of target sites, and directing the list of target sites as output to the user. The input criteria will generally include information regarding the guide sequence, and optionally the PAM sequence, the number of allowed mismatches, the number of allowed insertions, the number of allowed deletions, the genome to be searched, etc. In preferred embodiments the output is provided in the form of a ranked-list wherein each of the target sites are assigned a numerical value, "score", that correlates with the likelihood of nuclease cleavage at that site. It will be appreciated that in many cases the practitioner knows the on-target location and although the methods and systems are designed to identify off-target locations, may nonetheless also include the on- target site(s). In some embodiments, the user may wish to determine if there are on- or off-target sites within different genomes. Therefore, in some embodiments, the list of target sites includes both on-target sites and off-target sites. In other embodiments, only off-site targets are provided. An example of genomic search for only off-target sites is when targeting non-genomic sequences, such as mutated sites, chromosomal re-arrangements, introduced sequences (such as cDNA or other expression cassettes) or viral sequences. In some embodiments, the on-target site(s) can be subtracted or removed from the output.
In some embodiments, the methods and systems rank the target sites based on the likelihood of cleavage. The ranking can be based upon a scoring function for predicting nuclease activity based at least in-part on identity between the guide strand and each genomic target sequence and/or the ability of the guide sequence to hybridize to the complement thereof. In some embodiments the predictions can be based on the sequences and other known or predicted features such as accessibility, type of sequence, expression state or genomic context. In some embodiments the predictions will also include information about the cells in question, their
development, tissue-type, or expression pattern. In some embodiments, the methods and systems provide PC primer sequences that can be used for synthesizing oligonucleotide primers for testing cleavage in vivo.
A. Search Inputs
Typically, to perform a search, user input can include the genome of interest, guide strand sequence, PAM sequence, and the number of base mismatches, insertions, and deletions allowed. To perform a search, a user chooses the genome of interest from the list, and enters the guide strand and optionally PAM sequences (Figure 25 A). Types of indel query include, for example, (i) the number of mismatches with no insertion or deletion (i.e., "No indels"); (ii) the number of mismatches in addition to a single-base deletion (i.e., "Del"); and (iii) the number of mismatches in addition to a single-base insertion (i.e., "Ins"). Typically, up to three mismatches without indels, and up to two mismatches together with a one-base insertion and/or one-base deletion can be selected. However, in some embodiments, 4, 5, 6, 7, 8, 9, 10, or more mismatches, insertions, deletions, or any combination thereof can be selected.
In some embodiments, PAM variants such as NRG or other PAM sequences can be entered in the suffix box. For example, the spacer (Ns) and required nucleotides are entered into the suffix box, such as "NNNNGATT", "NNAGAA", "NNAGAA", "NAAAAC" and include genomic sites with any nucleotide at the N positions in the output. In other embodiments, a range of other sequences may constitute naturally occurring or modified PAM sequences. If primers are desired, primer design parameter settings and parameter templates can also be entered.
In other embodiments, parameters may be entered that correspond to cell type, culture conditions, animal age or growth, developmental state, genomic context, chromosomal or methylation state, DNA mutation repair, pathway choice and other features affecting cleavage and /or mutation rates.
B. Processing
The disclosed methods for identifying off-target cleavage locations of a CRISPR/Cas nuclease typically computer-implemented methods that include scanning or searching the genomic sequence data for the target cleavage locations of the nuclease based on parameters selected from the group consisting of guide strand sequence, organismal genome, number of mismatches, insertions, and/or deletions, to return target cleavage location sequence and/or locations in the genome. Typically the target sites identified by the search are assigned a score that is used to rank the target cleavage locations based on the likelihood of target cleavage. In other embodiments the prime function is ranking sequences to a range of criteria.
1. Searching for Off- target Sites
In the preferred embodiments, before performing a search, a series of search entries are constructed according to the user-specified guide strand and search criteria (Figure 25B). The search entries include all insertions and deletions at each possible location (Figure 25 C, Figures 26E-26F).
Although multi-base deletions (RNA bulges) and insertions (DNA bulges) could be tolerated (Lin, et al., Nucleic Acids Res, 42:Ί '473-' '485 (2014), and search for a wide range of insertions and deletions will likely result in a very large number of returned sites. Therefore, in a preferred embodiment only searches for single-base insertions and deletions in the DNA sequence are compared with the guide strand (Figure 25 A). In other embodiments, larger number of nucleotide insertions or deletions, or multiple insertions and/or deletions can be accommodated, though this is likely to result in a longer list of sites output. Widening the scope of output sites may be particular useful when trying to model the cause of verified off-target events that can not be explained by stricter criteria. For the potential target sites, the search algorithm can allow some ambiguities (such as N for any nucleotide). Ambiguities included in the search string are not counted toward the user-specified mismatch limits. In certain embodiments, ranges of ambiguities can be employed, such as the codes for either of two nucleotides (R, W, S, K, R or Y) or three nucleotides (B, D, H, V), in addition to N. The use of ambiguities allows the inclusion of the matching genomic base with the output sequences. One possibility is to include an "N" in positions that can have substitutions, such as the first base in a guide strand that is often a G primarily to aid in transcription, but does not need to match the
complementary target sequence (Hsu, et al., Nat Biotechnol, 31 : 827-832 (2013); Cradick, et al, Nucleic Acids Res, 41 :9584-9592 (2013); Mali, et al., Science, 339: 823-826 (2013)). One can leave off this base when performing a search, or include a 5' N in the search string, which allows output and alignment of the corresponding 5' bases at each locus to the "N."
In preferred embodiments, the search algorithm is based on sequence homology and identity, with the option to allow insertions or deletions a search method, a ranking method, or a combination thereof. The off-target site lists can be constructed using, for example, existing search algorithms such as FASTA or
BLAST. In some embodiments, these types of existing or freshly generated lists can be ranked by the methods described here. The FASTA algorithm is described in W.R. Pearson, and D.J. Lipman (1988) Proc. Natl. Acad. Set, 85:2444-2448 and D.J.
Lipman, and W.R. Pearson (1989) Science, 227:1435-1441. The BLAST algorithm is described in S. Altschul, et al. (1990) J. Mol. Biology, 215:403-410. While FASTA, BLAST, megaBLAST, BLAST Bowtie, and other later improvements can be used to construct a list of target sites, these are not the preferred approaches. In some embodiments, other search methods are used, then refined by using a ranking algorithm that can weigh the number and positions of mismatches, insertions, deletions and their combinations. The output from non-exhaustive search tools may not be considered to have all possible off-target sites.
In preferred embodiments, on-site and off-site targets of the CRISPR guide strands are determined by comparing the query sequence both with and without insertions, deletions, and/or mismatches at one or multiple positions using the
FetchGWI search program (Iseli, et al, PLoS ONE, 2(6): e579 (2007). FetchGWI operates on indexed genome sequences that are precompiled and stored (Figures 26A- 26G). It can identify genomic locations with sequences that match any of the series of search entries. FetchGWI saves run time by searching indexed files that represent the genome sequences, rather than the sequences themselves. There is one index entry for each nucleotide in the genome, which allows a rapid and exhaustive search. In other embodiments, other indexing strategies can be used. Exhaustive, complete searches are a key advantage over BLAST and other programs that scan non- overlapping words and may miss potential off-target sites.
The guide strand sequence and/or variants thereof and/or other query sequences can be compared to an organismal genome, or any loaded sequence files. In preferred embodiments, the searched genome is human, mouse, Caenorhabditis elegans, or rhesus macaque genomes. In other embodiments, any genome, modified genome or sequence file can be searched. In the most preferred embodiments, the searchable genome is prepared using the genwin program (Iseli, et al., PLoS ONE, 2(6): e579 (2007)) to transform the DNA sequence from FASTA formatted files into unsorted index entries which have all possible 25 bases-long tags in the DNA sequence. After that, the sortGWI program is used to sort the index entries, and store the result as a binary index file. sortGWI subdivides the whole index file into parts, each representing entries having identical first 12 nucleotides. A secondary index, recording the position in the main index file where each part starts, is added to the end of the index file to enable faster search and reduce file size. The index files can be stored in a server.
When the search is initiated, the sequence tags can be used to generate a series of additional tags that contain indels if the insertion or deletion boxes are checked, or if defaults are used. Identical tags are removed if they are duplications for strings containing consecutive identical bases, or in other embodiments, these can be removed at other steps in the processing. The resulting tags are all searched against the user-selected genome. The working Examples include exemplary searches, for example, if guide strand R-01 is entered and one (1) insertion and one (1) deletion are selected, the tags illustrated in Figure 26E and 26F are generated and used to search a genome.
To search the query sequences against the user-selected genome, the
FetchGWI program can be used (Iseli, et al, PLoS ONE, 2(6): e579 (2007). For example, if the user specifies a search with one or more mismatches, all possible sequence tags can be generated by replacing the specified number of nucleotides with all other possibilities. In the preferred embodiment, FetchGWI can search the genome allowing the user-specified number of mismatches. After that, FetchGWI can sort all the query tags and searches for matches in the index file, using binary search.
FetchGWI can report the search results by appending the actual sequence tag found, along with the accession number and position offset within the sequence for each matched query tags. Programs, such as the TagScan algorithm can be used to minimize run times while still performing exhaustive genome searches. In other embodiments, other programs are used that can allow greater numbers of mismatches to the genomic sequences.
2. Exemplary Methods of Constructing Query
Sequences
As discussed above, a series of guide sequence variants are constructed based on a user entered guide sequence and used to query the selected genome for potential target sites. The parameters used to construct the series of query guide sequences is typically prepared based on user entered parameters includes, the number of mismatches (e.g., 0, 1, 2, 3, etc.), insertions (e.g., 0, 1, 2, etc.), and/or deletions (e.g., 0, 1, 2, etc.) that are allowed at the target site relative to the guide sequence. In some embodiments, multiple insertions and/or deletions may be allowed. In some embodiments, duplicative query sequences are subtracted or culled from the series before the search such that each sequence in the series is unique and only searched once. In a particular embodiment, the query guide sequences provide guide strand variant sequences having no indels and 0, 1, 2, or 3 mismatches; 1-base deletion, no insertions, and 0, 1, or 2 mismatches; 1-base insertion, no deletions, and 0, 1, or 2 mismatches; 1-base deletion, 1-base insertion, and 0, 1, or 2 mismatches; or any combination thereof.
In specific embodiments,
(1) if insertions are allowed:
a series of query guide sequences are generated that are variations of the original guide sequence. At each position in the guide sequence, (such as between the PAM and the closest nucleotide, between the first and second, second and third nucleotide, etc.) each nucleotide can be inserted generating different guide strand variations. As there are four natural nucleotides, in most embodiments, there will be four variations with A, C, G or T introduced in in position in the four different variations. In the preferred embodiments, an "N" is inserted that will match any of these. If insertions of greater than one nt are allowed, then the single inserted N can also be replaced with two or more Ns, which can be inserted into each position to generate variations with one or more nt insertions.
(2) if deletions are allowed:
a series of query guide sequences are generated that are variations of the original guide sequence. At each position in the guide sequence, (such as between the PAM and the closest nucleotide, between the first and second, second and third nucleotide, etc.) each nucleotide can be deleted resulting in a guide strand that is one nt shorter. At positions where there are repeated nucleotides, deleting any one would result in the same variant. This is consistent if either is deleted when two nt are the same, or deleting any of a longer repeated string of nts. If deletions of greater than one nt are allowed, then the single nt deleted can also be replaced with two or more deleted nt that can be deleted at each position along the guide strand.
(3) if insertions and deletions are allowed:
a series of query guide sequences are generated that are variations of the original guide sequence. At each position in the guide sequence, (such as between the PAM and the closest nucleotide, between the first and second, second and third nucleotide, etc.) each nucleotide can be inserted generating different guide strand variations. As there are four natural nucleotides, in most embodiments, there will be four variations with A,C,G or T introduced in in position in the four different variations. In the preferred embodiments, an "N" is inserted that will match any of these as with insertions alone. The resulting string of queries is then subjected to individual deletions as in (2) above resulting in variations that have inserted and deleted bases. Deleting an inserted base would result in the original sequence.
Allowing more than one base inserted and / or deleted would introduce even more variations.
(4) if insertions are allowed with:
a series of query guide sequences are generated that are variations of the original guide sequence. At each position in the guide sequence, (such as between the PAM and the closest nucleotide, between the first and second, second and third nucleotide, etc) each nucleotide can be inserted generating different guide strand variations. As there are four natural nucletides, in most embodiments, there will be four variations with A,C,G or T introduced in in position in the four different variations. In the preferred embodiments, an "N" is inserted that will match any of these. In addition, other embodiments can allow the introduction of a second insertion at each point in the guide sequence.
(5) if deletions are allowed:
a series of query guide sequences are generated that are variations of the original guide sequence. At each position in the guide sequence, (such as between the PAM and the closest nucleotide, between the first and second, second and third nucleotide, etc) each nucleotide can be deleted resulting in a guide strand that is one nt shorter. At positions where there are repeated nucleotides, deleting any one would result in the same variant. This is consistent if either is deleted when two nt are the same, or deleting any of a longer repeated string of nts. In addition, other
embodiments can allow the introduction of a second insertion at each point in the guide sequence.
(6) if insertions and deletions are allowed:
a series of query guide sequences are generated that are variations of the original guide sequence. At each position in the guide sequence, (such as between the PAM and the closest nucleotide, between the first and second, second and third nucleotide, etc) each nucleotide can be inserted generating different guide strand variations. As there are four natural nucleotides, in most embodiments, there will be four variations with A,C,G or T introduced in in position in the four different variations. In the preferred embodiments, an "N" is inserted that will match any of these as with insertions alone. The resulting string of queries is then subjected to individual deletions as in (5) above resulting in variations that have inserted and deleted bases. Deleting an inserted base would result in the original sequence, though deleting one of the inserted bases may produce a variation already included in the ouput.
(7) if insertions are allowed with:
in other embodiments, other number of insertions may be allowed, leading to large combination of guide strand variations.
(8) if deletions are allowed:
in other embodiments, other number of deletions may be allowed, leading to large combination of guide strand variations, though the introduction of many would lead to shortening of the guide strand. (9) if insertions and deletions are allowed:
variations can be derived as in (7 and 8) above, and also contain combinations as described in (6). The large number of variations output may not be feasible using current computer configurations and testing or sequencing methods, but advances may allow screening larger number of variations in other embodiments.
Once the variations with indels are created as in (1-9) above, these query sequences, or tags, are used to search the specified genome(s). In one embodiment, this is using FetchGWI to compare each variant to sequences throughout the genome and output the sites that match the user-specified guideline. In one embodiment, that is the number of mismatches for each condition: no indels, with insertions or with deletions. In other embodiments, the output contains other user-specified or default criteria to limit the sequences output. Example of this type of screenings are is the possibility of only including sites that appear to be in open chromatin, or only outputting sites with particular annotations, such as in exons, regulatory sequences or in defined oncogenic regions.
In specific embodiments the mismatches can similarly be added to the query sequences prior to searching,
(10) if one mismatch, zero insertions, and zero deletions is selected:
the series of query guide sequences includes the guide sequence and sequence variants thereof wherein each nucleotide position in the guide sequence is individually substituted by each of the alternative nucleotides,
such that each of the query guide sequences in the series has zero or one mismatches, zero insertions, and zero deletions relative to the guide sequence;
(11) if two mismatches, zero insertions, and zero deletions is selected:
the series of query guide sequences includes the guide sequence and sequence variants thereof wherein each nucleotide position in the guide sequence is individually substituted by each of the alternative nucleotides, and
guide sequence variants wherein each combination of two nucleotide positions in the guide sequence is substituted with each alternative nucleotide,
such that each of the query guide sequences in the series has zero, one, or two mismatches, zero insertions, and zero deletions relative to the guide sequence;
(12) if three mismatches, zero insertions, and zero deletions is selected: the series of query guide sequences includes the guide sequence and sequence variants thereof wherein each nucleotide position in the guide sequence is individually substituted by each of the alternative nucleotides,
guide sequence variants wherein each combination of two nucleotide positions in the guide sequence is substituted with each alternative nucleotide,
guide sequence variants wherein each combination of three nucleotide positions in the guide sequence is substituted with each alternative nucleotide, and such that each of the query guide sequences in the series has zero, one, two, or three mismatches, zero insertions, and zero deletions relative to the guide sequence; (13) if zero mismatches, one insertion, and zero deletions is selected:
the series of query guide sequences includes the guide sequence and sequence variants thereof wherein each canonical nucleotide is individually inserted into each nucleotide position of the guide sequence,
such that each of the query guide sequences in the series has zero mismatches, one insertion, and zero deletions relative to the guide sequence;
(14) if zero mismatches, two insertions, and zero deletions is selected:
the series of query guide sequences includes the guide sequence and sequence variants thereof wherein each canonical nucleotide is individually inserted into each nucleotide position of the guide sequence, and
guide sequence variants wherein each combination of two canonical nucleotides are individually inserted into the guide sequence each combination of two positions in the guide sequence
such that each of the query guide sequences in the series has zero mismatches, two insertions, and zero deletions relative to the guide sequence;
(15) if zero mismatches, zero insertions, and one deletion is selected:
the series of query guide sequences includes the guide sequence and sequence variants thereof wherein one nucleotide is individually deleted from each nucleotide position of the guide sequence,
such that each of the query guide sequences in the series has zero mismatches, zero insertions, and one deletion relative to the guide sequence.
(16) if zero mismatches, zero insertions, and two deletions is selected: the series of query guide sequences includes the guide sequence and sequence variants thereof wherein one nucleotide is individually deleted from each nucleotide position of the guide sequence, and
guide sequence variants wherein two nucleotides are deleted from each combination of two nucleotide positions of the guide sequence
such that each of the query guide sequences in the series has zero mismatches, zero insertions, and two deletions relative to the guide sequence;
(17) if one mismatch, one insertion, and zero deletions is selected:
the series of query guide sequences includes the guide sequence, and sequence variants thereof wherein each nucleotide position in the guide sequence is individually substituted by each of the alternative nucleotides; each canonical nucleotide is individually inserted into each nucleotide position of the guide sequence; and guide sequence variants having the combination thereof,
such that each of the query guide sequences in the series has zero or one mismatches, zero or one insertions, and zero deletions relative to the guide sequence;
(18) if two mismatches, one insertion, and zero deletions is selected:
the series of query guide sequences includes the guide sequence, and sequence variants thereof wherein each nucleotide position in the guide sequence is individually substituted by each of the alternative nucleotides; each combination of two nucleotide positions in the guide sequence is substituted with each alternative nucleotide; each canonical nucleotide is individually inserted into each nucleotide position of the guide sequence; and guide sequence variants having combinations thereof,
such that each of the query guide sequences in the series has zero, one, or two mismatches, zero or one insertions, and zero deletions relative to the guide sequence;
(19) if three mismatches, one insertion, and zero deletions is selected:
the series of query guide sequences includes the guide sequence, and sequence variants thereof wherein each nucleotide position in the guide sequence is individually substituted by each of the alternative nucleotides; each combination of two nucleotide positions in the guide sequence is substituted with each alternative nucleotide; each combination of three nucleotide positions in the guide sequence is substituted with each alternative nucleotide; each canonical nucleotide is individually inserted into each nucleotide position of the guide sequence; and guide sequence variants having combinations thereof,
such that each of the query guide sequences in the series has zero, one, two, or three mismatches, zero or one insertions, and zero deletions relative to the guide sequence;
(20) if one mismatch, two insertions, and zero deletions is selected:
the series of query guide sequences includes the guide sequence, and sequence variants thereof wherein each nucleotide position in the guide sequence is individually substituted by each of the alternative nucleotides; each canonical nucleotide is individually inserted into each nucleotide position of the guide sequence; each combination of two canonical nucleotides are individually inserted into the guide sequence each combination of two positions in the guide sequence; and guide sequence variants having the combination thereof,
such that each of the query guide sequences in the series has zero or one mismatches, zero, one, or two insertions, and zero deletions relative to the guide sequence;
(21) if two mismatches, two insertions, and zero deletions is selected:
the series of query guide sequences includes the guide sequence, and sequence variants thereof wherein each nucleotide position in the guide sequence is individually substituted by each of the alternative nucleotides; each combination of two nucleotide positions in the guide sequence is substituted with each alternative nucleotide; each canonical nucleotide is individually inserted into each nucleotide position of the guide sequence; each combination of two canonical nucleotides are individually inserted into the guide sequence each combination of two positions in the guide sequence; and guide sequence variants having combinations thereof,
such that each of the query guide sequences in the series has zero, one, or two mismatches, zero, one, or two insertions, and zero deletions relative to the guide sequence;
(22) if three mismatches, two insertions, and zero deletions is selected:
the series of query guide sequences includes the guide sequence, and sequence variants thereof wherein each nucleotide position in the guide sequence is individually substituted by each of the alternative nucleotides; each combination of two nucleotide positions in the guide sequence is substituted with each alternative nucleotide; each combination of three nucleotide positions in the guide sequence is substituted with each alternative nucleotide; each canonical nucleotide is individually inserted into each nucleotide position of the guide sequence; each combination of two canonical nucleotides are individually inserted into the guide sequence each combination of two positions in the guide sequence; and guide sequence variants having combinations thereof,
such that each of the query guide sequences in the series has zero, one, two, or three mismatches, zero, one, or two insertions, and zero deletions relative to the guide sequence;
(23) if one mismatch, zero insertions, and one deletion is selected:
the series of query guide sequences includes the guide sequence, and sequence variants thereof wherein each nucleotide position in the guide sequence is individually substituted by each of the alternative nucleotides; one nucleotide is individually deleted from each nucleotide position of the guide sequence; and guide sequence variants having the combination thereof,
such that each of the query guide sequences in the series has zero or one mismatches, zero insertions, and zero or one deletions relative to the guide sequence;
(24) if two mismatches, zero insertions, and one deletion is selected:
the series of query guide sequences includes the guide sequence, and sequence variants thereof wherein each nucleotide position in the guide sequence is individually substituted by each of the alternative nucleotides; each combination of two nucleotide positions in the guide sequence is substituted with each alternative nucleotide; one nucleotide is individually deleted from each nucleotide position of the guide sequence; and guide sequence variants having combinations thereof,
such that each of the query guide sequences in the series has zero, one, or two mismatches, zero insertions, and zero or one deletions relative to the guide sequence;
(25) if three mismatches, zero insertions, and one deletion is selected:
the series of query guide sequences includes the guide sequence, and sequence variants thereof wherein each nucleotide position in the guide sequence is individually substituted by each of the alternative nucleotides; each combination of two nucleotide positions in the guide sequence is substituted with each alternative nucleotide; each combination of three nucleotide positions in the guide sequence is substituted with each alternative nucleotide; one nucleotide is individually deleted from each nucleotide position of the guide sequence; and guide sequence variants having combinations thereof,
such that each of the query guide sequences in the series has zero, one, two, or three mismatches, zero insertions, and zero or one deletions relative to the guide sequence;
(26) if one mismatch, zero insertions, and two deletions is selected:
the series of query guide sequences includes the guide sequence, and sequence variants thereof wherein each nucleotide position in the guide sequence is individually substituted by each of the alternative nucleotides; one nucleotide is individually deleted from each nucleotide position of the guide sequence; two nucleotides are deleted from each combination of two nucleotide positions of the guide sequence; and guide sequence variants having the combination thereof,
such that each of the query guide sequences in the series has zero or one mismatches, zero insertions, and zero, one, or two deletions relative to the guide sequence;
(27) if two mismatches, zero insertions, and two deletions is selected:
the series of query guide sequences includes the guide sequence, and sequence variants thereof wherein each nucleotide position in the guide sequence is individually substituted by each of the alternative nucleotides; each combination of two nucleotide positions in the guide sequence is substituted with each alternative nucleotide; one nucleotide is individually deleted from each nucleotide position of the guide sequence; two nucleotides are deleted from each combination of two nucleotide positions of the guide sequence; and guide sequence variants having combinations thereof,
such that each of the query guide sequences in the series has zero, one, or two mismatches, zero insertions, and zero, one, or two deletions relative to the guide sequence;
(28) if three mismatches, zero insertions, and two deletions is selected:
the series of query guide sequences includes the guide sequence, and sequence variants thereof wherein each nucleotide position in the guide sequence is individually substituted by each of the alternative nucleotides; each combination of two nucleotide positions in the guide sequence is substituted with each alternative nucleotide; each combination of three nucleotide positions in the guide sequence is substituted with each alternative nucleotide; one nucleotide is individually deleted from each nucleotide position of the guide sequence; two nucleotides are deleted from each combination of two nucleotide positions of the guide sequence; and guide sequence variants having combinations thereof,
such that each of the query guide sequences in the series has zero, one, two, or three mismatches, zero insertions, and zero, one, or two deletions relative to the guide sequence;
(29) if one mismatch, one insertion, and one deletion is selected:
the series of query guide sequences includes the guide sequence, and sequence variants thereof wherein each nucleotide position in the guide sequence is individually substituted by each of the alternative nucleotides; each canonical nucleotide is individually inserted into each nucleotide position of the guide sequence; one nucleotide is individually deleted from each nucleotide position of the guide sequence; one nucleotide is individually deleted from each nucleotide position of the guide sequence; and guide sequence variants having the combination thereof, such that each of the query guide sequences in the series has zero or one mismatches, zero or one insertions, and zero or one deletions relative to the guide sequence;
(30) if two mismatches, one insertion, and one deletion is selected:
the series of query guide sequences includes the guide sequence, and sequence variants thereof wherein each nucleotide position in the guide sequence is individually substituted by each of the alternative nucleotides; each combination of two nucleotide positions in the guide sequence is substituted with each alternative nucleotide; each canonical nucleotide is individually inserted into each nucleotide position of the guide sequence; one nucleotide is individually deleted from each nucleotide position of the guide sequence; one nucleotide is individually deleted from each nucleotide position of the guide sequence; and guide sequence variants having combinations thereof, such that each of the query guide sequences in the series has zero, one, or two mismatches, zero or one insertions, and zero or one deletions relative to the guide sequence;
(31) if three mismatches, one insertion, and one deletion is selected:
the series of query guide sequences includes the guide sequence, and sequence variants thereof wherein each nucleotide position in the guide sequence is individually substituted by each of the alternative nucleotides; each combination of two nucleotide positions in the guide sequence is substituted with each alternative nucleotide; each combination of three nucleotide positions in the guide sequence is substituted with each alternative nucleotide; each canonical nucleotide is individually inserted into each nucleotide position of the guide sequence; one nucleotide is individually deleted from each nucleotide position of the guide sequence; one nucleotide is individually deleted from each nucleotide position of the guide sequence; and guide sequence variants having combinations thereof,
such that each of the query guide sequences in the series has zero, one, two, or three mismatches, zero or one insertions, and zero or one deletions relative to the guide sequence;
(32) if one mismatch, two insertions, and one deletion is selected:
the series of query guide sequences includes the guide sequence, and sequence variants thereof wherein each nucleotide position in the guide sequence is individually substituted by each of the alternative nucleotides; each canonical nucleotide is individually inserted into each nucleotide position of the guide sequence; each combination of two canonical nucleotides are individually inserted into the guide sequence each combination of two positions in the guide sequence; one nucleotide is individually deleted from each nucleotide position of the guide sequence; and guide sequence variants having the combination thereof,
such that each of the query guide sequences in the series has zero or one mismatches, zero, one, or two insertions, and zero or one deletions relative to the guide sequence;
(33) if two mismatches, two insertions, and one deletion is selected:
the series of query guide sequences includes the guide sequence, and sequence variants thereof wherein each nucleotide position in the guide sequence is individually substituted by each of the alternative nucleotides; each combination of two nucleotide positions in the guide sequence is substituted with each alternative nucleotide; each canonical nucleotide is individually inserted into each nucleotide position of the guide sequence; each combination of two canonical nucleotides are individually inserted into the guide sequence each combination of two positions in the guide sequence; one nucleotide is individually deleted from each nucleotide position of the guide sequence; and guide sequence variants having combinations thereof,
such that each of the query guide sequences in the series has zero, one, or two mismatches, zero, one, or two insertions, and zero or one deletions relative to the guide sequence;
(34) if three mismatches, two insertions, and one deletion is selected:
the series of query guide sequences includes the guide sequence, and sequence variants thereof wherein each nucleotide position in the guide sequence is individually substituted by each of the alternative nucleotides; each combination of two nucleotide positions in the guide sequence is substituted with each alternative nucleotide; each combination of three nucleotide positions in the guide sequence is substituted with each alternative nucleotide; each canonical nucleotide is individually inserted into each nucleotide position of the guide sequence; each combination of two canonical nucleotides are individually inserted into the guide sequence each combination of two positions in the guide sequence; one nucleotide is individually deleted from each nucleotide position of the guide sequence; and guide sequence variants having combinations thereof,
such that each of the query guide sequences in the series has zero, one, two, or three mismatches, zero, one, or two insertions, and zero or one deletions relative to the guide sequence;
(35) if one mismatch, one insertion, and two deletions is selected:
the series of query guide sequences includes the guide sequence, and sequence variants thereof wherein each nucleotide position in the guide sequence is individually substituted by each of the alternative nucleotides; each canonical nucleotide is individually inserted into each nucleotide position of the guide sequence; one nucleotide is individually deleted from each nucleotide position of the guide sequence; one nucleotide is individually deleted from each nucleotide position of the guide sequence; two nucleotides are deleted from each combination of two nucleotide positions of the guide sequence; and guide sequence variants having the combination thereof,
such that each of the query guide sequences in the series has zero or one mismatches, zero or one insertions, and zero, one, or two deletions relative to the guide sequence;
(36) if two mismatches, one insertion, and two deletions is selected:
the series of query guide sequences includes the guide sequence, and sequence variants thereof wherein each nucleotide position in the guide sequence is individually substituted by each of the alternative nucleotides; each combination of two nucleotide positions in the guide sequence is substituted with each alternative nucleotide; each canonical nucleotide is individually inserted into each nucleotide position of the guide sequence; one nucleotide is individually deleted from each nucleotide position of the guide sequence; one nucleotide is individually deleted from each nucleotide position of the guide sequence; two nucleotides are deleted from each combination of two nucleotide positions of the guide sequence; and guide sequence variants having combinations thereof,
such that each of the query guide sequences in the series has zero, one, or two mismatches, zero or one insertions, and zero, one, or two deletions relative to the guide sequence;
(37) if three mismatches, one insertion, and two deletions is selected:
the series of query guide sequences includes the guide sequence, and sequence variants thereof wherein each nucleotide position in the guide sequence is individually substituted by each of the alternative nucleotides; each combination of two nucleotide positions in the guide sequence is substituted with each alternative nucleotide; each combination of three nucleotide positions in the guide sequence is substituted with each alternative nucleotide; each canonical nucleotide is individually inserted into each nucleotide position of the guide sequence; one nucleotide is individually deleted from each nucleotide position of the guide sequence; one nucleotide is individually deleted from each nucleotide position of the guide sequence; two nucleotides are deleted from each combination of two nucleotide positions of the guide sequence; and guide sequence variants having combinations thereof, such that each of the query guide sequences in the series has zero, one, two, or three mismatches, zero or one insertions, and zero, one, or two deletions relative to the guide sequence;
(38) if one mismatch, two insertions, and two deletions is selected:
the series of query guide sequences includes the guide sequence, and sequence variants thereof wherein each nucleotide position in the guide sequence is individually substituted by each of the alternative nucleotides; each canonical nucleotide is individually inserted into each nucleotide position of the guide sequence; each combination of two canonical nucleotides are individually inserted into the guide sequence each combination of two positions in the guide sequence; one nucleotide is individually deleted from each nucleotide position of the guide sequence; two nucleotides are deleted from each combination of two nucleotide positions of the guide sequence; and guide sequence variants having the combination thereof,
such that each of the query guide sequences in the series has zero or one mismatches, zero, one, or two insertions, and zero, one, or two deletions relative to the guide sequence;
(39) if two mismatches, two insertions, and two deletions is selected:
the series of query guide sequences includes the guide sequence, and sequence variants thereof wherein each nucleotide position in the guide sequence is individually substituted by each of the alternative nucleotides; each combination of two nucleotide positions in the guide sequence is substituted with each alternative nucleotide; each canonical nucleotide is individually inserted into each nucleotide position of the guide sequence; each combination of two canonical nucleotides are individually inserted into the guide sequence each combination of two positions in the guide sequence; one nucleotide is individually deleted from each nucleotide position of the guide sequence; two nucleotides are deleted from each combination of two nucleotide positions of the guide sequence; and guide sequence variants having combinations thereof,
such that each of the query guide sequences in the series has zero, one, or two mismatches, zero, one, or two insertions, and zero, one, or two deletions relative to the guide sequence;
(40) if three mismatches, two insertions, and two deletions is selected: the series of query guide sequences includes the guide sequence, and sequence variants thereof wherein each nucleotide position in the guide sequence is individually substituted by each of the alternative nucleotides; each combination of two nucleotide positions in the guide sequence is substituted with each alternative nucleotide; each combination of three nucleotide positions in the guide sequence is substituted with each alternative nucleotide; each canonical nucleotide is individually inserted into each nucleotide position of the guide sequence; each combination of two canonical nucleotides are individually inserted into the guide sequence each combination of two positions in the guide sequence; one nucleotide is individually deleted from each nucleotide position of the guide sequence; two nucleotides are deleted from each combination of two nucleotide positions of the guide sequence; and guide sequence variants having combinations thereof,
such that each of the query guide sequences in the series has zero, one, two, or three mismatches, zero, one, or two insertions, and zero, one, or two deletions relative to the guide sequence.
The guide sequence and the series of query guide sequences can be modified to include one or more PAM sequence suffixes as discussed above. Next the guide sequence and the series of query guide sequences, with and/or with the PAM sequence suffix(es) is compared or aligned to a genome. As discussed above, in the most preferred embodiments, the genome is a user selected genome composed of indexed files that represent the genome sequences, rather than the sequences themselves.
A target site location in the genome is typically identified or reported in the output when the genomic sequence matches the user-specified criteria. For examples the number of mismatches is below the user-supplied limit, and it lacks indels in relation to the guide strand if only "no indels" is chosen. The maximal number of mismatches allowed can be 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 15 or longer depending on the guide strand length. Alternatively a site can be output if it does have an insertion or deletion and that type of search is chosen by the user, subject to the site having a direct match or having less mismatches than the user-specified limit. The maximal number of mismatches allowed can be 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 15 or longer depending on the guide strand length. The user can also specify one, two, three or more PAM sequences individually or using consensus or ambiguous sequences. Depending on the number of mismatches, number of indels, guide strand length, and PAM lengths, the genomic sequence may have at least 60, 65, 70, 80, 85, 90, 92, 95, 96, 97, 98, 99 or 100 percent identify to the guide strand.
Searching genomes with a longer guide strand or PAM sequences will decrease the number of sites output if using the same number of mismatches, therefore the genomic sites most similar to the guide strand my correspond to lower levels of identity, such as at least 60, 70, 80, 85, 90, 92, 95, 96, 97, 98, 99 or 100 percent identify to the guide strand. It maybe important to query sequences throughout this range as tissue culture experiments have revealed that guide strands have been found to cleave sites with identities in this range.
In preferred embodiments, the level of matching is further or solely weighed based on sequence-dependent scoring, such that modified counts of the number of mismatches or indels or a modified percentage is determined by the sequence of the guide, the complementary genomic sequence or both. In some embodiments this may be weighed as the change in nucleotide affinity, the ability to tolerate mismatches or indels, or based on other modeling or data.
In other embodiments, other search programs are used to scan the genomes using the range of guide strand variants generated. Other index strategies can be used or whole genomic sequences can be scanned using perl, pyton, or other direct search programs or scripts. In some embodiments, the programs or scripts would identify sites that match the search criteria, though in other embodiments the sites would correspond to matching the guide strands and variants based on identity percentage. The sites output can be the highest percentages, or those sites above a calculated percentage (based on probability of finding sites after comparing the guide strand, PAM lengths and/or genome size).
A target site location in the genome is typically identified or reported when the genomic sequence has 100% sequence identity with the guide sequence, or the highest percentage in the genome and/or one or more of the query guide sequences with or without one or more appended PAM sequences. In some alternative embodiments, the sequence identity between the genomic sequence and the guide sequence and/or one or more of the query guide sequences with or without one or more appended PAM sequences is at least 80, 85, 90, 92, 95, 96, 97, 98, or 99 percent. The target site or on-target site can be thought of as the intended cleavage site, regardless of its level of identity, or number of mismatches, if it includes indels related to the gRNA and regardless of how this site compares to other un-intended sites (i.e., off-target sites) that may score below or higher in these indices.
In other embodiments any search method using local alignment or index searches could be used, such as Eland, SOAP, SHRiMP, Bowtie, Q-pick, Maq, BWA. The programs can vary in their speed and ability to locate all sites. Searches that fail to exhaustively locate all possible target sites, will not output the sites it fails to test, or fails to measure. Other embodiments that fail to filter sites may produce very long lists of sites to sort through scoring and ranking. In some embodiments, the scoring and ranking methods is used to weigh ever site in a genome, and only output top sites or sites scoring above a specified threshold, or number of sites.
As discussed above, the guide sequences, variants thereof, query sequences, etc. can include one or more "N" and other symbolic nucleotides, such as those described herein, that refer to one or more nucleotides. It will be appreciated that in some embodiments, where variant and query sequences are constructed by adding (insertions) or substituting (mismatches) each nucleotide, or each alternative nucleotide as appropriate, relative to a parent sequence (e.g., the guide sequence(s)) at one or more positions, this can additionally or alternatively be accomplished by adding or substituting with an "N" and other symbolic nucleotides, and vice versa. Such symbols can be understood by the user and/or computational software, and thus reduce the total number of variant or query sequences that have to be prepared relative to adding or substituting each of the possible alternative nucleotides individually.
2. Constructing the Target Site List
If more than one target site is identified, the target sites are typically reported as a list, preferably a ranked list. Therefore, the disclosed methods and systems can rank the target sites. The ranking can be based on a score that reflects the expectation of how likely the target site will be cleaved by a CRISPR/Cas nuclease such as Cas9, and can be weighted based on one or more factors or attributes. The ranking can be based upon a scoring function for predicting nuclease activity based at least in-part on sequence identity between the guide strand and the genomic target sequence and/or complementarity between to the guide strand and complementary strand of the genomic target sequence. In some embodiments the scoring function is derived empirically or by incorporating various design rules. The rank can be determined based on the sum of scores corresponding to different design considerations. The ranking can include scoring systems that include the weights for mismatches, insertions, deletions and the combinations of these with particular weight
corresponding to their location in the guide strand, based on nucleotide proximity or relative position, and or distance from the PAM. The ranking can include scoring systems with additive (or subtractive) weight factors and/or multiplicative factors and/or higher-order weights. In some embodiments, rankings will include features corresponding to the cell type, culture conditions, animal age and/or growth, developmental state, genomic context, chromosomal and/or methylation state, other features affecting cleavage rate, and combinations thereof. Therefore, the method is flexible and will be able to incorporate more design variables into the function as more information about the factors affecting nuclease activity at various target sites becomes available. In addition, the method can be re-applied to an enlarged training set of data once more experimental data become available. In some embodiments a range of different scoring functions is provided with some applying generally and others optimally for a specific guide strand sequence. Figure 30 presents a flow chart of an exemplary target site prediction method (700) that generates search parameters (710) based upon an input query, constructs a list of on- and off-target sites (720) based upon the search parameters, and ranks (730) the target sites in the list before outputting the results. The score can also include consideration of the number and location of base mismatches, insertions, and/or deletions, when ranking of the more likely target sites. Other considerations include, but are not limited to, the distance between mismatch(es) and the PAM. The Examples below show that mismatches further from the PAM are more likely to result in off-target cleavage. In some or all sequences, there are positions that may vary from this general trend.
Bioinformatics based ranking of CRISPR/Cas off-target sites may be hindered by the effects of genomic context and DNA modifications. Identical genomic sites and duplicated sites may have dramatic differences in off-target activity. The data presented in the Examples below shows that the indel rate at off-target site R-01 OT2 was 44%, though other loci with the same complementary sequence have much less, or no activity, possibly due to nuclease blocking or any of the other features described above. The accessibility of the genomic DNA may influence nuclease activity sites of similar sequence. Accordingly, in some embodiments, the score includes consideration of factors including chromatin condensation and/or DNA availability at the genomic location of the on- and off-target sites, alone or in combination with other factors in the search algorithm.
Typically, the results are sorted for unique sites with the lowest mismatch and indel score to locate the most likely target sites. In some embodiments, a low score correlates with a high likelihood of nuclease cleavage at the target site. For example, in a particular embodiment, one or more on-target sites are reported, generally first in the list, having a score of "0" and off-target sites are ranked in descending order of likelihood of cleavage based on ascending scores of greater than 0. By way of further illustration, the Examples below show an exemplary scoring paradigm wherein a binding site of a NGG PAM guide strand is typically ranked ahead of a binding site for the guide strand with a NAG PAM (by non-limiting example, +0.3 points can be added to the default scoring).
In other embodiments, a high score correlates with a high likelihood of nuclease cleavage at the target site. Other scoring schemes can be used in other embodiments, such as having 100 equal a perfect match or the top scoring site and scoring lower the less probable sites in accordance to mismatches, insertions and deletions, their combinations and positions.
In some embodiments, the mismatches, insertions, and/or deletions result in the addition to the score corresponding to their location in the guide strand, here in nucleotides from the PAM.
In some embodiments the location of each mismatch, insertion or deletion are added to make the score. For example, in an exemplary embodiment, for mismatches at or beyond position 13 the method adds 0.1, for positions 9-12, 0.5; for 7 and 8, 1.0; for position 6, 1.4; for position 5, 1.9; for position 4, 2.0 ; for position 1-3, 4; for mismatches in the PAM, 10. In other embodiments, there are multiplications of the individual scores, or combinations of additive scores and multiplication weights. In other embodiments, the weight scores are multiplied or they can be added/subtracted while other weights are multiplied to include score for individual or multiple mismatches or indels or multiple sets of mismatches or indels. In other embodiments, there are sequence specific weights in addition to position specific weights, and these weights can include the guide or complementary sequence or both. For example mismatches at G-C base pairing may be weighed differently than mismatches replacing A-T base pairs. Similarly the resulting mismatches may be weighed, such that G-A, G-T, C-A, or C-T can be scored differently depending on the orientation, the surrounding bases or other features. In other embodiments, other sequence- specific features are weighed such as the binding affinity, sequence patterns, GC or AT content, di-nucleotide pair usage or NA secondary or tertiary structures or capacity to form such structures. Each of these embodiments may be used with each application, such that one scoring system may be applied to look for on- and off-target binding, on- and off-target binding when linked to effector domains, nuclease or nickase binding, nuclease or nickase cleavage, or other binding or functional effects.
Table 22 illustrates an exemplary of two scoring paradigms that can be used to analyze and rank target sites based on the location/position of the mismatch or indel, and its type (e.g., mismatch, deletion, or insertion). In the exemplary embodiment shown in the right column of Table 22 ("scoring"), a "penalty" of "fine" of 0.5 is assessed for deletions, 0.6 for insertions, 0.3 for NAG PAM, and 20 for less preferred PAMs (anything outside NRG for S. pyogenes Cas9). This means there is a position penalty or fine for the insertions, then an additional penalty or fine for it being an indel instead of a mismatch. In another embodiment, the weights may be different, in some, or all positions.
Another embodiment in shown in the left column of Table 22 ("current"). In this embodiment, the weight scores are not decreasing as their distance varies from the PAM, but may be based on off-target data, biochemical or cellular testing, or other data or modeling. In other embodiments the total scoring is combinations of additive and/or multiplicative weight scores and may include factors weighing combinations of features, such as pairs of mismatches, or mismatches and indels. In other
embodiments, the weights may include sequence-specific weights including combinations of features, such as pairs of mismatches, or mismatches and indels. In such an embodiment changing a given nucleotide to any of the others may result in different weight scores, depending on that sequence change and the sequence of the remainder of the guide and/or complementary sequence. There may be a number of concurrent embodiments based on the particular applications, or user-specified features or requirements. Table 22: Exemplary Scoring Paradigm
Figure imgf000054_0001
plus additional
for
deletions 0.51
insertions 0.7
Figure 34 is a curve illustrating the score (x-axis) as a function of the location/position of the mismatch or indel relative to the PAM (y-axis) Mismatches in the PAM are not plotted. This graph displays one embodiment of the relationship between weight scores for the position of indels or mismatches. Lower scores under this scoring paradigm are believed to correlate with increased likelihood of nuclease activity at the target site with a mismatch or indel at this site. In this embodiment, weights scores or "fines" are added for multiple mismatches or indels according to these individual weights. Accordingly, in some embodiments under this paradigm, scores would be reported in ascending order with the target site believed to have the highest nuclease activity appearing first and others following in descending order.
C. Output
Output typically includes some or every genomic sequences that matches the user-supplied search criteria in comparison with the entered guide strand. The output method can be based on number of mismatches, indels, or as percentages. The output list of target sites allows a user to compare the number and score target sites for the input guide sequence. As discussed in more detail below, the output can include returning polymerase chain reaction primer sequences for amplification of the ranked cleavage site locations, returning a full nucleic acid sequence of an amplicon for detecting induced mutations; and designating each target cleavage location as being in an exon, intron, promoter, or regulatory or intergenic region. In addition, the output can return hyperlinks to internet resources on the genomic region of the cleavage locations.
1. Target Sites
In some embodiments, the output includes a ranked list of perfectly matched (on-target site and possibly other sites) and partially matched (potential off-target) sites in the genome, their ranking score, optionally along with reference sequences and primer designs that can be used for sequencing and/or mutation detection assays. In a particular embodiment, each line of the output file describes one genomic locus matching the search criteria. A locus may appear on multiple lines if it can be modeled and found in multiple ways.
In some embodiments, the output shows the genomic target site sequence ("hit"), preferably aligned to the query sequence (e.g., guide sequence) to highlight matches, mismatches, indels, etc. In particular embodiments, nucleotides that are not a direct match, including mismatches, insertions, and deletions, are colored or shaded differently or otherwise distinguished from matches. Ambiguities in the query sequence, such as the "N" in the PAM sequence NGG, are indicated differently or are similarly shown, though they do not count as mismatches.
The output can also include the query type, including (i) no deletion or insertion (No indel), (ii) deletions (Del), or (iii) insertions (Ins), with or without mismatches. This portion of the output can indicate if there are insertions or deletions, and specify the indel positions as the number of nucleotides away from the PAM.
The output can also include the number of mismatched bases between the guide sequence and target sequences. As illustrated in more detail in the Examples below, when two repeated bases appear in the guide strand, a deletion of either one of them in the target sequence gives the same query sequence, so the ambiguity can be noted in the output.
The output can also indicate if the PAM in the hit ends in G, as NGG is the Cas9 PAM with the highest activity, followed by NAG. This portion of the output helps in ruling out genomic sites with unlikely PAMs.
Other information that can be provided in the output includes, but is not limited to, the chromosomal location of the matching sequence, its strand, and the chromosomal location of the cleavage site. The predicted cleavage position is based on the fact that Cas9 primarily cleaves both DNA strands three nucleotides from the PAM. The output can include hyperlinks directed to the chromosomal sites one or more genomic websites or databases, for example, the UCSC genome browser. This allows determination of the gene that best matches the target sequence and if the target site is in an exon, intron, or other region. This information is helpful as mutations may be better tolerated in regions that are noncoding and nonfunctional. This information can also be included as part of the output.
In some embodiments, the output is grouped by query types, including (i) genomic sites with base mismatches, but no insertions or deletions (No indels), (ii) sites with deletions (Del), and (iii) sites with insertions (Ins) between the query and potential off-target sites (e.g., Table 12). Within each category, sites with mismatches further from the PAM are typically listed first, which are more likely to result in off- target cleavage. In some embodiments the scoring is the primary determinant of the order in the lists, though a number of tie-breaking criteria, such as lack of indels, or chromosomal location can be used.
The same genomic location may satisfy two or more search criteria, such as those sites that satisfy the mismatched base limit without and with an insertion or deletion. For example, mismatches at the base farthest from the PAM and deletions of this base will give the same set of genomic locations. This can also occur when the guide strand contains consecutively repeated bases. Since genomic locations can be specified through multiple criteria, they can be indicated as duplications in the output, for example, by listing in each of the corresponding groupings to aid further evaluation and scoring. In other embodiments, duplicate sites are removed or withheld in the output.
In some embodiments, the output lists the potential off-target sites according to attributes or by adding weight matrixes to rank the most likely off-target sites. The accumulation of additional experiments on CRISP off-target activity will allow creation of a more predictive scoring system. It is believed that mutations in the PAM are least well tolerated followed by sites closest to the PAM; however, little is known about how the guide strand sequence influences these effects (Jinek, et al., Elife 2:e00471 (2013); Fu, et al, Nat Biotechnol, 31 : 822-826 (2013); Hsu, et al, Nat Biotechnol, 31 : 827-832 (2013); Cradick, et al, Nucleic Acids Res, 41 :9584-9592 (2013)).
In some embodiments the output is in HyperText Markup Language (HTML). In some embodiments some or all of the output is exported into a spreadsheet, such as in Excel, text or comma, or tab separated formats. The spreadsheet can facilitate further processing by the user, such as sorting by attributes or adding weight matrixes to rank the most likely off-target sites. In some embodiments, the primary ranking is done in the spreadsheet to allow iterative tuning or ranking based on the default of user-supplied weight factors. In other embodiments, secondary, tertiary, or further ranking are done in the spreadsheet to add newer, alternative or other weight or multiplicative scores. The preferred embodiment allows the search method to greatly decrease the number of sites in the genome to a relatively low number, possibly hundreds, or to many thousands of loci to process in spreadsheets.
Table 10 shows an exemplary output in HMTL. The output includes the genomic sites matching the user-supplied criteria in comparison to a user supplied guide strand sequence with chromosomal location. Scoring of the mismatches is provided for ranking, as are PCR primers and reference sequence. Other typical output elements (not illustrated in Table 12) include, but are not limited to, right and/or left primer sequences and links to test each primer pair using the UCSC in- silico PCR web site, amplicon sequence, and digest size (discussed in more detail below). The chromosomal location ("Chr. position") for each "hit" in Table 12 is provided as a hyperlink to genomic resources, e.g. UCSC genome browser, and to an output file as a spreadsheet for further manipulation and primer ordering. In other embodiments, links can be provided with genomic annotation, sequence viewers, in silico primer testing, and or pubmed links.
In Table 12, each hit is appropriately aligned to the query shown in the "Result" box. DNA bases corresponding to mismatches, indels, ambiguity codes, such as N, are shown in the query line to identify the matching genomic bases. To the right of the "Result" box are boxes with the query type, number of mismatches, chromosomal position, score, primers, and other features. A spreadsheet output allows the user to manipulate the output to evaluate the number and scores of the low- scoring sites that are predicted to be more likely off-target sites, which may provide important guidelines when evaluating and choosing guide strands and/or testing for true cleavage events using DNA samples from cells after CRISPR/Cas treatment.
2. PCR Primers
An automated primer pair design is sometimes included to design primers appropriate for target site validation assays, matching user input criteria. The primer design function can be used in combination with assays for off-target cleavage after cells or animals are treated with CRISPR guide strands and nuclease. Primers are designed that fit the criteria needed for the particular assay or sequencing platform using an automated primer pair design process. This greatly simplifies the standard method for primer design that requires iterative steps of primer design and verification of the resulting fragment sizes. In addition to speeding the primer design throughput, an automated design process allows the primers to be custom designed for the downstream assays or sequencing, and to be matched for high-throughput, full-plate PCR amplification. Primers can be designed according to specified criteria or to the defaults given for particular applications (Figure 25 A)
To optimize amplicons for different sequencing platforms, the primer pair design will sometimes provide for specifying the minimum distance from the edge of the amplicon to the nuclease site. The recommended parameters will in some cases include a separation distance between cleavage bands that is greater than 0, 20, 40, 60, 80, 100, 120, 140, 160, 180, or 200 base pairs. In some embodiments primer pairs are chosen such that the minimum separation between unc leaved and cleaved products is greater than 50, 75, 100, 125, 150, 175, or 200 base pairs. The primers may be optimally chosen for a variety of sequencing assays, such as appropriate for each sequencing platform. In some embodiments, users can also input the number of bases the cleavage site must be from each amplicon's edge to ensure sequencing coverage depending on the different sequencing platforms. For single molecule, real-time (SM T) sequencing, a set of exemplary recommended parameters are: Minimum Distance Between Cleavage Bands of 0 base pairs, Minimum Separation Between Uncleaved and Cleaved Products of 125 base pairs. In another example, for Surveyor assays, the primer design parameters can be specified to ensure that the nuclease site is placed in an optimal position within the amplicon to yield cleavage bands that can be easily distinguished from the parental band and each other using agarose, polyacrylamide, other gels or capillary apparatus. For example, exemplary recommended parameters for use in Surveyor assays resolved on 2% agarose gels are: Minimum Distance Between Cleavage Bands— 100 bp, Minimum Separation Between Uncleaved and Cleaved Products— 150 bp. In a particular embodiment, for resolution on a 2% agarose gel, the recommended parameters may be: Minimum Distance Between Cleavage Bands of 100 base pairs, Minimum Separation Between Uncleaved and Cleaved Products of 150 base pairs. The output primers can also easily modified in the spreadsheet, such as to add flanking sequences for additional amplification and/or barcodes for sequencing.
The primer pair design process implemented will in some cases use the following steps and considerations to yield primer pairs suitable for high-throughput PCPv. In some embodiments the primer design process may take into account the potential secondary structure that could arise of the 3 ' end of a primer folding back; may take into account estimated physical properties including the temperature or length; may define targets for the content of specific bases in the primer; and may check to ensure for primers that are not self-complementary.
Outlined below is an example primer design process that may be employed in certain preferred embodiments.
Primer Design Process
Each possible position in the sequence 5 ' of the nuclease binding sites is considered as a possible 5 ' base for a primer (in some cases allowing for a user- specified minimum distance between the edge of the amplicon and the nuclease site).
For a given 5 ' starting position, a first number of bases in the 3 ' direction are taken as an initial sequence for the primer. The first number of bases may be any integer number of bases, but in some preferred embodiments the first number of bases chosen will be 15, 16, 17, 18, 19, or 20 bases. Then the following design loop begins:
LOOP:
1) Check for potential secondary structure that could result from the 3 ' end folding back.
Check that the sequence of the primer up to the 4th most 3 ' base does not contain any exact matches to the reverse complement of the three most 3 ' bases.
Example:
Potential Primer Sequence: 5 '-ACATTGAGGCACTACTTG-3 '
Check that the sequence CAA does not appear in ACATTGAGGCACTA
If there is a match, lengthen the primer by one base in the 3 ' direction and repeat the loop.
2) Check the predicted melting temperature of the primer and GC content. %GC - the percentage (not fraction) of G and C residues in the sequence i.e. 33 not 0.33
If the %GC content falls outside a specified range then lengthen the primer by one base in the 3' direction and repeat the loop. In some embodiments the specified range may be greater than 25, 30, 31, 32, 33, 34, 35, or 40 % and less than 55, 60, 61, 62, 63, 64, 65, 70, or 75%.
The melting temperature can be approximated by a number of methods. In one embodiment it is approximated by the empirical relation below, where the %GC is the percentage of G and C residues and the length is the primer length in units of the number of nucleotides.
Figure imgf000060_0001
If the predicted melting temperature falls outside of certain specified values, then lengthen the primer by one base in the 3 ' direction and repeat the loop. In preferred embodiments the predicted melting temperature is desirably less than 70, 65, 60, 59, 58, 57, 56, 55, 50 degrees when using the empirical formula above. 3) If the primer is longer than a specified maximum primer length, i.e. 30 base pairs, then exit the loop unsuccessfully— no primer for this position. In some cases the maximum primer length may be 20, 30, 35, 40, 50, 60, or 70 base pairs.
4) Check the primer sequence for high self-complementarity.
Ensure that all base pair sequences in the primer are not a perfect match to anywhere in the reverse complement sequence of the primer.
If any match is found, then exit the loop unsuccessfully— no primer for this position.
5) If all requirements are met, then exit the loop successfully and record the primer for this position.
END LOOP
After attempts to generate primers for all forward positions and all reverse positions are complete, pairs may then be made with each forward pair to each possible reverse pair. This list of pairs can then be pruned in some cases to remove any that would result in products where the distances between nuclease sites and the ends of the amplicon fall outside of some specified ranges. This list may further pruned to remove primer pairs that are somehow undesirable, i.e. could potentially form primer dimers as defined by having the final 3 ' bases of one primer match the reverse complement of the final 3 ' bases of the other primer.
The primer pairs may then be sorted by some selection criteria depending upon the application, for example how close the melting temperature is to a specified target melting temperature. Primer pairs may also be sorted and/or filtered by providing a preference, for instance for shorter amplicon lengths, or may be sorted alphabetically or any other acceptable manner.
In some embodiments, the primer pairs are then sorted by how close their melting temperature is to the target melting temperature (the default is 60°C) by
Figure imgf000061_0001
Take all pairs where the Tdiff < 2 and apply further sorting criteria in order of priority:
1) Prefer shorter amplicon length 2) Prefer a shorter length of the longer primer sequence in the pair
3) As a final tie-break, sort the primer sequences alphabetically
If no primer pairs are found acceptable under a specified set of criteria, the algorithm may selectively relax constraints in some embodiments to generate a minimum number of primer pairs. In a particular embodiment, the most lenient set of criteria still require a minimum %GC of 25, a maximum %GC of 70, a maximum length of 38, and a minimum melting temperature of 55°C.
The output can include returning polymerase chain reaction primer sequences for amplification of the ranked off-site cleavage locations alone, or in combination with a full nucleic acid sequence of an amplicon for detecting induced mutations.
In other embodiments, the output "primer sequences" can be used for other applications such as binding without amplification, pull-down sequences, probe sequences, or as sequence-specific tags.
3. Estimating Target Sites
Some embodiments provide an estimate of the number of expected target site based upon the search criteria, for example to provide the user with a guide for selecting appropriate search parameters or to prohibit queries that would generate such a large number of hits to be too time or resource intensive. In other
embodiments these calculations are done to provide the default or suggested parameters.
Figure 30B depicts a flow chart for an exemplary method (900) for generating target sites. A query is obtained and search parameters are generated (910).
Optionally, an estimate of the number of expected results is provided (920). The query may then be updated with a revised query, wherein a revised estimate is subsequently generated of the number of expected results. This process can be completed to obtain a desirable number of expected results. The query is then used to construct a target site list (930) using methods provided herein. The results in the target site list are ranked by score (940) and/or filtered by specified selection criteria (950). The list of target sites is then used to generate primer pairs (960) for generating test amplicons. The list of target sites and primer pairs is then output as results. D. Exemplary Algorithm for Identifying and/or Ranking
Targets Sites
An exemplary decision tree for identifying and/or ranking putative target sites is illustrated in Figure 30C (100). Following input of a guide strand sequence (gRNA) (110), based on the user-supplied inputs ("input"), variants of the guide NA are generated that vary in insertion(s) and/or deletion(s) in each possible position. The collection of these variants without the original guide (or with the original guide, depending on embodiment) (120), are then aligned to the chosen genomic (or other) sequence (130). If specified, the required adjacent motif must be present within the supplied limits or mismatches. This can be a PAM or other type of sequence. At each site, the program can determine if each of the guides or variant guides matches within the user specified number of mismatches (140). If not, the sequence is not added to the output (150) and the search moves one nt further through the genome index, the specified sequence or file and searches again (130). The collection of sites matching the criteria and collected as output (160), whereas the sites not matching are not output (150), though they may be included in other output using other guide sequences or inputs, such as greater allowed number of mismatches.
The input guide strand sequence (gRNA) (110), can also be used to search the genomic or other sequences without the possible addition of indels, based on the user- supplied input (170). This process can occur in parallel, or as part of the search with variants, or it may occur prior or at other times than the search described above (130). At each site, the program can determine if each of the guides or variant guides matches within the user specified number of mismatches (180). If specified, the required adjacent motif must be present within the supplied limits or mismatches. This can be a PAM or other type of sequence. If not, the sequence is not added to the output (190) and the search moves one nt further through the genome index, the specified sequence or file and searches again (170). The collection of sites matching the criteria and collected as output (200), whereas the sites not matching are not output (190), though they may be included in other output using other guide sequences or inputs, such as greater allowed number of mismatches.
Each of the sites that was located through these processes is compiled into the collected output (210). The output can contain some or all of the following information or additional information: a list of genomic sequences, the genomic location, such as the chromosome number and base position in most genomes, and annotation on the nearest gene, if the site is in an exon, intron or other annotated sequence or other data from current or future data bases. In other embodiments an output without indels (220) and one that can include indels (250) remain separate. This data can be generated from the process listed above (110-210), or can be derived from other sources, and processed primarily in terms of ranking the output or sequences collected from any source. In other embodiments each site of a given length, sub-sequences, in a genome or other sequence can be scanned and given a ranking score using the algorithm described below (240, 270). Generally the user would request only the sub-sequences above a user- input or default cut-off, generally the sites that would likely be cut.
The listed sites are each individually compared to the guide sequence (220), or guide sequence allowing indels (260) with the ranking performed in any of a number of weighted methods (one embodiment described in Table 22). In the preferred embodiment the site is aligned to the genomic site and included in the output (230 or 260), whereas in other embodiments, the site can be iteratively compared to the genomic site with different combinations of mismatches, insertions and/or deletions (260, 270), or aligned across the full specified sequence or genomic indices. Based on the alignment, the differences are scored with weights for mismatches, insertions and/or deletions using one of the default or user-supplied ranking methods (240, 270). The results of the ranking are given as output (280), which can be combined with other annotated information and provided as HTML, graphical, text, spreadsheet and/or other forms of output (290). The output can be further processed based on the results of this output, such as the number of sites returned, based on newer or different data that emerged, based on alternative applications or other reasons. The output can therefore be re-ranked using independent scoring or scoring systems that incorporate the previously determined score. In one embodiment, this can be as simple as adding further weights for additional features, such as PAM mismatches. In other
embodiments, re-ranking can be used to add data not in the original ranking such as chromosomal context, DNA accessibility, sequence specific features or known interactions (310). This output can be provided as HTML, graphical, text, spreadsheet and/or other forms of output (320).
The output in one preferred embodiment, allows one to avoid guide strands that may result in high off-target activity, that may target important genes or may result in other off-target events (300). In other embodiments, this process allows the better choice of guide strands, but comparing the output between a ranking of guide strands, that may target the same gene, regions or otherwise be alternatives (300). After the guide strands are used in cells the genomic, plasmid or other DNA can be harvested to measure activity. In one embodiment, output primers are provided that can be used to determine cleavage, homologous recombination, mutation rates or the rates of other events at the on-target and putative off-target sites (330). Similarly, one can use the output primers or other methods to evaluate the on-target or off-target activity of the guide strands and then compare between the guide strands (330).
III. Systems
A. Computer Implemented Systems
The systems and methods provided herein are generally useful for predicting the location of CRISPR/Cas on- and off-target cleavage sites, particularly those due to insertions and/or deletions in the target DNA relative to the guide RNA sequences and vice versa. In certain embodiments the methods are implemented on a computer server accessible over one or more computer networks. Figure 31 is a block diagram of a preferred network-based implementation (400) wherein a client computer system (410) is in communication with a server computer system (420) via a network (430), i.e. the Internet or in some cases a private network or a local intranet. One or both of the connections to the network may be wireless. In a preferred embodiment the server is in communication with a multitude of clients over the network, preferably a heterogeneous multitude of clients including personal computers and other computer servers as well as hand-held devices such as smartphones or tablet computers. In some embodiments the server computer is in communication, i.e. is able to receive an input query from or direct output results to, one or more laboratory automation systems, i.e. one or more automated laboratory systems or automation robotics that automate biochemical assays, PCR amplification, or synthesis of PCR primers. See for example automated systems available from Beckman Coulter.
The computer server where the methods are implemented may in principle be any computing system or architecture capable of performing the computations and storing the necessary data. The exact specifications of such a system will change with the growth and pace of technology, so the exemplary computer systems and components described herein should not be seen as limiting. Figure 32 is a block diagram of the basic components of an exemplary computer server (500) on which the methods may be implemented. The systems will typically contain storage space (510), memory (520), one or more processors (530), and one or more input/output devices (540). It is to be appreciated that the term "processor" as used herein is intended to include any processing device, such as, for example, one that includes a CPU (central processing unit). The term "memory" as used herein is intended to include memory associated with a processor or CPU, such as, for example, RAM, ROM, etc. In addition, the term "input/output devices" or "I/O devices" as used herein is intended to include, for example, one or more input devices, e.g., keyboard, for making queries and/or inputting data to the processing unit, and/or one or more output devices, e.g., a display and/or printer, for presenting query results and/or other results associated with the processing unit. An I/O device might also be a connection to the network where queries are received from and results are directed to one or more client computers. It is also to be understood that the term "processor" may refer to more than one processing device. Other processing devices, either on a computer cluster or in a multi-processor computer server, may share the elements associated with the processing device. Accordingly, software components including instructions or code for performing the methodologies of the invention, as described herein, may be stored in one or more of the associated memory or storage devices (e.g., ROM, fixed or removable memory) and, when ready to be utilized, loaded in part or in whole into memory (e.g., into RAM) and executed by a CPU. The storage may be further utilized for storing program codes, databases of genomic sequences, etc. The storage can be any suitable form of computer storage including traditional hard-disk drives, solid-state drives, or ultrafast disk arrays. In some embodiments the storage includes network-attached storage that may be operatively connected to multiple similar computer servers that comprise a computing cluster.
B. Graphical User Interface
In a preferred set of embodiments the computer server receives input submitted through a graphical user interface (GUI). The GUI may be presented on an attached monitor or display and may accept input through a touch screen, attached mouse or pointing device, or from an attached keyboard. In some embodiments the GUI will be communicated across a network using an accepted standard to be rendered on a monitor or display attached to a client computer and capable of accepting input from one or more input devices attached to the client computer.
Figure 33 depicts some of the components that may be found in an exemplary GUI for inputting parameters for target site searches capable of being rendered in a standard web browser window (600) on a client computer. In other embodiments, a phone interface can identify, read and or run entered sequences.
In the exemplary embodiment (600), the GUI contains a target genome selection region (612) where the user selects the genome to be searched. In this exemplary system a genome is indicated by clicking, touching, highlighting or selecting one of the genomes that are listed (615). In preferred embodiments, the target genome is selected from a drop-down list.
In the exemplary embodiment (600), the GUI contains in query sequence region (620) for entering or uploading one or more query guide sequences. The GUI typically includes a text box for the user to input a query guide strand sequence (622). In other embodiments, users may input any sequence or sequences for which they would like to design amplification primers. The GUI may additionally or
alternatively contain an interface for uploading a text file containing one or more query sequences (628, 626). In a particular embodiment, the text file must contain only one query sequence per line. In embodiments that include both options, the GUI may also contain radio buttons that allow the user to select if the target sequence will be entered in a text box (624) or upload from a text file (628). The GUI may include a button for choosing the file (626), may allow a user to drag and drop the intended file, or other means of having the file uploaded. The GUI generally accepts a sequence of length acceptable for serving as a CRISPR/Cas guide strand sequence, for example between about 10 and about 55 nucleotides. In preferred embodiments this may range from 17-22 nucleotides. The input is typically a string of letters, each corresponding to a single letter designating a nucleotide, or other symbols allowing ambiguity at indicated positions (N, R, etc.,), and together providing the nucleic acid sequence of the guide strand polynucleotide. The sequence will generally be entered using a combination of characters selected from the allowable characters and dependent upon the implementation may be limited to characters for the standard nucleotides, or may include non-standard nucleotides.
In the exemplary GUI embodiment (600), the GUI contains a region where the user selects search options (630). The region can include a text box for the user to input a target sequence protospacer adjacent motif (PAM) (632). The input is typically a string of three letters corresponding to the single letter code for the PAM. Exemplary PAM include, but are not limited to, NGG, NAG, and NRG.
The GUI also typically includes additional radio buttons, boxes, or/and other manners for the user to input the number of allowed mismatches, insertions, and/or deletions. In the exemplary GUI embodiment (600), the search options region (630) provides a check button for selecting if no indels should be included in the search (634), a check button for selecting if deletions should be included in the search (636), a check button for selecting if insertions should be included in the search (638), and radio buttons for entering how many mismatches (e.g., 0, 1, 2, or 3, etc.), deletions, (e.g., 0, 1, 2, etc.), insertions (e.g., 0, 1, 2, etc.), or a combination thereof should be searched. In some embodiments, the interface provides a check button to elect no indels in combination with radio buttons for selecting 0, 1, 2, or 3 mismatches; a check button to elect 1-base deletion in combination with radio buttons for selecting 0, 1, or 2 mismatches; and a check button to elect 1-base insertion in combination with radio buttons for selecting 0, 1, or 2 mismatches (640). In some embodiments, the number of mismatches, insertions, and/or deletions may be entered as individual numeric values, as a list of numeric values, or as a range of numeric values in a text box(es). For example, the input strings "0,1,2,3", "0,1-3", "0,1,2-3", or "0,1-2,3" would in some cases all be accepted inputs and would generate all possible alignments including 0, 1, 2, or 3 mismatches, insertions, or deletions.
The GUI can include options for the user to select pre-determined primer design options and/or to customize certain design parameters. In the exemplary GUI embodiment (600), the PCR primer design options region (650) includes a check box (652) or radio button that allows the user to select whether or not primer sequences should be included with the output. The GUI can include radio buttons or tabs (654) that allow the user to select a preferred primer design strategy, for example, default, Illumina 250, Illumina 250 - paired, SMRT, or enzyme. Additionally, or
alternatively, the GUI can include text boxes that allow the user to customize primer parameter settings including, for example, the minimum separation of uncleaved to cleaved (660), minimum cleavage product size difference (662), minimum amplicon length (664), maximum amplicon length (666), optimal amplicon length (668), etc. The user input for each text box is typically an integer, for example, between about 0 and 100,000 inclusive, preferably between about 0 and 10,000 inclusive, or between 0 and 1,000 inclusive. In the absence of user input or user editing, the text boxes can be populated with default setting before or after the user submits the query. The user can also elect not to include primer sequence as part of the output, which can reduce the runtime associated with the query.
The GUI also typically includes an interface for the user to initiate a search. The exemplary GUI embodiment (600) includes a submit button or tab (680) that when selected initiates a search according to the user entered or default criteria. The GUI can also include a reset button or tab (682) that when selected removes that user input and/or restores the default settings.
The GUI will in some embodiments have an example button that, when selected by the user populates all of the input fields with default values. The option selected by the example values may in some embodiments coincide with an example described in detail in a tutorial, manual, or help section. The GUI will in some embodiments contain all or only some of the elements described above. The GUI may contain any graphical user input element or combination thereof including one or more menu bars, text boxes, buttons, hyperlinks, drop-down lists, list boxes, combo boxes, check boxes, radio buttons, cycle buttons, data grids, or tabs.
Figures 26A-26G and Table 14 (below) illustrate an exemplary search string processed according to the disclosed methods and include examples showing the input, and portions of a web result and spreadsheet output for a search of the human genome using guide strand -01.
The genome of interest is chosen from the Target Genome list (Figure 26 A). The target sequence is entered into the Query Sequence box (Figure 26B). The required protospacer adjacent motif (PAM) is entered into the 'Add suffix' Box of the Search Options section (Figure 26C). The spacers (Ns) and required bases are included, such as NGG or NRG.
The boxes in the 'Allowed indels and mismatch' of the Search Options section are checked to indicate if genome sites to be searched include genomic sites that have No indels (with <3 mismatches but the same length), have 1-base Del (are 1-base shorter), or have 1-base Ins (are 1-base longer) (Figure 26C).
The boxes in the PCR Primer Design Options section are chosen, which allow COSMID to design primers matching the specific application. Primer design parameters are set by pressing the button for 'Default', 'Illumina 250', 'Illumina 250 paired', 'SM T' or 'enzyme' (when using other enzymes). Any of the parameters can be entered by hand to further customize.
IV. Experimental Methods
The methods provided herein will in some cases completely replace the need for experimentally screening nuclease target sites or nuclease activities, allowing for the design of CRISPR/Cas guide strands in a completely m-silico manner. In some cases the tools provided herein will serve as an essential first step in the design process by screening and selecting only the few potential guide strands that are predicted to have the desired cleavage-mediating activity at the on-target site, with limited off-site cleavage. In some cases, the tool will prevent the use of guide strands that have medium or high probability of cleaving an off-target site or cleaving multiple sites in the genome. This will allow for far less experimental time and resources being applied to preparing and testing guide strands that do not have the desired features.
In some cases the methods provided herein for predicting off-target sites are used without the need for experimental data. In some cases the methods provided herein for predicting off-target sites are parameterized to correlate with
experimentally determined values. In some embodiments the methods provided herein for predicting off-target sites are used to screen candidate guide strands wherein a much smaller subset are subsequently tested experimentally.
The methods of predicting off-target sites can be used in combination with experimental methods for measuring both on-target and/or off-target cleavage activity. In some embodiments this includes using the results from one or more experiments to guide the search for guide strand with the desired activity at the target site and little or no activity on off-target sites. The experimental methods can include any method capable of measuring the cleavage activity or identifying off-target active sites of a guide stand in combination with a CRISPR/Cas nuclease.
Non-limiting exemplary experimental methods are described below. For example, mutation detection assays can be used to determine if off-target cleavage occur at putative off-target sites identified by according to the disclosed methods. Suitable assays, such as enzyme mismatch assays, are known in the art, see, for example, Guschin, et al., Methods Mol. Biol., 649:247-56 (2010), which describes a procedure for quantifying mutations that result from DNA double-strand break repair via non-homologous end joining; and Huang, et al., Electrophoresis, 33(5):788-96 (2012), which describes a T7 endonuclease I-based assay. The assays are typically based on the ability of a nuclease to selectively cleave distorted duplex DNA formed via cross-annealing of mutated and wild-type sequence. Briefly, using primers, such as primers designed according to the methods described herein, PCR is used to amplification of the genomic loci of putative target sites after transfecting test cells with the elements of the CRISPR/Cas system (e.g., a plasmid expressing Cas9 and a test guide strand). Sanger sequencing can be used to observe mutations. Deep sequencing can also be used to detect and quantitate nuclease induced mutations in CRISPR/Cas-treated cell populations.
Examples
Example 1: CRISPR guide strands can exhibit off-target activity at similar levels as on-target activity, even with mismatches within first 12 nucleotides.
Materials and Methods
CRISPR design and testing
There were no CRISPR target sites in the human HBB gene sequence with their proximal 12 bases unique in the human genome (Cong, et al., Science, 339:819- 823 (2013)); therefore, CRISPR/Cas9 guide strands targeting HBB were chosen by comparing the similar regions in the human hemoglobin δ (HBD) gene. Eight 20- base guide strands were designed to target sites near the sickle mutation in the HBB gene (Figure 1 A), each adjacent to a PAM sequence that contains the canonical trinucleotide NGG. Five guide strands were also designed to target two segments in the human CCR5 gene (Figure 2A), and tested the corresponding CRISPR/Cas9 systems to determine their on-target cleavage and potential off-target activity at the human C-C chemokine receptor type 2 (CCR2) gene. Herein the name of the guide strand (such as R-03) is used to represent the CRISPR/Cas9 system with the specified guide strand.
CRISPR plasmids were generated by kinasing and annealing oligonucleotides containing a G followed by 19 additional bases of the guide strand plus sticky ends, ligating into the pX330 plasmid that contains a U6 promoter-driven chimeric +85-bp guide strand and a CHb promoter-driven Cas9 expression cassette, and expressed together from the 8.5-kb Cas9 gene expression plasmid, pX330 (provided by Dr. Feng Zhang, and also available through Addgene 42230) (Hsu, et al., Nat. Biotechnol, 31 :827-832 (2013)). In a 24-well plate, 80,000 HEK-293T cells/well were seeded and cultured in Dulbecco's modified Eagle medium supplemented with 10% fetal bovine serum (FBS) and 2 mM fresh L-glutamine, 24 h prior to transfection. Cells were transfected with 100, 200, 400 or 800 ng of C ISP plasmids (normalized to 800 ng with pUC18) using FuGENE HD (Promega). The genomic DNA was harvested after 3 days using QuickExtract (EpiCentre). Targeted cleavage was measured at the endogenous loci by the rate of mutations through mis-repair, detected using amplification of these sites using bar-coded or traditional primers (Table 1) and the T7EI assay. The fragments were separated on agarose gels and quantitated using ImageJ; the mutation frequencies were calculated and averaged. To better determine the mutation rate, amplification bands were cloned using the TOPO® TA kit
[Invitrogen], Sanger sequenced and aligned to the genomic sequence to observe the individual mutations and determine the mutational spectra. Sanger sequencing was chosen to ensure the detection of large insertions and deletions, as well as effectively detect single base indels, both of which can be problematic with the next-generation sequencing methods.
Table 1: Sequence of primers used to amplify endogenous loci for the T7EI assay, sequencing and quantitative PCR
Figure imgf000072_0001
Off-target analysis
Off-target analysis was performed using a bioinformatics-based search tool to select potential off-target sites, which were evaluated using the T7EI mutation detection assay. Sanger sequencing was used to confirm the gene modification frequencies for the CRISPR/Cas9 systems, including guide strand R-02 at GRIN3A (see Figure 6B) and compared to the on-target rate (Figure 6A).
Results
The ability to precisely edit endogenous DNA sequences has greatly facilitated the creation of cell lines and animal models for biological and disease studies, and led to unprecedented opportunities in therapeutics. For example, engineered zinc finger nucleases (ZFNs) and transcription activator-like effector nucleases (TALENs) have generated hundreds of animal models for disease studies (Perez, et al., Nat.
Biotechnol, 26:808-816 (2008); Geurts, et al, Science, 325:433 (2009), and nuclease- based treatment strategies are currently undergoing clinical trials. The discovery of a bacterial defense system that uses RNA-guided DNA cleaving enzymes and clustered, regularly interspaced, short palindromic repeats (CRISPR) (Bolotin, et al.,
Microbiology, 151 :2551-2561 (2005); Horvath, Science, 327: 167-170 (2010);
Marraffmi, et al., Nat. Rev. Genet. ,\ 1 : 181-190 (2010); Garneau, et al., Nature,
468:67-71 (2010); Hale, et al, Cell, 139:945-956 (2009)) may provide an exciting alternative to ZFNs and TALENs, as the CRISPR-associated (Cas) protein remains the same for different gene targets; only the short sequence of the guide RNA needs to be changed to redirect the site-specific cleavage (Cong, et al., Science, 339:819-823 (2013)).
Potential off-target cleavage by engineered nucleases poses concerns both for adverse events in therapeutic applications and confounding variables in biological studies. ZFNs (Pattanayak, et al., Nat. Methods, 8:765-770 (2011); Gabriel, et al., Nat. Biotechnol, 29:816-823 (2011)) have been shown to lack exquisite specificity and may cleave sequences in addition to their intended targets, which often induces unwanted mutations and/or toxicity (Cornu, et al., Methods Mol. Biol., 649:237-245 (2010); Ramirez, et al, Nucleic Acids Res., 40:5560-5568 (2012)). Although reports indicate that TALENs have better specificity than ZFNs, off-target activities have been found for TALENs as well (Tesson, et al., Nat. Biotechnol, 29:695-696 (2011); Hockemeyer, et al., Nat. Biotechnol, 29:731-734 (2011); Mussolino, et al., Nucleic Acids Res., 39:9283-9293 (2011)). Previous in vitro studies indicate that
CRISPR/Cas9 systems have a high potential for off-target activity, as they have more promiscuous binding abilities at positions distal from the protospacer-adjacent motif (PAM) region (Cong, et al., Science, 339:819-823 (2013); Gasiunas, et al, Natl Acad. Sci. USA, 109:E2579-E2586 (2012); Jinek, et al, Elife, 2:e00471 (2013); Jiang, et al, Nat. Biotechnol, 31 :233-239 (2013)). Further, because the guide NA strands typically target a DNA sequence of ~20 bp, relatively short compared with the >36 bp targeted by TALENs, many potential off-target sites may exist in large genomes, such as in mammals. Additionally, because non- Watson-Crick base pairing is known to occur (Jiang, et al., Nat. Biotechnol, 31 :233-239 (2013)), it is possible that
CRISPR/Cas9 systems have more off-target activities compared with corresponding ZFNs and TALENs.
To determine the off-target effects of CRISPR/Cas9 systems in the context of the human genome, a series of CRISPR/Cas9 systems were constructed with guide RNA strands targeting the human hemoglobin β (HBB) and C-C chemokine receptor type 5 (CCR5) genes, expressed them in human embryonic kidney 293T (HEK-293T) cells, and quantified their on- and off-target activities using the T7 endonuclease I (T7EI) mutation detection assay and Sanger sequencing. Special attention was placed on the effects of mismatches between the guide strands and the complementary target sequences. This allowed a direct evaluation of the impact of the location and number of mismatches within the 12 bases nearest the PAM region, as well as those in the PAM region (that usually match the canonical NGG motif, or NAG) (Table 2) on potential off-target activities (Cong, et al., Science, 339:819-823 (2013);
Sapranauskas, et al., Nucleic Acids Res., 39:9275-9282 (2011)). The results show that the CRISPR/Cas9 systems targeting the human HBB and CCR5 genes had significant off-target cleavage activities, especially at the HBD and CCR2 genes, which have high sequence homology with HBB and CCR5, respectively.
Table 2: CRISPR on- and off-target cleavage rates
Figure imgf000075_0001
(a) Number of base differences between the guide strand and complementary sequence, including the 5' nucleotide.
(b) Base pair positions from the PAM are numbered above the loci. The differences between the guide strand and complementary sequences are indicated in lowercase underlined nucleotides. The first of the three nucleotides in the PAM sequence is also indicated in lowercase.
* T7EI was performed in duplicate for this off-target site, not triplicate as with all other cases.
Table 2 summarizes the on- and off-target cleavage rates in which, for each CPJSPR/Cas9 system, the complementary sequence of the guide strand, the number of mismatches within the guide strand and the name and genetic region of the on- and off-target activities are provided. Specifically, in Table 2, the third and fourth columns list, respectively, the indel percentages determined by Sanger sequencing and T7EI.
Guide strands directed toward HBB resulted in high rates of on-target activity, with an average mutation frequency of 54% measured by the T7EI assay (Figure 1B- 1C). Because the T7EI assay may not cleave the PC product completely and assumptions must be made about the indel diversity to calculate the mutation percentages (Guschin, et al., Methods Mol. Biol, 649:247-256 (2010)), the mutation frequencies were verified using Sanger sequencing. It was determined that for some guide strands and loci, Sanger sequencing gave much higher mutation frequencies than the T7EI measurements. For example, Sanger sequencing of the HBB loci indicated that R-02 and R-03 resulted, respectively, in 60 of 80 (75%) and 31 of 44 (70%) sequences with insertions or deletions (indels) indicative of the error-prone nonhomologous end-joining (NHEJ) DNA repair pathway (Figure 1 A-C, Figure 4A- C,). Similarly, HEK-293T cells transfected with CRISPR constructs containing guide strands targeting CCR5 resulted in high rates of on-target activity, with an average of 57%) mutation frequency measured by the T7EI assay (Figure 2A-C, Figure 5A-C).
Some CRISPR/Cas9 systems with guide strands targeting HBB also cleaved HBD (some at high rates), even though there are mismatches between the guide strands and the complementary HBD sequences. For example, guide strands having just one-base mismatch with the complementary HBD sequences, located at positions 4 (R-07), 7 (R-01), 8 (R-08), 10 (R-04) and 11 (R-03) bases from the PAM sequence, resulted in off-target mutation rates ranging from 7 to 58%, roughly corresponding to the distance between the mismatch location and the PAM sequence, with R-04 as an exception (Figure IB). Note that two off-target sites at HBD had mutation rates even higher than the on-target rates at HBB, especially R-08, which induced a mutation rate of 48% at HBD, much higher than that at HBB (36%).
To allow RNA transcription by the U6 polymerase, the guide strand is typically preceded by a guanine (Cong, et al., Science, 339:819-823 (2013)). Results show that it is not necessary for the guanine base to match the target site for efficient cleavage, as seven guide strands without a guanine at this position induced mutations in HBB (R-02 to R-08) and four guide strands (R-03, R-04, R-07, R-08) induced mutations in HBD (Figure IB).
To a lesser extent, CCi?5-targeting CRISPR/Cas9 systems also induced off- target cleavage on CCR2, with mutation rates of 5% and 20% (Figure 2B-2C).
Specifically, guide strand R-25 was designed with two identical genomic targets in CCR5 and CCR2 genes to identify the influence of factors beyond sequence homology, such as genomic context. The CRISPR/Cas9 system with R-25 showed a >2-fold difference in mutation rate at these two sites (46% versus 20% mutation rate, Figure 2c). These results indicate that other features such as genomic context may play an important role in cleavage activity. Although guide strand R-30 had two mismatches with CCR2 at the two bases proximal to the PAM region, it induced mutations in CCR2 at a rate of 5% as measured by T7EI with 800 ng of plasmid in transfection (Figures 2B). R-30 transfections with 1100 ng of plasmid induced mutations of 21% quantified by sequencing (Figure 6C), but only 6%> by T7EI (Figure 3E); part of the difference is likely because of the incomplete cleavage of PCR products by T7EI.
A distinct feature of CRISPR off-target activity as related to mismatches in the guide strand is that mismatches in the PAM region can prevent off-target cleavage (Hsu, et al., Nat. Biotechnol, 31 :827-832 (2013)). For example, R-06, which has a one-base mismatch in the PAM, did not induce detectable mutations at HBD, although it has a perfect match of the 14 bases proximal to the PAM (Figure 1B-1C). Further, R-02 did not induce cleavage at HBD because of the one-base mismatch in the PAM and two mismatches at positions 2 and 4 from the PAM (Figure IB).
Similarly, there was no off-site mutagenesis detected at CCR2 by the CCR5 -targeting CRISPR/Cas9 systems with guide strands R-27 and R-29 that had NTG and NGT PAM substitutions, respectively. In particular, although R-29 had a perfect match with the 18-bp sequence proximal to the PAM, a one-base mismatch in the PAM region prevented cleavage of CCR2 (Figure 2B-2C). Clearly, off-target cleavage could also be prevented without any mismatch in the PAM, by having multiple mismatches between the guide strand and the complementary target sequence proximal to the PAM, as demonstrated by R-05 (Figure IB) and R-26 (Figure 2B).
To quantify the change in CRISPR/Cas9 cleavage activity with transfection conditions, CRISPR plasmids were transfected at doses from 100 to 800 ng, and corresponding on- and off-target activities measured by T7EI (Figure 3A-3E). With the dose decreases, R-04 and R-25 gave lower on- and off-target activities, whereas R-30 resulted in increased on-target activity and decreased off-target activity; the on- and off-target activities of R-03 and R-08 remained roughly the same. In general, transfection with the lowest dose (100 ng) increased the ratio of on-target to off-target activities for R-04, R-25 and R-30, although not for R-03 and R-08. These findings expand the results of a study where no appreciable changes in on- and off-target rates were found with two CRISPR guide strands at two doses (Fu, et al., Nat. Biotechnol, 31 :822-826 (2013)). Example 2: CRISPR-targeted loci showed a wide variety of insertions, deletions and point mutations
Materials and Methods
Chromosomal deletion analysis
To assay for gross chromosomal deletions, genomic DNA from cells transfected with R-03 was amplified using the HBD forward primer and the reverse primer downstream of the HBB site. Genomic DNA from cells transfected with R-25 or R-30 were similarly amplified using the CCR2 forward and the CCR5 reverse primers. Agarose gels were used to confirm that the polymerase chain reaction (PCR) product sizes were consistent with chromosomal deletions between these sites. The R- 03, R-25 and R-30 PCR products were cloned and the individual colonies Sanger sequenced and aligned.
Quantitative PCR
Quantitative PCR determination of the percentage of HBD-HBB chromosomal deletions. HEK-293 cells were transfected in triplicate with CRISPR plasmids containing guide strands R-02 or R-03, or mock transfected cells. Genomic DNA was harvested using QuickExtract (EpiCentre), per manufacturer's protocol.
Amplification reactions contained 1 ul of genomic DNA added to mastermix aliquots containing: 0.1 ul of each 10 uM primer, 3.8 ul of water and 5 ul of iTaq Universal SYBR Green 2x Supermix. The reactions were analysed on an Mx3005P qPCR
System (Stratagene) using MxPro qPCR software. As the genomic DNA could not be normalized, the total amount of HBB and the amount of HBD to HBB deletions were measured to determine the percentage of chromosomal deletions. Total HBB was measured using primers HBB-308R and HBB-mid99 that generated a 99 bp product from unmodified HBB or from chromosomal DNA with HBD to HBB deletions, as the primers bind outside the cleavage site. The HBD-HBB chromosomal deletion was measured using primers HBB-308R and HBD-520F and generates a 225 bp product that spans the cleavage site. The HBB product was seen in mock transfections, as HBB was unmodified. Mock transfection DNA did not amplify using HBB-308R and HBD-520F, indicating a lack of these chromosomal deletions. The no-template controls for each primer set were negative. Results
As revealed by Sanger sequencing, CRISPR-targeted loci showed a wide variety of insertions, deletions and point mutations. Because HBD is located ~7 kb upstream of HBB on chromosome 11 , cleavage at both sites raises the possibility of chromosomal rearrangements, including a deletion of the intervening segment (Lee, et al., Genome Res., 20:81-89 (2010); Gupta, et al, Genome Res., 23: 1008-1017 (2013); Xiao, et al, Nucleic Acids Res.,41 :el41 (2013); Gratz, et al, Genetics, 194:1029-1035 (2013)). These gross chromosomal deletions are seen with guide strand R-03, which cleaves both HBB and HBD at high rates, even though it has a mismatch to HBD (Figure 4A and 4B). PCR amplification and sequence analysis revealed gross chromosomal deletions resulting from rejoining the DNA double-strand break ends induced by two cleavage events in (or near) the conserved region of the HBB and HBD (Figure 4C). Each of these joined HBD-HBB clones amplified from cells transfected with R-03 had an indel consistent with NHEJ.
Quantitative PCR was used to estimate the number of HBB alleles containing the chromosomal deletion with HBD. Standard curves were made using serial dilutions of cloned HBD-HBB deletion fragment, so that the standard curves of both sets of primers could be compared (Figure 4D). Quantities were very similar across this standard curve using either the HBB pair of primers or the HBD-HBB pair of primers, which allowed comparison of the total amount of HBB and the amount of HBD to HBB deletions. The groupings of three HBD/HBB samples for R-02 and R- 03 are labelled (Figure 4D). Genomic DNA from the cells transfected with guide strand R-03 contained HBD-HBB chromosomal deletions equal to 12.6% of the copies of total HBB (Table 3). This was compared to genomic DNA from the cells transfected with guide strand R-02, which had higher HBB cleavage, but low HBD cleavage. The R-02 treated genomic DNA contained HBD-HBB chromosomal deletions equal to 0.4% of the copies of total HBB. Table 3: Results of quantitative PCR analysis
Figure imgf000080_0001
Similarly, CCR5 is located ~8 kb upstream of CCR2 on chromosome 3; thus, chromosomal rearrangements may occur with cleavages at both CCR5 and CCR2. These gross chromosomal deletions were detected with the R-25 CRISPR/Cas9 system, which cleaved both genes at high rates (Figure 5A and 5B). Here again, PCR amplification and sequence analysis revealed two cleavage events in (or near) a conserved region of the CCR5 and CCR2 genes, as indicated by indels consistent with NHEJ (Figure 5C). Cells transfected with the R-30 CRISPR/Cas9 system also had chromosomal deletions between CCR5 and CCR2 (Figure 5C).
Sequencing the on- and off-target loci revealed a range of different indels as a result of CRISPR/Cas9-induced DNA cleavage and mis-repair. Cleavage followed by correct repair is more difficult to detect, as the sequence does not change. The changes include three large insertions (140, 216 and 448 bp), and a range of deletions. Some sequencing reads had mutations and indels and some with only mutations, but no change in length. Specifically, the results indicated that one-base insertions and deletions occurred frequently, usually several bases from the PAM sequence, consistent with the reported cleavage between the third and fourth bases from the PAM (Jinek, et al, Science, 337:816-821 (2012)). As shown in Figure 7, the frequency of cleavage-induced gene modifications varied significantly with indels of different sizes, though 21% were one-base insertions and 12% one-base deletions. Interestingly, a common indel size was a 9-bp deletion that occurred in 14% of the clones, possibly due to micro-homologies in the sequence. Because the range of indels is influenced by sequence differences, microhomologies and/or palindromes in the area being cleaved (Yu, et al., Nucleic Acids Res., 38:5706-5717 (2010)), and the results were primarily from a limited number of overlapping target sites, further sequence analysis is needed to ensure a more general distribution.
Although CRISPR/Cas9 systems can induce high rates of gene modification in mammalian cells, they do not have perfect specificity, similar to previous
observations with ZFNs and TALENs. The results presented in Examples 1 and 2 demonstrate that CRISPR/Cas9 systems can have significant off-target activities even if 10 or 11 of the 12 bases proximal to the PAM sequence match. Therefore, it is likely that there are many more potential off-target sites in the human genome than previously thought (Cong, et al., Science, 339:819-823 (2013); Mali, et al., Science, 339:823-826 (2013)), if cleavage occurs when any permutation of 10 of the 12 bases in the guide strand matches a genomic sequence. The results indicate that mismatches in, or proximal to, the PAM sequence could block cleavage, as seen by others (Hsu, et al., Nat. Biotechnol, 31 :827-832 (2013); Fu, et al, Nat. Biotechnol, 31 :822-826 (2013); Mali, et al., Science, 339:823-826 (2013)). However, there are contrary examples, such as R-30 that cleaves CCR2 with mismatches in the two PAM- proximal bases (Figure 2B, Figure 6C).
The importance of the PAM sequence (Mojica, et al., Microbiology, 155:733- 740 (2009)) was corroborated by the lack of cleavage at some complementary sequences similar to the guide strand, but with PAM sequences differing from NGG (Figures IB and 2B). An example is guide strand R-06 that cleaved HBB at 59%, but had no detectible cleavage at HBD, presumably due to the NGA in the PAM sequence. Similarly, R-29 cleaves CCR5 at 65% efficiency. R-29 failed to cleave at CCR2 possibly due to the less tolerated, adjacent NGT PAM sequence, although the R-29 guide strand matches the 18 bases closest to the PAM sequence at CCR2.
Although Cas9 is thought to generate blunt ends (Gasiunas, et al., Natl Acad.
Sci. USA, 109:E2579-E2586 (2012); Jinek, et al, Science, 337:816-821 (2012)), the results presented in Examples 1 and 2 indicate that CRISPR-directed on- and off- target cleavage can induce a wide range of indels, with a large number of one-base insertions and a few large deletions. The high rate of off-target cleavage may result in large indels, causing a significant potential of mutagenesis and chromosomal rearrangements. For example, if two or more cleavage sites are on the same chromosome, it may lead to gross chromosomal deletions, as seen with R-03 (Figure 4C), and R-25 (Figure 5C) . These chromosomal deletions and the high levels of on- and off-target cleavage indicate that there might be other chromosomal
rearrangements, translocations and inversions. Although the ability of engineered CRISPR/Cas9 systems to target multiple sites/genes with different guide strands is an exciting feature (Cong, et al., Science, 339:819-823 (2013); Mali, et al., Science, 339:823-826 (2013); Wang, et al, Cell, 153:910-918 (2013)), each system may lead to off-target cleavage. The effect of having multiple guide strands on off-target cleavage and its effect on rates of chromosomal rearrangement have yet to be thoroughly studied (Wang, et al, Cell, 153:910-918 (2013)). A CRISPR/Cas9 system may cause chromosomal rearrangements with one guide strand inducing cleavage at two defined locations, or with a pair of guide strands inducing deletion between the target sites (Xiao, et al., Nucleic Acids Res.,4\ :e\4\ (2013)); in both cases the off- target effects of each guide strand must be assayed. Therefore, multiplexed gene editing using CRISPR/Cas9-based approaches might have limitations unless optimal design of the guide strands can be performed to reduce or even eliminate the potential for gross chromosomal rearrangements.
As demonstrated in this work and elsewhere (Hsu, et al., Nat. Biotechnol, 31 :827-832 (2013); Fu, et al, Nat. Biotechnol, 31 :822-826 (2013)), CRISPR/Cas9 systems may have high rates of off-target cleavage; therefore, care must be taken when choosing and evaluating target sites. Even with diligent choice of target sites, in most genome editing applications, quantifying the off-target activities is necessary to identify unintended cleavage and mutagenesis. Transfection conditions, including plasmid dosage, may be optimized to decrease off-target cleavage, although the effects may vary with guide strands (Figures 3A-3E). The variety of on- and off- target cleavage rates induced by CRISPR/Cas9 systems raises hope that better selection of target sites, possibly through rational design and/or screening in cells, can result in gene editing with improved specificity. Advanced genome searches may be needed in choosing optimal target sites by minimizing the number of potential off- target sites corresponding to different mismatches. More extensive off-target analysis of the CPvISPR/Cas9 systems, with a combination of bioinformatics and experimental approaches, may reveal patterns and design guidelines that better predict the target sites that can be effectively cleaved with high specificity. Example 3: sgRNA variants containing single-base DNA bulges induce Cas9 cleavage
Materials and Methods
CRISPR/Cas9 plasmid assembly
DNA oligonucleotides containing a G followed by a 19-nt guide sequence
(Table 3) were kinased, annealed to create sticky ends and ligated into the pX330 plasmid that contains the +85 chimeric RNA under the U6 promoter and a Cas9 expression cassette under the CBh promoter (available at Addgene) (Hsu, et al., Nat Biotechnol, 31 (2013)).
Table 4: Protospacer target sites for the sgRN As used in Examples 3-8
Figure imgf000083_0001
Variants of sgRNAs were constructed and tested with one or more nucleotides inserted or deleted Table 5. Table 5: sgRNA variants
Figure imgf000084_0001
Figure imgf000085_0001
Figure imgf000086_0001
Figure imgf000087_0001
Figure imgf000088_0001
Index names correspond to the index in Figures 6A-6H and Figures 2A-5C. Dashes indicate deleted nucleotides, "nd" means activity was not detected in the T7EI assay.
The annealed oligonucleotides have 4-bp overhangs that are compatible with the ends of Bbsl-digested pX330 plasmid. Constructed plasmids were sequenced to confirm the guide strand region using the primer CRISPR seq 5'- CGATACAAGGCTGTTAGAGAGATAATTGG -3 '. T7 endonuclease I (T7EI) mutation detection assay for
measuring endogenous gene modification rates
The cleavage activity of RNA-guided Cas9 at endogenous loci was quantified based on the mutation rates resulting from the imperfect repair of double-stranded breaks by NHEJ. In a 24-well plate, 60 000 HEK293T cells per well were seeded and cultured in Dulbecco's Modified Eagle Medium (DMEM) media supplemented with 10% Fetal Bovine Serum (FBS) and 2 mM fresh L-glutamine, 24 h prior to transfection. Cells were transfected with 750 ng (sgRNA variants) or 1000 ng of CRISPR plasmids using 3.4 μΐ FuGene HD (Promega), following manufacturer's instructions. Each sgRNA plasmid was transfected as biological duplicates in two separate trans fections. All subsequent steps, including the T7EI assay were performed independently for the duplicates. A HEK293T-derived cell line containing stably integrated EGFP gene was used for sgRNAs targeted to the EGFP gene. This cell line was constructed by correcting the mutations in the EGFP gene in the cell line 293/A658 (Jinek, et al., Science, 337:816-821 (2012)) (kindly provided by Dr
Francesca Storici). The genomic DNA was harvested after 3 days using QuickExtract DNA extraction solution (Epicentre), as described in (Yu, et al., Nucleic Acids Res., 38:5706-5717 (2010)). T7EI mutation detection assays were performed, as described previously (Mali, et al., Science, 339:823-826 (2013)) and the digestions separated on 2% agarose gels. The cleavage bands were quantified using ImageJ. The percentage of gene modification = 100 x (1 - (1 - fraction cleaved)0.5), as described (28).
Unless otherwise stated, all polymerase chain reactions (PCRs) were performed using AccuPrime Taq DNA Polymerase High Fidelity (Life Technologies) following manufacturer's instructions for 40 cycles (94°C, 30 s; 60°C, 30 s; 68°C, 60 s) in a 50 μΐ reaction containing 1.5 μΐ of the cell lysate, 3% Dimethyl sulfoxide (DMSO) and 1.5 μΐ of each 10 μΜ target region amplification primer (Tables 6 and 7) or off-target region amplification primer (Tables 8 and 9). Table 6: Primers for Target PCR
Figure imgf000090_0001
Table 7: Primer sequences
Figure imgf000090_0002
Sequences of primers used to amplify endogenous loci for testing the on-target activities of sgRNAs, and primers for qPCR. Target gene, sgRNAs using the primers, special PCR conditions are listed with each pair of primers in Table 6. The primer sequences are listed in the lower portion of Table 7.
Figure imgf000091_0001
90
Figure imgf000092_0001
91
Figure imgf000093_0001
92
Figure imgf000094_0001
93
Figure imgf000095_0001
Sanger sequencing of gene modifications resulted from Cas9
To validate the mutation rates measured by T7EI assay, the PC products used in the T7EI assays were cloned into plasmid vectors using TOPO TA Cloning Kit for Sequencing (Life Technologies) or Zero Blunt TOPO PCR Cloning Kit (Life
Technologies), following manufacturer's instructions. Plasmid DNAs were purified and Sanger sequenced using a M13F primer (5'- TGTAAAACGACGGCCAGT -3')· The mutation rates were determined by comparing each sequence read to the genomic sequence.
Results
Advances with engineered nucleases allow high-efficiency, targeted gene editing in numerous organisms, primary cells and cell lines. Gene editing was used to create user-defined cells, model animals and gene-modified stem cells with novel characteristics that can be used for gene functional studies disease modeling and therapeutic applications. Clustered regularly interspaced short palindromic repeats (CRISPR) and CRISPR-associated (Cas) proteins constitute a bacterial defense system that cleaves invading foreign nucleic acids (Bolotin, et al., Microbiology, 151 :2551-2561 (2005); Horvath, et al, Science, 327: 167-170 (2010); Marraffmi, et al., Nat. Rev. Genet., 11 : 181-190 (2010); Garneau, et al, Nature, 468:67-71 (2010); Hale, et al, Cell, 139:945-956 (2009); Makarova, et al, Biol. Direct, 1 :7 (2006); Barrangou, et al., Science, 315: 1709-1712 (2007); Brouns, et al., Science, 321 :960- 964 (2008)). Chimeric single-guided RNAs (sgRNAs) based on CRISPR (Jinek, et al., Science, 337:816-821 (2012)) have been engineered to direct the Cas9 nuclease to cleave complementary genomic sequences when followed by a 5'-NGG protospacer- adjacent motif (PAM) in eukaryotic cells (Mali, et al, Nat. Methods, 10:957-963 (2013); Cong, et al, Science, 339:819-823 (2013); Mali, et al, Science, 339:823-826 (2013)). Since gene targeting by CRISPR/Cas9 is directed by base pairing, such that only the short 20-nt sequence of the sgRNA needs to be changed for different target sites, CRISPR/Cas systems enable simultaneous targeting of multiple
deoxyribonucleic acid (DNA) sequences and robust gene modification (Jinek, et al., Science, 337:816-821 (2012); Mali, et al., Nat. Methods, 10:957-963 (2013); Cong, et al., Science, 339:819-823 (2013); Yang, et al, Cell, 154: 1370-1379 (2013); Xie, et al, Mol Plant, 6 (2013); Hwang, et al, Nat. Biotechnol, 31 :227-229 (2013); Cho, et al., Nat. Biotechnol, 31 :230-232 (2013); Li, et al, Nat. Biotechnol, 31 :681-683 (2013); Shan, et al, Nat. Biotechnol, 31 :686-688 (2013).
Endogenous DNA sequences followed by a PAM sequence can be targeted for cleavage by designing a ~20-nt sequence of the sgRNA complementary to the target. However, other sequences in the genome may also be cleaved non-specifically, and such off-target cleavage by CRISPR/Cas systems remains a major concern. Generally speaking, there is a partial match between the on- and off-target sites and the differences between the on- and off-target sequences can be grouped into three cases: (a) same length but with base mismatches; (b) off-target site has one or more bases missing ('deletions'); (c) off-target site has one or more extra bases ('insertions'). Recent studies have shown that CRISPR/Cas9 systems non-specifically cleave genomic DNA sequences containing base-pair mismatches (case a) generating off- target mutations in mammalian cells with considerable frequencies (Fu, et al., Nat. Biotechnol, 31 :822-826 (2013); Hsu, et al, Nat. Biotechnol, 31 :827-832 (2013); Pattanayak, et al., Nat. Biotechnol, 31 :839-843 (2013); Cradick, et al., Nucleic Acids Res., 41 :9584-9592 (2013); Mali, et al, Nat. Biotechnol, 31 :833-838 (2013); Cho, et al., Genome Res., 24:132-141 (2014)). Mismatches in the PAM sequence are less tolerated, although Cas9 also recognizes an alternative NAG PAM with low frequency (Hsu, et al., Nat. Biotechnol, 31 :827-832 (2013); Mali, et al., Nat.
Biotechnol, 31 :833-838 (2013); Jiang, et al, Nat. Biotechnol, 31 :233-239 (2013)). In addition, Cas9 off-target cleavage at a similar gene sequence with a base pair mismatch may lead to gross chromosomal deletions with high frequencies, as demonstrated by the deletion of the 7-kb sequence between two cleavage sites in HBB and HBD, respectively (Cradick, et al, Nucleic Acids Res., 41 :9584-9592 (2013)). These results indicate that, although Cas9 specificity extends past the 7-12 bp seed sequence (Hsu, et al., Nat. Biotechnol, 31 :827-832 (2013); Pattanayak, et al., Nat.
Biotechnol, 31 :839-843 (2013)), off-target effects may limit the applications of Cas9- mediated gene modification, especially in large mammalian genomes that contain multiple DNA sequences differing by only a few mismatches. One report revealed that 99.96% of the sites previously assumed to be unique Cas9 targets in human exons may have potential off-target sites containing a functional (NAG or NGG) PAM and one single-base mismatch compared with the on-target site (Mali, et al., Nat.
Biotechnol, 31 :833-838 (2013)).
Examples 3-8 examine the above-mentioned cases (b) and (c) of potential CRISPR/Cas9 off-target cleavage in human cells by systematically varying sgRNAs at different positions throughout the guide sequence to mimic insertions or deletions between off-target sequences and RNA guide strand. To avoid confusion, for single- base insertions, a 'DNA bulge' was used to represent the extra, unpaired base in the DNA sequence compared with the guide sequence. Similarly, for single-base deletions, an 'RNA bulge' was used to represent the extra, unpaired base in the guide sequence compared with the DNA sequence (Figures 8A-8B). Therefore, adding a base into the guide RNA would result in an RNA bulge, while removing a base in the guide strand can be used to model a DNA bulge. The cleavage activity of RNA- guided Cas9 at endogenous loci in HEK293T cells transfected with plasmids encoding Cas9 and sgRNA variants was quantified as the mutation rates induced by Non-Homologous End Joining (NHEJ). The results below show that off-target cleavage resulted from the sgRNA variants occurred with DNA bulge or sgRNA bulge at multiple positions in the guide strands, sometimes at levels comparable to or even higher than those of original sgRNAs. Cas9-mediated mutagenesis was also examined at 114 potential off-target loci in the human genome carrying single-base DNA bulges or sgRNA bulges together with a range of base mismatches, and the results confirmed 15 off-target sites with mutation frequencies up to 45.5%. The results illustrate the need to search for genomic sites with base-pair mismatches, insertions and deletions compared with the guide RNA sequence in analyzing
CRISPR/Cas9 off-target activity and in designing RNA guide strands for targeting specific genomic sites.
To determine if CRISPR/Cas9 systems tolerate genomic target sites containing single-base DNA bulges (Figure 8A), the sgRNA-DNA interfaces of two sgRNAs, R- 01 and R-30, targeting the HBB and CCR5 genes, respectively, were used as a model system (Cradick, et al., Nucleic Acids Res., 41 :9584-9592 (2013)). Systematically removing single nucleotides at all possible positions throughout the original 19-nt guide sequences of R-01 and R-30 resulted in single-base DNA bulges at their original HBB and CCR5 target sites that model single-base insertion at potential off- target sites in the genome (Figure 9 A and 10A).
Cleavage of the genomic DNA in HEK293T cells was quantified using the T7EI mutation detection assay. For both groups of sgRNA variants (generated from R-01 and R-30 respectively), single-base DNA bulges at certain positions in the DNA sequences were well tolerated (e.g. still had Cas9 induced cleavage), though variants of R-30 had higher cleavage activity at more locations (Figure 9B-9C and 1 OB- IOC). For both groups, it was clear that Cas9 tolerated DNA bulges in target sites in three regions: seven bases from PAM, the 5 '-end (PAM-distal) and the 3 '-end (PAM- proximal). Specifically, "-1 nt" variants of R-01 induced Cas9 cleavage activity when a single-base DNA bulge is present at positions 1 or 2, 6 or 7, 18 and 19 of the target DNA sequence from the PAM (Figure 9B-9C). Due to the presence of consecutive identical nucleotides at positions 1 and 2, 6 and 7, removing either one of the identical nucleotides in the sgRNA at these adjacent positions would give the same sequence and have the same sgRNA-DNA interface (their position is therefore marked as Or' in Figure 9B-9C and 1 OB- IOC).
In contrast, "-1 nt" variants of R-30 induced variable cleavage activity at more positions throughout the guide sequence: positions 1, 2 or 3, 7, 8, 9 or 10, 11, 16, 17, 18 and 19 from the PAM (Figure 1 OB- IOC). Seven R-30 variants have activities comparable to or even higher than that of the original sgRNA. These variants correspond to DNA bulges at positions 1, 2 or 3, 8, 9 or 10, 11, 18 and 19 from the PAM (Figure 10B- IOC). Consistent with previous studies showing that the specificity of CRISPR/Cas9 systems is guide-strand and target-site dependent (Fu, et al., Nat. Biotechnol, 31 :822-826 (2013); Hsu, et al, Nat. Biotechnol, 31 :827-832 (2013);
Cradick, et al, Nucleic Acids Res., 41 :9584-9592 (2013)), the positions in R-01 sgRNA variants where DNA-bulges were tolerated are different from that in R-30 sgRNA variants. However, these positions seem to group in the 5 '-end, middle and 3 '-end regions of the target loci, as in both R-01 and R-30 sgRNA-DNA interfaces, single-base DNA-bulges at the following five positions seems to be tolerated:
positions 1, 2, 7, 18 and 19. Although additional studies are needed to determine if these positions are common for different target sequences, the data support a conclusion that single-base DNA-bulges at the target sites corresponding to these positions are worth investigating when performing off-target analysis for
CRISPR/Cas9 systems.
In certain cases, off-target sites with DNA bulges may also be interpreted as sequences having various base mismatches with guide sequence and/or PAM (Figure 1 lA-1 IB). For example, the sgRNA-DNA interfaces corresponding to removing 5'- end bases in the guide sequences (positions 18 and 19 of the R-01 interface and 16-19 of the R-30 interface) can be viewed as having DNA bulges or having mismatches in the 5 '-end region of sgRNA, which have been shown to be better tolerated compared to the 3'-end region (Cong, et al., Science, 339:819-823 (2013); Fu, et al., Nat.
Biotechnol, 31 :822-826 (2013); Hsu, et al, Nat. Biotechnol, 31 :827-832 (2013)). Therefore, the Cas9 cleavage activities induced by these guide strands may be interpreted as tolerance of base mismatches at the 5 '-end of the guide RNA. In addition, the position- 1 variant of R-30 results in a shift in the adjacent PAM from GGG to CGG (another canonical PAM), which could explain why the activity of this guide sequence variant was similar to the original R-30. However, off-target activities associated with most other DNA bulges for the R-01 and R-30 interfaces cannot be attributed to base mismatch tolerance, since a base removal in the sgRNAs (corresponding to a DNA bulge) could result in many base mismatches or mutation in the PAM sequence. For example, the cleavage activity induced by the R-01 variant at position 2/1 may be alternatively interpreted as Cas9 cleavage with a GTG PAM (Figure 9B-9C and Figure 11A), which is highly unlikely according to previous studies (Hsu, et al., Nat. Biotechnol, 31 :827-832 (2013), Pattanayak, et al., Nat.
Biotechnol, 31 :839-843 (2013)). Further, a R-30 guide strand variant at position 11 would contain at least seven mismatches if modeled without a bulge. This guide strand resulted in a 1.8-fold higher cleavage activity compared to the original R-30 (Figure 10B- IOC and Figure 1 IB), which cannot be readily explained by the high level of base mismatches (which should prohibit cleavage), and thus should be attributed to the tolerance of DNA bulges. This is a good example of the case of a sequence-verified off-target site with a bulge that could modeled as mismatches without indels, though the number and position of mismatches would likely not allow cleavage.
Studies were also designed to determine if sgRNAs with small truncations at the 5 '-end retain cleavage activity. One to six nucleotides were deleted from the 5' end of R-01 except for the nucleotide at position 20, because the guanine here is required for the expression under the U6 promoter (Figure 12A). For these guide sequence truncations, it was discovered that 1- to 2-bp 5' truncations could still induce cleavage activities similar to the full-length sgRNA (Figure 12B-12C). Example 4: sgRNA variants containing single-base sgRNA bulges induce Cas9 cleavage
In addition to Cas9 induced cleavege at off-target sites with single-base DNA bulges, additional studies were designed to investigate if single-base sgRNA bulges (that model single-base deletions in DNA sequence) could induce Cas9 cleavage
(Figure 8B). Again, using sgRNA-DNA interfaces R-01 and R-30 as model systems, single nucleotides were added at positions throughout the original guide sequences, so that the interfaces with target sequences in HBB or CCR5 carries single-base sgRNA bulges (Figure 13A-13B). For some positions, the addition of single nucleotide A, C, G and U, respectively to the guide sequence was all tested to account for the effect of base identity. As above, HEK293T cells were transfected with plasmids of the Cas9 and sgRNA variants and the T7EI mutation detection assay was used to measure the Cas9 cleavage activity.
sgRNA bulges in the R-30 sgRNA-DNA interface were better tolerated compared to those of R-01. In contrast to the tolerances of DNA bulges adjacent to the PAM, sgRNA bulges close to the PAM prohibited cleavage (Figure 13A-14B). For the R-01 interface, single-base sgRNA bulges between each of the 11 PAM- proximal guide-strand nucleotides resulted in no detectable activity (Figure 13A- 13B). Single-base sgRNA bulges of the four nucleotides closest to the PAM in R-30 also eliminated T7EI activity (Figure 14A-14B). The sgRNA bulges 3' to the position 11 in R-30 resulted in reduced cleavage activities (Figure 14A-14B). The lack of activity with PAM-proximal sgRNA bulges in R-01 and low levels of activity with PAM -proximal sgRNA bulges in R-30 are consistent with the reduced mismatch tolerance in the 'seed sequence' reported in previous studies (Jinek, et al., Science, 337:816-821 (2012); Cong, et al, Science, 339:819-823 (2013); Sapranauskas, et al, Nucleic Acids Res., 39:9275-9282 (2011)). Nucleotide additions in sgRNA sometimes created consecutive identical nucleotides, such as adding a G before or after position 14 of R-01 or before or after position 15 of R-30. These sgRNA variants model a G-bulge that can be at either position in the sgRNA (Figure 13A- 14B). In many cases sgRNA bulges with a single U gave rise to high nuclease activities. Among all sgRNA variants with activities higher than the original sgRNAs, ~71% (5/7) were targeted to the loci with a U-bulge. Overall, single-base sgRNA bulges induced higher Cas9 cleavage activities at many more positions than that with single-base DNA bulges. This is not surprising since NA molecules are more flexible than DNA molecules, thus having smaller binding energy penalty with single-base RNA bulges, resulting in a higher tolerance (Alberts, et al., Garland Science (2007)).
RNA-DNA interfaces with single-base RNA bulges can also be viewed as sequences with various mismatches in the guide sequence and PAM (Figure 15A- 15B). Specifically, sgRNA bulges at the 5 '-end of guide RNA sequences (e.g.
U+20/19 for R-01 and R-30 interfaces) can be alternatively viewed as having one to a few base mismatches with the 3 '-end of DNA sequences (Figure 15A-15B), which are often tolerated, similar to deletions of 1-2 bp at the 5' end of guide strands (Figure 12A-12B). SgRNA bulges close to the 3'-end of guide sequence can be alternatively viewed as having base mismatches in the 3 '-end region, including those at the third base of PAM (R-30 variants) (the last six variants in Figure 15B). Among all sgRNA variants with considerable activities (Figure 15A-15B), most of them could not be explained by tolerance of base mismatches, since they would contain more than five mismatches or change in the third base of PAM, which was shown to abolish cleavage activity (Hsu, et al, Nat. Biotechnol, 31 :827-832 (2013)).
Example 5: GC (guanine-cytosine) content of sgRNAs effects the tolerance of single-base sgRNA bulges
The specificity profile (location and level of off-target cleavage) of R-01 variants is substantially different from that of R-30 variants. R-30, which showed a higher level of tolerance to DNA and RNA bulges than R-01, has a GC content of 70%, whereas R-01 has a GC content of 50%. It was hypothesized that the GC content of guide strands R-01 and R-30 played a significant role in causing this difference. To investigate this hypothesis, two additional sets of guide strands targeted to HBB and CCR5 genes, respectively, were tested with different GC contents compared to R-01 and R-30 (Table 10).
Table 10: Target sites, cleavage activities (% indels by T7EI assay) and GC contents of different guide strands targeted to HBB and CCR5 genes.
Figure imgf000102_0001
*Cleavage activity of R-25 is from reference (Cradick, et al., Nucleic Acids Res., 41 :9584-9592 (2013)).
Specifically, R-08 has a moderately higher GC content compared to R-01 (65% compared to 50%), whereas the GC content of R-25 is half of that of R-30 (35% compared to 70%). Cas9 induced cleavage with sgRNA variants of R-08 and R-25 was individually tested to quantify the bulge tolerance in HEK 293T cells. For the guide strand R-25, which contains a low percentage of GC, all R-25 variants tested showed non-detectable activities using the T7EI assay (Table 5). In contrast, for R-08 variants with bulges throughout the guide sequence, cleavage activities were observed at more positions compared with R-01 (Figure 16B-16D). These results of bulge tolerance for variants of R-08 and R-25 support the GC dependence hypothesis.
Example 6: sgRNA variants containing 2- to 5-bp bulges induce Cas9 cleavage
In addition to single-base bulges between sgRNA and target sequence, it is important to determine if bulges longer than 1 bp can also be tolerated by the
CRISPR/Cas9 systems. Consequently, the tolerance of 2- to 5-bp bulges was tested at locations where single-base bulges were well tolerated. For sgRNA bulges, two to five U's 15- or 12-bp upstream of PAM were added into the guide sequences of R-01 and R-30, respectively. To generate DNA bulges, two bases were deleted from the guide sequences of R-01 and R-30 (Figure 17A). Strikingly, sgRNA variants forming 2-, 3- and 4-bp RNA bulges induced cleavage activities as determined by the T7EI assay in HEK 293T cells (Figure 17B). Since sgRNA variants forming 2-bp DNA bulges did not show any detectable activity, longer DNA bulges were not tested. The findings that sgRNA bulges of >2-bp are better tolerated than DNA bulges of similar size are consistent with the higher cleavage activities by guide strands with 1-bp sgRNA bulges compared to those with 1-bp DNA bulges as shown in Figures (9A-9C, lOA-lOC, 13A-13B, and 14A-14B).
Example 7: sgRNA variants containing single-base bulges can mediate cleavage by paired Cas9 nickases
Paired Cas9 nickases (Cas9n) were developed to generate DNA double-strand breaks by inducing two closely spaced single-strand nicks using an appropriately designed pair of guide RNAs (Mali, et al., Nat. Biotechnol, 31 :833-838 (2013); Ran, et al., Cell, 154: 1380-1389 (2013)). This strategy may lower the off-target cleavage, as double stranded breaks (DSBs) could occur only when both guide RNAs of the pair induced two nicks adjacent to each other at roughly the same time. Assays were designed to test if paired Cas9n systems can tolerate bulges by using one bulge- forming guide variant paired with a perfectly matched guide strand. Specifically, four variants of R-01 showing high activities with Cas9 were paired with R-02, including Rl U+14/13 and Rl C+12 to test sgRNA bulges and Rl -7/6 and Rl -2/1 to test DNA bulges. Each paired sgRNAs created a 34-bp 5' overhang in the HBB gene (Figure 18A) (Cradick, et al, Nucleic Acids Res., 41 :9584-9592 (2013)), and the Cas9n cleavage activities were determined by the T7EI assay. The results show that both sgRNA and DNA bulges were also well tolerated in the Cas9n system (Figure 18B). The paired Cas9 nickases with single sgRNA bulges showed activities comparable to Cas9 system having one bulge in RO-1; however, for DNA bulges, the activities of paired Cas9 nickases were >2-fold higher than that of Cas9.
Example 8: Cas9 cleavage occurs at genomic loci with both base mismatches and DNA or sgRNA bulges
Materials and Methods
Identification of off-target sites
Potential off-target sites in the human genome (hgl9) were identified using TagScan (https://www.isrec.isb-sib.ch/tagger), a web tool providing genome searches for short sequences (Iseli, et al., PLoS One, 2:e579 (2007)). Guide sequences containing single-base insertions (represented with an 'N' in the sequence) and single- base deletions at different positions were entered, followed by the PAM sequence 'NGG'. Off-target sites were alternatively searched for using the recently developed bioinformatics program COSMID that can identify potential off-target sites due to insertions and deletions between target DNA and guide RNA sequences (disclosed herein). Primers were individually designed to amplify the genomic loci identified in the output.
Quantitative PCR to measure the expression levels of different guide RNAs
HEK 293T cells were transfected with 750 ng sgRNA variants, as described above. Each sgRNA was transfected as biological triplicates in three separate wells and processed independently. Total RNA was isolated from cells using the RNAeasy kit (Qiagen). Extracted RNA was reverse-transcribed using the iScript cDNA
Synthesis (BioRad). The cDNA was amplified using the iTaq Universal SYBR Green Supermix (Bio ad) and analyzed with quantitative PCR using specific primers that annealed at 60°C (Tables 6-7). Quantitative PCR was performed in technical triplicates for each cDNA sample from single transfected well. Relative mRNA expression was analyzed using an MX3005P (Agilent) and normalized to
glyceraldehyde-3 -phosphate dehydrogenase (GAPDH) expression. GAPDH expression remained relatively constant among treatments.
Relative mRNA expression of target genes was calculated with the ddCT method. All target genes were normalized to GAPDH in reactions performed in triplicate. Differences in CT values (ACT = CT gene of interest - CT GAPDH in experimental samples) were calculated for each target mRNA by subtracting the mean value of GAPDH. ACT values were subsequently normalized to the reference sample (mock transfected cells) to get AACT or ddCT (relative expression = 2-AACT).
Deep sequencing to determine activities at genomic loci Genomic DNAs from mock and nuclease-treated cells that were prepared for T7EI assays were used as templates for the first round of PCR using locus-specific primers that contained overhang adapter sequences to be used in the second PCR. Table 11 shows primers used in PCRs for deep sequencing by an Illumina Miseq 2X250 paired-end read. These reactions were sequenced as in Lin Nucleic Acids Research 2014. Primers for reaction 1 contains adapter sequences shown (same adapter sequences also present in reaction-2 primers), in addition to gene-specific sequences. In the final pooled sample containing all the amplicons, each barcode has similar occurrence to insure diversity required by Illumina sequencing. Customer sequencing primers for read 1 (forward), read 2 (reverse), and index read (read barcodes) are used in place of standard Illumina sequencing primers.
Table 11: Sequencing primers
Figure imgf000106_0001
PCR reactions for each locus were performed independently for eight touchdown cycles in which annealing temperature was lowered by 1°C each cycle from 65 to 57°C, followed by 35 cycles with annealing temperature at 57°C. PCR products were purified using Agencourt AmPure XP (Beckman Coulter) following manufacturer's protocol. The second PCR amplification was performed for each individual amplicon from first PCR using primers containing the adapter sequences from the first PCR, P5/P7 adapters and sample barcodes in the reverse primers (Table 11). PCR products were purified as in first PCR, pooled in an equimolar ratio, and subjected to 2 x 250 paired-end sequencing with an Illumina MiSeq.
Paired-end reads from MiSeq were filtered by an average Phred quality (Q score) greater than 20 and merged into a longer single read from each pair with a minimum overlap of 10 nucleotides. Alignments were performed using Borrows- Wheeler Aligner (BWA) for each barcode (Li, et al., Bioinformatics, 26:589-595
(2010)) and percentage of insertions and deletions containing bases within a ±10-bp window of the predicted cut sites were quantified. Error bounds for indel percentages are Wilson score intervals calculated using binom package for R statistical software (version 3.0.3) with a confidence level of 95% (32). To determine if each off-target indel percentage from a CRISPR-treated sample is significant compared to a mock- treated sample, a two-tailed P-value was calculated using Fisher's exact test.
Results
To gain a better understanding of CRISPR/Cas9 off-target activity, 27 different sgRNAs targeting six different genes (Table 4), seven targeted HBB, two for EGFP, five for CCR5, seven for ERCC5, four for TARDBP and two for HPRT1, respectively, were examined. Off-target analyses of these sgRNAs were performed by searching the human genome for potential off-target sites and found that for the sgRNAs searched, single-base DNA or sgRNA bulges were not located without mismatches in the human genome. Therefore, for each sgRNA, a subset of the potential sites with one to three mismatches was selected and avoided mismatches close to the PAM as much as possible. All of these sgRNAs efficiently induced mutations at their intended target loci in human HEK293T cells, as measured by the T7EI assay. Using the T7EI assay, 18 potential off-target sites containing target-site insertions and 62 containing deletions were investigated (Table 8). Two sgRNAs targeted to CCR5 and ERCC5, respectively, also induced cleavage at two off-target sites each bearing one DNA bulge and one mismatch (Figure 19A and 19B). For R- 30, the identified off-target site R-30 Off-4 contains a single-base DNA bulge at position 5, 6 or 7 and a base mismatch at position 14. The off-target gene
modification rate determined by T7EI is 9%, almost one third of the 30% on-target activity at the CCR5 gene (Figure 19A). For an R-31 off-target site with a single-base DNA bulge at position 2 and a mismatch at position 20, the off-target gene modification rate determined by T7EI was 3%, compared to 60% on-target activity at the ERCC5 gene (Figure 19B). Due to the high frequency of small indels (insertions and deletions) that result from repair of Cas9 induced cleavage, which may be poorly detected by the T7EI assay, the mutagenesis at these off-target sites was verified using Sanger sequencing (Figure 19C and 19D). For both off-target sites, the mutation frequencies quantified by Sanger sequencing are higher than those by T7EI, which is consistent with a previous study (Cradick, et al., Nucleic Acids Res., 41 :9584-9592
(2013)). No off-target cleavage was observed for the 62 sites tested with both sgRNA bulge and base mismatch, although in the model systems with sgRNA bulges only, high cleavage activities were observed (Figure 13A-14B). This discrepancy indicates that sites forming sgRNA bulges may be less tolerant to additional base mismatches and vice versa.
Two genomic off-target sites for guide strand R-30, Off-4 and Off-5, have identical target sequences (Table 8), but were cleaved at different rates. Specifically, R-30 Off-4 had a cleavage rate of 9%, while the cleavage at Off-5 was undetectable with the T7EI assay (Figure 20). Sanger sequencing revealed a 45.5% mutation rate at the R-30 Off-4 locus (Figure 19C), compared to a 4.2% mutation rate at R-30 Off-5 (Figure 20). Since R-30 Off-4 and R-30 Off-5 sites have identical sequences, the results indicate that off-target cleavage of Cas9 nuclease is very dependent on genomic context (Cradick, et al., Nucleic Acids Res., 41 :9584-9592 (2013)). Further investigation of these two sites using the ENCODE annotation from UCSC genome browser (Rosenbloom, et al., Nucleic Acids Res., 41 :D56-D63 (2013); Landt, et al., Genome Res., 22: 1813-1831 (2012)) revealed that R-30 Off-4, which had high off- target activity, targeted a site within 400 bp of the 3' end of a long non-coding RNA (RP4-756H11.3) and 12 kb of the protein-coding gene RABGEF. Analysis of the ENCODE data for chromatin structure in normal human embryonic kidney cells (NHEK) cells, the cell type of origin for the HEK293 cells used in this study shows Off-4 to be within 3 kb of a strong enhancer (marked by H3K27Ac and H3K4mel) and a strong DNAsel hypersensitive site, indicative of an open chromatin structure. In contrast, R-30 Off-5, which had low activity, targeted a site in a 162-kb intergenic region between the WBSCR28 and ELN genes that is marked by the more heterochromatic H3K27me3, and hence may be less accessible for Cas9 induced cleavage (Figure 21A and 21B). Taken together, these data lead to a conclusion that differences in the local chromatin structure may underlie the observed differences in cleavage efficiency between Off-4 and Off-5.
Deep sequencing was performed at 55 putative off-target sites corresponding to single-base sgRNA bulges and 21 sites corresponding to single-base DNA bulges. The sites were amplified from genomic DNA harvested from HEK 293T cells transfected with Cas9 and sgRNAs. The 55 sites with sgRNA bulges contain 35 sites tested in the preliminary T7EI assay, and the 21 sites with DNA bulges include seven sites tested in the T7EI assay. Putative bulge-forming loci containing one to three PAM-distal mismatches were chosen, since sites associated with a bulge without any base mismatch were not found. Some of the bulge-forming sites with a high level of sequence similarity, but containing an alternative NAG-PAM were also selected. For comparison, the deep sequencing also investigated 16 on-target sites of the sgRNAs tested. Each locus was sequenced from mock-transfected cells as control.
An additional 13 bulge-forming off-target sites with low, but significant cleavage activities resulted from CRISPR/Cas9 systems compared to the mock- transfected samples (Figure 19E). The number of genomic off-target cleavage sites associated with sgRNA bulges was relatively small (some of these cases are indistinguishable from a few mismatches at 5' end), but there was considerable activity at genomic sites with DNA bulges coupled with one to three additional base mismatches, even with an alternative NAG-PAM (R30_ins_10 and R30_ins_14). Similar results showing more off-target effect with DNA bulges plus mismatches compared to sgRNA bulges plus mismatches were observed in the preliminary T7EI assay (Figures 19A and 19B). The positions of these tolerated DNA bulges are 1-3 and 7-10 bp from PAM, consistent with the results from the model systems using sgRNA variants. The majority of the sites with off-target activities detected, as shown in Figures 19A, 19B and 19E are associated with the sgRNA R-30, which has a high GC content (70%). Other sgRNAs that resulted in off-target cleavage at bulge- forming loci have GC content >50%.
In summary, Examples 3-8 show that CRISPR/Cas9 systems can have off- target cleavage when DNA sequences have an extra base (DNA bulge) or a missing base (sgRNA bulge) at various locations compared with the corresponding RNA guide strand. sgRNA bulges of up to 4-bp could be tolerated by CRISPR/Cas9 systems (Figures 17A-17B). The correlation between cleavage activity and the position of DNA bulge or sgRNA bulge relative to the PAM appears to be loci and sequence dependent when comparing the specificity profiles of guide sequences R-01 and R-30.
It is believed that the following design guidelines will help reduce potential off-target effects of CRISPR/Cas9 systems: (i) conservatively choose target sequences with relatively low GC contents (e.g. <35%), (ii) avoid target sequences (with either NGG- and NAG-PAM) with <3 mismatches that form DNA bulges at 5' end, 3' ends or around 7-10 bp from PAM and (iii) if possible, avoid potential sgRNA bulges further than 12 bp from PAM.
Different specificity profiles of R-01 and R-30 guide sequences (and variants) are not due to different expression levels of the sgRNAs. Quantitative PCR of inactive R-01 variants and active R-30 variants indicated similar sgRNA expression levels (Figure 22). It is believed that high GC-content, which makes the RNA/DNA hybrids more stable (Sugimoto, et al., Biochemistry, 34: 11211-11216 (1995)), may be responsible for increased tolerance of DNA bulges and sgRNA bulges. Consistent with this belief, guide strand R-30 (70% GC) showed the highest tolerance to sgRNA and DNA bulges among the four guide strands tested (R-01, R-08, R-25 and R-30), while guide strand R-25 (35% GC) does not seem to tolerate any bulges. Guide sequences showing bulge-related off-target activity in Figures 19A-19E all have GC contents >50%, which further confirms that it is important to consider DNA-bulges for sgRNAs with high GC content, even with up to three base mismatches, when investigating off-target effects.
As shown in Figures 1 lA-1 IB and 12A-12B, bulges in the PAM distal or PAM proximal regions can reflect either mismatch tolerance or RNA/DNA bulge tolerance. In a bioinformatics search considering base mismatches only, some of the potential off-target sites identified may overlap with a search considering bulges. Although in both scenarios the mismatch and bulge-containing sites should be tested for off-target cleavage, a better understanding of the bulge tolerance as well as the difference in the mechanisms underlying these two scenarios is needed. One study revealed that a Cas9 ortholog from Streptococcus thermophilus has a PAM located 2 bps downstream of the protospacer (Chen, et al., J Biol. Chem., (2014). in press.). Thus, the cleavage resulting from the variant R-01 -2/1 (Figures 9A-9B) may reflect the tolerance of a linker between the target sequence and PAM instead of a DNA- bulge. On the other hand, Cas9 cleavage with RNA or DNA bulges in the middle of the target sequence may reflect only the bulge tolerance.
An interesting finding from this study is that sgRNA variants with bulges had different indel spectra than sgRNA without bulges (Figure 23A-23C and 24A-24C). Indel spectra for original sgRNAs R-01 and R-30, as well as sgRNA variants Rl -7/6, Rl C+12, R30 -11 and R30 U+12, were quantified using deep sequencing with around 104 reads for each sample. Bulge-forming sgRNA variants showed higher ratios of larger deletions (Δ10 or Δ7), whereas the original sgRNAs without bulges generate mostly 1-bp insertions. This effect is more prominent for variants forming sgRNA bulges (Rl C+12 and R30 U+12). Bulge-forming sgRNA variants may be more effective than regular sgRNAs in creating larger deletions that might be preferred in certain applications, such as targeted disruption of genomic elements. These larger deletions may also occur at off-target loci, which strengthens the need to include them in genomic searches.
Recently, paired Cas9 nickases have been shown to increase target specificity of CRISPR/Cas9 systems. However, only off-target activity associated with single guide RNAs were investigated (Mali, et al., Nat. Biotechnol, 31 :833-838 (2013); Ran, et al., Cell, 154: 1380-1389 (2013)), and the effect of cooperative nicking at potential off-target sites with sequence similarity to a pair of guide RNAs has not been characterized. Examples 3-8 show that Cas9n is able to cleave efficiently at target sites despite a single-base bulge in one of the paired guide RNAs. The results of this work provide some insight into off-target cleavage of the paired Cas9 nickases, as nicking of opposite DNA strands is likely to be independent events and the knowledge of bulge tolerance at the sgRNA-DNA interface would be applicable to off-target cleavage of Cas9 nickases.
Recent studies on the specificity of CRISPR/Cas9 systems revealed that a broad range of partial matches between sgRNA and DNA sequences could induce off- target cleavage (Fu, et al., Nat. Biotechnol, 31 :822-826 (2013); Hsu, et al., Nat.
Biotechnol, 31 :827-832 (2013); Pattanayak, et al, Nat. Biotechnol, 31 :839-843 (2013); Cradick, et al, Nucleic Acids Res., 41 :9584-9592 (2013)), which may limit the choice of sgRNA designs. While the use of existing bioinformatic tools based on base mismatches is certainly useful for predicting the most likely potential off-target sites, it might miss some important sites, since there would be too many base mismatches if bulges were not allowed to form in the middle of a target sequence, so the potential off-target sites with bulges are not likely to be included in the output of these search tools. Therefore, based on these results, it is preferable to search partially matched sequences including base mismatches, deletions and insertions and their combinations in identifying off-target sites. Since there might be a large number of potential off-target sites due to the many partially matched sequences, and the effect of sgRNA-DNA sequence differences on off-target cleavage is target-site and genome-context dependent, experimentally determining the true off-target activities is preferred, including the use of deep sequencing. Example 9: COSMID search algorithm and web interface
Materials and Methods
COSMID search inputs
To perform a COSMID search, the genome of interest, guide strand, PAM sequence, and the number of base mismatches, insertions, and deletions allowed are specified (Figure 25 A, Figure 26A-26G, Table 12 below). Three types of indel query are allowed: (i) the number of mismatches with no insertion or deletion (No indels); (ii) the number of mismatches in addition to a single-base deletion (Del); and (iii) the number of mismatches in addition to a single-base insertion (Ins). Up to three mismatches without indels, and up to two mismatches together with a one-base insertion or deletion could be chosen. If primers are desired, primer design parameter settings and parameter templates should also be entered (Figure 25 A). PAM variants, such as NRG can be entered in the suffix box, as well as other PAM sequences (Fischer, et al., J Biol Chem, 287:33351-33363 (2012)). The spacer (Ns) and required nucleotides are entered into the suffix box, such as "NNNNGATT" (Hou, et al., Proc Natl Acad Sci USA, 110: 15644-15649 (2013)), and include genomic sites with any nucleotide at the N positions in the output.
Before performing the search, COSMID constructs a series of search entries according to the user-specified guide strand and search criteria (Figure 25B). The search entries include all insertions and deletions at each possible location (Figure 25 C), and are subsequently used to perform rapid and accurate searches of the entire sequence of the interested genome, while allowing for the user-specified number of mismatches. These searches took ~4 seconds without primer design (Figure 26A- 26G).
Although multi-base deletions (RNA bulges) and insertions (DNA bulges) could be tolerated (Lin, et al., Nucleic Acids Res, 42:7473-7485 (2014)), they are less common, and search for a wide range of insertions and deletions will likely result in a very large number of returned sites. Therefore, COSMID only allows searches for single-base insertions and deletions in the DNA sequence compared with the guide strand (Figure 25 A). For the potential off-target sites, the search algorithm allows some ambiguities (such as N for any nucleotide). Ambiguities included in the search string are marked in red in the HTML results (as are mismatches and indels), but are not counted toward the user-specified mismatch limits. The use of ambiguities allows the inclusion of the matching genomic base with the output sequences. One possibility is to include an "N" in positions that can have substitutions, such as the first base in a guide strand that is often a G primarily to aid in transcription, but does not need to match the complementary target sequence (Hsu, et al., Nat Biotechnol, 31 : 827-832 (2013); Cradick, et al, Nucleic Acids Res, 41 :9584-9592 (2013); Mali, et al., Science, 339: 823-826 (2013). One can leave off this base when performing a search, or include a 5' N in the search string, which allows COSMID to output and align to the "N," the corresponding 5' bases at each locus.
COSMID search outputs
COSMID outputs all genomic sequences that match the user-supplied search criteria in comparison with the entered guide strand. The first column of the HTML output shows the genomic sequence ("hit") aligned to the query sequence with matches shown in black. Nucleotides that are not a direct match are shown, including mismatches, insertions, and deletions (Table 12). Ambiguities in the query sequence, such as the N in the PAM sequence NGG, are also shown in red, though they do not count as mismatches. The second column lists the query type, including (i) no deletion or insertion (No indel), (ii) deletions (Del), or (iii) insertions (Ins). This column indicates if there are insertions or deletions, and specifies the indel positions as the number of nucleotides away from the PAM. The third column lists the number of mismatched bases between the query and target sequences. When two repeated bases appear in the guide strand, a deletion of either one of them in the target sequence gives the same query sequence, so the ambiguity is noted in the query column. The fourth column indicates if the PAM in the hit ends in G, as NGG is the Cas9 PAM with the highest activity, followed by NAG (Hsu, et al., Nat Biotechnol, 31 : 827-832 (2013)). This column helps in ruling out genomic sites with unlikely PAMs. This function must be added to the excel spreadsheet for other PAMs. The fifth, sixth, and seventh columns contain respectively the chromosomal location of the matching sequence, its strand and the chromosomal location of the cleavage site. The predicted cleavage position is based on the fact that Cas9 primarily cleaves both DNA strands three nucleotides from the PAM (Jinek, et al., Science, 337: 816-821 (2012)). The HTML links included in the COSMID output are directed to the chromosomal sites in the UCSC genome browser. This allows determination of the gene that best matches the target sequence and if the target site is in an exon, intron, or other region. This information is helpful as mutations may be better tolerated in regions that are noncoding and nonfunctional.
The output is grouped by query types, including (i) genomic sites with base mismatches, but no insertions or deletions (No indels), (ii) sites with deletions (Del), and (iii) sites with insertions (Ins) between the query and potential off-target sites (Table 12). Within each category, sites with mismatches further from the PAM are listed first, which are more likely to result in off-target cleavage (Fu, et al., Nat Biotechnol, 31 : 822-826 (2013); Hsu, et al, Nat Biotechnol, 31 : 827-832 (2013); Cradick, et al., Nucleic Acids Res, 41 :9584-9592 (2013). The same genomic location may satisfy two or more search criteria, such as those sites that satisfy the mismatched base limit without and with an insertion or deletion. For example, mismatches at the base farthest from the PAM and deletions of this base will give the same set of genomic locations. This can also occur when the guide strand contains consecutively repeated bases. Since genomic locations can be specified through multiple criteria (examples shown in Figures 28 A and 28B), they are listed in each of the
corresponding groupings to aid further evaluation and scoring. Duplicate sites can be removed in the spreadsheet, as described below.
COSMID also outputs the potential off-target sites identified in a spreadsheet to allow for further processing, such as sorting by attributes or adding weight matrixes to rank the most likely off-target sites. The accumulation of additional experiments on CRISP off-target activity will allow creation of a more predictive scoring system. It is thought that mutations in the PAM are least well tolerated followed by sites closest to the PAM; however, little is known about how the guide strand sequence influences these effects (Jinek, et al., Elife 2:e00471 (2013); Fu, et al., Nat Biotechnol, 31 : 822-826 (2013); Hsu, et al, Nat Biotechnol, 31 : 827-832 (2013); Cradick, et al, Nucleic Acids Res, 41 :9584-9592 (2013)). The spreadsheet can also be used to indicate duplicate genomic sequences found using different search criteria, as mentioned above. The output list of off-target sites allows a user to compare the number and score of off-target sites for the input target sites.
COSMID Primer design
COSMID 's primer design function is used to assay for off-target cleavage after cells or animals are treated with CRISPR guide strands and nuclease. Primers are designed that fit the criteria needed for the particular assay or sequencing platform using an automated primer pair design process, not found in other CRISP programs. The algorithm was developed for the zinc finger nucleases and TAL effector nucleases off-target search program PROGNOS and found to give a single specific band in -93% of amplifications (Fine, et al., Nucleic Acids Res, 42:e42 (2013)). The automated primer design alleviates the need for the iterative steps of primer design and verification of the resulting fragment sizes, that slow primer design, especially for mutation detection assays where the cleavage product sizes determine how easily the cleavage bands can be distinguished on gels. The recommended parameters for use in Surveyor assays resolved on 2% agarose gels are: Minimum Distance Between Cleavage Bands— 100 bp, Minimum Separation Between Uncleaved and Cleaved
Products— 150 bp. Users can also input the number of bases the cleavage site must be from each amplicon's edge to ensure sequencing coverage depending on the different sequencing platforms. For single molecule, real-time (SMRT) sequencing, the recommended parameters are: Minimum Distance Between Cleavage Bands— 0, Minimum Separation Between Uncleaved and Cleaved Products— 125 bp. The output primers can be easily modified in the spreadsheet, such as to add flanking sequences for additional amplification and/or barcodes for sequencing.
Results
The COSMID algorithm is based on sequence homology; it searches a genome of interest for sites similar to CRISPR guide strands using the efficient FetchGWI search program that has powered search tools including TagScan34 and ZFN-site (Cradick, et al., BMC Bioinformatics, 12: 152 (2011)). FetchGWI operates on indexed genome sequences that are precompiled and stored (Figures 26A-26G). It can identify genomic locations with sequences that match any of the series of search entries. FetchGWI saves run time by searching indexed files that represent the genome sequences, rather than the sequences themselves. There is one index entry for each nucleotide in the genome, which allows a rapid and exhaustive search. This is a key advantage of COSMID over BLAST and other programs that scan
nonoverlapping words and may miss potential off-target sites (Cradick, et al., BMC Bioinformatics, 12: 152 (2011)). COSMID currently allows searching the human, mouse, Caenorhabditis elegans, and rhesus macaque genomes.
COSMID is a CRISPR off-target search tool with a web interface that allows directed and exhaustive genomic searches to identify potential off-target sites for guide strand choice or experimental validation. To perform a search, a user chooses the genome of interest from the list, and enters the guide strand and PAM sequences (Figure 25 A). By clicking the appropriate selection buttons, a user can choose to include (i) <2 base mismatches with an insertion and/or deletion, or (ii) <3 base mismatches without any indels (Figure 25 A). The user has the option to have primers as part of the output. Primers are designed by COSMID that are optimized to the specified criteria or to the defaults given for particular applications (Figure 25 A). COSMID exhaustively scans the genome based on these input parameters (Figure 25B), allowing consideration of mismatches, insertions, and/or deletions (Figure 25C, Figure 26A-26G).
COSMID outputs a ranked list of perfectly matched (on-target site and possibly other sites) and partially matched (potential off-target) sites in the genome, their ranking score, along with reference sequences and primer designs that can be used for sequencing and/or mutation detection assays (Table 12). Each line of the output file describes one genomic locus matching the search criteria. A locus may appear on multiple lines if it can be modeled and found in multiple ways.
An exemplary COSMID Output includes the following text, a hyperlink for viewing the raw search results in a txt file and Table 12.
Figure imgf000116_0001
Table 12: Exemplary COSMID Output - Search Results
Figure imgf000117_0001
Table 12 shows an exemplary COSMID output in HMTL and includes the genomic sites matching the user-supplied criteria in comparison to guide strand R-01 with chromosomal location. Scoring of the mismatches is provided for ranking, as are PCR primers and reference sequence. The right primers, in silico link, amplicon, and digest sizes are provided in the output, but not shown here. Links are provided to each location in the UCSC genome browser, and to the output file as a spreadsheet for further manipulation and primer ordering.
Each hit is appropriately aligned to the query shown in the "Result" box (Table 12). DNA bases corresponding to mismatches, indels, ambiguity codes, such as N, are shown in the query line to identify the matching genomic bases. To the right of the "Result" box are boxes with the query type, number of mismatches, chromosomal position, score, primers, and other features. The web page showing COSMID output also includes links to test each primer pair and to reformat the output file as text or in a spreadsheet. The spreadsheet output allows thorough evaluation of the number and scores of the low-scoring sites that are predicted to be more likely off-target sites, which may provide important guidelines when evaluating and choosing guide strands and/or testing for true cleavage events using DNA samples from cells after
CRISPR/Cas treatment.
COSMID uses the TagScan algorithm to minimize run times while still performing exhaustive genome searches (Iseli, et al., PLoS One, 2:e579 (2007)). With the primer design option off, the run times averaged 4 seconds for the guide strands without indels (Table 13).
Table 13: Run Times
Figure imgf000118_0001
Run times were measured for COSMID using variations of guide strands R-01 and R- 30, with and without a 5'G, using standard (NGG) or relaxed PAM (NRG). All runs included sites matching the guide strand with three or less mismatches without indels. More matching loci "hits" were identified by allowing single-base insertions or deletions together with <2 base mismatches.
Allowing insertions or deletions in addition to mismatches increases run time. For example, when searching with a 19-nt guide strand and an NRG PAM, and including two mismatches with either an insertion or an deletion resulted in run times averaging 42 seconds for R-01 and 36 seconds for R-30. The run times for the search with three mismatches without insertions or deletions were similar. Including primer design increased the run times proportional to the number of primer sets and reference sequences returned.
Figures 26A-26G and Table 14 illustrate an exemplary search string processing by COSMID include examples showing the input, and portions of the web results and spreadsheet output for a search of the human genome using guide strand R-01.
The genome of interest is chosen from the Target Genome list (Figure 26 A). The target sequence is entered into the Query Sequence box (Figure 26B). The required protospacer adjacent motif (PAM) is entered into the 'Add suffix' Box of the Search Options section (Figure 26C). The spacers (Ns) and required bases are included, such as NGG or NRG. The boxes in the 'Allowed indels and mismatch' of the Search Options section are checked to indicate if genome sites to be searched include genomic sites that have No indels (with <3 mismatches but the same length), have 1-base Del (are 1-base shorter), or have 1-base Ins (are 1-base longer) (Figure 26C).
The boxes in the PC Primer Design Options section are chosen, which allow
COSMID to design primers matching the specific application. Primer design parameters are set by pressing the button for 'Default', 'Illumina 250', 'Illumina 250 paired', 'SMRT' or 'enzyme' (when using other enzymes). Any of the parameters can be entered by hand to further customize.
For each genome included in COSMID, the genwin program was used to transform the DNA sequence from FASTA formatted files into unsorted index entries, which have all possible 25 bases-long tags in the DNA sequence. After that, the sortGWI program was used to sort the index entries, and store the result as a binary index file. sortGWI subdivided the whole index file into 16,777,216 parts, each representing entries having identical first 12 nucleotides. A secondary index, recording the position in the main index file where each part starts, was added to the end of the index file to enable faster search and reduce file size. The index files are stored in the COSMID server.
When the submit button is clicked, the sequence tags in COSMID are used to generate a series of additional tags that contain indels if the insertion or deletion boxes are checked. Identical tags are removed if they are duplications for strings containing consecutive identical bases. The resulting tags are all searched against the user- selected genome. For example, if guide strand R-01 is entered, the tags illustrated in Figure 26E and 26F are generated and used to search the human genome.
To search the query sequences against the user selected genome, the
FetchGWl program is used. If the user specifies a search with one or more mismatches, FetchGWl generates all possible sequence tags by replacing the specified number of nucleotides with all other possibilities. After that, FetchGWl sorts all the query tags and search for matches in the index file, using an efficient method called binary search. FetchGWl reports the search results by appending the actual sequence tag found, along with the accession number and position offset within the sequence for each matched query tags.
For each match that FetchGWl finds, COSMID generates a score that reflects the empirical expectation of how likely it is an off-target site.
COSMID web output includes links for html, txt and excel files (Figure 26G). Links are provided to test each primer pair using the UCSC in-silico PC web site. The excel output is sorted for unique sites with the lowest mismatch and indel score to locate the most likely off-target sites. Here the Score+ column contains a ranking to place NGG ahead of NAG sites (+0.3 points added to the COSMID default scoring). The second column represents the query type, then the chromosomal location, the ranked number and a grid showing the mismatches, insertions and deletions (Table 14). Different sections of the output are illustrated in Table 14.
Table 14: Exemplary COSMID excel output
Figure imgf000120_0001
Example 10: COSMID searches and identifies putative off-target cleavage sites Materials and Methods
CRISPR transfection and mutation detection assays
The on- and off-target cleavage activity of Cas9 and guide strand -01 was measured using the mutation rates resulting from the imperfect repair of double- stranded breaks by non-homologous end joining. An amaxa Nucleofector 4D was used to transfect 200,000 K-562 cells with 1 μg px330 expressing R-01 sgRNA, following manufacturer's instructions. The genomic DNA was harvested after 3 days using QuickExtract DNA extraction solution (Epicentre, Madison, WI), as described (Guschin, et al, Methods Mol Biol, 649: 247-256 (2010)). On- and off-target loci were amplified using AccuPrime Taq DNA Polymerase High Fidelity (Life
Technologies, Carlsbad, CA) following manufacturer's instructions for 40 cycles (94 °C, 30 seconds; 52-60 °C, 30 seconds; 68 °C, 60 seconds) in 50 μΐ reactions containing 1 μΐ of the cell lysate, and 1 μΐ of each 10 μιηοΐ/ΐ amplification primer. The T7EI mutation detection assays were performed, as per manufacturers protocol ( eyon, et al., Nat Biotechnol, 30: 460-465 (2012)), with the digestions separated on 2% agarose gels (Figure 2a) and quantified using Image J (Figure 2b) (Guschin, et al., Methods Mol Biol, 649: 247-256 (2010)). This guide strand was shown to have on- target cleavage at beta-globin and off-target cleavage at delta-globin,24 so a range of off-target sites were chosen, including two pairs of identical sites (OT6-OT7 and OT8-OT9) and five identical sites (OT1-OT5) to test for off-target mutations and evaluate the role of genomic context on cleavage and mutation rates. It is hoped that increased cellular data, such as provided in ENCODE for some cell lines, may prove useful in this regard.
Table 15: Genomic sequences and chromosomal positions of the
off-target sites tested using the mutation detection assay in Figure 27.
Figure imgf000121_0001
The nucleotides in position 20 and in the first position of the NGG PAM are lowercase, as there are not mismatches at these positions.
Results
To validate COSMID predictions, mutation detection assays were performed to determine if off-target cleavage occurred at putative off-target sites identified by COSMID. A search for the guide strand R-01 (GTGAACGTGGATGAAGTTGG), which targets the human beta-globin gene (Cradick, et al., Nucleic Acids Res, 41 :9584-9592 (2013)), gave 1,040 potential off-target sites in the human genome when allowing for up to three mismatches without any indels, and up to two mismatches with a one-base deletion or one-base insertion, adjacent to a NRG PAM (Figure 25 A).
Using primers as part of COSMID output, mutation detection assays were performed based on PCR amplification of the genomic loci (Guschin, et al., Methods Mol Biol, 649: 247-256 (2010)) after transfecting K-562 cells with a plasmid expressing Cas9 and guide strand R-01. A range of potential off-target sites without indels were studied in order to compare COSMID with other available bioinformatics tools. Of the 10 off-target sites tested, 8 sites, all with two mismatches, had off-target mutagenesis that could be detected by the T7EI mutation detection assay (Figure 27, Table 15), including an off-target site with higher activity than the on-target cleavage rate (44% versus 35%, Table 16, below). Similar to previous results, the level of off- target activity was generally diminished at sites with mismatches closer to the PAM (Gasiunas, et al, Proc Natl Acad Sci USA, 109:E2579-E2586 (2012); Jinek, et al, Elife 2:e00471 (2013); Jiang, et al., Nat Biotechnol, 31 : 233-239 (2013); Fu, et al, Nat Biotechnol, 31 : 822-826 (2013); Hsu, et al, Nat Biotechnol, 31 : 827-832 (2013); Cradick, et al, Nucleic Acids Res, 41 :9584-9592 (2013)).
Five different genomic sites with identical sequences, containing two mismatches respectively 14 and 19 bases from the PAM, had cleavage activities ranging from below the detection limit to 44%. The 10 sites chosen also contained two pairs of duplicated sites that had different mutation rates (13% versus 3%, and 7% versus below detection). The large variation in mutation rates at identical sequences, but different genomic regions may be due to the difference in gRNA/Cas9 accessibility and/or binding affinity at different genomic loci. This exemplifies the role genomic context can play in Cas9-induced cleavage and the difficulty in ranking off-target sites solely based on target sequences. See also, Figure 2C which compares the mutation rates at two identical sequences.
Table 16 lists these eight experimentally validated off-target sites in decreasing order of mutation rate (%), their ranking by COSMID, as well as that by other on-line CRISP tools.
Table 16: Comparison of COSMID with other available tools in predicting off- target sites with two mismatches for guide strand R-01.
Figure imgf000123_0001
The cleavage rates at R-01 on-target site and off-target sites OT1-OT10 are listed by decreasing T7EI activity in Table 16. OT3 and OT9 had activities below T7EI detection limit. Annotated genes corresponding to the sites are listed. Off- target analysis was performed with different online search tools. If the genomic sites with measurable T7EI activity (Figure 27) were identified by a specific tool (such as Cas OFFinder), their rankings in its output (if sortable) are shown. Sites not in the output of that tool are indicated by a dash in a grey box (e.g., R01 OT1 under "Cas OFFinder").
The output from COSMID was also compared with the output from other web tools for their ability to identify off-target sites that contain an extra bases (DNA bulge) or a missed base (RNA bulge) relative to the complementary genomic DNA sequence (Lin, et al, Nucleic Acids Res, 42:7473-7485 (2014)) (Table 17). The off- target sites in Table 17 might also be modeled as sites with four mismatches or noncanonical PAMs compared with the on-target site, though it is less likely that binding of Cas9 would occur without an NGG or NAG PAM. The columns corresponding to the individual tools follow from Table 16, above. When an extra base is present in the genomic sequence, next to one or more of the same nucleotide, the DNA bulge may occur in multiple locations, such as in the off-target site
R30_Ins9 where the additional G in the genomic sequence might be the first, second, or third of the three adjacent Gs, at locations 2, 3, or 4 nucleotides from the PAM (Table 18). Table 17: Comparison of search results for off-target sites that contain deletions or insertions, in which sequence-verified off-target sites with insertions or deletions, which can also be modeled as loci with four mismatches or alternate PAM considered.
Figure imgf000124_0001
Genomic sequences of the off-target sites are given, together with the number of mismatches, bulge type (guide bulge or gDNA bulge) and bulge position relative to PAM. *gDNA mismatches compared to guide strand are shown by alignment;
insertions are underlined, and deletions (guide bulge) are represented as dashes. The first nucleotide in PAM is in lower case.
In addition to being modeled as having one insertion with two mismatches, this off-target site can be modeled as having three mismatches with a shift in the PAM from NGG to NAG. Further, the off-target site Ol Insl may be modeled as having a NAG PAM. Without a bulge, R30_Insl4 would need to have the unlikely GTA PAM, so it remains unclear how it was modeled by Cas Online Designer. Each site in Tables 17 and 19 are marked "yes" when found by COSMID (first column) or other search method; if any of the confirmed off-target site could not be identified by a search tool, it is shown as a box with a dash. Specifically, of the six off-target sites identified by COSMID (and previously sequence confirmed) (Lin, et al., Nucleic Acids Res, 42:7473-7485 (2014)), Cas Online Designer, ZiFit, and CRISPR tools each only found two, and Cas OFFinder only found one. Table 19 lists the sequence confirmed, off-target sites containing DNA or RNA bulges that could not be represented by other means, with COSMID in the first column and the columns the same as in Table 16. Each of these sequence-verified off-target sites was identified by COSMID, but they were not output by these search tools, as they fail to locate sites with insertions or deletions.
Table 19: The sequence-verified off-target sites with insertions or deletions that cannot be modeled as four mismatches or alternate PAM can only be predicted by COSMID.
Figure imgf000125_0001
COSMID has better ability in identifying off-target sites with indels.
Although a number of bioinformatics programs can be used for CRISPR designs, COSMID provides exhaustive genomic searches for off-target sites due to
mismatches, deletions, and insertions, as well as providing primers for experimental validation of predicted off-target sites. The results shown in Tables 16, 17, and 19 give examples of validated off-target sites identified by COSMID, but not found by other search tools, including Cas Online Designer (Hsu, et al., Nat Biotechnol, 31 : 827-832 (2013)), ZiFit (Sander, et al, Nucleic Acids Res, 38 (suppl.):W462-468 (2010)), CRISPR Tools (Hsu, et al, Nat Biotechnol, 31 : 827-832 (2013)), and Cas OFFinder (Bae, et al., Bioinformatics, 30: 1473-1475 (2014)), which have different functions, such as determining CRISPR guide sequences (Grissa, et al., Nucleic Acids Res, 35: W52-W57 (2007); Grissa, et al, BMC Bioinformatics, 8: 172 (2007);
Rousseau, et al, Bioinformatics, 25: 3317-3318 (2009); Bland, et al, BMC
Bioinformatics, 8:209 (2007)), scanning a genome for possible target sites, and comparing the potential off-target sites (Hsu, et al., Nat Biotechnol, 31 : 827-832 (2013); Montague, et al, Nucleic Acids Res, 42:W401-W407 (2014);
Ronda, et al., Biotechnol Bioeng, 11 :1604-1616 (2014)).
In addition to providing optimized primer designs for sequencing and mutation detection for confirming putative off-target sites, COSMID also provides the reference sequence to facilitate sequencing. The reference sequence and knowledge of the cut site location facilitates mutation detection assays, including surveyor and T7EI, and possibly other uses, such as searching for restriction sites that may overlap the cut site.
To illustrate the ability of COSMID and importance of locating indels, search results for two guide strands were compared with validated activity and known off- target cleavage, including the guide strand R-01 that targets the human HBB gene, and the guide strand R-30 (GTAGAGCGGAGGCAGGAGC) that targets the human HIV co-receptor CCR5 gene (Cradick, et al., Nucleic Acids Res, 41 :9584-9592 (2013); Lin, et al, Nucleic Acids Res, 42:7473-7485 (2014)). The results of
COSMID searches were compared with the output given by other existing search tools. When off-target sites contain insertions or deletions in addition to mismatches, only COSMID searches could identify all of the 10 sequence-validated off-target sites (Tables 15, 16, and 17). Note that the deletion contained in off-target sites R-01_Dell or R-30_Dell (Table 17) could be modeled as four mismatches, and the insertion in off-target sites R-01_Insl, R-30_Ins9, or R-30_Insl4 (Table 17) could be modeled as having alternative PAMs. These alternative interpretations of the insertions and deletions for the sites shown in Table 17 explain why some existing bioinformatics tools such as Cas Online Designer, ZiFit, CRISPR Tools, and Cas OFFinder could still identify some of the off-sites listed in Table 17, although these tools do not allow insertions or deletions to be considered in the searches. Since the insertions or deletions in off-target sites R-30_Dell0, R-30_Ins4, R-30_Ins7, R-30_Ins8, R- 30_Insl0 (Table 19) could not be modeled as either mismatches or having alternative PAM, they were not found by any other tools at this time. Example 11: Extensive searches for HBB-targeted (R-01) and CCR5-targeted (R-30) guide strands, allowing indels greatly increases the number of putative off-target sites.
In addition to off-target sites of the same length as the guide strand but with mismatches, many similar sites exist in a genome with insertions (DNA bulges) and deletions (RNA bulges). Cas9 can tolerate DNA and RNA bulges and induce cleavage at genomic loci with high rates, sometimes even higher than the target site (Lin, et al., Nucleic Acids Res, 42:Ί '473-' '485 (2014)). To further demonstrate the capabilities of COSMID, the guide strands R-01 and R-30 (Cradick, et al., Nucleic Acids Res, 41 :9584-9592 (2013); Lin, et al, Nucleic Acids Res, 42:7473-7485
(2014)) were extensively analyzed using COSMID to search the human genome for sites similar to the R-01 or R-30 guide strands, having (i) up to three mismatches with no indels, (ii) up to two mismatches with a single-base insertion, and (iii) up to two mismatches with a single-base deletion. Since matching a guide strand's initial G is not essential, it was omitted in these searches. The off-target sites with a mismatched A at this position (OT1 and OT2) happened to have higher mutation rates than the three sites with a matching G (OT3-5) (Figure 27). The outputs provided many possible off-target sites, including those with insertions or deletions.
The number of putative genomic off-target sites output by COSMID increased drastically when indels were allowed in the search. For example, allowing one-base insertions together with two mismatches increased the number of genomic sites adjacent to a NAG or NGG PAM ~3 and ~7 times for R-01 and R-30 respectively compared with those without indels and two mismatches (166 versus 49 for R-01 and 224 versus 34 for R-30, Table 20).
Table 20: Comparison of search results for guide strands R-01 and R-30 with deletion or insertion permitted.
Figure imgf000127_0001
When one-base deletions are allowed together with two mismatches, the number of genomic sites identified is even higher, ~18 and ~26 times higher for R-01 and R-30 respectively compared with those without indels (883 sites for R-01 and 883 sites for R-30) (Table 20). With one-base insertion or one-base deletion in addition to base mismatches, the number of unique loci found was greatly increased compared with the corresponding number without indels. For example, when a one-base deletion was allowed in addition to <2 mismatches, the unique off-target loci found by
COSMID is 333 for R-01 and 761 for R-30 (Table 21).
Table 21: Off-target loci when a one-base deletion was allowed in addition to <2 mismatches.
Figure imgf000128_0001
When allowing (i) up to three mismatches with no indels, or (ii) up to two mismatches with a one-base insertion, or (iii) up to two mismatches with a one-base deletion, COSMID searches of off-target sites for guide strands R-01 and R-30 with NRG PAM located 1 ,040 unique putative off-target sites for R-01 and 1 ,218 for R-30. There were many identical sites located by multiple query types (examples shown in Figures 28A and 28B). The results varied between the two guide strands R-01 and R- 30 (each targets a coding sequence), as can be expected in a nonrandom genome (Figures 29A-29D). R-01 had a markedly larger number of matching sites with no indels. Of note was a particular 3-mismatch hit in 69 sites.
In summary, identifying off-target cleavage by CRISPR/Cas9 systems in a genome of interest is important, especially in treating human disease and creating model organisms, as CRISPR off-target cleavage (Fu, et al., Nat Biotechnol, 31 : 822- 826 (2013); Hsu, et al., Nat Biotechnol, 31 : 827-832 (2013)) can result in mutations, deletions, inversions, and translocations (Cradick, et al., Nucleic Acids Res, 41 :9584- 9592 (2013); Xiao, et al., Nucleic Acids Res, 41 :el41 (2013)) inducing detrimental biological consequences and potentially causing disease. However, accurate and complete genome-wide analysis of off-target efforts is a daunting task, since unbiased sequencing of a full genome to determine off-target activity is very costly, and many nuc lease-treated clones would have to be sequenced. Therefore, a bioinformatics- based tool that can predict and/or rank potential off-target cleavage sites can greatly aid the off-target analysis, and provide valuable guidance for guide strand designs. In particular, it is important to perform extensive bioinformatics searches for potential off-target sites that contain base mismatches, insertions, and deletions compared with the intended CRISPR target site.
COSMID can quickly and exhaustively search a genome for DNA sequences that partially match the target sequence of the guide strand, but contain insertions or deletions in addition to base mismatches. As shown in Table 21, a large number of potential off-target sites would be missed using search tools that only consider base mismatches, but not insertions or deletions. COSMID outputs potential off-target sites ("hits") corresponding to allowed mismatches and indels, the PAM sequence and the chromosomal location of the hits. COSMID also outputs primer designs for experimental validation of the off-target sites. Further processing of the COSMID results from the output spreadsheets extends COSMIDs utility to different
CRISPR/Cas platforms, including the use of Cas9 nickase pairs (Ran, et al., Cell, 154: 1380-1389 (2013)), Cas9/FokI fusion (Tsai, et al, Nat Biotechnol, 32:569-576 (2014); Guilinger, et al, Nat Biotechnol, 32: 577-582 (2014)), and multiplexed targeting (Cong, et al., Science, 339: 819-823 (2013)) by searching for multiple (sometimes paired) sites within a user-input chromosomal proximity. In addition to aiding the design of CRISPR/Cas systems for DNA cleavage, COSMID can be used to identify potential off-target sites of CRISPR activators, repressors, or other effector domains (Cheng, et al, Cell Res, 23: 1163-1171 (2013)).
The on-target and potential off-target sites given in the COSMID output can be tested experimentally using mutation detection assays (Guschin, et al., Methods Mol Biol, 649: 247-256 (2010)) or deep sequencing with genomic DNA harvested from cells treated by CRISPR/Cas. Mutation detection assays, including Surveyor and T7EI, are very commonly used to measure on- and off-target cleavage and
mutagenesis (Guschin, et al, Methods Mol Biol, 649: 247-256 (2010)). COSMID facilitates these assays by automatically designing primers to enable facile gel separation of the uncleaved and cleavage bands. The output also includes the genomic reference sequence for comparison to the sequencing results.
COSMID scores the potential off-target sites based on the number and location of base mismatches, allowing ranking of the more likely off-target sites. Bioinformatics based ranking of CRISPR/Cas off-target sites may be influenced by the effects of genomic context and DNA modifications. As exemplified herein, identical genomic sites and duplicated sites may have differences in off-target activity. The indel rate at off-target site R-01 OT2 was 44%, though other loci with the same complementary sequence have much less, or no activity, possibly due to nuclease blocking. It is believed that incorporating parameters such as the effects of chromatin condensation, DNA availability and other factors into the COSMID search algorithm will improve the scoring and ranking of the target sites.
Unless defined otherwise, all technical and scientific terms used herein have the same meanings as commonly understood by one of skill in the art to which the disclosed invention belongs. Publications cited herein and the materials for which they are cited are specifically incorporated by reference.
Those skilled in the art will recognize, or be able to ascertain using no more than routine experimentation, many equivalents to the specific embodiments of the invention described herein. Such equivalents are intended to be encompassed by the following claims.

Claims

We claim:
1. A computer-implemented method for identifying cleavage locations of a nucleotide-directed nuclease comprising: in a computer system,
comparing a series of query sequences comprising a guide strand sequence and one or more guide strand sequence variants comprising one or more nucleotide insertions, one or more nucleotide deletions, and/or one or more nucleotide substitutions relative to the guide sequence to genomic sequence and reporting target cleavage sites corresponding to locations in the genomic sequence within specified search conditions or having a specified sequence identity to the guide sequence or one or more of the query sequences.
2. The method of claim 1, wherein the series of query sequences comprises all possible guide strand sequence variants comprising between 0 and 10 nucleotide insertions relative to the guide sequence.
3. The method of claim 1, wherein the series of query sequences comprises all possible guide strand sequence variants comprising between 0 and 5 nucleotide insertions relative to the guide sequence.
4. The method of claim 1, wherein the series of query sequences comprises all possible guide strand sequence variants comprising 0, 1 , or 2 nucleotide insertions relative to the guide sequence.
5. The method of any of claims 1-4, wherein the series of query sequences comprises all possible guide strand sequence variants comprising between 0 and 10 nucleotide deletions relative to the guide sequence.
6. The method of any of claims 1-4, wherein the series of query sequences comprises all possible guide strand sequence variants comprising between 0 and 5 nucleotide deletions relative to the guide sequence.
7. The method of any of claims 1-4, wherein the series of query sequences comprises all possible guide strand sequence variants comprising 0, 1, or 2 nucleotide deletions relative to the guide sequence.
8. The method of any of claims 1-7, wherein the series of query sequences comprises all possible guide strand sequence variants comprising between 0 and 10 nucleotide substitutions relative to the guide sequence.
9. The method of any of claims 1-7, wherein the series of query sequences comprises all possible guide strand sequence variants comprising between 0 and 5 nucleotide substitutions relative to the guide sequence.
10. The method of any of claims 1-7, wherein the series of query sequences comprises all possible guide strand sequence variants comprising 0, 1, or 2 nucleotide substitutions relative to the guide sequence.
11. The method of any of claim 1-10, wherein the specified search conditions comprise the number of insertions, deletions, and/or mismatches between the guide strand sequence and the genomic sequence
12. The method of claim 11, wherein the specified search conditions comprise 5 or fewer mismatches, 5 or fewer insertions, 5 or fewer deletions, and combinations thereof.
13. The method of claim 12, wherein the specified search conditions comprise one or two mismatches with or without one or more insertions and/or one or more deletions.
14. The method of claim 12, wherein the specified search conditions comprise zero mismatches with or without one or more insertions and/or one or more deletions.
15. The method of claim 12, wherein the specific search conditions comprise zero, one, two, or three mismatches, zero insertions, and zero deletions; zero, one or two mismatches with one insertion, and zero deletions; one or two mismatches with zero insertions, and one deletions; or one or two mismatches with one insertion, and one deletion; and combinations thereof.
16. The method of any of claim 1-15, wherein the method further comprises assigning a score to the returned target cleavage locations indicative of the predictive likelihood of cleavage at the target cleavage location, and ranking the target cleavage locations based on their scores.
17. The method of claim 16, wherein target cleavage locations comprise genomic sequences comprising higher sequence identity to the guide sequence receive a lower score relative to target cleavage locations comprising genomic sequences comprising lower sequence identity to the guide sequence.
18. The method of claim 17, wherein increasing numbers of substitutions, deletions, and insertions at the target cleavage location increase the score.
19. The method of claim 18, wherein the score is increased more for deletion(s) in the genomic sequence relative to the guide sequence (RNA bulges) than for insertions in the genomic sequence relative to the guide sequence (DNA bulges).
20. The method of claim 19, wherein the score reflects that sgRNA bulges are less tolerant to additional base mismatches, and vice versa.
21. The method of any of claims 1-20, wherein the series of query sequences comprise a protospacer adjacent motif (PAM) suffix.
22. The method of claim 21, wherein the PAM suffix is selected from the group consisting of NGG, NAG, and NRG.
23. The method of claim 22, wherein a target cleavage site comprising a NGG PAM guide strand is given a lower score than that of NAG PAM.
24. The method of any of claims 1-23, further comprising providing primer sequences suitable for amplifying the genomic sequence at the target cleavage site.
25. The method of any of claims 1-24, wherein the genomic sequence is an organismal genome selected from the group consisting of a human genome, a rat genome, a mouse genome, a rhesus macaque genome.
26. The method of any of claims 1-25, wherein the genomic sequence comprises DNA sequence from FASTA formatted files transformed into index entries, which have all possible 25 bases-long tags in the DNA sequence.
27. The method of claim 26, wherein the index entries are sorted and the results are stored as a binary main index file.
28. The method of claim 27, wherein main index file is divided into parts, each representing entries having identical first about 12 nucleotides.
29. The method of claim 28, wherein a secondary index file comprises the position in the main index file where each part starts added to the end of the index file.
30. The method of any of claims 1-29, wherein the nuclease is a CRISPR/Cas nuclease.
31. The method of claim 30, wherein the CRISPR/Cas nuclease is Cas9 or a variant thereof.
32. The method of claim 30, wherein the nuclease is RNA-directed.
33. The method of claim 30, wherein the nuclease is DNA-directed, or directed by NA, DNA and/or alternative nucleotide format.
34. The method of any claims 1-33, wherein nuclease cleaves both DNA strands, is a single nickase, or a double nickase.
34. The method of any claims 1-33, wherein the nucleotide-directed protein, binds or interacts with DNA, but is not a nuclease.
PCT/US2015/013134 2014-01-27 2015-01-27 Methods and systems for identifying crispr/cas off-target sites WO2015113063A1 (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
US15/114,799 US10354746B2 (en) 2014-01-27 2015-01-27 Methods and systems for identifying CRISPR/Cas off-target sites
US16/410,395 US20190295689A1 (en) 2014-01-27 2019-05-13 Methods and systems for identifying crispr/cas off-target sites
US16/594,905 US11315659B2 (en) 2014-01-27 2019-10-07 Methods and systems for identifying nucleotide-guided nuclease off-target sites

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201461932003P 2014-01-27 2014-01-27
US61/932,003 2014-01-27

Related Child Applications (2)

Application Number Title Priority Date Filing Date
US15/114,799 A-371-Of-International US10354746B2 (en) 2014-01-27 2015-01-27 Methods and systems for identifying CRISPR/Cas off-target sites
US16/410,395 Continuation US20190295689A1 (en) 2014-01-27 2019-05-13 Methods and systems for identifying crispr/cas off-target sites

Publications (1)

Publication Number Publication Date
WO2015113063A1 true WO2015113063A1 (en) 2015-07-30

Family

ID=52469336

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2015/013134 WO2015113063A1 (en) 2014-01-27 2015-01-27 Methods and systems for identifying crispr/cas off-target sites

Country Status (2)

Country Link
US (2) US10354746B2 (en)
WO (1) WO2015113063A1 (en)

Cited By (54)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9340800B2 (en) 2013-09-06 2016-05-17 President And Fellows Of Harvard College Extended DNA-sensing GRNAS
WO2016094872A1 (en) * 2014-12-12 2016-06-16 The Broad Institute Inc. Dead guides for crispr transcription factors
US9388430B2 (en) 2013-09-06 2016-07-12 President And Fellows Of Harvard College Cas9-recombinase fusion proteins and uses thereof
US9526784B2 (en) 2013-09-06 2016-12-27 President And Fellows Of Harvard College Delivery system for functional nucleases
EP3129484A1 (en) * 2014-03-25 2017-02-15 Editas Medicine, Inc. Crispr/cas-related methods and compositions for treating hiv infection and aids
EP3219799A1 (en) 2016-03-17 2017-09-20 IMBA-Institut für Molekulare Biotechnologie GmbH Conditional crispr sgrna expression
US9834791B2 (en) 2013-11-07 2017-12-05 Editas Medicine, Inc. CRISPR-related methods and compositions with governing gRNAS
US9840699B2 (en) 2013-12-12 2017-12-12 President And Fellows Of Harvard College Methods for nucleic acid editing
WO2018064226A1 (en) * 2016-09-27 2018-04-05 uBiome, Inc. Method and system for crispr-based library preparation and sequencing
US10077453B2 (en) 2014-07-30 2018-09-18 President And Fellows Of Harvard College CAS9 proteins including ligand-dependent inteins
US20180291383A1 (en) * 2013-04-04 2018-10-11 President And Fellows Of Harvard College THERAPEUTIC USES OF GENOME EDITING WITH CRISPR/Cas SYSTEMS
US10113163B2 (en) 2016-08-03 2018-10-30 President And Fellows Of Harvard College Adenosine nucleobase editors and uses thereof
US10167457B2 (en) 2015-10-23 2019-01-01 President And Fellows Of Harvard College Nucleobase editors and uses thereof
US10227581B2 (en) 2013-08-22 2019-03-12 President And Fellows Of Harvard College Engineered transcription activator-like effector (TALE) domains and uses thereof
US10323236B2 (en) 2011-07-22 2019-06-18 President And Fellows Of Harvard College Evaluation and improvement of nuclease cleavage specificity
CN110070912A (en) * 2019-04-15 2019-07-30 桂林电子科技大学 A kind of prediction technique of CRISPR/Cas9 undershooting-effect
CN110335640A (en) * 2019-07-09 2019-10-15 河南师范大学 A kind of prediction technique of drug-DBPs binding site
US10494621B2 (en) 2015-06-18 2019-12-03 The Broad Institute, Inc. Crispr enzyme mutations reducing off-target effects
US10508298B2 (en) 2013-08-09 2019-12-17 President And Fellows Of Harvard College Methods for identifying a target site of a CAS9 nuclease
US10550372B2 (en) 2013-12-12 2020-02-04 The Broad Institute, Inc. Systems, methods and compositions for sequence manipulation with optimized functional CRISPR-Cas systems
US10577630B2 (en) 2013-06-17 2020-03-03 The Broad Institute, Inc. Delivery and use of the CRISPR-Cas systems, vectors and compositions for hepatic targeting and therapy
US10696986B2 (en) 2014-12-12 2020-06-30 The Board Institute, Inc. Protected guide RNAS (PGRNAS)
US10711285B2 (en) 2013-06-17 2020-07-14 The Broad Institute, Inc. Optimized CRISPR-Cas double nickase systems, methods and compositions for sequence manipulation
US10738305B2 (en) 2015-02-23 2020-08-11 Vertex Pharmaceuticals Incorporated Materials and methods for treatment of hemoglobinopathies
US10745677B2 (en) 2016-12-23 2020-08-18 President And Fellows Of Harvard College Editing of CCR5 receptor gene to protect against HIV infection
US10781444B2 (en) 2013-06-17 2020-09-22 The Broad Institute, Inc. Functional genomics using CRISPR-Cas systems, compositions, methods, screens and applications thereof
US10851357B2 (en) 2013-12-12 2020-12-01 The Broad Institute, Inc. Compositions and methods of use of CRISPR-Cas systems in nucleotide repeat disorders
US10930367B2 (en) 2012-12-12 2021-02-23 The Broad Institute, Inc. Methods, models, systems, and apparatus for identifying target sequences for Cas enzymes or CRISPR-Cas systems for target sequences and conveying results thereof
US10946108B2 (en) 2013-06-17 2021-03-16 The Broad Institute, Inc. Delivery, use and therapeutic applications of the CRISPR-Cas systems and compositions for targeting disorders and diseases using viral components
US11008588B2 (en) 2013-06-17 2021-05-18 The Broad Institute, Inc. Delivery, engineering and optimization of tandem guide systems, methods and compositions for sequence manipulation
US11041173B2 (en) 2012-12-12 2021-06-22 The Broad Institute, Inc. Delivery, engineering and optimization of systems, methods and compositions for sequence manipulation and therapeutic applications
US11155795B2 (en) 2013-12-12 2021-10-26 The Broad Institute, Inc. CRISPR-Cas systems, crystal structure and uses thereof
US11268082B2 (en) 2017-03-23 2022-03-08 President And Fellows Of Harvard College Nucleobase editors comprising nucleic acid programmable DNA binding proteins
US11268077B2 (en) 2018-02-05 2022-03-08 Vertex Pharmaceuticals Incorporated Materials and methods for treatment of hemoglobinopathies
US11306324B2 (en) 2016-10-14 2022-04-19 President And Fellows Of Harvard College AAV delivery of nucleobase editors
US11315659B2 (en) 2014-01-27 2022-04-26 Georgia Tech Research Corporation Methods and systems for identifying nucleotide-guided nuclease off-target sites
US11319532B2 (en) 2017-08-30 2022-05-03 President And Fellows Of Harvard College High efficiency base editors comprising Gam
US11407985B2 (en) 2013-12-12 2022-08-09 The Broad Institute, Inc. Delivery, use and therapeutic applications of the CRISPR-Cas systems and compositions for genome editing
US11447770B1 (en) 2019-03-19 2022-09-20 The Broad Institute, Inc. Methods and compositions for prime editing nucleotide sequences
US11542496B2 (en) 2017-03-10 2023-01-03 President And Fellows Of Harvard College Cytosine to guanine base editor
US11542509B2 (en) 2016-08-24 2023-01-03 President And Fellows Of Harvard College Incorporation of unnatural amino acids into proteins using base editing
EP3371306B1 (en) * 2015-11-04 2023-01-04 Crispr Therapeutics AG Materials and methods for treatment of hemoglobinopathies
US11560566B2 (en) 2017-05-12 2023-01-24 President And Fellows Of Harvard College Aptazyme-embedded guide RNAs for use with CRISPR-Cas9 in genome editing and transcriptional activation
US11578312B2 (en) 2015-06-18 2023-02-14 The Broad Institute Inc. Engineering and optimization of systems, methods, enzymes and guide scaffolds of CAS9 orthologs and variants for sequence manipulation
US11661590B2 (en) 2016-08-09 2023-05-30 President And Fellows Of Harvard College Programmable CAS9-recombinase fusion proteins and uses thereof
US11732274B2 (en) 2017-07-28 2023-08-22 President And Fellows Of Harvard College Methods and compositions for evolving base editors using phage-assisted continuous evolution (PACE)
US11795443B2 (en) 2017-10-16 2023-10-24 The Broad Institute, Inc. Uses of adenosine base editors
US11898179B2 (en) 2017-03-09 2024-02-13 President And Fellows Of Harvard College Suppression of pain by gene editing
US11912985B2 (en) 2020-05-08 2024-02-27 The Broad Institute, Inc. Methods and compositions for simultaneous editing of both strands of a target double-stranded nucleotide sequence
KR102667508B1 (en) * 2022-02-08 2024-06-11 주식회사 툴젠 A method for predicting off-targets which are cappable of occuring in process of genome editing by prime editing system
US12031155B2 (en) 2015-05-08 2024-07-09 President And Fellows Of Harvard College Universal donor stem cells and related methods
US12110545B2 (en) 2017-01-06 2024-10-08 Editas Medicine, Inc. Methods of assessing nuclease cleavage
US12129471B2 (en) 2015-02-23 2024-10-29 Vertex Pharmaceuticals Incorporated Materials and methods for treatment of human genetic diseases including hemoglobinopathies
US12134767B2 (en) 2020-06-23 2024-11-05 Vertex Pharmaceuticals Incorporated Materials and methods for treatment of hemoglobinopathies

Families Citing this family (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016154579A2 (en) * 2015-03-26 2016-09-29 Editas Medicine, Inc. Crispr/cas-mediated gene conversion
US20200190534A1 (en) * 2017-04-27 2020-06-18 President And Fellows Of Harvard College Method of Off-Target Recording of Spacer Sequences within a Cell In Vivo
US11142788B2 (en) 2017-06-13 2021-10-12 Genetics Research, Llc Isolation of target nucleic acids
US10947599B2 (en) 2017-06-13 2021-03-16 Genetics Research, Llc Tumor mutation burden
US20180355380A1 (en) * 2017-06-13 2018-12-13 Genetics Research, Llc, D/B/A Zs Genetics, Inc. Methods and kits for quality control
US10527608B2 (en) 2017-06-13 2020-01-07 Genetics Research, Llc Methods for rare event detection
US10081829B1 (en) 2017-06-13 2018-09-25 Genetics Research, Llc Detection of targeted sequence regions
WO2019010384A1 (en) * 2017-07-07 2019-01-10 The Broad Institute, Inc. Methods for designing guide sequences for guided nucleases
EP3794130A4 (en) 2018-05-16 2022-07-27 Synthego Corporation Methods and systems for guide rna design and use
WO2021003343A1 (en) * 2019-07-03 2021-01-07 Integrated Dna Technologies, Inc. Identification, characterization, and quantitation of crispr-introduced double-stranded dna break repairs
CN114981445B (en) 2019-10-24 2024-09-27 合成Dna技术公司 Modified double-stranded donor templates
CN111261223B (en) * 2020-01-12 2022-05-03 湖南大学 CRISPR off-target effect prediction method based on deep learning
WO2022144437A1 (en) 2020-12-30 2022-07-07 Universidad De Granada Crispna for genome editing
US11783001B2 (en) 2021-07-08 2023-10-10 Bank Of America Corporation System and method for splitting a video stream using breakpoints based on recognizing workflow patterns
US20230091138A1 (en) * 2021-08-05 2023-03-23 Monsanto Technology Llc Methods and systems for use in identifying guide nucleic acid sequences consistent with experimental scaling
WO2023209614A1 (en) * 2022-04-27 2023-11-02 Crispr Therapeutics Ag Guide design and off-target searches

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013176772A1 (en) 2012-05-25 2013-11-28 The Regents Of The University Of California Methods and compositions for rna-directed target dna modification and for rna-directed modulation of transcription
WO2014018423A2 (en) 2012-07-25 2014-01-30 The Broad Institute, Inc. Inducible dna binding proteins and genome perturbation tools and applications thereof

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9189594B2 (en) * 2010-08-31 2015-11-17 Annai Systems Inc. Method and systems for processing polymeric sequence data and related information
WO2014039729A1 (en) * 2012-09-05 2014-03-13 Stamatoyannopoulos John A Methods and compositions related to regulation of nucleic acids
EP3553176A1 (en) * 2014-03-10 2019-10-16 Editas Medicine, Inc. Crispr/cas-related methods and compositions for treating leber's congenital amaurosis 10 (lca10)

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013176772A1 (en) 2012-05-25 2013-11-28 The Regents Of The University Of California Methods and compositions for rna-directed target dna modification and for rna-directed modulation of transcription
WO2014018423A2 (en) 2012-07-25 2014-01-30 The Broad Institute, Inc. Inducible dna binding proteins and genome perturbation tools and applications thereof

Non-Patent Citations (106)

* Cited by examiner, † Cited by third party
Title
ADDGENE: "CRISPR in the Lab: A Practical Guide", ADDGENE WEBSITE, 2014
ALBERTS ET AL., GARLAND SCIENCE, 2007
BAE ET AL., BIOINFORMATICS, vol. 30, 2014, pages 1473 - 1475
BARRANGOU ET AL., SCIENCE, vol. 315, 2007, pages 1709 - 1712
BLAND ET AL., BMC BIOINFORMATICS, vol. 8, 2007, pages 209
BOLOTIN ET AL., MICROBIOLOGY, vol. 151, 2005, pages 2551 - 2561
BRINER ET AL., MOL CELL, vol. 56, no. 2, 2014, pages 333 - 9
BROUNS ET AL., SCIENCE, vol. 321, 2008, pages 960 - 964
CHEN ET AL., J. BIOL. CHEM., 2014
CHENG ET AL., CELL RES, vol. 23, 2013, pages 1163 - 1171
CHO ET AL., GENOME RES., vol. 24, 2014, pages 132 - 141
CHO ET AL., NAT. BIOTECHNOL, vol. 31, 2013, pages 230 - 232
CONG ET AL., SCIENCE, vol. 339, 2013, pages 819 - 823
CONG, SCIENCE, vol. 339, no. 6121, 15 December 2012 (2012-12-15), pages 819 - 823
CORNU ET AL., METHODS MOL BIOL, vol. 649, 2010, pages 237 - 245
CORNU ET AL., METHODS MOL. BIOL., vol. 649, 2010, pages 237 - 245
CRADICK ET AL., BMC BIOINFORMATICS, vol. 12, 2011, pages 152
CRADICK ET AL., NUCLEIC ACIDS RES, vol. 41, 2013, pages 9584 - 9592
CRADICK ET AL., NUCLEIC ACIDS RES., vol. 41, 2013, pages 9584 - 9592
D.J. LIPMAN; W.R. PEARSON, SCIENCE, vol. 227, 1989, pages 1435 - 1441
FINE ET AL., NUCLEIC ACIDS RES, vol. 42, 2013, pages E42
FISCHER ET AL., JBIOL CHEM, vol. 287, 2012, pages 33351 - 33363
FU ET AL., NAT BIOTECHNOL, vol. 31, 2013, pages 822 - 826
FU ET AL., NAT. BIOTECHNOL, vol. 31, 2013, pages 822 - 826
FU ET AL., NATURE BIOTECH., 2014, pages 279 - 84
GABRIEL ET AL., NAT. BIOTECHNOL, vol. 29, 2011, pages 816 - 823
GAJ ET AL., TRENDS BIOTECHNOL, vol. 31, 2013, pages 397 - 405
GARNEAU ET AL., NATURE, vol. 468, 2010, pages 67 - 71
GASIUNAS ET AL., NATL ACAD. SCI. USA, vol. 109, 2012, pages E2579 - E2586
GASIUNAS ET AL., PROC NATL ACAD SCI USA, vol. 109, 2012, pages E2579 - E2586
GEURTS ET AL., SCIENCE, vol. 325, 2009, pages 433
GRATZ ET AL., GENETICS, vol. 194, 2013, pages 1029 - 1035
GRISSA ET AL., BMC BIOINFORMATICS, vol. 8, 2007, pages 172
GRISSA ET AL., NUCLEIC ACIDS RES, vol. 35, 2007, pages W52 - W57
GUILINGER ET AL., NAT BIOTECHNOL, vol. 32, 2014, pages 577 - 582
GUPTA ET AL., GENOME RES., vol. 23, 2013, pages 1008 - 1017
GUSCHIN ET AL., METHODS MOL BIOL, vol. 649, 2010, pages 247 - 256
GUSCHIN ET AL., METHODS MOL. BIOL., vol. 649, 2010, pages 247 - 256
GUSCHIN ET AL., METHODS MOL. BIOL., vol. 649, 2010, pages 247 - 56
HALE ET AL., CELL, vol. 139, 2009, pages 945 - 956
HOCKEMEYER ET AL., NAT BIOTECHNOL, vol. 29, 2011, pages 731 - 734
HOCKEMEYER ET AL., NAT. BIOTECHNOL, vol. 29, 2011, pages 731 - 734
HORVATH ET AL., JBACTERIOL, vol. 190, 2008, pages 1401 - 1412
HORVATH ET AL., SCIENCE, vol. 327, 2010, pages 167 - 170
HORVATH, SCIENCE, vol. 327, 2010, pages 167 - 170
HOU ET AL., PROC NATL ACAD SCI USA, vol. 110, 2013, pages 15644 - 15649
HSU ET AL., NAT BIOTECHNOL, vol. 31, 2013
HSU ET AL., NAT BIOTECHNOL, vol. 31, 2013, pages 827 - 832
HSU ET AL., NAT. BIOTECHNOL, vol. 31, 2013, pages 827 - 832
HUANG ET AL., ELECTROPHORESIS, vol. 33, no. 5, 2012, pages 788 - 96
HWANG ET AL., NAT. BIOTECHNOL, vol. 31, 2013, pages 227 - 229
ISELI ET AL., PLOS ONE, vol. 2, 2007, pages E579
ISELI ET AL., PLOS ONE, vol. 2, no. 6, 2007, pages E579
JIANG ET AL., NAT BIOTECHNOL, vol. 31, 2013, pages 233 - 239
JIANG ET AL., NAT. BIOTECHNOL, vol. 31, 2013, pages 233 - 239
JINEK ET AL., ELIFE, vol. 2, 2013, pages E00471
JINEK ET AL., SCIENCE, vol. 337, 2012, pages 816 - 821
JINEK ET AL., SCIENCE, vol. 337, no. 6096, 2012, pages 816 - 21
LANDT ET AL., GENOME RES., vol. 22, 2012, pages 1813 - 1831
LEE ET AL., GENOME RES., vol. 20, 2010, pages 81 - 89
LI ET AL., BIOINFORMATICS, vol. 26, 2010, pages 589 - 595
LI ET AL., NAT. BIOTECHNOL, vol. 31, 2013, pages 681 - 683
LIN ET AL., NUCLEIC ACIDS RES, vol. 42, 2014, pages 7473 - 7485
MAKAROVA ET AL., BIOL. DIRECT, vol. 1, 2006, pages 7
MALI ET AL., NAT. BIOTECHNOL, vol. 31, 2013, pages 833 - 838
MALI ET AL., NAT. METHODS, vol. 10, 2013, pages 957 - 963
MALI ET AL., SCIENCE, vol. 339, 2013, pages 823 - 826
MARRAFFINI ET AL., NAT REV GENET, vol. 11, 2010, pages 181 - 190
MARRAFFINI ET AL., NAT. REV. GENET., vol. 11, 2010, pages 181 - 190
MING MA ET AL: "A Guide RNA Sequence Design Platform for the CRISPR/Cas9 System for Model Organism Genomes", BIOMED RESEARCH INTERNATIONAL, vol. 31, no. 3, 1 January 2013 (2013-01-01), pages 822 - 4, XP055118861, ISSN: 2314-6133, DOI: 10.1186/1748-7188-6-26 *
MOJICA ET AL., MICROBIOLOGY, vol. 155, 2009, pages 733 - 740
MONTAGUE ET AL., NUCLEIC ACIDS RES, vol. 42, 2014, pages W401 - W407
MUSSOLINO ET AL., NUCLEIC ACIDS RES, vol. 39, 2011, pages 9283 - 9293
MUSSOLINO ET AL., NUCLEIC ACIDS RES., vol. 39, 2011, pages 9283 - 9293
PATRICK D HSU ET AL: "DNA targeting specificity of RNA-guided Cas9 nucleases", NATURE BIOTECHNOLOGY, NATURE PUBLISHING GROUP, UNITED STATES, vol. 31, no. 9, 1 September 2013 (2013-09-01), pages 827 - 832, XP002718604, ISSN: 1546-1696, [retrieved on 20130721], DOI: 10.1038/NBT.2647 *
PATTANAYAK ET AL., NAT BIOTECHNOL, vol. 31, 2013, pages 839 - 843
PATTANAYAK ET AL., NAT. BIOTECHNOL, vol. 31, 2013, pages 839 - 843
PATTANAYAK ET AL., NAT. METHODS, vol. 8, 2011, pages 765 - 770
PEREZ ET AL., NAT. BIOTECHNOL, vol. 26, 2008, pages 808 - 816
PRASHANT MALI ET AL: "CAS9 transcriptional activators for target specificity screening and paired nickases for cooperative genome engineering", NATURE BIOTECHNOLOGY, vol. 31, no. 9, 1 August 2013 (2013-08-01), pages 833 - 838, XP055186073, ISSN: 1087-0156, DOI: 10.1038/nbt.2675 *
RAMIREZ ET AL., NUCLEIC ACIDS RES, vol. 40, 2012, pages 5560 - 5568
RAMIREZ ET AL., NUCLEIC ACIDS RES., vol. 40, 2012, pages 5560 - 5568
RAN ET AL., CELL, vol. 154, 2013, pages 1380 - 1389
REYON ET AL., NAT BIOTECHNOL, vol. 30, 2012, pages 460 - 465
RONDA ET AL., BIOTECHNOL BIOENG, vol. 11, 2014, pages 1604 - 1616
ROSENBLOOM ET AL., NUCLEIC ACIDS RES., vol. 41, 2013, pages D56 - D63
ROUSSEAU ET AL., BIOINFORMATICS, vol. 25, 2009, pages 3317 - 3318
S. ALTSCHUL ET AL., J. MOL. BIOLOGY, vol. 215, 1990, pages 403 - 410
SANDER ET AL., NUCLEIC ACIDS RES, vol. 38, 2010, pages W462 - 468
SAPRANAUSKAS ET AL., NUCLEIC ACIDS RES., vol. 39, 2011, pages 9275 - 9282
SHAH ET AL., RNA BIOL, vol. 10, 2013, pages 891 - 899
SHAN ET AL., NAT. BIOTECHNOL, vol. 31, 2013, pages 686 - 688
SUGIMOTO ET AL., BIOCHEMISTRY, vol. 34, 1995, pages 11211 - 11216
T. J. CRADICK ET AL: "CRISPR/Cas9 systems targeting -globin and CCR5 genes have substantial off-target activity", NUCLEIC ACIDS RESEARCH, vol. 41, no. 20, 1 November 2013 (2013-11-01), pages 9584 - 9592, XP055186069, ISSN: 0305-1048, DOI: 10.1093/nar/gkt714 *
TESSON ET AL., NAT BIOTECHNOL, vol. 29, 2011, pages 695 - 696
TESSON ET AL., NAT. BIOTECHNOL, vol. 29, 2011, pages 695 - 696
THOMAS J CRADICK ET AL: "COSMID: A Web-based Tool for Identifying and Validating CRISPR/Cas Off-target Sites", MOLECULAR THERAPY-NUCLEIC ACIDS, vol. 3, no. 12, 2 December 2014 (2014-12-02), pages e214, XP055186449, DOI: 10.1038/mtna.2014.64 *
TSAI ET AL., NAT BIOTECHNOL, vol. 32, 2014, pages 569 - 576
W.R. PEARSON; D.J. LIPMAN, PROC. NATL. ACAD. SCI., vol. 85, 1988, pages 2444 - 2448
WANG ET AL., CELL, vol. 153, 2013, pages 910 - 918
XIAO ET AL., BIOINFORMATICS, vol. 30, 2014, pages 1180 - 1182
XIAO ET AL., NUCLEIC ACIDS RES, vol. 41, 2013, pages E141
XIAO ET AL., NUCLEIC ACIDS RES., vol. 41, 2013, pages E141
XIE ET AL., MOL PLANT, vol. 6, 2013
YANG ET AL., CELL, vol. 154, 2013, pages 1370 - 1379
YU ET AL., NUCLEIC ACIDS RES., vol. 38, 2010, pages 5706 - 5717

Cited By (99)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US12006520B2 (en) 2011-07-22 2024-06-11 President And Fellows Of Harvard College Evaluation and improvement of nuclease cleavage specificity
US10323236B2 (en) 2011-07-22 2019-06-18 President And Fellows Of Harvard College Evaluation and improvement of nuclease cleavage specificity
US10930367B2 (en) 2012-12-12 2021-02-23 The Broad Institute, Inc. Methods, models, systems, and apparatus for identifying target sequences for Cas enzymes or CRISPR-Cas systems for target sequences and conveying results thereof
US11041173B2 (en) 2012-12-12 2021-06-22 The Broad Institute, Inc. Delivery, engineering and optimization of systems, methods and compositions for sequence manipulation and therapeutic applications
US20180291383A1 (en) * 2013-04-04 2018-10-11 President And Fellows Of Harvard College THERAPEUTIC USES OF GENOME EDITING WITH CRISPR/Cas SYSTEMS
US11597949B2 (en) 2013-06-17 2023-03-07 The Broad Institute, Inc. Optimized CRISPR-Cas double nickase systems, methods and compositions for sequence manipulation
US12018275B2 (en) 2013-06-17 2024-06-25 The Broad Institute, Inc. Delivery and use of the CRISPR-CAS systems, vectors and compositions for hepatic targeting and therapy
US10711285B2 (en) 2013-06-17 2020-07-14 The Broad Institute, Inc. Optimized CRISPR-Cas double nickase systems, methods and compositions for sequence manipulation
US10781444B2 (en) 2013-06-17 2020-09-22 The Broad Institute, Inc. Functional genomics using CRISPR-Cas systems, compositions, methods, screens and applications thereof
US10577630B2 (en) 2013-06-17 2020-03-03 The Broad Institute, Inc. Delivery and use of the CRISPR-Cas systems, vectors and compositions for hepatic targeting and therapy
US10946108B2 (en) 2013-06-17 2021-03-16 The Broad Institute, Inc. Delivery, use and therapeutic applications of the CRISPR-Cas systems and compositions for targeting disorders and diseases using viral components
US11008588B2 (en) 2013-06-17 2021-05-18 The Broad Institute, Inc. Delivery, engineering and optimization of tandem guide systems, methods and compositions for sequence manipulation
US11920181B2 (en) 2013-08-09 2024-03-05 President And Fellows Of Harvard College Nuclease profiling system
US10508298B2 (en) 2013-08-09 2019-12-17 President And Fellows Of Harvard College Methods for identifying a target site of a CAS9 nuclease
US10954548B2 (en) 2013-08-09 2021-03-23 President And Fellows Of Harvard College Nuclease profiling system
US10227581B2 (en) 2013-08-22 2019-03-12 President And Fellows Of Harvard College Engineered transcription activator-like effector (TALE) domains and uses thereof
US11046948B2 (en) 2013-08-22 2021-06-29 President And Fellows Of Harvard College Engineered transcription activator-like effector (TALE) domains and uses thereof
US11299755B2 (en) 2013-09-06 2022-04-12 President And Fellows Of Harvard College Switchable CAS9 nucleases and uses thereof
US10912833B2 (en) 2013-09-06 2021-02-09 President And Fellows Of Harvard College Delivery of negatively charged proteins using cationic lipids
US10858639B2 (en) 2013-09-06 2020-12-08 President And Fellows Of Harvard College CAS9 variants and uses thereof
US9526784B2 (en) 2013-09-06 2016-12-27 President And Fellows Of Harvard College Delivery system for functional nucleases
US9340800B2 (en) 2013-09-06 2016-05-17 President And Fellows Of Harvard College Extended DNA-sensing GRNAS
US9999671B2 (en) 2013-09-06 2018-06-19 President And Fellows Of Harvard College Delivery of negatively charged proteins using cationic lipids
US9737604B2 (en) 2013-09-06 2017-08-22 President And Fellows Of Harvard College Use of cationic lipids to deliver CAS9
US9388430B2 (en) 2013-09-06 2016-07-12 President And Fellows Of Harvard College Cas9-recombinase fusion proteins and uses thereof
US9340799B2 (en) 2013-09-06 2016-05-17 President And Fellows Of Harvard College MRNA-sensing switchable gRNAs
US10682410B2 (en) 2013-09-06 2020-06-16 President And Fellows Of Harvard College Delivery system for functional nucleases
US10597679B2 (en) 2013-09-06 2020-03-24 President And Fellows Of Harvard College Switchable Cas9 nucleases and uses thereof
US10640788B2 (en) 2013-11-07 2020-05-05 Editas Medicine, Inc. CRISPR-related methods and compositions with governing gRNAs
US11390887B2 (en) 2013-11-07 2022-07-19 Editas Medicine, Inc. CRISPR-related methods and compositions with governing gRNAS
US10190137B2 (en) 2013-11-07 2019-01-29 Editas Medicine, Inc. CRISPR-related methods and compositions with governing gRNAS
US9834791B2 (en) 2013-11-07 2017-12-05 Editas Medicine, Inc. CRISPR-related methods and compositions with governing gRNAS
US11124782B2 (en) 2013-12-12 2021-09-21 President And Fellows Of Harvard College Cas variants for gene editing
US11597919B2 (en) 2013-12-12 2023-03-07 The Broad Institute Inc. Systems, methods and compositions for sequence manipulation with optimized functional CRISPR-Cas systems
US9840699B2 (en) 2013-12-12 2017-12-12 President And Fellows Of Harvard College Methods for nucleic acid editing
US11155795B2 (en) 2013-12-12 2021-10-26 The Broad Institute, Inc. CRISPR-Cas systems, crystal structure and uses thereof
US10851357B2 (en) 2013-12-12 2020-12-01 The Broad Institute, Inc. Compositions and methods of use of CRISPR-Cas systems in nucleotide repeat disorders
US11053481B2 (en) 2013-12-12 2021-07-06 President And Fellows Of Harvard College Fusions of Cas9 domains and nucleic acid-editing domains
US11591581B2 (en) 2013-12-12 2023-02-28 The Broad Institute, Inc. Compositions and methods of use of CRISPR-Cas systems in nucleotide repeat disorders
US10550372B2 (en) 2013-12-12 2020-02-04 The Broad Institute, Inc. Systems, methods and compositions for sequence manipulation with optimized functional CRISPR-Cas systems
US11407985B2 (en) 2013-12-12 2022-08-09 The Broad Institute, Inc. Delivery, use and therapeutic applications of the CRISPR-Cas systems and compositions for genome editing
US10465176B2 (en) 2013-12-12 2019-11-05 President And Fellows Of Harvard College Cas variants for gene editing
US11315659B2 (en) 2014-01-27 2022-04-26 Georgia Tech Research Corporation Methods and systems for identifying nucleotide-guided nuclease off-target sites
EP3129484A1 (en) * 2014-03-25 2017-02-15 Editas Medicine, Inc. Crispr/cas-related methods and compositions for treating hiv infection and aids
US11578343B2 (en) 2014-07-30 2023-02-14 President And Fellows Of Harvard College CAS9 proteins including ligand-dependent inteins
US10077453B2 (en) 2014-07-30 2018-09-18 President And Fellows Of Harvard College CAS9 proteins including ligand-dependent inteins
US10704062B2 (en) 2014-07-30 2020-07-07 President And Fellows Of Harvard College CAS9 proteins including ligand-dependent inteins
WO2016094872A1 (en) * 2014-12-12 2016-06-16 The Broad Institute Inc. Dead guides for crispr transcription factors
US11624078B2 (en) 2014-12-12 2023-04-11 The Broad Institute, Inc. Protected guide RNAS (pgRNAS)
US10696986B2 (en) 2014-12-12 2020-06-30 The Board Institute, Inc. Protected guide RNAS (PGRNAS)
US10738305B2 (en) 2015-02-23 2020-08-11 Vertex Pharmaceuticals Incorporated Materials and methods for treatment of hemoglobinopathies
US12129471B2 (en) 2015-02-23 2024-10-29 Vertex Pharmaceuticals Incorporated Materials and methods for treatment of human genetic diseases including hemoglobinopathies
US12110500B2 (en) 2015-05-08 2024-10-08 President And Fellows Of Harvard College Universal donor stem cells and related methods
US12031155B2 (en) 2015-05-08 2024-07-09 President And Fellows Of Harvard College Universal donor stem cells and related methods
US12031154B2 (en) 2015-05-08 2024-07-09 President And Fellows Of Harvard College Universal donor stem cells and related methods
US10876100B2 (en) 2015-06-18 2020-12-29 The Broad Institute, Inc. Crispr enzyme mutations reducing off-target effects
US11578312B2 (en) 2015-06-18 2023-02-14 The Broad Institute Inc. Engineering and optimization of systems, methods, enzymes and guide scaffolds of CAS9 orthologs and variants for sequence manipulation
US10494621B2 (en) 2015-06-18 2019-12-03 The Broad Institute, Inc. Crispr enzyme mutations reducing off-target effects
US12123032B2 (en) 2015-06-18 2024-10-22 The Broad Institute, Inc. CRISPR enzyme mutations reducing off-target effects
US12043852B2 (en) 2015-10-23 2024-07-23 President And Fellows Of Harvard College Evolved Cas9 proteins for gene editing
US11214780B2 (en) 2015-10-23 2022-01-04 President And Fellows Of Harvard College Nucleobase editors and uses thereof
US10167457B2 (en) 2015-10-23 2019-01-01 President And Fellows Of Harvard College Nucleobase editors and uses thereof
EP3371306B1 (en) * 2015-11-04 2023-01-04 Crispr Therapeutics AG Materials and methods for treatment of hemoglobinopathies
US12043843B2 (en) 2015-11-04 2024-07-23 Vertex Pharmaceuticals Incorporated Materials and methods for treatment of hemoglobinopathies
EP3219799A1 (en) 2016-03-17 2017-09-20 IMBA-Institut für Molekulare Biotechnologie GmbH Conditional crispr sgrna expression
WO2017158153A1 (en) 2016-03-17 2017-09-21 Imba - Institut Für Molekulare Biotechnologie Gmbh Conditional crispr sgrna expression
US10113163B2 (en) 2016-08-03 2018-10-30 President And Fellows Of Harvard College Adenosine nucleobase editors and uses thereof
US11999947B2 (en) 2016-08-03 2024-06-04 President And Fellows Of Harvard College Adenosine nucleobase editors and uses thereof
US10947530B2 (en) 2016-08-03 2021-03-16 President And Fellows Of Harvard College Adenosine nucleobase editors and uses thereof
US11702651B2 (en) 2016-08-03 2023-07-18 President And Fellows Of Harvard College Adenosine nucleobase editors and uses thereof
US11661590B2 (en) 2016-08-09 2023-05-30 President And Fellows Of Harvard College Programmable CAS9-recombinase fusion proteins and uses thereof
US12084663B2 (en) 2016-08-24 2024-09-10 President And Fellows Of Harvard College Incorporation of unnatural amino acids into proteins using base editing
US11542509B2 (en) 2016-08-24 2023-01-03 President And Fellows Of Harvard College Incorporation of unnatural amino acids into proteins using base editing
US11572555B2 (en) 2016-09-27 2023-02-07 Psomagen, Inc. Method and system for CRISPR-based library preparation and sequencing
WO2018064226A1 (en) * 2016-09-27 2018-04-05 uBiome, Inc. Method and system for crispr-based library preparation and sequencing
US11306324B2 (en) 2016-10-14 2022-04-19 President And Fellows Of Harvard College AAV delivery of nucleobase editors
US10745677B2 (en) 2016-12-23 2020-08-18 President And Fellows Of Harvard College Editing of CCR5 receptor gene to protect against HIV infection
US11820969B2 (en) 2016-12-23 2023-11-21 President And Fellows Of Harvard College Editing of CCR2 receptor gene to protect against HIV infection
US12110545B2 (en) 2017-01-06 2024-10-08 Editas Medicine, Inc. Methods of assessing nuclease cleavage
US11898179B2 (en) 2017-03-09 2024-02-13 President And Fellows Of Harvard College Suppression of pain by gene editing
US11542496B2 (en) 2017-03-10 2023-01-03 President And Fellows Of Harvard College Cytosine to guanine base editor
US11268082B2 (en) 2017-03-23 2022-03-08 President And Fellows Of Harvard College Nucleobase editors comprising nucleic acid programmable DNA binding proteins
US11560566B2 (en) 2017-05-12 2023-01-24 President And Fellows Of Harvard College Aptazyme-embedded guide RNAs for use with CRISPR-Cas9 in genome editing and transcriptional activation
US11732274B2 (en) 2017-07-28 2023-08-22 President And Fellows Of Harvard College Methods and compositions for evolving base editors using phage-assisted continuous evolution (PACE)
US11932884B2 (en) 2017-08-30 2024-03-19 President And Fellows Of Harvard College High efficiency base editors comprising Gam
US11319532B2 (en) 2017-08-30 2022-05-03 President And Fellows Of Harvard College High efficiency base editors comprising Gam
US11795443B2 (en) 2017-10-16 2023-10-24 The Broad Institute, Inc. Uses of adenosine base editors
US11268077B2 (en) 2018-02-05 2022-03-08 Vertex Pharmaceuticals Incorporated Materials and methods for treatment of hemoglobinopathies
US11643652B2 (en) 2019-03-19 2023-05-09 The Broad Institute, Inc. Methods and compositions for prime editing nucleotide sequences
US11447770B1 (en) 2019-03-19 2022-09-20 The Broad Institute, Inc. Methods and compositions for prime editing nucleotide sequences
US11795452B2 (en) 2019-03-19 2023-10-24 The Broad Institute, Inc. Methods and compositions for prime editing nucleotide sequences
CN110070912A (en) * 2019-04-15 2019-07-30 桂林电子科技大学 A kind of prediction technique of CRISPR/Cas9 undershooting-effect
CN110070912B (en) * 2019-04-15 2023-06-23 桂林电子科技大学 Prediction method for CRISPR/Cas9 off-target effect
CN110335640B (en) * 2019-07-09 2022-01-25 河南师范大学 Prediction method of drug-DBPs binding sites
CN110335640A (en) * 2019-07-09 2019-10-15 河南师范大学 A kind of prediction technique of drug-DBPs binding site
US12031126B2 (en) 2020-05-08 2024-07-09 The Broad Institute, Inc. Methods and compositions for simultaneous editing of both strands of a target double-stranded nucleotide sequence
US11912985B2 (en) 2020-05-08 2024-02-27 The Broad Institute, Inc. Methods and compositions for simultaneous editing of both strands of a target double-stranded nucleotide sequence
US12134767B2 (en) 2020-06-23 2024-11-05 Vertex Pharmaceuticals Incorporated Materials and methods for treatment of hemoglobinopathies
KR102667508B1 (en) * 2022-02-08 2024-06-11 주식회사 툴젠 A method for predicting off-targets which are cappable of occuring in process of genome editing by prime editing system

Also Published As

Publication number Publication date
US10354746B2 (en) 2019-07-16
US20190295689A1 (en) 2019-09-26
US20170053062A1 (en) 2017-02-23

Similar Documents

Publication Publication Date Title
US20190295689A1 (en) Methods and systems for identifying crispr/cas off-target sites
Li et al. Computational tools and resources for CRISPR/Cas genome editing
JP7095031B2 (en) Genome-wide and bias-free DSB identification assessed by sequencing (GUIDE-Seq)
Hendel et al. Quantifying on-and off-target genome editing
US11120889B2 (en) Method for synthesizing a nuclease with reduced off-site cleavage
Yadav et al. Genome-wide development of transposable elements-based markers in foxtail millet and construction of an integrated database
Hsu et al. DNA targeting specificity of RNA-guided Cas9 nucleases
Kleinstiver et al. Engineered CRISPR-Cas9 nucleases with altered PAM specificities
Lin et al. CRISPR/Cas9 systems have off-target activity with insertions or deletions between target DNA and guide RNA sequences
Varshney et al. Understanding and editing the zebrafish genome
Fu et al. Targeted genome editing in human cells using CRISPR/Cas nucleases and truncated guide RNAs
Rudd Expressed sequence tags: alternative or complement to whole genome sequences?
Townsend et al. High-frequency modification of plant genes using engineered zinc-finger nucleases
Salem et al. Recently integrated Alu elements and human genomic diversity
EP3724214A1 (en) Systems and methods for predicting repair outcomes in genetic engineering
Lee et al. Allele-specific quantitative PCR for accurate, rapid, and cost-effective genotyping
US11315659B2 (en) Methods and systems for identifying nucleotide-guided nuclease off-target sites
Booher et al. Tools for TAL effector design and target prediction
Marzec et al. Targeted base editing systems are available for plants
JP2024079842A (en) Methods and systems for guide RNA design and use
Terrazas et al. The origins and the biological consequences of the Pur/Pyr DNA· RNA asymmetry
Hanscom et al. Characterization of sequence contexts that favor alternative end joining at Cas9-induced double-strand breaks
KR20210065085A (en) Methods for Characterizing Modifications Elicited by Use of Designer Nucleases
Zhang et al. Subtelomeric 5-enolpyruvylshikimate-3-phosphate synthase copy number variation confers glyphosate resistance in Eleusine indica
De Filippis Plant bioinformatics: next generation sequencing approaches

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15704161

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

WWE Wipo information: entry into national phase

Ref document number: 15114799

Country of ref document: US

122 Ep: pct application non-entry in european phase

Ref document number: 15704161

Country of ref document: EP

Kind code of ref document: A1