AU2013201171B2

AU2013201171B2 - Plant chimeric binding polypeptides for universal molecular recognition

Info

Publication number: AU2013201171B2
Application number: AU2013201171A
Authority: AU
Inventors: Jennifer Jones
Original assignee: Monsanto Technology LLC
Current assignee: Monsanto Technology LLC
Priority date: 2006-02-13
Filing date: 2013-02-28
Publication date: 2016-01-21
Anticipated expiration: 2027-02-13
Also published as: AU2013201171A1

Abstract

C: \NRPortbl\DCC\SCG\4958945_I.DOC-2 28 2013 Libraries of nucleic acids encoding chimeric binding polypeptides based on plant scaffold polypeptide sequences. Also described are methods for generating the libraries.

Description

C:\NRPonTbl\DCC\SCG\4958945_1.DOC-2/27/2013 PLANT CHIMERIC BINDING POLYPEPTIDES FOR UNIVERSAL MOLECULAR RECOGNITION This is a divisional of Australian Patent Application No. 2007215062, the entire contents 5 of which are incorporated herein by reference. BACKGROUND The binding specificity and affinity of a protein for a target are determined primarily by the protein's amino acid sequence within one or more binding regions. Accordingly, varying the 10 amino acid sequence of the relevant regions reconfigures the protein's binding properties. In nature, combinatorial changes in protein binding are best illustrated by the vast array of immunoglobulins produced by the immune system. Each immunoglobulin includes a set of short, virtually unique, amino acid sequences known as hypervariable regions (i.e., protein binding domains), and another set of longer, invariant sequences known as constant regions. The constant 15 regions form f sheets that stabilize the three dimensional structure of the protein in spite of the enormous sequence diversity among hypervariable regions in the population of immunoglobulins. Each set of hypervariable regions confers binding specificity and affinity. The assembly of two heavy chain and two light chain immunoglobulins into a large protein complex (i.e., an antibody) further increases the number of combinations with diverse binding activities. 20 The binding diversity of antibodies has been successfully exploited in many biomedical and industrial applications. For example, libraries have been constructed that express immunoglobulins bearing artificially diversified hypervariable regions. Immunoglobulin expression libraries are very useful for identifying high affinity antibodies to a target molecule (e.g., a receptor or receptor ligand). A nucleic acid encoding the identified immunoglobulin can 25 then be isolated and expressed in host cells or organisms. However, despite the usefulness of immunoglobulins and antibodies in general, their expression in transgenic plants can be problematic. Immunoglobulins may not fold properly in plant cytoplasm because they require the formation of multiple disulfide bonds. Further, the large size of immunoglobulins prevents their effective uptake by some plant pests. Thus, 30 immunoglobulins are frequently not useful as protein pesticides or pesticide targeting molecules. Finally, expressing mammalian proteins such as immunoglobulins (e.g., as so called "plantibodies") in edible plants 1 WO 2007/095300 PCT/US2007/003937 also raises potential issues of consumer acceptance and is thus an impediment to commercialization: This may effectively prevent use of plantibodies for many input and output traits in transgenic plants. The above-mentioned disadvantages of immunoglobulins can be circumvented 5 by generating diverse libraries of binding proteins from other classes of structurally tolerant proteins, preferably plant-derived proteins. These libraries can be screened to identify individual proteins that bind with desired specificity and affinity to a target of interest. Afterwards, identified binding proteins can be efficiently expressed in transgenic plants. 10 SUMMARY Diverse-libraries of nucleic acids encoding plant chimeric binding polypeptides, as well as methods for generating them are described herein. The chimeric binding polypeptides are conceptually analogous to immunoglobulins in that they feature highly varied binding domains in the framework of unvarying sequences that encode a 15 structurally robust protein. However, the chimeric binding polypeptides described herein have the considerable advantage of being derived from plant protein sequences thereby avoiding many of the problems associated with immunoglobulin expression in plants. The amino acid sequences of the encoded plant chimeric binding proteins are derived from a scaffold polypeptide sequence that includes subsequences to be varied. 20 The varied subsequences correspond to putative binding domains of the plant chimeric binding polypeptides, and are highly heterogeneous in the library of encoded plant chimeric binding proteins. In contrast the sequence of the encoded chimeric binding proteins outside of the varied subsequences is essentially the same as the parent scaffold polypeptide sequence and highly homogeneous throughout the library of 25 encoded plant chimeric binding proteins. Such libraries can serve as a universal molecular recognition platform to select proteins with high selectivity and affinity binding for expression in transgenic plants. Accordingly, one aspect described herein is a library of nucleic acid molecules encoding at least ten (e.g., at least 1,000, 10', or 106) different chimeric binding 30 polypeptides. The amino acid sequence of each polypeptide includes C-XI-C 2

X

2

C

3 X 3

.C

4 , where CI-C 4 are backbone subsequences selected from purple acid phosphatase (i.e., SEQ ID NOs: 1-30, 31-60, 61-90, and 91-120, respectively) that can include up to 2 WO 2007/095300 PCT/US2007/003937 30 (e.g., 20, 10, or 5) single amino acid substitutions, deletions, insertion, or additions to the selected purple acid phosphatase sequences. The Ce-C 4 subsequences are homogeneous across many of the polypeptides encoded in the library. In contrast to the

C

1

-C

4 backbone subsequences, the X 1

-X

3 subsequences are independent variable 5 subsequences consisting of 2-20 amino acids, and these subsequences are heterogeneous across many of the polypeptides in the library. For example, the library of chimeric polypeptides can have the amino acid sequence of any one of SEQ ID NOs:124-126 including one to ten single amino acid substitutions, deletions, insertions, or additions to amino acid positions corresponding to 23-39, 51-49, and 79-84 of SEQ 10 ID NOs:124-126. Another aspect described herein is a method for generating the just-described library. The method includes providing a parental nucleic acid encoding a plant scaffold polypeptide sequence containing C 1

-X

1

-C-X

2

-C

3

-X

3

.C

4 as defined above. The method further includes replicating the parental nucleic acid (e.g., at least one of the 15 X 1

-X

3 subsequences is selected from SEQ ID NOs: 121-123) under conditions that introduce up to 10 single amino acid substitutions, deletions, insertions, or additions to the parental X 1 , X 2 , or X 3 subsequences, whereby a heterogeneous population of randomly varied subsequences encoding X 1 , X 2 , or X 3 is generated. The population varied subsequences is then substituted into a population of parental nucleic acids at the 20 positions corresponding to those encoding X 1 , X 2 , or X 3 . The amino acid substitutions, deletions, insertions or additions can be introduced into the parental nucleic acid subsequences by replication in vitro (e.g., using a purified mutagenic polymerase or nucleotide analogs) or in vivo (e.g., in a mutagenic strain of E. coli). The just-described library can be introduced into a biological replication system (e.g., E. coli or 25 bacteriophage) and amplified. A related aspect described herein is another method for generating the above described library of nucleic acids. The method includes selecting an amino acid sequence containing C 1

-X-C

2

-X

2

-C

3

-X

3

.C

4 as defined above. The method further includes providing a first and second set of oligonucleotides having overlapping 30 complementary sequences, Oligonucleotides of the first set encode the C-C 4 subsequences and multiple heterogeneous X-X 3 subsequences. Oligonucleotides of the second set are complementary to nucleotide sequences encoding the C1-C4 subsequences and multiple heterogeneous X-X 3 subsequences. The two sets of 3 WO 2007/095300 PCT/US2007/003937 oligonucleotides are combined to form a first mixture and incubated urider conditions that allow hybridization of the overlapping complementary sequences. The resulting hybridized sequences are then extended to form a second mixture containing the above described library. 5 Yet another aspect of the invention is a library of nucleic acids encoding chimeric binding polypeptides each of which include an amino acid sequence at least 70% (i.e., any percentage between 70% and 100%) identical to any of SEQ ID NOs: 127-129. The amino acid sequence of each of the encoded polypeptides includes amino acids that differ from those of SEQ ID NOs: 127-129 at positions 14, 15, 33, 35-36, 38, 10 47-48, 66, 68-69, 71, 80, 81, 99, 101-102, and 104, and the amino acid differences are heterogeneous across a plurality of the encoded polypeptides. The amino acid sequence of each of the encoded polypeptides outside of the above-listed positions is homogeneous across a plurality of the encoded chimeric'polypeptides. A related aspect described herein is a method for generating the just-described 15 library. The method includes selecting an amino acid sequence corresponding to any of SEQ ID NOs: 127-129, in which the selected sequence differs from SEQ ID NOs:127 129 in at least one the above-mentioned positions. The method further includes providing a first and second set of oligonucleotides having overlapping complementary sequences. Oligonucleotides of the first set encode subsequences of the selected amino 20 acid sequence, the subsequences being heterogeneous at the above-mentioned positions. Oligonucleotides of the second set are complementary to nucleotide sequences encoding subsequences of the selected amino acid sequence, the subsequences being heterogeneous at the above-mentioned positions. The two sets of oligonucleotides are combined to form a first mixture and incubated under conditions 25 that allow hybridization of the overlapping complementary sequences. The resulting hybridized sequences are then extended to form a second mixture containing the above described library. Various implementations of the invention can include one or more of the following. For example, each nucleic acid in a library can include a vector sequence. 30 Also featured is any nucleic acid isolated from one of the above-described libraries, as well as the chimeric binding polypeptide encoded by it, in pure form. In one implementation, a population of cells (or individual cells selected from the population of cells) is provided which express chimeric binding polypeptides 4 WO 2007/095300 PCT/US2007/003937 encoded by one of the libraries. Another implementation features a library of purified chimeric binding polypeptides encoded by one the nucleic acid libraries. Yet another implementation provides a population of filamentous phage displaying the chimeric binding polypeptides encoded by one of the nucleic acid libraries. 5 In various implementations of methods for generating the above described nucleic acid libraries by oligonucleotide assembly, one or more of the following can be included. For example, the method can further include, after the second mixture that contains the nucleic acid library is generated, performing a cycle of denaturing the population of nucleic acids followed by a hybridization and an elongation step. 10 Optionally, this cycle can be repeated (e.g., up to 100 times). The nucleic acid libraries can be amplified by a polymerase chain reaction that includes a forward and a reverse primer that hybridize to the 5' and 3' end sequences, respectively, of all nucleic acids in the library. In one implementation, amino acids to be encoded in variable sequence positions are selected from a subset (e.g., only 4, 6, 8, 10, 12, 14 or 16) of alanine, 15 arginine, asparagine, aspartate, glutamine, glutamate, glycine, histidine, isoleucine, leucine, lysine, methionine, phenylalanine, proline, serine, threonine, tryptophan, tyrosine, cysteine and valine (the 20 naturally occurring amino acids). In other cases 19 of the 20 are used (excludes cysteine). In other cases all 20 are used. In another implementation, the subset of amino acids includes at least one aliphatic, one acidic, 20 one neutral, and one aromatic amino acid (e.g., alanine, aspartate, serine, and tyrosine). Described herein is library of nucleic acids encoding at least ten different polypeptides, the amino acid sequence of each polypeptide comprising: CI-X1-C2-X2-C3-X3-C4, wherein: (i) subsequence C1 is selected from SEQ. ID NOs:1-30, subsequence C2 is selected from SEQ ID NOs:31-60, subsequence C3 is 25 selected from SEQ. ID NOs:61-90; subsequence C4 is selected from SEQ. ID NOs:91 120, and each of C1-C4 comprise up to 10 single amino acid substitutions, deletions, insertions, or additions to the selected subsequence; (ii) CI-C4 are homogeneous across a plurality of the encoded polypeptides; (iii) each of X1-X3 is an independently variable subsequence consisting of 2-20 amino acids; and each of X1-X3 are 30 heterogeneous across a plurality of the encoded polypeptides. Also described is a library of nucleic acids encoding at least ten different polypeptides, the amino acid sequence of each polypeptide comprising: 5 WO 2007/095300 PCT/US2007/003937 CI-X1-C2-X2-C3-X3-C4, wherein: (i) subsequence CI is selected from FIG 2 or FIG 4, subsequence C2 is selected from FIG 2 or FIG 4, subsequence C3 is selected from FIG. 2 or FIG 4; subsequence C4 is selected from FIG 2 or FIG 4, and each of C1 -C4 comprise up to 10 single amino acid substitutions, deletions, insertions, or 5 additions to the selected subsequence; (ii) C1-C4 are homogeneous across a plurality of the encoded polypeptides (iii) each of X1-X3 is an independently variable subsequence consisting of 2-20 amino acids; and each of XI-X3 are heterogeneous across a plurality of the encoded polypeptides. 10 Also described is a library of nucleic acids encoding at least ten different polypeptides, the amino acid sequence of each polypeptide comprising: CI-XI-C2-X2-C3-X3-C4, wherein (i) subsequence C1 is selected from FIG 3 or FIG 5, subsequence C2 is selected from FIG 3 or FIG, 5, subsequence C3 is selected from FIG. 3 or FIG 5; subsequence C4 is selected from FIG 3 XX, and each of C1-C4 15 comprise up to 30 single amino acid substitutions, deletions, insertions, or additions to the selected subsequence; (ii) CI-C4 are homogeneous across a plurality of the encoded polypeptides (iii) each of X1-X3 is an independently variable subsequence consisting of 2-20 amino acids; and each of XI -X3 are heterogeneous across a plurality of the encoded polypeptides. 20 In various embodiments: at least 1,000 different polypeptides are encoded; at least 100,000 different polypeptides are encoded; at least 1,000,000 different polypeptides are encoded; each of Cl-C4 independently comprises up to 20 single amino acid substitutions, deletions, insertions, or additions to the selected subsequence; each of CI-C4 independently comprises up to 10 single amino acid substitutions, 25 deletions, insertions, or additions to the selected subsequence; each of Cl -C4 independently comprises up to 5 single amino acid substitutions, deletions, insertions, or additions to the selected subsequence; none of C1-C4 comprise amino acid substitutions, deletions, insertions, or additions to the selected subsequence; amino acids of X1-X3 are selected from fewer than 20 amino acids genetically encoded in 30 plants; amino acids of Xl-X3 are selected from all 20 amino acids genetically encoded in plants; the fewer than 20 genetically encoded amino acids include at least one aliphatic amino acid, at least one acidic amino acid, at least one neutral amino acid, and 6 WO 2007/095300 PCT/US2007/003937 at least one aromatic amino acid; fewer than 20 genetically encoded amino acids comprise alanine, aspartate, serine, and tyrosine. In some cases: the amino acid sequence of each polypeptide is selected from: (a). a polypeptide comprising CI-X1-C2-X2-C3-X3-C4 wherein C1= SEQ. 5 ID NO:1, C2= SEQ. ID NO: 31, C3= SEQ. ID NO: 61, and C4= SEQ. ID NO: 91; (b). a polypeptide comprising C1-X1-C2-X2-C3-X3-C4 wherein C1= SEQ. ID NO:2, C2= SEQ. ID NO: 32, C3= SEQ. ID NO: 62, and C4= SEQ. ID NO: 92; and (c). a polypeptide comprising C1-X1-C2-X2-C3-X3-C4 wherein C1= SEQ. ID NO:3, C2= SEQ. ID NO: 33, C3= SEQ. ID NO: 63, and C4= SEQ. ID NO: 93. 10 In some cases: each encoded polypeptide comprises Cl -Xl -C2-X2-C3-X3-C4, wherein C1 = SEQ. ID NO: X1, C2= SEQ. ID NO: X2, C3= SEQ. ID NO: X3, and C4= SEQ. ID NO: X4; designated SEQ. ID NO: 130. In some cases: each encoded polypeptide comprises CI-X1-C2-X2-C3-X3-C4, wherein C1= SEQ. ID NO: X1, C2= SEQ. ID NO: X2, C3= SEQ. ID NO: X3, and C4= 15 SEQ. ID NO: X4; designated SEQ. ID NO: 130. In some embodiments: wherein each of the nucleic acids comprises a vector sequence. Also described: are an isolated nucleic acid selected from the library and a isolated cell expressing the nucleic acid as well as a purified library of purified 20 polypeptides encoded by the library; and a population of filamentous phage displaying the polypeptides encoded by the library. Described herein is a method of generating a library, comprising: (i) providing a parental nucleic acid encoding a parental polypeptide comprising the amino acid sequence: C1--X-C2-X2-C3-X3-C4, wherein subsequence C1 is selected from SEQ 25 ID NOs: 1-30, subsequence C2 is selected from SEQ ID NOs:31-60, subsequence C3 is selected from SEQ ID NOs:61-90; subsequence C4 is selected from SEQ ID NOs:91 120; each of Cl-C4 comprises up to 10 single amino acid substitutions, deletions, insertions, or additions to the selected subsequence; and each of X1-X3 is an independent subsequence consisting of 2-20 amino acids; (ii) replicating the parental 30 nucleic acid under conditions that introduce up to 10 single amino acid substitutions, deletions, insertions, or additions to the X1, X2, or X3 subsequences, whereby a population of randomly varied subsequences encoding Xl', X2', or X3' is generated; and (iii) the population of randomly varied subsequences Xl', X2', or X3'is 7 WO 2007/095300 PCT/US2007/003937 substituted, into a population of parental nucleic acids at the positions corresponding to those that encode X1, X2, or X3. In various instances: at least one of the X1 -X3 subsequences is selected from SEQ ID NOs:121-123; each of C1-C4 independently comprises up to 20 single amino 5 acid substitutions, deletions, insertions, or additions to the selected subsequence; each of C1-C4 independently comprises up to 10 single amino acid substitutions, deletions, insertions, or additions to the selected subsequence; each of Cl-C4 independently comprises up to 5 single amino acid substitutions, deletions, insertions, or additions to the selected subsequence; none of C1-C4 comprise amino acid substitutions, deletions, 10 insertions, or additions to the selected subsequence; the replicating generates a heterogeneous population of randomly varied subsequences by introducing up to 5 amino acid substitutions in each of X1, X2, or X3; the method further comprises amplifying the library by introducing it into a biological replication system and proliferating the biological replication system; the biological replication system is a 15 plurality of E. coli cells; the biological replication system is a plurality of bacteriophage; the replicating occurs in vitro; the replicating is performed with a purified mutagenic polymerase; the replicating is performed in the presence of a nucleotide analog; the replicating occurs in vivo; the replicating in vivo occurs in a mutagenic species of E. coli. 20 Also described is a method of generating the library of claim 1, comprising:(i) selecting an amino acid sequence comprising the amino acid sequence Cl-X1 -C2-X2 C3 X3-C4 to be encoded, wherein: (a) subsequence CI is selected from SEQ ID NOs:1-30, subsequence C2 is selected from SEQ ID NOs:31-60, subsequence C3 is selected from SEQ ID NOs:61 90, and subsequence C4 is selected from SEQ ID 25 NOs:91-120; (b) each of C1-C4 comprises up to 10 single amino acid substitutions, deletions, insertions, or additions to the selected subsequence; (c) each of X1, X2, and X3 consists of an amino acid sequence 2-20 amino acids in length; (ii) providing a first plurality and a second plurality of oligonucleotides, wherein: (a) oligonucleotides of the first plurality encode the C1 -C4 subsequences and multiple heterogeneous XI -X3 30 variant subsequences Xl'-X3'; (b) oligonucleotides of the second plurality are complementary to nucleotide sequences encoding the C1 -C4 subsequences and to nucleotide sequences encoding multiple heterogeneous X1' X3' subsequences; and (c) the oligonucleotides of the first and second pluralities have overlapping sequences 8 WO 2007/095300 PCT/US2007/003937 complementary to one another; (iii) combining the population of oligonucleotides to form a first mixture; (iv) incubating the mixture under conditions effective for hybridizing the overlapping complementary sequences to form a plurality of hybridized complementary sequences; and (v) elongating the plurality of hybridized 5 complementary sequences to form a second mixture containing the library, In various instances: each of C1-C4 independently comprises up to 20 single amino acid substitutions, deletions, insertions, or additions to the selected subsequence; each of C I -C4 independently comprises up to 10 single amino acid substitutions, deletions, insertions, or additions to the selected subsequence; each of CI-C4 10 independently comprises from zero and up to 5 single amino acid substitutions, deletions, insertions, or additions to the selected subsequence; the method further comprises performing a cycle of steps, the cycle of steps comprising denaturing the library by increasing the temperature of the second mixture to a temperature effective for denaturing double stranded DNA, followed by steps (iv) and (v); the method 15 comprises repeating the cycle of steps up to 100 times; the method further comprises amplifying the library by a polymerase chain reaction consisting essentially of the library, a forward primer, and a reverse primer, wherein the forward and reverse primers can hybridize to the 5' and 3' end sequences, respectively, of all nucleic acids in the library; the amino acid to be encoded in each position of the XI, X2, or X3 20 subsequences, is selected from a subset of alanine, arginine, asparagine, aspartate, cysteine, glutamine, glutamate, glycine, histidine, isoleucine, leucine, lysine, methionine, phenylalanine, proline, serine, threonine, tryptophan, tyrosine, and valine; herein the amino acid selected for each single amino acid substitution is selected from a group of amino acids consisting of at least one aliphatic, at least one one acidic, at least 25 one one neutral, and at least one one aromatic amino acid; and the group of amino acids consists of alanine, aspartate, seine, and tyrosine. Also described herein is a method of generating a library, comprising: (i) providing a parental nucleic acid encoding a parental polypeptide comprising the amino acid sequence: C1-X1-C2-X2-C3-X3-C4, wherein subsequence CI is selected from 30 FIG 2 or FIG 4, subsequence C2 is selected from FIG 2 or FIG 4, subsequence C3 is selected from FIG.2 or FIG 4; subsequence C4 is selected from FIG2 or FIG 4 each of C1-C4 comprises up to 10 single amino acid substitutions, deletions, insertions, or additions to the selected subsequence; and each of XI-X3 is an independent 9 WO 2007/095300 PCT/US2007/003937 subsequence consisting of 2-20 amino acids; (ii) . replicating the parental nucleic acid under conditions that introduce up to 10 single amino acid substitutions, deletions, insertions, or additions to the X1, X2, or X3 subsequences, whereby a population of randomly varied subsequences encoding XI', X2', or X3' is generated; and (iii) the 5 population of randomly varied subsequences Xl', X2', or X3'is substituted, into a population of parental nucleic acids at the positions corresponding to those that encode X1, X2, or X3. In various embodiments: at least one of the X1-X3 subsequences is selected from SEQ ID NOs:121-123; each of CI-C4 independently comprises up to 20 single 10 amino acid substitutions, deletions, insertions, or additions to the selected subsequence; each of Cl -C4 independently comprises up to 10 single amino acid substitutions, deletions, insertions, or additions to the selected subsequence; each'of Cl-C4 independently comprises up to 5 single amino acid substitutions, deletions, insertions, or additions to the selected subsequence; none of CI-C4 comprise an amino acid 15 substitutions, deletions, insertions, or additions to the selected subsequence; the replicating generates a heterogeneous population of randomly varied subsequences by introducing up to 5 amino acid substitutions in each of X1, X2, or X3; the method further comprises amplifying the library by introducing it into a biological replication system and proliferating the biological replication system; the biological replication 20 system is a plurality of E. coli cells; the biological replication system is a plurality of bacteriophage; the replicating occurs in vitro; the replicating is performed with a purified mutagenic polymerase the replicating is performed in the presence of a nucleotide analog; the replicating occurs in vivo; and the replicating in vivo occurs in a mutagenic species of E. coli. 25 Also described is a method of generating the library, comprising: (i) selecting an amino acid sequence comprising C1-XI-C2-X2 C3 X3-C4 to be encoded, wherein (a) subsequence Cl is selected from FIG 2 or FIG 4, subsequence C2 is selected from FIG 2 or FIG 4, subsequence C3 is selected from FIG. 2 or FIG 4, and subsequence C4 is selected from FIG. 2 or FIG 4; (b) each of Cl-C4 comprises up to 10 single amino acid 30 substitutions, deletions, insertions, or additions to the selected subsequence; (c) each of X1, X2, and X3 consists of an amino acid sequence 2-20 amino acids in length; (ii) providing a first plurality and a second plurality of oligonucleotides, wherein (a) oligonucleotides of the first plurality encode the C1-C4 subsequences and multiple 10 WO 2007/095300 PCT/US2007/003937 heterogeneous X1-X3 variant subsequences X1'-X3'; (b) oligonucleotides of the second plurality are complementary to nucleotide sequences encoding the Cl-C4 subsequences and to nucleotide sequences encoding multiple heterogeneous X1' X3' subsequences; and 5 (c) the oligonucleotides of the first and second pluralities have overlapping sequences complementary to one another; (iii) combining the population of oligonucleotides to form a first mixture; (iv) incubating the mixture under conditions effective for hybridizing the overlapping complementary sequences to form a plurality of hybridized complementary sequences; and (v) elongating the plurality of hybridized 10 complementary sequences to form a second mixture containing the library. In various cases: each of C1-C4 independently comprises up to 20 single amino acid substitutions, deletions, insertions, or additions to the selected subsequence; each of C1-C4 independently comprises up to 10 single amino acid substitutions, deletions, insertions, or additions to the selected subsequence; each of C1-C4 independently 15 comprises from zero and up to 5 single amino acid substitutions, deletions, insertions, or additions to the selected subsequence; the method further comprises performing a cycle of steps, the cycle of stepscomprising denaturing the library by increasing the temperature of the second mixture to a temperature effective for denaturing double stranded DNA, followed by steps (iv) and (v); the method further comprises repeating 20 the cycle of steps up to 100 times; the method further comprises amplifying the library by a polymerase chain reaction consisting essentially of the library, a forward primer, and a reverse primer, wherein the forward and reverse primers can hybridize to the 5' and 3' end sequences, respectively, of all nucleic acids in the library; the amino acid to be encoded in each position of the X1, X2, or X3 subsequences, is selected from a 25 subset of alanine, arginine, asparagine, aspartate, cysteine, glutamine, glutamate, glycine, histidine, isoleucine, leucine, lysine, methionine, phenylalanine, proline, serine, threonine, tryptophan, tyrosine, and valine; the amino acid selected for each single amino acid substitution is selected from a group of amino acids consisting of at least one aliphatic, at least one acidic, one at least one neutral, and at least one aromatic 30 amino acid; and the group of amino acids consists of alanine, aspartate, seine, and tyrosine. Also disclosed is a method of generating the library, comprising:. (i) providing a parental nucleic acid encoding a parental polypeptide comprising the amino acid 11 WO 2007/095300 PCT/US2007/003937 sequence: C1-XI-C2-X2-C3-X3-C4, wherein subsequence C1 is selected from FIG 3 or FIG 5, subsequence C2 is selected from FIG 3 or FIG 5, subsequence C3 is selected from FIG 3 or FIG 5; subsequence C4 is selected ftom FIG. 3 or FIG 5; each of Cl-C4 comprises up to 10 single amino acid substitutions, deletions, insertions, or additions to 5 the selected subsequence; and each of X1-X3 is an independent subsequence consisting of 2-20 amino acids; (ii) replicating the parental nucleic acid under conditions that introduce up to 10 single amino acid substitutions, deletions, insertions, or additions to the XI, X2, or X3 subsequences, whereby a population of randomly varied subsequences encoding Xl',X2', or X3' is generated; and (iii) the population of 10 randomly varied subsequences Xl', X2', or X3'is substituted, into a population of parental nucleic acids at the positions corresponding to those that encode XJ, X2, or X3. In various instances: at least one of the Xl-X3 subsequences is selected from SEQ ID NOs:121-123; each of Cl-C4 independently comprises up to 20 single amino 15 acid substitutions, deletions, insertions, or additions to the selected subsequence; each of C1-C4 independently comprises up to 10 single amino acid substitutions, deletions, insertions, or additions to the selected subsequence; each of Cl-C4 independently comprises up to 5 single amino acid substitutions, deletions, insertions, or additions to the selected subsequence; none of CI-C4 comprise amino acid substitutions, deletions, 20 insertions, or additions to the selected subsequence; the replicating generates a heterogeneous population of randomly varied subsequences by introducing up to 5 amino acid substitutions in each of X1, X2, or X3; the method further comprises amplifying the library by introducing it into a biological replication system and proliferating the biological replication system; the biological replication system is a 25 plurality of E. coli cells; the biological replication system is a plurality of bacteriophage; the replicating occurs in vitro; the replicating is performed with a purified mutagenic polymerase; the replicating is performed in the presence of a nucleotide analog; the replicating occurs in vivo; and the replicating in vivo occurs in a mutagenic species of E. coli. 30 Also described is a method of generating the library, comprising: (i) selecting an amino acid sequence comprising: C1-X1-C2-X2 C3 X3-C4 to be encoded, wherein (a) subsequence C1 is selected from FIG 3 or FIG 5, subsequence C2 is selected from FIG 3 or FIG 5, subsequence C3 is selected from FIG. 3 or FIG 5, and subsequence C4 is 12 WO 2007/095300 PCT/US2007/003937 selected from FIG. 3 or FIG 5; (b) each of CI-C4 comprises up to 10 single amino acid substitutions, deletions, insertions, or additions to the selected subsequence; (c) each of XI, X2, and X3 consists of an amino acid sequence 2-20 amino acids in length; (ii) providing a first plurality and a second plurality of oligonucleotides, wherein (a) 5 oligonucleotides of the first plurality encode the C1-C4 subsequences and multiple heterogeneous X1-X3 variant subsequences X1'-X3'; (b) oligonucleotides of the second plurality are complementary to nucleotide sequences encoding the C1-C4 subsequences and to nucleotide sequences encoding multiple heterogeneous Xl' X3' subsequences; and (c) the oligonucleotides of the first and second pluralities have 10 overlapping sequences complementary to one another; (iii) combining the population of oligonucleotides to form a first mixture; (iv) incubating the mixture under conditions effective for hybridizing the overlapping complementary sequences to form a plurality of hybridized complementary sequences; and (v) elongating the plurality of hybridized complementary sequences to form a second mixture containing the library. 15 In various embodiments: each of C1-C4 comprises up to 20 single amino acid substitutions, deletions, insertions, or additions to the selected subsequence; each of C1-C4 independently comprises up to 10 single amino acid substitutions, deletions, insertions, or additions to the selected subsequence; each of CI-C4 independently comprises from zero and up to 5 single amino acid substitutions, deletions, insertions, 20 or additions to the selected subsequence; the method further comprises performing a cycle of steps, the cycle comprising denaturing the library by increasing the temperature of the second mixture to a temperature effective for denaturing double stranded DNA, followed by steps (iv) and (v); the method further comprises repeating the cycle up to 100 times; the method further comprises amplifying the library by a 25 polymerase chain reaction consisting essentially of the library, a forward primer, and a reverse primer, wherein the forward and reverse primers can hybridize to the 5' and 3' end sequences, respectively, of all nucleic acids in the library; the amino acid to be encoded in each position of the X1, X2, or X3 subsequences, is selected from a subset of alanine, arginine, asparagine, aspartate, cysteine, glutamine, glutamate, glycine, 30 histidine, isoleucine, leucine, lysine, methionine, phenylalanine, proline, serine, threonine, tryptophan, tyrosine, and valine the amino acid selected for each single amino acid substitution is selected from a group of amino acids consisting of at least 13 WO 2007/095300 PCT/US2007/003937 one aliphatic, one acidic, one neutral, and one aromatic amino acid; and the group of amino acids consists of alanine, aspartate, serine, and tyrosine. Also described is a library of nucleic acids encoding at least ten different polypeptides, wherein: (i) the amino acid sequence of each of the encoded polypeptides 5 comprises an amino acid sequence at least 70% identical to any of SEQ ID NOs:127 129; (ii) the amino acid sequence of each of the encoded polypeptides includes amino acids that differ from those of SEQ ID NOs:127-129 at positions 14, 15, 33, 35-36, 38, 47-48, 66, 68-69, 71, 80, 81, 99, 101-102, and 104, and the amino acid differences are heterogeneous across a plurality of the encoded polypeptides; and (iii) the amino acid 10 sequence of each of the encoded polypeptides outside of the residues corresponding to positions 14, 15, 33, 35-36, 38, 47-48, 66, 68-69, 71, 80, 81, 99, 101-102, and 104 of SEQ ID NOs: 127-129 is homogeneous across a plurality of the encoded polypeptides. In various embodiments: the amino acid sequence of the polypeptides has at least 75% identity to any of SEQ ID NOs 127-129; the amino acid sequence of the 15 polypeptides has at least 80% identity to any of SEQ ID NOs 127-129; and the amino acid sequence of the polypeptides has at least 85% identity to any of SEQ ID NOs 127 129 each of the nucleic acids comprises a vector sequence. Also disclosed: an isolated nucleic acid encoding a polypeptide, selected from the library; a purified polypeptide encoded by the nucleic acid; a population of cells expressing the polypeptides encoded 20 by the library; .a cell selected from the population of cells; a purified library of polypeptides encoded by the library; a population of filamentous phage displaying the library of polypeptides encoded by the library. Also disclosed is a method of generating the library, comprising: (i) selecting an amino acid sequence corresponding to any one of SEQ ID NOs: 127 129 to be 25 encoded, wherein the selected sequence differs from those of SEQ ID NOs:127-129 in at least one of variable positions 14, 15, 33, 35-36, 38, 47-48, 66, 68-69, 71, 80, 81, 99, 101-102, and 104; (ii) chemically providing a first and a second plurality of oligonucleotides, wherein (a) oligonucleotides of the first plurality encode amino acid subsequences of the selected amino acid sequence; the subsequences being 30 heterogeneous at the encoded variable positions; (b) oligonucleotides of the second plurality are complementary to nucleotide sequences encoding subsequences of the selected amino acid sequence, the subsequences being heterogeneous at the encoded variable positions; and (c) the first and second pluralities comprise 14 I \scg lcmwovse\NRPoN\nblCC\SCG\5170427 1 DOC.23/05/21013 oligonucleotides have overlapping sequences complementary to one another; (iii) combining the population of oligonucleotides to form a first mixture; (iv) incubating the mixture under conditions effective for hybridizing the overlapping complementary sequences to form a plurality of hybridized complementary sequences; and (v) elongating the plurality of 5 hybridized complementary sequences to form a second mixture containing the library. In various instances: the method further comprises performing a cycle of denaturing the library by increasing the temperature of the second mixture to a temperature effective for denaturing double stranded DNA, followed by steps (iv) and (v); the method further comprises repeating the cycle up to 100 times; the method further comprises amplifying the library by a 10 polymerase chain reaction consisting essentially of the library, a forward primer, and a reverse primer, wherein the forward and reverse primers can hybridize to the 5' and 3' end sequences, respectively, of all nucleic acids in the library; the amino acids to be encoded for the variable positions, are selected from a subset of alanine, arginine, asparagine, aspartate, cysteine, glutamine, glutamate, glycine, histidine, isoleucine, leucine, lysine, methionine, phenylalanine, 15 proline, serine, threonine, tryptophan, tyrosine, and valine the amino acids selected for the variable positions are selected from a group consisting of an aliphatic, an acidic, a neutral, and an aromatic amino acid; the group of amino acids consists of alanine, aspartate, serine, and tyrosine. The details of one or more embodiments of the invention are set forth in the description 20 below. Other features, objects, and advantages of the invention will be apparent from the description and drawings, and from the claims. DESCRIPTION OF DRAWINGS FIG. I is a schematic representation depicting the generation of a library of nucleic 25 acids encoding chimeric binding polypeptides by diversifying subsequences within an encoded polypeptide scaffold sequence. Encoded scaffold polypeptide sequence is SEQ ID NO:124. Library of encoded chimeric binding polypeptides are SEQ ID NOs: 844, 845 and 846, respectively (i.e., from top to bottom). FIG. 2 is an alignment of the sequences of a number of proteins that have regions 30 which can be used as a scaffold. These proteins are homologous to oryzacystatin. The Cl, C2, C3 and C4 are boxed and labeled. The sequences shown are SEQ I D NO: 132 (i.e., homologous sequences Q2V816_CUCMA_1441/1-28 - Q2V814_CUCMO_734/1-28); SEQ ID NO: 133 (i.e., Q2V8H9_LAGLE_431/1-28); SEQ ID NO: 134 (i.e., 15 H:\scg\)neovnNRPonbl\DCC\SCG\5171)427_ .DOC-23/)512013 Q6DKU9_CUCMA_1441/1-28 and Q6DLC8_CUCMA_1441/1-28); SEQ ID NO: 135 (i.e., 080389_CUCSA_795/1-89); SEQ ID NOs: 136 - 150 (i.e., QIRVW3_MEDTR_2578/1-54 Q8GZV2_CHEMJ_340/1-38); SEQ ID NO:130 (i.e., Reference/1-102); and SEQ ID NOs: 151-198 and 200-330 (i.e., CYT IORYSA_ 1097/1-88 to end). 5 FIG. 3 is an alignment of the sequences of a number of proteins that have regions which can be used as a scaffold. These proteins are homologous to C2. The Cl, C2, C3 and C4 are boxed and labeled. Sheets 1-3 show SEQ ID NOs: 331-367 (i.e., Q9M366ARATH_43120/1-78 - Q9FJG3_ARATH_325405/1-81); SEQ ID NO: 130 (i.e., Reference/l -156); and SEQ ID NOs: 368-384 (i.e., ERGI _ORYSA_795/1-89 10 Q4JH18_CUCMA_692/1-87). Sheets 4-6, 7-9, 10-12. 13-15, 16-18, 19-21, 22-24, and 25-27 show SEQ ID NOs: 385-827.FIG. 4 is an alignment of the sequences of a number of proteins that have regions which can be used as a scaffold. The sequences shown are SEQ ID NO: 130 (i.e., oryza full) and SEQ ID NOs 828-838. These proteins are homologous to oryzacystatin. The Cl, C2, C3 and C4 are boxed and labeled. 15 FIG 5. is an alignment of the sequences of a number of proteins that have regions which can be used as a scaffold. The sequences shown are, from top to bottom, SEQ ID NO:131 and SEQ ID NOs 839-843. These proteins are homologous to C2. The Cl, C2. C3 and C4 are boxed and labeled. 20 DETAILED DESCRIPTION Diverse libraries of nucleic acids (e.g., cDNA libraries) encoding plant chimeric binding polypeptides, as well as methods for generating them are described below. The amino acid sequences of the library of encoded plant chimeric binding proteins are derived from a scaffold polypeptide sequence that includes subsequences to be varied. The varied 25 subsequences correspond to putative binding domains of the plant chimeric binding proteins, and are highly heterogeneous in the library of plant chimeric binding proteins. In contrast, the sequence of the encoded chimeric binding proteins outside of the varied subsequences is essentially the same as the parent scaffold polypeptide sequence and highly homogeneous throughout the library of encoded plant chimeric binding proteins. Thus, libraries of plant 30 chimeric binding proteins can serve as a universal molecular recognition library platform for selection of specialized binding proteins for expression in transgenic plants. Libraries of plant chimeric binding proteins can be expressed by transfected cells (i.e., as expression libraries) and tested for interaction with a molecular 16 WO 2007/095300 PCT/US2007/003937 I. Plant Scaffold Polypeptide Sequences A plant scaffold polypeptide sequence is an amino acid sequence based on a plant protein that is structurally tolerant of extreme sequence variation within one or more regions. The regions to be varied within the scaffold polypeptide sequence are 5 conceptually analogous to the hypervariable regions of immunoglobulins, and form putative binding domains in a chimeric binding polypeptide. Thus, a large library of nucleic acid sequences encoding diverse plant chimeric binding polypeptides is produced by diversifying specific sequences within a scaffold polypeptide sequence, as is described in detail below. 10 Plant scaffold polypeptide sequences are selected to have a number of properties, e.g., they: (i) are derived from sequences that are of plant origin; (ii) encode proteins that tolerate the introduction of sequence diversity structurally; (iii) only contain disulfide bonds that do not interfere with folding of the polypeptide when expressed in a plant; (iv) express at high levels in diverse plant tissues; and (v) can be 15 targeted to different subcellular locations (e.g., cytoplasm, mitochondria, plastid) or secreted from the cell. Based on these properties, plant scaffold polypeptide sequences permit the generation of large libraries of chimeric binding polypeptides with highly diverse binding activities. Libraries of chimeric binding polypeptides can be screened for binding to a target molecule. Chimeric binding proteins having the desired binding 20 activity can subsequently be expressed in plants to confer input traits (e.g., pest or pathogen resistance, drought tolerance) or output traits (e.g. modified lipid composition, heavy metal binding for phytoremediation, medicinal uses). Such binding proteins can also be used in various affinity-based applications, e.g., diagnostic detection of an antigen using a sandwich ELISA; histochemical detection of antigens; 25 generation of protein biochips; and affinity purification of antigens. It is helpful to select the scaffold polypeptide sequence based on the sequence of a plant protein or protein domain of known three dimensional structure (see, e.g., Nygren et al. (2004) "Binding Proteins from Alternative Scaffolds," J. ofImmun. Methods 290:3-28). However, even without experimentally determined structural data 30 for a potential scaffold polypeptide sequence, valuable inferences can be gleaned from computational structural analysis of a candidate amino acid sequence. Useful programs for structure prediction from an amino acid sequence include, e.g., the "SCRATCH Protein Predictor" suite of programs available to the public on the world wide web at 17 WO 2007/095300 PCT/US2007/003937 ics.uci.edu/-baldig/scratchlindex. It is important that introduction of sequence variation not destabilize the known or predicted secondary structure of the scaffold polypeptide sequence. Accordingly, the known or predicted secondary structure of the scaffold polypeptide sequence informs the selection of amino acid subsequences that 5 can be varied within a scaffold polypeptide sequence to form putative binding domains. The structural adequacy of a particular scaffold polypeptide sequence can be readily tested, e.g., by phage display expression analysis methods that are commonly known in the art. For example, a scaffold polypeptide sequence containing 0, 1, 2, 3, or more disulfide bonds can be tested for its ability to fold into a stable protein. Since proteins 10 that do not fold properly will not be incorporated into a phage coat, they will not be displayed. Thus, without undue effort, many candidate scaffold polypeptide sequences can be rapidly screened for their ability to fold into stable proteins once expressed. The plant scaffold polypeptide sequences can be based on the accessory domain from purple acid phosphatases (PAPs). The crystal structure of the PAP accessory 15 domain of kidney bean, Phaseolus vulgaris, has been determined (Strater et al. (1995), Science 268(5216):1489-1492). Three exposed loops within the protein are reminiscent of the hypervariable domains found in immunoglobulins. The loops are brought together by the rigid anti-parallel #-sheet framework of the protein. The subsequences that form each loop form the putative binding domains of a chimeric binding protein 20 derived from a PAP. These subsequences are diversified by substituting, deleting, inserting, or adding up to 10 (e.g., up to 3, 4, 6, 8) amino acids. The loops that form the putative binding domains are particularly well suited to binding target molecules containing pockets or clefts. PAP-based scaffold polypeptide sequences take the general form: 25 CrXr-CrX 2

-C

3

-X

3

-C

4 where C 1 , C 2 , C 3 , and C 4 correspond to "backbone" subsequences which can include some introduced variation, but are not highly diversified. On the other hand, X 1 , X 2 , and X 3 correspond to highly varied subsequences that form the putative binding 30 domains of each PAP-based chimeric binding protein. Table I shows a list of suitable

C-C

4 backbone subequences derived from the amino acid sequences of 30 PAPs. CI, C 2 , C 3 , and C 4 correspond to SEQ ID NOs: 1-30, 31-60, 61-90, and 91-120, respectively, in Table 1. 18 WO 2007/095300 PCT/US2007/003937

X

1 , X 2 , and X 3 can be based on naturally occurring variants of corresponding PAP sequences, e.g., those shown in Table 2 as SEQ ID NOs: 121-123. Table 2 shows the range variation at each amino acid position in subsequences corresponding, respectively, to X 1 , X 2 , and X 3 , within 30 naturally occurring PAP sequences. 5 Alternatively, the parent variable subsequences, X 1

,-X

3 , can be arbitrary sequences 2-20 amino acids in length. In some implementations, C 1 , C 2 , C 3 , and C 4 of a scaffold polypeptide sequence can be selected from multiple PAP-based scaffold polypeptide sequence sequences listed in Table 1, in any combination, e.g., CI(SEQ ID NO:5), C2(SEQ ID NO:12), C3(SEQ ID NO:7), 10 and C4(SEQ IDNO:19); CI(SEQ ID NO:5), C2(SEQ ID NO:12), C3(SEQ ID NO:S), and C4(SEQ ID NO:12); C4(SEQ ID NO:22); CI(SEQ ID NO:17), C2(SEQ ID NO:17), C3(SEQ ID NO:19), and C4(SEQ IDNO:!), and so forth. 19 WO 2007/095300 PCT/US2007/003937 Table 1:. SPSs -Based on the Accessor Domain of PA;s Seq D C Seq13)C 2 I PQQV11ITQGDI{VGKAVIVSWVT 31 VVVYWS ENS KYKKSAEGTVTT 2 PQQV1*ITQGDLVGKAVIVSWVT 32 EVHYWSliNSDKKKIAEGI(LV'I 3 PQQVHITQGDLVGRAt.IIISWVT 33 AVRYWSEKNGRKRIAKGKMST 4 PQQVHITQGDLVGXAVIVSWVT 34 EV14YWSEN'SDKKKIAEGKLVT 5 PQQV1IIQGD{VGKAVIVSWVT 35 AVPYWSKNSKQKRLAKGKIVT 6 PQQVHITQGDHVGYAMIVSWVT 36 KVV-YWS ENSQHKKVALKGN IRT 7 PQQVHITQGDHVGXM.IVSWVT 37 KVVYWSENSQHKKVARGNIRT 8 VIQGHGTVVWV 38 TVLYWS EKSKQKNTAKGKVTT 9 PQQVIIQGDLVGQAMIISWVT 39 QVI:YWSDSSLQNFTAEGEVUFT 10 PQQVHITQGDLiVGQAMIISWVT 40 QVIYWSDSSLQNFTAEGEVFT 11 PQQVHITQGDHVGXKANIVSWVT 41 TLYWSNNSKQIO'KATGAVrT 12 PQQVHITQGDLEGPAN II SWVR 42 KThYWIDGSNQKHSAINGYK:TY 13 PQQV1iI[TQGDHVGX(AVIVSWVT 43 TVVYWSEKSKLKNKANGKVTT 14 PQQVHITQGDHVGQAMIISWVT 44 EVIYWSNSSLQNFTAEGEVF'T 15 PQQVYITQGDHEGKGVIASWTT 45 SVLYWAENSNVKSSAEGFVVS 16 PQQVHITQGDYEGKGVIISWVT 46 TV~IESVKPDVV 17 PQQVIIQGDLVGPAMIISWVT 47 AVRYWSEKNGRKRIAKGKMST 18 PQQV1ILTQGDHVGKGVIVSWT 48 KVLYWEFNSKIKQIAKGTVST 19 PQQVHITQGDVEGXAVIVSWVT 49 KVIYWKENSTKK4KHG3KTWT 20 PQQVHVTQGNHEGNGVIISWVT- -50 TVRYWCENKKSRKQAFATV14T 21 PQQVIVTQGNUEGNG3VIISWVT 51 TVQYWCENEKSRKQAEATVNT 22 PQQV141TQGDYDGXAVIVSWVT 52 KVQFGTSENKFQTSAEGTVSN 23 PQQV141TQGDHEGIASI IVSWIT 53 TVFYGTSENKLDQHAEGTVtM 24 PQQVHilTGDQTGTAMVSWVT 54 TVRYGSSPEKLDRAAEGSH-TR 25 PQQVHI TQGDYDGKAVIVSWVT 55 EV"GTSP1SYDHSAQGKTTN 26 PQQVITQDYDG3KVIISWVT 56 HIQYGTSENKFQTSEEGTVTN 27 PQQVHXTQGDYDGEAVIISWVT 57 EVRYGLSEGKYDVTVEGThNN 28 PQQV141TQGDYDG(AVIISWVT 58 QVHYGAVQGKYEFVAQGTYMN 29 PQQVHI:TQGDYDGI(AVIISWVT 59 QVHYGAVQGKYEFVAQGYHN 30 PQQVHiITQGDYNGKAVIVSWVT 60 EVLYGKNEHQYDQRVEGTVTN 20 WO 2007/095300 PCT/US2007/003937 Table I continued Seq ID C 3 Seq ID C 4 61 YIHHCYIKGLEYDTKYYYV 91 SREFWFR 62 FIHHTTIRNLEYKTKYYYE 92 TRQFWFV 63 FIHHTTIRKLKYNTKYYYE 93 TRRFSFI 64 FIHHTTIRNLEYKTKYYYE 94 TRQFWFV 65 FIHHTTIRNLEYNTKYYYE 95 TRQFWFV 66 YIHHCTIRNLEYNTKYYYE 96 TRSFWFT 67 YIHHCTIRNLEYNTKYYYE 97 TRSFWFT 68 YIHHSTIRHLEFNTKYYYK 98 ARTFWFV 69 FIHHTTITNLEFDTTYYYE 99 TRQFWFI 70 FIHHTTITNLEFDTTYYYE 100 TRQFWFI 71 YIHHCIIKHLKFNTYYYE 101 PRTFWFV 72 FIHHCTIRRLKHNTKYHYE 102 VRSFWFM 73 YIHHCNIKNLKFDTKYYYK 103 ARTFWFT 74 FIHHTNITNLEFNTTYFYV 104 TRQFWFI 75 YIHHCTIKDLEFDTKYYYE 105 TRKFWFV 76 YIHHCTIKDLEYDTKYYYE 106 KRQFWFV 77 YIHHCTIKNLEYNTKYFYE 107 TRQFWFT 78 YIHHCTIQNLKYNTKYYYM 108 RRTFWFV 79 FIHHCPIRNLEYDTKYYYV 109 ERKFWFF 80 YIHHCLIDDLEFDTKYYYE 110 SRRFWFF 81 YIHHCLIDDLEFDTKYYYE 111 SRRFWFF 82 YVHHCLIEGLEYKTKYYYR 112 SREFWFE 83 YIHHCVLTDLKYDRKYFYK 113 ARLFWFK 84 FIHHCTLTGLTHATKYYYA 114 VRTFSFT 85 YIHCLLDKLEYDTKYYYK 115 AREFWFH 86 YIHHCLIEGLEYETKYYYR 116 SREFWFK 87 YIHQCLVTGLQYDTKYYYE 117 ARKFWFE 88 FIHHCLVSDLEHDTKYYYK 11g SREFWFV 89 FIHHCLVSDLEHDTKYYYK 119 SREFWFV 90 YIHHCLVDGLEYNTKYYYK 120 AREFWFE 21 WO 2007/095300 PCT/US2007/003937 Table 2: Naturally Occurring Residue Variation in PAP Subsequences Corresponding to X 1 , X 2 , and X 3 (SEQ ID NOs:121-123) X, X2 X3 (SEQ ID NO: 121) (SEQ ID NO: 122) (SEQ ID NO: 123) Position Position Position a b c d e f g a b c d e f g h i a b c d e f M D E P G S S Y K Y Y N Y T S G V G L R N T V E A K P N R F F T S P I E I G H E N K L K K T H K N L V E D P V D T F D K M E D Q Q s H E E T K T I T S s A A E F F K 5 After diversification of the above-listed subsequences of the scaffold polypeptide sequence, the diversified Xj', X 2 ', and X 3 ' subsequences are highly heterogeneous within the library of encoded plant chimeric binding polypeptides, and can each contain up to 10 (e.g., 8, 6, 4, 3) single amino acid substitutions, deletions, insertions, or additions with respect to SEQ ID NOs: 121-123 listed in Tables 1, 10 respectively (see, e.g., Fig. 1). For example, the length of the amino acid sequences corresponding to regions X 1 , X 2 , or X3 can be unaltered, shortened, or lengthened relative to SEQ ID NOs: 121-123. The regions outside of the putative binding domains are referred to as "backbone" regions (i.e., C 1 , C 2 , C 3 , and C 4 ). Unlike the amino acid sequences for X 1 , 15 X 2 , and X 3 , the amino acid sequences of the backbone regions are generally not substantially diversified within the library of encoded chimeric binding proteins, although some sequence variation in these regions within the library is permissible. The backbone regions of a plant scaffold polypeptide sequence can be at least 70% (i.e., 22 WO 2007/095300 PCT/US2007/003937 80, 85, 90, 95, 98, or 100%) identical to any of SEQ ID NOs: 1-120. Alternatively, the backbone regions can contain up to 30 (i.e., 28, 26, 24, 22, 20, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, or 1) single amino acid substitutions, deletions, insertions or additions. For example, C 1 , C 2 , C 3 , and C 4 can each include 0, 1, 2, 3, 4, 5 or 5 or more single amino acid changes. If amino acid substitutions are to be introduced into the backbone regions, it is preferable to make conservative substitutions. A conservative substitution is one that preserves the substitutes an amino acid with one that has similar chemical properties (e.g., substitution of a polar amino acid such as serine with another polar amino acid such as threonine). 10 In one embodiment, the plant scaffold polypeptide sequence is one of SEQ ID NOs: 124-126 shown below. Sequences corresponding to X 1 , X 2 , and X 3 are in bold and underlined. SEQ ID NO: 124 PQQVHITQGDHVGKAVIVSWVTMDEPGSSVVVYWSENSKYKKSAEGTVTTYRFY 15 NYTSGYIHHCYIKGLEYDTKYYYVVGIGNTSREFWFR SEQ ID NO: 125 PQQVHITQGDLVGKAVIVSWVTVDEPGSSEVHYWSENSDKKKIAEGKLVTYRF FNYSSGFIHHTTIRNLEYKTKYYYEVGLGNTTRQFWFV 20 SEQ ID NO: 126 PQQVHITQGDLVGRAMIISWVTMDEPGSSAVRYWSEKNGRKRIAKGKMSTYR FFNYSSGFIHHTTIRKLKYNTKYYYEVGLRNTTRRFSFI In other embodiments, a plant scaffold polypeptide sequence is based on the amino acid sequence of plant proteins that have ankyrin-like repeats. Ankryin-like 25 repeats are small turn-helix-helix (THH) repeats consisting of approximately 33 amino acids. The number of THH repeats within a scaffold polypeptide sequence can vary from 2 to 20. The putative binding sites within the THH repeats are typically non contiguous, but clustered on the same side of the protein of which they are a part, A plant THH repeat-containing scaffold polypeptide sequence can have an 30 amino acid sequence that is based on any of SEQ ID NOs: 127-129 listed below. High levels of amino acid sequence variation are introduced at the bolded/underlined residues. The plant THH repeat-containing scaffold polypeptide sequences can contain substitutions of up to 3 amino acids or a deletion in the place of the amino acids 23 WO 2007/095300 PCT/US2007/003937 corresponding to residues 12-13, 33, 35-36, 38, 46-47, 66, 68-69, 71, 79-80, 99, 101 102, 104, and 112-113 (residues in bold and underlined) of SEQ ID NOs:127-129. SEQ ID NO: 127 GDDLGKKLHLAASRGHLEIVRVLVEAGADVNALDKFGRTALHIAASRGHLEV 5 VKLLLEAGADVNALDKFGRTALHLAASRGHLEVVKLLLEAGADVNALDKFG DTALHVSIDNGNEDIAEILQ SEQ ID NO: 128 GDDLGKKLHLAASRGHLEIVRVLVEAGADVNALDKFGRTP LHIAASKGNEQV 10 VKLLLEAGADPNALDKFGRTPLHIAASKGNEQVVKLLLEAGADPNAQDFGD TALHVSIDNGNEDIAEILQ SEQ ID NO: 129 15 GSDLGKKLLEAARGQDDEVRILMANGADVNALDKFGRTPLHIAASKGNEQ VVKLLLEAGADPNALDKFGRTPLHIAASKGNEQVVKLLLEAGADPNAQDKFG KTAFDISIDNGNEDLAEILQ The sequence of the scaffold polypeptide sequences can be at least 70% (i.e., 20 80, 85, 90, 95, 98, or 100%) identical to the sequence outside of the foregoing amino acid positions (in bold) of SEQ ID NOS: 127-129. Alternatively, the sequence of the scaffold polypeptide sequences outside of the foregoing amino acid positions (in bold) of SEQ ID NOS:127-129 can contain up to 30 (i.e., 28, 26, 24, 22, 20, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, or 1) single amino acid substitutions, deletions, 25 insertions or additions. In some cases it can be desirable to include additional repeating units. SEQ ID NOs: 127-129 have an amino-ternminal cap, two internal repeats and a carboxy-terminal cap. It might be desirable to have 1-6 internal repeats. The amino-terminal cap sequence is aa 1-33. The first internal repeat is 34-66 and the second internal repeat is 67- 99. The carboxy-terminal cap sequence is aa 100-123. 30 The first or the second internal repeats or both can be independently repeated 1, 2, 3, 4, 5 or 6 times. The putative binding sites are formed by amino acid side chains protruding from the rigid secondary structure formed by the scaffold polypeptide sequence. These proteins may typically form a larger, flatter binding surface and are particularly useful 35 for binding to targets that do not have deep clefts or pockets. Another suitable scaffold can be based on oryzacystatin (JBiol Chem 262:16793 (1987); Biochemistry 39:14753 (2000)), a member of the cystatin/Papain 24 Family (Pfam Identifier PF00031) that is identified as a cysteine proteinase inhibitor of rice. The sequence of oryzacystatin is depicted below. A scaffold having the amino acid sequence C1-X1-C2-X2-C3-X3-C4 where each of Xl, X2, X3 and X4 is a variable region and Cl, C2, C3 and C4 are the backbone regions can be created based on oryzacystatin. 5 MSSVGGPVLGGVEPVGNENDLHLVDLARFAVTEHNKKANSLLEFEKLV SVKQQVVAGTLYYFTLEVKEGDAKKLYEAKVWEKPWMDFKELQEFKPVDAS ANA (SEQ ID NO: 130) Cl-MSS (aa 1-3 of SEQ ID NO: 130) 10 Xl-VGGP (aa 4-7 of SEQ ID NO: 130) C2- VLGGVEPVGNENDLHLVDLARFAVTEHNKKANSLLEFEKLVSV (aa-8 50 of SEQ ID NO: 130) X2-KQQ VVAGT (aa 51-58 of SEQ ID NO: 130) C3-LYYFTLEVKEGD AKKLYEAKVWE (aa 59-81 of SEQ ID NO: 130) 15 X3-KPWM (aa 82-85 of SEQ ID NO: 130) C4-DFKELQEFKPVDASANA (aa 86-102 of SEQ ID NO: 130) FIG. 2 depicts the sequences of a large number of plant proteins aligned with oryzacystatin. Examples of suitable Cl-C4 regions are indicated. FIG. 4 depicts the sequences of a small number of plant proteins aligned with oryzacystatin. Examples of 20 suitable C1-C4 regions are indicated. In general, Xl can be a sequence of 2-20 random amino acids (e.g., 3 amino acids). X2 can be a sequence of 2-20 random amino acids (e.g., 4 amino acids). X3 can be a sequence of 2-20 random amino acids (e.g., 4 amino acids). Yet another suitable can be based on the C2 protein of rice (Biochemistry 42:11625 (2003)), a member of the C2 domain family (Pfam Identifier PFOO 168) that is thought to 25 be involved in plant defense signaling systems. The sequence of rice C2 is depicted below. A scaffold having the amino acid sequence Cl-Xl -C2-X2-C3-X3-C4 where each of Xl , X2, X3 and X4 is a variable region and Cl, C2, C3 and C4 are the backbone regions can be created based on rice C2. 30 MAGSGVLEVHLVDAKGLTGNDFLGKIDPYVVVQYRSOERKSSVARDQ GKNPSWNEVFKFQINSTAATGQHKLFLRLMDHDTFSRDDFLGEATINVTDLISL -25- GMEHGTWEMSESKHRVVLADKTYHGEIRVSLTFTASAKAQDHAEQVGGWAH SFRQ (SEQ ID NO: 131) Cl -MAGSGVLEVHLVDAKG (aa 1-16 of SEQ ID NO: 131) 5 Xl -LTGNDFLGKID (aa 17-27 of SEQ ID NO: 131) C2-PYVVVQYRSQERK (aa 28-40 of SEQ ID NO: 131) X2-SSVARDQGKNP (aa 41-51 of SEQ ID NO: 131) C3-SWNEVFKFQINSTAATGQHKLFLRL (aa 52-76 of SEQ ID NO: 131) X3- MDHDTFSRDDFL (aa 77-88 of SEQ ID NO: 131) 10 C4 GEATINVTDLISLGMEHGTWEMSESKHRVVLADKTYHGEIRVSLTFTASAKAQ DHAEQVGGWAHSFRQ (aa 89-156 of SEQ ID NO: 131) FIG. 3 depicts the sequences of a large number of plant proteins aligned with rice C2. Examples of suitable C1 -C4 regions are indicated. FIG. 4 depicts the sequences of a 15 small number of plant proteins aligned with oryzacystatin. Examples of suitable C1 -C4 regions are indicated. In general, X1 can be a sequence of 2-20 random amino acids (e.g., 11 amino acids). X2 can be a sequence of 2-20 random amino acids (e.g., 11 amino acids). X3 can be a sequence of 2-20 random amino acids (e.g., 12 amino acids). 20 The following sections disclose methods for generating libraries of nucleic acids encoding chimeric binding proteins based on plant scaffold polypeptide sequences. II. Generation of Nucleic Acid Libraries based on a Plant scaffold polypeptide sequence A large library of nucleic acid sequence variants encoding the plant scaffold 25 polypeptide sequence is created based on one or more plant scaffold polypeptide sequences. The library of nucleic acids encodes at least 5 (e.g., 1,000, 10 5 , 106, 10 7 , 10 9 , 1012, 101 or more) different chimeric binding protein sequences. It is recognized that not every member of a library generated by the methods described herein will encode a unique amino acid sequence. Nevertheless, it is desirable that at least 10% 26 WO 2007/095300 PCT/US2007/003937 (e.g., 25%, 30%, 40%, 50%, 60%, 70%, 75%, or 90%) of the encoded chimeric binding proteins represented in the library be unique. Prior to diversifying a plant scaffold polypeptide sequence, it may be useful to estimate computationally the expected sequence diversity to be generated with a given 5 set of sequence variation parameters. A method for estimating sequence diversity is described, e.g., in Volles et al. (2005), 33(11): 3667-3677. For example, the number of different sequences expected in a library of nucleic acids generated by PCR can be estimated based on the mutation frequency of the mutagenic polymerase used for the amplification. Useful algorithms for estimating sequence diversity in randomized 10 protein-encoding libraries can also be found on the world wide web, e.g., at guinevere.otago.ac.nz/mlrgd/STATS/index. Libraries of nucleic acids encoding plant chimeric binding proteins can be generated by a number of known methodologies. Sequence diversity is introduced into a plant scaffold polypeptide sequence by substitution, deletion, insertion, or addition of 15 amino acids at the highly variable positions of a scaffold polypeptide sequence as described above. Since the set of 20 amino acids that are genetically encoded in plants have somewhat redundant chemical and structural properties, a subset of amino acids (e.g., a subset of 4 types of amino acids) that encompasses this structural diversity can be adopted for substitutions. For example, amino acids to be used for substitution or 20 insertion can be selected to include an acidic amino acid, a neutral amino acid, an aliphatic amino acid, and an aromatic amino acid (see Table 3). For example, the amino acids used for substitution could be limited to aspartate, serine, alanine, and tyrosine. Limiting the redundancy of amino acid substitutions will increase the overall structural and binding diversity of the library of chimeric binding proteins. 25 Table 3 Chemical Properties of Amino Acids Genetically Encoded in Plants Acidic Neutral Aliphatic Aromatic Basic Aspartate, Asparagine, Cysteine Alanine, Histidine, Arginine, Glutamate, Glutamine, Methionine, Glycine, Phenylalanine, Lysine Proline, Serine, Threonine, Isoleucine, Tryptophan, Tyrosine Leucine, Valine 27 WO 2007/095300 PCT/US2007/003937 The library of nucleic acids can be generated in vitro by assembly of sets of oligonucleotides with overlapping complementary sequences. First, a scaffold polypeptide sequence sequence is selected that is to be encoded by sets of assembled oligonucleotides. The sequences to be encoded in the variable regions of a given 5 scaffold polypeptide sequence will include a multitude of heterogeneous sequences containing substitutions, insertions, deletions in additions in accordance with the library of chimeric binding polypeptides to be generated as described above. The scaffold polypeptide sequences to be encoded can include the Ci-C 4 subequences corresponding to any of SEQ ID NOs:1-30, 31-60, 61-90, and 91-120, respectively. 10 One set of oligonucleotides encodes regions of the plant scaffold polypeptide sequence where diversity is to be introduced (e.g., at X 1 , X 2 , and X 3 ). In contrast, regions of the scaffold polypeptide sequence in which little or no variation is to be introduced (e.g., in backbone domains of PAP scaffold polypeptide sequences) are encoded by a set of oligonucleotides encoding amino acid sequences with no less than 15 70% (i.e., 75%, 80%, 85%, 90%, 95%, or 100%) identity to any one of the above mentioned scaffold polypeptide sequences. The details of this method are described, e.g., in U.S. patent No. 6,521,453, hereby incorporated by reference. Sequence-varied oligonucleotides used to generate libraries of nucleic acids are typically synthesized chemically according to the solid phase phosphoramidite triester 20 method described by Beaucage and Caruthers (1981), Tetrahedron Letts., 22(20):1859 1862, e.g., using an automated synthesizer, as described in Needham-VanDevanter et al. (1984) Nucleic Acids Res., 12:6159-6168, A wide variety of equipment is commercially available for automated oligonucleotide synthesis. Multi-nucleotide synthesis approaches (e.g., tri-nucleotide synthesis), as discussed, supra, are also useful. 25 Nucleic acids can be custom ordered from a variety of commercial sources, such as Sigma-Genosys (at sigma-genosys.com/oligo.asp); The Midland Certified Reagent Company ([email protected]), The Great American Gene Company (at genco.com), ExpressGen Inc. (at expressgen.com), Operon Technologies Inc. (Alameda, Calif.) and many others. 30 The oligonucleotides can have a codon use optimized for expression in a particular cell type (e.g., in a plant cell, a mammalian cell, a yeast cell, or a bacterial cell). Codon usage frequency tables are publicly available, e.g., on the world wide web at kazusa.or.jp/codon. Codon biasing can be used to optimize expression in a cell or 28 WO 2007/095300 PCT/US2007/003937 on the surface of a cell in which binding of a plant chimeric binding protein is to be assessed, and can also be used to optimize expression of the chimeric binding protein in a transgenic organism of commercial interest (e.g., a transgenic plant), In general, codons with a usage frequency of less than 10% are not used. Before synthesis 5 oligonucleotide sequences are checked for potentially problematic sequences, e.g, restriction sites useful for subcloning, potential plant splice acceptor or donor sites (see, e.g., cbsdtu.dk/services/FeatureExtract/), potential mRNA destabilization sequences (e.g., "ATTA"), and stretches of more than four occurrences of the same nucleotide. Potentially problematic sequences are changed accordingly. 10 Populations of oligonucleotides are synthesized that encode amino acid variations in the putative binding regions of the selected scaffold polypeptide sequence (e.g., in regions X,, X 2 , and X 3 of a PAP scaffold polypeptide sequence). Preferably, all of the oligonucleotides of a selected length (e.g., about 10, 12, 15, 20, 30, 40, 50, 60, 70, 80, 90, or 100 or more nucleotides) that correspond to 15 regions where sequence diversity is to be introduced in the scaffold polypeptide sequence encode all possible amino acid variations from a diverse set of amino acids as described above. This includes N oligonucleotides per N sequence variations, where N is the number of different sequences at a locus. The N oligonucleotides are identical in sequence, except for the nucleotide(s) encoding the variant amino acid(s). In 20 generating the sequence-varied oligonucleotides, it can be advantageous to utilize parallel or pooled synthesis strategies in which a single synthesis reaction or set of reagents is used to make common portions of each oligonucleotide. This can be performed e.g., by well-known solid-phase nucleic acid synthesis techniques, or, e.g,, utilizing array-based oligonucleotide synthetic methods (see e.g., Fodor et al. (1991) 25 Science, 251: 767-777; Fodor (1997) "Genes, Chips and the Human Genome" FASEB Journal. 11:121-121; Fodor (1997) "Massively Parallel Genomics" Science. 277:393 395; and Chee et al. (1996) "Accessing Genetic Information with High-Density DNA Arrays" Science 274:610-614). In typical synthesis strategies the oligonucleotides have at least about 10 bases 30 of sequence identity to either side of a region of variance to ensure reasonably efficient recombination. However, flanking regions with identical bases can have fewer identical bases (e.g., 4, 5, 6, 7, 8, or 9) and can, of course, have larger regions of identity (e.g., 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,25, 30, 50, or more). 29 WO 2007/095300 PCT/US2007/003937 The oligonucleotides to be assembled together are incubated to allow hybridization between oligonucleotides containing overlapping complementary sequences. Each set of hybridizing overlapping oligonucleotides thereby forms a contiguous nucleic acid interrupted by small gaps. These small gaps can be filled to 5 form full length sequences using any of a variety of polymerase-mediated reassembly methods, e.g.,-as described herein and as known to one of skill. The greatest sequence diversity is introduced in oligonucleotides encoding the plant scaffold polypeptide sequence putative binding regions and residues. However, oligonucleotides encoding specific sequence variations can be "spiked" in the recombination mixture at any 10 selected concentration, thus causing preferential incorporation of desirable modifications into the encoded plant chimeric binding proteins in regions outside of the putative binding domains. For example, during oligonucleotide elongation, hybridized oligonucleotides are incubated in the presence of a nucleic acid polymerase, e.g., Taq, Klenow, or the like, 15 and dNTP's (i.e., dATP, dCTP, dGTP and dTTP). If regions of sequence identity are large, Taq or other high-temperature polymerase can be used with a hybridization temperature of between about room temperature (i.e., about 25 *C) and, e.g., about 65 *C. If the areas of identity are small, Klenow, Taq or polymerases can be used with a hybridization temperature of below room temperature. The polymerase can be added to 20 the assembly reaction prior to, simultaneously with, or after hybridization of the oligonucleotides. Afterwards, the resulting elongated double-stranded nucleic acid sequences are denatured, hybridized, and elongated again. This cycle can be repeated for any desired number of times. The cycle is repeated e.g., from about 2 to about 100 times. 25 Optionally, after multiple cycles of combinatorial nucleic acid assembly, the resulting products can be amplified, e.g., by standard polymerase chain reaction (PCR). A portion of the volume of the above-described assembly reaction is incubated with unique forward and reverse primers that hybridize universally to the ends of the nucleic acids, as well as dNTPs and a suitable polymerase (e.g., pfu polymerase). The PCR 30 reaction is then carried out for about 10 to 40 cycles. To determine the extent of oligonucleotide incorporation any approach which distinguishes similar nucleic acids can be used. For example, the nucleic acids can be cloned and sequenced, or amplified (in vitro or by cloning, e.g., into a standard cloning 30 WO 2007/095300 PCT/US2007/003937 or expression vector) and cleaved with a restriction enzyme which specifically recognizes a particular oligonucleotide sequence variant. It is useful to include rare restriction sites (e.g., Not 1) in the 5' ends of the 5' and 3' most primers used either in the assembly or PCR reactions. Inclusion of 5 restriction sites in these primers facilitates subcloning of the nucleic acids into a vector by restriction digestion and subsequent ligation. Alternatively, the assembly reaction or PCR products can also be subcloned, without being restriction digested, using standard methods, e.g.,"TA" cloning. Other methods for introducing diversity into a plant scaffold polypeptide 10 sequence can also be used. For example, a scaffold polypeptide sequence can be encoded in a nucleic acid template, e.g., a plasmid contruct. Alternatively, a PCR product, mRNA or genomic DNA from an appropriate plant species such as soybean may also serve as a template encoding a plant scaffold polypeptide sequence. One or more scaffold polypeptide sequence subsequences to be diversified (e.g., the X 2 region 15 of a PAP scaffold polypeptide sequence) can be diversified during or after amplification from the scaffold polypeptide sequence nucleic acid template by any of a number of error-prone PCR methods. Error-prone PCR methods can be divided into (a) methods that reduce the fidelity of the polymerase by unbalancing nucleotides concentrations and/or adding of chemical compounds such as manganese chloride (see, 20 e.g., Lin-Goerke et al. (1997) Biotechniques, 23, 409-412), (b) methods that employ nucleotide analogs (see, e.g., U.S. Patent No. 6,153,745), (c) methods that utilize 'mutagenic' polymerases (see, e.g., Cline, J. and HogrefeH.H. (2000) Strategies (Stratagene Newsletter), 13, 157-161 and (d) combined methods (see, e.g., Xu, H., Petersen, E.I., Petersen, S.B. and el-Gewely, M.R. (1999) Biotechniques, 27, 1102 25 1108. Other PCR-based mutagenesis methods include those, e.g., described by Osuna 3, Yanez 3, Soberon X, and Gaytan P. (2004), Nucleic Acids Res. 2004, 32(17):e136 and Wong TS, Tee KL, Hauer B, and Schwaneberg, Nucleic Acids Res. 2004 Feb 10;32(3):e26), and others known in the art. After generating a population of sequence variants, these can be substituted into 30 the appropriate region of a chosen plant scaffold polypeptide sequence nucleic acid (e.g., a plasmid containing a scaffold polypeptide sequence) by subcloning which thereby effectively acts as a vector for the library of diversified sequences. 31 WO 2007/095300 PCT/US2007/003937 Yet another approach to mutagenizing specific plant scaffold polypeptide sequence regions is the use of a mutagenic E. coli strain (see, e.g., Wu et al. (1999), Plant Mol. Biol., 39(2):381-386). A nucleic acid vector containing a target sequence to be mutated is introduced into the mutator strain, which is then propagated. Error-prone 5 DNA replication in the mutator E. coli strain introduces mutations into the introduced target sequence. The population of altered target sequences is then recovered and subeloned into the appropriate position of a nucleic acid encoding the selected plant scaffold polypeptide sequence to generate a diverse library of nucleic acids encoding plant chimeric binding proteins. 10 III. Expression and Screening of Plant chimeric binding proteins The library of nucleic acids based on a plant scaffold polypeptide sequence and encoding plant chimeric binding polypeptides are subcloned into an expression vector and introduced into a biological replication system to generate an expression library. 15 The expression library can be propagated and screened to identify plant chimeric binding proteins that bind a target molecule (TM) of interest (e.g., a nematode, insect, fungal, viral or plant protein). The biological replication system on which screening of plant chimeric binding proteins will be practiced should be capable of growth in a suitable environment, after 20 selection for binding to a target. Alternatively, the nucleic acid encoding the selected plant chimeric binding protein can be isolated by in vitro amplification. During at least part of the growth of the biological replication system, the increase in number is preferably approximately exponential with respect to time. The frequency of library members that exhibits the desired binding properties may be quite low, for example, 25 one in 10 6 or less. Biological replication systems can be bacterial DNA viruses, vegetative bacterial cells, bacterial spores. Eukaryotic cells (e.g., yeast cells) can also be used as a biological replication system. In a particularly useful embodiment, a chimeric binding protein-phage coat 30 protein fusion is encoded in a phagemid construct. The phagemid constructs are transformed into host bacteria, which are subsequently infected with a helper phage that expresses wild type coat proteins. The resulting phage progeny have protein coats that include both fusion protein and wild-type coat proteins. This approach has the 32 WO 2007/095300 PCT/US2007/003937 advantage that phage viability is greater compared to viability of phage that have exclusively chimeric binding protein-coat fusion proteins. Phagemid-based display library construction and screening kits are commercially available, e.g., the EZnetTM Phage Display cDNA Library Construction Kit and Screening Kit (Maxim Biotech, 5 Inc., San Francisco, CA). Nonetheless, a strain of any living cell or virus is potentially useful if the strain can be: 1) genetically altered with reasonable facility to encode a plant chimeric binding protein, 2) maintained and amplified in culture, 3) manipulated to display the potential binding protein domain where it can interact with the target material, and 4) 10 selected while retaining the genetic information encoding the expressed plant chimeric binding protein in recoverable form. Preferably, the biological replication system remains viable after affinity-based selection. When the biological replication system is a bacterial cell or a phage which is assembled in the periplasm, the expression vector for display of the plant chimeric 15 binding protein encodes the chimeric binding protein itself fused to two additional components. The first component is a secretion signal which directs the initial expression product to the inner membrane of the cell (a host cell when the package is a phage). This secretion signal is cleaved off by a signal peptidase to yield a processed, mature, plant chimeric binding protein. The second component is an outer surface 20 transport signal which directs the biological replication system to assemble the processed protein into its outer surface. This outer surface transport signal can be derived from a surface protein native to the biological replication system (e.g., the M 13 phage coat protein gIII). For example, the expression vector comprises a DNA encoding a plant chimeric 25 binding protein operably linked to a signal sequence (e.g., the signal sequences of the bacterial phoA or bla genes or the signal sequence of M1 3 phage qene Il) and to DNA encoding a coat protein (e.g., the M13 gene III or gene VIII proteins) of a filamentous phage (e.g., M13). The expression product is transported to the inner membrane (lipid bilayer) of the host cell, whereupon the signal peptide is cleaved off to leave a 30 processed hybrid protein. The C-terminus of the coat protein-like component of this hybrid protein is trapped in the lipid bilayer, so that the hybrid protein does not escape into the periplasmic space. As the single-stranded DNA of the nascent phage particle passes into the periplasmic space, it collects both wild-type coat protein and the hybrid 33 WO 2007/095300 PCT/US2007/003937 protein from the lipid bilayer. The hybrid protein is thus packaged into the surface sheath of the filamentous phage, leaving the plant chimeric binding protein exposed on its outer surface. Thus, the filamentous phage, not the host bacterial cell, is the biological replication system in this embodiment, If a secretion signal is necessary for 5 the display of the plant chimeric binding protein, a "secretion-permissive" bacterial strain can be used for growth of the filamentous phage biological replication system. It is unnecessary to use an inner membrane secretion signal when the biological replication system is a bacterial spore, or a phage whose coat is assembled intracellularly. In these cases, the display means is merely the outer surface transport 10 signal, typically a derivative of a spore or phage coat protein. Filamentous phage in general are attractive as biological replication systems for display of plant chimeric binding proteins, and M13 in particular, is especially attractive because: 1) the 3D structure of the virion is known; 2) the processing of the coat protein is well understood; 3) the genome is expandable; 4) the genome is small; 15 5) the sequence of the genome is known; 6) the virion is physically resistant to shear, heat, cold, urea, guanidinium Cl, low pH, and high salt; 7) the phage is a sequencing vector so that sequencing is especially easy; 8) antibiotic-resistance genes have been cloned into the genome; 9) It is easily cultured and stored, with no unusual or expensive media requirements for the infected cells, 10) it has a high burst size, each infected cell 20 yielding 100 to 1000 M13 progeny after infection; and 11) it is easily harvested and concentrated by standard methods. For example, when the biological replication system is M13 the gene III or the gene VIII proteins can be used as an outer surface targeting signal. Alternatively, the proteins from genes VI, VII, and IX may also be used. 25 The encoded plant chimeric binding protein can be fused to the surface targeting signal (e.g., the MI 3 gene III coat protein) at its carboxy or amino terminal. The fusion boundary between the plant chimeric binding protein and the targeting signal can also include a short linker sequence (e.g., up to 20 amino acids long) to avoid undesirable interactions between the chimeric binding protein and the fused targeting signal. In 30 some embodiments it is advantageous to include within the linker sequence a specific proteolytic cleavage site. In addition, the amino terminal or carboxy terminal of the fused protein can include a short epitope tag (e.g., a hemaglutinin tag). Inclusion of a proteolytic cleavage site or a short epitope tag is particularly useful for purification of a 34 WO 2007/095300 PCT/US2007/003937 library of chimeric binding proteins from a population of cells expressing the library. Epitope-tagged chimeric binding proteins can be conveniently purified by proteolytic cleavage of linker sequence followed by affinity chromatography utilizing an antibody or other binding agent that recognizes the epitope tag. 5 Many methods exist for screening phage display libraries (see, e.g., Willats (2002), Plant Mol. Biol., 50:837-854). As commonly practiced, the target molecule of interest is adsorbed to a support and then exposed to solutions of phage displaying plant chimeric binding proteins. The target molecule can be immobilized by passive adsorption on a support medium, e.g, tubes, plates, columns, or magnetic beads. 10 Generally, the adsorptive'support medium is pre-blocked, e.g., with bovine serum albumin, milk, or gelatin, to reduce non-specific binding of the phage during screening. Alternatively, the target molecule can be biotinylated, so interaction between chimeric binding protein-bearing phage and the target molecule can be carried out in solution. Phage that bind to the target can then be selected using avidin or streptavidin bound to a 15 solid substrate (e.g., beads or a column). After phage are allowed to interact with the target molecule, non-interacting phage are removed by washing. The remaining, specifically binding phage are then eluted by one of any number of treatments including, e.g., lowering or increasing pH, application of reducing agents, or use of detergents. In one embodiment, a specific 20 proteolytic cleavage site is introduced between the plant chimeric binding protein sequence and the phage coat protein sequence. Thus, phage elution can be accomplished simply by addition of the appropriate protease. Eluted phage are then amplified by infection of host cells and can subsequently be re-screened by the method just outlined to reduce the number of false positive 25 binders. During each round of phage screening, care should be taken to include growth of the phage on a solid medium rather than exclusively in a liquid medium as this minimizes loss of phage clones that grow sub-optimally. Plant chimeric binding proteins can also be expressed and screened for binding solely in vitro using ribosomal display. An exclusively in vitro approach circumvents 30 the requirement to introduce the library of nucleic acids encoding plant chimeric binding proteins into a biological replication system. Methods for screening polypeptides in vitro by ribosomal protein display are described in detail, e.g., in U.S. Patent No. 6,589,741. The nucleic acids described in the section above are modified by 35 WO 2007/095300 PCT/US2007/003937 adding a phage promoter sequence (e.g., a T7 promoter) enabling in vitro transcription, a ribosome binding sequence upstream to the start of translation of the encoded plant chimeric binding protein, and a transcription termination sequence (e.g., from phage T3). The modified library of nucleic acids is then transcribed in vitro to generate a 5 corresponding mRNA population encoding plant chimeric binding proteins. Plant chimeric binding proteins are then expressed in vitro by translating the population of mRNA molecules devoid of stop codons in the correct reading frame in an in vitro translation system, under conditions that allow the formation of polysomes. The polysomes so formed are then brought into contact with a target molecule under 10 conditions that allow the interaction of plant chimeric binding proteins with the target molecule. Polysomes displaying chimeric binding proteins that interact with the target molecule are then separated from non-interacting polysomes displaying no such (poly)peptides; and the mRNA associated with the interacting polysome is then amplified (e.g., by PCR) and sequenced. 15 Interaction of a plant chimeric binding protein with a target protein can also be detected in a genetic screen. In the screen, the target protein functions as a "bait protein" and each plant chimeric binding protein functions as a potential "prey" protein in a binding assay that utilizes a two-hybrid assay or three-hybrid assay (see, e.g., U.S. Patent No. 5,283,317; Zervos et al. (1993) Cell 72:223-232; Madura et al. (1993) J. 20 Biol. Chem. 268:12046-12054; Bartel et al. (1993) Biotechniques 14:920-924; Iwabuchi et al. (1993) Oncogene 8:1693-1696; Hubsman et al. (2001) Nuc. Acids Res. Feb 15;29(4):E18; and Brent W094/10300). A two-hybrid assay can be carried out using a target polypeptide as the bait protein. In sum, the target polypeptide is fused to the LexA DNA binding domain and 25 used as bait. The prey is plant chimeric binding protein library cloned into the active site loop of TrxA as a fusion protein with an N-terminal nuclear localization signal, a LexA activation domain, and an epitope tag (Colas et al. 1996 Nature 380:548; and Gyuris et al. Cell 1993 75:791). Yeast cells are transformed with bait and prey genes. When the target fusion protein binds to a plant chimeric binding protein fusion protein, 30 the LexA activation domain is brought into proximity with the LexA DNA binding domain and expression of reporter genes or selectable marker genes having an appropriately positioned LexA binding site increases. Suitable reporter genes include fluorescent proteins (e.g., EGFP), enzymes (e.g., luciferase, 3-galactosidase, alkaline 36 C:\NRPorblXDCC\SCG\4710405 1.DOC-30/0I/2012 phosphatase, etc.) Suitable selectable marker genes include, for example, the yeast LEU2 gene. After identification of one or more target-binding chimeric binding proteins, the isolated nucleic acids encoding the chimeric binding proteins can be mutagenized by the 5 methods described herein, to generate small expression libraries expressing variant chimeric binding proteins. The chimeric binding protein-variant expression libraries can be screened to identify chimeric binding protein variants with improved target binding properties (e.g., increased affinity or specificity). The reference in this specification to any prior publication (or information derived 10 from it), or to any matter which is known, is not, and should not be taken as an acknowledgment or admission or any form of suggestion that that prior publication (or information derived from it) or known matter forms part of the common general knowledge in the field of endeavour to which this specification relates. Throughout this specification and the claims which follow, unless the context 15 requires otherwise, the word "comprise", and variations such as "comprises" and "comprising", will be understood to imply the inclusion of a stated integer or step or group of integers or steps but not the exclusion of any other integer or step or group of integers or steps. The following specific examples are to be construed as merely illustrative, and not 20 limitative of the remainder of the disclosure in any way whatsoever. Without further elaboration, it is believed that one skilled in the art can, based on the description herein, utilize the present invention to its fullest extent. All publications cited herein are hereby incorporated by reference in their entirety. 25 EXAMPLES Example 1 Design and Expression of Plant Scaffold Polypeptide Sequences Several protein domain families were analyzed for their potential use as scaffolds. A search of PFAM domains (pfam.wustl.edu; see Bateman et al. (2004)), restricting the 30 output to Viridiplantae, was conducted to limit domains only to those present in green plants. Four protein domain families were selected to develop plant universal molecular recognition libraries; the accessory domain of purple acid phosphatase (PAP), plant - 37 - C:\NRPorblkDCC\SCG\47 10405_.DOC-31/W2012 cystatins, plant C2 domains and the turn-helix-helix (THH) motif found in ankyrin repeat proteins. Three purple acid phosphatase scaffolds were designed having the sequence of SEQ ID NOs:34-36. The amino acid sequence of the accessory domain from kidney bean 5 PAP was used as a query sequence to BLAST the NCBI database. When the output was restricted to proteins found in Viridiplantae, 62 unique sequences were identified. From an alignment of these sequences, a consensus plant PAP sequence was generated (SEQ ID NO:34) by selecting the most frequent amino acid at each position in the alignment. The kidney bean (Phaseolus vulgaris) PAP was selected as a parental scaffold (SEQ ID 10 NO:35), because of its known structure. A PAP from soybean, - 37a - WO 2007/095300 PCT/US2007/003937 Glycine max, was also chosen (SEQ ID NO:36), as this species represents a common crop species in which transgenic products are generated. A set of scaffold polypeptide sequences which contain plant ankyrin-like repeats was also designed. Ankyrin-like repeats are small turn-helix-helix (THH) motifs 5 consisting of approximately 33. amino acids. They are common elements of proteins from all organisms and are often found in tandem arrays of 2 to 20 repeats within a protein. Three THH scaffolds were generated. These proteins are similar in structure to GA binding protein (GABP-9). This protein consists of THH like amino and carboxy 10 terminal caps with 3 THH internal repeats. In this protein, it is thought that the caps help stabilize the protein by shielding hydrophobic residues found in the internal repeats. Three hundred and twelve Viridiplantae ankyrin repeats proteins found in PFAM were aligned to aid in designing plant-specific THH scaffolds. A plant 15 consensus THH sequence was generated by selecting the most frequently occurring amino acid at each position. This sequence was termed the plant consensus internal repeat sequence. This sequence was used to search the NCBI databases by BLAST alignment to find the closest natural THH sequence found in plants. A sequence from wheat (Triticum aestivum) was found. The designed repeat based on T aestivum 20 contains a substitution of valine for the single cysteine occurring in the T aestivum sequence. Two sets of N and C terminal caps were generated. One set consists of sequences derived from GABP-# and the second set was derived from the plant THH consensus sequence and optimized to resemble the structure of GABP-#. In particular, the N terminal cap has an extended alpha-helical structure, while the C terminal cap has 25 a truncated helix compared to the typical THH repeat. Three THH scaffolds were designed, one consists of plant consensus N and C caps and two plant consensus internal THH repeats (SEQ ID NO:37). Another consists of plant consensus N -and C caps and two wheat internal repeats (SEQ ID NO:38) and the third consists of ankyrin like N and C caps with two wheat internal repeats (SEQ ID 30 NO:39). The genes encoding the plant scaffold polypeptide sequences were designed for expression testing in plants, bacteria, and on the surface of phage. Codons were selected for plant expression using a publicly available Glycine max codon usage table 38 WO 2007/095300 PCT/US2007/003937 (at kazusa.or.jp/codon, codon usage tabulated from the international DNA sequence databases: status for the year 2000. Nakamura, Y, Gojobori, T and Ikemura, T (2000) NucL. Acids Res. 28:292.). Codon selection was done manually with the aim for the final codon frequency to roughly reflect the natural frequency for Glycine max. Rarely 5 used codons (<10% frequency) were not used. Final sequences were checked for potential problematic sequences, including removal of restriction sites needed for cloning, potential plant splice acceptor or donor sites (see website at cbs.dtu.dk/services/NetPgene/), potential mRNA destabilization sequences (ATTTA) and stretches of more than 4 occurrences of the same nucleotide. Any potential 10 problematic sequences were altered in the genes by modifying codon usage. Since the THH sequences have 4 similar repeat sequences within each protein, steps were taken to reduce nucleotide similarity within repeats; the average repeat identity was reduced 10-15% by these means. Seven constructs were produced using synthetic gene assembly, (three based on 15 THE scaffold polypeptide sequences, two based on PAP scaffold polypeptide sequences, one plant cystatin and one plant C2 domain protein). The three THH scaffold polypeptide sequences were placed into a phagemid vector as fusion sequences with the gene III coat protein (giII) at its carboxy terminus (Phage 3.2, Maxim Biotech, Inc., South San Francisco, CA). A 6-His tag was included at the 5' end of the gene as 20 well as a c-Myc tag between the scaffold gene and the encoded amino terminus of gIII. The phagemid constructs were then packaged into phage particles and the phage were tested for expression and surface display of the THH scaffold. A phage ELISA using either anti-His and anti-Myc indicated that the THH scaffold proteins were expressed on the surface of phage in phage ELISAs, suggesting that all 3 THH scaffold 25 polypeptide sequence constructs are folding and expressing well on the phage surface. The selected scaffold polypeptide sequences were then used to generate expression vectors to evaluate their expression in transgenic plants by immunoblotting. Tobacco leaves were injected with agrobacterium, LB4404 transformed with THH containing plant expression vectors. Two days later, sections of leaves injected 30 with agrobacterium were harvested, frozen on dry ice, then ground into a fine powder with a pestle. PBS containing 0.2% Tween-20 was added to the fine powder at a 1:1 weight to volume ratio and additional grinding was done. Insoluble material was removed by centrifugation and 10 ul of the remaining supernatant was loaded onto a 4 39 WO 2007/095300 PCT/US2007/003937 12% acrylamide SDS page gel (NuPage, Intvitrogen). Proteins were transferred to PVDF membranes. Proteins were detected using a rat anti-HA antibody (Roche) and an anti-rat HRP conjugated secondary antibody (Chemicon). HRP was detected using Amerham Lumigen reagents. 5 All three THH scaffold were found to be expressed, with the relative level of expression of the three scaffolds being TA-THH > CC-THH >. TC-THH. OTHER EMBODIMENTS All of the features disclosed in this specification may be combined in any 10 combination. Each feature disclosed in this specification may be replaced by an alternative feature serving the same, equivalent, or similar purpose. Thus, unless expressly stated otherwise, each feature disclosed is only an example of a generic series of equivalent or similar features. From the above description, one skilled in the art can easily ascertain the 15 essential characteristics of the present invention, and without departing from the spirit and scope thereof, can make various changes and modifications of the invention to adapt it to various usages and conditions. Thus, other embodiments are also within the scope of the following claims. 40

Claims

1. A library of nucleic acids encoding at least ten different polypeptides, wherein (i) the amino acid sequence of each of the encoded polypeptides comprises an amino acid sequence at least 70% identical to any of SEQ ID NOs:127-129; (ii) the amino acid sequence of each of the encoded polypeptides includes amino acids that differ from those of SEQ ID NOs:127-129 at positions 13, 14, 33, 35-36, 38, 46-47, 66, 68-69, 71, 79, 80, 99,101-102. 104, and 112-113 and the amino acid differences are heterogeneous across a plurality of the encoded polypeptides; and (iii) the amino acid sequence of each of the encoded polypeptides outside of the residues corresponding to positions 13, 14, 33, 35-36, 38, 46-47, 66, 68-69, 71, 79, 80, 99,101-102, 104, and 112-113 of SEQ ID NOs: 127-129 is homogeneous across a plurality of the encoded polypeptides.

2. The library of claim 1, wherein the amino acid sequence of the polypeptides has at least 75% identity to any of SEQ ID NOs 127-129.

3. The library of claim 1, wherein the amino acid sequence of the polypeptides has at least 80% identity to any of SEQ ID NOs 127-129.

4. The library of claim 1, wherein the amino acid sequence of the polypeptides has at least 85% identity to any of SEQ ID NOs 127-129.

5. The library of claim 1, wherein each of the nucleic acids comprises a vector sequence.

6. A population of cells expressing the polypeptides encoded by the library of claim 1.

7. A cell selected from the population of cells of claim 6. 41 H:\amt\Intenvoven\NRPortbl\DCC\AMT\8338652_l.doc-28/08/2015

8. A purified library of polypeptides encoded by the library of claim 1.

9. A population of filamentous phage displaying the library of polypeptides encoded by the library of claim 1.

10. A method of generating the library of claim 1, comprising: (i) selecting an amino acid sequence corresponding to any one of SEQ ID NOs: 127-129 to be encoded, wherein the selected sequence differs from those of SEQ ID NOs:127-129 in at least one of variable positions 13, 14, 33, 35-36, 38, 46-47, 66,

68-69, 71, 79, 80, 99,101-102, 104, and 112-113; (ii) chemically providing a first and a second plurality of oligonucleotides, wherein (a) oligonucleotides of the first plurality encode amino acid subsequences of the selected amino acid sequence; the subsequences being heterogeneous at the encoded variable positions; (b) oligonucleotides of the second plurality are complementary to nucleotide sequences encoding subsequences of the selected amino acid sequence, the subsequences being heterogeneous at the encoded variable positions; and (c) the first and second pluralities comprise oligonucleotides have overlapping sequences complementary to one another; (iii) combining the population of oligonucleotides to form a first mixture; (iv) incubating the mixture under conditions effective for hybridizing the overlapping complementary sequences to form a plurality of hybridized complementary sequences; and (v) elongating the plurality of hybridized complementary sequences to form a second mixture containing the library. 11. The method of claim 10, further comprising performing a cycle of denaturing the library by increasing the temperature of the second mixture to a temperature effective for denaturing double stranded DNA, followed by steps (iv) and (v). 42 H:\am\Inienvoven\NRPortbl\DCC\AM'I8338652_ l.docx-28/08/2015 12. The method of claim 11, comprising repeating the cycle up to 100 times. 13. The method of claim 12, further comprising amplifying the library by a polymerase chain reaction consisting essentially of the library, a forward primer, and a reverse primer, wherein the forward and reverse primers can hybridize to the 5' and 3' end sequences, respectively, of all nucleic acids in the library. 14. The method of claim 10, wherein amino acids to be encoded for the variable positions, are selected from a subset of alanine, arginine, asparagine, aspartate, cysteine, glutamine, glutamate, glycine, histidine, isoleucine, leucine, lysine, methionine, phenylalanine, proline, serine, threonine, tryptophan, tyrosine, and valine. 15. The method of claim 14, wherein the amino acids selected for the variable positions are selected from a group consisting of an aliphatic, an acidic, a neutral, and an aromatic amino acid. 16. The method of claim 15, wherein the group of amino acids consists of alanine, aspartate, serine, and tyrosine. 17. The library of claim 1 or the method of claim 10, substantially as herein described and with reference to any of the Examples and/or Figures. 43