Update proteinnet_records.md

aqlaboratory · Feb 25, 2019 · 5bf299b · 5bf299b
1 parent 6f3a56d
commit 5bf299b
Showing 1 changed file with 4 additions and 4 deletions.
diff --git a/docs/proteinnet_records.md b/docs/proteinnet_records.md
@@ -7,13 +7,13 @@ ProteinNet is comprised of ProteinNet Records which can be used to train machine
 * Tertiary Structure
 * Mask
 
-**Sequences** are the primary amino acid chains that constitue a protein. They are represented by a string of characters with an alphabet size of 20. Our standard [parser](../code/parser.py) converts this into a variable-length tensor comprised of 20-dimensional one-hot vectors.
+**Sequences** are the primary amino acid chains that constitue a protein. They are represented by a string of characters with an alphabet size of 20. Our standard [parser](../code/parser.py) converts this into a variable-length tensor comprised of 20-dimensional one-hot vectors; one dimension per amino acid, ordered alphabetically.
 
-**PSSMs**, a.k.a. [position-specific scoring matrices](https://en.wikipedia.org/wiki/Position_weight_matrix), summarize the propensity of each residue position along the protein chain to mutate to other amino acids. They are represented by a sequence of real-valued 20-dimensional vectors, normalized to range in value between 0 and 1. An additional dimension, corresponding to the information content of a residue, is concatenated with each vector to bring the total dimensionality to 21. We will provide multiple types of PSSMs, but this preliminary release of ProteinNet contains PSSMs derived using [JackHMMer](https://hmmer.org) from UniParc and metagenomic sequences.
+**PSSMs**, a.k.a. [position-specific scoring matrices](https://en.wikipedia.org/wiki/Position_weight_matrix), summarize the propensity of each residue position along the protein chain to mutate to other amino acids. They are represented by a sequence of real-valued 20-dimensional vectors (one dimension for each amino acid, ordered alphabetically), normalized to range in value between 0 and 1. An additional dimension, corresponding to the information content of a residue, is concatenated with each vector to bring the total dimensionality to 21. We will provide multiple types of PSSMs, but this preliminary release of ProteinNet contains PSSMs derived using [JackHMMer](https://hmmer.org) from UniParc and metagenomic sequences.
 
-**Secondary structure** is a categorical classification (8 classes) of the local structure of proteins, with the most prominent examples being [alpha helices](https://en.wikipedia.org/wiki/Alpha_helix) and [beta sheets](https://en.wikipedia.org/wiki/Beta_sheet). We derive our classification from the [tertiary structure](https://en.wikipedia.org/wiki/Protein_tertiary_structure) of the protein using the [DSSP](https://swift.cmbi.ru.nl/gv/dssp/) software package. As a result, it is more suitable as a prediction target as opposed to an input modality, although it can be used in either way. Secondary structure is represented by a string of characters with an alphabet size of 8. Our standard parser converts this into a variable-length tensor comprised of 8-dimensional one-hot vectors. Note that secondary structure is not available in the current preliminary release of ProteinNet.
+**Secondary structure** is a categorical classification (8 classes) of the local structure of proteins, with the most prominent examples being [alpha helices](https://en.wikipedia.org/wiki/Alpha_helix) and [beta sheets](https://en.wikipedia.org/wiki/Beta_sheet). We derive our classification from the [tertiary structure](https://en.wikipedia.org/wiki/Protein_tertiary_structure) of the protein using the [DSSP](https://swift.cmbi.ru.nl/gv/dssp/) software package. As a result, it is more suitable as a prediction target as opposed to an input modality, although it can be used in either way. Secondary structure is represented by a string of characters with an alphabet size of 8. Our standard parser converts this into a variable-length tensor comprised of 8-dimensional one-hot vectors; one dimension per DSSP class, using the following ordering: LHBEGITS. Note that secondary structure is not available in the current preliminary release of ProteinNet.
 
-**Tertiary structure** is the three-dimensional atomic representation of a protein. The preliminary release of ProteinNet only contains the [backbone](https://en.wikipedia.org/wiki/Backbone_chain) atoms, corresponding to the sequential chain of N, C_alpha, and C' atoms. Each amino acid residue is represented by a real-valued 3x3 matrix, corresponding to the Cartesian coordinates of the three backbone atoms. The full protein is thus represented by a sequence of 3n 3-dimensional vectors, where n is the number of amino acids in a protein.
+**Tertiary structure** is the three-dimensional atomic representation of a protein. The preliminary release of ProteinNet only contains the [backbone](https://en.wikipedia.org/wiki/Backbone_chain) atoms, corresponding to the sequential chain of N, C_alpha, and C' atoms. Each amino acid residue is represented by a real-valued 3x3 matrix, corresponding to the Cartesian coordinates of the three backbone atoms in picometers (relative to PDB files everything is multiplied by 100). The full protein is thus represented by a sequence of 3n 3-dimensional vectors, where n is the number of amino acids in a protein.
 
 **Masks** are one-bit indicators of whether the atomic coordinates for a protein residue are present. Many protein structures, due to intrinsic or experimental reasons, do not have precisely defined positions for all atoms. Masks provide an explicit indicator of this information that can be incorporated into the learning algorithm, for example to prevent the loss function from penalizing predictions made of unknown atomic coordinates. They are represented by a string of characters with an alphabet size of 2 (+/-). Our standard parser converts this into a 2D binary matrix where columns and rows of residues containing missing atoms are set to 0, and all other entries are set to 1.