Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
alquraishi committed Feb 25, 2019
1 parent bb6d744 commit 6f3a56d
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# ProteinNet
ProteinNet is a standardized data set for machine learning of protein structure. It provides protein sequences, structures ([secondary](https://en.wikipedia.org/wiki/Protein_secondary_structure) and [tertiary](https://en.wikipedia.org/wiki/Protein_tertiary_structure)), multiple sequence alignments ([MSAs](https://en.wikipedia.org/wiki/Multiple_sequence_alignment)), position-specific scoring matrices ([PSSMs](https://en.wikipedia.org/wiki/Position_weight_matrix)), and standardized [training / validation / test](https://en.wikipedia.org/wiki/Training,_test,_and_validation_sets) splits. ProteinNet builds on the biennial [CASP](http:https://predictioncenter.org/) assessments, which carry out blind predictions of recently solved but publicly unavailable protein structures, to provide test sets that push the frontiers of computational methodology. It is organized as a series of data sets, spanning CASP 7 through 12 (covering a ten-year period), to provide a range of data set sizes that enable assessment of new methods in relatively data poor and data rich regimes.

**Note that this is a preliminary release.** The raw data used for construction of the data sets, as well as the MSAs, are not yet generally available. However, the raw data (4TB) for ProteinNet 12 is available upon request. Transfer requires downloading of a Globus client.
**Note that this is a preliminary release.** The raw data used for construction of the data sets, as well as the MSAs, are not yet generally available. However, the raw MSA data (4TB) for ProteinNet 12 is available upon request. Transfer requires downloading of a Globus client.

### Motivation
Protein structure prediction is one of the central problems of biochemistry. While the problem is well-studied within the biological and chemical sciences, it is less well represented within the machine learning community. We suspect this is due to two reasons: 1) a high barrier to entry for non-domain experts, and 2) lack of standardization in terms of training / validation / test splits that make fair and consistent comparisons across methods possible. If these two issues are addressed, protein structure prediction can become a major source of innovation in ML research, alongside the canonical tasks of computer vision, NLP, and speech recognition. Much like [ImageNet](http:https://www.image-net.org) helped [spur the development](https://qz.com/1034972/the-data-that-changed-the-direction-of-ai-research-and-possibly-the-world/) of new computer vision techniques, ProteinNet aims to facilitate ML research on protein structure by providing a standardized data set, and standardized training / validation / test splits, that any group can use with minimal effort to get started.
Expand Down

0 comments on commit 6f3a56d

Please sign in to comment.