DeepLearningMoleculeEnergyPredict - dlmep


dlmep is based on DeepChem. It is a neural network model for atom-dependent regression in physics and chemistry fields. It is an atom number and atom type transferable NN that you can predict any system as long as the atom type keeps same as train set or its subset.

Regardless of the rationality in chemistry or physics, a transferable model should satisfy following rules:

  1. Can predict any system that atom type is a subset of atom type in train set (e.g. When use CH CHO CHON as train set, can predict CN, CO, CH, ...)
  2. Can predict any system with different atom number. (e.g. When use C2H2 as train set, can predict C60H60 )

Here our model's transferability is achieved in two stages:

  1. When do encoding from atom structure to feature vector, the shape of feature vector can not change with the total number of atoms and type. (e.g., the shape of feature vector of carbon in CH4 should be same as C60. In C60, H is an absent atom, leaving zero vector but feature vector do not change)
  2. When use model to train or predict, make an output that its shape is not dependent on input.

To achieve these two stages, we have following methods:

  1. For transferability in encoding stage, use ANI-transform (from DeepChem) or SOAP (from SOAPLite) method. ANI uses a symmetry function to encode, details is in, SOAP is another encoding method but haven't tested in NN model in the paper, about SOAP:
  2. For transferability in train/test/predict stage, use a reduce_sum operation in tensorflow.



dlmep developed the coordinate encoding method using ANI from DeepChem, and made an interface for VASP results dir as input. And dlmep supports input directly from coordinates and atom cases. Also, dlmep can use the feature encoded from soapml. And some part of soapml will one day be merged into dlmep.

Give the dir of your VASP result dirs, the program will transfer them to dataset automatically, then use ANI transform from DeepChem to get features.

The freatures will go into a Dense NN to predict energy, during the train and test process, the features are feed into different NN according to their atom index, and finally reduce_sum to get total energy.

When you predict other datasets, No Limitation to Atom Numbers(since the final layer is reduce_sum),but the Atom Cases Must Included in the Trainset. E.g., if your trainset use C2H4O1Cu20, you can predict CHO, C1Cu20, O1Cu20, ..., but you can predict CHNCu. (But Make Sure the Distribution of Your Trainset is Large Enough for Prediction).




# give a dir that contains vasp result dirs, like "S:\dataset\carbonNanotobe"
aim_vasp_path = vasp_dir_path  
# print_file will write info into file named "log"
print_file(">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>New Game Begin!<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<")
print_file("Start Data Collecting")

dataset_maker = DatasetMaker(aim_vasp_path)
total_info = dataset_maker.give_out_dataset()
print_file("Finished Data Collecting, Start Feature Transform")
# encode
dataset_offer = DatasetOffer(total_data_info=total_info)
    total_train_feed_x, \
    total_test_feed_x, \
    total_train_feed_y, \
    total_test_feed_y, \
    atom_cases, \
    n_feat = \
# this will save dataset into ANI_features.pkl

with open(ANI_pkl_file_path, "rb") as f:
    total_train_feed_x, total_test_feed_x, total_train_feed_y, total_test_feed_y, atom_cases, n_feat \
            = pickle.load(f)
# create model from dlmep.FullAtomModel
nn = FullAtomModel(atom_cases, os.getcwd() + "/model", n_feat)
        print_file("Load Weights Failed")

index = 0
# here can use while True rather than epoch, since will save weights
# of every step
index = 0
for i in range(epoch): 
            print_file(">>Loop %s" % (index))
            for dataset_index in range(len(total_train_feed_x)):
                print_file(">>>>Train for %s/%s" % (dataset_index + 1, len(total_train_feed_x)))
                # fit data
      [dataset_index], total_train_feed_y[dataset_index], epoch=1000,
            index += 1
# and finally test
with open(ANI_pkl_file_path, "rb") as f:
        _, total_test_feed_x, _, total_test_feed_y, atom_cases, n_feat \
            = pickle.load(f)

# create model and load weights
nn = FullAtomModel(atom_cases, os.getcwd() + "/model/trained", n_feat)

pred_result = []
true_result = []
for dataset_index in range(len(total_test_feed_x)):

plt.plot(pred_result, true_result, 'ro')
plt.savefig("test_result.png", dpi=300)

Results: the pred_y and true_y plot. Predicted on testset.

Network structure of the FullAtomModel in Demo



From VASP results to 3D coordinate and energy (potential energy). Input: list of str(dir_path). Output: List of array (n_sample, n_atom, 4), the length of list is equal to length of the dir_path list


Input: List of array (n_sample, n_atom, 4) as X, List of array (n_sample 1) as Y.

ANI-Transform: List of Dict, each dict use atom_case as key, value is array (n_sample, n_atom of that atom_case, n_feature). For train set and test set, the only different is the n_sample in value array. (This is in the train/test stage, validation stage is different.)


Input: feature_num, atom_cases. Data: List of Dict, each dict use atom_case as key**(the atom_cases is str like H, Pt, C, O, ... )**, value is array (n_sample, n_atom of that atom_case, n_feature).

Build a dense NN from n_feature to 1 for each atom_case. For every dense NN, its input is (n_sample, n_atom, n_feature), its output is (n_sample, 1), the first reduce sum is used to make output not dependent on n_atom.

Then reduce sum the output of each dense NN.

