Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

During implementation on my own dataset, I got weird results #41

Open
DBpackage opened this issue Aug 16, 2023 · 4 comments
Open

During implementation on my own dataset, I got weird results #41

DBpackage opened this issue Aug 16, 2023 · 4 comments

Comments

@DBpackage
Copy link

DBpackage commented Aug 16, 2023

Hi! always thank you for your nice work in Drug Target prediction.

I run the run_experiments.py in the deepdta-toy folder for training my own dataset.
But I only could get this result.

image

Could I get any advice about this result?

I guess since my GPU is 3090RTX, I installed conda environments with tensorflow2.4.1 and keras 2.4.3. (for more detail, I added txt file) it can be some problem during backpropagation or something in your code which based on tf 1.x version.
env.txt

or maybe I misunderstood the dataset format for training on my own data.

  1. I made My_train, My_test datafolder for saving my own dataset.
  2. Both have same format of ligands.tab/proteins.fasta/Y.tab (at the first place, I made My_train dataset along with DTC folder which you gave us training dataset in the ReadMe. But with this format of training dataset, run_experiments.py required 'proteins.fasta' file in the training folder which is not included in your DTC folder. So I changed My_train folder dataset along with 'mytest' folder)
image image image

I'm pretty newbie in Computer Science field TT.
The code is running. but only result is weird, So I don't know how to debug this.
If you let me know something suspicious, I will inspect that point.

Best regards,

@hkmztrk
Copy link
Owner

hkmztrk commented Aug 16, 2023

Hi @DBpackage, thanks a lot for your interest. It seems to me your inputs are correctly formatted since the code itself is running.

What is the issue with the results? Do you mean the loss not improving? Since your datasets are now original, the hyper parameters might need (e.g. kernel sizes and learning rate) fine-tuning. You can also try different training sets such as KIBA, DTC etc. and then see whether there is an improvement on the training and test set. Another note is, sometimes isomeric SMILES is more informative than canonical SMILES, so you can also try those.

Let me know if you have more questions/issues.

@DBpackage
Copy link
Author

Thanks for fast replying!

Adjustment of hyperparameters is important for improving the performance of the model. However, even if the model has the wrong hyperparameters, if the model itself does not appear to be learning at all, it is unlikely that the adjustment of the hyperparameters alone will solve it. Anyway, I've not tried to adjust the parameters, I will do that too! many thanks!

image

My question is

  1. As you can see above, I got zero C-index score and it's not reasonable. I think something is wrong since as I know, CI less than 0.5 is abnormal.
  2. During making my own dataset, at the first place, I misunderstood that I had to make the train and test folds for running the model. But after I run the model, I found that it makes the folds itself. So, I don't need to make the train index list text file(folds/train_folds.txt) by myself?
  3. My training dataset has almost 110000 data. And I saw that model train it really fast (only 8seconds per epoch). It's normal speed for training? I think it's too fast.

I'm going to use your KIBA dataset for running the model for checking the model is working now, thanks!

Respectfully!

@hkmztrk
Copy link
Owner

hkmztrk commented Aug 16, 2023

  1. I see, yes, cindex measure does not make sense. I would then suggest to do basic debugging, e.g. starting with a train set of 10 samples and overfitting the model. You'd need to see loss and cindex changing meaningfully - then you can gradually increase the dataset size and test again.

  2. Yes, new datasets are prepared here. Did you update (i) the arguments for train/test paths to reflect your train/test and (ii) make sure that the binding affinities for both datasets are on the same scale?

  3. I really can't determine the runtime, sorry. But with gpu, one can expect around max 2hrs of training I'm guessing.

@DBpackage
Copy link
Author

DBpackage commented Aug 21, 2023

I'm sorry for late, I was crazily busy lately.

  1. I run only with 100 samples, Train with (22, 7) and test with (7, 2) [drug, target matrix], still the model didn't work. It returns same results CI score 0.5
  2. Yes I updated train/test path to my own dataset folder and since I used BindingDB dataset, It has same binding affinities with DAVIS dataset so I changed the arguments --isLog 1 . I made train/test from the same table so the affinity scales must be same.
  3. Okay then I don't mind the runtime anymore thanks!

Because I want to check if I run the model correctly, I've tried to run the model with your own DAVIS dataset.
I used run_experiments.py in the source directory and I only changed the --data_path and --isLog arguments in the go.sh file. The code what I fixed is here.

python run_experiments.py --num_windows 32
--seq_window_lengths 8 12
--smi_window_lengths 4 8
--batch_size 256
--num_epoch 100
--max_seq_len 1000
--max_smi_len 100
--dataset_path '../data/davis/'
--problem_type 1
--isLog 1
--log_dir 'logs/'

and I run with this code
./go.sh

and I have to change some run_experiments.py codes for running on my system. (tensorflow >2.X, I only changed these two codes)

1. importing tensorflow part
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()

2. importing Keras part
from keras import backend as K
tf.set_random_seed(0)
sess = tf.Session(graph=tf.get_default_graph(), config=session_conf)
tf.keras.backend.set_session(sess)

I got these same result.
image

Is there any mistake with my running code?
I think if I can't run the model with your model and dataset, the tensorflow2.X version cause something wrong during the training. (As I said above, my system has only >3000RTX, I can't use tf 1.X version since the CUDA 10 is not supported)

Thanks in advance!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants