Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CrabNet matbench results - possibly neglecting 25% of the training data it could have used #19

Open
sgbaird opened this issue Dec 31, 2021 · 2 comments

Comments

@sgbaird
Copy link
Collaborator

sgbaird commented Dec 31, 2021

@anthony-wang,

In the CrabNet matbench notebook, it does train/val/test splits. However, if #15 (comment) is correct such that the validation data (i.e. val.csv) doesn't contribute to hyperparameter tuning, then that 25% of the training data is essentially getting thrown away, correct?

In other words, the CrabNet results are based on only 75% of the training data compared to what the other matbench models use for training. From what I understand, the train/val/test split in the context of matbench only really makes sense if you're doing hyperparameter optimization in a nested CV scheme, as follows:

(Source: https://hackingmaterials.lbl.gov/automatminer/advanced.html)

To correct this, I think all that needs to be done is change:

#split_train_val splits the training data into two sets: training and validation
def split_train_val(df):
  df = df.sample(frac = 1.0, random_state = 7)
  val_df = df.sample(frac = 0.25, random_state = 7)
  train_df = df.drop(val_df.index)

  return train_df, val_df

to

#split_train_val splits the training data into two sets: training and validation
def split_train_val(df):
  train_df = df.sample(frac = 1.0, random_state = 7)
  val_df = df.sample(frac = 0.25, random_state = 7)

  return train_df, val_df

which makes it so there's data bleeding between train_df and val_df, but val_df ends up being essentially just a dummy dataset so that CrabNet doesn't error out when a val.csv isn't available.

Sterling

@sgbaird
Copy link
Collaborator Author

sgbaird commented Feb 5, 2022

Email response from @MahamadSalah74

If I remember correctly, for the matbench submission we chose to assign 25% of the training data as validation just as a sanity check for us to make sure CrabNet is not overfitting on the training data. I don't think CrabNet does any internal hyperparameters optimization using the validation data. Validation data should not be used for optimization or any adjustments to the model whatsoever while training.

@sgbaird
Copy link
Collaborator Author

sgbaird commented Feb 7, 2022

The results from #15 suggest that adding the extra 25% validation data into the training data may not necessarily improve the results since the validation data is used to update the weights through SWA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant