Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make the framework using a consistent strategy when lazy-loading val_set #270

Closed
WenjieDu opened this issue Dec 18, 2023 · 0 comments
Closed
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@WenjieDu
Copy link
Owner

if val_set is not None:
if isinstance(val_set, str):
with h5py.File(val_set, "r") as hf:
# Here we read the whole validation set from the file to mask a portion for validation.
# In PyPOTS, using a file usually because the data is too big. However, the validation set is
# generally shouldn't be too large. For example, we have 1 billion samples for model training.
# We won't take 20% of them as the validation set because we want as much as possible data for the
# training stage to enhance the model's generalization ability. Therefore, 100,000 representative
# samples will be enough to validate the model.
val_set = {
"X": hf["X"][:],
"X_intact": hf["X_intact"][:],
"indicating_mask": hf["indicating_mask"][:],
}
# check if X_intact contains missing values
if np.isnan(val_set["X_intact"]).any():
val_set["X_intact"] = np.nan_to_num(val_set["X_intact"], nan=0)
logger.warning(
"X_intact shouldn't contain missing data but has NaN values. "
"PyPOTS has imputed them with zeros by default to start the training for now. "
"Please double-check your data if you have concerns over this operation."
)
val_set = BaseDataset(val_set, return_labels=False, file_type=file_type)

E.g. in imputation models, when val_set is set as an h5 file path to enable lazy loading, the framework still loads all data from the file. Although people usually don't have a large validation set, this may increase memory pressure if so. And we also expect the framework to have a consistent behavior with train_set. Therefore we need to make PyPOTS to apply the same strategy on val_set.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant