Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How can I customize my own dataset to fit PyPOTS SOTA imputation models? #141

Closed
abhishekju06 opened this issue Jun 8, 2023 · 20 comments
Closed
Labels
question Further information is requested

Comments

@abhishekju06
Copy link

1. Feature description

I want to run Pypots SOA models for my own dataset.

2. Motivation

I have a multivariate dataset and want to check how PyPots models are working on it for data imputation.

3. Your contribution

None so far

@abhishekju06 abhishekju06 added enhancement New feature or request new feature Proposing to add a new feature labels Jun 8, 2023
@WenjieDu
Copy link
Owner

WenjieDu commented Jun 8, 2023

Hi there 👋,

Thank you so much for your attention to PyPOTS! If you find PyPOTS helpful to your work, please star⭐️ this repository. Your star is your recognition, which can help more people notice PyPOTS and grow PyPOTS community. It matters and is definitely a kind of contribution to the community.

I have received your message and will respond ASAP. Thank you for your patience! 😃

Best,
Wenjie

@WenjieDu
Copy link
Owner

WenjieDu commented Jun 8, 2023

Hi, thank you for raising this issue. The only thing you need to do is, after your data preprocessing, ensure the shape of your data input into models has 3 dimensions [n_samples, n_steps, n_features].

@abhishekju06
Copy link
Author

abhishekju06 commented Jun 8, 2023

What does n_steps indicate in my dataset?
n_features represent the number of attributes, I suppose.
n_samples represent the len of dataframe, I suppose.

Is Data preprocessing consists of:
Cleaning & Normalization only?

@WenjieDu
Copy link
Owner

WenjieDu commented Jun 8, 2023

n_samples indicates how many samples are in your dataset. n_steps is the number of time steps in each sample. You can use sliding window algo to generate such a 3D dataset from your original 2D dataset.

Yes, of course, cleaning and normalization are included in preprocessing. You know, machine learning is not magic, you have to make things prepared for model processing.

@abhishekju06
Copy link
Author

In my case number of time steps in each sample is same as length of dataframe.
Can you give me a reference of sliding window algo for generation of the 3D dataset? It would be of great help.

@WenjieDu
Copy link
Owner

WenjieDu commented Jun 8, 2023

Please try simply to search with google or github, I believe you can figure it out fast. This is not a complicated algorithm, but just a simple method.

@WenjieDu WenjieDu added question Further information is requested and removed enhancement New feature or request new feature Proposing to add a new feature labels Jun 8, 2023
@abhishekju06
Copy link
Author

abhishekju06 commented Jun 8, 2023

Thanks a lot!

@WenjieDu
Copy link
Owner

WenjieDu commented Jun 8, 2023

My pleasure! @abhishekju06 Just remembered that you can find such a sliding window function from data-processing utilities in SAITS repo here. If you using SAITS model for your data imputation and think it's helpful, please kindly consider to star 🌟 SAITS repo to make more people notice this useful model. Many thanks!

@abhishekju06
Copy link
Author

abhishekju06 commented Jun 13, 2023

n_samples indicates how many samples are in your dataset. n_steps is the number of time steps in each sample. You can use sliding window algo to generate such a 3D dataset from your original 2D dataset.

Can you please help me understand

  1. n_samples
  2. n_steps
  3. sequence length by an example.

========================
I have created the dataset:
2023-06-13 17:41:40,422 - Already masked out 10.0% values in train set
2023-06-13 17:41:40,475 - In val set, num of artificially-masked values: 7917.0
2023-06-13 17:41:40,475 - In test set, num of artificially-masked values: 7244.0
2023-06-13 17:41:40,476 - Feature num: 3,
7805 (0.936) samples in train set
281 (0.034) samples in val set
257 (0.031) samples in test set
2023-06-13 17:41:40,496 - All done.

====================
Below is my code:

import h5py
f = h5py.File('datasets.h5')
f.keys()

for key in f.keys():
print(key) #Names of the root level object names in HDF5 file - can be groups or datasets.
print(type(f[key])) # get the object type: usually group or dataset
#################
group_train = f['train']

for key in group_train.keys():
print("Train:",key)

dataset_for_training = {
"X": group_train['X'][()],
}
#############################
group_val = f['val']
for key in group_val.keys():
print("Val:",key)

dataset_for_validating = {
"X": group_val['X'][()],
"X_intact": group_val['X_hat'][()],
"indicating_mask": group_val['indicating_mask'][()],
}
#############################
group_test = f['test']
for key in group_test.keys():
print("Test:",key)

dataset_for_testing = {
"X":group_test['X'][()],
}

from pypots.optim import Adam
from pypots.imputation import SAITS

saits = SAITS(
n_steps= 100,#physionet2012_dataset['n_steps'],
n_features = 3,#physionet2012_dataset['n_features'],
n_layers=2,
d_model=256,
d_inner=128,
n_heads=4,
d_k=64,
d_v=64,
dropout=0.1,
attn_dropout=0.1,
diagonal_attention_mask=True, # otherwise the original self-attention mechanism will be applied
ORT_weight=1, # you can adjust the weight values of arguments ORT_weight
# and MIT_weight to make the SAITS model focus more on one task. Usually you can just leave them to the default values, i.e. 1.
MIT_weight=1,
batch_size=32,
# here we set epochs=10 for a quick demo, you can set it to 100 or more for better performance
epochs=10,
# here we set patience=3 to early stop the training if the evaluting loss doesn't decrease for 3 epoches.
# You can leave it to defualt as None to disable early stopping.
patience=3,
# give the optimizer. Different from torch.optim.Optimizer, you don't have to specify model's parameters when
# initializing pypots.optim.Optimizer. You can also leave it to default. It will initilize an Adam optimizer with lr=0.001.
optimizer=Adam(lr=1e-3),
# this num_workers argument is for torch.utils.data.Dataloader. It's the number of subprocesses to use for data loading.
# Leaving it to default as 0 means data loading will be in the main process, i.e. there won't be subprocesses.
# You can increase it to >1 if you think your dataloading is a bottleneck to your model training speed
num_workers=1,
# just leave it to default, PyPOTS will automatically assign the best device for you.
# Set it to 'cpu' if you don't have CUDA devices. You can also set it to 'cuda:0' or 'cuda:1' if you have multiple CUDA devices.
device='cuda',
# set the path for saving tensorboard and trained model files
saving_path="C:/Users/e264642/WFD_Projects/IITB/IITB_Code/pots/saits",
# only save the best model after training finished.
# You can also set it as "better" to save models performing better ever during training.
model_saving_strategy="best",
)
#Training
saits.fit(train_set=dataset_for_training, val_set=dataset_for_validating)

I am getting this error:
2023-06-13 17:54:11 [INFO]: Model initialized successfully with the number of trainable parameters: 1,321,802
2023-06-13 17:54:26 [INFO]: epoch 0: training loss 0.3622, validating loss nan
2023-06-13 17:54:42 [INFO]: epoch 1: training loss 0.2156, validating loss nan
2023-06-13 17:54:59 [INFO]: epoch 2: training loss 0.1777, validating loss nan
2023-06-13 17:54:59 [INFO]: Exceeded the training patience. Terminating the training procedure...
Traceback (most recent call last):

File "C:\Users\1234\AppData\Local\Temp\ipykernel_13552\1758113713.py", line 41, in
saits.fit(train_set=dataset_for_training, val_set=dataset_for_validating)

File "C:\Users\1234\Anaconda3\envs\pypots\lib\site-packages\pypots\imputation\saits\model.py", line 420, in fit
self._train_model(training_loader, val_loader)

File "C:\Users\1234\Anaconda3\envs\pypots\lib\site-packages\pypots\imputation\base.py", line 352, in _train_model
if np.equal(self.best_loss.item(), float("inf")):

AttributeError: 'float' object has no attribute 'item'

Please help!

@abhishekju06 abhishekju06 reopened this Jun 13, 2023
@WenjieDu
Copy link
Owner

According to the info you provided, I think the error is caused by the input data not properly prepared. Please check whether there are NaNs in dataset_for_validating["indicating_mask"] and dataset_for_validating["X_intact"].

@abhishekju06
Copy link
Author

abhishekju06 commented Jun 13, 2023

Should there be NaN present in my input dataset before I generate the dataset i.e., can my attribute columns contain NaN values?

@WenjieDu
Copy link
Owner

WenjieDu commented Jun 13, 2023

Datasets are OK with missing values, of course, PyPOTS is designed for datasets with missing data. But after generation, indicating_mask and X_intact should not have NaNs, and the missing part in X_intact should be imputed with some values like 0. Because PyPOTS will use them for loss calculation. NaN in indicating_mask or X_intact will result in NaN loss, just like in your case.

@abhishekju06
Copy link
Author

According to the info you provided, I think the error is caused by the input data not properly prepared. Please check whether there are NaNs in dataset_for_validating["indicating_mask"] and dataset_for_validating["X_intact"].

So NaNs in dataset_for_validating["indicating_mask"] and dataset_for_validating["X_intact"] is necessary. right?

@WenjieDu
Copy link
Owner

Datasets are OK with missing values, of course, PyPOTS is designed for datasets with missing data. But after generation, indicating_mask and X_intact should not have NaNs, and the missing part in X_intact should be imputed with some values like 0. Because PyPOTS will use them for loss calculation. NaN in indicating_mask or X_intact will result in NaN loss, just like in your case.

Sorry for missing a not in my last reply, just fixed it: "after generation, indicating_mask and X_intact should not have NaNs".

@WenjieDu WenjieDu changed the title How can I customize my own dataset to fit PyPots SOA imputation models? How can I customize my own dataset to fit PyPOTS SOTA imputation models? Jun 13, 2023
@abhishekju06
Copy link
Author

According to the info you provided, I think the error is caused by the input data not properly prepared. Please check whether there are NaNs in dataset_for_validating["indicating_mask"] and dataset_for_validating["X_intact"].

I have replaced NaN values with 0 in the 'indicating_mask'.
I guess, X_hat refers to X_intact. Thus, I have set it as the value for the key X_intact:

dataset_for_validating = {
"X": group_val['X'][()],
"X_intact": group_val['X_hat'][()],
"indicating_mask": group_val['indicating_mask'][()],
}

If not please let me know what X_intact stands for.

In SAITS/dataset_generating_scripts
/data_processing_utils.py

in line 86 X_hat[indices_for_holdout] = np.nan # X_hat contains artificial missing values

It is evident that X_hat must contain NaN as it represents artificial missing values.

So where I am going wrong?

@WenjieDu
Copy link
Owner

Please read the paper first https://arxiv.org/abs/2202.08516. Thanks.

@abhishekju06
Copy link
Author

Hi,
Please don't get me wrong. I have read the paper.
I just want to make clear what the notations in the paper corresponds with the notation in the code.
X cap
M cap
X tilde
I

Without your help it is not possible for me to understand.

@WenjieDu
Copy link
Owner

Please read it carefully and take a look at the model's implementation code here for reference.

@abhishekju06
Copy link
Author

Thanks a ton!

@WenjieDu
Copy link
Owner

No problem. If you have further questions regarding the SAITS model, you're welcome to raise issues in SAITS repo https://github.com/WenjieDu/SAITS/issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants