How can I customize my own dataset to fit PyPOTS SOTA imputation models? #141

abhishekju06 · 2023-06-08T07:17:00Z

1. Feature description

I want to run Pypots SOA models for my own dataset.

2. Motivation

I have a multivariate dataset and want to check how PyPots models are working on it for data imputation.

3. Your contribution

None so far

WenjieDu · 2023-06-08T07:17:32Z

Hi there 👋,

Thank you so much for your attention to PyPOTS! If you find PyPOTS helpful to your work, please star⭐️ this repository. Your star is your recognition, which can help more people notice PyPOTS and grow PyPOTS community. It matters and is definitely a kind of contribution to the community.

I have received your message and will respond ASAP. Thank you for your patience! 😃

Best,
Wenjie

WenjieDu · 2023-06-08T07:51:40Z

Hi, thank you for raising this issue. The only thing you need to do is, after your data preprocessing, ensure the shape of your data input into models has 3 dimensions [n_samples, n_steps, n_features].

abhishekju06 · 2023-06-08T08:25:38Z

What does n_steps indicate in my dataset?
n_features represent the number of attributes, I suppose.
n_samples represent the len of dataframe, I suppose.

Is Data preprocessing consists of:
Cleaning & Normalization only?

WenjieDu · 2023-06-08T08:37:31Z

n_samples indicates how many samples are in your dataset. n_steps is the number of time steps in each sample. You can use sliding window algo to generate such a 3D dataset from your original 2D dataset.

Yes, of course, cleaning and normalization are included in preprocessing. You know, machine learning is not magic, you have to make things prepared for model processing.

abhishekju06 · 2023-06-08T08:45:07Z

In my case number of time steps in each sample is same as length of dataframe.
Can you give me a reference of sliding window algo for generation of the 3D dataset? It would be of great help.

WenjieDu · 2023-06-08T08:49:26Z

Please try simply to search with google or github, I believe you can figure it out fast. This is not a complicated algorithm, but just a simple method.

abhishekju06 · 2023-06-08T09:42:23Z

Thanks a lot!

WenjieDu · 2023-06-08T10:04:28Z

My pleasure! @abhishekju06 Just remembered that you can find such a sliding window function from data-processing utilities in SAITS repo here. If you using SAITS model for your data imputation and think it's helpful, please kindly consider to star 🌟 SAITS repo to make more people notice this useful model. Many thanks!

abhishekju06 · 2023-06-13T12:18:36Z

n_samples indicates how many samples are in your dataset. n_steps is the number of time steps in each sample. You can use sliding window algo to generate such a 3D dataset from your original 2D dataset.

Can you please help me understand

n_samples
n_steps
sequence length by an example.

========================
I have created the dataset:
2023-06-13 17:41:40,422 - Already masked out 10.0% values in train set
2023-06-13 17:41:40,475 - In val set, num of artificially-masked values: 7917.0
2023-06-13 17:41:40,475 - In test set, num of artificially-masked values: 7244.0
2023-06-13 17:41:40,476 - Feature num: 3,
7805 (0.936) samples in train set
281 (0.034) samples in val set
257 (0.031) samples in test set
2023-06-13 17:41:40,496 - All done.

====================
Below is my code:

import h5py
f = h5py.File('datasets.h5')
f.keys()

for key in f.keys():
print(key) #Names of the root level object names in HDF5 file - can be groups or datasets.
print(type(f[key])) # get the object type: usually group or dataset
#################
group_train = f['train']

for key in group_train.keys():
print("Train:",key)

dataset_for_training = {
"X": group_train['X'][()],
}
#############################
group_val = f['val']
for key in group_val.keys():
print("Val:",key)

dataset_for_validating = {
"X": group_val['X'][()],
"X_intact": group_val['X_hat'][()],
"indicating_mask": group_val['indicating_mask'][()],
}
#############################
group_test = f['test']
for key in group_test.keys():
print("Test:",key)

dataset_for_testing = {
"X":group_test['X'][()],
}

from pypots.optim import Adam
from pypots.imputation import SAITS

saits = SAITS(
n_steps= 100,#physionet2012_dataset['n_steps'],
n_features = 3,#physionet2012_dataset['n_features'],
n_layers=2,
d_model=256,
d_inner=128,
n_heads=4,
d_k=64,
d_v=64,
dropout=0.1,
attn_dropout=0.1,
diagonal_attention_mask=True, # otherwise the original self-attention mechanism will be applied
ORT_weight=1, # you can adjust the weight values of arguments ORT_weight
# and MIT_weight to make the SAITS model focus more on one task. Usually you can just leave them to the default values, i.e. 1.
MIT_weight=1,
batch_size=32,
# here we set epochs=10 for a quick demo, you can set it to 100 or more for better performance
epochs=10,
# here we set patience=3 to early stop the training if the evaluting loss doesn't decrease for 3 epoches.
# You can leave it to defualt as None to disable early stopping.
patience=3,
# give the optimizer. Different from torch.optim.Optimizer, you don't have to specify model's parameters when
# initializing pypots.optim.Optimizer. You can also leave it to default. It will initilize an Adam optimizer with lr=0.001.
optimizer=Adam(lr=1e-3),
# this num_workers argument is for torch.utils.data.Dataloader. It's the number of subprocesses to use for data loading.
# Leaving it to default as 0 means data loading will be in the main process, i.e. there won't be subprocesses.
# You can increase it to >1 if you think your dataloading is a bottleneck to your model training speed
num_workers=1,
# just leave it to default, PyPOTS will automatically assign the best device for you.
# Set it to 'cpu' if you don't have CUDA devices. You can also set it to 'cuda:0' or 'cuda:1' if you have multiple CUDA devices.
device='cuda',
# set the path for saving tensorboard and trained model files
saving_path="C:/Users/e264642/WFD_Projects/IITB/IITB_Code/pots/saits",
# only save the best model after training finished.
# You can also set it as "better" to save models performing better ever during training.
model_saving_strategy="best",
)
#Training
saits.fit(train_set=dataset_for_training, val_set=dataset_for_validating)

I am getting this error:
2023-06-13 17:54:11 [INFO]: Model initialized successfully with the number of trainable parameters: 1,321,802
2023-06-13 17:54:26 [INFO]: epoch 0: training loss 0.3622, validating loss nan
2023-06-13 17:54:42 [INFO]: epoch 1: training loss 0.2156, validating loss nan
2023-06-13 17:54:59 [INFO]: epoch 2: training loss 0.1777, validating loss nan
2023-06-13 17:54:59 [INFO]: Exceeded the training patience. Terminating the training procedure...
Traceback (most recent call last):

File "C:\Users\1234\AppData\Local\Temp\ipykernel_13552\1758113713.py", line 41, in
saits.fit(train_set=dataset_for_training, val_set=dataset_for_validating)

File "C:\Users\1234\Anaconda3\envs\pypots\lib\site-packages\pypots\imputation\saits\model.py", line 420, in fit
self._train_model(training_loader, val_loader)

File "C:\Users\1234\Anaconda3\envs\pypots\lib\site-packages\pypots\imputation\base.py", line 352, in _train_model
if np.equal(self.best_loss.item(), float("inf")):

AttributeError: 'float' object has no attribute 'item'

Please help!

WenjieDu · 2023-06-13T12:58:03Z

According to the info you provided, I think the error is caused by the input data not properly prepared. Please check whether there are NaNs in dataset_for_validating["indicating_mask"] and dataset_for_validating["X_intact"].

abhishekju06 · 2023-06-13T12:59:41Z

Should there be NaN present in my input dataset before I generate the dataset i.e., can my attribute columns contain NaN values?

WenjieDu · 2023-06-13T13:04:38Z

Datasets are OK with missing values, of course, PyPOTS is designed for datasets with missing data. But after generation, indicating_mask and X_intact should not have NaNs, and the missing part in X_intact should be imputed with some values like 0. Because PyPOTS will use them for loss calculation. NaN in indicating_mask or X_intact will result in NaN loss, just like in your case.

abhishekju06 · 2023-06-13T13:09:26Z

According to the info you provided, I think the error is caused by the input data not properly prepared. Please check whether there are NaNs in dataset_for_validating["indicating_mask"] and dataset_for_validating["X_intact"].

So NaNs in dataset_for_validating["indicating_mask"] and dataset_for_validating["X_intact"] is necessary. right?

WenjieDu · 2023-06-13T13:13:33Z

Datasets are OK with missing values, of course, PyPOTS is designed for datasets with missing data. But after generation, indicating_mask and X_intact should not have NaNs, and the missing part in X_intact should be imputed with some values like 0. Because PyPOTS will use them for loss calculation. NaN in indicating_mask or X_intact will result in NaN loss, just like in your case.

Sorry for missing a not in my last reply, just fixed it: "after generation, indicating_mask and X_intact should not have NaNs".

abhishekju06 · 2023-06-14T07:38:42Z

According to the info you provided, I think the error is caused by the input data not properly prepared. Please check whether there are NaNs in dataset_for_validating["indicating_mask"] and dataset_for_validating["X_intact"].

I have replaced NaN values with 0 in the 'indicating_mask'.
I guess, X_hat refers to X_intact. Thus, I have set it as the value for the key X_intact:

dataset_for_validating = {
"X": group_val['X'][()],
"X_intact": group_val['X_hat'][()],
"indicating_mask": group_val['indicating_mask'][()],
}

If not please let me know what X_intact stands for.

In SAITS/dataset_generating_scripts
/data_processing_utils.py

in line 86 X_hat[indices_for_holdout] = np.nan # X_hat contains artificial missing values

It is evident that X_hat must contain NaN as it represents artificial missing values.

So where I am going wrong?

WenjieDu · 2023-06-14T07:46:04Z

Please read the paper first https://arxiv.org/abs/2202.08516. Thanks.

abhishekju06 · 2023-06-14T08:40:41Z

Hi,
Please don't get me wrong. I have read the paper.
I just want to make clear what the notations in the paper corresponds with the notation in the code.
X cap
M cap
X tilde
I

Without your help it is not possible for me to understand.

WenjieDu · 2023-06-14T09:50:15Z

Please read it carefully and take a look at the model's implementation code here for reference.

abhishekju06 · 2023-06-14T10:19:39Z

Thanks a ton!

WenjieDu · 2023-06-14T10:21:32Z

No problem. If you have further questions regarding the SAITS model, you're welcome to raise issues in SAITS repo https://github.com/WenjieDu/SAITS/issues.

abhishekju06 added enhancement New feature or request new feature Proposing to add a new feature labels Jun 8, 2023

WenjieDu added question Further information is requested and removed enhancement New feature or request new feature Proposing to add a new feature labels Jun 8, 2023

abhishekju06 closed this as completed Jun 8, 2023

abhishekju06 reopened this Jun 13, 2023

WenjieDu changed the title ~~How can I customize my own dataset to fit PyPots SOA imputation models?~~ How can I customize my own dataset to fit PyPOTS SOTA imputation models? Jun 13, 2023

abhishekju06 closed this as completed Jun 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How can I customize my own dataset to fit PyPOTS SOTA imputation models? #141

How can I customize my own dataset to fit PyPOTS SOTA imputation models? #141

abhishekju06 commented Jun 8, 2023

WenjieDu commented Jun 8, 2023

WenjieDu commented Jun 8, 2023

abhishekju06 commented Jun 8, 2023 •

edited

Loading

WenjieDu commented Jun 8, 2023 •

edited

Loading

abhishekju06 commented Jun 8, 2023

WenjieDu commented Jun 8, 2023 •

edited

Loading

abhishekju06 commented Jun 8, 2023 •

edited

Loading

WenjieDu commented Jun 8, 2023

abhishekju06 commented Jun 13, 2023 •

edited

Loading

WenjieDu commented Jun 13, 2023

abhishekju06 commented Jun 13, 2023 •

edited

Loading

WenjieDu commented Jun 13, 2023 •

edited

Loading

abhishekju06 commented Jun 13, 2023

WenjieDu commented Jun 13, 2023

abhishekju06 commented Jun 14, 2023

WenjieDu commented Jun 14, 2023

abhishekju06 commented Jun 14, 2023

WenjieDu commented Jun 14, 2023

abhishekju06 commented Jun 14, 2023

WenjieDu commented Jun 14, 2023

How can I customize my own dataset to fit PyPOTS SOTA imputation models? #141

How can I customize my own dataset to fit PyPOTS SOTA imputation models? #141

Comments

abhishekju06 commented Jun 8, 2023

1. Feature description

2. Motivation

3. Your contribution

WenjieDu commented Jun 8, 2023

WenjieDu commented Jun 8, 2023

abhishekju06 commented Jun 8, 2023 • edited Loading

WenjieDu commented Jun 8, 2023 • edited Loading

abhishekju06 commented Jun 8, 2023

WenjieDu commented Jun 8, 2023 • edited Loading

abhishekju06 commented Jun 8, 2023 • edited Loading

WenjieDu commented Jun 8, 2023

abhishekju06 commented Jun 13, 2023 • edited Loading

WenjieDu commented Jun 13, 2023

abhishekju06 commented Jun 13, 2023 • edited Loading

WenjieDu commented Jun 13, 2023 • edited Loading

abhishekju06 commented Jun 13, 2023

WenjieDu commented Jun 13, 2023

abhishekju06 commented Jun 14, 2023

WenjieDu commented Jun 14, 2023

abhishekju06 commented Jun 14, 2023

WenjieDu commented Jun 14, 2023

abhishekju06 commented Jun 14, 2023

WenjieDu commented Jun 14, 2023

abhishekju06 commented Jun 8, 2023 •

edited

Loading

WenjieDu commented Jun 8, 2023 •

edited

Loading

WenjieDu commented Jun 8, 2023 •

edited

Loading

abhishekju06 commented Jun 8, 2023 •

edited

Loading

abhishekju06 commented Jun 13, 2023 •

edited

Loading

abhishekju06 commented Jun 13, 2023 •

edited

Loading

WenjieDu commented Jun 13, 2023 •

edited

Loading