Using CSV files versus h5 data #1

Niharikajo · 2022-04-04T14:14:14Z

Hello Wenjie,

Thank you for releasing the code, I had couple of questions. I am trying to run the code using Air Quality dataset in Google Colab. These are some of my doubts:

!CUDA_VISIBLE_DEVICES=2 python run_models.py --config_path configs/AirQuality_SAITS_best.ini
Running this gives me the following error message.
OSError: Unable to open file (unable to open file: name = 'dataset_generating_scripts/RawData/AirQuality/PRSA_Data_20130301-20170228/datasets.h5', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0)

All the dataset is in .CSV format.
1a) Is there an option to use the default .csv data?
1b) How do I convert .csv to h5 format?

Where should we change the file path of the dataset for training purpose
As in file configs/AirQuality_SAITS_best.ini ?

Please do let me know, thanks.

Niharika

WenjieDu · 2022-04-04T15:36:35Z

Hi @Niharikajo, thank you for submitting the first issue of this repo!

Before starting your model training, you have to generate the datasets first and this is why you got the error. Although the generated datasets are not included in this repo due to their large sizes, the dataset downloading and preprocessing scripts are available in dir SAITS/dataset_generating_scripts. Please check them out there.

For your questions:
1a) Is there an option to use the default .csv data?
Answer: No, this framework does not offer such an option, and I am sorry about this. This repository is created to help researchers/engineers like you to reproduce the results in the paper, and all necessary steps and code are included. Regarding your specific requirements, you can fork this repo and customize the workflow. Besides, I am currently working on another project PyPOTS (A Python Toolbox for Data Mining on Partially-Observed Time Series) that may interest you.

1b) How do I convert .csv to h5 format?
Answer: You can refer to the preprocessing script of dataset AirQuality here SAITS/dataset_generating_scripts/gene_UCI_BeijingAirQuality_dataset.py.

2. Where should we change the file path of the dataset for training purpose. As in file configs/AirQuality_SAITS_best.ini?
Answer: Yes. You should give the dataset path in the model configuration file. Particularly, specify the dataset_base_dir in section [file_path] and give dataset_name (i.e. the dir name of the dataset under dataset_base_dir) in section [dataset].

I hope my answers are satisfactory to you. If you have any further questions, please let me know. Thanks.

Niharikajo · 2022-04-05T14:02:39Z

Thank you Wenjie, I greatly appreciate your prompt reply.

I was able to run training script without any issues.

I encountered the following error while running the testing script

configparser.InterpolationMissingOptionError: Bad value substitution: option 'model_path' in section 'test' contains an interpolation key 'file_path:model_saving_dir' which is not a valid option name. Raw value: '${file_path:model_saving_dir}/${step_527}'

This was the script I used for testing

!CUDA_VISIBLE_DEVICES=0 python run_models.py --config_path configs/AirQuality_SAITS_best.ini --test_mode

The AirQuality_SAITS_best.ini file I am using can be found here.
https://www.dropbox.com/s/4u1br06x7dxh20v/AirQuality_SAITS_best.ini?dl=0

If possible, please let me know the means to rectify this issue, thanks.

Best regards
Niharika

WenjieDu · 2022-04-05T14:32:17Z

The error indicates that your model saving path does not exist. You need to double-check the path model_saving_dir under section [file_path] and your saved model for testing model_path under section [test] in the config file.

Niharikajo · 2022-04-06T18:37:58Z

Thank you Wenjie for your reply, it was very helpful.
I was able to train and test the model successfully on air quality dataset and electricity dataset.

I had a question regarding data pre processing

Is there a way in which datasets can be generated with missingness in only a few columns/features?

Please do let me know, Thanks.

Best Regards
Niharika

WenjieDu · 2022-04-07T03:04:39Z

The scripts do not offer options like this, but you can customize the preprocessing workflow to specify the missing pattern. One function you should pay attention to is add_artificial_mask in data_processing_utils.py#L30, which is used to add artificial missingness into datasets.

Niharikajo · 2022-04-08T19:53:33Z

Thank You Wenjie for your reply, it was of great help.

I had a few of doubts:

a)How is the sequence length (seq_len) considered?
b) Is it similar to tokens in Natural Language Processing?
For example, If I have a data [1 2 3 4 5 6 7 8 ] and 2,6 are missing
And if I consider seq_len=2 then will it be [1 nan 3 4 ] and [5 nan 7 8]?
What are these different model_trainstep files which we obtain as output after training?
Is there a way to extract files from the final imputed file (imputations.h5), so that imputed vs original data can be plotted.

Please do let me know, Thanks.

Best Regards
Niharika

WenjieDu · 2022-04-09T02:13:26Z

Answer to Q1.a: The sequence length depends on your application scenario. And there is a trade-off here. If the sequence length is long, then it can provide more context information, while if it is too long, it may be hard for the model to capture all context information and utilize it to make proper imputation, and vice-versa. Although you can enlarge the size of the model to increase its capacity, here I am talking about the architecture limitation, like RNN-based BRITS¹ model uses bidirectional architecture to enhance its ability to capture context information, but using self-attention for imputation is a more natural way.

Answer to Q1.b: It is kind of like tokens in NLP. But here, in your example, if you want your output to be [1 nan 3 4 ] and [5 nan 7 8], your sequence length should be 4 and your sliding window size should be 4 as well.

Answer to Q2: They are saved models. During training, once we get a model obtaining better validation performance than before. It will be saved into a disk file, just as you saw.

Answer to Q3: Yes. .h5 files are just data serialized by library h5. You only need to read the data with h5, then manipulate it as you want.

Hope my answers can help you.

Wei Cao et al. BRITS: Bidirectional Recurrent Imputation for Time Series. NeurIPS 2018. ↩

Niharikajo · 2022-04-11T12:32:36Z

Thank You Wenjie for your reply, the answers were very insightful .

Rdfing · 2022-04-11T20:25:03Z

sliding window size

Wenjie,

In your SAITS model, what is your sliding window size? Or does SAITS need sliding window size at all?

Thanks,
Haochen

WenjieDu · 2022-04-12T01:24:53Z

Hi Haochen (@Rdfing),

Thanks for asking.
The sliding window sizes of dataset Air-Quality and Electricity are the same with their sequence lengths. For dataset PhysioNet-2012, each sample is from one patient, so I set its sequence length as 48. Shorter ones will be padded, and longer ones will be truncated. You can refer to the data preprocessing scripts for more details.

Rdfing · 2022-04-12T12:08:20Z

Weijie,

Thank you for the clarification.

Haochen

Niharikajo · 2022-04-16T07:37:02Z

Hello Wenjie,

I had a few doubts in the paper:

What is the difference between SAITS model and SAITS (base) model?
My understanding is that a single DMSA block is used in SAITS (base) and two DMSA blocks are used in SAITS model.
In Eqation 4 section 3.2.1 Diagonally-Masked Self-Attention (DMSA)

what are W^O, W^Q, W^K, W^V and why is it multiplied with x in DiagMaskedSelfAttention.
What is the source for the transformer method used in performance comparison?

Please do let me know, Thanks.

Best Regards
Niharika

WenjieDu · 2022-04-16T08:05:07Z

Hi Niharika,

The answers to your questions can be found in the original SAITS paper.

A1: The difference between SAITS and SAITS (base) is, the hyper-parameters of SAITS (base) are fixed to form a base model, while the ones of SAITS get tuned during the training to obtain as best performance as it can.

A2: W^O, W^Q, W^K, W^V here are parameters of the projection layers, which project the representation into latent spaces with higher/lower dimensions. W^Q, W^K, W^V project x to form Q, K V, and W^O projects the output of MHA back to the space with d_model dimensions . SAITS applies the same strategy here as the original paper proposing self-attention ¹.

Ashish Vaswani et al. Attention Is All You Need. https://arxiv.org/abs/1706.03762 ↩

Niharikajo · 2022-04-18T14:27:09Z

Thankyou for the answers Wenjie.

Niharikajo · 2022-04-21T18:45:13Z

Hello Wenjie,

While I was running brits model, I had a doubt
In brits.py two models are given, brits and rits.
And I tried running rits model.

We are calculating reconstruction loss, imputation loss and consistency loss for brits model #brits.py line 182
1a. What is consistency loss ?
1b. Why are we not calculating it in rits model.
Also why imputation loss is not calculated for rits model?

Please do let me know, Thanks.

Best Regards
Niharika

WenjieDu · 2022-04-23T14:20:50Z

Hi Niharika,

BRITS consists of two RITS, and a single RITS is not exposed for use in PyPOTS (I will consider exposing it in the near future), though you can refer to its implementation. And I strongly recommend you to read the original paper of BRITS in detail to figure out your question 1a and 1b.

A kind reminder, your questions are not related to this issue. You can create new issues for your new questions, and this can help avoid other guys asking similar questions (creating proper issues is absolutely a kind of contribution). Or send me your questions by email. I always try my best to help. Thank you. 😃

Niharikajo closed this as completed Apr 11, 2022

WenjieDu pinned this issue Apr 11, 2022

WenjieDu mentioned this issue Apr 12, 2022

window truncate function #2

Closed

Niharikajo reopened this Apr 16, 2022

Niharikajo closed this as completed Apr 18, 2022

Niharikajo reopened this Apr 21, 2022

WenjieDu closed this as completed Apr 23, 2022

WenjieDu unpinned this issue Apr 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using CSV files versus h5 data #1

Using CSV files versus h5 data #1

Niharikajo commented Apr 4, 2022

WenjieDu commented Apr 4, 2022 •

edited

Loading

Niharikajo commented Apr 5, 2022

WenjieDu commented Apr 5, 2022

Niharikajo commented Apr 6, 2022

WenjieDu commented Apr 7, 2022 •

edited

Loading

Niharikajo commented Apr 8, 2022

WenjieDu commented Apr 9, 2022

Niharikajo commented Apr 11, 2022

Rdfing commented Apr 11, 2022

WenjieDu commented Apr 12, 2022

Rdfing commented Apr 12, 2022

Niharikajo commented Apr 16, 2022 •

edited

Loading

WenjieDu commented Apr 16, 2022

Niharikajo commented Apr 18, 2022

Niharikajo commented Apr 21, 2022

WenjieDu commented Apr 23, 2022

Using CSV files versus h5 data #1

Using CSV files versus h5 data #1

Comments

Niharikajo commented Apr 4, 2022

WenjieDu commented Apr 4, 2022 • edited Loading

Niharikajo commented Apr 5, 2022

WenjieDu commented Apr 5, 2022

Niharikajo commented Apr 6, 2022

WenjieDu commented Apr 7, 2022 • edited Loading

Niharikajo commented Apr 8, 2022

WenjieDu commented Apr 9, 2022

Footnotes

Niharikajo commented Apr 11, 2022

Rdfing commented Apr 11, 2022

WenjieDu commented Apr 12, 2022

Rdfing commented Apr 12, 2022

Niharikajo commented Apr 16, 2022 • edited Loading

WenjieDu commented Apr 16, 2022

Footnotes

Niharikajo commented Apr 18, 2022

Niharikajo commented Apr 21, 2022

WenjieDu commented Apr 23, 2022

WenjieDu commented Apr 4, 2022 •

edited

Loading

WenjieDu commented Apr 7, 2022 •

edited

Loading

Niharikajo commented Apr 16, 2022 •

edited

Loading