Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using CSV files versus h5 data #1

Closed
Niharikajo opened this issue Apr 4, 2022 · 16 comments
Closed

Using CSV files versus h5 data #1

Niharikajo opened this issue Apr 4, 2022 · 16 comments

Comments

@Niharikajo
Copy link

Hello Wenjie,

Thank you for releasing the code, I had couple of questions. I am trying to run the code using Air Quality dataset in Google Colab. These are some of my doubts:

  1. !CUDA_VISIBLE_DEVICES=2 python run_models.py --config_path configs/AirQuality_SAITS_best.ini
    Running this gives me the following error message.
    OSError: Unable to open file (unable to open file: name = 'dataset_generating_scripts/RawData/AirQuality/PRSA_Data_20130301-20170228/datasets.h5', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0)

All the dataset is in .CSV format.
1a) Is there an option to use the default .csv data?
1b) How do I convert .csv to h5 format?

  1. Where should we change the file path of the dataset for training purpose
    As in file configs/AirQuality_SAITS_best.ini ?

Please do let me know, thanks.

Niharika

@WenjieDu
Copy link
Owner

WenjieDu commented Apr 4, 2022

Hi @Niharikajo, thank you for submitting the first issue of this repo!

Before starting your model training, you have to generate the datasets first and this is why you got the error. Although the generated datasets are not included in this repo due to their large sizes, the dataset downloading and preprocessing scripts are available in dir SAITS/dataset_generating_scripts. Please check them out there.

For your questions:
1a) Is there an option to use the default .csv data?
Answer: No, this framework does not offer such an option, and I am sorry about this. This repository is created to help researchers/engineers like you to reproduce the results in the paper, and all necessary steps and code are included. Regarding your specific requirements, you can fork this repo and customize the workflow. Besides, I am currently working on another project PyPOTS (A Python Toolbox for Data Mining on Partially-Observed Time Series) that may interest you.

1b) How do I convert .csv to h5 format?
Answer: You can refer to the preprocessing script of dataset AirQuality here SAITS/dataset_generating_scripts/gene_UCI_BeijingAirQuality_dataset.py.

2. Where should we change the file path of the dataset for training purpose. As in file configs/AirQuality_SAITS_best.ini?
Answer: Yes. You should give the dataset path in the model configuration file. Particularly, specify the dataset_base_dir in section [file_path] and give dataset_name (i.e. the dir name of the dataset under dataset_base_dir) in section [dataset].

I hope my answers are satisfactory to you. If you have any further questions, please let me know. Thanks.

@Niharikajo
Copy link
Author

Thank you Wenjie, I greatly appreciate your prompt reply.

I was able to run training script without any issues.

I encountered the following error while running the testing script

configparser.InterpolationMissingOptionError: Bad value substitution: option 'model_path' in section 'test' contains an interpolation key 'file_path:model_saving_dir' which is not a valid option name. Raw value: '${file_path:model_saving_dir}/${step_527}'

This was the script I used for testing

!CUDA_VISIBLE_DEVICES=0 python run_models.py --config_path configs/AirQuality_SAITS_best.ini --test_mode

The AirQuality_SAITS_best.ini file I am using can be found here.
https://www.dropbox.com/s/4u1br06x7dxh20v/AirQuality_SAITS_best.ini?dl=0

If possible, please let me know the means to rectify this issue, thanks.

Best regards
Niharika

@WenjieDu
Copy link
Owner

WenjieDu commented Apr 5, 2022

The error indicates that your model saving path does not exist. You need to double-check the path model_saving_dir under section [file_path] and your saved model for testing model_path under section [test] in the config file.

@Niharikajo
Copy link
Author

Thank you Wenjie for your reply, it was very helpful.
I was able to train and test the model successfully on air quality dataset and electricity dataset.

I had a question regarding data pre processing

  • Is there a way in which datasets can be generated with missingness in only a few columns/features?

Please do let me know, Thanks.

Best Regards
Niharika

@WenjieDu
Copy link
Owner

WenjieDu commented Apr 7, 2022

The scripts do not offer options like this, but you can customize the preprocessing workflow to specify the missing pattern. One function you should pay attention to is add_artificial_mask in data_processing_utils.py#L30, which is used to add artificial missingness into datasets.

@Niharikajo
Copy link
Author

Thank You Wenjie for your reply, it was of great help.

I had a few of doubts:

  1. a)How is the sequence length (seq_len) considered?
    b) Is it similar to tokens in Natural Language Processing?
    For example, If I have a data [1 2 3 4 5 6 7 8 ] and 2,6 are missing
    And if I consider seq_len=2 then will it be [1 nan 3 4 ] and [5 nan 7 8]?

  2. What are these different model_trainstep files which we obtain as output after training?

  3. Is there a way to extract files from the final imputed file (imputations.h5), so that imputed vs original data can be plotted.

Please do let me know, Thanks.

Best Regards
Niharika

@WenjieDu
Copy link
Owner

WenjieDu commented Apr 9, 2022

Answer to Q1.a: The sequence length depends on your application scenario. And there is a trade-off here. If the sequence length is long, then it can provide more context information, while if it is too long, it may be hard for the model to capture all context information and utilize it to make proper imputation, and vice-versa. Although you can enlarge the size of the model to increase its capacity, here I am talking about the architecture limitation, like RNN-based BRITS1 model uses bidirectional architecture to enhance its ability to capture context information, but using self-attention for imputation is a more natural way.

Answer to Q1.b: It is kind of like tokens in NLP. But here, in your example, if you want your output to be [1 nan 3 4 ] and [5 nan 7 8], your sequence length should be 4 and your sliding window size should be 4 as well.

Answer to Q2: They are saved models. During training, once we get a model obtaining better validation performance than before. It will be saved into a disk file, just as you saw.

Answer to Q3: Yes. .h5 files are just data serialized by library h5. You only need to read the data with h5, then manipulate it as you want.

Hope my answers can help you.

Footnotes

  1. Wei Cao et al. BRITS: Bidirectional Recurrent Imputation for Time Series. NeurIPS 2018.

@Niharikajo
Copy link
Author

Thank You Wenjie for your reply, the answers were very insightful .

@WenjieDu WenjieDu pinned this issue Apr 11, 2022
@Rdfing
Copy link

Rdfing commented Apr 11, 2022

sliding window size

Wenjie,

In your SAITS model, what is your sliding window size? Or does SAITS need sliding window size at all?

Thanks,
Haochen

@WenjieDu
Copy link
Owner

Hi Haochen (@Rdfing),

Thanks for asking.
The sliding window sizes of dataset Air-Quality and Electricity are the same with their sequence lengths. For dataset PhysioNet-2012, each sample is from one patient, so I set its sequence length as 48. Shorter ones will be padded, and longer ones will be truncated. You can refer to the data preprocessing scripts for more details.

@Rdfing
Copy link

Rdfing commented Apr 12, 2022

Weijie,

Thank you for the clarification.

Haochen

@Niharikajo
Copy link
Author

Niharikajo commented Apr 16, 2022

Hello Wenjie,

I had a few doubts in the paper:

  1. What is the difference between SAITS model and SAITS (base) model?
    My understanding is that a single DMSA block is used in SAITS (base) and two DMSA blocks are used in SAITS model.

  2. In Eqation 4 section 3.2.1 Diagonally-Masked Self-Attention (DMSA)
    image

    what are WO, WQ, WK, WV and why is it multiplied with x in DiagMaskedSelfAttention.

  3. What is the source for the transformer method used in performance comparison?

Please do let me know, Thanks.

Best Regards
Niharika

@WenjieDu
Copy link
Owner

Hi Niharika,

The answers to your questions can be found in the original SAITS paper.

A1: The difference between SAITS and SAITS (base) is, the hyper-parameters of SAITS (base) are fixed to form a base model, while the ones of SAITS get tuned during the training to obtain as best performance as it can.

A2: WO, WQ, WK, WV here are parameters of the projection layers, which project the representation into latent spaces with higher/lower dimensions. WQ, WK, WV project x to form Q, K V, and WO projects the output of MHA back to the space with dmodel dimensions . SAITS applies the same strategy here as the original paper proposing self-attention 1.

Footnotes

  1. Ashish Vaswani et al. Attention Is All You Need. https://arxiv.org/abs/1706.03762

@Niharikajo
Copy link
Author

Thankyou for the answers Wenjie.

@Niharikajo
Copy link
Author

Hello Wenjie,

While I was running brits model, I had a doubt
In brits.py two models are given, brits and rits.
And I tried running rits model.

  1. We are calculating reconstruction loss, imputation loss and consistency loss for brits model #brits.py line 182
    1a. What is consistency loss ?
    1b. Why are we not calculating it in rits model.

  2. Also why imputation loss is not calculated for rits model?

Please do let me know, Thanks.

Best Regards
Niharika

@WenjieDu
Copy link
Owner

Hi Niharika,

BRITS consists of two RITS, and a single RITS is not exposed for use in PyPOTS (I will consider exposing it in the near future), though you can refer to its implementation. And I strongly recommend you to read the original paper of BRITS in detail to figure out your question 1a and 1b.

A kind reminder, your questions are not related to this issue. You can create new issues for your new questions, and this can help avoid other guys asking similar questions (creating proper issues is absolutely a kind of contribution). Or send me your questions by email. I always try my best to help. Thank you. 😃

@WenjieDu WenjieDu unpinned this issue Apr 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants