Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

window truncate function #2

Closed
Rdfing opened this issue Apr 12, 2022 · 3 comments
Closed

window truncate function #2

Rdfing opened this issue Apr 12, 2022 · 3 comments

Comments

@Rdfing
Copy link

Rdfing commented Apr 12, 2022

def window_truncate(feature_vectors, seq_len):
    """ Generate time series samples, truncating windows from time-series data with a given sequence length.
    Parameters
    ----------
    feature_vectors: time series data, len(shape)=2, [total_length, feature_num]
    seq_len: sequence length
    """
    start_indices = np.asarray(range(feature_vectors.shape[0] // seq_len)) * seq_len
    sample_collector = []
    for idx in start_indices:
        sample_collector.append(feature_vectors[idx: idx + seq_len])

    return np.asarray(sample_collector).astype('float32')

Wenjie,

I have some questions if you do not mind to clarify

  1. In the implementation, is the training data generated by diving into the time series based on the sequence length?
  2. What is the advantage of such training data configuration over using the sliding window approach, e.g., generates the training set with one-time step lag [t-n, t-n+1, ... t], [t-n+1, t-n+2, ... t+1], [t-n+2, t-n+3, ... t+2]. Is not the sliding window approach would generate more datasets for training?
  3. I am not quite familiar with transformer architecture. In a typical RNN based imputation method, there are the concepts of sequence length (i.e., length of historical or future data for input) and prediction horizon (i.e., how far in the future or in the past the model try to impute). For the SAITS, what would be the equivalent concepts or does such a concept of the prediction horizon exist?
  4. I understand from your paper that the sequence length is fixed between different models for comparison purposes. How does the sequence length affect the accuracy of the imputation? What would you recommend to determine the appropriate sequence length for the problem at hand?
  5. An unrelated question, Is your PyPOTS currently working with the Air Quality dataset?

Thanks in advance,
Haochen

@WenjieDu WenjieDu pinned this issue Apr 12, 2022
@WenjieDu
Copy link
Owner

Hi Haochen,

Here are my answers to your questions:

A1: Maybe the word diving is a typo, which should be 'dividing'? Yes, the dataset is generated by dividing given feature_vectors into samples with sequence length=seq_len.

A2: Yes, the sliding window approach can generate more samples, and also have more duplicate information between samples, which may help the model obtain better performance, or in other words, make it easier for the model to capture the context information. In the industry application scenario, one may have to think about the sliding window approach carefully (a trade-off here: a small window size brings more samples but may slow down the training due to the large dataset size, while a bigger window size generates fewer samples). However, in the settings of my paper, my purpose is to evaluate all models across datasets fairly and I don't have to care much about the sliding window size (you can take function window_truncate here as a special cause of sliding window approach, which has sliding window size = sample sequence length).

A3: For SAITS, the sequence length does exist, which is the length (number of time steps) of the input feature vector. Regarding the prediction horizon (how far in the future or in the past the model try to impute), I would say SAITS will impute missing values in the temporal interval of given data samples (i.e. sequence length of samples). I don't fully understand the concepts of sequence length and prediction horizon here. Can model BRITS be taken as a typical RNN based imputation method? Could you take it as an example for an explanation?

A4: Please refer to my Answer to Q1.a in Issue#1. Regarding how to determine the appropriate sequence length for the problem at hand, it is tricky. You can refer to some literature in the same domain of the problem in your hand, to see how other researchers choose the length. Or, you take the sequence length as a hyper-parameter and tune it on the same model to see which value can bring better performance.

A5: Yes, dataset Air Quality is included in PyPOTS. Actually, PyPOTS takes project TSDB as a module. So datasets listed in TSDB all can be automatically downloaded, extracted, and (simply) preprocessed by PyPOTS. Simple preprocessing is to well organize the dataset, and I want to hand more preprocessing rights to users. Please let me know if you have any further questions.

Hope my answers can help you.

@Rdfing
Copy link
Author

Rdfing commented Apr 12, 2022

Wenjie,

Thank you for your insights and clarification.

Follow-ups to your A3: We can use LSTM for the univariable forecasts as an example. Given x[t-n, t-n+1, ... t], the LSTM can be trained to predict x[t+H]. n is the sequence length (i.e., history value provided) and H is the prediction horizon (i.e., how far do we want to predict). The notion is that the further we want to predict the future (i.e., larger H), the harder it gets.

I am dealing with a dataset that is much more challenging than the Beijing air quality dataset. The Beijing air quality dataset has simple periodic patterns and most features are correlated. My dataset is more chaotic and also has monotone missing data patterns (i.e., data are continuously missing for a period of time). I guess if I want to apply SAITS to my case, I may need a much a large sequence length so that the continuously missing period only constitutes a small fraction of my sequence. I will follow your suggestion and experiment with different hyperparameters for my case.

Thank you again for developing such a great tool, and I am very looking forward to the impact of PyPOTS!
Haochen

@WenjieDu
Copy link
Owner

The follow-up to Q3: What you are talking about is in the forecasting case rather than imputation, right? The prediction horizon is how far the model can forecast.

Regarding your project, I think you can change the artificially missing pattern for SAITS training. In the original work of SAITS, MIT (masked imputation task) randomly masks out some portion of observed values and holds them out to calculate the MIT loss for model optimization. While in your case, now that your data has the monotone missing pattern, you can randomly mask out a continuous period of data in MIT to train SAITS. Please let me know if you want any help with your project from me.

Also, thank you very much for your attention to my work. If you have any feedback, please tell me. And you are welcome to join PyPOTS and TSDB community!

@WenjieDu WenjieDu unpinned this issue Apr 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants