Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto-Grouping for SSA #122

Merged
merged 4 commits into from
Dec 7, 2021
Merged

Auto-Grouping for SSA #122

merged 4 commits into from
Dec 7, 2021

Conversation

lucasplagwitz
Copy link
Contributor

Hi,

I am very interested in your efficient implementation of the Singular Spectrum Analysis.
Do you ever think of an advanced grouping for the components?
You mentioned in the SSA-exmaple that

The first subseries consists of the trend of the original time series.
The second and third subseries consist of noise.

But should the trend not rather be a zero-like signal?

And wouldn't it be nicer to show the possibility of decomposing an additive signal into three components (trend, seasonal, residual)?
Packages like JULIAs-SSA explicitly return trend and seasonal.
They group by a specific consideration of the eigenvalues, which in my opinion also disregard a few special cases.

My commit is only a short demo and tries to demonstrate the potential of the
SSA with a grouping inspired by A METHOD OF TREND EXTRACTION USING SINGULAR SPECTRUM ANALYSIS.
Also saw that the R package uses a similar approach.

Are you interested in the topic of grouping SSA-components?
Then I could add tests and generalize the process.

Best regards,
Lucas

@pep8speaks
Copy link

pep8speaks commented Dec 3, 2021

Hello @lucasplagwitz! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-12-07 15:59:29 UTC

@codecov
Copy link

codecov bot commented Dec 3, 2021

Codecov Report

Merging #122 (b4967b9) into main (bbfef38) will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##             main     #122   +/-   ##
=======================================
  Coverage   99.07%   99.07%           
=======================================
  Files          97       97           
  Lines        5699     5727   +28     
=======================================
+ Hits         5646     5674   +28     
  Misses         53       53           
Impacted Files Coverage Δ
pyts/bag_of_words/bow.py 100.00% <ø> (ø)
pyts/decomposition/tests/test_ssa.py 100.00% <ø> (ø)
pyts/decomposition/ssa.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bbfef38...b4967b9. Read the comment docs.

@johannfaouzi
Copy link
Owner

johannfaouzi commented Dec 6, 2021

Hi,

Sorry for the delayed response. Thank you for your pull request! I have not given singular spectrum analysis much attention because I've never seen it used for time series classification (it's available in this package just because I implemented it so I thought that it was better to make it available anyway). I'm glad that it's useful for some people.

I'm in favor of adding more advanced grouping strategies! I looked at your code and the reference that you provided and I have a few remarks:

  • I don't think that scipy.signal.periodogram does what is mentioned in the reference. According to Wikipedia, there are several definitions of a periodogram. I think that scipy.signal.periodogram implements the time-averaging approach (because it uses Welch's method). In the reference, each element of the periodogram is proportional to the amplitude. If I'm not mistaken, it's probably better to reimplement the periodogram.
  • It's more efficient to use broadcasting instead of for loops whenever possible. Moreover, the indices for the trend and the residuals are the same for all the time series and all the windows, so we can just compute them once.
  • The reference mentions automatic computations of w0 and c0. Do you plan on implementing them? It may be a good idea to add them as parameters of the class if one user wants to set their values.
  • The original example is based on independent random values from a standard normal distribution... It's probably a terrible idea and I should use another dataset.
  • I'm wondering if it would be better to add your example as a separate example illustrating specifically the 'auto' grouping to extract the trend, the seasonality and the residuals.
  • If you plan on working on this, as you mentioned, it would be nice to add some tests for the new functionality (and fixing the linting issues).

Edit: I think the code is wrong. In the reference, it is mentioned that the periodogram is computed from the eigenvectors, but X_elem[i, :, 0] does not correspond the matrix of eigenvector? The matrix of eigenvectors is v.

Edit 2: I edited the code to use the eigenvectors.

Here's a code snippet illustrating my proposed changes (at the end it should be implemented inside the class as you did, it's just a working example):

import numpy as np
from numba import njit, prange
from sklearn.utils.validation import check_array
from pyts.datasets import make_cylinder_bell_funnel
from pyts.utils.utils import _windowed_view
from scipy.signal import periodogram


# Toy dataset
X, _ = make_cylinder_bell_funnel(n_samples=6, random_state=42, shuffle=False)
X += 3 * np.sin(np.arange(X.shape[1]))

# Original code for SSA (until grouping)
window_size = 20

@njit
def _outer_dot(v, X, n_samples, window_size, n_windows):
    X_new = np.empty((n_samples, window_size, window_size, n_windows))
    for i in prange(n_samples):
        for j in prange(window_size):
            X_new[i, j] = np.dot(np.outer(v[i, :, j], v[i, :, j]), X[i])
    return X_new


@njit
def _diagonal_averaging(X, n_samples, n_timestamps, window_size,
                        n_windows, grouping_size, gap):
    X_new = np.empty((n_samples, grouping_size, n_timestamps))
    first_row = [(0, col) for col in range(n_windows)]
    last_col = [(row, n_windows - 1) for row in range(1, window_size)]
    indices = first_row + last_col
    for i in prange(n_samples):
        for group in prange(grouping_size):
            for (j, k) in indices:
                X_new[i, group, j + k] = np.diag(
                    X[i, group, :, ::-1], gap - j - k - 1
                ).mean()
    return X_new


X = check_array(X, dtype='float64')
n_samples, n_timestamps = X.shape
n_windows = n_timestamps - window_size + 1

X_window = np.transpose(
    _windowed_view(X, n_samples, n_timestamps,
                   window_size, window_step=1), axes=(0, 2, 1)
).copy()
X_tranpose = np.matmul(X_window,
                       np.transpose(X_window, axes=(0, 2, 1)))
w, v = np.linalg.eigh(X_tranpose)
w, v = w[:, ::-1], v[:, :, ::-1]

X_elem = _outer_dot(v, X_window, n_samples, window_size, n_windows)


# Grouping
w0 = 0.1
c0 = 0.85

f = np.arange(0, 1 + window_size // 2) / window_size
Pxx = np.abs(np.fft.rfft(v, axis=1, norm='ortho')) ** 2
if Pxx.shape[-1] % 2 == 0:
    Pxx[:, 1:-1, :] *= 2
else:
    Pxx[:, 1:, :] *= 2

Pxx_cumsum = np.cumsum(Pxx, axis=1)
idx_trend = np.where(f < w0)[0][-1]
idx_resid = Pxx_cumsum.shape[1] // 2

trend = Pxx_cumsum[:, idx_trend, :] / Pxx_cumsum[:, -1, :] > c0
resid = Pxx_cumsum[:, idx_resid, :] / Pxx_cumsum[:, -1, :] < c0
season = np.logical_and(~trend, ~resid)

X_groups = np.empty((n_samples, 3, window_size, n_windows))
for i in range(n_samples):
    for j, arr in enumerate((trend, season, resid)):
        X_groups[i, j] = X_elem[i, arr[i]].sum(axis=0)
        
# SSA
grouping_size = X_groups.shape[1]
if window_size >= n_windows:
    X_groups = np.transpose(X_groups, axes=(0, 1, 3, 2))
    gap = window_size
else:
    gap = n_windows

X_ssa = _diagonal_averaging(
    X_groups, n_samples, n_timestamps, window_size,
    n_windows, grouping_size, gap
)
X_ssa = np.squeeze(X_ssa)

Best,
Johann

@lucasplagwitz
Copy link
Contributor Author

Thank you very much for the very helpful input. It really helps to examine things again sensibly. And your code is so much cleaner.

I have now separated out the example as a second one. I hope that my adjustments fit into your style. Otherwise, you can of course modify everything or give me feedback if you want me to change something.

Especially for analyses like I'm currently doing, the automatic components-grouping and your SSA is very useful!

@lucasplagwitz lucasplagwitz marked this pull request as ready for review December 6, 2021 23:07
Copy link
Owner

@johannfaouzi johannfaouzi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much! It's great, I just left a few minor remarks.

Also, please add your contribution in the changelog. It will be released in version 0.13.0. Please add your name at the end of the description (there are a couple of examples in version 0.10.0).

Comment on lines 229 to 230
raise TypeError("'groups' must be either None, an integer "
"or array-like.")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to change the sentence to add 'auto'.

Comment on lines 234 to 238
if not 0 < self.lower_frequency_bound < 1:
raise ValueError(
"'lower_frequency_bound' must be greater than 0 and "
"lower than 1."
)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lower_frequency_bound must be between 0 and 0.5 I think.

is squeezed and its shape is (n_samples, n_timestamps).
equal to ``groups``. If ``groups='auto'``, ``n_splits`` is equal
to three. If ``groups`` is array-like, ``n_splits`` is equal to
the length of ``groups``. If ``n_split=1``, ``X_new`` is squeezed
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing s in n_splits (the typo was there before, but since you're changing the corresponding line, you can fix it).

@@ -183,10 +224,27 @@ def _check_params(self, n_timestamps):
"(got {0}).".format(self.window_size)
)
window_size = max(2, ceil(self.window_size * n_timestamps))
if not (self.groups is None
if not (self.groups is None or self.groups == "auto"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit better to make sure that groups is a string before checking an equality: if self.groups is an array, I think that the current code could raise a warning because it performs the comparison element-wise. Another (not as good) solution would be to check the equality after the list / tuple / array instance check.

I would change the condition like this:

if not (self.groups is None
        or (isinstance(self.groups, str) and self.groups == "auto")
        or isinstance(self.groups, (int, list, tuple, np.ndarray))):

Comment on lines 60 to 68
lower_frequency_bound : float (default = 0.075)
The boundary of the periodogram to characterize trend, seasonal and
residual components. Ignored if 'groups' is not set to 'auto'.


lower_frequency_contribution : float (default = 0.85)
The relative threshold to characterize trend, seasonal and
residual components by considering the periodogram.
Ignored if 'groups' is not set to 'auto'.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to add the bounds for each parameter in their description (something like It must between a and b. before the Ignored sentence).

"'lower_frequency_bound' must be a float."),

({'lower_frequency_bound': 1.2}, ValueError,
"'lower_frequency_bound' must be greater than 0 and lower than 1."),
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upper bound needs to be changed.

@lucasplagwitz
Copy link
Contributor Author

Absolutely right. Hopefully I didn't miss a line. Thanks for the review!

Copy link
Owner

@johannfaouzi johannfaouzi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect! Thank you very much!

@johannfaouzi johannfaouzi merged commit 0f023a6 into johannfaouzi:main Dec 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants