Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to extract latent representation from clustering algorithms #177

Closed
vemuribv opened this issue Aug 24, 2023 · 12 comments
Closed
Assignees
Labels
enhancement New feature or request new feature Proposing to add a new feature

Comments

@vemuribv
Copy link
Contributor

vemuribv commented Aug 24, 2023

1. Feature description

The unsupervised clustering methods (VaDER & CRLI) should internally yield lower dimensional/latent representations of the input data that are used for the final clustering assignments. Users should be able to extract this latent representation for further downstream analysis.

2. Motivation

The ability to extract this latent representation would allow users to calculate internal clustering validation measures like silhouette coefficient, gap statistic, and other indices. This is important especially in cases where there are no ground truth labels.

3. Your contribution

I am not yet totally clear on the clustering architecture implementations in PyPOTS (though I'm starting to familiarize myself more). However, I think these latent representations are already baked into the code:

It may be as simple as including these in what's returned after running .cluster (or provide another function to extract them solely).

@vemuribv vemuribv added enhancement New feature or request new feature Proposing to add a new feature labels Aug 24, 2023
@WenjieDu
Copy link
Owner

Hi there 👋,

Thank you so much for your attention to PyPOTS! You can follow me on GitHub to receive the latest news of PyPOTS. If you find PyPOTS helpful to your work, please star⭐️ this repository. Your star is your recognition, which can help more people notice PyPOTS and grow PyPOTS community. It matters and is definitely a kind of contribution to the community.

I have received your message and will respond ASAP. Thank you for your patience! 😃

Best,
Wenjie

@WenjieDu
Copy link
Owner

WenjieDu commented Aug 30, 2023

Hey @vemuribv, I'm going to adjust the PyPOTS framework API to make the clustering models return their latent representation.

From your end, could you please give a thought to how PyPOTS can provide a more useful utility to help users calculate clustering validation measurements? e.g. could you help integrate some metrics you mentioned in sklearn.clustering into pypots.utils.metrics? After your code is merged into PyPOTS main branch, you will get listed as one of PyPOTS contributors https://pypots.com/about/#all-contributors

@vemuribv
Copy link
Contributor Author

Sure thing, I'll work on that this week. Also, it would be great if the imputed array could be returned as well so we can do things like compare with the original array and generate cluster time series plots (see below from VaDER paper)

image

@WenjieDu
Copy link
Owner

WenjieDu commented Sep 4, 2023

No problem. I'll make the model to provide such an option to return the values. Could you please add the visualization functions as well?

@vemuribv
Copy link
Contributor Author

vemuribv commented Sep 5, 2023

Will do--working on something like this (PhysioNet (CRLI, k=3) here as an example):

download

download

@WenjieDu
Copy link
Owner

WenjieDu commented Sep 5, 2023

Thanks for your PR #179, Bhargav! Will review it 😃

@WenjieDu
Copy link
Owner

WenjieDu commented Sep 6, 2023

Hey Bhargav, your PR #179 has been merged. Congrats! 👍

@vemuribv
Copy link
Contributor Author

vemuribv commented Sep 6, 2023

Awesome, thank you! What's the best place for the visualization functions? Also in pypots.utils.metrics?

@WenjieDu
Copy link
Owner

WenjieDu commented Sep 6, 2023

Absolutely my pleasure ;-) Please put them in pypots/utils/visualization.py, and visualization.py should be created by you.

@vemuribv
Copy link
Contributor Author

vemuribv commented Sep 8, 2023

Made the PR! Let me know if I can fix anything.

I also want to add visualizations like these below, but won't get to that until at least next week.

image

image

@WenjieDu
Copy link
Owner

WenjieDu commented Sep 8, 2023

Great! I'm going to make the clustering models return their latent representation. Then you can write some unit tests for your functions.

@WenjieDu
Copy link
Owner

Hi Bhargav, I've made VaDER and CRLI return their latent representations for clustering as you requested in this issue. I've also written unit testing to test our functions of internal cluster validation metrics that you can refer to

clustering, latent_collector = self.vader.cluster(
TEST_SET, return_latent=True
)
external_metrics = cal_external_cluster_validation_metrics(
clustering, DATA["test_y"]
)
internal_metrics = cal_internal_cluster_validation_metrics(
latent_collector["z"], DATA["test_y"]
)
logger.info(f"{external_metrics}")
logger.info(f"{internal_metrics}")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request new feature Proposing to add a new feature
Projects
None yet
Development

No branches or pull requests

2 participants