Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clustergram - Request to cluster based on a selected aggregated "groupby" column #645

Open
adkinsrs opened this issue Dec 8, 2021 · 2 comments

Comments

@adkinsrs
Copy link

adkinsrs commented Dec 8, 2021

Currently the Dash clustergram is restricted to clustering based on all row or column values. There are cases where I would like to sort my data based on a chosen metadata category, and then cluster based on the mean value of that metadata category. Right now I am forced to choose to preserve sorting without clustering, or cluster by the raw data values and lose the aesthetic grouping that came from pre-sorting the data. Below I have two pictures of Dash-Bio Clustergrams (with my own post-processing touches) that show the situation I am trying to convey.

Clustering by individual samples instead of category
Screen Shot 2021-12-08 at 11 03 43 AM

Sorted by a category but no clustering
Screen Shot 2021-12-08 at 11 03 31 AM

The functionality I am requesting is similar to the dendrogram option for Scanpy's heatmap function (see https://scanpy.readthedocs.io/en/stable/generated/scanpy.pl.heatmap.html).

I thought a potential solution would be to

  1. Groupby the chosen category to get mean values for the data
  2. Run dashbio.Clustergram on this to get the dendrogram traces back
  3. Sort the original data to have the order match the dendrogram traces
  4. And then plug those traces back into dashbio.Clustergram using the sorted non-grouped original data.

But I would be running the "clustergram" tool twice, and since the category groups have uneven counts of members, the traces from step 2 would not line up 1-to-1 with the sorted data and the x/y coords would need to be adjusted.

Any thoughts on this enhancement?

@adkinsrs
Copy link
Author

adkinsrs commented Dec 8, 2021

I just ran into a dataset that had so many data samples that Scipy ran into a "maximum recursion depth exceeded" error when attempting to cluster the samples, so being able to optionally cluster by an aggregated category would also alleviate this issue.

@nickmelnikov82
Copy link
Contributor

Hi @adkinsrs.

The reordering of the data is proceeding not in the Clustegram component directly, but in the Dendrogram class from the plotly.figure_factory module. So we don't available to fix the main problem of this issue in the dash-bio project. We can create an issue about the reordering problem in the original Dendrogram component from figure_factory.

Best wishes,
Nick.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants