Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

calculating the distance between two centroids #404

Open
aguilart opened this issue Apr 1, 2021 · 2 comments
Open

calculating the distance between two centroids #404

aguilart opened this issue Apr 1, 2021 · 2 comments

Comments

@aguilart
Copy link

aguilart commented Apr 1, 2021

HI, I want measure the difference between the spatial medians of two groups. The data comes from using bray distances between samples.Some samples were under control and other under a drought treatment. I found that betadisper followed by TukeyHSD does the trick. However, when I see the medians in the boxplots and check them by eye, it does not seem to correspond with value provided by TukeyHSD. As an example:

ejemplo<-OTU_abundance6[c(1:10),]
y<-vegdist(ejemplo[,c(1:2113)],method = "bray")
rw<-betadisper(y,ejemplo$Treatment)

If I do boxplot
boxplot(rw)

I get:

image

Now calculating TukeyHSD;

TukeyHSD(rw)

It returns a negative difference between the spatial medians (i.e. diff):

Tukey multiple comparisons of means
95% family-wise confidence level

Fit: aov(formula = distances ~ group, data = df)

group                 diff        lwr        upr     p adj
Drought-Control -0.06495099 -0.2075065 0.07760449 0.3241143

I thought I might be able to calculate by myself the difference between the medians by doing this:

dist(rw$centroids)

which returns a different distance:

Control
Drought 0.1620314

So, in summary, I want to be reallly sure of what is the real difference between the spatial medians between the two groups (i.e. drought vs control)

@gavinsimpson
Copy link
Contributor

These things are showing slightly different things, and the details can be important.

Using the following as an example to illustrate

data(varespec)
     
## Bray-Curtis distances between samples
dis <- vegdist(varespec)
     
## First 16 sites grazed, remaining 8 sites ungrazed
groups <- factor(c(rep(1,16), rep(2,8)), labels = c("grazed","ungrazed"))
     
## Calculate multivariate dispersions
mod <- betadisper(dis, groups)

The boxplot() is a boxplot of the distance between each observation (sample) and its groups median/centre. By convention boxplot() draws boxplots that showing Tukey's summaries, so the thick line in each plot is the median of these individual distances to group centres.

The printed output from betadisper() shows the mean of these distances:

> mod

	Homogeneity of multivariate dispersions

Call: betadisper(d = dis, group = groups)

No. of Positive Eigenvalues: 15
No. of Negative Eigenvalues: 8

Average distance to median:
  grazed ungrazed 
  0.3926   0.2706 

Eigenvalues for PCoA axes:
(Showing 8 of 23 eigenvalues)
 PCoA1  PCoA2  PCoA3  PCoA4  PCoA5  PCoA6  PCoA7  PCoA8 
1.7552 1.1334 0.4429 0.3698 0.2454 0.1961 0.1751 0.1284

If you look at the boxplots of these distances by group

boxplot(mod)

betadisper-boxplot

you'll see that the one for ungrazed indicates the distribution of those distances to centres are skewed, and hence the mean value reported in the output 0.2706 should be expected to be different to that of the median of these distances (on the boxplot), with the median being lower (it is less influenced by the upper tail) and is closer to 0.2.

The same thing applies in your example too.

The output from TukeyHSD() is

> TukeyHSD(mod)
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = distances ~ group, data = df)

$group
                      diff        lwr          upr     p adj
ungrazed-grazed -0.1219422 -0.2396552 -0.004229243 0.0429502

where the diff column is showing the estimated difference of means, which is the difference between the sample means of the group distances to centres:

> with(mod, tapply(distances, group, mean))
   grazed  ungrazed 
0.3925879 0.2706457 
> diff(with(mod, tapply(distances, group, mean)))
  ungrazed 
-0.1219422

The first line of the output above is the average (sample mean) distance to group centre for each group, while the second line of code computes the difference of these two samples means, which is the sample estimate of the difference of means, which is what is reported by TukeyHSD() ad diff.

If you want to compute the distance between the centres then dist() is one option, but you have to be a bit more careful about how you handle axes that correspond to negative eigenvalues and you likely have those if you are using a non-metric coefficient like Bray-Curtis.

> with(mod, dist(centroids))
            grazed
ungrazed 0.5155369

So the above is the Euclidean distance between the group centres - not accounting for axes with negative eigenvalues, so it is wrong. You'd need to compute the squared Euclidean distance between centroids for the axes with positive eigenvalues separately from the distance among the axes with negative eigenvalues and then do:

sqrt(dist.pos - dist.neg)

and also worry about situations where dist.neg > dist.pos in which case you'd be trying to take the square root of a negative number which won't work. You can see how we do this for the distances reported in $distances for the samples in the code for betadisper — it's not complicated it's just a bit more complex than simply running dist().

The use of TukeyHSD() is not intended to correspond to distances between the group centres; it is meant as measure of the difference between the groups in terms of the average dispersion of samples around the respective group centres.

@aguilart
Copy link
Author

Thanks a lot for the feedback: it was really useful! Just a quick follow up: when you say: "and also worry about situations where dist.neg > dist.pos". Does that mean than in those cases I should just invert the order and that´s all ( i mean dist.neg - dist.pos). Thank you again for all the effort you put into answer these questions! It really helps understanding better what happens "behind the scenes" in vegan!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants