Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Methods to identify the number of nontrivial axes in wcmdscale #78

Open
eduardszoecs opened this issue Nov 18, 2014 · 6 comments
Open

Comments

@eduardszoecs
Copy link
Contributor

Paul and Anderson (2013) discuss 4 methods to assess how many numbers of axis might be usefull/interpretable/[putwhatyouwanthere]

a) broken-stick [implemented in vegan]
b) bootstrap CI for eigenvalues
c) bootstrap from null model
d) variation of c)

I wondered if b)-c) are already implemented in vegan? And if not, if it's worth implementing them?

I have somewhere on my machine a more veganized version of their code...

Paul WL, Anderson MJ (2013) Causal modeling with multivariate species data. Journal of Experimental Marine Biology and Ecology 448:72–84. doi: 10.1016/j.jembe.2013.05.028

@jarioksa
Copy link
Contributor

The methods that Paul & Anderson discuss look pretty simple and could be implemented. I'm not fond of these methods because I don't think they work. With "not working" I mean that I don't think that they usually give the correct number of "non-trivial" axes even when such a number exists. This also applies to bstick and screeplot that we now have in vegan. @gavinsimpson seems to think differently, though, and he may have some ideas. However, as we have bstick we could also have other simple alternatives.

An ecological classic on the number of axes is

Jackson, DA (1993) Stopping rules in principal components analysis: a comparison of heuristical and statistical approaches. Ecology 74, 2204-2214.

Paul & Anderson cite this paper and say they are implementing one of his heuristic methods.

Then about the aspect of "not working" and why I think so. You can use simulate.rda and simulate.capscale functions to produce random matrices of known number of "non-trivial axes" (rank) plus random error. Here I generate 2-dimensional data with capscale:

> library(vegan)
> data(mite)
> m <- capscale(vegdist(wisconsin(sqrt(mite))) ~ 1)
> m2 <- capscale(vegdist(wisconsin(sqrt(mite))) ~ scores(m, dis="sites", choice=1:2))  
> signbstick <-
function (model, rank="full") 
{
    mod <- capscale(simulate(model, rank=rank) ~ 1)
    max(min(which(bstick(mod) > eigenvals(mod))) - 1, 0)
}
> table(replicate(100, signbstick(model=m2)))

 2  3  4  5  6 
12 24 44 17  3 

This generates data with two axes plus independent random normal error and estimates the residual variance from the residuals of the model (there are other choices). In capscale simulations we ignore negative eigenvalues, though. This gives the "correct" number of axes only in 12/100 cases, and the modal case is four axes. However, if we reduce the residual random variation, we get correct results:

> m5 <- capscale(vegdist(wisconsin(sqrt(mite))) ~ scores(m, dis="sites", choice=1:5))
> table(replicate(100, signbstick(model=m5, rank=2)))

 2  3 
94  6 

It is not only the number of systematic axes, but the signal/noise ratio: the non-trivial axis must be strong relative to residual variation. Unfortunately, if that is not the case, bstick works in a wrong way and suggests too high a number of axes. Similar tests could be useful for alternative methods, too.

@eduardszoecs
Copy link
Contributor Author

Thanks Jari for your thoughts (and the ref to jackson 1993)!

Peres-Neto, P. R., D. A. Jackson, and K. M. Somers. “How Many Principal Components? Stopping Rules for Determining the Number of Non-Trivial Axes Revisited.” Computational Statistics & Data Analysis 49, no. 4 (2005): 974–97.

also discusses a lot of methods. Most of them should be easy to implement.

Your examples are about constrained ordinations (capscale()) - do your critics also apply to unconstrained ordinations?

@jarioksa
Copy link
Contributor

Here is another paper that I found in last spring (northern hemisphere) and found then useful, mainly in explaining why things like broken stick should underestimate first eigenvalues, and for underlying geometry in particular:

Johnstone IM (2007) High dimensional statistical inference and random matrices. Proceedings of the International Congress of Mathematicians, Madrid, Spain, pages 307-333.

This is a conference paper, but I found this on the Internet so that it should be available.

My example above actually studied unconstrained ordination: it only constrained the ordination by first unconstrained axes to get the desired dimensionality to the systematic part, and to get matching random variation.

@gavinsimpson
Copy link
Contributor

I don't think differently; I consider these at best a guide to the potential/likely dimensionality, not some infallible, all-knowing method that can discern the will of <insert favourite deity>.

Bootstrapping eigenvalues seems at first blush a great solution until you realise that all this gives you is say a confidence interval on the eigenvalues and some measure of bias in the values you got for your data. You still have the problem of saying whether the eigenvalues +/- the bootstrap CI are meaningful or not.

I need to revisit Pedro's paper to recall what they did. If we think these are useful as guides and perhaps may perform better than bstick() we should consider implementing them but with big caveats. If anything, having a well-implemented version of these methods in vegan would make it easier to use in teaching to illustrate that these should not be trusted or take at face value.

@gavinsimpson
Copy link
Contributor

I would add that the more formal approaches to doing unconstrained ordination, such as using latent variable models, would allow a real statistical (as opposed to an heuristical) solution to the problem as then we'd have likelihoods to generate an LRT, or use information criteria, to choose between models with different numbers of latent variables (axes).

The longer-term usage of vegan will either decline as these new-fangled methods gain traction (which undoubtedly they will, eventually), or not, if we choose to add vegan-esque implementations. Thoughts?

@jarioksa
Copy link
Contributor

jarioksa commented Mar 31, 2017

This issue was raised in StackOverflow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants