Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Canned dissimilarities? #196

Open
jarioksa opened this issue Sep 14, 2016 · 0 comments
Open

Canned dissimilarities? #196

jarioksa opened this issue Sep 14, 2016 · 0 comments

Comments

@jarioksa
Copy link
Contributor

jarioksa commented Sep 14, 2016

Function designdist is currently faster than vegdist. With "binary" and "quadratic" terms it is much faster than vegdist. With "minimum" terms (used by most dissimilarity functions in vegdist) it used to be slower than vegdist, but I wrote C code with .Call() interface to find those minimum terms (5fb205d), and now even these are faster than in vegdist. The speed comes with some cost:

  • Memory footprint of designdist is higher. I made vegdist to have .Call() interface which further reduced the memory footprint of vegdist and makes the difference even larger in 2.5-0 than it used to be (and still is in 2.4-1). In the same process I also made vegdist faster and it now matches stats::dist() which used to be much faster earlier. However, this does not close the gap to designdist (major changes in 8125d43).
  • Missing values in input data give missing dissimilarities (NA) in designdist, but in vegdist we can use ´´pairwise deletion´´. For "minimum" terms this is the main reason for faster designdist.
  • These two ways of defining squared Euclidean distance are algebraically equivalent ∑(x-y)2 = ∑x2 + ∑y2 - 2∑xy and we use the latter as designdist(x, "A+B-2*J", terms="quadratic"). However, they are not numerically equivalent, but quadratic terms can lose precision and give erratic results. This concerns most other indices, and it is safer to use compiled code that was designed to be numerically more stable.
  • designdist coefficients must be designed and written which may be tricky for some users.

The last point could be solved by providing a function of canned dissimilarity functions. We could have a long list of dissimilarity indices defined in designdist terms, and these could be selected with an index name. The following function demonstrates the concept:

canneddist <-
    function(x, method)
{
index <- list(
    "sorensen" = list(method = "(A+B-2*J)/(A+B)", terms = "binary"),
    "bray" =   list(method = "(A+B-2*J)/(A+B)", terms = "minimum"),
    "whittaker" =  list(method = "(A+B-2*J)/(A+B)", terms = "binary"),
    "ochiai" = list(method = "1-J/sqrt(A*B)", terms = "binary"),
    "cosine" = list(method = "1-J/sqrt(A*B)", terms = "quadratic"))
ind <- match.arg(method, names(index))
z <- index[[ind]]
designdist(x, method = z$method, terms = z$terms, name = ind)
}
## use this as
library(vegan)
data(dune)
canneddist(dune, "och")

The list of indices could grow to any desired size. For instance, an article by Z. Hubalek lists 86 binary indices, and there are many more.

The function is simple, but the real challenge is documentation. The list of indices is dynamic, and when it reaches something like 200 alternatives, we need also ways of paging the output, filtering the results, finding synonyms (there are synonyms even in the list above) etc. Currently I have a simple help argument in betadiver which lists the seventeen indices available there, but this would not be sufficient for this choice of canned dissimilarities.

Probably we would also want to have optional fields like synonym and note which could print message() of canonical names or implementation specifics for certain indices. Perhaps also an entry on source could be useful to give the source reference to literature on each index (not usually the original but a text book or similar), but this would call for a more complicated design as same sources are duplicated and we do not want to write them in full for each index.

What do you think of this idea. Should we have a function like this?

This popped up in issue #182 but I decided to make this a separate issue.

jarioksa pushed a commit to jarioksa/natto that referenced this issue Sep 5, 2018
this version is similar as outlined in github issue
vegandevs/vegan#196 and lacks indices and lacks tools
of documentation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant