Canned dissimilarities? #196

jarioksa · 2016-09-14T13:21:36Z

Function designdist is currently faster than vegdist. With "binary" and "quadratic" terms it is much faster than vegdist. With "minimum" terms (used by most dissimilarity functions in vegdist) it used to be slower than vegdist, but I wrote C code with .Call() interface to find those minimum terms (5fb205d), and now even these are faster than in vegdist. The speed comes with some cost:

Memory footprint of designdist is higher. I made vegdist to have .Call() interface which further reduced the memory footprint of vegdist and makes the difference even larger in 2.5-0 than it used to be (and still is in 2.4-1). In the same process I also made vegdist faster and it now matches stats::dist() which used to be much faster earlier. However, this does not close the gap to designdist (major changes in 8125d43).
Missing values in input data give missing dissimilarities (NA) in designdist, but in vegdist we can use ´´pairwise deletion´´. For "minimum" terms this is the main reason for faster designdist.
These two ways of defining squared Euclidean distance are algebraically equivalent ∑(x-y)² = ∑x² + ∑y² - 2∑xy and we use the latter as designdist(x, "A+B-2*J", terms="quadratic"). However, they are not numerically equivalent, but quadratic terms can lose precision and give erratic results. This concerns most other indices, and it is safer to use compiled code that was designed to be numerically more stable.
designdist coefficients must be designed and written which may be tricky for some users.

The last point could be solved by providing a function of canned dissimilarity functions. We could have a long list of dissimilarity indices defined in designdist terms, and these could be selected with an index name. The following function demonstrates the concept:

canneddist <-
    function(x, method)
{
index <- list(
    "sorensen" = list(method = "(A+B-2*J)/(A+B)", terms = "binary"),
    "bray" =   list(method = "(A+B-2*J)/(A+B)", terms = "minimum"),
    "whittaker" =  list(method = "(A+B-2*J)/(A+B)", terms = "binary"),
    "ochiai" = list(method = "1-J/sqrt(A*B)", terms = "binary"),
    "cosine" = list(method = "1-J/sqrt(A*B)", terms = "quadratic"))
ind <- match.arg(method, names(index))
z <- index[[ind]]
designdist(x, method = z$method, terms = z$terms, name = ind)
}
## use this as
library(vegan)
data(dune)
canneddist(dune, "och")

The list of indices could grow to any desired size. For instance, an article by Z. Hubalek lists 86 binary indices, and there are many more.

The function is simple, but the real challenge is documentation. The list of indices is dynamic, and when it reaches something like 200 alternatives, we need also ways of paging the output, filtering the results, finding synonyms (there are synonyms even in the list above) etc. Currently I have a simple help argument in betadiver which lists the seventeen indices available there, but this would not be sufficient for this choice of canned dissimilarities.

Probably we would also want to have optional fields like synonym and note which could print message() of canonical names or implementation specifics for certain indices. Perhaps also an entry on source could be useful to give the source reference to literature on each index (not usually the original but a text book or similar), but this would call for a more complicated design as same sources are duplicated and we do not want to write them in full for each index.

What do you think of this idea. Should we have a function like this?

This popped up in issue #182 but I decided to make this a separate issue.

The text was updated successfully, but these errors were encountered:

this version is similar as outlined in github issue vegandevs/vegan#196 and lacks indices and lacks tools of documentation.

jarioksa added the feature-request label Sep 14, 2016

jarioksa mentioned this issue Sep 14, 2016

designdist faster than vegdist for binary distances #182

Closed

jarioksa added the request-for-comments label Dec 30, 2016

jarioksa pushed a commit to jarioksa/natto that referenced this issue Sep 5, 2018

add canned dissimilarities: proof-of-the-concept version

389e436

this version is similar as outlined in github issue vegandevs/vegan#196 and lacks indices and lacks tools of documentation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Canned dissimilarities? #196

Canned dissimilarities? #196

jarioksa commented Sep 14, 2016 •

edited

Loading

Canned dissimilarities? #196

Canned dissimilarities? #196

Comments

jarioksa commented Sep 14, 2016 • edited Loading

jarioksa commented Sep 14, 2016 •

edited

Loading