Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use a non-SciPy nearest-neighbor function #12

Open
jfinkels opened this issue Sep 16, 2016 · 7 comments
Open

Use a non-SciPy nearest-neighbor function #12

jfinkels opened this issue Sep 16, 2016 · 7 comments

Comments

@jfinkels
Copy link
Contributor

Assuming the changes suggested in pull request #10 are acceptable, we can remove the dependency on SciPy entirely just by reimplementing the get_nearest_neighbor_distances function. This will facilitate goftests being runnable on PyPy, which does not yet have SciPy support.

Here is a short, pure Python nearest-neighbor function backed by a k-d tree: http:https://code.activestate.com/recipes/577497-kd-tree-for-nearest-neighbor-search-in-a-k-dimensi/. It is published under an MIT license. Since we don't need any other behavior from the k-d tree (we just need to find the distance of each point to its nearest neighbor), I think incorporating that code is the best approach.

@jfinkels
Copy link
Contributor Author

I now think that instead of replacing the SciPy k-d tree with a different pure Python k-d-tree implementation which seems to be unreasonably slow (see pull request #14), I think that a better solution is to find a simple, fast, self-contained C implementation of a k-d tree, create a new repository that provides a Python wrapper around the C code, and use that Python wrapper instead.

@fritzo
Copy link
Member

fritzo commented Aug 31, 2017

That seems reasonable to me. The pure python implementation is indeed unreasonably slow.

@danring
Copy link
Contributor

danring commented Aug 31, 2017

Note that SciPy does now (mostly) work with more recent Pypy2 and Pypy3.5 beta versions.

I do see goftests test failures under Pypy, but they are unrelated to cKDTree. (Some Scipy distributions are broken under Pypy due to this PyTuple_SetItem issue: http:https://doc.pypy.org/en/latest/cpython_differences.html#c-api-differences).

@jfinkels
Copy link
Contributor Author

Well, we'll need SciPy to work on PyPy in order to support testing goftests itself, so that is definitely good news Dan. I still think it would be nice to replace a rather heavyweight dependency with a slimmer one for installing goftests.

@jfinkels
Copy link
Contributor Author

jfinkels commented May 8, 2018

Potential replacement for nearest neighbors: https://github.com/spotify/annoy/

(Thanks @mattbarr for the pointer.)

@jfinkels
Copy link
Contributor Author

Annoy seems to be about 15 times slower than scipy.spatial.cKDTree. Here's the diff for my attempt:

diff --git a/goftests/__init__.py b/goftests/__init__.py
index b299495..115467f 100644
--- a/goftests/__init__.py
+++ b/goftests/__init__.py
@@ -37,10 +37,10 @@ except ImportError:
 import random
 import sys
 
+import annoy
 import numpy
 import numpy.random
 from numpy import pi
-from scipy.spatial import cKDTree
 
 from .utils import chi2sf
 
@@ -197,11 +197,23 @@ def volume_of_sphere(dim, radius):
     return radius ** dim * pi ** (0.5 * dim) / gamma(0.5 * dim + 1)
 
 
-def get_nearest_neighbor_distances(samples):
+def get_nearest_neighbor_distances(dim, samples):
+
+    def nn_dist(i):
+        return t.get_nns_by_item(i, 2, include_distances=True)[1][1]
+
     if not hasattr(samples[0], '__iter__'):
         samples = numpy.array([samples]).T
-    distances, indices = cKDTree(samples).query(samples, k=2)
-    return distances[:, 1]
+
+    f = dim  # dimension of each vector
+    t = annoy.AnnoyIndex(f, metric='euclidean')
+    for i, sample in enumerate(samples):
+        t.add_item(i, sample)
+
+    num_trees = 10  # higher is more accurate
+    t.build(num_trees)
+
+    return numpy.array(list(map(nn_dist, range(len(samples)))))
 
 
 def vector_density_goodness_of_fit(
@@ -228,7 +240,7 @@ def vector_density_goodness_of_fit(
     dim = get_dim(samples[0])
     assert dim
     assert len(samples) > 1000 * dim, 'WARNING imprecision; use more samples'
-    radii = get_nearest_neighbor_distances(samples)
+    radii = get_nearest_neighbor_distances(dim, samples)
     density = len(samples) * numpy.array(probs)
     volume = volume_of_sphere(dim, radii)
     exp_samples = density * volume

I then used the following "setup" script for a timeit test:

# filename: speed.py
import numpy
import scipy.stats
import goftests

NUM_BASE_SAMPLES = 250

NUM_SAMPLES_SCALE = 1000

mu = numpy.ones(3)
sigma = numpy.eye(3)
dist = scipy.stats.multivariate_normal(mu, sigma)

dim = goftests.get_dim(dist.rvs(size=2)[0])
sample_count = NUM_BASE_SAMPLES + NUM_SAMPLES_SCALE * dim

xs = dist.rvs(size=sample_count)
ps = dist.pdf(xs)
# intending to call: auto_density_goodness_of_fit(xs, ps)

and timed the call to auto_density_goodness_of_fit(xs, ps) on the master branch (on CPython 3.5):

$ python -m perf timeit -s "from goftests import auto_density_goodness_of_fit; from speed import xs, ps;" "auto_density_goodness_of_fit(xs, ps)"
.....................
Mean +- std dev: 8.24 ms +- 0.31 ms

and with the diff:

$ python -m perf timeit -s "from goftests import auto_density_goodness_of_fit; from speed import xs, ps;" "auto_density_goodness_of_fit(xs, ps)"
.....................
Mean +- std dev: 114 ms +- 1 ms

Bummer! Although there's a good chance I'm not using Annoy correctly, since this is my first time using it.

@jfinkels
Copy link
Contributor Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants