Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve documentation #167

Merged
merged 80 commits into from
Feb 8, 2021
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
80 commits
Select commit Hold shift + click to select a range
8070fe6
sort imports
sampan501 Jan 13, 2021
c0ac9e7
Merge branch 'master' into energy-mmd-disco
sampan501 Jan 13, 2021
ddfe3b0
fix docs error short underline
sampan501 Jan 13, 2021
65a2569
add energy source code
sampan501 Jan 13, 2021
05c8165
fix tutorials not rendering
sampan501 Jan 13, 2021
95b0a31
add more detail to compute distance and kernel
sampan501 Jan 13, 2021
1884dd2
Merge branch 'master' into better-docs
sampan501 Jan 13, 2021
c6916b9
add tutorials folder with overview
sampan501 Jan 14, 2021
5c83c70
add folder to gitignore
sampan501 Jan 15, 2021
07f55e2
add contributing guidelines to github repo
sampan501 Jan 15, 2021
fff2b0d
make docs look nicer
sampan501 Jan 15, 2021
a7d77fc
change reference to use automodule
sampan501 Jan 15, 2021
8484dbb
add changelog to single file
sampan501 Jan 15, 2021
e5a0402
remove reference folder
sampan501 Jan 15, 2021
760f10d
make clear the package reqs needs to be installed
sampan501 Jan 15, 2021
e4a5911
simplify autosummary table
sampan501 Jan 15, 2021
e44e1d4
remove old changelog files
sampan501 Jan 15, 2021
09e97f7
add file with all automodules
sampan501 Jan 15, 2021
3efedac
change docs layout
sampan501 Jan 15, 2021
7d4f7c0
add gallery and misc docs changes
sampan501 Jan 15, 2021
a996e77
change order of sidebar
sampan501 Jan 15, 2021
7c3bf1b
make install docs more clear
sampan501 Jan 15, 2021
5f8938c
update makefile with autogen and more options
sampan501 Jan 15, 2021
41a224f
update docs package requirements
sampan501 Jan 15, 2021
98cd501
remove github_links file
sampan501 Jan 15, 2021
8ce777e
remove old tutorial folder
sampan501 Jan 15, 2021
65dab0e
remove tutorials and gallery folders from docs
sampan501 Jan 15, 2021
7b1c6d7
add discriminability
sampan501 Jan 18, 2021
02bdeae
add statistic to the docs
sampan501 Jan 18, 2021
f902282
fix dcorr and hsic incorrect stats
sampan501 Jan 18, 2021
d439f6a
Merge branch 'master' into better-docs
sampan501 Jan 19, 2021
92e70c2
Merge branch 'master' into better-docs
sampan501 Jan 20, 2021
e64dae0
add citation page
sampan501 Jan 22, 2021
5f7b144
add internal links within docs
sampan501 Jan 22, 2021
4e68424
use readme in index.rst
sampan501 Jan 22, 2021
4e5d0d9
add links for dependencies
sampan501 Jan 22, 2021
a54fa40
use meth and class
sampan501 Jan 22, 2021
9a17205
update doc requirements
sampan501 Jan 22, 2021
615ec9d
remove docs tutorial page
sampan501 Jan 22, 2021
c85a5fa
update README to rst
sampan501 Jan 22, 2021
b61eb57
Update setup.py
sampan501 Jan 22, 2021
fe8cd2e
fix reference docs formatting
sampan501 Jan 22, 2021
f4b2844
remove figure from package overview
sampan501 Jan 22, 2021
a38fb43
add example folder
sampan501 Jan 22, 2021
45f27a4
add mgc map example
sampan501 Jan 22, 2021
2f83436
add independence simulations example
sampan501 Jan 22, 2021
a0d2412
add feature importance example
sampan501 Jan 22, 2021
572a640
install package pip netlify
sampan501 Jan 26, 2021
0f52c9a
Update netlify.toml
sampan501 Jan 26, 2021
81a58f9
add recommonmark
sampan501 Jan 26, 2021
34452d1
add matplotlib and seaborn
sampan501 Jan 26, 2021
550bf31
add tutorials and mmd
sampan501 Jan 26, 2021
b935187
add disco
sampan501 Jan 26, 2021
00de4f9
update doctest errors
sampan501 Jan 26, 2021
51c1dc8
Update mmd.py
sampan501 Jan 26, 2021
9c1e2ff
fix intersphinx mapping
sampan501 Jan 28, 2021
6b1bf60
move gaussian sim to new example
sampan501 Jan 28, 2021
8d9d0da
remove reps warning
sampan501 Feb 2, 2021
7a17b8e
add fast 1D Dcorr
sampan501 Feb 2, 2021
17225a9
add helper functions reference
sampan501 Feb 2, 2021
9b424cc
add manova and hotelling
sampan501 Feb 7, 2021
42007a2
fix documentation weird renderings
sampan501 Feb 7, 2021
fbbfb24
remove unused import indep_sims
sampan501 Feb 7, 2021
8f9db33
add max margin test
sampan501 Feb 8, 2021
f181538
fix doctest assertion errors
sampan501 Feb 8, 2021
1c657a8
cache numba
sampan501 Feb 8, 2021
6311eb0
add k-sample tutorial
sampan501 Feb 8, 2021
de0ddac
move general example to overview py file
sampan501 Feb 8, 2021
9a6e880
add time-series tutorial
sampan501 Feb 8, 2021
605ea9c
add time series tutorial
sampan501 Feb 8, 2021
5b0d7c9
add time series sims
sampan501 Feb 8, 2021
ec22b5c
fix tutorials
sampan501 Feb 8, 2021
db9b3e6
add v to tag name
sampan501 Feb 8, 2021
da680ec
change order of independence test
sampan501 Feb 8, 2021
f25bb81
add discriminability tutorial
sampan501 Feb 8, 2021
597b6c5
change name of time series sim file
sampan501 Feb 8, 2021
e936ea7
[skip ci] clean netlify directory
sampan501 Feb 8, 2021
7dd860d
[skip ci] rebuild docs
sampan501 Feb 8, 2021
eac3255
[skip ci] update netlify runtime version
sampan501 Feb 8, 2021
66ef343
change netlify build command to fix intersphinx
sampan501 Feb 8, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
add disco
  • Loading branch information
sampan501 committed Jan 26, 2021
commit b93518768f65319763bc68f6b2e3d9ec1811fe0f
1 change: 1 addition & 0 deletions docs/api/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@ Independence
KSample
Energy
MMD
DISCO



Expand Down
197 changes: 197 additions & 0 deletions hyppo/ksample/disco.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,197 @@
import numpy as np

from ..independence.hsic import _dcov
from ..tools import compute_dist
from ._utils import _CheckInputs, k_sample_transform
from .base import KSampleTest
from .ksamp import KSample


class DISCO(KSampleTest):
r"""
Distance Components (DISCO) test statistic and p-value.

DISCO is a powerful multivariate `k`-sample test. It leverages distance matrix
capabilities (similar to tests like distance correlation or Dcorr). In fact, DISCO
statistic is equivalent to our 2-sample formulation nonparametric MANOVA via
independence testing, i.e. :class:`hyppo.ksample.Ksample`,
and to
:class:`hyppo.independence.Dcorr`,
:class:`hyppo.independence.Energy`,
:class:`hyppo.independence.Hsic`, and
:class:`hyppo.ksample.MMD` `[1]`_ `[2]`_.

Traditionally, the formulation for the DISCO statistic
is as follows `[3]`_:

Define
:math:`\{ u^i_1 \stackrel{iid}{\sim} F_{U_1},\ i = 1, ..., n_1 \}` up to
:math:`\{ u^j_k \stackrel{iid}{\sim} F_{V_1},\ j = 1, ..., n_k \}` as `k` groups
of samples deriving from different distributions with the same
dimensionality. If :math:`d(\cdot, \cdot)` is a distance metric (i.e. euclidean),
:math:`N = \sum_{i = 1}^k n_k`,
and :math:`\mathrm{Energy}` is the Energy test statistic from
:class:`hyppo.independence.Energy`
then,

.. math::

\mathrm{DISCO}_N(\mathbf{u}_1, \ldots, \mathbf{u}_k) =
\sum_{1 \leq k < l \leq K} \frac{n_k n_l}{2N}
\mathrm{Energy}_{n_k + n_l} (\mathbf{u}_k, \mathbf{u}_l)

The implementation in the ``hyppo.ksample.KSample`` class (using Dcorr) is in
fact equivalent to this implementation (for p-values) and statistics are
equivalent up to a scaling factor `[1]`_.

The p-value returned is calculated using a permutation test uses
:meth:`hyppo.tools.perm_test`.
The fast version of the test uses :meth:`hyppo.tools.chi2_approx`.

.. _[1]: https://arxiv.org/abs/1910.08883
.. _[2]: https://arxiv.org/abs/1806.05514
.. _[3]: https://www.semanticscholar.org/paper/TESTING-FOR-EQUAL-DISTRIBUTIONS-IN-HIGH-DIMENSION-Sz%C3%A9kely-Rizzo/ad5e91905a85d6f671c04a67779fd1377e86d199

Parameters
----------
compute_distance : str, callable, or None, default: "euclidean"
A function that computes the distance among the samples within each
data matrix.
Valid strings for ``compute_distance`` are, as defined in
``sklearn.metrics.pairwise_distances``,

- From scikit-learn: [‘cityblock’, ‘cosine’, ‘euclidean’, ‘l1’, ‘l2’,
‘manhattan’] See the documentation for
:mod:`scipy.spatial.distance` for details
on these metrics.
- From scipy.spatial.distance: [‘braycurtis’, ‘canberra’, ‘chebyshev’,
‘correlation’, ‘dice’, ‘hamming’, ‘jaccard’, ‘kulsinski’, ‘mahalanobis’,
‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’,
‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘yule’] See the
documentation for
:mod:`scipy.spatial.distance` for details on these metrics.

To call a custom function, either create the distance matrix
before-hand or create a function of the form ``metric(x, **kwargs)``
where ``x`` is the data matrix for which pairwise distances are
calculated and ``**kwargs`` are extra arguements to send to your custom
function.
bias : bool, default: False
Whether or not to use the biased or unbiased test statistics.
**kwargs
Arbitrary keyword arguments for ``compute_distance``.
"""

def __init__(self, compute_distance="euclidean", bias=False, **kwargs):
# set is_distance to true if compute_distance is None
self.is_distance = False
if not compute_distance:
self.is_distance = True
KSampleTest.__init__(
self, compute_distance=compute_distance, bias=bias, **kwargs
)

def statistic(self, *args):
r"""
Calulates the DISCO test statistic.

Parameters
----------
*args : ndarray
Variable length input data matrices. All inputs must have the same
number of samples and dimensions. That is, the shapes must be `(n, p)`
where `n` are the number of samples and `p` is
the number of dimensions.

Returns
-------
stat : float
The computed DISCO statistic.
"""
inputs = list(args)
N = [i.shape[0] for i in inputs]

if len(set(N)) > 1:
raise ValueError(
"Shape mismatch, inputs must have same sample size, "
"currently {}".format(len(set(N)))
)
u, v = k_sample_transform(inputs)

if not self.is_distance:
distu, distv = compute_dist(
u, v, metric=self.compute_distance, **self.kwargs
)

# exact equivalence transformation Dcorr and DISCO
stat = _dcov(distu, distv, self.bias) * np.sum(N) * len(N) / 2
self.stat = stat

return stat

def test(self, *args, reps=1000, workers=1, auto=True):
r"""
Calculates the DISCO test statistic and p-value.

Parameters
----------
*args : ndarray
Variable length input data matrices. All inputs must have the same
number of samples and dimensions. That is, the shapes must be `(n, p)`
where `n` is the number of samples and `p` is
the number of dimensions.
reps : int, default: 1000
The number of replications used to estimate the null distribution
when using the permutation test used to calculate the p-value.
workers : int, default: 1
The number of cores to parallelize the p-value computation over.
Supply ``-1`` to use all cores available to the Process.
auto : bool, default: True
Automatically uses fast approximation when `n` and size of array
is greater than 20. If ``True``, and sample size is greater than 20, then
:class:`hyppo.tools.chi2_approx` will be run. Parameters ``reps`` and
``workers`` are
irrelevant in this case. Otherwise, :class:`hyppo.tools.perm_test` will be
run.

Returns
-------
stat : float
The computed DISCO statistic.
pvalue : float
The computed DISCO p-value.

Examples
--------
>>> import numpy as np
>>> from hyppo.ksample import DISCO
>>> x = np.arange(7)
>>> y = x
>>> stat, pvalue = DISCO().test(x, y)
>>> '%.3f, %.1f' % (stat, pvalue)
'0.267, 1.0'
"""
inputs = list(args)
check_input = _CheckInputs(
inputs=inputs,
indep_test="dcorr",
)
inputs = check_input()
N = [i.shape[0] for i in inputs]

if len(set(N)) > 1:
raise ValueError("Shape mismatch, inputs must have shape " "[n, p].")

# observed statistic
stat = self.statistic(*inputs)

# since stat transformation is invariant under permutation, k-sample Dcorr
# pvalue is identical to DISCO
_, pvalue = KSample(
indep_test="Dcorr",
compute_distance=self.compute_distance,
bias=self.bias,
**self.kwargs
).test(*inputs, reps=reps, workers=workers, auto=auto)

return stat, pvalue
4 changes: 2 additions & 2 deletions hyppo/ksample/ksamp.py
Original file line number Diff line number Diff line change
Expand Up @@ -203,7 +203,7 @@ def statistic(self, *args):
----------
*args : ndarray
Variable length input data matrices. All inputs must have the same
number of samples. That is, the shapes must be `(n, p)` and
number of dimensions. That is, the shapes must be `(n, p)` and
`(m, p)`, ... where `n`, `m`, ... are the number of samples and `p` is
the number of dimensions.

Expand All @@ -228,7 +228,7 @@ def test(self, *args, reps=1000, workers=1, auto=True):
----------
*args : ndarray
Variable length input data matrices. All inputs must have the same
number of samples. That is, the shapes must be `(n, p)` and
number of dimensions. That is, the shapes must be `(n, p)` and
`(m, p)`, ... where `n`, `m`, ... are the number of samples and `p` is
the number of dimensions.
reps : int, default: 1000
Expand Down
4 changes: 0 additions & 4 deletions hyppo/ksample/mmd.py
Original file line number Diff line number Diff line change
Expand Up @@ -70,10 +70,6 @@ class MMD(KSampleTest):
"""

def __init__(self, compute_distance="euclidean", bias=False, **kwargs):
# set is_distance to true if compute_distance is None
self.is_distance = False
if not compute_distance:
self.is_distance = True
KSampleTest.__init__(
self, compute_distance=compute_distance, bias=bias, **kwargs
)
Expand Down
53 changes: 53 additions & 0 deletions hyppo/ksample/tests/test_disco.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
import numpy as np
import pytest
from numpy.testing import assert_almost_equal, assert_raises

from ...tools import linear, rot_ksamp
from .. import DISCO


class TestDISCO:
@pytest.mark.parametrize(
"n, obs_stat, obs_pvalue",
[(200, 6.621905272534802, 0.001), (100, 2.675357570989666, 0.001)],
)
def test_disco_linear_oned(self, n, obs_stat, obs_pvalue):
np.random.seed(123456789)
x, y = rot_ksamp(linear, n, 1, k=2)
stat, pvalue = DISCO().test(x, y, auto=False)

assert_almost_equal(stat, obs_stat, decimal=1)
assert_almost_equal(pvalue, obs_pvalue, decimal=1)


class TestDISCOErrorWarn:
"""Tests errors and warnings derived from MGC."""

def test_error_notndarray(self):
# raises error if x or y is not a ndarray
x = np.arange(20)
y = [5] * 20
z = np.arange(5)
assert_raises(ValueError, DISCO().test, x, y, z)

def test_error_shape(self):
# raises error if number of samples different (n)
x = np.arange(100).reshape(25, 4)
y = x.reshape(10, 10)
z = x
assert_raises(ValueError, DISCO().test, x, y, z)

def test_error_lowsamples(self):
# raises error if samples are low (< 3)
x = np.arange(3)
y = np.arange(3)
assert_raises(ValueError, DISCO().test, x, y)

def test_error_nans(self):
# raises error if inputs contain NaNs
x = np.arange(20, dtype=float)
x[0] = np.nan
assert_raises(ValueError, DISCO().test, x, x)

y = np.arange(20)
assert_raises(ValueError, DISCO().test, x, y)