Unbiased squared distance covariance increases as sample size becomes large #353

loremarchi · 2023-02-16T23:05:31Z

Dear team, first of all thank you for maintaining this useful package. I was trying to run the independence test by Shen et al. (2022), but I received some weird results, such as a rejection of independence even though I knew that the two one-dimensional vectors I was testing were independent. It was only when working with large samples (more than 70,000 observations) that I noticed something was off.

Although I have not yet identified the source of the problem in the code, I have a script available that reproduces the issue I am referring to. Please note that I have modified the method statistic in the dcorr.py to output both "stat" and "covar".

Reproducing code example:

t = hyppo.independence.Dcorr(); t.is_fast = True
n_samples = [100,1000,10000,50000,70000,100000]
for n in n_samples:
    U1 = np.random.rand(n,1)
    U2 = np.random.rand(n,1)
    S1 = np.sqrt(-2*np.log(U1))*np.cos(2*np.pi*U2)
    S2 = np.sqrt(-2*np.log(U1))*np.sin(2*np.pi*U2)
    print(f'current n is: {n}')
    print(f'Implementation according to R/original paper (unbiased squared distance covariance): {_r_distance_corr(S1, S2, mode = "squared_cov", unbiased = True)}')
    print(f'Implementation according to hyppo (unbiased squared distance covariance, covar in dcorr.py): {t.statistic(S1,S2)[1]}')
    print(f'Implementation according to hyppo (unbiased distance correlation, stat in dcorr.py): {t.statistic(S1,S2)[0]}\n')

Please, see that as the sample size increases (n = 70000 and n = 100000) the hyppo unbiased squared distance covariance becomes unintuitive.

Results

current n is: 100
Implementation according to R/original paper (unbiased squared distance covariance): 0.017224788665771484
Implementation according to hyppo (unbiased squared distance covariance, covar in dcorr.py): 0.017225187792305974
Implementation according to hyppo (unbiased distance correlation, stat in dcorr.py): 0.052031725347361675

current n is: 1000
Implementation according to R/original paper (unbiased squared distance covariance): 3.993511199951172e-05
Implementation according to hyppo (unbiased squared distance covariance, covar in dcorr.py): 4.020496229051318e-05
Implementation according to hyppo (unbiased distance correlation, stat in dcorr.py): 9.702530109483742e-05

current n is: 10000
Implementation according to R/original paper (unbiased squared distance covariance): 8.52346420288086e-05
Implementation according to hyppo (unbiased squared distance covariance, covar in dcorr.py): 8.585070210065382e-05
Implementation according to hyppo (unbiased distance correlation, stat in dcorr.py): 0.000211508104872705

current n is: 50000
Implementation according to R/original paper (unbiased squared distance covariance): -7.987022399902344e-06
Implementation according to hyppo (unbiased squared distance covariance, covar in dcorr.py): -8.067143002499222e-06
Implementation according to hyppo (unbiased distance correlation, stat in dcorr.py): -1.9942861479842804e-05

current n is: 70000
Implementation according to R/original paper (unbiased squared distance covariance): -6.318092346191406e-06
Implementation according to hyppo (unbiased squared distance covariance, covar in dcorr.py): 4.23422066658965
Implementation according to hyppo (unbiased distance correlation, stat in dcorr.py): 0.9132199516986869

current n is: 100000
Implementation according to R/original paper (unbiased squared distance covariance): 1.430511474609375e-06
Implementation according to hyppo (unbiased squared distance covariance, covar in dcorr.py): 15.134514035440011
Implementation according to hyppo (unbiased distance correlation, stat in dcorr.py): 0.9742237257727073

Version information

OS: Any
Python Version: Any
Package Version 0.3.2 and 0.4.0

The text was updated successfully, but these errors were encountered:

vibhorpro · 2023-07-01T10:45:26Z

hello @loremarchi I reviewed your code and error and found out may be it is because of Implementation Differences or Statistical Power . To resolve this you may use another library or TRY THIS CODE:

import numpy as np
from scipy.spatial.distance import squareform, pdist

def _r_distance_corr(X, Y, mode="squared_cov", unbiased=True):
X = X.flatten()
Y = Y.flatten()
n = len(X)

a = squareform(pdist(X[:, np.newaxis]))
b = squareform(pdist(Y[:, np.newaxis]))

A = a - a.mean(axis=0)[np.newaxis, :] - a.mean(axis=1)[:, np.newaxis] + a.mean()
B = b - b.mean(axis=0)[np.newaxis, :] - b.mean(axis=1)[:, np.newaxis] + b.mean()

if mode == "squared_cov":
    cov = (A * B).mean()
    if unbiased:
        var_X = np.var(X, ddof=1)
        var_Y = np.var(Y, ddof=1)
        return cov / np.sqrt(var_X * var_Y)
    else:
        return cov

elif mode == "cov":
    cov = (A * B).mean()
    return cov

elif mode == "corr":
    corr = (A * B).mean() / np.sqrt((A ** 2).mean() * (B ** 2).mean())
    return corr

t = hyppo.independence.Dcorr()
t.is_fast = True

n_samples = [100, 1000, 10000, 50000, 70000, 100000]
for n in n_samples:
U1 = np.random.rand(n, 1)
U2 = np.random.rand(n, 1)
S1 = np.sqrt(-2 * np.log(U1)) * np.cos(2 * np.pi * U2)
S2 = np.sqrt(-2 * np.log(U1)) * np.sin(2 * np.pi * U2)
print(f'current n is: {n}')
print(f'Implementation according to R/original paper (unbiased squared distance covariance): {_r_distance_corr(S1, S2, mode="squared_cov", unbiased=True)}')
print(f'Implementation according to hyppo (unbiased squared distance covariance, covar in dcorr.py): {t.statistic(S1, S2)[1]}')
print(f'Implementation according to hyppo (unbiased distance correlation, stat in dcorr.py): {t.statistic(S1, S2)[0]}\n')

loremarchi added the bug Something isn't working label Feb 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unbiased squared distance covariance increases as sample size becomes large #353

Unbiased squared distance covariance increases as sample size becomes large #353

loremarchi commented Feb 16, 2023 •

edited

Loading

vibhorpro commented Jul 1, 2023 •

edited

Loading

Unbiased squared distance covariance increases as sample size becomes large #353

Unbiased squared distance covariance increases as sample size becomes large #353

Comments

loremarchi commented Feb 16, 2023 • edited Loading

Reproducing code example:

Results

Version information

vibhorpro commented Jul 1, 2023 • edited Loading

loremarchi commented Feb 16, 2023 •

edited

Loading

vibhorpro commented Jul 1, 2023 •

edited

Loading