Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent results when using multiprocessing in fit_pw method #15

Closed
ben300694 opened this issue May 16, 2023 · 1 comment
Closed

Comments

@ben300694
Copy link

Dear authors,

thank you for your work on this project and for making the code available to the community!
I have encountered an issue with the fit_pw method in the lPCA class when using multiprocessing. The issue is that the output dimensions are not the same when using different numbers of jobs (n_jobs), while they should be consistent. Below is a code sample to reproduce the problem:

import numpy as np
import skdim

# set the random seed
np.random.seed(42)

lpca = skdim.id.lPCA(ver="FO", alphaFO=0.05, verbose=True)

# create a random dataset
X_np = np.random.randn(100, 10)

print("X_np.shape: ", X_np.shape)
print(X_np)

lpca.fit_pw(X=X_np, n_neighbors=20, n_jobs=1, smooth=False)
lpca_dimensions_pw_n_jobs_1 = lpca.dimension_pw_
print(f"lpca.dimension_pw_: {lpca.dimension_pw_}")

lpca.fit_pw(X=X_np, n_neighbors=20, n_jobs=2, smooth=False)
lpca_dimensions_pw_n_jobs_2 = lpca.dimension_pw_
print(f"lpca.dimension_pw_: {lpca.dimension_pw_}")

# Check that the results are the same
assert np.allclose(lpca_dimensions_pw_n_jobs_1, lpca_dimensions_pw_n_jobs_2)

I have identified that the problem is related to the management of class instances and their state when using multiprocessing. In the current implementation, the worker processes do not share memory with the main process, so modifications to the instances within the worker processes are not reflected in the main process. The dimension is being calculated and stored in the _dimension attribute of the instances within the worker processes, but this information is not being propagated back to the main process.

A possible solution is to change the fit_pw function to return the computed dimension values instead of storing them in the _dimension attribute of the instances. Then, use the apply_async function to asynchronously apply the fit function to each data point and collect the results in the main process. Here's an example of a modified fit_pw function that addresses this issue:

def fit_pw(self, X, precomputed_knn=None, smooth=False, n_neighbors=100, n_jobs=1):
    # ...
    if n_jobs > 1:
        with mp.Pool(n_jobs) as pool:
            # Asynchronously apply the `fit` function to each data point and collect the results
            results = [pool.apply_async(self.fit, (X[i, :],)) for i in knnidx]
            # Retrieve the computed dimensions
            self.dimension_pw_ = np.array([r.get().dimension_ for r in results])
    # ...

With this modification, the computed dimensions are correctly returned and stored in the main process, and the dimensions are consistent when running the code with different numbers of jobs.
Would it be possible to consider this change or a similar approach to address the issue with multiprocessing in the fit_pw method? This problem likely also affects the other multiprocessing dimension estimates, and not just the lPCA class.

@j-bac
Copy link
Collaborator

j-bac commented May 16, 2023

thanks a lot for spotting this bug and for the detailed solution ! I implemented your fix in a new version '0.3.3'

@j-bac j-bac closed this as completed May 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants