You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
thank you for your work on this project and for making the code available to the community!
I have encountered an issue with the fit_pw method in the lPCA class when using multiprocessing. The issue is that the output dimensions are not the same when using different numbers of jobs (n_jobs), while they should be consistent. Below is a code sample to reproduce the problem:
importnumpyasnpimportskdim# set the random seednp.random.seed(42)
lpca=skdim.id.lPCA(ver="FO", alphaFO=0.05, verbose=True)
# create a random datasetX_np=np.random.randn(100, 10)
print("X_np.shape: ", X_np.shape)
print(X_np)
lpca.fit_pw(X=X_np, n_neighbors=20, n_jobs=1, smooth=False)
lpca_dimensions_pw_n_jobs_1=lpca.dimension_pw_print(f"lpca.dimension_pw_: {lpca.dimension_pw_}")
lpca.fit_pw(X=X_np, n_neighbors=20, n_jobs=2, smooth=False)
lpca_dimensions_pw_n_jobs_2=lpca.dimension_pw_print(f"lpca.dimension_pw_: {lpca.dimension_pw_}")
# Check that the results are the sameassertnp.allclose(lpca_dimensions_pw_n_jobs_1, lpca_dimensions_pw_n_jobs_2)
I have identified that the problem is related to the management of class instances and their state when using multiprocessing. In the current implementation, the worker processes do not share memory with the main process, so modifications to the instances within the worker processes are not reflected in the main process. The dimension is being calculated and stored in the _dimension attribute of the instances within the worker processes, but this information is not being propagated back to the main process.
A possible solution is to change the fit_pw function to return the computed dimension values instead of storing them in the _dimension attribute of the instances. Then, use the apply_async function to asynchronously apply the fit function to each data point and collect the results in the main process. Here's an example of a modified fit_pw function that addresses this issue:
deffit_pw(self, X, precomputed_knn=None, smooth=False, n_neighbors=100, n_jobs=1):
# ...ifn_jobs>1:
withmp.Pool(n_jobs) aspool:
# Asynchronously apply the `fit` function to each data point and collect the resultsresults= [pool.apply_async(self.fit, (X[i, :],)) foriinknnidx]
# Retrieve the computed dimensionsself.dimension_pw_=np.array([r.get().dimension_forrinresults])
# ...
With this modification, the computed dimensions are correctly returned and stored in the main process, and the dimensions are consistent when running the code with different numbers of jobs.
Would it be possible to consider this change or a similar approach to address the issue with multiprocessing in the fit_pw method? This problem likely also affects the other multiprocessing dimension estimates, and not just the lPCA class.
The text was updated successfully, but these errors were encountered:
Dear authors,
thank you for your work on this project and for making the code available to the community!
I have encountered an issue with the
fit_pw
method in thelPCA
class when using multiprocessing. The issue is that the output dimensions are not the same when using different numbers of jobs (n_jobs
), while they should be consistent. Below is a code sample to reproduce the problem:I have identified that the problem is related to the management of class instances and their state when using multiprocessing. In the current implementation, the worker processes do not share memory with the main process, so modifications to the instances within the worker processes are not reflected in the main process. The dimension is being calculated and stored in the
_dimension
attribute of the instances within the worker processes, but this information is not being propagated back to the main process.A possible solution is to change the
fit_pw
function to return the computed dimension values instead of storing them in the_dimension
attribute of the instances. Then, use theapply_async
function to asynchronously apply thefit
function to each data point and collect the results in the main process. Here's an example of a modifiedfit_pw
function that addresses this issue:With this modification, the computed dimensions are correctly returned and stored in the main process, and the dimensions are consistent when running the code with different numbers of jobs.
Would it be possible to consider this change or a similar approach to address the issue with multiprocessing in the
fit_pw
method? This problem likely also affects the other multiprocessing dimension estimates, and not just thelPCA
class.The text was updated successfully, but these errors were encountered: