-
Notifications
You must be signed in to change notification settings - Fork 151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
simulate_for_sbi
not saturating CPU
#1175
Comments
simulate_for_sbi
not saturating CPU
Dear @janko-petkovic Thanks a lot for looking into this and proving the benchmark. I can confirm your findings. This is indeed a very interesting behavior of I think the reason could be the progress bar wrappers we use to show Indeed, if I add the following minimal joblib function to the benchmark I get the same performance as with @timeit
def run_simulations_joblib(prior, num_simulations, num_processes):
'''Generates the joint using joblib'''
theta_low = prior.base_dist.low.cpu().numpy()
theta_high = prior.base_dist.high.cpu().numpy()
theta_range = theta_high - theta_low
thetas = np.random.uniform(size=(num_simulations, 1)) * theta_range + theta_low
x = [
r
for r in tqdm(
Parallel(return_as="generator", n_jobs=num_processes)(
delayed(simulation)(theta) for theta in thetas
),
total=num_simulations,
)
] |
Looking into the code for |
@tomMoral might have a word or two to say about this? |
Very glad to see this! Also, very interestingly, the fix you proposed seems to work better than def simulation(theta):
'''Some generic simulation function.
You can change this to study, e.g. the impact of numpy's multithreading
on the code's multiprocessing - np.linalg.eig'''
matrix = np.random.rand(200,200)/theta
eigval, eigvec = np.linalg.eig(matrix)
return 1 where the vectorization of
Sure, I'd love to. How can we organize? |
yes please! |
Please follow the instructions here: https://github.com/sbi-dev/sbi/blob/main/docs/docs/contribute.md Regarding the refactoring: I would
One strange thing I noticed and haven't understood yet: In the benchmark, you sampled thetas using a However, if I do so, the following call to |
Great! I will start working on it then. Concerning your observeation: yes, I can confirm that, and it seems to happen because, again, the CPU saturation is reduced. From what I see, it is connected to the fact that thetas = prior.sample((num_simulations,)).tolist()
# [... joblib code ...] everything goes back to normal, and che CPU returns to full saturation, with equal runtimes across the board. Why |
I am able to reproduce what you find on my machine; indeed,
I think that this behaviour does not depend on |
OK, thanks for checking this. It would be interesting to get @tomMoral take on this. also, this https://pytorch.org/docs/stable/notes/multiprocessing.html might be a relevant read. |
Hello guys. A quick update on the state of the issue. After discussing with @janfb, we decided to proceed as follows:
I will now do the first PR. Please let me know if you have any concerns. |
Dear devs,
I have run into this unexpected behaviour while running the
sbi.inference.simulate_for_sbi
method.It appears that, in a multi-processing scenario, the single core saturation is capped at a certain value that decreeses in function of the number of simulations to be produced. This inevitably leads to increased simulation times.
I have tried to circumvent the
sbi.inference.simulate_for_sbi
method both my manually implementing multi-processing (with the straightforward Pythonmultiprocessing.Pool
) and with thep_map
method from thep_tqdm
library. The latter has the advantage of being extremely easy to use, and of natively supportingtqdm
execution bars. These two options correctly saturate all the CPUs involved in the computation.I wonder if this behaviour is coming from the choice of
joblib
as multiprocessing library, and if a switch top_tqdm
could improve the performance of the SBI library.Below, I provide some code to test this behaviour. The requirements are the source version of
sbi
as well as the PIP version ofp_tqdm
. The code contains three benchmarking methods (vanillaPool
,p_map
andsimulate_for_sbi
) as well as two handles for the number of processes (workers) to be used and the number of simulations to be generated.By looking at the ETA, the runtime difference between
simulate_for_sbi
andp_map
is negligible for a low simulation number (e.g. 100) but starts increasing together with it, arriving at a factor 10 difference in case of anum_simulations
equal to 1M.I hope this report can be helpful, especially since the simulation runtime represents a bottleneck in many common usecases.
The text was updated successfully, but these errors were encountered: