Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel version slower than sequential #9

Closed
FractalArt opened this issue Dec 31, 2020 · 3 comments
Closed

Parallel version slower than sequential #9

FractalArt opened this issue Dec 31, 2020 · 3 comments

Comments

@FractalArt
Copy link

Hi,

I am reading your book and I have arrived at chapter 7 and was wondering whether you have any timings available for comparison. I was timing the code and was surprised to find that it runs faster on a single core than on four:

tsunami_git/src/ch07 on 🌱 master (34af56e)  
> hyperfine "cafrun -n 1 tsunami"
Benchmark #1: cafrun -n 1 tsunami
  Time (mean ± σ):     646.2 ms ±  51.0 ms    [User: 339.5 ms, System: 104.0 ms]
  Range (min … max):   566.3 ms … 719.4 ms    10 runs
 

tsunami_git/src/ch07 on 🌱 master (34af56e) 
> hyperfine "cafrun -n 4 tsunami"
Benchmark #1: cafrun -n 4 tsunami
  Time (mean ± σ):     751.9 ms ±  56.0 ms    [User: 1.473 s, System: 0.212 s]
  Range (min … max):   671.5 ms … 863.1 ms    10 runs

I think my coarray installation works since for the weather-buoy example, I do indeed see a speedup. Could it be that there is so much synchronization going on that it ruins the benefits of parallelization?

I am running on Ubuntu 20.04.1 LTS, with OpenCoarrays 2.9.0 (installed as described in Appendix A) on an Intel i7-8565U CPU @ 1.80GHz × 8 processor and I use gfortran 9.3.0 as a compiler.

@milancurcic
Copy link
Member

Hi @FractalArt, you're right, the communication overwhelms the computation so the code as is doesn't scale.

Try two things in tsunami.f90:

  1. Increase grid size:
integer(int32), parameter :: grid_size = 1000

which was 100 originally, but you can try even larger. We kept it small in the book so that the example runs fast on single core. Increasing grid size increases the computation, which decreases the communication / computation ratio (you want this ratio as small as you can for parallel scaling).

  1. Comment out the gather + print lines inside the time loop:
    ...
    ! gather to image 1 and write current state to screen
    !gather(is:ie)[1] = h(ils:ile) ! there is a all-to-one communication here
    sync all ! this sync is still important before we move to the next time step
    !if (this_image() == 1) print *, n, gather

  end do time_loop

This reduces the communication in each step. This gather operation is only for diagnostic purpose so removing it doesn't affect the result. Alternatively, to preserve some diagnostics, you can do the gather + print every 10th or 100th time step.

Let me know how this works out.

Ideally, this should have been explained in the book. We had a section about it but it didn't make the cut. However it should still be explained at least in the README of this repo. Do you agree?

Thank you for reading and reporting this.

@FractalArt
Copy link
Author

Hi @milancurcic,

Thanks a lot for your quick reply. You're right, if I crank up the grid size I see the improvement, and as you say, the higher the grid size the bigger the impact of parallelization.

Regarding the explanation in the README, I have to say that I did not find it.

@milancurcic
Copy link
Member

Regarding the explanation in the README, I have to say that I did not find it.

Yes, there isn't any right now, I meant I should make an effort to write an explanation there.

milancurcic added a commit that referenced this issue Nov 17, 2021
Add note on parallel scalability to the README; closes #9
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants