Parallel version slower than sequential #9

FractalArt · 2020-12-31T14:54:27Z

Hi,

I am reading your book and I have arrived at chapter 7 and was wondering whether you have any timings available for comparison. I was timing the code and was surprised to find that it runs faster on a single core than on four:

tsunami_git/src/ch07 on 🌱 master (34af56e)  
> hyperfine "cafrun -n 1 tsunami"
Benchmark #1: cafrun -n 1 tsunami
  Time (mean ± σ):     646.2 ms ±  51.0 ms    [User: 339.5 ms, System: 104.0 ms]
  Range (min … max):   566.3 ms … 719.4 ms    10 runs
 

tsunami_git/src/ch07 on 🌱 master (34af56e) 
> hyperfine "cafrun -n 4 tsunami"
Benchmark #1: cafrun -n 4 tsunami
  Time (mean ± σ):     751.9 ms ±  56.0 ms    [User: 1.473 s, System: 0.212 s]
  Range (min … max):   671.5 ms … 863.1 ms    10 runs

I think my coarray installation works since for the weather-buoy example, I do indeed see a speedup. Could it be that there is so much synchronization going on that it ruins the benefits of parallelization?

I am running on Ubuntu 20.04.1 LTS, with OpenCoarrays 2.9.0 (installed as described in Appendix A) on an Intel i7-8565U CPU @ 1.80GHz × 8 processor and I use gfortran 9.3.0 as a compiler.

The text was updated successfully, but these errors were encountered:

milancurcic · 2020-12-31T15:54:24Z

Hi @FractalArt, you're right, the communication overwhelms the computation so the code as is doesn't scale.

Try two things in tsunami.f90:

Increase grid size:

integer(int32), parameter :: grid_size = 1000

which was 100 originally, but you can try even larger. We kept it small in the book so that the example runs fast on single core. Increasing grid size increases the computation, which decreases the communication / computation ratio (you want this ratio as small as you can for parallel scaling).

Comment out the gather + print lines inside the time loop:

    ...
    ! gather to image 1 and write current state to screen
    !gather(is:ie)[1] = h(ils:ile) ! there is a all-to-one communication here
    sync all ! this sync is still important before we move to the next time step
    !if (this_image() == 1) print *, n, gather

  end do time_loop

This reduces the communication in each step. This gather operation is only for diagnostic purpose so removing it doesn't affect the result. Alternatively, to preserve some diagnostics, you can do the gather + print every 10th or 100th time step.

Let me know how this works out.

Ideally, this should have been explained in the book. We had a section about it but it didn't make the cut. However it should still be explained at least in the README of this repo. Do you agree?

Thank you for reading and reporting this.

FractalArt · 2021-01-01T17:30:10Z

Hi @milancurcic,

Thanks a lot for your quick reply. You're right, if I crank up the grid size I see the improvement, and as you say, the higher the grid size the bigger the impact of parallelization.

Regarding the explanation in the README, I have to say that I did not find it.

milancurcic · 2021-01-02T22:39:23Z

Regarding the explanation in the README, I have to say that I did not find it.

Yes, there isn't any right now, I meant I should make an effort to write an explanation there.

Add note on parallel scalability to the README; closes #9

milancurcic closed this as completed in ec5873b Nov 17, 2021

milancurcic added a commit that referenced this issue Nov 17, 2021

Merge pull request #16 from modern-fortran/note-on-scaling

1ea55a6

Add note on parallel scalability to the README; closes #9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel version slower than sequential #9

Parallel version slower than sequential #9

FractalArt commented Dec 31, 2020

milancurcic commented Dec 31, 2020

FractalArt commented Jan 1, 2021

milancurcic commented Jan 2, 2021

Parallel version slower than sequential #9

Parallel version slower than sequential #9

Comments

FractalArt commented Dec 31, 2020

milancurcic commented Dec 31, 2020

FractalArt commented Jan 1, 2021

milancurcic commented Jan 2, 2021