Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convergence problems in parallel #1778

Closed
tandreasr opened this issue May 8, 2024 · 9 comments
Closed

Convergence problems in parallel #1778

tandreasr opened this issue May 8, 2024 · 9 comments
Assignees
Labels
bug parallel Parallel capabilities

Comments

@tandreasr
Copy link

Hi,
I've just done some more tests regarding the parallel capabilities of MF6.
For that I used one of your large models from
https://water.usgs.gov/GIS/metadata/styles/landingPage/zell2020_wrr.xml
and did a domain decomposition using Metis into 24 subdomains.
Here I've attached the Solution containing the 24 models:
SubdomainDec_0024.zip
SubdomainDec_0024.z01.txt
SubdomainDec_0024.z02.txt
in order to unpack that multivolume ZIP archive you need to remove the two .txt extensions (I wasn't allowed to upload *.z0? files)

As for my actual problem:
I did several test runs on a 24 core machine (Windows) using PetSc only:

"E:\Modflow6\mf6.5.0.dev2_win64par\bin\mpiexec.exe" -np 3 "E:\Modflow6\mf6.5.0.dev2_win64par\bin\mf6.exe" -P

varying the -np parameter only and here is what I observed:

np Time Iterations
1 1:18m (20 Iter)
2 0:40m (20)
3 1:08m (82)
4 0:19m (21)
5 1:25m (210)
6 0:38m (82)
7 0:52m (135)
8 0:42m (135)
9 0:44m (135)
10 0:55m (169)
11 0:55m (169)
12 0:42m (169)
16 0:42m (166)
24 0:08m (21)

For np=1,2,4 and 24
The model converges quite well and the run times decrease as expected.

But all other numbers of ranks drastically increase the number of iterations and thus their run time.
For comparision I've attached the convergence diagrams as they should be (np=1,2,4 & 24)

Convergence 1 MPI Rank

and just one example (np=3) of the irregular cases (the others are quite similiar with just increasing numbers of iterations)

Convergence 3 MPI Ranks

There still seems to be a bug in those cases?
Best regards
Andreas

@tandreasr tandreasr added the bug label May 8, 2024
@mjr-deltares
Copy link
Contributor

Hi @tandreasr , thanks for reporting. We are seeing very similar issues in one of the test models. I will get back to you on this.

@mjr-deltares
Copy link
Contributor

Hi @tandreasr , the issues you reported are supposed to be solved with PR #1780. I have created a different issue to capture the bug (#1779) and will close this one. Feel free to reopen when the solution is not adequate.

@mjr-deltares
Copy link
Contributor

mjr-deltares commented May 8, 2024

Thank you for providing the test model. It was a great help in pinning down the problem!

(PS: I would love to see an update of those timings above by the way, if you get a chance and when the PR is merged into develop)

@tandreasr
Copy link
Author

tandreasr commented May 8, 2024

You are welcome.
I let you know about the new timings once I'm able to test it.
Did you already have time to have a look at #1753?
Regards
Andreas

@mjr-deltares
Copy link
Contributor

Just to make sure, are you referring to the case where the model column is added to the csv, even if it is the only model on that specific partition, so it is easier to parse?

@tandreasr
Copy link
Author

Exactly :-)

@tandreasr
Copy link
Author

tandreasr commented May 28, 2024

Hi Martijn,
I've been absent from work for 3 weeks, so I finally find time to inform you, that your corrections work for the above mentioned scenario - thanks!
And here is the promised update on the timings :-)

1 1:27m (25 iterations)
2 0:44m (23)
3 0:33m (25)
4 0:27m (24)
5 0:25m (25)
6 0:20m (25)
7 0:20m (25)
8 0:19m (25)
12 0:13m (25)
16 0:13m (24)
24 0:11m (24)

So all variants of -np now iterate nearly the same way.

One more question - just out of interest:
Do you have any plans for the foreseable future to support the GPU backends of PetSC and if so which one would be your choise?
(see https://petsc.org/release/overview/gpu_roadmap/ for reference)

Best regards
Andreas

@mjr-deltares
Copy link
Contributor

Thanks Andreas, that's good to know!

We are currently not looking at enabling GPU support for PETSc. Have you seen good results already for our (or similar) type of applications? Solving non-linear systems on a GPU is challenging in terms of performance gain. It's not going to be the 100x speedups people sometimes quote for their explicit timestepping algorithms.

@tandreasr
Copy link
Author

Hi Martijn,
that corresponds with my experiences using GPU solvers (ViennaCL & Paralution) with older Modflow versions.
But I wanted to ask anyway in case I missed something :-)
Thank you very much
Andreas

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug parallel Parallel capabilities
Projects
None yet
Development

No branches or pull requests

2 participants