Convergence problems in parallel #1778

tandreasr · 2024-05-08T07:23:01Z

Hi,
I've just done some more tests regarding the parallel capabilities of MF6.
For that I used one of your large models from
https://water.usgs.gov/GIS/metadata/styles/landingPage/zell2020_wrr.xml
and did a domain decomposition using Metis into 24 subdomains.
Here I've attached the Solution containing the 24 models:
SubdomainDec_0024.zip
SubdomainDec_0024.z01.txt
SubdomainDec_0024.z02.txt
in order to unpack that multivolume ZIP archive you need to remove the two .txt extensions (I wasn't allowed to upload *.z0? files)

As for my actual problem:
I did several test runs on a 24 core machine (Windows) using PetSc only:

"E:\Modflow6\mf6.5.0.dev2_win64par\bin\mpiexec.exe" -np 3 "E:\Modflow6\mf6.5.0.dev2_win64par\bin\mf6.exe" -P

varying the -np parameter only and here is what I observed:

np Time Iterations
1 1:18m (20 Iter)
2 0:40m (20)
3 1:08m (82)
4 0:19m (21)
5 1:25m (210)
6 0:38m (82)
7 0:52m (135)
8 0:42m (135)
9 0:44m (135)
10 0:55m (169)
11 0:55m (169)
12 0:42m (169)
16 0:42m (166)
24 0:08m (21)

For np=1,2,4 and 24
The model converges quite well and the run times decrease as expected.

But all other numbers of ranks drastically increase the number of iterations and thus their run time.
For comparision I've attached the convergence diagrams as they should be (np=1,2,4 & 24)

and just one example (np=3) of the irregular cases (the others are quite similiar with just increasing numbers of iterations)

There still seems to be a bug in those cases?
Best regards
Andreas

mjr-deltares · 2024-05-08T08:46:49Z

Hi @tandreasr , thanks for reporting. We are seeing very similar issues in one of the test models. I will get back to you on this.

mjr-deltares · 2024-05-08T14:54:01Z

Hi @tandreasr , the issues you reported are supposed to be solved with PR #1780. I have created a different issue to capture the bug (#1779) and will close this one. Feel free to reopen when the solution is not adequate.

mjr-deltares · 2024-05-08T14:55:02Z

Thank you for providing the test model. It was a great help in pinning down the problem!

(PS: I would love to see an update of those timings above by the way, if you get a chance and when the PR is merged into develop)

tandreasr · 2024-05-08T16:37:37Z

You are welcome.
I let you know about the new timings once I'm able to test it.
Did you already have time to have a look at #1753?
Regards
Andreas

mjr-deltares · 2024-05-08T20:16:21Z

Just to make sure, are you referring to the case where the model column is added to the csv, even if it is the only model on that specific partition, so it is easier to parse?

tandreasr · 2024-05-09T04:12:45Z

Exactly :-)

tandreasr · 2024-05-28T05:41:13Z

Hi Martijn,
I've been absent from work for 3 weeks, so I finally find time to inform you, that your corrections work for the above mentioned scenario - thanks!
And here is the promised update on the timings :-)

1 1:27m (25 iterations)
2 0:44m (23)
3 0:33m (25)
4 0:27m (24)
5 0:25m (25)
6 0:20m (25)
7 0:20m (25)
8 0:19m (25)
12 0:13m (25)
16 0:13m (24)
24 0:11m (24)

So all variants of -np now iterate nearly the same way.

One more question - just out of interest:
Do you have any plans for the foreseable future to support the GPU backends of PetSC and if so which one would be your choise?
(see https://petsc.org/release/overview/gpu_roadmap/ for reference)

Best regards
Andreas

mjr-deltares · 2024-05-28T15:57:46Z

Thanks Andreas, that's good to know!

We are currently not looking at enabling GPU support for PETSc. Have you seen good results already for our (or similar) type of applications? Solving non-linear systems on a GPU is challenging in terms of performance gain. It's not going to be the 100x speedups people sometimes quote for their explicit timestepping algorithms.

tandreasr · 2024-05-28T16:50:47Z

Hi Martijn,
that corresponds with my experiences using GPU solvers (ViennaCL & Paralution) with older Modflow versions.
But I wanted to ask anyway in case I missed something :-)
Thank you very much
Andreas

tandreasr added the bug label May 8, 2024

mjr-deltares self-assigned this May 8, 2024

mjr-deltares added the parallel Parallel capabilities label May 8, 2024

mjr-deltares mentioned this issue May 8, 2024

Residual L2 norm calculation incorrect for parallel simulation with Backtracking and/or PTC #1779

Closed

mjr-deltares closed this as completed May 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Convergence problems in parallel #1778

Convergence problems in parallel #1778

tandreasr commented May 8, 2024

mjr-deltares commented May 8, 2024

mjr-deltares commented May 8, 2024

mjr-deltares commented May 8, 2024 •

edited

Loading

tandreasr commented May 8, 2024 •

edited

Loading

mjr-deltares commented May 8, 2024

tandreasr commented May 9, 2024

tandreasr commented May 28, 2024 •

edited

Loading

mjr-deltares commented May 28, 2024

tandreasr commented May 28, 2024

Convergence problems in parallel #1778

Convergence problems in parallel #1778

Comments

tandreasr commented May 8, 2024

mjr-deltares commented May 8, 2024

mjr-deltares commented May 8, 2024

mjr-deltares commented May 8, 2024 • edited Loading

tandreasr commented May 8, 2024 • edited Loading

mjr-deltares commented May 8, 2024

tandreasr commented May 9, 2024

tandreasr commented May 28, 2024 • edited Loading

mjr-deltares commented May 28, 2024

tandreasr commented May 28, 2024

mjr-deltares commented May 8, 2024 •

edited

Loading

tandreasr commented May 8, 2024 •

edited

Loading

tandreasr commented May 28, 2024 •

edited

Loading