Merge redgreen-optimized into develop (Part 2) #120

samhatfield · 2024-07-15T10:35:55Z

Continuation of #106 because GitHub is too clever for its own good.

This was needed in an earlier version!

CMakeLists.txt

src/programs/ectrans.in

wdeconinck · 2024-07-15T12:31:25Z

src/programs/ectrans-benchmark.F90

@@ -1071,6 +1084,10 @@ subroutine get_command_line_arguments(nsmax, cgrid, iters, nfld, nlev, lvordiv,
  character(len=128) :: carg          ! Storage variable for command line arguments
  integer            :: iarg = 1      ! Argument index

+#ifdef USE_GPU
+  !$acc init
+#endif


Why can this not be done within setup_trans0 ?
OpenACC should be implementation detail

According to the commit log this needs to be done before mpl_init. @marsdeno is that really the case?

Yes, I was told that MPI_INIT benefits from knowing OpenACC/GPUs will be used, in order to set up the right comms protocols in UCX layer.

bummer. If one is using cuda or hip directly without OpenACC, how would you do this then?

Where is USE_GPU defined anyway? There's only one reference to it:

ac6-102:/ec/res4/hpcperm/nash/ectrans $ ack USE_GPU src/programs/ectrans-benchmark.F90 1090:#ifdef USE_GPU src/trans/gpu/internal/trgtol_mod.F90 569:#ifdef USE_GPU_AWARE_MPI 691:#ifdef USE_GPU_AWARE_MPI src/trans/gpu/internal/trmtol_mod.F90 168:#ifdef USE_GPU_AWARE_MPI 181:#ifdef USE_GPU_AWARE_MPI src/trans/gpu/internal/trltog_mod.F90 695:#ifdef USE_GPU_AWARE_MPI 730:#ifdef USE_GPU_AWARE_MPI src/trans/gpu/internal/trltom_mod.F90 174:#ifdef USE_GPU_AWARE_MPI 187:#ifdef USE_GPU_AWARE_MPI src/trans/gpu/external/setup_trans0.F90 205:#ifdef USE_GPU_AWARE_MPI src/trans/gpu/CMakeLists.txt 93: $<${HAVE_GPU_AWARE_MPI}:USE_GPU_AWARE_MPI>

Good catch Sam, this (USE_GPU) is something I have in my testing on AC that got lost in the rgo-develop preparation.

bummer. If one is using cuda or hip directly without OpenACC, how would you do this then?

Good question, I'll try and find out

Is there any documentation on this acc init and MPI_Init? Which MPI implementation?
Is there perhaps an environment variable as well?
It seems to have worked seemingly without as well.

In general, it should also work without, but the default behaviour is much better if you do acc init before MPI_Init, because then MPI knows about which GPUs you are using, and so on. You can always tweak UCX variables to achieve the same, but I highly recommend to do acc init before MPI_init.

I think then we have to keep it for now.

src/trans/gpu/CMakeLists.txt

src/trans/gpu/algor/cuda_device_mod.F90

tests/CMakeLists.txt

src/transi/CMakeLists.txt

cmake/ectrans-import.cmake.in

samhatfield · 2024-07-16T09:27:30Z

I've combined the three device module files in the gpu algor directory: 23ee8e2.

Any feedback on that?

wdeconinck · 2024-07-16T10:53:12Z

I've combined the three device module files in the gpu algor directory: 23ee8e2.

Any feedback on that?

Looks good. This change also removed use cudafor that was in cuda_device_mod. Is that OK @marsdeno ?

samhatfield · 2024-07-16T10:58:37Z

Yes, I'm not sure that was actually used.

samhatfield · 2024-07-16T14:48:25Z

Just to give you a sense of how well this branch performs, here are some numbers comparing rgo-develop/CPU with rgo-develop/GPU:

1 node
CPU: 4 ranks x 32 threads
GPU: 4 ranks (1 GPU per rank) x 32 threads
2 x AMD Rome
TCO639
100 iterations
Single precision

CPU:

Inverse-direct transforms
-------------------------
avg  (s):   3.2426
min  (s):   2.9662
max  (s):   4.5370
med  (s):   3.1509
loop (s): 326.6021

GPU:

Inverse-direct transforms
-------------------------
avg  (s):   0.5349
min  (s):   0.4953
max  (s):   2.9323
med  (s):   0.4995
loop (s):  54.0317

samhatfield · 2024-07-17T08:50:54Z

I've removed all adjoint routines and FLT-related stuff from the gpu source tree. Neither of these were ported to GPU, and in fact the files just contained out-of-date versions of the CPU code.

samhatfield · 2024-07-18T08:33:19Z

Is it time to merge this?

* In order to use this from fortran, add the following to callsite routine, * in the appropriate places * use device_mod, only : devicegetmeminfo * use iso_c_binding, only : c_int * integer(c_int) :: imemfree,imemtotal * DEVICEGETMEMINFO(imemfree,imemtotal) * write(nout,*) 'Current free memory: ',imemfree,' out of total ',imemtotal

Reinstate memgetinfo utility

marsdeno · 2024-07-18T09:32:55Z

Good to go for me now.

wdeconinck

Good to me!

wdeconinck · 2024-07-19T13:21:00Z

Thank you everyone involved, especially @anmrde ! This has been a huge milestone!!!

samhatfield · 2024-07-19T13:22:47Z

Celebrate good times!!!

Now, let's try and at least get IFS to complete 24 hours of integration with this branch without crashing...

* develop: Add GPU capability (ecmwf-ifs#106 , ecmwf-ifs#120)

marsdeno and others added 30 commits October 21, 2023 01:20

Harmonisation

78580b5

Cleanup TRLTOG vertical offsets

21d7a7a

Explicitly pass arrays into FTDIR

33cb53f

Add back FOURIER_OUT function/file

a6ab754

ZGTF is now a local variable

d674c02

Implement pointer swap in ftdir

c26890a

Non-critical: FTDIR and FTINV perfectly shadow eachother now

4580eb9

Minor changes to make FOURIER_IN and FOURIER_OUT more ismilar

3e3f613

Pass through FOUBUF_IN

e7a002a

Re-allocate FOUBUF_IN in DIR_TRANS

79ed559

Reallocate FOUBUF in DIR_TRANS

ff4e0a3

Reallocate POA1 in LEDIR

95880d9

Remove some allocations from setup_trans

e2532fe

Remove redundand variables from fields and dir files

02ab29a

No more need to compute divergence if vorticity is needed

343c40c

This was needed in an earlier version!

Remove redundant variables

3fb6e56

Use pointers for clarity

d6cc2f3

Accidentally added to many FFTs again

d3db819

Interface changes between complex/non-complex field counts

5decf18

Tiny cleanup in modules

e703dc7

Put copyins and copyouts at the same place for INV and DIR

c9dc45c

Refactor 4XX GSTATS (NVIDIA GSTATS)

423e207

Remove barrier that are not ours

a52804d

Redirect some GSTATS function to add nvtx

73693f7

Add missing GEMM label

83047b8

Incase parallelism again for some slow kernels in DIR

0fa8fca

Pimp a bit the NVTX coloring

ecf077c

Try improve LEDIR GEMM array packing

9af5ac0

CUFFT: Use workspace

0ae7935

The complex part of ZGTF is compact now

48498a1

wdeconinck requested changes Jul 15, 2024

View reviewed changes

samhatfield added 4 commits July 15, 2024 15:51

Add me and Olivier to author list

edfce03

Fix treatment of MPI feature

ff35922

Rationalise device-related modules in GPU algor directory

23ee8e2

Fix incorrect paths

c038e01

samhatfield force-pushed the rgo-develop branch from e571b57 to c038e01 Compare July 16, 2024 09:26

samhatfield added 2 commits July 16, 2024 10:56

Convert DEVICE_MOD to uppercase

933c19d

Only bring in FFTW for CPU

d926322

Remove all adjoint routines from gpu branch

a8a21fe

samhatfield force-pushed the rgo-develop branch from bba02eb to f52ceb6 Compare July 17, 2024 08:49

samhatfield force-pushed the rgo-develop branch from df394fd to 40936ef Compare July 17, 2024 13:26

samhatfield added 2 commits July 17, 2024 15:18

Remove butterfly algorithm from gpu version

5397abc

Remove overriding of BUILD_SHARED_LIBS

80439a5

samhatfield force-pushed the rgo-develop branch from 40936ef to 80439a5 Compare July 17, 2024 15:18

Fix wrong preprocessor statement

37025f5

marsdeno and others added 2 commits July 18, 2024 08:53

Merge pull request #3 from marsdeno/rgo-develop-memgetinfo

36abc08

Reinstate memgetinfo utility

wdeconinck approved these changes Jul 19, 2024

View reviewed changes

wdeconinck merged commit 548cce2 into ecmwf-ifs:develop Jul 19, 2024
11 checks passed

samhatfield deleted the rgo-develop branch July 19, 2024 13:27

wdeconinck added a commit to DJDavies2/ectrans that referenced this pull request Jul 19, 2024

Merge branch 'develop' into bugfix/95

5c10ac3

* develop: Add GPU capability (ecmwf-ifs#106 , ecmwf-ifs#120)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge redgreen-optimized into develop (Part 2) #120

Merge redgreen-optimized into develop (Part 2) #120

samhatfield commented Jul 15, 2024

wdeconinck Jul 15, 2024

samhatfield Jul 15, 2024

marsdeno Jul 15, 2024

wdeconinck Jul 16, 2024

samhatfield Jul 16, 2024

marsdeno Jul 16, 2024

marsdeno Jul 16, 2024

wdeconinck Jul 16, 2024

lukasm91 Jul 17, 2024

samhatfield Jul 18, 2024

samhatfield commented Jul 16, 2024

wdeconinck commented Jul 16, 2024

samhatfield commented Jul 16, 2024

samhatfield commented Jul 16, 2024 •

edited

Loading

samhatfield commented Jul 17, 2024

samhatfield commented Jul 18, 2024

marsdeno commented Jul 18, 2024

wdeconinck left a comment

wdeconinck commented Jul 19, 2024

samhatfield commented Jul 19, 2024

Merge redgreen-optimized into develop (Part 2) #120

Merge redgreen-optimized into develop (Part 2) #120

Conversation

samhatfield commented Jul 15, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

samhatfield commented Jul 16, 2024

wdeconinck commented Jul 16, 2024

samhatfield commented Jul 16, 2024

samhatfield commented Jul 16, 2024 • edited Loading

samhatfield commented Jul 17, 2024

samhatfield commented Jul 18, 2024

marsdeno commented Jul 18, 2024

wdeconinck left a comment

Choose a reason for hiding this comment

wdeconinck commented Jul 19, 2024

samhatfield commented Jul 19, 2024

samhatfield commented Jul 16, 2024 •

edited

Loading