Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge redgreen-optimized into develop (Part 2) #120

Merged
merged 463 commits into from
Jul 19, 2024

Conversation

samhatfield
Copy link
Collaborator

Continuation of #106 because GitHub is too clever for its own good.

CMakeLists.txt Outdated Show resolved Hide resolved
src/programs/ectrans.in Outdated Show resolved Hide resolved
@@ -1071,6 +1084,10 @@ subroutine get_command_line_arguments(nsmax, cgrid, iters, nfld, nlev, lvordiv,
character(len=128) :: carg ! Storage variable for command line arguments
integer :: iarg = 1 ! Argument index

#ifdef USE_GPU
!$acc init
#endif
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can this not be done within setup_trans0 ?
OpenACC should be implementation detail

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the commit log this needs to be done before mpl_init. @marsdeno is that really the case?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was told that MPI_INIT benefits from knowing OpenACC/GPUs will be used, in order to set up the right comms protocols in UCX layer.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bummer. If one is using cuda or hip directly without OpenACC, how would you do this then?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is USE_GPU defined anyway? There's only one reference to it:

ac6-102:/ec/res4/hpcperm/nash/ectrans $ ack USE_GPU
src/programs/ectrans-benchmark.F90
1090:#ifdef USE_GPU

src/trans/gpu/internal/trgtol_mod.F90
569:#ifdef USE_GPU_AWARE_MPI
691:#ifdef USE_GPU_AWARE_MPI

src/trans/gpu/internal/trmtol_mod.F90
168:#ifdef USE_GPU_AWARE_MPI
181:#ifdef USE_GPU_AWARE_MPI

src/trans/gpu/internal/trltog_mod.F90
695:#ifdef USE_GPU_AWARE_MPI
730:#ifdef USE_GPU_AWARE_MPI

src/trans/gpu/internal/trltom_mod.F90
174:#ifdef USE_GPU_AWARE_MPI
187:#ifdef USE_GPU_AWARE_MPI

src/trans/gpu/external/setup_trans0.F90
205:#ifdef USE_GPU_AWARE_MPI

src/trans/gpu/CMakeLists.txt
93:                               $<${HAVE_GPU_AWARE_MPI}:USE_GPU_AWARE_MPI>

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch Sam, this (USE_GPU) is something I have in my testing on AC that got lost in the rgo-develop preparation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bummer. If one is using cuda or hip directly without OpenACC, how would you do this then?

Good question, I'll try and find out

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any documentation on this acc init and MPI_Init? Which MPI implementation?
Is there perhaps an environment variable as well?
It seems to have worked seemingly without as well.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, it should also work without, but the default behaviour is much better if you do acc init before MPI_Init, because then MPI knows about which GPUs you are using, and so on. You can always tweak UCX variables to achieve the same, but I highly recommend to do acc init before MPI_init.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think then we have to keep it for now.

src/trans/gpu/CMakeLists.txt Outdated Show resolved Hide resolved
src/trans/gpu/algor/cuda_device_mod.F90 Outdated Show resolved Hide resolved
tests/CMakeLists.txt Show resolved Hide resolved
src/transi/CMakeLists.txt Outdated Show resolved Hide resolved
cmake/ectrans-import.cmake.in Outdated Show resolved Hide resolved
@samhatfield
Copy link
Collaborator Author

I've combined the three device module files in the gpu algor directory: 23ee8e2.

Any feedback on that?

@wdeconinck
Copy link
Collaborator

I've combined the three device module files in the gpu algor directory: 23ee8e2.

Any feedback on that?

Looks good. This change also removed use cudafor that was in cuda_device_mod. Is that OK @marsdeno ?

@samhatfield
Copy link
Collaborator Author

Yes, I'm not sure that was actually used.

@samhatfield
Copy link
Collaborator Author

samhatfield commented Jul 16, 2024

Just to give you a sense of how well this branch performs, here are some numbers comparing rgo-develop/CPU with rgo-develop/GPU:

  • 1 node
  • CPU: 4 ranks x 32 threads
  • GPU: 4 ranks (1 GPU per rank) x 32 threads
  • 2 x AMD Rome
  • TCO639
  • 100 iterations
  • Single precision

CPU:

Inverse-direct transforms
-------------------------
avg  (s):   3.2426
min  (s):   2.9662
max  (s):   4.5370
med  (s):   3.1509
loop (s): 326.6021

GPU:

Inverse-direct transforms
-------------------------
avg  (s):   0.5349
min  (s):   0.4953
max  (s):   2.9323
med  (s):   0.4995
loop (s):  54.0317

@samhatfield
Copy link
Collaborator Author

I've removed all adjoint routines and FLT-related stuff from the gpu source tree. Neither of these were ported to GPU, and in fact the files just contained out-of-date versions of the CPU code.

@samhatfield
Copy link
Collaborator Author

Is it time to merge this?

marsdeno and others added 2 commits July 18, 2024 08:53
* In order to use this from fortran, add the following to callsite routine,
* in the appropriate places
*      use device_mod, only : devicegetmeminfo
*      use iso_c_binding, only : c_int
*      integer(c_int) :: imemfree,imemtotal
*      DEVICEGETMEMINFO(imemfree,imemtotal)
*      write(nout,*) 'Current free memory: ',imemfree,' out of total ',imemtotal
@marsdeno
Copy link
Collaborator

Good to go for me now.

Copy link
Collaborator

@wdeconinck wdeconinck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to me!

@wdeconinck wdeconinck merged commit 548cce2 into ecmwf-ifs:develop Jul 19, 2024
11 checks passed
@wdeconinck
Copy link
Collaborator

Thank you everyone involved, especially @anmrde ! This has been a huge milestone!!!

@samhatfield
Copy link
Collaborator Author

Celebrate good times!!!

Now, let's try and at least get IFS to complete 24 hours of integration with this branch without crashing...

@samhatfield samhatfield deleted the rgo-develop branch July 19, 2024 13:27
wdeconinck added a commit to DJDavies2/ectrans that referenced this pull request Jul 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants