Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add batched dgemm to CCSD(T) for openmp offload and openacc implementations #980

Conversation

omarkahmed
Copy link
Contributor

This is an implementation of Batched DGEMM for CCSD(T) for OpenMP Offload and OpenACC implementations.

Acknowledgements include:
Nawal Copty
Rakesh Krishnaiyer
Abhinav Gaba
Ravi Narayanaswamy
Nitin Gawande

@edoapra
Copy link
Collaborator

edoapra commented Jul 8, 2024

@omarkahmed could you please rebase your pull request to the current master?

@omarkahmed omarkahmed force-pushed the omarkahmed/ccsd_offload_batched_gemm_rebase branch from c66b913 to c697434 Compare July 9, 2024 00:00
@omarkahmed
Copy link
Contributor Author

@edoapra , just rebased.

@edoapra
Copy link
Collaborator

edoapra commented Jul 9, 2024

@omarkahmed This comment is about the previous commit 05f5cb5 you made 8 months ago

Why did you put options for the C compiler when the ccsd compiler does not contain any C source code?
https://github.com/nwchemgit/nwchem/blame/80cff689dbd18943911ee13cc7f4b8a305f52445/src/ccsd/GNUmakefile#L133-L137

@omarkahmed
Copy link
Contributor Author

@omarkahmed This comment is about the previous commit 05f5cb5 you made 8 months ago

Why did you put options for the C compiler when the ccsd compiler does not contain any C source code? https://github.com/nwchemgit/nwchem/blame/80cff689dbd18943911ee13cc7f4b8a305f52445/src/ccsd/GNUmakefile#L133-L137

Thanks, that's a good point. This was due to earlier versions which had some C source. Currently sanity checking that removing this code doesn't impact my unit tests.

@edoapra
Copy link
Collaborator

edoapra commented Jul 9, 2024

I think that the Fortran options might need some cleanup, too.
For example, I see some MKL related bits. Are those needed at compile time or link time?

@jeffhammond
Copy link
Collaborator

The "OpenACC" version uses CUDA Fortran and only compiles with NVHPC Fortran. I didn't bother renaming it but it really should just be the "nvhpc" or "Nvidia" version.

If you added batched CUBLAS, I'll take a look, but the dimensions used in CCSD(T) are known to not leverage batching effectively. The current version with async streams splits the batch of GEMMs so adding one batched call is more synchronous. It's unlikely to be better on any NVIDIA GPU, which is the only supported hardware for the code as written.

@omarkahmed
Copy link
Contributor Author

I think that the Fortran options might need some cleanup, too. For example, I see some MKL related bits. Are those needed at compile time or link time?

@edoapra , thanks for the suggestion! I just pushed up a patch to separate out the INCLUDES and DEFINES.

The "OpenACC" version uses CUDA Fortran and only compiles with NVHPC Fortran. I didn't bother renaming it but it really should just be the "nvhpc" or "Nvidia" version.

If you added batched CUBLAS, I'll take a look, but the dimensions used in CCSD(T) are known to not leverage batching effectively. The current version with async streams splits the batch of GEMMs so adding one batched call is more synchronous. It's unlikely to be better on any NVIDIA GPU, which is the only supported hardware for the code as written.

@jeffhammond , yes these are batched CUBLAS calls. Thanks for the additional information and review! I can remove the batched calls if you think that makes more sense.

@jeffhammond
Copy link
Collaborator

I'm on vacation but I'll test the batched stuff when I get back. If it's optional, it shouldn't do any harm to be there.

@edoapra
Copy link
Collaborator

edoapra commented Jul 26, 2024

I have just rebased

@edoapra
Copy link
Collaborator

edoapra commented Jul 26, 2024

@jeffhammond could you please have a look at the commit I have just made to the ccsd USE_OPENACC bit?
3052284

@edoapra edoapra requested a review from jeffhammond July 29, 2024 16:35
@jeffhammond
Copy link
Collaborator

do you have performance numbers for the NVIDIA implementation here? this is a large change and since i will end up maintaining it, i want to understand what the upside is.

@edoapra
Copy link
Collaborator

edoapra commented Jul 30, 2024

@jeffhammond what environment variables are you use to link when USE_OPENACC_TRPDRV=1?

I am getting a bunch of cuda and pgf90/pgi undefined objects.
I can get the link to work by modifying makefile.h with this change

diff -u config/makefile.h config/makefile.h.linkcuda 
--- config/makefile.h	2024-07-30 10:23:16.989751519 -0700
+++ config/makefile.h.linkcuda	2024-07-26 11:54:17.992993104 -0700
@@ -3658,7 +3658,7 @@
 
 ifdef NWCHEM_LINK_CUDA
     ifeq ($(_FC),pgf90)
-       CORE_LIBS += -acc -cuda -cudalib=cublas
+       CORE_LIBS += -L/usr/local/cuda/targets/x86_64-linux/lib/ -acc -cuda -cudalib=cublas
     endif
     ifeq ($(_FC),gfortran)
        CORE_LIBS +=  -fopenacc -lcublas

@jeffhammond
Copy link
Collaborator

this code is 6 times slower than it was before. i have no idea why you are contributing this without doing basic testing to determine that it is valuable to NWChem's developers and users.

without batching

 ccsd(t): 100% done, Aggregate Gflops=  274.3     in      285.1 secs
CU+MEM free took     0.24475E-01 seconds
 Time for integral evaluation pass     1       10.38
 Time for triples evaluation pass      1      285.17

with batching

 ccsd(t): 100% done, Aggregate Gflops=  46.21     in     1692.8 secs
CU+MEM free took     0.24896E-01 seconds
 Time for integral evaluation pass     1       10.22
 Time for triples evaluation pass      1     1692.86

these tests were with 4*H2O with cc-pVTZ on a 4090. the utility of batching goes down as matrix sizes are larger, so 6x is on the low end of the slowdown one would expect from this change.

@edoapra
Copy link
Collaborator

edoapra commented Jul 30, 2024

@jeffhammond @omarkahmed Should we mark this pull request as draft for the time being?

Copy link
Collaborator

@jeffhammond jeffhammond left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revert all changes to the OpenACC.F code. They are not good.

@jeffhammond
Copy link
Collaborator

Intel should just leave my version alone and contribute theirs on its own.

@omarkahmed
Copy link
Contributor Author

@jeffhammond , thanks for the feedback. Will revert the OpenACC.F modifications.

@jeffhammond
Copy link
Collaborator

jeffhammond commented Jul 30, 2024 via email

@edoapra
Copy link
Collaborator

edoapra commented Jul 30, 2024

export USE_DEBUG=1 export USE_F90_ALLOCATABLE=y export USE_OPENACC_TRPDRV=y export NWCHEM_LINK_CUDA=y

My link problem vanished after having installed a more recent nvhpc release

@edoapra edoapra force-pushed the omarkahmed/ccsd_offload_batched_gemm_rebase branch from 1d079a7 to 2dc63c1 Compare July 31, 2024 00:20
@omarkahmed omarkahmed force-pushed the omarkahmed/ccsd_offload_batched_gemm_rebase branch from 2dc63c1 to e7a8e04 Compare August 5, 2024 23:32
@omarkahmed
Copy link
Contributor Author

Please revert all changes to the OpenACC.F code. They are not good.

@jeffhammond , I reorganized the commits to eliminate the changes to the cuda version. Please let me know if you have any other requests.

@jeffhammond
Copy link
Collaborator

Thanks. This good for me.

@edoapra edoapra merged commit b2bd5c6 into nwchemgit:master Aug 7, 2024
62 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants