Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU support? #32

Open
flatstik opened this issue May 7, 2022 · 12 comments
Open

GPU support? #32

flatstik opened this issue May 7, 2022 · 12 comments
Labels
enhancement New feature or request

Comments

@flatstik
Copy link

flatstik commented May 7, 2022

Is anyone working on the GPU support from the older files?

@godotalgorithm
Copy link
Collaborator

So far, we have not attempted to reactivate the old GPU support. It was not a priority for the initial release, and our development plans include an eventual overhaul of GPU support in MOPAC that is unlikely to reuse any of the old implementation. However, given that development has been slower than expected, it might be worth considering the temporary reactivation of the old GPU support, particularly as a ground-breaking activity in supporting GPUs through a CMake-based build system. The old code should still work with at most very minor and localized changes, but handling all of the GPU-related dependencies through CMake is likely to be a tedious exercise that has to be confronted eventually anyways.

@godotalgorithm godotalgorithm added the enhancement New feature or request label May 10, 2022
@flatstik
Copy link
Author

Working with large protein models, I would really appreciate to reactivate the GPU support at least temporarily.

@godotalgorithm
Copy link
Collaborator

The presently-deactivated GPU support in MOPAC is unlikely to be helpful for large protein models. It only accelerates conventional, cubic-scaling MOPAC calculations [https://doi.org/10.1021/ct3004645], and you become limited by both GPU and CPU memory rather than just CPU memory. This feature can reduce time-to-solution for moderately-sized problems, but it does not ease any memory limitations.

Large protein models should be run with MOZYME rather than conventional MOPAC if at all possible. MOZYME is both faster and more memory efficient than conventional MOPAC for large organic molecules. MOZYME has never had GPU support, it is a purely serial CPU implementation. Rocha's group has published preliminary results on GPU acceleration of MOZYME [https://doi.org/10.1007/s00894-020-04571-6], but that feature has not yet been integrated with or contributed to MOPAC. We have solicited the feature from Prof. Rocha, but development activity in his group is a bit slow right now. Also, I should note that the 1-GPU MOZYME benchmarks presented in Rocha's paper are not actually faster than the 1-core CPU MOZYME solver presently implemented in MOPAC. The "40x" speed-up claimed in the abstract of that paper refers to the 1-core CPU implementation of their new solver algorithm, not the existing MOZYME solver in MOPAC.

@flatstik
Copy link
Author

So the results published by Jimmy were fake to start with?

@godotalgorithm
Copy link
Collaborator

No, I am pointing out that there are two distinct GPU-related developments in MOPAC, both out of the Rocha group at the Federal University of Paraiba in Brazil, but from different time periods and presently at different states of functionality.

In 2012, they published a paper about GPU acceleration of conventional MOPAC calculations by moving dense matrix operations to the GPU using the cuBLAS and MAGMA libraries. This was integrated into the main version of MOPAC circa 2014. It provides a speed-up to conventional calculations as soon as the faster speed of the GPU overcomes the added cost of moving matrix data to and from the GPU, typically around a few hundred atoms. However, it is limited by the amount of GPU memory, which is typically smaller than main system memory. This feature was deactivated during the open-source transition of MOPAC for the sake of expediency, and will either be temporarily reactivated or simply superseded by more general GPU support of dense linear algebra in MOPAC in the future. The benchmark page you are linking is older benchmarks of this previously-functioning MOPAC feature.

In 2020, they published a paper about GPU acceleration of reduced-scaling MOZYME calculations. However, this was more of a preview of a feature in development rather than an announcement of a fully integrated feature in MOPAC. Rather than port the existing MOZYME algorithm to a GPU, it was a test of a different reduced-scaling algorithm altogether using the same sparse matrix format as the original MOZYME algorithm. The alternate algorithm was easier to port to a GPU and was successfully accelerated with respect to its single-core CPU implementation, but it does not yet match the performance of the original MOZYME algorithm when comparing typical 1-GPU and 1-CPU-core performance.

Whether or not the old GPU support is reactivated, the fastest and most memory efficient way to run a large protein simulation in MOPAC for the foreseeable future will be on 1 CPU core using the MOZYME solver. Given the present state of MOPAC's development, this is a situation where a faster algorithm (MOZYME) on slower hardware (1 CPU core) will usually out-perform a slower algorithm (dense linear algebra operations) on faster hardware (a GPU). Obviously, it would be best to implement the fastest algorithm on the fastest hardware, but that will require a substantial amount of development work, and it is unclear where/if it fits into our active development priorities.

@aizvorski
Copy link

@godotalgorithm What type of primitives are needed for GPU MOZYME, is it mostly sparse matrix multiply? How sparse and how large is it?

Can this be done with either PyTorch https://pytorch.org/docs/stable/sparse.html or JAX https://jax.readthedocs.io/en/latest/jax.experimental.sparse.html ?

Going down to lower level API, there's cuSPARSE https://docs.nvidia.com/cuda/cusparse/index.html but I think higher level would probably be better.

@godotalgorithm
Copy link
Collaborator

The existing MOZYME solver is based on an approximate sparsity-preserving matrix factorization process (called "pseudodiagonalization"), and the closest standard sparse matrix method would be ILU factorization. The underlying matrix primitive is Givens rotations with truncation of small matrix elements that are created by the rotations. There are no sparse-matrix/dense-vector operations in MOZYME, which seems to be the primary target of the GPU libraries that you have linked.

This is an unfortunate case of a novel algorithm that isn't well adapted to hardware trends, which is also the case in other parts of MOPAC. There was an optional pseudodiagonalization step in the standard MOPAC solver, which uses fewer operations than full diagonalization. However, it inherently uses level-1 BLAS operations rather than level-3, and so it is much slower in practice these days despite the smaller number of operations and I subsequently deactivated it.

On the topic of GPU support, I will soon be reactivating the old cuBLAS/MAGMA code in MOPAC, contingent on CMake behaving itself. I was originally going to rewrite and modernize all of it, but I have other more urgent development priorities and there have been enough requests now for GPU support that I feel obligated to do something.

@aizvorski
Copy link

aizvorski commented Sep 20, 2022

@godotalgorithm Thanks for explaining, that's very interesting. Givens rotations - basically they're matrix-matrix multiplications with a matrix that just has two off-diagonal elements? (and then truncation).

I had a look at cuSPARSE and it only has sparse matrix-dense matrix or vector multiply. cuSPARSELt https://docs.nvidia.com/cuda/cusparselt/index.html may have sparse-sparse?

Even if cuSPARSE/cuSPARSELt doesn't have the right operations, I think sparse matrix-sparse matrix multiplication followed by a separate truncation should work fairly efficiently in PyTorch. It would be more efficient to fuse the multiplication and truncation, but doing it as two separate ops shouldn't be terribly slow. I could try to benchmark it if you give me a couple of example matrixes for a typical-sized protein. Other libraries could fuse automatically (possibly pytorch-xla), or would support writing efficient code for a fused multiply-truncate (https://openai.com/blog/triton/) without having to write CUDA code directly.

@godotalgorithm
Copy link
Collaborator

You'll need to give me some time, but I should be able to provide both some matrix examples of various sizes and also a minimal, self-contained single-core reference implementation of pseudodiagonalization. The main idea is quite simple, but there are some bookkeeping complications with its implementation in MOZYME. What I'll need to extract and translate from MOZYME are the Fock matrices for the initial choice of localized molecular orbitals (LMOs). In effect, these are sparse symmetric matrices partitioned into 4 large blocks, and the solver performs a sequence of Givens rotations to eliminate the 2 off-diagonal matrix blocks, which decouples the remaining diagonal matrix blocks.

@godotalgorithm
Copy link
Collaborator

@aizvorski I have prepared a first version of a self-contained proxy for the MOZYME solver. It still needs work, but it is already a reasonable reference for any attempts at GPU acceleration of MOZYME and contains small (Crambin) and mid-size (Chymotrypsin) protein examples.

@aizvorski
Copy link

aizvorski commented Oct 6, 2022

@godotalgorithm Thanks Jonathan! That's amazing work. I read it through and the code is very readable. I'll need a little bit to understand the sequence of operations in carries out when it runs on the example inputs, and then try to benchmark some of the possible GPU functions.

Meanwhile: it seems there is a sparse matrix - sparse matrix multiply in cuSPARSE after all, as cusparseSpGEMM() https://docs.nvidia.com/cuda/cusparse/index.html#cusparse-generic-function-spgemm I'm not sure is a sequence of those would be more efficient than a naive approach, eg porting the solver you made to CUDA directly, with each column running on its own CUDA thread.

@godotalgorithm
Copy link
Collaborator

@aizvorski It is unlikely that using something like cusparseSpGEMM will produce good performance for 2-by-2 rotation matrices. I do expect that hand-written CUDA code would be needed to accelerate the algorithm as-is. However, when I come back to this (maybe in a month or so), I plan to understand why MOZYME is maintaining sparsity more effectively than my solver. One explanation is that it may be using smaller-angle, more incremental rotation matrices, and in that scenario, it may be possible to merge multiple 2-by-2 rotation matrices into a more substantial transformation matrix that could more effectively use these operations.

On the other hand, I would expect that the computation of the transformed Fock matrix (project_fock) would translate well to cusparseSpGEMM. My implementation is forming the transformed matrix one column at a time with mat-vec operations, but this is more naturally a sequence of two sparse matrix-matrix products.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants