Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use unified memory in CUDA debug builds #621

Merged
merged 2 commits into from
Aug 20, 2020
Merged

Use unified memory in CUDA debug builds #621

merged 2 commits into from
Aug 20, 2020

Conversation

upsj
Copy link
Member

@upsj upsj commented Aug 18, 2020

This PR uses unified memory instead of device memory in CUDA debug builds, allowing you to directly access device memory in cuda-gdb.

It additionally fixes a bug in multi-GPU raw_copy_to, which used the wrong device ordinals due to a wrongly named parameter (src -> dest)

@upsj upsj added is:bug Something looks wrong. is:help-wanted Need ideas on how to solve this. mod:cuda This is related to the CUDA module. 1:ST:ready-for-review This PR is ready for review labels Aug 18, 2020
@upsj upsj self-assigned this Aug 18, 2020
@upsj upsj changed the title Use unified memory on CUDA debug builds Fix GPU raw_copy_to between devices Aug 18, 2020
@codecov
Copy link

codecov bot commented Aug 19, 2020

Codecov Report

Merging #621 into develop will decrease coverage by 0.00%.
The diff coverage is n/a.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop     #621      +/-   ##
===========================================
- Coverage    92.86%   92.86%   -0.01%     
===========================================
  Files          303      303              
  Lines        21115    21115              
===========================================
- Hits         19609    19608       -1     
- Misses        1506     1507       +1     
Impacted Files Coverage Δ
core/base/extended_float.hpp 91.26% <0.00%> (-0.98%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6dd2eb4...15faee3. Read the comment docs.

@upsj upsj changed the title Fix GPU raw_copy_to between devices Use unified memory in CUDA debug builds Aug 19, 2020
@pratikvn
Copy link
Member

I definitely like the idea of using UVM in debug builds, but I am not so sure if we should be implementing two different implementations for Release and Debug. To me debug builds are meant to have the same implementation but maybe with more possible checks and information, but in this case the implementation (std::memcpy vs CudaMemcpyPeer) is different which kind of violates that principle.

Nice job finding that bug for the multi-GPU raw_copy_to. I guess that was the issue with my CudaUVM memspace invalid device ordinal errors.

Also you dont seem to be doing this for HIP device -->HIP device raw_copy_to.

@upsj
Copy link
Member Author

upsj commented Aug 19, 2020

That's a good point, it might also be that you can just use cudaMemcpy as usual with UVM, I will check.

@upsj upsj force-pushed the executor_debug_uvm branch 2 times, most recently from 2709bba to 50a6db6 Compare August 19, 2020 09:34
@upsj upsj removed the is:help-wanted Need ideas on how to solve this. label Aug 19, 2020
Copy link
Member

@pratikvn pratikvn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Member

@yhmtsai yhmtsai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

dev_tools/scripts/gdb-ginkgo.py Outdated Show resolved Hide resolved
also fix the incorrect peer memcpy device ID order
Copy link
Member

@thoasm thoasm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@upsj upsj added 1:ST:ready-to-merge This PR is ready to merge. and removed 1:ST:ready-for-review This PR is ready for review labels Aug 20, 2020
@sonarcloud
Copy link

sonarcloud bot commented Aug 20, 2020

Kudos, SonarCloud Quality Gate passed!

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities (and Security Hotspot 0 Security Hotspots to review)
Code Smell A 0 Code Smells

0.0% 0.0% Coverage
0.0% 0.0% Duplication

warning The version of Java (1.8.0_121) you have used to run this analysis is deprecated and we will stop accepting it from October 2020. Please update to at least Java 11.
Read more here

@upsj upsj merged commit 4a45c3f into develop Aug 20, 2020
@upsj upsj deleted the executor_debug_uvm branch August 20, 2020 21:02
tcojean added a commit that referenced this pull request Aug 26, 2020
Release 1.3.0 of Ginkgo.

The Ginkgo team is proud to announce the new minor release of Ginkgo version
1.3.0. This release brings CUDA 11 support, changes the default C++ standard to
be C++14 instead of C++11, adds a new Diagonal matrix format and capacity for
diagonal extraction, significantly improves the CMake configuration output
format, adds the Ginkgo paper which got accepted into the Journal of Open Source
Software (JOSS), and fixes multiple issues.

Supported systems and requirements:
+ For all platforms, cmake 3.9+
+ Linux and MacOS
  + gcc: 5.3+, 6.3+, 7.3+, all versions after 8.1+
  + clang: 3.9+
  + Intel compiler: 2017+
  + Apple LLVM: 8.0+
  + CUDA module: CUDA 9.0+
  + HIP module: ROCm 2.8+
+ Windows
  + MinGW and Cygwin: gcc 5.3+, 6.3+, 7.3+, all versions after 8.1+
  + Microsoft Visual Studio: VS 2017 15.7+
  + CUDA module: CUDA 9.0+, Microsoft Visual Studio
  + OpenMP module: MinGW or Cygwin.


The current known issues can be found in the [known issues page](https://github.com/ginkgo-project/ginkgo/wiki/Known-Issues).


Additions:
+ Add paper for Journal of Open Source Software (JOSS). [#479](#479)
+ Add a DiagonalExtractable interface. [#563](#563)
+ Add a new diagonal Matrix Format. [#580](#580)
+ Add Cuda11 support. [#603](#603)
+ Add information output after CMake configuration. [#610](#610)
+ Add a new preconditioner export example. [#595](#595)
+ Add a new cuda-memcheck CI job. [#592](#592)

Changes:
+ Use unified memory in CUDA debug builds. [#621](#621)
+ Improve `BENCHMARKING.md` with more detailed info. [#619](#619)
+ Use C++14 standard instead of C++11. [#611](#611)
+ Update the Ampere sm information and CudaArchitectureSelector. [#588](#588)

Fixes:
+ Fix documentation warnings and errors. [#624](#624)
+ Fix warnings for diagonal matrix format. [#622](#622)
+ Fix criterion factory parameters in CUDA. [#586](#586)
+ Fix the norm-type in the examples. [#612](#612)
+ Fix the WAW race in OpenMP is_sorted_by_column_index. [#617](#617)
+ Fix the example's exec_map by creating the executor only if requested. [#602](#602)
+ Fix some CMake warnings. [#614](#614)
+ Fix Windows building documentation. [#601](#601)
+ Warn when CXX and CUDA host compiler do not match. [#607](#607)
+ Fix reduce_add, prefix_sum, and doc-build. [#593](#593)
+ Fix find_library(cublas) issue on machines installing multiple cuda. [#591](#591)
+ Fix allocator in sellp read. [#589](#589)
+ Fix the CAS with HIP and NVIDIA backends. [#585](#585)

Deletions:
+ Remove unused preconditioner parameter in LowerTrs. [#587](#587)

Related PR: #625
tcojean added a commit that referenced this pull request Aug 27, 2020
The Ginkgo team is proud to announce the new minor release of Ginkgo version
1.3.0. This release brings CUDA 11 support, changes the default C++ standard to
be C++14 instead of C++11, adds a new Diagonal matrix format and capacity for
diagonal extraction, significantly improves the CMake configuration output
format, adds the Ginkgo paper which got accepted into the Journal of Open Source
Software (JOSS), and fixes multiple issues.

Supported systems and requirements:
+ For all platforms, cmake 3.9+
+ Linux and MacOS
  + gcc: 5.3+, 6.3+, 7.3+, all versions after 8.1+
  + clang: 3.9+
  + Intel compiler: 2017+
  + Apple LLVM: 8.0+
  + CUDA module: CUDA 9.0+
  + HIP module: ROCm 2.8+
+ Windows
  + MinGW and Cygwin: gcc 5.3+, 6.3+, 7.3+, all versions after 8.1+
  + Microsoft Visual Studio: VS 2017 15.7+
  + CUDA module: CUDA 9.0+, Microsoft Visual Studio
  + OpenMP module: MinGW or Cygwin.


The current known issues can be found in the [known issues page](https://github.com/ginkgo-project/ginkgo/wiki/Known-Issues).


Additions:
+ Add paper for Journal of Open Source Software (JOSS). [#479](#479)
+ Add a DiagonalExtractable interface. [#563](#563)
+ Add a new diagonal Matrix Format. [#580](#580)
+ Add Cuda11 support. [#603](#603)
+ Add information output after CMake configuration. [#610](#610)
+ Add a new preconditioner export example. [#595](#595)
+ Add a new cuda-memcheck CI job. [#592](#592)

Changes:
+ Use unified memory in CUDA debug builds. [#621](#621)
+ Improve `BENCHMARKING.md` with more detailed info. [#619](#619)
+ Use C++14 standard instead of C++11. [#611](#611)
+ Update the Ampere sm information and CudaArchitectureSelector. [#588](#588)

Fixes:
+ Fix documentation warnings and errors. [#624](#624)
+ Fix warnings for diagonal matrix format. [#622](#622)
+ Fix criterion factory parameters in CUDA. [#586](#586)
+ Fix the norm-type in the examples. [#612](#612)
+ Fix the WAW race in OpenMP is_sorted_by_column_index. [#617](#617)
+ Fix the example's exec_map by creating the executor only if requested. [#602](#602)
+ Fix some CMake warnings. [#614](#614)
+ Fix Windows building documentation. [#601](#601)
+ Warn when CXX and CUDA host compiler do not match. [#607](#607)
+ Fix reduce_add, prefix_sum, and doc-build. [#593](#593)
+ Fix find_library(cublas) issue on machines installing multiple cuda. [#591](#591)
+ Fix allocator in sellp read. [#589](#589)
+ Fix the CAS with HIP and NVIDIA backends. [#585](#585)

Deletions:
+ Remove unused preconditioner parameter in LowerTrs. [#587](#587)

Related PR: #627
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1:ST:ready-to-merge This PR is ready to merge. is:bug Something looks wrong. mod:cuda This is related to the CUDA module.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants