Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exception output from many threads is not helpful #1780

Closed
chengchingwen opened this issue Feb 26, 2023 · 8 comments · Fixed by #2342
Closed

Exception output from many threads is not helpful #1780

chengchingwen opened this issue Feb 26, 2023 · 8 comments · Fixed by #2342
Labels
cuda kernels Stuff about writing CUDA kernels. good first issue Good for newcomers

Comments

@chengchingwen
Copy link

Describe the bug

This would crash julia if the array is large, happened on both 1.8.5 and 1.9-beta4.

To reproduce

The Minimal Working Example (MWE) for this bug:

julia> using CUDA                                                                                            
                                                                                                             
julia> f(x) = (Int32(1), x)                                                                                  
f (generic function with 1 method)                                                                           
                                                                                                             
julia> g(a, b) = (a[1] + b[1], b[2] * a[1] + b[1] / a[1])                                                    
g (generic function with 1 method)                                                                           
                                                                                                             
julia> mapreduce(f, g, CUDA.randn(10, 10, 10); dims=1, init=(one(Int32), zero(Float32)))                     
ERROR: a exception was thrown during kernel execution.                                                       
Stacktrace:                                                                                                  
 [1] CuDynamicSharedArray at /home/peter/.julia/packages/CUDA/ZdCxS/src/device/intrinsics/memory_shared.jl:52
 [2] CuDynamicSharedArray at /home/peter/.julia/packages/CUDA/ZdCxS/src/device/intrinsics/memory_shared.jl:61
 [3] reduce_block at /home/peter/.julia/packages/CUDA/ZdCxS/src/mapreduce.jl:57                              
 [4] partial_mapreduce_grid at /home/peter/.julia/packages/CUDA/ZdCxS/src/mapreduce.jl:126                   
Manifest.toml

Paste your Manifest.toml here, or accurately describe which version of CUDA.jl and its dependencies (GPUArrays.jl, GPUCompiler.jl, LLVM.jl) you are using.
  [052768ef] CUDA v4.0.1                                                                                                  
  [1af6417a] CUDA_Runtime_Discovery v0.1.1                                                                                
  [0c68f7d7] GPUArrays v8.6.3
  [46192b85] GPUArraysCore v0.1.4
  [61eb1bfa] GPUCompiler v0.17.2
  [929cbde3] LLVM v4.16.0
⌅ [4ee394cb] CUDA_Driver_jll v0.2.0+0             
⌅ [76a88914] CUDA_Runtime_jll v0.2.3+2
  [62b44479] CUDNN_jll v8.6.0+3   

Version info

Details on Julia:

# please post the output of:
versioninfo()
Julia Version 1.9.0-beta4
Commit b75ddb787f (2023-02-07 21:53 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 12 × Intel(R) Core(TM) i7-7800X CPU @ 3.50GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, skylake-avx512)
  Threads: 4 on 12 virtual cores

Details on CUDA:

# please post the output of:
CUDA.versioninfo()
CUDA runtime 11.8, artifact installation
CUDA driver 11.6
NVIDIA driver 510.73.5

Libraries: 
- CUBLAS: 11.11.3
- CURAND: 10.3.0
- CUFFT: 10.9.0
- CUSOLVER: 11.4.1
- CUSPARSE: 11.7.5
- CUPTI: 18.0.0
- NVML: 11.0.0+510.73.5

Toolchain:
- Julia: 1.9.0-beta4
- LLVM: 14.0.6
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0, 7.1, 7.2, 7.3, 7.4, 7.5
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80, sm_86

@chengchingwen chengchingwen added the bug Something isn't working label Feb 26, 2023
@maleadt
Copy link
Member

maleadt commented Mar 15, 2023

I don't see a segfault here, just the expected exception reporting.

@maleadt maleadt added the needs information Further information is requested label Mar 15, 2023
@chengchingwen
Copy link
Author

The segfault happened if you replace the CUDA.randn(10, 10, 10) with larger one like CUDA.randn(512, 128, 16).

@maleadt
Copy link
Member

maleadt commented Mar 15, 2023

In that case you just get a lot of output on your terminal. I assume you hit CTRL-C, which might have killed Julia or CUDA then.

In any case, there isn't much we can do about this, as I/O is currently handled by CUDA. Maybe we could limit output by keeping track of written bytes and capping it, but that doesn't sound very satisfying.

@maleadt maleadt added cuda kernels Stuff about writing CUDA kernels. and removed needs information Further information is requested labels Mar 15, 2023
@chengchingwen
Copy link
Author

chengchingwen commented Mar 15, 2023

I didn't hit CTRL-C but wait for the output stop. It still result in segfault. Another issue is that this error is not always captured. On the machine (I reported above) the error is not shown unless I start julia with -g2.

@maleadt maleadt changed the title mismatched return type in mapreduce kernel cause segmentation fault Exception output from many threads is not helpful Mar 15, 2023
@maleadt
Copy link
Member

maleadt commented Mar 15, 2023

Another issue is that this error is not always captured. On the machine (I reported above) the error is not shown unless I start julia with -g2.

That's intentional. If you run without -g2 you get:

ERROR: a exception was thrown during kernel execution.
       Run Julia on debug level 2 for device stack traces.

@maleadt maleadt added good first issue Good for newcomers and removed bug Something isn't working labels Mar 15, 2023
@chengchingwen
Copy link
Author

That's intentional.

Oh, I thought it would give a compile error, but it seems to run successfully and generate the correct result on that machine.

@maleadt
Copy link
Member

maleadt commented Mar 15, 2023

Oh, I thought it would give a compile error, but it seems to run successfully and generate the correct result on that machine.

We can't generate a compile error because of Julia's dynamic semantics. You should still see a run-time exception though, albeit without a stack trace (you need -g2 for that).

@chengchingwen
Copy link
Author

You should still see a run-time exception though

It didn't get any exception on that machine. And on another machine, it randomly failed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda kernels Stuff about writing CUDA kernels. good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants