Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

segfaults in ltinv_ctlad_mod with GCC 10.2 and 10.1 (related to OpenMP) #20

Closed
shlyaeva opened this issue Dec 28, 2022 · 9 comments · Fixed by #22
Closed

segfaults in ltinv_ctlad_mod with GCC 10.2 and 10.1 (related to OpenMP) #20

shlyaeva opened this issue Dec 28, 2022 · 9 comments · Fixed by #22

Comments

@shlyaeva
Copy link
Contributor

We see segfaults in calls to ltinv_ctlad when the code is built with GCC 10.2 or GCC 10.1. ectrans_test_adjoint (or any code calling ltinv_ctlad) fails with a segfault with a following backtrace:

2: Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
2:
2: Backtrace for this error:
2: #0  0x7f92b929e33f in ???
2: #1  0x7f92bb0a887e in __ltinv_ctlad_mod_MOD_ltinv_ctlad._omp_fn.0
2: 	at /work/noaa/da/ashlyaev/jedi/jedi-bundle/ectrans/src/trans/internal/ltinv_ctlad_mod.F90:107
2: #2  0x7f92bb52ae21 in GOMP_parallel
2: 	at /home/builder/ktietz/cos6/ci_cos6/ctng-compilers_1622658800915/work/.build/x86_64-conda-linux-gnu/src/gcc/libgomp/parallel.c:171
2: #3  0x7f92bb0a9752 in __ltinv_ctlad_mod_MOD_ltinv_ctlad
2: 	at /work/noaa/da/ashlyaev/jedi/jedi-bundle/ectrans/src/trans/internal/ltinv_ctlad_mod.F90:107
2: #4  0x7f92bb09c25c in __inv_trans_ctlad_mod_MOD_inv_trans_ctlad
2: 	at /work/noaa/da/ashlyaev/jedi/jedi-bundle/ectrans/src/trans/internal/inv_trans_ctlad_mod.F90:289
2: #5  0x7f92bb101425 in inv_transad_
2: 	at /work/noaa/da/ashlyaev/jedi/jedi-bundle/ectrans/src/trans/external/inv_transad.F90:610
2: #6  0x4043de in test_adjoint
2: 	at /work/noaa/da/ashlyaev/jedi/jedi-bundle/ectrans/tests/trans/test_adjoint.F90:212
2: #7  0x4015ce in main
2: 	at /work/noaa/da/ashlyaev/jedi/jedi-bundle/ectrans/tests/trans/test_adjoint.F90:11
1/1 Test #2: ectrans_test_adjoint .............***Exception: SegFault  0.19 sec

The problem appears to be in how compiler deals with this OpenMP-parallelized loop:

!$OMP PARALLEL DO SCHEDULE(DYNAMIC,1) PRIVATE(JM,IM)
DO JM=1,D%NUMP
IM = D%MYMS(JM)
CALL LTINVAD(IM,JM,KF_OUT_LT,KF_UV,KF_SCALARS,KF_SCDERS,ILEI2,IDIM1,&
& PSPVOR,PSPDIV,PSPSCALAR,&
& PSPSC3A,PSPSC3B,PSPSC2 , &
& KFLDPTRUV,KFLDPTRSC,FSPGL_PROC)
ENDDO
!$OMP END PARALLEL DO

If the OpenMP directives are commented/removed, the calling code doesn't have any issues, the test succeeds. Trying to set explicitly PRIVATE or SHARED directives for all the variables in the loop doesn't solve the issue.

The same behavior is observed on a different system with GCC 10.1 (Maryam @mer-a-o ran the tests).

Any ideas on how this can be resolved would be greatly appreciated.

Log of all ectrans tests run on the system with GCC 10.2: LastTest.log

@wdeconinck
Copy link
Collaborator

Thank you, we will investigate.

@wdeconinck
Copy link
Collaborator

I tried to reproduce with GCC 10.3 and GCC 11.2 but interestingly could not reproduce. Do you also see this?
Unfortunately I don't have access to GCC 10.1 or 10.2.

Perhaps the OpenMP pragma could be pre-processor guarded to be removed when GCC 10.1 or 10.2 is detected.
It should be understood that this code path will not utilise any of the available OpenMP resources then though.

@shlyaeva
Copy link
Contributor Author

@wdeconinck: same here: we don't see this issue with GCC10.3 (@xian22 tested yesterday), only 10.1 and 10.2.

@climbfuji
Copy link

We could use something like this in the code, given that it seems to point at a bug in gcc 10.1 and 10.2:

! Bug in gcc 10.1 and 10.2, see https://github.com/ecmwf-ifs/ectrans/issues/20
#if !(__GNUC__ == 10 && (__GNUC_MINOR__ == 1 || __GNUC_MINOR__ == 2))
!$OMP PARALLEL DO SCHEDULE(DYNAMIC,1) PRIVATE(JM,IM) 
#endif
   DO JM=1,D%NUMP 
     IM = D%MYMS(JM) 
     CALL LTINVAD(IM,JM,KF_OUT_LT,KF_UV,KF_SCALARS,KF_SCDERS,ILEI2,IDIM1,& 
      & PSPVOR,PSPDIV,PSPSCALAR,& 
      & PSPSC3A,PSPSC3B,PSPSC2 , & 
      & KFLDPTRUV,KFLDPTRSC,FSPGL_PROC) 
   ENDDO 
#if !(__GNUC__ == 10 && (__GNUC_MINOR__ == 1 || __GNUC_MINOR__ == 2))
!$OMP END PARALLEL DO 
#endif

Or, for simplicity, if we think that slowing down this one particular loop with all gcc 10.x versions won't be a show-stopper:

! Bug in gcc 10.1 and 10.2, see https://github.com/ecmwf-ifs/ectrans/issues/20
#if !(__GNUC__ == 10)
!$OMP PARALLEL DO SCHEDULE(DYNAMIC,1) PRIVATE(JM,IM) 
#endif
   DO JM=1,D%NUMP 
     IM = D%MYMS(JM) 
     CALL LTINVAD(IM,JM,KF_OUT_LT,KF_UV,KF_SCALARS,KF_SCDERS,ILEI2,IDIM1,& 
      & PSPVOR,PSPDIV,PSPSCALAR,& 
      & PSPSC3A,PSPSC3B,PSPSC2 , & 
      & KFLDPTRUV,KFLDPTRSC,FSPGL_PROC) 
   ENDDO 
#if !(__GNUC__ == 10)
!$OMP END PARALLEL DO 
#endif

@wdeconinck
Copy link
Collaborator

I think the first approach is preferred. Then at least the GNU 10.3 version which we have is not affected.
For all we know the bug is also present in 10.0 (if that is around somewhere).
How about:

#if !(defined(__GFORTRAN__) && __GNUC__ == 10 && __GNUC_MINOR__ <= 2)

I would probably use __GFORTRAN__ as well just to make sure that other compilers are not using by coincidence the GNU preprocessor.
I'd be grateful with a PR with this suggestion. Thanks!

@climbfuji
Copy link

Sounds good @shlyaeva and/or I will try this, and if it solves the problem we'll issue the PR. Thanks @wdeconinck !

@shlyaeva
Copy link
Contributor Author

I have tested the fix with GNU10.2 and it works. Thank you all very much!
I will issue a PR with a fix Thursday US morning/Europe afternoon.

@climbfuji
Copy link

Is it possible to create a new release with this bug fix that we can point to?

@wdeconinck
Copy link
Collaborator

Is it possible to create a new release with this bug fix that we can point to?

Sure, it is tagged with 1.2.0 now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants