Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: Update to gcc/gfortran 9 #28

Merged
merged 20 commits into from
Jul 5, 2019
Merged

CI: Update to gcc/gfortran 9 #28

merged 20 commits into from
Jul 5, 2019

Conversation

letmaik
Copy link

@letmaik letmaik commented Jun 10, 2019

In addition, on macOS, this switches from a versioned gcc, e.g. gcc@8 to the default gcc. There is not much point pinning gcc as bottled dependencies like MPI and netcdf-fortran would still have been built with the default gcc (and not the pinned gcc) and this is typically a problem when using Fortran modules to depend on them as the compiler versions need to match. Our documentation also doesn't use pinned gcc versions, so it makes sense to test with the default (which currently is gcc 9).

Note: The same issue with pinning exists on MSYS2 as well, in an even more extreme form, as older gcc versions typically can't be easily installed (for the record: I tried downgrading gcc incl. gcc-libs and then cmake wouldn't even run anymore, and probably other stuff was broken as well).

Note 2: We still keep a few Travis CI build matrix configurations that build with the default (=old) compiler versions on Ubuntu 14.04, just for the sake of detecting build failures on older compilers.

@letmaik letmaik added this to the 4.1.0 milestone Jun 10, 2019
@letmaik
Copy link
Author

letmaik commented Jun 11, 2019

There are two issues in CI currently:

1. Divide by zero / surprising evaluation order in conjuction

There's an issue with gfortran 9 in Debug mode when runtime checks are enabled. This happens on all platforms and all parallelization modes, and we get some sensible stack trace on Linux only. This happens for case 2 in wats, which is the first case that uses noahmp as sf_surface_physics.

Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.

Backtrace for this error:
#0  0x7f8baa15769d in ???
#1  0x7f8baa1568b5 in ???
#2  0x7f8ba98824af in ???
#3  0x2094149 in __module_first_rk_step_part1_MOD_first_rk_step_part1
	at /home/vsts/work/1/s/dyn_em/module_first_rk_step_part1.F:441
#4  0x1dcdb19 in solve_em_
	at /home/vsts/work/1/s/dyn_em/solve_em.F:756
#5  0x1b8f3cd in solve_interface_
	at /home/vsts/work/1/s/share/solve_interface.F:56
#6  0x19b9d1c in __module_integrate_MOD_integrate
	at /home/vsts/work/1/s/frame/module_integrate.F:392
#7  0x405c7a in __module_wrf_top_MOD_wrf_run
	at /home/vsts/work/1/s/main/module_wrf_top.F:2072
#8  0x404999 in wrf
	at /home/vsts/work/1/s/main/wrf.F:44
#9  0x4049f0 in main
	at /home/vsts/work/1/s/main/wrf.F:6

The offending line is:
https://github.com/WRF-CMake/WRF/blob/ee69e71a49f46bf043d7da59c08597db790a2a96/dyn_em/module_first_rk_step_part1.F#L441

The only way this line can fail is when grid%STEPWTD is 0, as this would be a division by zero. The funny thing is that opt_run is not set in the case config, which means it uses the default of 3. Does that mean gfortran 9 suddenly evaluates the right part of the conjuction even if the left part is false? That would be surprising at least.

Another instance of the issue is at: https://github.com/WRF-CMake/WRF/blob/a500a27c14967380ac410e524878953c6875afaa/phys/module_surface_driver.F#L3106

EDIT: Resolved. See #28 (comment).

2. libgomp linking - unresolved symbols

Only on Linux and Make (not CMake!) with OpenMP, there are linker errors during the build where it doesn't find two OpenMP library symbols:

wrf_io.f:(.text+0x5e6): undefined reference to `GOMP_loop_maybe_nonmonotonic_runtime_start'
wrf_io.f:(.text+0x639): undefined reference to `GOMP_loop_maybe_nonmonotonic_runtime_next'

Curiously, this doesn't happen on Travis CI: https://travis-ci.com/WRF-CMake/WRF/jobs/207338038. I also couldn't reproduce it in fresh 16.04 and 18.04 Docker containers.
Note: By looking at https://github.com/gcc-mirror/gcc/blob/gcc-9_1_0-release/gcc/omp-builtins.def and comparing with 8.x it becomes clear that the undefined symbol was introduced in gcc 9. Somehow the compilation works fine, but then at link-time an old libgomp library is pulled in that lacks the symbol, which is extremely weird as this happens automatically with the -fopenmp flag.

As an aside, I noticed that we forgot to define _OPENMP in CMake builds if OpenMP is requested. I added this in here even though it doesn't strictly belong to the PR. This is automatically added by all OpenMP compliant compilers if OpenMP is enabled.

@dmey
Copy link

dmey commented Jun 11, 2019

Do we need to have this in 4.1.0? Otherwise, I would suggest keeping it as it is until we finish with JOSS in 4.1.0. There may be things that we want to point out directly with UCAR and I think it'd be easier to do once the paper is finalised.

@letmaik
Copy link
Author

letmaik commented Jun 11, 2019

The problem is Windows with MSYS2. We can't define a compiler version there and that's why we saw those CI failures already before this PR, since MSYS2 switched to gcc/gfortran 9. We have a choice on macOS and Linux, although for macOS to do it properly we would have to build netcdf-fortran and open-mpi from source instead of using the brew bottles, as otherwise it only works by accident since you normally can't mix gfortran compiler versions when depending on Fortran modules.

I don't like this situation myself as it's a ticking bomb but I don't see a good solution apart from staying up to date.

@zbeekman
Copy link

Note that this doesn't happen with CMake.

Is that because CMake is passing different floating point exception flags? -ffpe-summary=all may get more info.

@letmaik
Copy link
Author

letmaik commented Jun 12, 2019

@zbeekman You misunderstood. The sentence you referred to was just about the OpenMP linking issue. The other issue related to SIGFPE happened not just with CMake but also in the existing Makefile based build system. I've rewritten the comment slightly to make it more clear.

@letmaik
Copy link
Author

letmaik commented Jun 13, 2019

@zbeekman I managed to fix the FPE issue but I'd like to get your input on it as it really confuses me (EDIT: nevermind, see the other EDITs below). See 6b79e6b and c5e9530, plus my extended commentary on the post above. This code hasn't been touched for years and suddenly fails with the new gfortran version. It seems like the compiler generates code that evaluates the right part of the conjunction even if the left part would already be enough. With optimizations enabled, this doesn't seem to happen (probably because then it figures out it's enough to look at the left part first and only then try the right part), but with -O0 (default) it does happen. Isn't this behaviour defined in the language itself? Could it be a bug in gfortran 9?

EDIT: I found https://software.intel.com/en-us/forums/intel-visual-fortran-compiler-for-windows/topic/282534 which is the same issue. Looks like Fortran doesn't make guarantees here (for short-circuiting), and the relevant WRF code probably only worked by luck so far.

EDIT2: https://www.scivision.dev/fortran-short-circuit-logic/ This is consistent with my findings for gfortran but it surprises me that Intel also doesn't short-circuit and that would mean no one ever ran a case with opt_run != 5 on Intel.

EDIT3: Reported upstream: wrf-model#930

@letmaik
Copy link
Author

letmaik commented Jul 5, 2019

Issue 2 (libgomp linking errors) is fixed now as well. I'm not sure what it was but I switched to container builds now which give us a clean starting point (using ubuntu:16.04), compared to the regular CI VMs which have all kinds of stuff pre-installed. It also provides more flexibility in case we want to test different images (e.g. distributions) in the future.

@letmaik letmaik merged commit d7a361c into wrf-cmake Jul 5, 2019
@letmaik letmaik deleted the letmaik/gcc-9 branch July 5, 2019 07:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants