-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI: Update to gcc/gfortran 9 #28
Conversation
There are two issues in CI currently: 1. Divide by zero / surprising evaluation order in conjuctionThere's an issue with gfortran 9 in Debug mode when runtime checks are enabled. This happens on all platforms and all parallelization modes, and we get some sensible stack trace on Linux only. This happens for case 2 in wats, which is the first case that uses noahmp as sf_surface_physics.
The offending line is: The only way this line can fail is when Another instance of the issue is at: https://github.com/WRF-CMake/WRF/blob/a500a27c14967380ac410e524878953c6875afaa/phys/module_surface_driver.F#L3106 EDIT: Resolved. See #28 (comment). 2. libgomp linking - unresolved symbolsOnly on Linux and Make (not CMake!) with OpenMP, there are linker errors during the build where it doesn't find two OpenMP library symbols:
Curiously, this doesn't happen on Travis CI: https://travis-ci.com/WRF-CMake/WRF/jobs/207338038. I also couldn't reproduce it in fresh 16.04 and 18.04 Docker containers.
|
Do we need to have this in 4.1.0? Otherwise, I would suggest keeping it as it is until we finish with JOSS in 4.1.0. There may be things that we want to point out directly with UCAR and I think it'd be easier to do once the paper is finalised. |
The problem is Windows with MSYS2. We can't define a compiler version there and that's why we saw those CI failures already before this PR, since MSYS2 switched to gcc/gfortran 9. We have a choice on macOS and Linux, although for macOS to do it properly we would have to build netcdf-fortran and open-mpi from source instead of using the brew bottles, as otherwise it only works by accident since you normally can't mix gfortran compiler versions when depending on Fortran modules. I don't like this situation myself as it's a ticking bomb but I don't see a good solution apart from staying up to date. |
…sm to exercise both MPI and OpenMP
Is that because CMake is passing different floating point exception flags? |
@zbeekman You misunderstood. The sentence you referred to was just about the OpenMP linking issue. The other issue related to SIGFPE happened not just with CMake but also in the existing Makefile based build system. I've rewritten the comment slightly to make it more clear. |
@zbeekman I managed to fix the FPE issue but I'd like to get your input on it as it really confuses me (EDIT: nevermind, see the other EDITs below). See 6b79e6b and c5e9530, plus my extended commentary on the post above. This code hasn't been touched for years and suddenly fails with the new gfortran version. It seems like the compiler generates code that evaluates the right part of the conjunction even if the left part would already be enough. With optimizations enabled, this doesn't seem to happen (probably because then it figures out it's enough to look at the left part first and only then try the right part), but with -O0 (default) it does happen. Isn't this behaviour defined in the language itself? Could it be a bug in gfortran 9? EDIT: I found https://software.intel.com/en-us/forums/intel-visual-fortran-compiler-for-windows/topic/282534 which is the same issue. Looks like Fortran doesn't make guarantees here (for short-circuiting), and the relevant WRF code probably only worked by luck so far. EDIT2: https://www.scivision.dev/fortran-short-circuit-logic/ This is consistent with my findings for gfortran but it surprises me that Intel also doesn't short-circuit and that would mean no one ever ran a case with opt_run != 5 on Intel. EDIT3: Reported upstream: wrf-model#930 |
allows easier use in docker containers
cleaner environment than host vm
missing in docker
avoids installing git inside container jobs
Issue 2 (libgomp linking errors) is fixed now as well. I'm not sure what it was but I switched to container builds now which give us a clean starting point (using ubuntu:16.04), compared to the regular CI VMs which have all kinds of stuff pre-installed. It also provides more flexibility in case we want to test different images (e.g. distributions) in the future. |
In addition, on macOS, this switches from a versioned gcc, e.g.
gcc@8
to the default gcc. There is not much point pinning gcc as bottled dependencies like MPI and netcdf-fortran would still have been built with the default gcc (and not the pinned gcc) and this is typically a problem when using Fortran modules to depend on them as the compiler versions need to match. Our documentation also doesn't use pinned gcc versions, so it makes sense to test with the default (which currently is gcc 9).Note: The same issue with pinning exists on MSYS2 as well, in an even more extreme form, as older gcc versions typically can't be easily installed (for the record: I tried downgrading gcc incl. gcc-libs and then cmake wouldn't even run anymore, and probably other stuff was broken as well).
Note 2: We still keep a few Travis CI build matrix configurations that build with the default (=old) compiler versions on Ubuntu 14.04, just for the sake of detecting build failures on older compilers.