-
Notifications
You must be signed in to change notification settings - Fork 438
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bad wB97X-D3BJ energies with MKL & in-core DFT #2279
Comments
How do you know the correct energy? Have you run the calculation with a completely independent implementation? I think there might also be a difference in libxc between Psi4 1.3.2 and 1.4.0: range separated hybrids were basically fully rewritten in libxc 5.1.0, and should now be correct. addendum: of course if the calculations match in small basis sets, it can't be a libxc issue |
With wB97X-D3BJ/6-31G* on the same system, the non-MKL install gives -1964.4305 Hartree for the same system, and the MKL version blows up, not converging within 100 SCF iterations (the non-converged energies are around 66325650 Hartree). So it looks like you don't need a big basis to observe this instability, which is good for testing. But, this is more evidence that the MKL install has a problem. |
If the argument to psi4.set_memory() is reduced to 2 GB (forcing the disk algorithm), the MKL install gives an energy for wB97X-D3BJ/6-31G* of -1964.4297 Hartree, which is reasonable. So I continue to believe that the in-core algorithm is implicated, or at least magnifying an existing problem. |
If there’s a difference between the MKL and non-MKL version, I bet the culprit is a memory bug / wrong GEMM arguments in the in-core code.
|
mem_df definitely a culprit. thanks for the nice (and alarming) bug report.
mem_df
|
Whoa, and threading matters. above does not converge with 8 threads, but it converges and only differs from disk_df by 7e-5 with 1 thread. |
Which could be caused by a memory bug, since you need more memory for more threads... |
Thanks so much for the detailed error report. I can reproduce this error, and am getting closer to finding the culprit. In the meantime, if you add the option |
Hi everyone, I'm on 1.4 and have cherry-picked in the changes from 9163cbd. I'm building from source, as may be relevant to the MKL issues listed in #2283 I'm building with mkl/2019.0.117 (and stuck here for a bit because of my need for MKL_DEBUG_CPU_TYPE to continue to work). In addition, I have gcc/9.2.0 and icc/2020.2-108 going in my build env, with the C and CXX and Fortran compilers set to the intel compilers in my cmake config options. I'm running calculations with wB97M-V and was noticing the same issues @jminuse was. After cherry-picking and recompiling the issue persists. Adding Without
And with
Both jobs were run with 8 threads and 29337MB of memory, on the same machine (An
Should I have expected cherry-picking 9163cbd into 1.4 to have resolved this discrepancy? Or should I be making Thanks! |
@andysim's investigation yielded further oddities that Intel is working on. But yours is the first case that the cherry-pick hasn't fixed, @tallakahath . Would you post the input, please? Yes, it might be advisable to |
Ha, I was trying to come up with a minimal reproducible example and found the story somewhat more complicated. We're using a custom basis set, and it's nearest sibling (aug-cc-pvdz) doesn't reproduce the bug. Further, if I remove the mol.set_geometry call (which is just reloading in the same geometry, modulo some flutter from unit conversion and digit truncation (this was part of a bigger calc that keeps updating the geometry but only this section was required for reproduction), the bug goes away... Here is the file, attached so it's not GIANT in the field here b/c it has the custom basis spec: And here are the results I got running with and without wcombine: I'm stumped here, especially by how hard it turned out to be to get the bug to happen. But it happens reliably if I run both of those files I've been running a few thousand jobs through now with |
An update - after a colleague rebooted the node I'd been using for testing with the ...and now I get the "correct" answer -- the one I'd get with So I think despite 9163cbd there is still a sketchy use of DGETRI or DGETRF somewhere getting pulled in by an edge-case I'm hitting (because, again, if I tweak the number of processes, or the memory, or the basis set, or the geometry ever so slightly, it goes away!). I think I should flag @andysim here? I'll continue with |
Andy's Intel report is officially listed as a false positive. |
I am seeing large differences in wB97X-D3BJ energy between 1.4.0 and 1.3.2, and between different 1.4.0 installations. It seems that installing 1.4.0 with
-c anaconda
can cause the differences between 1.4.0 installations, possibly because it replaces the default linear algebra libraries with MKL versions. Such an installation runs 50% faster, but also gives wrong energies in some situations, sometimes by more than a Hartree.I've only seen the problem with clusters and large basis sets, which suggests it's a numerical issue. I've tested PBE, M06-2X, and wB97X-D3BJ, and so far it only appears in wB97X-D3BJ. Also, the error goes away if less RAM is provided (say, 10 GB instead of 32 GB). This suggests it may be related to the new ability of Psi4 1.4.0 to do in-core omega integrals (#1749).
Working env:
conda create --name psi4_v1.4.0 python=3.8 psi4 psi4-rt -c psi4 -y
Broken env:
conda create --name psi4_v1.4.0_mkl python=3.8 psi4 psi4-rt -c psi4 -c anaconda -y
Example script: https://drive.google.com/file/d/1c0wZO47h9ooRXQMzTW9eETLWozo4MT_O/view?usp=sharing
To reproduce: install psi4 via conda with
-c anaconda
as shown, activate the env, then run the provided scriptpython psi4_1.4.0_omega_issue.py
. The energy should be approximately -1965.2319, but will instead give something like -1963.3023.The text was updated successfully, but these errors were encountered: