Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MeshDLR creation memory usage is linear in $\beta$ (should be $\log\beta$) #917

Open
HugoStrand opened this issue Sep 14, 2023 · 3 comments
Assignees

Comments

@HugoStrand
Copy link
Member

Description

This is issue is related to the cppdlr library and its creation of the DLR basis.

When creating a TRIQS DLR mesh, e.g. MeshDLR, the memory usage is linear in the inverse temperature $\beta$, preventing the use of DLR at low temperatures. This is probably caused by the current approach to select DLR Matsubara frequencies, since a dense Matsubara grid is used with an upper cutoff proportional to $\beta$.

Currently this limits $\beta$ to $10^4$ in practical use cases, see appended plot showing peak memory used during the initialization of a MeshDLR mesh in python.

figure_dlr_beta_memory_scaling

This issue was originally observed by @YannIntVeld, thank you for sharing 🥇.

Steps to Reproduce

Here is the benchmark script used to produce the plot above

import shlex
import subprocess
import numpy as np


def get_mem_usage(beta):
    cmd = f"""/usr/bin/time -l python -c "from triqs.gf import MeshDLR;  m = MeshDLR({beta}, 'Fermion', w_max=30., eps=1e-10); print('Mesh points', len(m))" """
    out = subprocess.run(shlex.split(cmd), capture_output=True)
    n_pts = int(str(out.stdout).split('points')[-1].split(r'\n')[0])
    peak_mem_bytes = int(str(out.stderr).split('peak memory')[0].split('elapsed')[-1].split(r'\n')[-1])
    print(f"beta {beta:2.2E} mb {peak_mem_bytes * 1e-6:2.2E}")
    return peak_mem_bytes, n_pts


betas = np.logspace(1, 4, num=20, endpoint=True)
mem_pts = [ get_mem_usage(beta) for beta in betas ]
mem, pts = zip(*mem_pts)
mem = np.array(mem)

import matplotlib.pyplot as plt
subp = [2, 1, 1]

plt.subplot(*subp); subp[-1] += 1
plt.title(r"MeshDLR(beta, 'Fermion', w_max=30., eps=1e-10)")
plt.plot(betas, pts, 'o-')
plt.xlabel(r'$\beta$')
plt.ylabel(r'$N_{DLR}$')
plt.semilogx([],[])

plt.subplot(*subp); subp[-1] += 1
plt.plot(betas, mem*1e-6, 'o-')
plt.xlabel(r'$\beta$')
plt.ylabel('Megabytes')
plt.semilogx([],[])
plt.ylim(bottom=0)

plt.tight_layout()
plt.savefig('figure_dlr_beta_memory_scaling.png')
plt.show()

Expected behavior: The peak memory usage should be $\log \beta$ for the DLR meshes to be applicable in the low temperature regime.

Actual behavior: Peak memory usage is linear in $\beta$

Versions

% python -c "from triqs.version import *; show_version(); show_git_hash();" 
You are using the TRIQS library version 3.2.0
You are using the git hash 2ab1969e4d63aa5567c093263301ed9708e428fc
% uname -v
Darwin Kernel Version 21.6.0: Mon Aug 22 20:19:52 PDT 2022; root:xnu-8020.140.49~2/RELEASE_ARM64_T6000
@jasonkaye
Copy link

jasonkaye commented Sep 14, 2023

Hi Hugo. Thanks for pointing this out. I agree with your diagnosis, and my back-of-the-envelope memory estimate gives the same order of magnitude as your data.

The unfortunate answer is that improving this is a small research problem, though I am extremely optimistic it is solvable. Here are my favorite possible approaches, in no particular order, though there are others as well.

(1) Some kind of precomputation of DLR nodes.
(1a) Have cppdlr simply precompute and store the DLR nodes for a sufficient number of Lambda/epsilon combinations that one could always easily take the user inputs and floor epsilon & ceiling Lambda to the next stored values. We'll just do this precomputation on a big computer, or do some hacks to get away with it. I think the memory requirements for that data should be relatively modest, and it wouldn't be unreasonable to simply include those values in cppdlr.
(1b) A refinement of 1a: "recompression". Do 1a, but for a much smaller number of Lambda/epsilon combinations, or even one combination with extraordinarily large Lambda and epsilon = machine epsilon. Then follow the procedure in 1a, but just use the resulting grid as the initial dense grid in the pivoted Gram-Schmidt to be subselected from, analogous to the composite Chebyshev grid in imaginary time. In principle, this should work, and gives the correct scaling, but it needs to be tested.
(1c) Compatible with 1a or 1b. Show that for a given Lambda, there is a single set of DLR frequencies/nodes from which one can sub-select given a specific epsilon. This certainly works at the level of DLR frequencies, and one could test whether it works at the level of DLR nodes, in other words whether the DLR nodes for fixed Lambda and a sequence of choices of epsilon can be taken to be nested (we can discuss what I have in mind). Then there would only be one parameter Lambda to loop over in your precomputation. It is even possible that one could cook up a single universal set of DLR frequencies, and corresponding nodes, such that given Lambda and epsilon one can simply sub-select from that set, but I'd have to think more about this. If it's true, we should really just pre-compute that data once and for all and be done with it.

(2) If you find precomputing and storing things distasteful (I don't): figure out the correct method of treating Matsubara frequency space not as a set of discrete points, but as a continuum, by figuring out the correct interpolation of the kernel and convincing yourself that this is legal (I need to think more about this, but there's probably a way). Assuming this works, now you can use a composite Chebyshev grid as a fine grid in Matsubara frequency, just as for imaginary time. You will get points which aren't Matsubara frequencies, and you have two options. You could show that working with the interpolated kernel is allowed, and then you could just keep the points that come out of the algorithm, but this could cause compatibility issues with other codes. Or, you could snap each point to the nearest Matsubara frequency, and check that this still works (I'm guessing it does).

(3) Use some kind of pre-specified logarithmically-spaced grid as your fine grid, and pray. This isn't necessarily such a crazy idea, because one could check empirically for many values of Lambda and epsilon that it works. Plausibility argument: the choice of nodes tends to be somewhat forgiving, meaning if you jiggle the nodes around, the quality of your interpolation doesn't seem to be negatively affected too much. So it might be that starting with any sufficiently-dense grid with roughly the right clustering is good enough.

So this problem is probably solvable, but somebody needs to sit down and do a little bit of research work and thinking about the correct solution. I'm happy to participate in those discussions and give my thoughts about what to try first, and once we know what to do, I'm happy to implement the solution in cppdlr. Or, it's a nice quick research project for anybody out there to pick up.

Jason

@HugoStrand
Copy link
Member Author

Dear Jason,

I agree that there are many ways to handle this.

I think that since we have an upper bound on how many imaginary frequency points the DLR construction will select (given by the epsilon rank $r$ of the real-frequency rank revealing QR) it is possible to construct a dyadic grid that terminates at the order $\Lambda$ which is the current imaginary frequency cutoff. I have experimented with this in pydlr and it is working out well.

The idea is to put out panels on the positive Matsubara frequency indices where the first panel contains all $r$ first Matsubara indices, the second panel contains every second index still keeping $r$ points, the third panel takes every 4th index up to $r$ points retained, etc. The number of panels is controlled by the upper cutoff $N_{max} = \text{ceil}( \Lambda )$ just as we are currently doing it. (Finally the index panels are also used for the negative frequency indices.)

Here is a code example that constructs the range of indices with panels with doubled spacing for each panel index when the flag dense_imfreq=False:

        if dense_imfreq:
            n = np.arange(-nmax, nmax+1)
        else:
            # -- Sparse starting grid in imaginary frequency
            r = self.rank
            n_panels = int(np.ceil(np.log2(nmax/r))) + 1 if nmax > r else 2                
            n = np.zeros(r * n_panels)

            idx = 1
            for p in range(n_panels):
                d_idx = r * 2**p
                nn = np.arange(idx, idx + d_idx, 2**p)
                n[r*p:r*(p+1)] = nn
                idx += d_idx
                
            n = np.concatenate((1-n[::-1], n))

and here is the commit that implements this in pydlr jasonkaye/libdlr@ff40d13

Can I go on and implement this in cppdlr?

Best, Hugo

@jasonkaye
Copy link

Sorry for the confusion: since this is really a cppdlr issue, I'm moving the discussion to the corresponding issue there. Let's get the cppdlr implementation settled, and then we can figure out what to do on the TRIQS end.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants