Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use parsed files to improve header compile commands #123

Open
sam-mccall opened this issue Aug 9, 2019 · 15 comments
Open

Use parsed files to improve header compile commands #123

sam-mccall opened this issue Aug 9, 2019 · 15 comments
Assignees

Comments

@sam-mccall
Copy link
Member

Currently the dominant way we get compile commands for open headers is filename based matching (in JSONCompilationDatabase).

However we have two sets of better information:

  • the files that have been open (headers are often opened after some file that includes them)
  • the background index (most headers have been indexed by some TU)
    we could make use of these to get the TU, and then transfer its compile command.
@ilya-biryukov
Copy link

ilya-biryukov commented Aug 12, 2019

Great idea! We could potentially get a better "switch header-source" for free while implementing this feature too.

@knopp
Copy link

knopp commented May 2, 2020

Is there any progress on this? I believe 2) is roughly how Xcode indexes headers, except that it also reflects additional preprocessor state at the point of include.

so having mysource.cpp

#define A
#include "myheader.h"

Will cause -DA added to myheader.h compilation flags alongside mysource.cpp flags. This leads to very good result.

morehouse pushed a commit to morehouse/llvm-project that referenced this issue Mar 4, 2021
Currently our strategy for getting header compile flags is something like:

A) look for flags for the header in compile_commands.json
   This basically never works, build systems don't generate this info.
B) try to match to an impl file in compile_commands.json and use its flags
   This only (mostly) works if the headers are in the same project.
C) give up and use fallback flags
   This kind of works for stdlib in the default configuration, and
   otherwise doesn't.

Obviously there are big gaps here.

This patch inserts a new attempt between A and B: if the header is
transitively included by any open file (whether same project or not),
then we use its compile command.

This doesn't make any attempt to solve some related problems:
 - parsing non-self-contained header files in context (importing PP state)
 - using the compile flags of non-opened candidate files found in the index

Fixes clangd/clangd#123
Fixes clangd/clangd#695
See clangd/clangd#519

Differential Revision: https://reviews.llvm.org/D97351
@HighCommander4
Copy link

HighCommander4 commented May 10, 2021

It looks to me like the patch that landed here is a partial solution (it picks the right source file if that source file has been opened), not a complete solution (which I would envision as using the right source file even if it hasn't been opened, using include-graph information stored in the index).

Should we re-open, or file a new issue to track the complete solution? This is something that users continue to run into, e.g. clangd/vscode-clangd#187.

@kadircet
Copy link
Member

we've avoided doing that so far mainly because storing reverse include graphs are hard.

How many candidates do we want to store per header?

The current solution stores only one and it works well because most of the time the user would like to see the header in context of the most recent file they were working in and existing solution guarantees that in most cases.

When we go with using the info from background-index, back at the day we've concluded that we can't store a single candidate. Because it might be coming from a different part of the code-base (similar to picking a bad candidate heuristically) and clangd might end up building the header in a not-so-useful state (probably still better than completely broken). Even if you stored multiple candidates, how are you gonna pick between them? (e.g. one has -std=c++20 other has -std=c++17).

Also the candidate might just be invalid because someone deleted the file, or it no longer depends on the header (and static index can go stale). There's no way to easily verify the latter without at least running the preprocessor on the file again.
Anyway, we need fallback candidates now (as they might be invalid too). If we decide to store all, it becomes O(N^2) storage. Even if you cleverly intern filenames and such, in codebases like chromium there are ~300K source files, this is GBs worth of storage.

Maybe some middle ground could be found by still storing a single candidate in static (background) index and using it whenever there's no information coming from dynamic index. Only cases I can think of are:

  1. User directly opening a header file, without opening any cc file depending on it
  2. There are indeed no files depending on a particular header and compile commands don't have an entry for it.

There isn't really much we can do for the second case, it just won't work with a static index approach too.

As for the first case, i got a feeling like it should be rare in practice to justify all the extra complication. (As we already have a heuristic CDB for this case, and are only considering the cases in which heuristics pick a bad file. That possibility still exists with a static-index based approach but becomes less likely as explained)

Are there any cases I am missing here in which the dynamic index based solution doesn't really cover?

@HighCommander4
Copy link

I think the first case is likely to be common for projects which are mostly headers, e.g. a header-only / template library. In such projects, the only source files are often test files, and users don't necessarily open test files before opening the header files.

I also think a single candidate based on the include graph is still better, in that we know that e.g. it provides the flags necessary to resolve includes in the header being opened, than heuristically chosen candidates, which may not.

Preferring information from the dynamic index and falling back on a single candidate stored in the static index seems reasonable to me.

@kadircet
Copy link
Member

I also think a single candidate based on the include graph is still better, in that we know that e.g. it provides the flags necessary to resolve includes in the header being opened, than heuristically chosen candidates, which may not.

Sorry if i wasn't explicit enough. I was also trying to make this point, while adding that:

  • It still won't be foolproof, some things that come to mind:
    • the candidate we pick might vanish between indexing and time to use
    • go stale e.g. not depend on the particular header anymore or got deleted
    • we picked the configuration user wasn't interested in
  • I don't think it happens common enough in practice to justify the complications in static-index + CDB. The two possible designs i can see are:
  • either changing the SymbolIndex interface to support such queries (well doesn't sound like something a "symbol" index should be doing at the very least)
  • signalling either TUScheduler/CDB from "indexer" component. (this one definitely messy)

So it sounds to me like we'll improve the world slightly at the cost of lots of complexity. Hence i was saying that it didn't sound that feasible. But maybe there are better alternatives that I don't see ATM, either foolproof (which i don't think we can at the face of multiple configurations even if we figure out all the rest) or that can be implemented locally without changing lots of interfaces. So we are open to suggestions.

@HighCommander4
Copy link

Your points about the tradeoffs are well taken.

My concrete suggestion for the time being, is that we reopen this issue (or file a new one) to acknowledge that there is a remaining gap here, so that we have a place where we can point users that run into this (such as in clangd/vscode-clangd#187), and document workarounds (such as using a tool like CompDB to generate header entries in the CDB).

Even if the resolution ends up being "it's not worth to spend more time enhancing clangd's command inference, let's just accept that header-only library projects need to use the CompDB approach", that's worth documenting / putting in an FAQ or something.

@kadircet
Copy link
Member

Yes. Since we currently lack the roadmap, it is hard to convey the situation on our side (but i am working on it!). In the long term we would like to ensure that the "open" issues are only for the cases where no action was taken or there's a consensus around taking some action but it isn't complete yet. Then have all the other cases in which we made a decision to take no action documented in a different place ("closed" issues are also a candidate) by explaining the reasons why, so that it is visible to everyone (including us, history is hard to keep track of!). Hopefully we'll get there soon.

In the meantime feel free to open an issue and summarize these discussions if you have some time, but otherwise this particular issue is also on my to-do list to mention (like most other open issues) once we figure out details about how we want to proceed.

@GavinRay97
Copy link

Just leaving a note here, without much to contribute to the discussion, that CompDB is fantastic but for some projects fails & is unusable.

IE when used when building JUCE apps (even ones that have a single .cpp file), the resulting header compilation commands make the compile_commands.json so large that clangd cannot process them anymore =(

I am sure there are a non-insignificant number of projects where this is the case unfortunately.

@HighCommander4
Copy link

I don't think it happens common enough in practice to justify the complications in static-index + CDB.

@kadircet, since we had this discussion ~9 weeks ago, 7 different users (that I'm aware of) have reported running into this problem in various forums (Discourse, Discord, or here in the bug tracker). I've collected them:

  1. 2021-05-21: Source file chosen to infer compile commands for a header sometimes has the wrong language #519 (comment)
  2. 2021-06-01: Use parsed files to improve header compile commands #123 (comment)
  3. 2021-06-09: https://llvm.discourse.group/t/clangd-compilation-unit-incorrectly-inferred-for-a-header-file/3620
  4. 2021-06-14: Source file chosen to infer compile commands for a header sometimes has the wrong language #519 (comment)
  5. 2021-07-01: https://discord.com/channels/636084430946959380/649134148723802113/860276527496298516
  6. 2021-07-10: https://discord.com/channels/636084430946959380/649134148723802113/863578622899781682
  7. 2021-07-20: File is not parsed with C++17 support #823

In my opinion, this is a serious enough recurring issue that it's worth trying to solve via the static index even though it's not foolproof as discussed, and even if it entails some internal complexity such as expanding the interface of SymbolIndex.

@kadircet
Copy link
Member

Thanks for the detailed update Nathan (and sorry for late reply on my side),

We were discussing this offline with the team and came to the conclusion that even though this is going to add some complexity and solutions that we can come up with using the current infra is still going to be limited, the impact is probably justified.

The interaction still needs to be designed carefully though as it will need to touch lots of different components. First we need to figure out how to extract the data needed in SymbolCollector (hopefully consuming IncludeGraph should suffice). Then how to store this new information in index shards and how to build serving structures on top of it in dex/mem. Afterwards we need to make it available through SymbolIndex and define the behaviour for Merge/File and various other index implementations (and probably an error case too). Finally it is going to touch TUScheduler so that we can pivot to a different file if currently active files fallback fails (+ maybe just make existing logic part of dynamic index already and have a single fallback rather than 2 different fallbacks).

It is not something we can prioritize today, or in the near term but definitely something we support. I am planning to come up with a roadmap in the following weeks, hopefully this item will also have its place in the agenda so that someone can pick it up.

@koenlek
Copy link

koenlek commented Sep 9, 2021

Could we re-open this ticket? Or is there a better ticket to subscribe to for getting clangd to infer "includes in headers" based on the compile commands?

@HighCommander4
Copy link

I think it makes sense to re-open this one, as I think there is a lot of relevant discussion here.

@cpsauer
Copy link

cpsauer commented Feb 6, 2022

Chiming in from the bazel plugin side to echo the need (beyond the backlinks, above):

It'd be so nice if clangd proactively traversed the include graph, applying flags to headers from the sources that included them as part of background indexing.

Can verify that this is a case we and our users would hit all the time! And that much of work (runtime, and file size) spent to get a compile_commands.json to work well with clangd involves working around this issue, compdb-style.

mem-frob pushed a commit to draperlaboratory/hope-llvm-project that referenced this issue Oct 7, 2022
Currently our strategy for getting header compile flags is something like:

A) look for flags for the header in compile_commands.json
   This basically never works, build systems don't generate this info.
B) try to match to an impl file in compile_commands.json and use its flags
   This only (mostly) works if the headers are in the same project.
C) give up and use fallback flags
   This kind of works for stdlib in the default configuration, and
   otherwise doesn't.

Obviously there are big gaps here.

This patch inserts a new attempt between A and B: if the header is
transitively included by any open file (whether same project or not),
then we use its compile command.

This doesn't make any attempt to solve some related problems:
 - parsing non-self-contained header files in context (importing PP state)
 - using the compile flags of non-opened candidate files found in the index

Fixes clangd/clangd#123
Fixes clangd/clangd#695
See clangd/clangd#519

Differential Revision: https://reviews.llvm.org/D97351
@MrDiver
Copy link

MrDiver commented Dec 28, 2023

Any updates ?

AdelKS added a commit to AdelKS/ZeCalculator that referenced this issue Mar 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants