Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compile_commands.json contains entries for headers! And it takes a little while to generate! ~20s on the Hedron main repo at the time of writing. Could we speed it up? #5

Closed
cpsauer opened this issue Nov 17, 2021 · 10 comments

Comments

@cpsauer
Copy link
Contributor

cpsauer commented Nov 17, 2021

Status: The slowness is from having clang preprocess every source file in the whole project to figure out which headers it uses (in refresh.template.py). We'll need to do this for clangd until it does this for us in its index. Clangd issue. Once this is fixed, things should be much faster and we should consider regenerating automatically on BUILD file save. Without the preprocessing, refresh.py only takes 8s or so, single threaded.

Suggested Workaround: You might be able to refresh compile_commands.json slightly less often, making the current runtime okay. If you're adding files, clangd should make pretty decent guesses at completions, using commands from nearby files. And if you're deleting files, there's not a problem. So you may not need to rerun refresh.py on every change to BUILD files. Instead, maybe refresh becomes something you run every so often when you can spare the time, making the current runtime okay.

If that doesn't work for you, consider trying to only index the part of the codebase you care about using the refresh_compile_commands rule (see README). Usually the runtime is only a real problem on huge repos--and there you're probably only editing a small subset of the codebase.

Finally, there are now options to exclude indexing for external workspaces. See docs at the top of the refesh_compile_command.

Otherwise: If you'd like to explore ways it could be made faster or help, read on!

@cpsauer cpsauer changed the title compile_commands.json takes a while to generate! ~30s on the Hedron main repo at the time of writing. Could we speed it up? compile_commands.json contains entries for headers! And it takes a little while to generate! ~20s on the Hedron main repo at the time of writing. Could we speed it up? Dec 25, 2021
@cpsauer
Copy link
Contributor Author

cpsauer commented Feb 6, 2022

See also #24 for more discussion about the same root issue.

@matta
Copy link
Contributor

matta commented Feb 7, 2022

I saw #24 and it triggered some memories.

Bazel requires headers to be declared with a hdrs attribute, so in theory every header, if it is "#includable" at all, appears explicitly in the build graph. I believe that this has been strictly enforced for about seven years or so (e.g. https://groups.google.com/g/bazel-discuss/c/LzlzSuqxhFU).

The discussion on clangd/clangd#123 about the clangd heuristic is in the context of clangd working as best it can regardless of the build system used. Most build systems work at the include path level, and don't require individual header files to be declared to the build system. Some sort of "auto discovery" certainly makes sense in that context.

I'm not sure why bazel-compile-commands-extractor must resort to preprocessing source code, given the strictness of the Bazel build. I would think it at least theoretically possible to discover plausible compile commands for all headers in the build using the Bazel build graph alone.

Clangd can improve its heuristics, but it still won't get it right in all cases. It seems like inferring "correctly representative" compile commands for header files, using explicitly declared information in the Bazel build graph, would I imagine still be worth doing because it will be more likely to be correct.

Sometimes with these things, the devil is in the details. It is possible to express some distinctly non-standard builds with Bazel, which can break Bazel's "strict headers checking" logic, etc. So perhaps the heuristics clangd uses are still relevant. I am curious to understand why preprocessing is currently preferable to processing the build graph, though.

P.S. the bazel-discuss post I link above hints that Bazel doesn't do any "include scanning" itself. I believe it was at least always the team's intent to never ship "include scanning" outside Google, because it was considered a design mistake that in practice produced unsound builds. Internally, Google's Blaze did (and may still do) "include scanning" in some cases. But, perhaps relevant here, this was never full blown preprocessing. Blaze had/has a very dummed down scanner that looks for things like #include "blarg/blah.h" and includes the blarg/blah.h file in the build if it happens to be available.

@cpsauer
Copy link
Contributor Author

cpsauer commented Feb 7, 2022

Thanks for seeing what's new and chipping in, @matta :)

If you have any ideas on how to get 'em out of bazel faster, I'm definitely excited about it!

Some history, if useful: In our previous, extra_action approach (more in #24 and ImplementationReadme) bazel did list the headers for us, but that was after a build, much more slowly, and with a variety of other problems. Maybe it, too, was parsing a dependency (.d) file outputted from the build, as described in the discussion you linked, and like we're doing currently. When we switched to the current, much better aquery-based approach, I was bummed to not see the headers directly listed in the inputs field. But maybe they're findable by traversing the "middlemen." (I am, however, seeing that at least the system SDK headers aren't directly listed anywhere). Any chance you (or anyone reading) know what's going on with middlemen enough to navigate with confidence?

If curious about the browsing the middlemen, try
bazel aquery "mnemonic('(Objc|Cpp)Compile|Middleman',deps(<your target>))" | less

Just to steer on a few things and to try to answer some questions:
(1) I think Clangd can definitely nail finding included headers as well as anyone. It's got the benefit of clang's C++ compiler frontend and is already running the full header inclusion search process as (a small) part of helping you edit. That is, clangd, like clang, can determine exactly which headers are used by a given source file given the compile command. No heuristics needed; they're able to directly invoke the relevant part of the compiler to get an authoritative answer. IMO doing this proactively instead of reactively is a pretty natural build-in for them, given they've already got that ability, plus background indexing and caching--and a need for this across build systems.
(2) Until then, we've been directly invoking clang/gcc to run an equivalently authoritative process up front, invoking header search and dependency generation but stopping after preprocessing, forgoing compilation. (see --dependencies in refresh.template.py)
(3) I'm confident that Bazel knows somewhere which headers (and directories) could be included. But what we'd really like is to know which ones are actually included, to keep the compile_commands specific and small. It seems to me like we need the compiler[preprocessor] to do this correctly, easily, and robustly. Now, maybe we should be willing to give up this specificity for speed. That is just quickly emit entries for all headers a file could include regardless of whether it does, based on Bazel's understanding. We'd want to be careful that we don't miss something! But I think Bazel is less likely to be easily correct here than clang(d), rather than more.

Thoughts? Things I might be missing?

Still super eager to hear about your experience using this with Emacs! But that's a separate issue, I suppose.

@matta
Copy link
Contributor

matta commented Feb 7, 2022

Including the system headers, and other headers from installed source (not under bazel management) is an interesting sub-problem. So is the situation where one header is compiled in potentially multiple ways in one build.

Unfortunately, my Bazel experience predates aquery and cquery (there was just "query"). I don't even know what a Bazel aspect is, really. Perhaps best to ignore me? :-) I'm vaguely aware that a lot of the mobile build stuff has made the logic much more complicated than it used to be (e.g. a target can change how its dependencies are built). Frankly my head kinda explodes whenever I have to think about it.

I do think it would be nice if Bazel was a kind of "source of truth" at finding the "best" .cc file to pair with an .h file for the purposes of compile_commands.json. In my build's case bazel-compile-commands-extractor usually picks a reasonable "nearby" .cc file, but sometimes it doesn't. I'll file an issue about the case I'm seeing.

P.S. Emacs isn't really an interesting special case here. It is using clangd much like VS Code would. It just needs a bit more explicit setup to get a "project" defined and working, instead of the click-to-install-clangd-and-auto-configure approach that VS Code takes.

@cpsauer
Copy link
Contributor Author

cpsauer commented Feb 8, 2022

No, no. Good stuff--and thanks for thinking with me. Turns out that aquery stuff is pretty handy, though! Enables this plugin. And unfortunately, the multiply included header and system headers cases are more the common case than the unusual.

[I'll give a better reply back to your separate issue. But briefly, it's just picking an (arbitrary) source file that includes the header (possible transitively) during compilation, so it can get a valid compile command that runs on that header.]

And on the P.S.: Great to hear emacs isn't too different. Easy enough that people should just be able to figure it out? You're our trailblazer here :) So if there are tips people should know, I would so love it if you'd add it to the README. Even just a note saying "Emacs: It works!"

@cpsauer
Copy link
Contributor Author

cpsauer commented Apr 7, 2022

I thought of another way we could potentially speed this up if people needed it--at least on macOS/Linux:

We could check to see if Bazel has cached .d files listing the dependencies and use those instead of preoprocessing everything to regenerate them.

If all the files listed in the .d file have older last-modified dates than the .d file itself, this should be safe. We'd want to check that bazel isn't 0-timestamping generated files, though.

We could also write .d files when needed, saving work between runs.

Maybe there's a good way of doing the equivalent on Windows, too, maybe using /sourceDependencies

cpsauer added a commit that referenced this issue Apr 7, 2022
@cpsauer
Copy link
Contributor Author

cpsauer commented Apr 7, 2022

I've added this to the other ideas in the OPTIMNOTE section of https://github.com/hedronvision/bazel-compile-commands-extractor/blob/main/refresh.template.py

@cpsauer
Copy link
Contributor Author

cpsauer commented May 20, 2022

^ If someone is reads this an is interested in helping speed this up, I'd love a PR taking a crack at the caching described above #5 (comment).

@cpsauer
Copy link
Contributor Author

cpsauer commented Jun 2, 2022

Closing this down because I just landed code that makes header finding way faster. We now have caching for header file finding and for clang/gcc, piggyback on Bazel's dependency cache--in addition to the options to control header finding merged previously!

That should be a huge improvement in speed. Don't think this can be much faster, and it's now quite good.

(Though if you're a windows user reading this: We might be able to make piggyback Bazel's cache with MSVC too. See #49 (comment) if you're interested in helping.)

@wolfd
Copy link

wolfd commented Apr 29, 2024

For what it's worth, I tried the aquery method that was suggested in this thread and it works great for us at Skydio. We open-sourced it, and I certainly wouldn't mind if it served as inspiration to this project.

https://github.com/danny-skydio/bazel-compile-commands

tiponada4l added a commit to tiponada4l/bazel-compile-commands-extractor that referenced this issue Aug 10, 2024
Trabochuk added a commit to Trabochuk/bazel-compile-commands-extractor that referenced this issue Aug 10, 2024
ngiloq6 added a commit to ngiloq6/bazel-compile-commands-extractor that referenced this issue Aug 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants