-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
compile_commands.json contains entries for headers! And it takes a little while to generate! ~20s on the Hedron main repo at the time of writing. Could we speed it up? #5
Comments
See also #24 for more discussion about the same root issue. |
I saw #24 and it triggered some memories. Bazel requires headers to be declared with a The discussion on clangd/clangd#123 about the clangd heuristic is in the context of clangd working as best it can regardless of the build system used. Most build systems work at the include path level, and don't require individual header files to be declared to the build system. Some sort of "auto discovery" certainly makes sense in that context. I'm not sure why bazel-compile-commands-extractor must resort to preprocessing source code, given the strictness of the Bazel build. I would think it at least theoretically possible to discover plausible compile commands for all headers in the build using the Bazel build graph alone. Clangd can improve its heuristics, but it still won't get it right in all cases. It seems like inferring "correctly representative" compile commands for header files, using explicitly declared information in the Bazel build graph, would I imagine still be worth doing because it will be more likely to be correct. Sometimes with these things, the devil is in the details. It is possible to express some distinctly non-standard builds with Bazel, which can break Bazel's "strict headers checking" logic, etc. So perhaps the heuristics clangd uses are still relevant. I am curious to understand why preprocessing is currently preferable to processing the build graph, though. P.S. the bazel-discuss post I link above hints that Bazel doesn't do any "include scanning" itself. I believe it was at least always the team's intent to never ship "include scanning" outside Google, because it was considered a design mistake that in practice produced unsound builds. Internally, Google's Blaze did (and may still do) "include scanning" in some cases. But, perhaps relevant here, this was never full blown preprocessing. Blaze had/has a very dummed down scanner that looks for things like |
Thanks for seeing what's new and chipping in, @matta :) If you have any ideas on how to get 'em out of bazel faster, I'm definitely excited about it! Some history, if useful: In our previous, extra_action approach (more in #24 and ImplementationReadme) bazel did list the headers for us, but that was after a build, much more slowly, and with a variety of other problems. Maybe it, too, was parsing a dependency (.d) file outputted from the build, as described in the discussion you linked, and like we're doing currently. When we switched to the current, much better aquery-based approach, I was bummed to not see the headers directly listed in the inputs field. But maybe they're findable by traversing the "middlemen." (I am, however, seeing that at least the system SDK headers aren't directly listed anywhere). Any chance you (or anyone reading) know what's going on with middlemen enough to navigate with confidence? If curious about the browsing the middlemen, try Just to steer on a few things and to try to answer some questions: Thoughts? Things I might be missing? Still super eager to hear about your experience using this with Emacs! But that's a separate issue, I suppose. |
Including the system headers, and other headers from installed source (not under bazel management) is an interesting sub-problem. So is the situation where one header is compiled in potentially multiple ways in one build. Unfortunately, my Bazel experience predates aquery and cquery (there was just "query"). I don't even know what a Bazel aspect is, really. Perhaps best to ignore me? :-) I'm vaguely aware that a lot of the mobile build stuff has made the logic much more complicated than it used to be (e.g. a target can change how its dependencies are built). Frankly my head kinda explodes whenever I have to think about it. I do think it would be nice if Bazel was a kind of "source of truth" at finding the "best" .cc file to pair with an .h file for the purposes of compile_commands.json. In my build's case P.S. Emacs isn't really an interesting special case here. It is using clangd much like VS Code would. It just needs a bit more explicit setup to get a "project" defined and working, instead of the click-to-install-clangd-and-auto-configure approach that VS Code takes. |
No, no. Good stuff--and thanks for thinking with me. Turns out that aquery stuff is pretty handy, though! Enables this plugin. And unfortunately, the multiply included header and system headers cases are more the common case than the unusual. [I'll give a better reply back to your separate issue. But briefly, it's just picking an (arbitrary) source file that includes the header (possible transitively) during compilation, so it can get a valid compile command that runs on that header.] And on the P.S.: Great to hear emacs isn't too different. Easy enough that people should just be able to figure it out? You're our trailblazer here :) So if there are tips people should know, I would so love it if you'd add it to the README. Even just a note saying "Emacs: It works!" |
I thought of another way we could potentially speed this up if people needed it--at least on macOS/Linux: We could check to see if Bazel has cached .d files listing the dependencies and use those instead of preoprocessing everything to regenerate them. If all the files listed in the .d file have older last-modified dates than the .d file itself, this should be safe. We'd want to check that bazel isn't 0-timestamping generated files, though. We could also write .d files when needed, saving work between runs. Maybe there's a good way of doing the equivalent on Windows, too, maybe using /sourceDependencies |
I've added this to the other ideas in the OPTIMNOTE section of https://github.com/hedronvision/bazel-compile-commands-extractor/blob/main/refresh.template.py |
^ If someone is reads this an is interested in helping speed this up, I'd love a PR taking a crack at the caching described above #5 (comment). |
Closing this down because I just landed code that makes header finding way faster. We now have caching for header file finding and for clang/gcc, piggyback on Bazel's dependency cache--in addition to the options to control header finding merged previously! That should be a huge improvement in speed. Don't think this can be much faster, and it's now quite good. (Though if you're a windows user reading this: We might be able to make piggyback Bazel's cache with MSVC too. See #49 (comment) if you're interested in helping.) |
For what it's worth, I tried the aquery method that was suggested in this thread and it works great for us at Skydio. We open-sourced it, and I certainly wouldn't mind if it served as inspiration to this project. https://github.com/danny-skydio/bazel-compile-commands |
Status: The slowness is from having clang preprocess every source file in the whole project to figure out which headers it uses (in refresh.template.py). We'll need to do this for clangd until it does this for us in its index. Clangd issue. Once this is fixed, things should be much faster and we should consider regenerating automatically on BUILD file save. Without the preprocessing,
refresh.py
only takes 8s or so, single threaded.Suggested Workaround: You might be able to refresh compile_commands.json slightly less often, making the current runtime okay. If you're adding files, clangd should make pretty decent guesses at completions, using commands from nearby files. And if you're deleting files, there's not a problem. So you may not need to rerun
refresh.py
on every change to BUILD files. Instead, maybe refresh becomes something you run every so often when you can spare the time, making the current runtime okay.If that doesn't work for you, consider trying to only index the part of the codebase you care about using the
refresh_compile_commands
rule (see README). Usually the runtime is only a real problem on huge repos--and there you're probably only editing a small subset of the codebase.Finally, there are now options to exclude indexing for external workspaces. See docs at the top of the refesh_compile_command.
Otherwise: If you'd like to explore ways it could be made faster or help, read on!
The text was updated successfully, but these errors were encountered: