Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make compilecache atomic #30174

Closed
wants to merge 5 commits into from
Closed

Conversation

tkf
Copy link
Member

@tkf tkf commented Nov 28, 2018

Previously, the cache path is directly passed to the --output-ji option of the subprocess for generating compile cache. However, when several processes are compiling the same package, one process may modify the file while another process is calculating the checksum, yielding a broken cache. This patch let each process create a cache in a private temporary path and then atomically rename it to the final cache path.

Before this PR, I can consistently invoke Warning: The call to compilecache failed to create a usable precompiled cache file for Example with:

touch ~/.julia/dev/Example/src/Example.jl
for _ in $(seq 30)
do
    julia --startup-file=no -e "using Example" &
done
wait

With this PR it does not show any warnings (so far).

ref: https://discourse.julialang.org/t/precompilation-error-using-hpc/17094

Previously, the cache path is directly passed to the --output-ji
option of the subprocess for generating compile cache.  However, when
several processes are compiling the same package, one process may
modify the file while another process is calculating the checksum,
yielding a broken cache.  This patch let each process create a cache
in a private temporary path and then atomically rename it to the final
cache path.
@JaredCrean2
Copy link
Contributor

Can I suggest adding an environment variable or command line option to specify the parent directory for mktemp(cachepath)? On some HPC systems, the filesystem where packages are installed may not be able to handle large numbers of concurrent writes.

@JaredCrean2
Copy link
Contributor

On second thought, if all processes are going to be doing filesystem operations, then there should be a separate path for the cache files (both temporary and final).

@tkf
Copy link
Member Author

tkf commented Nov 28, 2018

@JaredCrean2 FYI, you can use environment variable JULIA_DEPOT_PATH to use directories to be used instead of ~/.julia:

https://docs.julialang.org/en/v1/stdlib/Pkg/#Glossary-1
https://discourse.julialang.org/t/where-or-how-to-edit-depot-path/13501
https://discourse.julialang.org/t/precompilation-error-using-hpc/17094/4

specify the parent directory for mktemp(cachepath)

The purpose of this PR is to make cache file creation atomic. Using different filesystem seems to defeat the purpose.

@tkf
Copy link
Member Author

tkf commented Nov 29, 2018

(I was meant to write [ci skip]...)

@JaredCrean2
Copy link
Contributor

The problem is JULIA_DEPOT_PATH is determining two things: the location of the package source files and the location of the compiled files. HPC systems generally have two filesystems, a small, slower, persistent file system for users to store source code and compiled objects, and a large, fast, non-persistent filesystem for running codes to do output. Source code should obviously be stored on the persistent file system, so JULIA_DEPOT_PATH has to be somewhere there. Prior to this PR, this was not a problem for the cache files, because, even if n julia processes are launched at the same time (where n is large) most of the processes will get filesystem errors if they try to overwrite the cache file (the error reported on Discourse). As a result the amount of disk space occupied by the cache file is small and the number of concurrent writes to the slow, persistent filesystem is limited.

The problem with this PR is that there will (temporarily) be n copies of the cache file, which might not fit on the small filesystem. Additionally, having all n processes write the temporary cache files simultaneously may cause the filesystem to freeze (for large n). The solution is to put at least the temporary cache files on the large, fast filesystem. You are probably right that moving files from one filesystem to another is non-atomic. so I suggest keeping both the temporary and final cache files on the large, fast filesystem. Currently, it is not possible to do this without setting JULIA_DEPOT_PATH to be on the large, fast filesystem, which is a bad idea because this filesystem is non persistent (usually files older than xx days get deleted). Having to regenerate the cache files every xx days is not a problem, but having to replace source code is.

@tkf
Copy link
Member Author

tkf commented Nov 29, 2018

How about creating a symlink ~/.julia/compiled -> /PATH/TO/SCRATCH/SPACE/julia_compiled (say)? You can have a small julia code in ~/.julia/config/startup.jl to re-create /PATH/TO/SCRATCH/SPACE/julia_compiled directory if it is removed by the system. Note that this should be done in "easier to ask for forgiveness than permission" (EAFP) style to allow concurrent execution (i.e., use mkpath instead of isdir-mkdir combo).

But I agree having some way to configure the path ~/.julia/compiled would be great. Why don't you open an issue?

@JaredCrean2
Copy link
Contributor

I created a new issue. Symlinking seems like a reasonable workaround (although an FAQ entry to warn people that the default configuration can be problematic is some cases might be warranted).

@jpsamaroo
Copy link
Member

The problem with symlinking is that Windows users get left out in the cold (AFAIU, symlinks can't be established from user-mode programs in Windows).

@vtjnash
Copy link
Sponsor Member

vtjnash commented Nov 30, 2018

There's already code to handle this in dump.c. We should probably move that logic here.

@tkf
Copy link
Member Author

tkf commented Nov 30, 2018

@jpsamaroo Symlink is only relevant for HPC environment. Not sure Windows is used there. Besides, there are other workarounds. I mentioned them in the FAQ.

@vtjnash Or remove the temporary file creation code from dump.c? Either way, I think this issue should be fixed by 1.1 as it's very hard to use Julia in HPC environment without atomic precompilation.

cache files. It may become a major issue in slow and/or small distributed
file systems. There are a few possible workarounds:

1. Use `julia` with `--compilecache=no` flag to turn off precompilation.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This option has been renamed: https://github.com/JuliaLang/julia/pull/23616/files

compilecache -> compiled-modules

NEWS.md Outdated Show resolved Hide resolved
Co-Authored-By: Charles Kawczynski <[email protected]>
@vtjnash
Copy link
Sponsor Member

vtjnash commented Aug 5, 2019

Or remove the temporary file creation code from dump.c?

Yeah, sorry, that is what I meant.

@vtjnash vtjnash added the status:forget me not PRs that one wants to make sure aren't forgotten label Dec 17, 2019
@alyst
Copy link
Contributor

alyst commented Jan 27, 2020

I hate to ping, but are there any updates on merging this?
Multiprocessing is unusable in 1.3.1 -- after addprocs() workers try to precompile and break the cache.

@vtjnash
Copy link
Sponsor Member

vtjnash commented Feb 24, 2020

It appears that someone needs to pick this up and finish and/or re-write it.

@davidanthoff
Copy link
Contributor

We would love to have this fixed for the VS Code Julia extension as well. We are getting a fair number of crash reports where users start two instances of VS Code that both trigger precompile of the core LanguageServer.jl packages, and then things crash.

@grahamas
Copy link
Contributor

Multiprocessing is unusable in 1.3.1 -- after addprocs() workers try to precompile and break the cache.

You can work around this particular problem by

using PackageName
@everywhere using PackageName

@alyst
Copy link
Contributor

alyst commented Mar 28, 2020

You can work around this particular problem by ...

I've realized that the real problem here is that the workers have different settings than the master (I'm working from Juno). So @everythere using ... triggers the recompilation of the package for the workers. And when I restart the session the master sees that it needs to recompile the package and so forth...
The issue that I just described is different from the one of the PR, but it also needs to be fixed -- the workers should have exactly the same compiler settings as the master.

When I use Julia from the command line the problem goes away -- apparently the master and the workers settings are in sync.

@kpamnany
Copy link
Contributor

kpamnany commented Jul 8, 2020

This can be closed now I believe.

@KristofferC KristofferC closed this Jul 8, 2020
@tkf tkf deleted the atomic-compilecache branch July 8, 2020 19:42
@simeonschaub simeonschaub removed the status:forget me not PRs that one wants to make sure aren't forgotten label May 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet