Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REPLCompletions spends a lot of time on statx syscalls (when crawling filesystem) #53153

Closed
topolarity opened this issue Feb 2, 2024 · 15 comments · Fixed by #53540
Closed

REPLCompletions spends a lot of time on statx syscalls (when crawling filesystem) #53153

topolarity opened this issue Feb 2, 2024 · 15 comments · Fixed by #53540
Labels
filesystem Underlying file system and functions that use it REPL Julia's REPL (Read Eval Print Loop)

Comments

@topolarity
Copy link
Member

It was observed in #52765 that we generate a lot of statx and readlink syscalls when scanning the filesystem.

Especially on WSL systems this kind of access pattern can be very slow:

julia> @elapsed dir = readdir("/mnt/c/Windows/System32/")
0.048656877

julia> paths = ["/mnt/c/Windows/System32/" * name for name in dir];
julia> @elapsed for path in paths
           isfile(path)
       end
14.98347244   # ~300x slower than the 50 milliseconds spent above

The example above is not contrived. WSL by default appends /mnt/c/Windows/System32/ to my PATH so that when REPLCompletions walks the PATH, it has to do pretty much exactly the above.

Thankfully the getdents64 syscall on many modern Linux filesystems should allow getting file vs. directory information for lots of files in one go, so there should be a pathway to optimize this.

@topolarity topolarity added the REPL Julia's REPL (Read Eval Print Loop) label Feb 2, 2024
@topolarity
Copy link
Member Author

topolarity commented Feb 2, 2024

To give a sense of how much this dominates our filesystem scan (which takes ~27s to scan the PATH on my machine):
image (7)

@topolarity
Copy link
Member Author

I also noted in #52765 (using strace -e getdents64 ./julia) that Julia seems to scan the current directory on every keypress of shell> ls ./dir1/dir2/... which seems unnecessary.

We should be able to scan once (or refresh on an interval) and then re-use the results for completion hints.

@IanButterworth
Copy link
Sponsor Member

Thankfully the getdents64 syscall on many modern Linux filesystems should allow getting file vs. directory information for lots of files in one go, so there should be a pathway to optimize this.

@vtjnash is that something that libuv can handle? Could we add something like readdir(path, filesonly=true) that does something more efficient than isfile on the list?

I wondered if the flags to uv_fs_scandir provide that option, but I can't see docs for that function here

@fatteneder
Copy link
Member

I wondered if the flags to uv_fs_scandir provide that option, but I can't see docs for that function here

Its here: https://docs.libuv.org/en/v1.x/fs.html#c.uv_fs_scandir_next
It says it works similar to Linux's scandir(), but there is also this warning here:

On Linux, getting the type of an entry is only supported by some file systems (btrfs, ext2, ext3 and ext4 at the time of this writing), check the getdents(2) man page.

@IanButterworth
Copy link
Sponsor Member

I'm referring to https://docs.libuv.org/en/v1.x/fs.html#c.uv_fs_scandir which seems to not have docs? Notably what the flags arg is.
My hope was that uv_fs_scandir or some dir scan initializer can be asked to set up the scan to just return files, so they don't have to each be checked.

@fatteneder
Copy link
Member

I think those two are meant to be used together: with uv_fs_scandir you can setup an iterator (the mentioned callback), and with uv_fs_scandir_next you can walk it. But then I did not try running it either, so I might be wrong.

Notably what the flags arg is.

Indeed, can't find them either.

@IanButterworth
Copy link
Sponsor Member

Sorry, I wasn't clear, both are used by readdir currently, hence why I'm asking about them specifically. Flags is just 0 here.

julia/base/file.jl

Lines 921 to 928 in 33d9e2a

err = ccall(:uv_fs_scandir, Int32, (Ptr{Cvoid}, Ptr{Cvoid}, Cstring, Cint, Ptr{Cvoid}),
C_NULL, req, dir, 0, C_NULL)
err < 0 && uv_error("readdir($(repr(dir)))", err)
# iterate the listing into entries
entries = String[]
ent = Ref{uv_dirent_t}()
while Base.UV_EOF != ccall(:uv_fs_scandir_next, Cint, (Ptr{Cvoid}, Ptr{uv_dirent_t}), req, ent)

@fatteneder
Copy link
Member

fatteneder commented Feb 15, 2024

Ok, now I see what you mean, sorry.

Grep-ing libuv doesn't show me any uses of getdents or getdents64 syscalls, so I guess that optimization would need to be added to libuv first?

@topolarity
Copy link
Member Author

The returned uv_dirent_t should already have the type information you need: https://docs.libuv.org/en/v1.x/fs.html#c.uv_dirent_t

libuv uses readdir and scandir which both use the getdents64 syscall under the hood

@IanButterworth
Copy link
Sponsor Member

IanButterworth commented Feb 16, 2024

Great point.
Which of these would be better?

files = readdir(; type=:file)

or

objects, types = readdir_types()

I'm leaning towards the latter because from some testing a symlink will return UV_DIRENT_LINK rather than follow the link.
So that gives the user the chance to, for instance:

for obj, type in zip(readdir_types())
    if type = :file || (type == :link && isfile(obj))
    	# do something with `obj` now we know it's a file 
    end
end

which still avoids isfile for known files.

@IanButterworth
Copy link
Sponsor Member

Alternatively

files = readdir(; filter=(obj, type) -> type == :file || (type == :link && isfile(obj)))

or special-case files which could do that under the hood

files = readdir(; files_only=true)

@fatteneder
Copy link
Member

fatteneder commented Feb 16, 2024

I'm leaning towards the latter because from some testing a symlink will return UV_DIRENT_LINK rather than follow the link.

I think the file system limitation #53153 (comment) still applies, and so you might prepare to also handle UV_DIRENT_UNKNOWN return values (https://github.com/libuv/libuv/blob/507f3046d104db2be2c33d5e8fc0d5322818a62c/include/uv.h#L1220) and then fall back to isfile.

I like readdir_types and the example you gave that avoids at least calling isfile in the user's code. (its avoided in either case, sorry).

@IanButterworth
Copy link
Sponsor Member

Ah so check :unknown too

for obj, type in zip(readdir_types())
    if type = :file || (type in (:link, :unknown) && isfile(obj))
    	# do something with `obj` now we know it's a file 
    end
end

@fatteneder
Copy link
Member

Ah so check :unknown too

No, I meant to use isfile inside readdir_types when the filesystem does not provide the type info, which makes it behave like the initial example.

Also: I think the type can be any of the enums here https://github.com/libuv/libuv/blob/507f3046d104db2be2c33d5e8fc0d5322818a62c/include/uv.h#L1220.
So what do we do about UV_DIRENT_FIFO, UV_DIRENT_SOCKET, UV_DIRENT_CHAR, UV_DIRENT_BLOCK?

@vtjnash vtjnash added the filesystem Underlying file system and functions that use it label Feb 17, 2024
@vtjnash
Copy link
Sponsor Member

vtjnash commented Feb 17, 2024

I wonder if we should have a version (readdirx? scandir?) that returns a vector of structs with the following accessible "fields":

  • .dir::String
  • .name::String
  • .path::String = joinpath(dir, name)
  • .rawtype::FileKind
  • .type::FileKind = rawtype == UNKNOWN ? trystat(path).type : rawtype (where trystat returns UNKNOWN if there is any IOError)

(The path and type fields would be computed values, instead of being actually stored in the struct.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
filesystem Underlying file system and functions that use it REPL Julia's REPL (Read Eval Print Loop)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants