Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fetchGit/fetchTree: add support for shallow cloning #9376

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

DavHau
Copy link
Member

@DavHau DavHau commented Nov 18, 2023

This PR should mainly serve as a POC and a basis for discussions.

It implements a significantly more efficient way of fetching git sources (100x in many cases), hidden behind a flag. But rather than implementing this as an optional feature it would be much better if this became the default as part of the stabilization work on fetchTree. If I get green light by the nix team I'm happy to adapt these changes and make it the default for fetchTree.

fixes #5119
fixes #4455 (This was closed, but not fixed, probably due to a misunderstanding, see the 'shallow = true' option confusion paragraph below)

Motivation

TL;DR;

  • Reduce the required download size of some git sources by more than 99.7%.
  • Make fetchGit achieve similar performance as the github tarball feature but without depending on github for it

In lang2nix scenarios it is common to have lock files with entries pointing to git repos. Usually git revisions sha1 hashes are used to pin down such sources and verify their integrity. Fetching these lock file closures via nix is currently rather expensive, because the fetchGit/fetchTree builtins do not support shallow fetching for most git repos.

For example fetching a single revision of the php static analyzer phpstan currently requires fetching 4.17 GiB from the network, while with this PR, it only requires fetching 13.40 MiB, which is a reduction by ~99.7%.

The 'shallow = true' option confusion

The current fetchGit/fetchTree builtin already offers a shallow flag, but this only allows fetching remote shallow repos. It does not support shallow fetching from non-shallow remote repos. This has been confusing to a number of users, because they expected the shallow flag to behave similar to a nix clone https://example.com --branch $ref --depth 1. It's not obvious at first glance, but this behavior wouldn't make much sense. It would not be reproducible, because the selected $ref is not static and might progress over time on the remote. The number for --depth {number} would have to be increased over time in order to still reach the same desired revision.

Fixing the problem

In order to really do a shallow fetch from a non-shallow remote, we have to fetch shallowly by revision ignoring any ref. Git supports this since version 2.5, but the nix builtin doesn't support it yet, as it always enforces fetching via ref (nix will automatically select a ref if the user didn't specify one). This PR fixes that by introducing a new option shallowRev = true, which enforces ref being unset and makes git fetch only the specified revision using fetch --depth 1 on the specific revision sha1 hash.

The way forward

Shallow cloning results in the exact same output in the nix store compared to full cloning. The contents can still be verified either via the revision sha1 hash, or a narHash. The only problem why shallow fetching can not become the default right now is because it cannot reliably return a revCount. As a result, it would break the API.

Therefore, my question to the nix team: What was the original motivation behind having a revCount? It doesn't seem to provide any guarantees. It is not needed for verifying the file contents either. The simplest fix seems to deprecate revCount in fetchTree, or make it optional. Is this change feasible? This way, shallow cloning can be made the default.

Considerations

There is no guarantee if a remote supports or allows shallow cloning by revision. Therefore if shallow cloning became the default, a fallback to full cloning should be implemented which triggers whenever shallow cloning fails.

Also, the current POC implementation of this PR does not cache the git repos, which doesn't hurt mach, as the amount of data transmitted is already much lower than before. But caching could be added here as well. Even if all fetching is done shallowly, caching can still be beneficial, because two different revisions can still contain identical files, in which case re-downloading these git blobs could be omitted. Though, I'm not sure how smart the git protocol really is in omitting unnecessary objects during transfer. Maybe it's not worth it. I can look into this more.

Priorities

Add 👍 to pull requests you find important.

@github-actions github-actions bot added with-tests Issues related to testing. PRs with tests have some priority fetching Networking with the outside (non-Nix) world, input locking labels Nov 18, 2023
@nixos-discourse
Copy link

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/stabilising-the-new-nix-command-line-interface/35531/6

@roberth
Copy link
Member

roberth commented Nov 19, 2023

Therefore, my question to the nix team: What was the original motivation behind having a revCount? It doesn't seem to provide any guarantees. It is not needed for verifying the file contents either. The simplest fix seems to deprecate revCount in fetchTree, or make it optional. Is this change feasible? This way, shallow cloning can be made the default.

I believe revCount was added to provide an easy human friendly number to expressions that wish to use it.

Another alternative is to fetch all the commits without fetching the corresponding tree. See #5313 (comment)

Also we might not need to remove it, if we add support for laziness in libfetchers. Laziness there could take the shape of a method that enriches the attrset based on which attribute was demanded. E.g. Attrs InputScheme::fetchAtLeastAttribute(Attrs alreadyLocked, string attribute) (modulo architecture changes), without the requirement to update only that attribute.

@DavHau
Copy link
Member Author

DavHau commented Nov 19, 2023

I believe revCount was added to provide an easy human friendly number to expressions that wish to use it.

OK, but what's the use case for it? Just being friendly doesn't seem to be a strong reason to have it.
If the purpose is to generate a version suffix, then the first few digits of the rev can be used instead.

Another alternative is to fetch all the commits without fetching the corresponding tree. See #5313 (comment)

I already experimented with that, but this approach seems to be problematic. A simple --filter=blob:none is not enough as that omits all blobs, but we need some blobs. Therefore a second fetch to get the required blobs is needed, leading to strage behavior:

# init new repo
grmpf@grmpf-nix /t/dream2nix (main)> rm -rf .git
grmpf@grmpf-nix /t/dream2nix> git init
grmpf@grmpf-nix /t/dream2nix (main)> git remote add origin https://github.com/nix-community/dream2nix

# fetch only the commit objects to get the rev count
grmpf@grmpf-nix /t/dream2nix (main)> git fetch origin 8db4cb11214ec2c415c85632461da0083f3b0a7a --filter=tree:0
[...]
grmpf@grmpf-nix /t/dream2nix (main)> git rev-list --count 8db4cb11214ec2c415c85632461da0083f3b0a7a
1960

# revCount seems fine, now fetch the tree and blob objects for that same revision, so we can access the files
grmpf@grmpf-nix /t/dream2nix (main)> git fetch origin 8db4cb11214ec2c415c85632461da0083f3b0a7a --depth 1
[...]

# now the revCount is broken again
grmpf@grmpf-nix /t/dream2nix (main)> git rev-list --count 8db4cb11214ec2c415c85632461da0083f3b0a7a
1

(I tested also fetching first the blobs and then the commit history, but that leads to the same result)
The reliability of revCount in git seems flaky. I'm not sure if it's a good idea to rely on it. But maybe libgit2 will have different behavior.

@DavHau
Copy link
Member Author

DavHau commented Nov 19, 2023

I need to correct myself. I actually did find a solution to get the full commit history but only blobs of the given revision without breaking revCount:

# init new repo
grmpf@grmpf-nix /t/dream2nix (main)> rm -rf .git
grmpf@grmpf-nix /t/dream2nix> git init
grmpf@grmpf-nix /t/dream2nix (main)> git remote add origin https://github.com/nix-community/dream2nix

# fetch only the given revision shallowly including blobs
grmpf@grmpf-nix /t/dream2nix (main)> git fetch origin 8db4cb11214ec2c415c85632461da0083f3b0a7a --depth 1
[...]
grmpf@grmpf-nix /t/dream2nix (main)> git rev-list --count 8db4cb11214ec2c415c85632461da0083f3b0a7a
1960

# now --unshallow, but only fetch the commit history
grmpf@grmpf-nix /t/dream2nix (main)> git fetch origin 8db4cb11214ec2c415c85632461da0083f3b0a7a --filter=tree:0 --unshallow
[...]

# revCount is correct
grmpf@grmpf-nix /t/dream2nix (main)> git rev-list --count 8db4cb11214ec2c415c85632461da0083f3b0a7a
1960

@tomberek
Copy link
Contributor

If the purpose is to generate a version suffix, then the first few digits of the rev can be used instead.

revCount increases in a predictable way, characters from rev are random.

@nixos-discourse
Copy link

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/2023-11-20-nix-team-meeting-minutes-105/35902/1

@thufschmitt
Copy link
Member

Discussed in the Nix maintenance meeting.

The tentative plan forward is written down in #9402

@DavHau DavHau marked this pull request as draft December 23, 2023 05:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fetching Networking with the outside (non-Nix) world, input locking with-tests Issues related to testing. PRs with tests have some priority
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Nix doesn't fetch repositories shallowly when told to Allow performing shallow clones with builtins.fetchGit
5 participants