Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Git, GitDB, and GitCmdObjectDB give different results when counting objects. #765

Closed
ali1234 opened this issue Jun 5, 2018 · 54 comments
Closed
Labels

Comments

@ali1234
Copy link

ali1234 commented Jun 5, 2018

  1. Use git to count the objects in a repo:
$ git rev-list --objects --all | wc -l
6990030
$ git rev-list --objects --all | sort | uniq | wc -l
6990030
  1. Parse the output from git rev-list --objects --all, fetch each object with name_to_object, and count each type:
Commits: 909667, Tags: 2469, Trees: 4178263, Blobs: 1899631
  1. Query what is ostensibly the same information using git-python:
import argparse
import pathlib

import git

def log(testname, a, b):
    print(testname, ':', a, b)

def main():

    parser = argparse.ArgumentParser(description='Git x ref.')
    parser.add_argument('repository', metavar='repository', type=pathlib.Path,
                        help='Path to Git repository.')

    args = parser.parse_args()

    repos = [
        git.Repo(str(args.repository), odbt=git.GitCmdObjectDB),
        git.Repo(str(args.repository), odbt=git.GitDB)
    ]

    log('size()', *[r.odb.size() for r in repos])
    log('len(sha_iter())', *[sum(1 for x in r.odb.sha_iter()) for r in repos])
    log('len(iter_trees())', *[sum(1 for x in r.iter_trees()) for r in repos])


if __name__ == '__main__':
    main()

Result:

size() : 3839 8268978
len(sha_iter()) : 3839 8268978
len(iter_trees()) : 568851 568851

So:

Git thinks there are 6,990,030 objects in the database.
GitDB thinks there are 8,268,978.
GitCmdObjectDB thinks there are 3,839.

Git thinks there are 4,178,263 trees in the database.
Both GitDB and GitCmdObjectDB think there are 568,851.

@Byron
Copy link
Member

Byron commented Jun 6, 2018

@ali1234 Could you tell me the repository you are looking at? I would like to run https://github.com/Byron/git-count on it, assuming it's as fast as C would be, based on libgit2 which should be the best implementation out there to access git object databases.
I am aware that writing something in Rust is not addressing the problem directly, but might indeed be a suitable workaround. One could possibly just use libgit2 from python directly, too. For now I just want to see what the kinds are using libgit2.
Thank you

@ali1234
Copy link
Author

ali1234 commented Jun 6, 2018

This is my .git/config:

[core]
	repositoryformatversion = 0
	filemode = true
	bare = false
	logallrefupdates = true
[remote "linux-stable"]
	url = git:https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git
	fetch = +refs/heads/*:refs/remotes/linux-stable/*
[remote "linux-stable-rc"]
	url = git:https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git
	fetch = +refs/heads/*:refs/remotes/linux-stable-rc/*
[remote "linux-next"]
	url = git:https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
	fetch = +refs/heads/*:refs/remotes/linux-next/*
[remote "linux-next-history"]
	url = git:https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next-history.git
	fetch = +refs/heads/*:refs/remotes/linux-next-history/*
[remote "android-common"]
	url = https://android.googlesource.com/kernel/common
	fetch = +refs/heads/*:refs/remotes/android-common/*
[remote "android-mediatek"]
	url = https://android.googlesource.com/kernel/mediatek
	fetch = +refs/heads/*:refs/remotes/android-mediatek/*
[merge]
	renamelimit = 12000

@ali1234
Copy link
Author

ali1234 commented Jun 6, 2018

Also this is my python object counter. It takes 3m14.257s 2m35.187s to run and produced the above output. I would be interested to compare the speeds but I can't get your rust code to compile because it needs experimental features.

import argparse
import pathlib
from collections import defaultdict

import git
from gitdb.util import hex_to_bin


def main():

    parser = argparse.ArgumentParser(description='Git x ref.')
    parser.add_argument('repository', metavar='repository', type=pathlib.Path,
                        help='Path to Git repository.')

    args = parser.parse_args()

    repo = git.Repo(str(args.repository), odbt=git.GitCmdObjectDB)

    typecount = defaultdict(int)
    for line in repo.git.rev_list('--objects', '--all').split('\n'):
        binsha = hex_to_bin(line.split()[0])
        oinfo = repo.odb.info(binsha)
        typecount[oinfo.type] += 1
    print(', '.join('{:s}s: {:d}'.format(k.decode('utf8').capitalize(), v) for k, v in typecount.items()))


if __name__ == '__main__':
    main()

@ghost
Copy link

ghost commented Jun 6, 2018

@ali1234 The compilation issue might stem from an old rustc version. Rustc 1.26 is the one I use, and I manage it with Rustup.
Which OS are you on? I am sure I can send you a binary if you a willing to trust me it's actually built with the code you see in the repository.

@ghost
Copy link

ghost commented Jun 6, 2018

Here is the result on my system (OSX) - run right after clone

➜  linux-stable.git git:(master) time git-count
commits: 878995, trees: 4025161, blobs: 1832388, tags: 2387, any: 0, unknown: 0
git-count  369.99s user 10.72s system 98% cpu 6:27.88 total

@ghost
Copy link

ghost commented Jun 6, 2018

And here is the result of invoking the script you provided on my system:

git-count git:(master) ✗ time PYTHONPATH=$PWD/../GitPython:$PWD/../GitPython/git/ext  python3 count.py linux-stable.git
Commits: 878995, Tags: 2387, Trees: 4025161, Blobs: 1832388
PYTHONPATH=$PWD/../GitPython:$PWD/../GitPython/git/ext python3 count.py   240.43s user 66.76s system 114% cpu 4:27.86 total
➜

The conclusion I am making is that the default object database implementation is correct, and probably that the GitDB implementation should be deprecated permanently.

Also we might see that libgit2 is likely still slower than the native git implementation, assuming that the git command-line does not use libgit2 under the hood by now.
Also quite interesting to see that the python implementation is indeed the clear winner here.

@ali1234
Copy link
Author

ali1234 commented Jun 6, 2018

I get different numbers from git-count and my script.

git-count:

commits: 1069996, trees: 4943730, blobs: 2252783, tags: 2469, any: 0, unknown: 0

real	14m9.732s
user	14m7.562s
sys	0m1.664s

and for my script:

Commits: 909667, Tags: 2469, Trees: 4178263, Blobs: 1899631

real	2m36.989s
user	2m37.376s
sys	0m29.017s

@ghost
Copy link

ghost commented Jun 6, 2018

It's odd that the numbers don't match, as this was consistent for me.
Also git-count seems unusually slow - it should just be limited by libgit2. Did you compile it with the --release flag (i.e. cargo build --release) ?

@ali1234
Copy link
Author

ali1234 commented Jun 6, 2018

6m32.922s for the release build.

@ghost
Copy link

ghost commented Jun 6, 2018

If Rust + libgit2 cannot beat the python + cgit team, than I don't know what will!
Good that at least on my machine the results are consistent. As the Rust version pins all dependencies, could it be that the difference stem from the git binary - I use git v2.17.1.

@ali1234
Copy link
Author

ali1234 commented Jun 7, 2018

Same git version here, on Ubuntu 18.04.

@Byron
Copy link
Member

Byron commented Jun 10, 2018

I realized that the only way to fix GitPython is not to use it, or at least stick with the proven cgit implementation.
In an attempt to eventually remedy the situation and offer alterantives, I decided to implement the whole thing in pure Rust.

Here is the first result:

➜  git-odb git:(master) time ./target/release/examples/git-count /Users/byron/dev/GitPython/linux-stable.git/objects/pack/pack-930627d122f4fd6a6e203b61f58ea4ad441da724.idx /Users/byron/dev/GitPython/linux-stable.git/objects/pack/pack-930627d122f4fd6a6e203b61f58ea4ad441da724.pack
commits: 804501, trees: 170520, blobs: 94462, tags: 2389, deltas: 5667694
./target/release/examples/git-count    1.32s user 0.35s system 99% cpu 1.671 total

Please note that the comparison is probably unfair, as the program just does a lookup by offset, not by SHA1, which is pretty much as efficient as it gets.

I am also declaring this issue as 'wont-fix' as the only 'working' git database is the GitCmdDb, which seems to be doing OK.
Please correct me if I am wrong, but if not it would be nice if the issue could be closed.
Thank you

@ali1234
Copy link
Author

ali1234 commented Jun 10, 2018

I can't understand how to compile that, but I got different results from git-count and git itself, and I don't even know which one of them is correct.

@Byron
Copy link
Member

Byron commented Jun 10, 2018

You can compile it with cargo run --example git-count --release -- <args>.
It has the same result as git rerify-pack, as far as the counts go.

I think it would be good if the problem you are trying to solve would be stated as well. By that it might be easier to determine which algorithm is preferable.

Generally I would think its best to build the count on top of an algorithm you trust, and the one using git directly and GitPython for lookups is certainly the one I would trust, assuming that I like what git outputs in the first place.

@ali1234
Copy link
Author

ali1234 commented Jun 10, 2018

I think in the past week I have spent more time explaining to people what I am doing than writing code. :)

The problem: given a list of remote repositories and an arbitrary tree which does not exist in any of them, calculate the list of parent commits that will produce the merge commit with the smallest possible diff when committing the tree with commit-tree.

The algorithm:

  1. First invert the whole repository graph:
    a. Create a defaultdict mapping binsha to lists.
    b. For every blob and tree in the repo, append its list to the lists of its children.
    c. For every commit, add its binsha to the list of its tree.
  2. Build a bitmap mapping commits to blobs:
    a. Create a defaultdict mapping binsha to bitmaps of length equal to the number of blobs in the input tree.
    b. For every blob in the input tree, look it up in the inverted repo graph and follow the references to the leaf nodes, which are commits. This gives a list of every commit containing the blob.
    c. For every commit in the list, set the bitmap equal to 1 for the current blob.
  3. Find the best commits:
    a. Create a bitmap of length equal to the number of blobs, set entirely to True, called "unfound"
    b. For every commit, AND with unfound, count the number of 1 bits, and yield the one with the most.
    c. Subtract the best commit from unfound, ie unfound &= ~best
    d. Repeat until you run out of commits or blobs, ie best & unfound == 0.
  4. Actually make the commit:
    a. git checkout the first commit produced by the above algorithm.
    b. Copy the input tree over the workdir.
    c. Make a temporary commit and then run git cat-file -p HEAD to get the tree's binsha.
    d. Run git commit-tree -p <parent-sha1> -p ... <tree-sha1> listing every commit output by the algorithm as a parent, including the first one used in step 4a.

The list of remote repositories: #765 (comment)
The input tree: https://github.com/dguidipc/gemini-android-kernel-3.18/tree/56760a6e806bb4399d70626dd2e6cf22f7c9e9c1/kernel-3.18
The code: https://github.com/ali1234/gitxref/
The resulting merge: https://github.com/ali1234/linux-gemini/tree/20ba641d006ed0335de2c06ecd075f04d4f7065a

Now you are probably going to ask "why not just do X?" (everyone else did). The answer is speed. This algorithm takes around 8 hours and 10GB of RAM to run. There are 54,000 blobs in the input tree and 900,000 commits in the upstream repos. One algorithm that has been suggested is:

for each commit in the remotes:
    recurse through the tree and for each blob:
        check blob against the input tree

This is not practical, because walking the tree of an arbitrary kernel commit takes approximately 2 seconds. That means this algorithm will take approximately 3 weeks to complete.

Another method which has been suggested:

for each blob in the input tree:
    git describe <sha1 of blob>

This is slightly better but still much worse than my approach. git describe with an arbitrary blob sha1 takes about 5 seconds to run, so this will take approximately 3 days to complete. And when it does, it will only have told me ONE commit containing the blob, instead of ALL of them which is what I want.

So now that you hopefully fully understand what I am doing and why, let's look at the actual problem:

In step 1b I wrote "for each blob and commit in the repo..." and in 1c I wrote "for each commit in the repo..." For speed, I do this with a single loop which just looks at every object in the repo.

And the problem I have is that the various tools do not agree on the number of objects in the repo.

@ghost
Copy link

ghost commented Jun 10, 2018

Thanks so much for sharing, @ali1234 !
If I understand the problem correctly, the program reproduces the most efficient merge-commit which yields the least amount of difference the commits it is merging. Or in other words, it finds an 'historically suitable' place for any tree given a repository. I would assume that if the tree is a tree of the repository's history, it would detect that single commit, too.

Also this problems hits a weak spot of mine, as for some reason, I really like to make things suitably fast.

Looking at the algorithm, there is the part that builds the lookup table (backrefs), and the one that does the lookup. The table seems valuable enough to warrant it to be written to disk.

If I were to implement grit far enough to generate the lookup table, do you think that would already help you?
Maybe it would be better to just try something naive to work first (like the brute-force examples you provided) to verify the algorithm is working, before introducing lookup tables though.

For that I think what I should do first is to write a version of the program that uses libgit2 for everything. Then there is some value in rewriting it using grit as backend.

What do you think?

Regarding your question, which I understand as "do I actually get to see all commits": I think it all depends on whether this git command does what you need. The GitPython based object lookup would just fail if there was a problem.
My implementation of it would certainly just iterate all objects directly, and avoid the SHA1 based lookup alltogether (which doesn't seem to be possible with libgit2, and we know it's slower than the current version in that regard).

@ali1234
Copy link
Author

ali1234 commented Jun 10, 2018

If I understand the problem correctly, the program reproduces the most efficient merge-commit which yields the least amount of difference the commits it is merging. Or in other words, it finds an 'historically suitable' place for any tree given a repository.

Yes, exactly. It's for when OEMs release tarballs with no history. I like that phrase, "historically-suitable", it captures what I am trying to do very well.

I would assume that if the tree is a tree of the repository's history, it would detect that single commit, too.

Yes. If the tree (or a superset of it) exists in the remotes, the algorithm should output one commit.

Looking at the algorithm, there is the part that builds the lookup table (backrefs), and the one that does the lookup. The table seems valuable enough to warrant it to be written to disk.

I already do this using pickle. Building the backrefs takes about 12 minutes with 8 "threads". Single thread version takes about 70 minutes. Loading it back in from disk takes about 90 seconds. It is 2GB on disk and 6GB in RAM. :)

What do you think?

If you can make a way to fetch a list of every commit containing a blob given its SHA1 which takes less than 1 second and I can call it from Python, then I am interested.

If you want to rewrite the whole thing in rust then go ahead - but I have no knowledge of rust programming at all.

Regarding git rev-list --objects --all: I'm not sure what it really does. I suspect that it does not see orphaned objects, while directly iterating the database will. That would explain why git-count saw more objects. However it cannot explain why GitCmdObjectDB saw fewer.

I don't think I want orphaned objects because presumably they do not have a useful history, and the whole point is to reconstruct history...

@ali1234
Copy link
Author

ali1234 commented Jun 10, 2018

PS thanks for taking an interest. Most people were just like "well, good luck with that" after I spent 15 minutes explaining the problem to them :)

I also met one other person trying to solve the same problem, with a different OEM tarball, but they took a totally different approach: https://framagit.org/GNUtoo/reconstruct-git-history

@ghost
Copy link

ghost commented Jun 10, 2018

If you can make a way to fetch a list of every commit containing a blob given its SHA1 which takes less than 1 second and I can call it from Python, then I am interested.

Perfect - this is a great starting point - it can create the lookup table in memory, especially since it is not taking too long given the data you kindly provided.

My plan is to

  1. create the 'commits-by-blob' program with libgit2
  2. boost grit to the point where it can do the same
  3. re-implement above program with grit to see speed boost
  4. rewrite gitxref in Rust or... mentor you doing it :) to get the ultimate performance.

The alternative program you linked to (I just had a very brief look) seems not suitable for huge repositories, but I may be wrong.
Also I find it very interesting to do any kind of work along the side-lines of the linux-kernel, and I love git, too :D. And efficiency! And... Rust :D.

@ghost
Copy link

ghost commented Jun 10, 2018

For 1., here is the first version: https://github.com/Byron/git-commits-by-blob.git .
It's currently only taking commits reachable from head into account, but that can be fixed quickly.
Also it's extremely fresh, I am still checking out the linux-stable repo to do some testing on your typical datasets.
Will be back for more, soon.

@ghost
Copy link

ghost commented Jun 10, 2018

Intermediate results are in, and it becomes clear that I will have to look at how the LUT in python is really working. Right now even after processing about 5000 commits of linux-stable, the program is at a whopping 7GB real memory.
The code is simple, maybe you see what the problem is right away.

@ghost
Copy link

ghost commented Jun 10, 2018

screen shot 2018-06-11 at 00 09 03

@ali1234
Copy link
Author

ali1234 commented Jun 10, 2018

Are you building a map of commit -> list of objects it contains?

That won't work... the result will be 900000 lists of ~54000 objects. You need a minimum of 3 bytes to uniquely identify all the binshas, so the result will be at an absolute minimum 150GB of data, not including datastructure overhead.

What I do is invert the graph... so where in git a tree has a list of subtrees and blobs, in backrefs every tree and blob has a list of trees it appears in. And top level trees also have the binsha of the commit(s) they belong to. This is still much larger than the forward version that git stores. Trying to flattening makes it grow exponentially.

I still had to use some tricks to make it fit in memory:

If tree A is in tree B, then A does not store the binsha of B in its list. Instead it stores a direct reference to the list representing B, which is another list of lists... and so on. The only data that gets stored is the binshas of commits. Everything else is lists of lists. This saved about 16GB of RAM, although curiously it had almost no effect on the on-disk size of the pickled data.

Another trick: Every binsha seen by the program is stored in a dict. Whenever it sees a new one it checks the dict and returns a reference to the existing one, instead of allocating new memory for the same string. This saved 2GB. The dict is thrown away once the backrefs table has been built.

This seems quite slow too... after 24 minutes it has indexed 10000 commits (using 18GB), so the total expected run time would be about 1.5 days (using 1.6TB of RAM). I think this is because you are walking the commits, and then walking the tree of each commit. This is very slow as you are looking at the same trees over and over. But it is possible (actually very likely) that I do not understand this rust code at all. :)

@ghost
Copy link

ghost commented Jun 10, 2018

Thanks for your help! I couldn't sleep, so I got up and fixed the memory issue by throwing in a bitvec. Just pushed a version which will probably not have that memory issue! From what I see, for the 150k blobs on master, it should use just about 3GB of RAM (it now scales by amount of blobs only).
It's still slow at what it does as it indeed traverses everything over and over, but I think it should be super fast when looking things up.
What I like about this trivial algorithm is that it spends most of the time in libgit2, which is what I can use a good benchmark for traversal performance.
Oh, and please try again and let me know what you think. My computer will keep running over night, too, times tomorrow!

@ali1234
Copy link
Author

ali1234 commented Jun 10, 2018

If you are storing a bitvec/bitarray/bitmap of length total number of blobs for each commit, then you have gone directly to step 2. But the problem with that is my repo has over 1.8 million blobs and 0.9 million commits, so the full bitmap would be about 200GB - in memory or on disk. That's why I build the backrefs table (~6GB) and then build the bitmaps only for the 54,000 blobs I am interested in (another ~6GB).

@ghost
Copy link

ghost commented Jun 11, 2018

It's great to hear you got some additional speedups - not that it wasn't a challenge already :D!
Here is my result from the night - which was a bit shorter than I wanted, too ;).

➜  linux-stable.git git:(master) time echo | git-commits-by-blob $PWD
⠤ 394985/754092 Commits done; Table with 884224 blobs and a total of 20457642542 back-refs
^C
echo  0.00s user 0.00s system 34% cpu 0.003 total
git-commits-by-blob $PWD  17782.32s user 72.38s system 99% cpu 4:59:04.74 total

Memory consumption was 13.4GB after less then half of the blobs which are to be expected, which wouldn't fit!
For a moment I thought I should work with memory compression myself, but I see now that the inverse tree seems to be the only option!
Even if memory was not a problem, the incredible amount of work and time wasted due to touching ... 20 billion objects (and causing whatever amount of memory leaks or allocator waste) is just not an acceptable time.
I believe tonight I will be back with the version that builds the inverse-tree, otherwise the issue won't let me sleep.

@ghost
Copy link

ghost commented Jun 11, 2018

Couldn't let it go, now there is a rather crude mapping of vertices (tree, commits or blobs) to their parents, all identified via OIDs.

➜  linux-stable.git git:(master) time echo | git-commits-by-blob $PWD
READY: Build cache from 754092 commits with table of 5259650 blobs and 179511902 refs
error: Failed to count git objects
Caused by:
 1: unable to parse OID - too short; class=Invalid (3)
echo  0.00s user 0.00s system 35% cpu 0.002 total
git-commits-by-blob $PWD  331.21s user 17.20s system 95% cpu 6:04.83 total

Max observed memory consumption was 8GB, so it fits already.
Improvements will clearly be multi-threading, as well as a more optimized storage of the graph.

@ghost
Copy link

ghost commented Jun 11, 2018

The current version should be usable for you - it does all the work in 6minutes and has incredibly quick lookups afterwards. Memory usage, now slightly compacted, is a mere 8GB (6.2GB Real).
When using an index based lookup table, memory consumption should go down by 2GB or so (180_000_000 * 20) / 2.5, even though I am not sure if it's worth the additional complexity of the non-naive implementation.

@ghost
Copy link

ghost commented Jun 11, 2018

Ok, that's it :)! The current version implements a better compaction step which brings the overall memory consumption down to 5GB with a peak of the prior 8GB, even though it also increases the time it takes until ready by 3m40s.
It was quite fun to implement it so that no unnecessary work is done and so that there are at most a single extra parent vector allocated per edge during compaction as to avoid any kind of memory spike.

The next step I want to take is to multi-thread both compaction as well as lookup table generation. Thanks to Rust, this will be easy to do and guarantees that I won't run into race conditions.

Please let me know what you think :D.

@ghost
Copy link

ghost commented Jun 11, 2018

The last features... now you

  • have a proper CLI and
  • can skip memory compaction
  • and by default iterate all remote branches.

I am very curious how this affects your benchmark, and if it makes sense to walk the extra mile and implement the code which creates the most fitting merge-commit, too.
That would be interesting to me as it would push grit to dealing with Index files and writing trees and blobs.

@ali1234
Copy link
Author

ali1234 commented Jun 11, 2018

I've been implementing speed ups and multiprocessing code all day. My python now builds the LUT in 6 minutes (including 90 seconds to write the cache) vs 8 minutes for your rust.

However I just realised that actually, all we need to do is:

cat-file --buffer --batch-check --batch-all-objects | grep -E \(tree\|commit\) | cut -d \  -f 1 | git cat-file --buffer --batch

This takes two minutes to run and is naturally multithreaded by nature of being a shell pipeline. It is just a matter of parsing the firehose quickly, because it contains exactly the minimum amount of data you can pull from the repo and nothing else. No need to implement multithreading inside the program. If written directly to disk it is 13GB.

@ghost
Copy link

ghost commented Jun 11, 2018

Wow! That's great to hear! Who said that a little competition hurt anyone ;)!
As I still don't know how exactly everything works, I only have the feeling that most of the time is still spend actually finding the right merge commit.
For that you said that if a blob-to-commits mapping would be done in a second or less, it would be interesting. This is what the Rust backend already does. Does that work for you?
Right now I see myself implement the actual algorithm which uses the LUT, which will be valuable for grit anyway just to see how libgit2 feels API wise.

And just because I couldn't let you claim that 6min of multi-core python beat my 8 minutes of single-core Rust, please do try again with the latest version :) - it implements multi-threading for the LUT generation (without workstealing, yet, which I leave for another day).
Also by default memory is compacted, please try without this time as it's something you might not need and that python doesn't do either, like so:

# by default it uses the amount of physical cores, you can set the threads to any number with `-t N`
time echo | git-commits-by-blob $PWD --no-compact

The above yielded the following:

➜  linux-stable.git git:(master) time echo | git-commits-by-blob $PWD --no-compact
Didn't find a single remote - pushing head instead to avoid empty traversal
⠠ 328500 Commits done; reverse-tree with 2665938 entries and a total of 82562959 parent-edges
⠉ 376900 Commits done; reverse-tree with 3038963 entries and a total of 92885326 parent-edges
READY: Build reverse-tree from 754092 commits with table of 6091013 entries and 179793390 parent-edges
Waiting for input...
error: Failed to count git objects
Caused by:
 1: unable to parse OID - too short; class=Invalid (3)
echo  0.00s user 0.00s system 33% cpu 0.002 total
git-commits-by-blob $PWD --no-compact  425.55s user 15.80s system 178% cpu 4:06.66 total

4 minutes :)!

@ghost
Copy link

ghost commented Jun 12, 2018

I just let it run on a CoreI7 laptop with 4 cores, and it goes 2min with compaction (and 7GB of RAM) and 1:30min without compaction (but 12GB of RAM).
There is a memory overhead when doing multithreading, currently.
Some obvious future performance gain here could be the usage of a faster git interaction, which hopefully is what grit ends up being.
So exciting :)!

@ali1234
Copy link
Author

ali1234 commented Jun 12, 2018

54000 look ups completed in 42 minutes.

time git -C repo/ cat-file --buffer --batch-check='%(objecttype) %(objectname)' --batch-all-objects | \
    grep ^b | cut -d \  -f 2 | head -n 54000 | \
    git-commits-by-blob --no-compact repo/ > /dev/null

@ghost
Copy link

ghost commented Jun 12, 2018

Thanks again! I never get to the point where I could actually test the lookup speed! 0.05 seconds for each lookup sounds a bit slow - I wonder if inefficient writing to stdout could be the problem.
Let me fix that :)

@ali1234
Copy link
Author

ali1234 commented Jun 12, 2018

It's actually very fast compared to my python. Of course more speed is always good. :)

@ghost
Copy link

ghost commented Jun 12, 2018

:D! It's always nice to hear that!
Please try again with the latest version - I think it wasted some time on IO.
But if not, then implementing it in pure Rust promises great gains. That way, there is no time spend on formatting and IO, and there is quite a huge opportunity for multi-threading, too.
Soon I should have a version which does the lookup and builds the table from commits to the blob-availability-bitmap.
It's quite intriguing to me because it's also my first time with Rust and multi-threading. That can be done without fear as it just won't compile if unsound.

@ghost
Copy link

ghost commented Jun 12, 2018

For me it ran in 57min (without compaction) and pv reported 35.4GiB 0:57:58 [10.4MiB/s] - I would expect some of that time to go down, especially when it's threaded.

@ali1234
Copy link
Author

ali1234 commented Jun 12, 2018

Not sure what you did but it is much slower now - like 100x slower. (Head currently at 55d9886).

@ghost
Copy link

ghost commented Jun 13, 2018

Oh, darn, a regression - no wonder given the way I test the performance currently. The given hash I couldn't find, unfortunately.
Maybe the 'optimization for the output' of commits wasn't working, but 100X slower ... odd.
By the way, I moved over to another repo: https://github.com/Byron/git-reconstruct .
The previous one is left with just the initial features that fit to git-commits-by-blob. Once I find the source of the regression, I will backport it.

@ghost
Copy link

ghost commented Jun 13, 2018

Ok, it's fixed. Problem was that the commit output buffer was not cleared in one particular case, thus piling up commits and increasing the time it took to output data.
Besides, of course, yielding the wrong result.

@ghost
Copy link

ghost commented Jun 13, 2018

The definitive version has landed: https://github.com/Byron/git-reconstruct . Please note that it drops the --no-compact flag, as it is not required anymore.
It only needs 3.6GB for the lookup table, and can do the lookups extremely fast.
The runtime is now 9min!

By now I have the problem that when I leave this project in an unsatisfying state, I have serious trouble thinking about anything else (or do the work I am paid for :D). It's great to work under pressure sometimes.

time git -C $PWD cat-file --buffer --batch-check='%(objecttype) %(objectname)' --batch-all-objects | \
    grep ^b | cut -d \  -f 2 | head -n 55000 | \
    git-reconstruct $PWD | pv > /dev/null
Didn't find a single remote - using head instead to avoid empty traversal
READY: Build reverse-tree from 754092 commits with table of 6091013 entries and 179793390 parent-edges
Waiting for input... B/s] [<=>                                                                                                                                                                            ]
36.1GiB 0:09:16 [66.4MiB/s] [                                                                                                                                 <=>                                         ]
git -C $PWD cat-file --buffer --batch-check='%(objecttype) %(objectname)'   0.89s user 1.49s system 0% cpu 9:04.42 total                               <=>                                                ]
grep --color=auto --exclude-dir={.bzr,CVS,.git,.hg,.svn} ^b  0.19s user 0.01s system 0% cpu 9:04.39 total
cut -d \  -f 2  0.18s user 0.01s system 0% cpu 9:04.38 total
head -n 55000  0.01s user 0.01s system 0% cpu 9:04.38 total
git-reconstruct $PWD  699.36s user 20.90s system 129% cpu 9:16.54 total
pv > /dev/null  3.21s user 9.67s system 2% cpu 9:16.54 total

Improvements can be made with unchecked array access and multithreading.

@ghost
Copy link

ghost commented Jun 13, 2018

Here is the final results. Multithreading for graph generation was removed due to memory constraints later on, and generation of bitmaps is what was taking most of the time for me.
Now we are back to 6min or so for graph generation, and it takes another 18min (single core) to tick the blob-bits per commit.
The last step of actually finding the best commits should be comparatively swift, but it's not implemented yet.

Do you think that's a decent time?

time  git-reconstruct $PWD ../linux-stable-workingtree
Didn't find a single remote - using head instead to avoid empty traversal
READY: Build reverse-tree from 754092 commits with graph with 6013742 vertices and 179511902 parent-edges
Ticked 61319 blob bits in 7505212535 commits
thread 'main' panicked at 'not yet implemented', src/main.rs:126:9
note: Run with `RUST_BACKTRACE=1` for a backtrace.
git-reconstruct $PWD ../linux-stable-workingtree  1476.41s user 23.24s system 97% cpu 25:31.13 total

PS: I found the numbers quite impressive - after all it is seeing 7.5 billion commits in the process.

@ali1234
Copy link
Author

ali1234 commented Jun 13, 2018

That's a very impressive time.

Today I found an optimization. I don't know if you are already doing this. If a node has only one parent edge, then you can replace it in all children with its parent at no cost.

This Python code is called once for each node in the whole graph:

def list_opt(l):
    count = 0
    for n in range(len(l)):
        while type(l[n]) is list and len(l[n]) == 1:
            l[n] = l[n][0]
            count += 1
    return count

It eliminates about 350 million edges, reducing memory use and halving the lookup time. Total run time for the python version is now at about 3 hours.

@ghost
Copy link

ghost commented Jun 13, 2018

I am happy to hear that! Right now I don't really have a comparison and rely solely on your judgement! Great work there with the optimization - 3h is impressive too!

I have tried to implement a multi-pass topology optimization as you suggested, but... there seems to be a problem with the output and so far I only see 3 million edges removed for the first pass.

time echo | git-reconstruct --cache-path lut.bincode.lz4  $PWD
Didn't find a single remote - using head instead to avoid empty traversal
⠁ Pass 1: 2989402 edges removed
  Pass 1: 2989402 edges removed
READY: Build reverse-tree from 754092 commits with graph with 6013742 vertices and 176522500 parent-edges
Saving graph...
Waiting for input...
error: unable to parse OID - too short; class=Invalid (3)
echo  0.00s user 0.00s system 35% cpu 0.003 total
git-reconstruct --cache-path lut.bincode.lz4 $PWD  405.02s user 11.55s system 98% cpu 7:00.78 total

What you see as well is caching, and it can now dump its graph to disk in an lz4 compressed binary format. It's ... fast, I didn't really see it happen. The resulting file weighs only 800MB on disk.
Loading failed though, maybe LZ4 has a problem with streaming, or ... bincode isn't doing as it should.

Will keep you posted, there is some time left today to fix this.

@ghost
Copy link

ghost commented Jun 13, 2018

Turns out I was able (and allowed) to forget calling .finish() on the lz4-stream. Now that it properly terminates its streams, I have got a decent compressor and decompressor. Even though the speed at which the serializer/deserializer is working leaves more to be desired. It takes 30s to load the graph, which should be less than 10s to feel reasonable. I believe it's the serialization format which should be chosen more wisely - one of these should be better. Maybe I should try the python pickle format :)!

Judging by these times it seems that either my way of simplifying the topology of the graph isn't really working, or that the changes don't make that much of a difference in my case. Maybe the code tells you something (please note that my data structure is not a hash-based tree).

Memory consumption by now goes up to 17GB (!! half way!!) for me as I have to trade memory for performance - using hashmaps for commit-to-bits lookups is too expensive for me.
For some reason it seems it wouldn't be able to finish the operation anymore, maybe the topology fix I applied actually broke the graph, and for example causes trees to end up looking like commits for some reason.
Odd, but solvable I hope :D.
Next steps would be (besides fixing this) implement the actual merge commit generation.

Right now it seems you are outputting human readable text. Could you show me an example? Or maybe tell me what you would be going for?

@ali1234
Copy link
Author

ali1234 commented Jun 13, 2018

The output format is not particularly important. All it needs to do is print the set of commits that should be used in the merge commit. I made the commit manually.

For the datastructure stuff, remember the repository is a directed acyclic graph, not a tree. Sources are commits, sinks are blobs, everything else is a tree. Backrefs is the reversed graph, so blobs become sources and commits become sinks.

I use dict/hash to find nodes by SHA1 but they are not internal to the graph. Each vertex is simply a list of lists, except for commits which are just a binsha. The dict containing tree vertices is discarded at the end, as only the blob vertices need to be found by binsha.

While writing this post it occurred to me that, in the language of DAGs, the algorithm would be described like this:

  1. Reverse the repository graph.
  2. Remove all source vertices except the blobs found in the tarball.
  3. Compute the transitive closure of the resulting graph.

There is probably a much better algorithm for computing the transitive closure than walking from every source vertex.

@ghost
Copy link

ghost commented Jun 14, 2018

I use dict/hash to find nodes by SHA1 but they are not internal to the graph. Each vertex is simply a list of lists, except for commits which are just a binsha. The dict containing tree vertices is discarded at the end, as only the blob vertices need to be found by binsha.

Our data-structures are actually similar! Even though yours seems to store a little less - I for example keep everything I see during traversal, and thus can lookup everything by sha. The reverse is also true, and the SHA can be obtained for every index. That might be an opportunity to safe a few hundred megabytes, but won't speed up the operation.
That one optimization you do I will certainly try to apply again, now that I rolled back my version of it the program works as before - 9.5GB resident size is what it takes and by now 21min 30s when using the cache.

time echo | git-reconstruct --cache-path lut.bincode.lz4  $PWD ../linux-stable-workingtree

Loading graph...
Ticked 61319 blob bits in 7505212535 commits
unimplemented
echo  0.00s user 0.00s system 34% cpu 0.002 total
git-reconstruct --cache-path lut.bincode.lz4 $PWD ../linux-stable-workingtree  1125.54s user 19.59s system 89% cpu 21:26.05 total

Even though the graph lookup algorithm itself might be suboptimal, I have a feeling that any improvement will likely cost more memory than there is. But I am open to be corrected on that one :D!
One improvement I see possible is to parallelise the blob-to-commit lookup - it's a perfect opportunity for huge gains.

As to be more of a feature-guide for grit, I will certainly implement generation of a merge-commit, which will require to deal with relative paths too. Currently I just save the memory :D.

@ghost
Copy link

ghost commented Jun 14, 2018

So amazing that fearless concurrency really is a thing in Rust - I just quickly put in concurrency for lookups, which are the most expensive part for me, and here is the timing:

time echo | git-reconstruct --cache-path lut.bincode.lz4  $PWD ../linux-stable-workingtree

Loading graph...
Ticked 61319 blob bits in 7505212535 commits
unimplemented
echo  0.00s user 0.00s system 32% cpu 0.002 total
git-reconstruct --cache-path lut.bincode.lz4 $PWD ../linux-stable-workingtree  1354.33s user 87.09s system 192% cpu 12:28.42 total

12min 30s! And with slight thread overallocation it even goes down to 12min :).

@ali1234
Copy link
Author

ali1234 commented Jun 15, 2018

6m25s with cached graph, numpy, and 1 thread:

Topological sort: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 53741/53741 [00:46<00:00, 1152.03 sources/s]
Making bitmaps: 100%|██████████████████████████████████████████████████████████████████████████████████████| 1775132/1775132 [04:29<00:00, 6586.07 vertices/s]
	Command being timed: "python3 -m gitxref /home/al/gemini/kernel2/upstream/ /home/al/gemini/kernel2/kernel-3.18/"
	User time (seconds): 380.67
	System time (seconds): 4.70
	Percent of CPU this job got: 99%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 6:25.43
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 14130952
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 3610107
	Voluntary context switches: 199
	Involuntary context switches: 2021
	Swaps: 0
	File system inputs: 0
	File system outputs: 8
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

Building the cache takes 10 minutes, also single threaded. The external git pipe line is doing most of the work.

I am not entirely convinced it is still producing valid results but assuming it is, using this algorithm in rust should allow you to get below 1 minute pretty easily I think.

@Byron
Copy link
Member

Byron commented Jun 15, 2018 via email

@ali1234
Copy link
Author

ali1234 commented Jun 15, 2018

Memory use is way up because intermediate bitmaps are cached in the new algorithm. backrefs.py is dead, replaced by graph.py. It works like this:

  1. Each blob includes itself, so give it a bitmap with only itself marked.
  2. Do a topological sort, starting from each of the blobs you want to look up.
  3. For each vertex in the sorted list, OR its bitmap into each child's bitmap. If the child doesn't have a bitmap, give it a copy of the current vertex bitmap. If the child is a commit, OR the vertex bitmap into the result set/dict instead.
  4. Delete the current vertex bitmap.

Another side effect of this algorithm is you can partition the search. You could in theory do it with a step size of 1 and that would be the same as the old way. Or you can split it in half. This reduces the memory required. The code handles this, and theoretically the partitions can be done concurrently if you adapt the way temporary bitmaps are generated.

In order to do it all in one go it is necessary to reduce the graph. This is the part I am not sure is correct. Without reduction I have 350 million edges, and generating the bitmaps all at once would need 32GB of RAM. With reduction it fits in about 8GB.

@ali1234
Copy link
Author

ali1234 commented Jun 16, 2018

I've reimplemented the full output and it looks like the results are reasonable. One problem is that there are usually several commits that match the same number of blobs and the ordering is not stable. Numpy even sped up the last part of the algorithm a lot - it is much faster at bitwise operations on long arrays than the bitarray library.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

2 participants