Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extremely slow on larger repos #89

Closed
asteele0 opened this issue May 25, 2023 · 9 comments
Closed

Extremely slow on larger repos #89

asteele0 opened this issue May 25, 2023 · 9 comments

Comments

@asteele0
Copy link

I work on a large repo, and I wanted to see if this tool could help my team. It seems like it would, but it is prohibitively slow to use.
Is this level of performance expected, or have I maybe set something up incorrectly?

Large repo

git-sim log -n 2: 24m 5s 7ms

git-sizer output

Processing blobs: 121739
Processing trees: 149846
Processing commits: 17891
Matching commits to trees: 17891
Processing annotated tags: 1491
Processing references: 12914

Name Value Level of concern
Biggest objects
* Trees
* Maximum entries [1] 15.9 k ***************
* Blobs
* Maximum size [2] 75.1 MiB *******
Biggest checkouts
* Maximum path depth [3] 17 *
* Maximum path length [4] 173 B *

Medium repo

I ran it on a medium-ish size repo as another reference point.

git-sim log: 1m 17s 949ms
git-sim log -n 2: 24s 134ms

These times are more acceptable, although still slower than I'd like.

git-sizer output

Processing blobs: 5497
Processing trees: 3449
Processing commits: 977
Matching commits to trees: 977
Processing annotated tags: 0
Processing references: 4399

Name Value Level of concern
Biggest objects
* Trees
* Maximum entries [1] 3.84 k ***
* Blobs
* Maximum size [2] 68.1 MiB *******

Dummy repo

As a control I ran it on the dummy repo generated from

git-dummy --name="dummy-repo" --branches=3 --commits=10

git-sim log: 5s 305ms

Then for funsies I created a few more dummy repos, running log on them:
git-dummy --name="dummy-repo" --branches=5 --commits=100: 5s 331ms
git-dummy --name="dummy-repo" --branches=50 --commits=100: 12s 852ms
git-dummy --name="dummy-repo" --branches=10 --commits=500: 5s 388ms

It's interesting, it looks like number of commits doesn't terribly affect the results, but the number of branches does.

Environment

Windows 10.0.19044 Build 19044
git-sim version 0.3.0
Python 3.10.5
manimce 0.17.3
git version 2.40.1.windows.1
Windows Terminal Version: 1.16.10261.0
@initialcommit-io
Copy link
Contributor

Hi @asteele0. Thanks for the detailed notes. Yes one downside of Git-Sim relying on Manim as a dependency for rendering the output images and videos is that its performance is rather unoptimized. This can result in longer render times for Git-Sim on repos that have large numbers of commits to traverse.

However, A LOT depends on the specific commit structure being traversed by a particular command. The way Git-Sim commit parsing words is it starts at the HEAD commit and works backwards recursively tracing parent commits and rendering them, until N commits have been parsed PER BRANCH. The default N is 5, but as you saw you can manually set it using the -n 2 flag, for example to only show 2 commits per branch.

That being said, showing only 2 commits should NOT take 24 mins... despite Manim's render time. It's possible there is a design flaw in my commit parsing algorithm that's causing this in large repos.

Any chance you can provide the screenshot Manim is outputting for the git-sim log -n 2 command? Also - any chance this is an open-source repo I could take a look at trying to reproduce? Or if not, can you try reproducing this in a large open-source repo so that I can test and play around with it?

@asteele0
Copy link
Author

asteele0 commented Jun 5, 2023

@initialcommit-io sorry for the delay.

Yeah, here's the image generated:
git-sim-log_06-05-23_14-10-43

Unfortunately I didn't copy it, but the execution time was much shorter this time. More like 12m.
This is probably just because of my computer, though.

I did notice that it generated a lot more files than I expected:

C:\source\git-sim_media\source\images
C:\source\git-sim_media\source\texts
C:\source\git-sim_media\source\videos
C:\source\git-sim_media\source\images\git-sim-log_06-05-23_14-10-43.jpg
C:\source\git-sim_media\source\texts\17dd8cd4858722ca.svg
C:\source\git-sim_media\source\texts\1bfa1e6971f55cab.svg
C:\source\git-sim_media\source\texts\4947dc0b08323edd.svg
C:\source\git-sim_media\source\texts\52d1ff1694365c2a.svg
C:\source\git-sim_media\source\texts\5c02b1dbfc9212fb.svg
C:\source\git-sim_media\source\texts\642cbb1c7323e76e.svg
C:\source\git-sim_media\source\texts\760beb94de3b39af.svg
C:\source\git-sim_media\source\texts\96f9676f36dcf0cf.svg
C:\source\git-sim_media\source\videos\1080p60
C:\source\git-sim_media\source\videos\1080p60\partial_movie_files
C:\source\git-sim_media\source\videos\1080p60\git-sim-log_06-05-23_13-41-04.mp4
C:\source\git-sim_media\source\videos\1080p60\partial_movie_files\Log
C:\source\git-sim_media\source\videos\1080p60\partial_movie_files\Log\1019697990_51972775_2762394053.mp4
C:\source\git-sim_media\source\videos\1080p60\partial_movie_files\Log\partial_movie_file_list.txt

Public repo

Unfortunately the repo is not open source, but I did run the same command on the vscode repo.
The execution time is comparable to my repo, so I think it's a good testing area.

git-sim log -n 2: 11m 57s 728ms

git-sim-log_06-05-23_14-31-55

vscode git-sizer output

Processing blobs: 353412
Processing trees: 958462
Processing commits: 115283
Matching commits to trees: 115283
Processing annotated tags: 50
Processing references: 1011

Name Value Level of concern
History structure
* Maximum tag depth [1] 2 *
Biggest checkouts
* Number of directories [2] 6.81 k ***
* Maximum path depth [3] 12 *
* Maximum path length [4] 127 B *

So you're saying that git-sim recursively parses n commits per branch, no matter what branch is checked out?
To me, and of course I don't know how any of it works, it seems that is doing entirely too much unnecessary work.

Naively, it seems that offloading the heavy tree parsing to git log (or some plumbing command, idk), and then just rendering the output would be the fastest solution?

Thanks for looking into this!

@initialcommit-io
Copy link
Contributor

@asteele0 Thank again for your thoroughness and clarity! I did some testing and found out the performance issue is a bug related to checking for remote tracking branch names to use as labels where appropriate. It was inefficiently repeating this code multiple times within loops which was taking forever on repos with many remote tracking branches, (i.e. the big repos you've been working on!)

I fixed it so that this remote-tracking-branch code runs only once, and pushed new version git-sim 0.3.1 (feel free to see here for the exact code change I made). You can test it out by running pip install git-sim --upgrade. It would be awesome if you can give it a try and make sure it works. Should be much faster now, but I'll keep this issue open until you confirm. I hope this makes git-sim usable for your team and use-case!

In case you're curious, here are some answers to some of the points you brought up:

So you're saying that git-sim recursively parses n commits per branch, no matter what branch is checked out?

No - git-sim starts at the HEAD commit and parses backwards recursively until n commits are reached. However, if along the path a commit with multiple parents is found, the recursive nature will keep parsing until up to n commits are drawn along each commit chain, by design. So n doesn't necessarily refer to the total number of drawn commits in the whole diagram, but the number of commits drawn per split chain, if that makes sense.

it seems that offloading the heavy tree parsing to git log...

This is basically the way git-sim works. In general the parsing doesn't eat up performance, it's the Manim rendering (especially when using the --animate flag for video) that comes with heavy performance load.

@asteele0
Copy link
Author

asteele0 commented Jun 6, 2023

Nice! That totally fixed it. I'll mark the ticket as closed, thank you!
git-sim log -n 2: 7s 435ms

It only took 11s to run git-sim log:
git-sim-log_06-06-23_12-25-27

[...]keep parsing until up to n commits are drawn along each commit chain

Ah, yeah that makes sense. I misunderstood what you were saying originally, sorry! 😅

This is basically the way git-sim works.

Awesome! Thanks for taking the time to explain, I appreciate it.

@asteele0 asteele0 closed this as completed Jun 6, 2023
@initialcommit-io
Copy link
Contributor

initialcommit-io commented Jun 6, 2023

@asteele0 Sweet!! Glad to hear that. It still takes longer than I'd like (esp on average processors), but unfortunately it's a bit out of my control due to the dependency on Manim at this time. I hope it's manageable though, and def shouldn't be taking MINUTES for rendering just a few commits (let me know if you run into anything else with execution time that bad...)

Where it does still take a prohibitively long time is if you use a custom higher value n like 15, combined with a heavily-merged repo due to rendering all the commit paths, and ESPECIALLY if you render video using the --animate flag. But for example I just tested git-sim log -n 15 on the vscode repo on my Desktop (13600k) and it took 31.65 seconds, which again isn't amazing but isn't horrible...

Anyway, if you end up finding good use cases for git-sim on your team I'd love to hear about them at some point!

@ehmatthes
Copy link
Contributor

ehmatthes commented Jun 11, 2023

Hi @initialcommit-io. I can open a new issue if you like, but the fix for this seems to have broken visualizations of merge operations:

$ git-sim merge mp34_popular_projects
Manim Community v0.17.3

Simulating: git merge mp34_popular_projects 
...
│ /Users/eric/projects/mostly_python/.venv/lib/python3.11/site-packages/git_sim/merge.py:68 in     │
│ construct                                                                                        │
│                                                                                                  │
│    65 │   │   for k, v in self.__dict__.items():                                                 │
│    66 │   │   │   print(k, v)                                                                    │
│    67 │   │                                                                                      │
│ ❱  68 │   │   if not self.is_remote_tracking_branch(self.branch):                                │
│    69 │   │   │   if self.branch in self.repo.git.branch("--contains", head_commit.hexsha):      │
│    70 │   │   │   │   self.ff = True                                                             │
│    71 │   │   else:
...
AttributeError: 'Merge' object has no attribute 'is_remote_tracking_branch'

My line numbers might not match yours because I had written some diagnostics into the merge.py file. Installing 0.3.0 fixes my issue.

Interestingly, this came up when drafting a newsletter post about newer Python projects that might be appropriate for people to start contributing to. I would be happy to consider helping set up some end to end tests if that would be helpful.

@initialcommit-io
Copy link
Contributor

@ehmatthes Oh, shoot. Yes you're absolutely right and thx for letting me know! I removed a method as a part of the fix and thought it wasn't used anywhere else... I will add this back in tonight and roll a new release with the fix.

And LOL that's a very suitable time to run into this error! Thanks for considering git-sim in your newsletter - I'd love to read it when you send it out.

And yes! It would be awesome to get some help setting up some end to end tests for Git-Sim! I've had this on my mind for months but haven't had the time. We have an existing issue for this (#55) but work has stalled. We can definitely pick it up there.

However, one thing I did do was create a bundled dependency called Git-Dummy (https://github.com/initialcommit-com/git-dummy) which should simplify the process of creating dummy repos in a desired state to run the end-to-end tests on and confirm an expected result...

@initialcommit-io
Copy link
Contributor

@ehmatthes Just pushed new version 0.3.2 with the fix for this. You can install it with pip install git-sim --upgrade.

Please test it out and let me know if the merge subcommand works now!

@ehmatthes
Copy link
Contributor

Yes, 0.3.2 works for me. I'm going to jump into that other thread about testing. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants