Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Balance performance on small files #108

Open
progtologist opened this issue Sep 19, 2020 · 3 comments
Open

Balance performance on small files #108

progtologist opened this issue Sep 19, 2020 · 3 comments

Comments

@progtologist
Copy link

I am running mergerfs.balance on a filesystem that is comprised of lots of big but even more small files (source code, potentially even compile artifacts).
The moment balance starts to move the small files, the whole process turns into an unbelievably slow torture. CPU utilization jumps to 100% and disk I/O becomes almost 0%. It's been 2 days now and barely 10GB have been moved!
I switched from cpython to pypy3 to see if that would improve things, I think it slightly did, but not by a huge margin. Is there something that I could do to help this process? Is there some logic that if added to the script, it would improve the performance of small file transfer?
E.g. Use some os or shutil to see if a folder is comprised of a large number of small files, then tar them all, move the tar and extract it to the new target?
If I implemented something like that and filed a PR would that be of interest to be accepted?

@trapexit
Copy link
Owner

The tool mostly just walks over the filesystem and calls rsync to copy files (because while I can certainly recreate the behavior of rsync, rsync is well trusted). What kind of system do you have?

Small files will always be higher cost. You could put a file size filter which may help (the paths still all have to be walked.) A better solution, which is planned already, is to decide what to move where all at once, write some temp files, and then use --files-from. That will limit tree walking and rsync execution but is more complicated to do.

@progtologist
Copy link
Author

The system is an Intel i7 2760QM (mobile chip) with 16GB of DDR3 RAM. The disks are 8 and 12TB WD ones, low RPM but are capable of 150MB/sec sequencial read/write. They are all connected through a Dell Perc H310 flashed in IT mode.
To my understanding, it is not rsync that is causing the slowdown (100% cpu usage), if that were true, I would have seen similar issues with the large files (where rsync is doing all the heavy lifting). So it must be the tree walking in python that is bringing the system to its' knees.

@trapexit
Copy link
Owner

trapexit commented Sep 19, 2020

It blocks on the execution of rsync which should limit CPU usage... I can't even get close to 100% usage if I change rsync to "bash -c true" but maybe my system is just faster. Regardless, a change of what I described is basically a full rewrite.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants