fast writes, slow reads with sshfs #664

chebee7i · 2016-02-16T00:30:10Z

This is a continuation of discussion that began in #660.

I am trying to create an archive, but the process slows after handling a few large files. Borg's state is S+ and it appears to be waiting for something. In one case, I let the wait continue for greater than 4 hours and it did not resolve itself. Ctrl-C does nothing to ameliorate the issue. kill and kill -9 have no effect. Afterwards, the process is in state D+. Restarting is the only way to stop the process.

Update:

There are no issues if I backup to a local location, with and without compression. For remote, I have no choice but to use sshfs. When the backup begins, each file is chunked and sent over the wire. This proceeds at a reasonable speed. Things slow down whenever Archive.save() is called. In this method, Repository.commit() is called. What happens here is that the segments are compacted. Each segment has a number of objects associated to it. So for each segment, we iterate over its objects via LoggedIO.iter_objects(). There is a while loop in there and the first line of that while loop is a call to LoggedIO._read(). Inside that method, there is a line: data = fd.read(length). This is what is stalling, and presumably, is the source of the high %wa values. I can see with nload that there are very regular downloads of data every ~12 seconds, but they are in extremely small chunks and the download interval is far less than 1 second. To give a sense of the slowdown for two object reads, I logged:

Read 774416 bytes in 83.613407 seconds ( approx 74 kbps)
Read 1891722 bytes in 174.902586 seconds (approx 86 kbps)

This seems quite odd. Am I hitting a limitation of sshfs? It seems like it'd be better to do this completely local and then rsync the archive over.

Example: I capped my transfer rate to 5 mbps and tried to send 571 MiB of data remotely. This took 40 minutes. Of that 40 minutes, roughly 16 were spent sending the chunks over the wire and 18 minutes were spent inside the LoggedIO._read() method caused by Archive.save().

The text was updated successfully, but these errors were encountered:

ThomasWaldmann · 2016-02-16T21:30:25Z

In another ticket, there was the idea to separate compaction from archive creation.

chebee7i · 2016-02-17T02:33:49Z

Any guess on why the read is taking so long (<100 kbps). This isn't a general feature of sshfs---generally, I can read just fine and with decent speed.

ThomasWaldmann · 2016-02-17T22:57:40Z

Maybe it depends on read size? If there are small reads, overhead and latency might add up.

chebee7i · 2016-02-20T23:31:04Z

I'm not sure that can be the issue. In my comment above, the statement:

Read 774416 bytes in 83.613407 seconds ( approx 74 kbps)

represents how long a single call to data = fd.read(length) took. I did a time.time() before and after and then printed the length of data. So this is a single read statement. The speed shown is abysmal, and so much so that latency could not possibly be the issue---it is also far below what my ISP supports and also far below what I can get with a speedtest while the borg backup is happening.

Also, I can do a read over an sshfs mount without using borg at all for both small and large files at much much higher bit rates. There is something fishy going on for sure, but I'm not sure what I should be doing to debug this, especially since I've narrowed it down to a single Python statement.

ThomasWaldmann · 2016-02-20T23:52:39Z

I've looked at the source and found this:

every 2nd read() is small (9 bytes long headers), ever other read is large (~ chunksize data)
for TAG_COMMIT it issues a fd.read(0) - zero length

The whole overhead could maybe be optimized by reading the whole segment into a BytesIO object and then reading from that (from memory).

ThomasWaldmann · 2016-02-21T03:23:41Z

Could you try to change the open() call in get_fd() to:

            fd = open(self.segment_filename(segment), 'rb', buffering=512*1024)

Then test again to see if it makes any difference.

https://docs.python.org/3.4/library/functions.html#open

ThomasWaldmann · 2016-02-24T15:30:09Z

Can you try this until saturday noon?

I'ld like to create 1.0.0rc2 at the weekend and if above helps, I'ld like to include it.

chebee7i · 2016-02-27T02:18:29Z

I will try tonight and report back here.

chebee7i · 2016-02-27T16:47:12Z

So something external (ISP or host for storage) has changed. Things are much faster right now. I tried with and without your modification and the timings were about the same, but in both cases, things were pretty quick and definitely faster than before. So I'd have to do a much longer/larger test to know if this really makes a difference, and unfortunately, I will be unable to do so until tomorrow. I'll leave it to you to decide if you want to include it.

chebee7i · 2016-02-27T16:50:44Z

Here I've attached two files, with and without the change you suggested. Previously, the checkpoint that I make happen with "-c 150" was taking forever...but now it is super quick. I terminate the backups after the first checkpoint. In both of these tests, I am throttling the speed with trickle.

data_with.txt
data_without.txt

ThomasWaldmann · 2016-02-27T17:09:27Z

Hmm, this is not the scenario you had when you filed your issue, right?
Can you just test if it fixes the original problem?

ThomasWaldmann · 2016-02-27T17:20:54Z

To summarize the data_with vs. data_without (too much data in there):

without: checkpoint writing takes 16s
with: checkpoint writing takes 16s

ThomasWaldmann · 2016-02-27T17:24:45Z

Linux: Ubuntu 14.04 trusty  LibC: glibc 2.9

You have glibc 2.9? I'ld have expected 2.19.

ThomasWaldmann · 2016-02-27T17:31:55Z

I didn't read your first comment at first.

OK, if it can't be reproduced / determined right now, I'll keep the code as is for now.

chebee7i · 2016-02-28T20:30:24Z

Regarding glibc 2.9. Could this just be the version that Python was compiled with? I installed Python 3.4 for Ubuntu using apt-get, and while it is true that platform.libc_ver() returns ('glibc', '2.9'), if I use lsof while Python is running, I see the following file open: /lib/x86_64-linux-gnu/libc-2.19.so. And that is similar to what I see with dpkg-query.

$ dpkg-query -l glibc* | grep un
un  glibc-2.19-1                                          <none>                                              <none>       (no description available)
un  glibc-doc                                             <none>                                              <none>       (no description available)

In other words, it's not clear to me how relevant the output of platform.libc_ver() actually is. The version is obtained by inspecting the contents of the executable, which was certainly not compiled on my machine.

chebee7i · 2016-02-28T20:45:34Z

Here's a sceencapture of nload during checkpointing. You can see that there are relatively large chunks of time where no reading at all is happening. I'm working on getting some more information on this right now.

nload.txt

chebee7i · 2016-02-28T23:59:59Z

I was able to do a more thorough test. I have a directory consisting of five 100 MiB files:

$ time scp -r data backup:
0.dat                               100%  100MB   1.6MB/s   01:03
1.dat                               100%  100MB   1.6MB/s   01:03
2.dat                               100%  100MB   1.6MB/s   01:03
3.dat                               100%  100MB   1.6MB/s   01:04
4.dat                               100%  100MB   1.6MB/s   01:03

real    5m55.487s
user    0m2.770s
sys     0m1.207s

So it takes about 5 minutes to scp these over. I then ran borg with and without the buffer argument to the open command. The total time to complete the backup with the buffer command was 46 minutes. The total time without the buffer command was 56 minutes. So there might be some benefit, but I'd definitely want to do this test a few times to be sure to get a sense of the timing variability. But overall, there is clearly still a problem.

During check-pointing, a read command is sent to the remote server (due to sshfs). These read commands are blocking and extremely slow. The attached log files demonstrate the slow speed, but it sits somewhere around 10-100 KiB/sec. This is an average time. The nload excerpt I attached above shows that the transfers are not sustained but happen in small chunks instead. So it's probably closer to 1 MiB/sec for a short burst and then 0 MiB/sec for a long pause and then the burst happens again.

Also, the blocking read() command causes a lot of problems for my system. During that command (and only during that command) doing, say, an ls in another terminal frequently stalls. In one case, the results were not shown for 19 seconds. Perhaps it would be better to do some sort of non-blocking read with a long timeout, then stick that whole thing in a for loop? http:https://stackoverflow.com/a/1810703

test_20160228_with.txt
test_20160228_without.txt

Aside: I tried a much larger test on Saturday as I left for the day. I did not check the backup until 16+ hours later, but it still had not moved past the first 4 hours of progress. It was doing "checkpointing" and so my guess is that it was the blocking read command that was stalled.

ThomasWaldmann · 2016-02-29T00:07:36Z

You used scp to write to remote.

Can you check cp to read from remote (via sshfs)?

ThomasWaldmann · 2016-02-29T00:50:01Z

just searched for "sshfs slow":
http:https://askubuntu.com/questions/128722/sshfs-very-slow-in-12-04
http:https://jeromejaglale.com/doc/mac/fix_ssh_connection_delays

chebee7i · 2016-02-29T01:10:36Z

Ahh. I think you've pinpointed the issue. It is definitely sshfs.

$ time scp -r backup:data tmp/
4.dat       100%  100MB   1.0MB/s   01:40    
3.dat       100%  100MB 984.6KB/s   01:44    
2.dat       100%  100MB   1.2MB/s   01:25    
1.dat       100%  100MB   1.0MB/s   01:38    
0.dat       100%  100MB   1.0MB/s   01:39    

real    8m7.398s
user    0m7.246s
sys     0m8.781s

But using rsync over sshfs, it is quite a bit slower...much closer to the observed 10 kiB/second. I don't expect that it will finish anytime soon:

$ rsync -ah --progress /path/to/ssh-mount/data/ ~/tmp/
sending incremental file list
./
0.dat
          2.39M   2%   10.39kB/s    2:44:23

ThomasWaldmann · 2016-02-29T13:38:38Z

ok, so this is an issue with sshfs and not borg.

shall we add an faq entry about network filesystems and their weirdnesses?

chebee7i · 2016-02-29T21:49:33Z

Yeah sounds good.

Am I the only one experiencing this over sshfs? If it is affecting everyone, it pretty much means that you cannot use sshfs at all with borg. So you have to backup locally and scp over, or you need to have borg available remotely. Both of those situations mean I cannot use borg right now.

I'll aim to see if I can resolve my sshfs issues, but it would be nice to know if others experience the same issue.

jdchristensen · 2016-02-29T23:31:57Z

I used to regularly use attic to backup local files to a remote repo accessed via sshfs. It was of course slower than having attic run close to the remote disk, but it was still usable. Sometimes it was over a home wifi network, other times over a gigabit network.

ThomasWaldmann · 2016-02-29T23:34:49Z

btw, in a comment (I linked to it above) a guy points out that sshfs got slow on ubuntu 12.04 and was much faster before. so maybe it is just that some sshfs versions are slow?

jdchristensen · 2016-03-01T00:02:56Z

For the record, I was mostly using Ubuntu 14.04 (sshfs 2.5-1ubuntu1, openssh-client 1:6.6p1-2ubuntu2.6) for the client and Debian wheezy (openssh-server 1:6.0p1-4+deb7u3) on the server.

chebee7i · 2016-03-01T01:41:01Z

Great data point. I'll continue looking into this and report back here sometime later this week.

ThomasWaldmann · 2016-03-11T23:49:54Z

I have added a generic section to the faq and will close this now as the issue is not within borg.

document how to debug / file filesystem issues, fixes #664

ThomasWaldmann · 2016-03-12T14:08:41Z

issue in sshfs issue tracker: libfuse/sshfs#11

ThomasWaldmann · 2016-03-14T11:22:27Z

@chebee7i nikolaus rath has issue reproducing the slowness, can you discuss that with him? See link in previous comment.

chebee7i mentioned this issue Feb 16, 2016

Resume backups #660

Closed

chebee7i changed the title ~~Backup does not finish, blocked or sleeping~~ fast writes, slow reads with sshfs Feb 16, 2016

ThomasWaldmann mentioned this issue Mar 11, 2016

Slow transfer speeds under OS X libfuse/sshfs#11

Open

ThomasWaldmann added a commit that referenced this issue Mar 11, 2016

Merge pull request #737 from ThomasWaldmann/document-network-fs-issues

9be520b

document how to debug / file filesystem issues, fixes #664

ThomasWaldmann closed this as completed in b6966a5 Mar 11, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fast writes, slow reads with sshfs #664

fast writes, slow reads with sshfs #664

chebee7i commented Feb 16, 2016

ThomasWaldmann commented Feb 16, 2016

chebee7i commented Feb 17, 2016

ThomasWaldmann commented Feb 17, 2016

chebee7i commented Feb 20, 2016

ThomasWaldmann commented Feb 20, 2016

ThomasWaldmann commented Feb 21, 2016

ThomasWaldmann commented Feb 24, 2016

chebee7i commented Feb 27, 2016

chebee7i commented Feb 27, 2016

chebee7i commented Feb 27, 2016

ThomasWaldmann commented Feb 27, 2016

ThomasWaldmann commented Feb 27, 2016

ThomasWaldmann commented Feb 27, 2016

ThomasWaldmann commented Feb 27, 2016

chebee7i commented Feb 28, 2016

chebee7i commented Feb 28, 2016

chebee7i commented Feb 28, 2016

ThomasWaldmann commented Feb 29, 2016

ThomasWaldmann commented Feb 29, 2016

chebee7i commented Feb 29, 2016

ThomasWaldmann commented Feb 29, 2016

chebee7i commented Feb 29, 2016

jdchristensen commented Feb 29, 2016

ThomasWaldmann commented Feb 29, 2016

jdchristensen commented Mar 1, 2016

chebee7i commented Mar 1, 2016

ThomasWaldmann commented Mar 11, 2016

ThomasWaldmann commented Mar 12, 2016

ThomasWaldmann commented Mar 14, 2016

fast writes, slow reads with sshfs #664

fast writes, slow reads with sshfs #664

Comments

chebee7i commented Feb 16, 2016

ThomasWaldmann commented Feb 16, 2016

chebee7i commented Feb 17, 2016

ThomasWaldmann commented Feb 17, 2016

chebee7i commented Feb 20, 2016

ThomasWaldmann commented Feb 20, 2016

ThomasWaldmann commented Feb 21, 2016

ThomasWaldmann commented Feb 24, 2016

chebee7i commented Feb 27, 2016

chebee7i commented Feb 27, 2016

chebee7i commented Feb 27, 2016

ThomasWaldmann commented Feb 27, 2016

ThomasWaldmann commented Feb 27, 2016

ThomasWaldmann commented Feb 27, 2016

ThomasWaldmann commented Feb 27, 2016

chebee7i commented Feb 28, 2016

chebee7i commented Feb 28, 2016

chebee7i commented Feb 28, 2016

ThomasWaldmann commented Feb 29, 2016

ThomasWaldmann commented Feb 29, 2016

chebee7i commented Feb 29, 2016

ThomasWaldmann commented Feb 29, 2016

chebee7i commented Feb 29, 2016

jdchristensen commented Feb 29, 2016

ThomasWaldmann commented Feb 29, 2016

jdchristensen commented Mar 1, 2016

chebee7i commented Mar 1, 2016

ThomasWaldmann commented Mar 11, 2016

ThomasWaldmann commented Mar 12, 2016

ThomasWaldmann commented Mar 14, 2016