Double disk de-duplication on --system pull #2342

uajain · 2018-11-21T14:02:15Z

This PR tries to address the issue : ostreedev/ostree#1723
(ideally we wanted to move this issue inside Flatpak's umbrella but it's seems that it's not possible). So please refer to the discussion there. Most of moderate to high-level details inside commit messages too.

Summary of changes:

Let system-helper create a child repo in $SYSTEM_REPO/tmp. For convenience put this child repo in a cache dir (that has the convention flatpak-cache-$ref-XXXXXX)
FlatpakSystemHelperOngoingPull's struct member root_dir points to this flatpak-cache-$ref-XXXXXX (because it's the root owned directory)
FlatpakSystemHelperOngoingPull's struct member user_dir points to this another directory that gets created flatpak-cache-$ref-XXXXXX-bindfs. We will bindfs mount this directory and give it back to the client to do the pull in the child repo. bindfs lets client (non-sudo) write to this child repo. While the actual file creation is being done in the root_dir folder. As long as the bindfs mount is up, the client is able to write(or pull) to the child repo.
When the child pull is finished, we unmount bindfs mount. In case of failure, we force unmount things or kill the bindfs subprocess. The idea to use a FUSE filesystem is basically that it is in user space and can be spawned into a separate child process(which can be killed).
Make the local "trusted" pull. That should use hardlinks and thus double disk situation is avoided.
In case of error in mounting directories or pull failures, we the system-helper's CancelPull method, to cleanup.
If bindfs is not found on the system, we fallback on the current --system pull behaviour (that would use double-disk space).

I have been testing this since couple of weeks now. It has worked well in my testing locally. Though I am having troubles on how to write rigours tests for this feature (any help is welcome here).

rh-atomic-bot · 2018-11-21T14:04:39Z

Can one of the admins verify this patch?
I understand the following commands:

bot, add author to whitelist
bot, test pull request
bot, test pull request once

uajain · 2018-11-26T14:11:15Z

@alexlarsson @matthiasclasen
Not sure if you have got any time to try this PR out, but some initial feedback regarding the approach will be very helpful. Do you see any major "holes" in the approach that I can continue to improve on?

common/flatpak-dir.c

system-helper/flatpak-system-helper.c

alexlarsson · 2018-12-06T11:08:16Z

Ok. New approach. First of all, we don't need bindfs, we can use rofiles-fuse as well, which is shipped with ostree. Secondly, we create a user(+default group) for flatpak (normally in the flatpak package install). Then, we run rofiles-fuse with -o allow_other to allow other users to access the mountpoint (relies on this being allowed by the sysadmin in /etc/fuse.conf). Then we protect the various directories from access by the wrong users by using regular filesystem permissions.

Here is an example (starting in a root shell):

# adduser flatpak

# mkdir /var/lib/flatpak/repo/tmp/pull_dirs_hidden
# chown flatpak.flatpak /var/lib/flatpak/repo/tmp/pull_dirs_hidden
# chmod 0700 /var/lib/flatpak/repo/tmp/pull_dirs_hidden

# mkdir /var/lib/flatpak/repo/tmp/pull1
# chown flatpak.flatpak /var/lib/flatpak/repo/tmp/pull1
# chmod 0770 /var/lib/flatpak/repo/tmp/pull1

# su flatpak
$ mkdir /var/lib/flatpak/repo/tmp/pull_dirs_hidden/pull1
$ mkdir /var/lib/flatpak/repo/tmp/pull1/repo
$ ostree init --mode=bare-user-only --repo=/var/lib/flatpak/repo/tmp/pull_dirs_hidden/pull1/repo
$ ostree remote add --repo=/var/lib/flatpak/repo/tmp/pull_dirs_hidden/pull1/repo  --no-gpg-verify flathub https://dl.flathub.org/repo/
$ rofiles-fuse -o allow_other /var/lib/flatpak/repo/tmp/pull_dirs_hidden/pull1/repo /var/lib/flatpak/repo/tmp/pull1/repo
$ exit

# chown 1000.flatpak /var/lib/flatpak/repo/tmp/pull1

Now, in a uid 1000 shell you can do:

$ ostree --repo=/var/lib/flatpak/repo/tmp/pull1/repo pull flathub app/org.gnome.frogr/x86_64/stable

Back again as in root shell:

# umount /var/lib/flatpak/repo/tmp/pull1/repo
# killall -9 rofiles-fuse   # This is a bit harsh, should kill only the right process
# rm -rf var/lib/flatpak/repo/tmp/pull1
# chown -R root.root /var/lib/flatpak/repo/tmp/pull_dirs_hidden/pull1/repo # Should fix permissions too

Now we can merge the root-owned, repo from /var/lib/flatpak/repo/tmp/pull_dirs_hidden/pull1/repo into /var/lib/flatpak/repo. All access to it via the rofiles-fuse mount should be impossible, and only the flatpak user can access it via the other path, so that should be trustworthy.

Does this seem safe ok?

rh-atomic-bot · 2018-12-06T15:01:45Z

☔ The latest upstream changes (presumably e14af35) made this pull request unmergeable. Please resolve the merge conflicts.

cgwalters · 2018-12-06T15:36:46Z

I have only sort-of been following this work, but just a quick comment:

we can use rofiles-fuse as well

One note; so far rofiles-fuse was not written with a security posture in mind, rather just protecting against accidental corruption by build/package scripts. That said it's a very small codebase.

Longer term...it feels like what we really want is to potentially support privilege-separated fetches in libostree upstream (think having an API over a socketpair like "does the repo have object X" and "write object X" - I think that'd be close to all we'd need).

alexlarsson · 2018-12-07T08:06:36Z

@cgwalters Yeah, but we're running rofiles-fuse as a non-privileged user, so I don't think its a huge risk. In fact, the reason i prefer it over bindfs is that its smaller and simpler, so less risky.

The reason we're using it is really not anything that rofiles-fuse implements, but the fact that it gates access to the repo files via a resource we control (the fuse process). Killing that process is a robust way to implement revoke() on an entire directory tree.

wjt · 2018-12-07T11:07:11Z

Longer term...it feels like what we really want is to potentially support privilege-separated fetches in libostree upstream (think having an API over a socketpair like "does the repo have object X" and "write object X" - I think that'd be close to all we'd need).

I think this API would run aground against @alexlarsson's earlier point:

So, some questions wrt allowing the user to write files owned by root: This means a user can override the reserved-space-for-non-root setting.

Does the current verify-and-copy scheme have the same issue, though? I guess not: although there's no requirement that the temporary repo path passed to Deploy() is on the same filesystem as the system installation, you do have to pass a polkit check before anything gets copied. The same is true for this branch: you have to pass a polkit check to call GetChildRepoForPull.

wjt · 2018-12-07T17:10:33Z

@alexlarsson Sounds safe to me. Re. "Should fix permissions too", I guess you mean "remove any permission bits outside 0775" even if this invalidates the content hash, on the theory that if the files had such permissions, Flatpak sets BAREUSERONLY_FILES and bareuseronly_dirs they would be rejected/silently ignored anyway. If that's right, I think it might be better to fail the Deploy() operation if such forbidden permission bits are found, to give better errors? (Hmm, do we need to tidy away any xattrs, too?)

I haven't looked at the recent changes

alexlarsson · 2018-12-10T07:25:02Z

@alexlarsson Sounds safe to me. Re. "Should fix permissions too", I guess you mean "remove any permission bits outside 0775" even if this invalidates the content hash, on the theory that if the files had such permissions, Flatpak sets BAREUSERONLY_FILES and bareuseronly_dirs they would be rejected/silently ignored anyway. If that's right, I think it might be better to fail the Deploy() operation if such forbidden permission bits are found, to give better errors? (Hmm, do we need to tidy away any xattrs, too?)

What i mean was that we canonicalize file permissions in according to the current rules at commit time. (I.e. matching what can be represented by bare-user-only). This means for instance that a ostree object with world-writable permissions would be changed to not be world-writable, this will change the checksum and fail the checksum when commiting, so it will report an error. You mean we should fail this with a different error than EWRONGCHECKSUM, which i guess makes sense. However, we need to make sure we don't abort at the first such case, because that could leave other word-writable directories around, which we need to avoid.

alexlarsson · 2018-12-14T10:38:10Z

So, @uajain asked me on irc about the "track when its mounted" part, and i forgot about that. So here goes:

First of all, we want to use core kernel features, not call out to some ipc service that polls some file that some other thing writes or whatever, that will never be robust in face of things like namespaces with different mount tables and whatnot.

Secondly, the correct way to know something is mounted is when the mount() syscall returns. This happens in the fuse filesystem implementation, and this is signaled to the parent by the launched fuse-fs process exiting. So, all we have to do is launch it (without the foreground option) and then wait for it to exit. This is proper anyway, because that way we can track any errors happening up to (and including) the mount syscall and handle it correctly.

Of course, the problem with this is that we were using the foreground flag for a reason. Without it its hard to ensure the fuse filesystem properly goes away. Unfortunately there is no way for rofiles-fuse currently to return the child pid for the fuse process. There are a few options here:

Modify rofiles-fuse to return the child pid. Unfortunately this happens in fuse_daemonize() deep in libfuse, so that is a bit hard.
Kill it some other way than the pid. I was thinking of using process groups for this, but it turns out that fuse_daemonize() already creates a new process group, so we can't use that.

Also, one has to track the exiting of the fuse process. If we know the pid we can use PR_SET_CHILD_SUBREAPER to inherit the process when the intermediate child dies and then get sigchild. However, if we don't know the pid its hard to know what sigchild to wait for. In that case we'd have to use some other way. Maybe pass an pipe fd to the child and poll for the other side closing it (when it dies) for instance.

alexlarsson · 2018-12-14T10:44:39Z

I think the easiest way is to make rofiles-fuse report the pid somehow (e.g. to a pipe fd passed in as a commandline arg). We can use /proc/$pid/task/$pid/children to get the child pid, which is a bit hacky, but a lot easier than changing libfuse.

alexlarsson · 2018-12-14T10:56:49Z

Actually, i take that back. It should be fairly easy to make rofiles-fuse get the real pid. All you need to do is instead of using fuse_main() make your own version of fuse_main() that calls fuse_setup(), fuse_loop(), and fuse_teardown_common(). At this point you know the mount has succeeded by the time fuse_setup() returns. You can then hack this by passing -f as and extra argument, and then reimplementing fuse_daemonize() ourselves before calling fuse_setup().

alexlarsson · 2018-12-14T11:00:48Z

Actually, I think there is an even simpler way. We launch rofiles-fuse with -f --signal-mounted-on-fd=$fd, and then we reimplement fuse_main() and put a write(signal_mounted_on_fd, "x", 1) after fuse_setup() returns.

alexlarsson · 2018-12-14T11:02:59Z

In this case we would create a pipe before spawning rofiles-fuse, and then not continuing until its written too.

Actually, we should probably signal the parent by closing the fd instead of writing to it, because then that also happens if rofiles-fuse returns early due to some error.

alexlarsson · 2018-12-14T13:01:14Z

So, @wjt brought up reservations about relying on the user_allow_other global setting on irc. And i agree.

The ideal approach would be to run the fuse filesystem as the target user. That doesn't work because the user can't write to the filesystem though (that is the point after all). So, as a different approach I propose a new fuse filesystem "revokefs", that basically takes the rofiles-fuse code, removes the existing can't-write-to-hardlink stuff, and then makes all the operations that write stuff and implement them out of process via messages over a unix socket.

The idea here is that you can spawn the fuse filesystem as the target user, and the writing part as the flatpak user, and then you can pass the socket fds between these at setup. You then have the repo readable to the user, but only writable to the flatpak user, and once you shut down the write socket (by e.g. killing the write process, or just calling shutdown() on the socket fd that you kept around) you've essentially revoked all write access to the user.

The messages between the processes can be very simple. I propose using a SOCK_SEQPACKET for easy segmentation and a simple header for the arguments + the raw data.

alexlarsson · 2018-12-14T23:21:49Z

The revoke via privilege separation fuse idea seemed so simple and useful that I had to try it.
Initial version in https://github.com/alexlarsson/flatpak/tree/revokefs
I think this should work well enough for hosting an ostree repo on, but I've not tested that. In fact it is very lightly tested, and needs some solid review. On the other hand, the security is enforced by the kernel (via different uids), so the code that needs reviewing is quite small and simple.

alexlarsson · 2018-12-14T23:44:18Z

Interesting with revokefs-fuse is that we can just have the system helper spawn the backend (as some other user) and then pass the socket to the flatpak client which can then spawn the entire fuse filesystem wherever it wants as the user.

alexlarsson · 2018-12-15T07:58:39Z

Hmm, we need to add some permission filtering to revokefs too, otherwise the user could create writable files in the backing directory, or files setuid to the flatpak user.

uajain · 2018-12-15T08:38:23Z

Hmm, is it really dangerous? Given the files will be owned by flatpak user and before deploying, flatpak_canonicalize_permissions will act upon all these files to clear the setuid bit.

alexlarsson · 2018-12-15T16:00:14Z

The entire point of revokefs is lost if the user can change the base directory to be writable by the user...

alexlarsson · 2018-12-17T07:40:03Z

(oh, and setuid flatpak means you can turn into the flatpak user and get write rights, which similarly is bad)

uajain · 2018-12-17T07:59:16Z

(oh, and setuid flatpak means you can turn into the flatpak user and get write rights, which similarly is bad)

hmm, I see. Maybe my definition of "Trust"(-ed user) as not as rigid as yours. Anyways today, I am learning more about FUSE in general and revokefs. I need to fillup my knowledge holes in order to make progress on this PR. :) hah!

alexlarsson · 2018-12-17T08:08:26Z

Well, the point is it you have direct write rights (be it by just changing a directory permissions and writing there, or making an binary that does writes a file and mark it setuid flatpak and run it), then you can have an outstanding write file-descriptor to some object file by the time you tell the system-helper to merge the repo. Since we're not longer making a copy of the file, this means the file-descriptor you have is still going to be valid and useful after the file has been chowned to root and put into place in the system repo, and you can now modify the system repo however you want.

alexlarsson · 2018-12-17T08:11:04Z

The revokefs name comes from the revoke syscall, btw: https://lwn.net/Articles/192632/

If we had this we wouldn't need all these workarounds. Unfortunately linux doesn't have revoke().

alexlarsson · 2018-12-17T08:33:59Z

Also, if we didn't care about the revoke issue we would just let the user write to files as himself and then move+chown them before the import.

wjt · 2019-01-11T15:12:11Z

I have attempted to summarize the problem, the design of the current proposed solution, and the past solutions we've discussed/explored and decided against, on https://github.com/flatpak/flatpak/wiki/Noncopying-system-app-installation .

uajain · 2019-04-09T13:04:41Z

Closing in favor of #2657 (merged)

uajain mentioned this pull request Nov 21, 2018

Updates can hang the system due to IO #2322

Open