Noncopying system app installation

This page describes a proposed optimization.

Background

Today, unprivileged users install a Flatpak to the system repo as follows:

User creates temporary repo R in /var/tmp
User pulls Flatpak F into R
User calls Deploy(R, F, ...) on privileged flatpak-system-helper process
flatpak-system-helper pulls F from R to the system repo, verifying and copying each object
Once Deploy() returns, user deletes R

The copy in step 4 is undesirable:

Installing an app temporarily needs twice as much disk space as the size of the app, since it is temporarily stored in its entirety in the temporary repo as well as in the system repo. With large apps or on space-constrained systems, this can mean there is not enough space to install the app, even if the installed app would fit on the disk.
All the app's files are written to disk twice. On slow spinning disks, or with large apps, this makes for a poor user experience.

The system helper cannot simply hardlink each object from the temporary repo into the system repo, because the unprivileged user may still have an open file handle on files in the temporary repo. This would allow a malicious user to modify root-owned files in the system repo, leading at best to repo corruption and at worst to privilege escalation. (On filesystems like btrfs which support reflinks, the objects' contents are reflinked rather than copied, avoiding the extra space usage and disk IO without introducing this security hole; unfortunately, ext4 does not support reflinks.)

Solution

The key problem above is that the unprivileged user process may hold an open file descriptor on objects in the temporary repo. The solution is to give the system helper a way to ensure that no such file descriptors exist. Some Unix systems provide a revoke(path) syscall to forcibly close any open file descriptors for a path, but Linux does not. So, we provide our own, through the magic of FUSE. Here's an overview:

revokefs, derived from rofiles-fuse, which has a two-process architecture:
- A writer process, which runs as a trusted (but unprivileged) user, and performs write operations in responses to requests received over a socket passed to it at startup
- The FUSE filesystem process, which performs read operations directly, and sends write requests over a socket passed to it at startup
A new unprivileged-but-trusted flatpak-write-helper (name up for discussion) user
A new method on flatpak-system-helper, called by the user process, which:
- Creates a socket pair
- Creates a temporary directory owned by and writeable only by flatpak-write-helper
- Spawns the writer half of revokefs as the flatpak-write-helper user, passing it one end of the socket pair
- Returns the other end of the socket pair to the user process
User process mounts the FUSE half of revokefs, passing it the socket it got back from the system helper, and performs the temporary pull. Data is actually written to disk as the flatpak-write-helper user.
When complete, the user process asks flatpak-system-helper to deploy from the temporary pull directory:
- The system helper closes the socket and kills the writer process, guaranteeing that the files can no longer be written by the untrusted user
- It canonicalizes permissions on the contents of the temporary pull directory, removing any setuid bits, then chowns the files to root:root
- Now it calls into libOSTree to pull from the temporary directory into the system repo. libOSTree has already been taught to hardlink root-owned files (after verifying their checksums).

Open questions

What is the performance penalty of writing to the temporary directory via a (multi-process) FUSE filesystem?
How does it compare to the performance penalty of just copying the data, on SSD or HDD?

Rejected solutions

Add a "destructive pull" operation to libOSTree

We could teach libOSTree to delete each object in the temporary repository after it has been verified and copied to the system repo. This would solve the "double disk space" issue, but not the "double disk IO" issue.

`mount --bind` with a barrier directory

Summary:

Create a "barrier" directory:

# mkdir $REPO/tmp/barrier
# chown root:root $REPO/tmp/barrier
# chmod 700 $REPO/tmp/barrier

Create a temporary directory for the user to pull into. The user can't access it directly because of the opaque barrier directory above it:

# mkdir $REPO/tmp/barrier/pull-dir
# chown user:user $REPO/tmp/barrier/pull-dir
# chmod 775 $REPO/tmp/barrier/pull-dir

Bind-mount the temporary directory somewhere the user can access it:

# mkdir $REPO/tmp/pull-dir-mount
# mount --bind $REPO/tmp/barrier/pull-dir $REPO/tmp/pull-dir-mount

As user: pull into $REPO/tmp/pull-dir-mount

As root: unmount the bind-mount. File descriptors are scoped to the nearest mount point, so if this call succeeds there are no open FDs:

# umount --force $REPO/tmp/pull-dir-mount

Unfortunately, --force – umount2(..., MNT_FORCE) – has no effect on ext4 filesystems. If a buggy/malicious user process holds an open FD in the bind-mount, it cannot be unmounted, and there's no way for the system helper to clean it up.

`bindfs`

FUSE filesystems do support MNT_FORCE, and you can also just kill the process to effectively revoke any open FDs. bindfs is essentially a FUSE implementation of mount --bind, with extra options to tweak the permissions of files as they are written.

This approach was explored in depth on #2342 but ran up against a number of stumbling blocks:

In order for the "real" directory used by bindfs not to be accessible by the untrusted user, either:
- bindfs must run as root, which introduces several more problems:
  - Unless special care is taken, files will be written as root, which allows the untrusted user to circumvent the "5% disk space reserved for root" setting on the filesystem
  - Running a large amount of FUSE filesystem code as root is scary
- bindfs could run as a trusted but unprivileged user; unfortunately, in the default configuration of FUSE, it's not possible for user A to run a FUSE filesystem and allow user B to access it
In order to easily get the PID of the bindfs process so it can be killed, it must be run in the foreground. However, the normal mechanism for a FUSE filesystem to indicate to the caller that the mount was successful is to daemonize itself and exit success in the parent, which it can't do if we tell it not to daemonize. This leaves us either looking at fragile hacks to determine the daemonized processes' PID, or at fragile hacks to guess when the filesystem is mounted.

Visit flatpak.org for information on getting started, developer documentation and details of available applications and runtimes.

Information for Developers

Information for Distributors

Technical Details

Flatpak Build Service

Miscellaneous

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Noncopying system app installation

Background

Solution

Open questions

Rejected solutions

Add a "destructive pull" operation to libOSTree

`mount --bind` with a barrier directory

`bindfs`

Clone this wiki locally

Noncopying system app installation

Background

Solution

Open questions

Rejected solutions

Add a "destructive pull" operation to libOSTree

mount --bind with a barrier directory

bindfs

Clone this wiki locally

`mount --bind` with a barrier directory

`bindfs`