Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

performance: what can image-spec do to improve handling of large images? #1190

Open
rchincha opened this issue May 28, 2024 · 31 comments · May be fixed by project-stacker/stacker#626
Open

performance: what can image-spec do to improve handling of large images? #1190

rchincha opened this issue May 28, 2024 · 31 comments · May be fixed by project-stacker/stacker#626

Comments

@rchincha
Copy link

Now that OCI artifacts has landed and getting mindshare and use cases, some issues are popping up. Best to standardize them.

Perhaps time to resurrect this?
https://groups.google.com/a/opencontainers.org/g/dev/c/Zk3yf45HIdA

@rchincha
Copy link
Author

@samuelkarp
Copy link
Member

kubernetes/enhancements#4642 is relevant to this too

@rchincha
Copy link
Author

@cyphar
Copy link
Member

cyphar commented May 29, 2024

https://github.com/project-machine/puzzlefs was made to solve the problems my OCIv2 proposal discussed quite a few years ago. I haven't looked into it very deeply unforuntately, and I don't think it will help much with large artefact-filled images.

(My view has slowly moved to thinking that CDC and other compression methods make more sense on the distribution side. If we did that, it would be possible to make large images with any content equally deduplicated. There are downsides to this approach too, but embedding CDC parameters into the image-spec seems like a repeat of the nightmare we've had with compression algorithm settings but now with the added issue that changing the settings would cause you to waste cross-image deduplication.)

@hsiangkao
Copy link

hsiangkao commented May 30, 2024

https://github.com/project-machine/puzzlefs was made to solve the problems my OCIv2 proposal discussed quite a few years ago. I haven't looked into it very deeply unforuntately, and I don't think it will help much with large artefact-filled images.

Recently I occasionally found @gregkh already mentioned EROFS many years ago in OCI community :-).. https://groups.google.com/a/opencontainers.org/g/dev/c/icXssT3zQxE/m/N4YZsbZcAwAJ

I may need to rephrase EROFS again here: Instead of just reinventing a wheel for Android only, the original goal is to address Squashfs runtime performance issue since it doesn't fulfill for high-performance use cases like smartphones. Users cannot accept unacceptable dynamic app latencies (and currently most Android vendors already switch to EROFS since they have the same issue with applying compression). Squashfs on-disk format hasn't been even updated for a decade (currently even without a filesystem UUID), and various previous improved attempts (at least at the time when I decided to redesign a high-performance image filesystem format) was ignored [1][2][3].

The goal of EROFS filesystem was to launch a general image filesystem project for various use cases with high-performance. It may vary from system firmware, container images, app sandboxes, and even AI data model, etc. For example, people could use the same image for system firmwares on raw block devices (like Container OS use cases) and container image. If some new on-disk feature could benefit to most image use cases, we will consider to add with discussion too and new contributors are always welcome.

From my own perspective, although OCI tar format has many flaws, but the format is quite simple at least and various operation systems can support parsing tar without any barrier. Besides, the docker image format has been existed almost for a decade too, many base layers are already formed in tar layer format. As a public cloud vendor (like my current employer, Alibaba Cloud), image compatibility is quite important for our customers, and I guess that other cloud vendors may have the same concern since there is enough old OCI-compatible runtimes to be considered. If people would like some on-demand fetching, there is already technologies to resolve that like SOCI, stargz, etc. If people want to directly mount a filesystem in-kernel (although I'm not sure why such requirement is really important compared with performance and osboot concerns unlike system firmware use cases), they could use a Squashfs or EROFS index with OCI tar data instead.

I'm very happy if OCI community could have a chance to consider using EROFS in some form, but my opinion is that we may need to improve the current OCI format to overcome some current high-priority OCI image concern first. But if some specific areas like AI model needs some specific filesystem blob, I think some EROFS layer blobs for such specific use cases are fine too, btw, EROFS already has a IANA-registered media type as "vnd.erofs"

My own experience is that EROFS just becomes slowly used recently because many server users are still in 3.10 or 4.18 kernels, it doesn't matter for system image use cases like our original Android system images (because users will upgrade the whole system if they decide to use EROFS), but it may take many years before actual users use a new in-kernel feature like container images.

(My view has slowly moved to thinking that CDC and other compression methods make more sense on the distribution side. If we did that, it would be possible to make large images with any content equally deduplicated. There are downsides to this approach too, but embedding CDC parameters into the image-spec seems like a repeat of the nightmare we've had with compression algorithm settings but now with the added issue that changing the settings would cause you to waste cross-image deduplication.)

Actually EROFS already has a varient-CDC since Linux 6.1 although it's unlike the traditional CDC, but the result is almost the same. My experience is that CDC is good at text meterials but it has little benefit to executable binaries (I guess that is what we care about more in term of image sizes and runtime performance) because jump and data load instructions will kill all the possiblity of such data deduplication like the following code snippets of two minor versions of libc:
image

In reality, the end result for executable binaries or something will be eventually like a page-unaligned block-based deduplication (like reflink) or file-based deduplication (like ostree).

IMHO, CDC-like approach without compression is suitable for archive uses and transfer uses (like casync or likewise), but I would have certain reservations as a kernel filesystem developer for runtime uses due to its block/page-unaligned chunks. CDC is unfriendly to page cache sharing (or FSDAX secure container memory sharing) and data movement is almost always needed. Extra data movement also slows down the performance compared to reflink approaches unless compression is also considered, yet EROFS already has compressed data deduplication feature for two years since 2022.

I think the only way to deduplicate these executable binaries is "delta compression", but I'm not sure if it's really a new on-disk feature for Linux kernel anyway. I guess most users are already happy with ostree or likewise, it needs carefully evaluation though.

[1]
https://lore.kernel.org/all/af77c1f80e2725c4cf1bf106d6add820b3b0eed5.1523276963.git.geliangtang@gmail.com
https://lore.kernel.org/all/975b0f7acbb65445551ee374a2dd38d553ac2e6a.1523326310.git.geliangtang@gmail.com
https://lore.kernel.org/all/1702a314dc9de4626fbefc788213a578be88f184.1533630854.git.geliangtang@gmail.com
https://lore.kernel.org/all/15428d5047390927114ad49d7721b3da2bdf40ef.1548403955.git.geliangtang@gmail.com
https://lore.kernel.org/all/d6cbe74944ad1a6be21cc74b99b30d18cba140c5.1548406694.git.geliangtang@gmail.com
[2]
https://lore.kernel.org/all/[email protected]
https://lore.kernel.org/all/[email protected]
https://lore.kernel.org/all/[email protected]
[3]
https://lore.kernel.org/all/[email protected]

@rchincha
Copy link
Author

rchincha commented May 30, 2024

@hsiangkao thanks for calling out all the salient points.

Just curious, how well is in-kernel erofs supported? in terms of community size and history etc? is there a recommended minimum Linux kernel version? is there a recommended userspace erofs implementation?

@gregkh
Copy link
Member

gregkh commented May 30, 2024 via email

@hsiangkao
Copy link

@hsiangkao thanks for calling out all the salient points.

Just curious, how well is in-kernel erofs supported? in terms of community size and history etc? is there a recommended minimum Linux kernel version? is there a recommended userspace erofs implementation?

I fully agree with Greg's point: always use the latest stable kernel. Anyway, to your question, it depends on the feature requirement. If the intention is just to use EROFS format as an index (like a stargz-like TOC) to refer tar data (for lazy pulling), I think Linux 5.4+ is enough. The current distro configs could be checked out by https://oracle.github.io/kconfigs/?config=UTS_RELEASE&config=EROFS_FS
If compatibility is really the main concern, I'd suggest using Squashfs. And erofs-utils has an official userspace implementation as an alternative approach anyway.

@rchincha
Copy link
Author

rchincha commented May 30, 2024

Additional considerations ...

overlayfs (Linux kernel version 4.x but also supported by various *BSD), squashfs (Linux kernel version 2.6.x, also supported by various *BSD at least recently) and erofs (Linux kernel 5.x, not supported on *BSD?). MS Win support is another matter altogether.

@hsiangkao
Copy link

hsiangkao commented May 30, 2024

Additional considerations ...

overlayfs (Linux kernel version 4.x but also supported by various *BSD), squashfs (Linux kernel version 2.6.x, also supported by various *BSD at least recently) and erofs (Linux kernel 5.x, not supported on *BSD?). MS Win support is another matter altogether.

I'm quite open to that since EROFS is not designed for some specific use case. If OCI community considers EROFS in some form (or as an alternative), that is quite awecome. If not, EROFS will still improve new features consistently to fulfill generic image use cases. The feature development of EROFS is always active from Android vendors, some cloud vendors, etc.

@rchincha
Copy link
Author

rchincha commented Jun 4, 2024

Adding more notes ...

OCI artifacts may package "many small-ish files" such as container image rootfs or "a few very large files" such as AI models.

@ChaoyiHuang
Copy link

some thought here. large model file is really large, for example the size of LLMA3 70b fp16 is about 141GB.

one way to handling such huge file is to use same storage for image registry and compute node, i.e., the model file can be stored in the distributed file system with raw format, which is shared among the registry and compute node, no data tranfer between registry backend and compute node.

when the client in the compute node pull the model blob, the registry returns the location of the model file, the client find that it's located in the file system the compute node can access, no blob downloading is requried.

@hsiangkao
Copy link

hsiangkao commented Jun 6, 2024

some thought here. large model file is really large, for example the size of LLMA3 70b fp16 is about 141GB.

one way to handling such huge file is to use same storage for image registry and compute node, i.e., the model file can be stored in the distributed file system with raw format, which is shared among the registry and compute node, no data tranfer between registry backend and compute node.

when the client in the compute node pull the model blob, the registry returns the location of the model file, the client find that it's located in the file system the compute node can access, no blob downloading is requried.

Anyway, you could also treat OCI artifacts (like a kind of object storage) as shared immutable storage (like a read-only mini- gfs2, ocfs2), in which way you also don't need to download any blob locally in advance (like hundreds of GiB), just virtual block device clients with nbd/tcmu/ublk or (if you really need some local caching) caching framework like fscache.

@rchincha
Copy link
Author

rchincha commented Jun 6, 2024

some thought here. large model file is really large, for example the size of LLMA3 70b fp16 is about 141GB.

^ how compressible is this model file?

@rchincha
Copy link
Author

rchincha commented Jun 6, 2024

Is there interest in porting erofs-utils to golang? since most utilities in this world are golang-based?
Mainly interested in creating a erofs layer/image (so that it is compatible with overlayfs).

@gregkh
Copy link
Member

gregkh commented Jun 6, 2024 via email

@rchincha
Copy link
Author

rchincha commented Jun 6, 2024

If the goal is to produce and consume erofs layers - so that they can just be copied over and mounted, then there are two touch points, which may or may not be ok with binary invocations.

  1. Tools that produce said layers and images
  2. Container runtimes that overlay mount these layers

Maybe as a initial poc, go bindings (cgo) instead?

@tianon
Copy link
Member

tianon commented Jun 7, 2024

I recall reading (perhaps incorrectly! 🙈❤️) that many kernel filesystems are not designed to be hardened against attacker-controlled raw input, but given the use cases for erofs, I'm guessing that its implementation is hardened against malicious inputs? 👀😇

@hsiangkao
Copy link

hsiangkao commented Jun 7, 2024

I recall reading (perhaps incorrectly! 🙈❤️) that many kernel filesystems are not designed to be hardened against attacker-controlled raw input, but given the use cases for erofs, I'm guessing that its implementation is hardened against malicious inputs? 👀😇

This is really a best-effort stuff. Unlike generic fses with complex metadata and journalling (so some consistency issues between different kinds of metadata are always challenging), EROFS core on-disk format is quite simple [1]. EROFS project addresses any new syzkaller fuzzing reports and we also have our own fuzzer to find potential bugs. However, EROFS is not a complete freezed filesystem project, thus new useful ondisk/runtime features will be added by the time according to new scenarios/inputs, so that there may be some new issues raised (as we are all humans and not bug-free.)

Unlike some other fses, EROFS will address new found/reported issues in time and that is all the guarantee I could give. So yes, in brief, the implementation is hardened against malicious inputs with best efforts. Or we could find some ways to let users only use core stable features but it looks like a non-technical issue anyway (again, latest stable kernels are always preferred to address all kernel issues).

[1] https://erofs.docs.kernel.org/en/latest/core_ondisk.html

@hsiangkao
Copy link

Is there interest in porting erofs-utils to golang? since most utilities in this world are golang-based? Mainly interested in creating a erofs layer/image (so that it is compatible with overlayfs).

Some runtime like gVisor [1] already landed core on-disk EROFS support in their own go form to enable efficient image passthrough to sandboxes. But some other alternative way (like cgo) is helpful since EROFS is still actively under development, so maintaining various language implementations up to date is somewhat challenging due to limited time & engineering resources (although we may have some experimental Rust implementation developped by students later). Anyway, C is still the quite portable language among all architectures / platforms / distributions.

[1] google/gvisor#9486

rchincha added a commit to rchincha/stacker that referenced this issue Jun 14, 2024
rchincha added a commit to rchincha/stacker that referenced this issue Jun 14, 2024
@rchincha
Copy link
Author

rchincha commented Aug 14, 2024

@hsiangkao what is a good forum to coordinate the cgo changes? and where should they land? is the project receptive to any cgo-related refactoring?

At the very least the main primitives that are needed (as golang APIs) are:

  1. MkErofs()
    https://github.com/erofs/erofs-utils/blob/dev/mkfs/main.c#L1154

  2. MountErofs()/UnmountErofs()
    https://github.com/erofs/erofs-utils/blob/dev/fuse/main.c#L630 (fuse) or use the kernel driver

  3. PopulateErofs()

What we are trying to achieve is substitute tar blob workflows with erofs blob ones with minimal impact to ecosystem tooling.

@hsiangkao
Copy link

hsiangkao commented Aug 14, 2024

Hi @rchincha,

Thanks for your reply!

@hsiangkao what is a good forum to coordinate the cgo changes?

OCI meeting time is too late for me to attend regularly (It seems 0:00~1:00AM on my side)...sigh...
I think we could have a seperate issue collector (maybe using github issues) and an independent meeting may be better?
If it's needed, I could attend OCI meetings too if you prefer.

and where should they land?

A erofs-go project or just integrate into erofs-utils? which one you prefer?

Is the project receptive to any cgo-related refactoring?

Yeah, definitely!
Also as I mentioned in erofs-utils v1.8 announcement, I think erofs-utils v1.9 is a good time to stablize all APIs.

At the very least the main primitives that are needed (as golang APIs) are:

  1. MkErofs()
    https://github.com/erofs/erofs-utils/blob/dev/mkfs/main.c#L1154

Agreed.

  1. MountErofs()/UnmountErofs()
    https://github.com/erofs/erofs-utils/blob/dev/fuse/main.c#L630 (fuse) or use the kernel driver

Yeah, it seems containerd could also use it too.

  1. PopulateErofs()

What's the purpose of this API?

What we are trying to achieve is substitute tar blob workflows with erofs blob ones with minimal impact to ecosystem tooling.

Yeah, I'm very happy to coordinate and work, thanks!

@rchincha
Copy link
Author

@hsiangkao coordination doesn't have to be via a OCI meeting. It can just be over PR reviews and maybe more convenient.

PopulateErofs() is just a placeholder API to indicate entries are copied into the newly created erofs layout. Maybe we just mount and copy things in/out instead and don't actually need this.

rchincha added a commit to rchincha/stacker that referenced this issue Aug 14, 2024
@hsiangkao
Copy link

@hsiangkao coordination doesn't have to be via a OCI meeting. It can just be over PR reviews and maybe more convenient.

Sounds good, yet currently erofs-utils development works in a mailing list way for years like almost kernel projects.
If you prefer github PRs, we could create a new erofs-go repo or use erofs-utils github mirror for reviewing of this part.

PopulateErofs() is just a placeholder API to indicate entries are copied into the newly created erofs layout. Maybe we just mount and copy things in/out instead and don't actually need this.

Okay, got it.

@cgwalters
Copy link

What we are trying to achieve is substitute tar blob workflows with erofs blob ones with minimal impact to ecosystem tooling.

You may also be interested in https://github.com/containers/composefs which uses EROFS for metadata, but splits out shared files into a backing store which gives various advantages like dedup on disk and in the page cache too.

@hsiangkao
Copy link

hsiangkao commented Aug 18, 2024

What we are trying to achieve is substitute tar blob workflows with erofs blob ones with minimal impact to ecosystem tooling.

You may also be interested in https://github.com/containers/composefs which uses EROFS for metadata, but splits out shared files into a backing store which gives various advantages like dedup on disk and in the page cache too.

Yes, ostree+composefs is also a great way to distribute images. And composefs has been more widely used now.

In my spare time, I try to go on work on containerd erofs snapshotter support if possible, and hopefully this snapshotter could support:

  • erofs native blobs;
  • ostree blobs by using composefs;
  • OCI tar blobs.

@cgwalters
Copy link

Yes, ostree+composefs is also a great way to distribute images.

To be clear, the composefs core is agnostic to higher level tooling. Yes, it can be used with ostree, but we are also looking at a strong OCI native integration with composefs and some of that already exists in the containers/storage composefs backend: https://github.com/containers/storage/blob/main/docs/containers-storage-composefs.md


Connecting with that but backing up to a higher level though...some of the discussion in this thread seems to be basically floating to "switch to EROFS instead of tar in an OCI v2"...I'm not really in favor of that...I think it'd be too traumatic for the ecosystem, and there's other ways to gain key desirable properties.

Especially this thread started around handling large images in an incremental way, and I think it makes sense to try to standardize work on estargz and https://github.com/containers/storage/blob/main/docs/containers-storage-zstd-chunked.md

@Conan-Kudo
Copy link

I would be considerably in favor of using a filesystem as the medium instead of tar for a next-generation OCI format. We are definitely running into problems (as many others have outlined) with the tarball-based format, and the band-aid solutions to work around it have similar portability problems with not enough benefit over just adopting something like EROFS for this.

It has always bothered me that we called OCI an image format when it in fact isn't one, and with things like Fedora's bootc attempting to glue OCI to operating systems, having an OCI format that's actually image based would permit those things to work with some kind of reasonable performance and scalability.

@hsiangkao
Copy link

hsiangkao commented Aug 29, 2024

Especially this thread started around handling large images in an incremental way, and I think it makes sense to try to standardize work on estargz and https://github.com/containers/storage/blob/main/docs/containers-storage-zstd-chunked.md

Just my personal opinion, I think the ways to enhance tar but make tar compatible like (e)stargz, zstd::chunked or SOCI are all good. Yet it'd be better to give another option of their TOC format in addition to JSON.
Surely, from my own POV, I do hope EROFS can be useful and then used widely in the container world (and I do hope more people could give me more inputs about this). Thus, I hope the TOC itself can be formed in EROFS (or Squashfs format if someone is also interested) meta format too so that it can be directly parsed by EROFS rather than do an extra (and unnecessary) JSON->EROFS meta conversion. I think it can benefit to ComposeFS too.
The complexity of JSON itself makes it not quite sense for bootloaders and kernels.

rchincha added a commit to rchincha/stacker that referenced this issue Aug 29, 2024
@rchincha
Copy link
Author

@cgwalters

.I think it'd be too traumatic for the ecosystem, and there's other ways to gain key desirable properties.

There are known issues with the tar format.
https://www.cyphar.com/blog/post/20190121-ociv2-images-i-tar

Folks are already moving toward layers as full filesystems - soci etc being a smart tarfs (via fuse) while still staying with tar etc.

Also you don't have to use a newer format if you don't want to.

This is still a consensus-building exercise until working prototypes are demonstrated I suppose.

@hsiangkao
Copy link

hsiangkao commented Sep 16, 2024

@hsiangkao coordination doesn't have to be via a OCI meeting. It can just be over PR reviews and maybe more convenient.

PopulateErofs() is just a placeholder API to indicate entries are copied into the newly created erofs layout. Maybe we just mount and copy things in/out instead and don't actually need this.

Hi @rchincha, JFYI, file-backed mount has been merged upstream:
https://git.kernel.org/torvalds/c/69a3a0a45a2f
I'm trying to implement this (rather than using loop devices) for containerd too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants