System.IO.Compression: ZipArchiveEntry always stores uncompressed data in memory #1544

qmfrederik · 2016-09-13T13:32:53Z

If a ZipArchive is opened in Update mode, calling ZipArchiveEntry.Open will always result in the entry's data being decompressed and stored in memory.

This is because ZipArchiveEntry.Open calls ZipArchiveEntry.OpenInUpdateMode, which in return gets the value of the ZipArchiveEntry.UncompressedData property, which will decompress the data and store it in a MemoryStream.

This in its own is already fairly inconvenient - if I'm updating a large file (say 1GB) that's compressed in a zip archive, I would want to read it from the ZipArchiveEntry in smaller chunks, save it to a temporary file, and update the entry in the ZipArchive in a similar way (i.e. limiting the memory overhead and preferring temporary files instead).

This also means that as soon as a ZipArchive is opened in Update mode, even reading an ZipArchiveEntry which you'll never update incurs the additional cost.

A short-term fix may be to expose ZipArchiveEntry.OpenInReadMode, which will return a DeflateStream instead. If you're doing mixed reading/writing of entries in a single ZipArchive, this should already help you avoid some of the memory overhead.

The text was updated successfully, but these errors were encountered:

ericstj · 2019-02-27T20:25:39Z

I've noticed the same. In general ZipArchiveEntry.Open is very non-intuitive in its behavior.

For read only / write only you get a wrapped DeflateStream which doesn't tell you the length of the stream nor permit you to seek it. For read/write (update) ZipArchiveEntry will read and decompress the entire entry into memory (in fact, into a memory stream backed by a single contiguous managed array) so that you have a seek-able representation of the file. Once opened for update the file is then written back to the archive when the archive itself is closed.

I agree with @qmfrederik here that we need a better API. Rather than rely solely on the archive's open mode we should allow for the individual call's to Open to specify what kind of stream they want. We can then check that against how the archive was opened in case it is incompatible and throw. Consider the addition:

  public Stream Open(FileAccess desiredAccess)

For an archive opened with ZipArchiveMode.Update we could allow FileAccess.Read, FileAccess.Write, or FileAccess.ReadWrite, where only the latter would do the MemoryStream expansion. Read and write would be have as they did today. In addition to solving the OP issue, this would address the case where someone is opening an archive for Update and simply adding a single file: we shouldn't need to copy that uncompressed data into memory just to write it to the archive.

We also can do better in the read case, we know the length of the data (assuming the archive is not inconsistent) and can expose that rather than throwing.

ericstj · 2019-02-27T20:58:02Z

Another interesting consideration for this is something like the approach taken by System.IO.Packaging in desktop. It implemented an abstraction over the deflate stream that would change modes depending on how you interacted with it: https://referencesource.microsoft.com/#WindowsBase/Base/MS/Internal/IO/Packaging/CompressStream.cs,e0a52fedb240c2b8

Exclusive reads small seeks would operate on a deflate stream; same for exclusive writes. Large seeks or random access would fall back to "emulation mode" wherein it would decompress everything to stream that was partially backed by memory but would fallback to disk.

I don't really like this since it hides some very expensive operations behind synchronous calls, as well as introducing a potential trust boundary (temp file) behind something that is expected to be purely computation. I think it makes sense to keep Zip lower level and not try to hide this in the streams we return. Perhaps we could allow for the caller to provide a stream for temporary storage in the Update case.

carlossanlop · 2019-12-16T22:36:47Z

Triage:
This would be nice to have.

twsouthwick · 2020-07-24T17:13:41Z

@carlossanlop This is blocking many users from moving to .NET Core as writing large office files ends up hitting this by way of System.IO.Packaging->DocumentFormat.OpenXml

danmoseley · 2020-07-24T18:42:05Z

Maybe we should mark this 6.0.0

Trysor · 2020-08-20T17:04:14Z

@twsouthwick @danmosemsft Is there a way to work around this issue (for OpenXml or otherwise) as a temporary solution? Alternatively re-discuss it for 5.0.0.

danmoseley · 2020-08-20T17:06:32Z

I do not have context on this area. That is @twsouthwick and @carlossanlop

ericstj · 2020-09-03T16:38:55Z

Typically you can workaround this if you open an archive only for create, or only for read. When you open an archive for update our Zip implementation needs to buffer things to memory since it can't mutate in-place. I agree that we should try to fix something here in 6.0.0.

nike61 · 2021-03-26T11:25:11Z

Hello everyone, are there any updates on that issue? We really want that fix 😄

ericstj · 2021-03-26T21:49:04Z

@nike61 did the suggested workarounds not work for you? Maybe share some of your scenario to help move things along.

Would the suggested API of providing Read | Write granularity on individual entries solve this for you, or do you need a solution that lets you edit the contents of an entry.

nike61 · 2021-03-28T06:07:24Z

@ericstj we use OpenXML to generate large Excel documents. I'm not sure, maybe we should ask OpenXML team to use workaround.

ericstj · 2021-03-29T19:06:35Z

I see, so that'd probably be @twsouthwick who appears to be the current maintainer. @twsouthwick does OpenXML expose the workarounds suggested above (opening ReadOnly or Create, but not update)? If ZipArchiveEntry had read-only-Open and write-only-Open APIs on entries would this be good enough?

twsouthwick · 2021-03-29T19:47:40Z

@ericstj AFAIK, OpenXML never directly deals with any ZipArchiveEntry. That's handled via System.IO.Packaging.Package. Here's a repro of the same memory growth users are seeing without the OpenXml layer:

using System;
using System.IO;
using System.IO.Packaging;

namespace ClassLibrary2
{
    public class Class1
    {
        public static void Main()
        {
            var filePath = Path.GetTempFileName();

            using var package = Package.Open(filePath, FileMode.Create, FileAccess.ReadWrite);

            var part = package.CreatePart(new Uri("/test", UriKind.Relative), "something/other");

            using var stream = part.GetStream(FileMode.Create);

            for (int i = 0; i < int.MaxValue; i++)
            {
                var bytes = BitConverter.GetBytes(i);
                stream.Write(bytes, 0, bytes.Length);
            }
        }
    }
}

As you can see, everything is using FileMode.Create. I believe all of the writing to the stream of a given part will be to replace all the contents so we never need to update a given entry.

ericstj · 2021-03-29T20:35:16Z

That's happening because of FileAccess.ReadWrite in the above. Here's what it's translated to:

runtime/src/libraries/System.IO.Packaging/src/System/IO/Packaging/ZipPackage.cs

Lines 339 to 344 in 82ca681

 if (packageFileAccess == FileAccess.Read) 

 zipArchiveMode = ZipArchiveMode.Read; 

 else if (packageFileAccess == FileAccess.Write) 

 zipArchiveMode = ZipArchiveMode.Create; 

 else if (packageFileAccess == FileAccess.ReadWrite) 

 zipArchiveMode = ZipArchiveMode.Update;

If you are only writing, then just use FileAccess.Write, that should workaround the entry buffering issue.

In the above sample the call to GetStream eventually ignores the FileMode passed in:

runtime/src/libraries/System.IO.Packaging/src/System/IO/Packaging/ZipStreamManager.cs

Line 85 in 82ca681

Stream ns = zipArchiveEntry.Open();

So if we created Read/Write only Open API on ZipArchiveEntry then System.IO.Packaging could be modified to use them, though this would be breaking for people who counted on getting the buffered/seekable stream when opening package in RW mode.

A bit of history here: the System.IO.Packaging library was originally ported by a previous maintainer of OpenXML, but now that WPF exists since 3.0, it's used there as well, so we need to be mindful of those scenarios when changing this.

salvois · 2021-11-20T11:10:50Z

@fkamp this is a shameless plug, but you may try my https://github.com/salvois/LargeXlsx library, which I wrote exactly because I had the same problem.

…dotnet#1544) This moves a bit towards dotnet#306 and dotnet#1453

insinfo · 2022-04-08T21:46:24Z

I'm having the memory leak issue if I use Update Mode, Any idea when this will be fixed?

 static void Main(string[] args)
        {
            var watch = new System.Diagnostics.Stopwatch();
            watch.Start();
            var user = "root";
            var senha = "pass";
            var connectionInfo = new ConnectionInfo("192.168.133.13", user, new PasswordAuthenticationMethod(user, senha));
            var client = new SftpClient(connectionInfo);
            client.Connect();
            DownloadDirectoryAsZip2(client, "/var/www/dart/intranetbrowser", @"C:\MyCsharpProjects\fsbackup\download2.zip");
            client.Dispose();        
            watch.Stop();
            Console.WriteLine($"End Download Execution Time: {watch.ElapsedMilliseconds} ms");
        }
 public static void DownloadDirectoryAsZip2(SftpClient sftpClient, string sourceRemotePath, string destLocalPath)
        {
            using (var zipFile = new FileStream(destLocalPath, FileMode.OpenOrCreate))
            {
//memory leaks if ZipArchiveMode.Update
                using (var archive = new ZipArchive(zipFile, ZipArchiveMode.Create,leaveOpen:true))
                {
                    DownloadDirectoryAsZipRec2(archive, sftpClient, sourceRemotePath);
                }
            }
        }
        private static void DownloadDirectoryAsZipRec2(ZipArchive archive, SftpClient sftpClient, string sourceRemotePath)
        {
            try
            {
                var files = sftpClient.ListDirectory(sourceRemotePath);
                foreach (var file in files)
                {
                    if ((file.Name != ".") && (file.Name != ".."))
                    {
                        var sourceFilePath = sourceRemotePath + "/" + file.Name;

                        if (file.IsDirectory)
                        {
                            DownloadDirectoryAsZipRec2(archive, sftpClient, sourceFilePath);
                        }
                        else
                        {

                            //var memoryStream = new MemoryStream();
                            // var memoryStream = File.Create(Path.GetTempFileName(), 4096, FileOptions.DeleteOnClose);
                            var entry = archive.CreateEntry(sourceFilePath);                           
                            Stream stream = entry.Open();

                            try
                            {
                                sftpClient.DownloadFile(sourceFilePath, stream);

                               /* using (var entryStream = entry.Open())
                                using (var fileStream = sftpClient.OpenRead(sourceFilePath))
                                {
                                     fileStream.CopyTo(entryStream);
                                }*/
                            }
                            catch (Exception e)
                            {
                                Console.WriteLine($"Download File failed: {sourceFilePath} | Error:{e}");
                            }
                            finally
                            {
                                stream.Close();
                                stream.Dispose();
                            }

                        }
                    }
                }
            }
            catch (Exception e)
            {
                Console.WriteLine($"Download Directory failed: {sourceRemotePath} | Error:{e}");
            }
        }

sshnet/SSH.NET#948

#23750

easuter · 2022-11-18T09:50:04Z

Appreciate that this will be a complex issue to fix, however it has been open for more than 8 years now.
Is the .NET Framework version unaffected because it's relying on some Windows API? If so why can't this be ported to .NET/Core?

petrkoutnycz · 2022-12-26T20:28:43Z

No workarounds? No fix? :-(

LarinLive · 2023-01-11T08:25:14Z

I have changed a few jobs since opening this issue dotnet/Open-XML-SDK#244, five years have passed, but there is no solution yet. The core thing is that the old .NET Framework worked and works well, but not the new, stylish, modern .NET Core . Any ideas, Microsoft?

twsouthwick · 2023-01-23T18:22:11Z

I finally got around to attempting a work around on the OpenXml side. Here's what I did:

Added a new abstraction on top of the packaging APIs because the packaging APIs are not great (you cannot compose new implementations together - the public methods are not the same as the protected virtual methods)
Enabled ability to "reload" a package - now a new package instance can be used underneath if needed without the rest of the stack knowing that
Created a temporary storage of part streams that (currently) are written to files that will be tracked
On "Save()", I reload the package to just be FileAccess.Write
If I try to get a part to save, I am no longer to able to access a part because I'm in write-only mode and can't read the parts

So, an API that is a write-only API on Packagepart would be greatly appreciated that could enable a mode in which the data is not buffered to memory.

As an alternative, what about providing an option to provide the temporary buffer that is used here rather than just using a MemoryStream?

clement911 · 2023-03-14T02:23:37Z

We're struggling with this as well.
Can it get some love?

LarinLive · 2023-11-02T20:34:47Z

This problem is a great obstacle for stream writing large Excel files with the Open-XML-SDK Library. An appropriate bug has been being opened for six years dotnet/Open-XML-SDK#244. IMHO, there was enough time to offer a solution.

PaulusParssinen · 2024-04-19T16:55:25Z

Just FYI subscribers to this issue: I opened API proposal for the quoted API at #101243

I've noticed the same. In general ZipArchiveEntry.Open is very non-intuitive in its behavior.

For read only / write only you get a wrapped DeflateStream which doesn't tell you the length of the stream nor permit you to seek it. For read/write (update) ZipArchiveEntry will read and decompress the entire entry into memory (in fact, into a memory stream backed by a single contiguous managed array) so that you have a seek-able representation of the file. Once opened for update the file is then written back to the archive when the archive itself is closed.

I agree with @qmfrederik here that we need a better API. Rather than rely solely on the archive's open mode we should allow for the individual call's to Open to specify what kind of stream they want. We can then check that against how the archive was opened in case it is incompatible and throw. Consider the addition:
  public Stream Open(FileAccess desiredAccess)
For an archive opened with ZipArchiveMode.Update we could allow FileAccess.Read, FileAccess.Write, or FileAccess.ReadWrite, where only the latter would do the MemoryStream expansion. Read and write would be have as they did today. In addition to solving the OP issue, this would address the case where someone is opening an archive for Update and simply adding a single file: we shouldn't need to copy that uncompressed data into memory just to write it to the archive.

We also can do better in the read case, we know the length of the data (assuming the archive is not inconsistent) and can expose that rather than throwing.

LarinLive · 2024-04-22T09:52:06Z

@PaulusParssinen you suggested a nice approach. @twsouthwick, can the .NET Team triage it to move forward with the initial problem?

kotofsky · 2024-06-10T19:54:22Z

Voting for fixing this!

kotofsky · 2024-08-28T13:10:47Z

I see that there is no fix yet. So, I made my own library for big excel and word documents too. No leaks and works fast. But you need to work with OpenXml v.2.19.0 still.
Hope it helps someone.
https://www.nuget.org/packages/DocMaker

carlossanlop transferred this issue from dotnet/corefx Jan 9, 2020

Dotnet-GitSync-Bot added area-System.IO.Compression untriaged New issue has not been triaged by the area owner labels Jan 9, 2020

carlossanlop added api-suggestion Early API idea and discussion, it is NOT ready for implementation and removed untriaged New issue has not been triaged by the area owner labels Jan 9, 2020

carlossanlop added this to the 5.0 milestone Jan 9, 2020

carlossanlop modified the milestones: 5.0.0, Future Jun 18, 2020

twsouthwick mentioned this issue Jul 24, 2020

System.IO.Packaging part stream has a memory leak when writing a large stream #23750

Closed

twsouthwick mentioned this issue Jul 27, 2020

System.IO.Packaging.Package should have a way to handle common errors with malformed packages #26084

Open

Viktor-36 mentioned this issue Sep 17, 2020

Large parts cannot be written on .NET Core due to OutOfMemoryException dotnet/Open-XML-SDK#807

Closed

carlossanlop mentioned this issue Oct 16, 2020

More helpful exception when compressing large files in seek mode #43542

Open

ashahabov mentioned this issue Jan 24, 2021

Add .NET 5.0 target ShapeCrawler/ShapeCrawler#98

Closed

shps951023 mentioned this issue Mar 9, 2021

System.IO.Compression: ZipArchiveEntry always stores uncompressed data in memory mini-software/MiniExcel#59

Closed

shps951023 mentioned this issue Apr 3, 2021

[System.IO.Compression: ZipArchiveEntry always stores uncompressed data in memory · Issue #1544 · dotnet/runtime](https://github.com/dotnet/runtime/issues/1544) mini-software/MiniExcel#145

Closed

salvois mentioned this issue Nov 20, 2021

Not possible to recover files with Excel v. 2110 salvois/LargeXlsx#5

Closed

ghost added the needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration label Nov 20, 2021

MichalStrehovsky pushed a commit to MichalStrehovsky/runtime that referenced this issue Dec 9, 2021

Implement ComWrappers.RegisterForTrackerSupport to be able create CCW (…

a4ba154

…dotnet#1544) This moves a bit towards dotnet#306 and dotnet#1453

carlossanlop mentioned this issue Dec 10, 2021

System.IO.Compression work planned for .NET 7 #62658

Closed

28 tasks

jeffhandley modified the milestones: Future, 7.0.0 Jan 9, 2022

insinfo mentioned this issue Apr 8, 2022

how can i download a directory from a server via SFTP/SCP to a zip file sshnet/SSH.NET#948

Open

jeffhandley modified the milestones: 7.0.0, 8.0.0 Jul 9, 2022

Martin-Molinero mentioned this issue Jun 19, 2023

Standardize to dotnet archive compression libraries QuantConnect/Lean#2435

Closed

3 tasks

ViktorHofer modified the milestones: 8.0.0, Future Jul 12, 2023

twsouthwick mentioned this issue Oct 31, 2023

System.IO.IOException: Stream was too long error on .NET Core 2.0 application dotnet/Open-XML-SDK#244

Open

santisq mentioned this issue Nov 4, 2023

New-ZipEntry Fails if File Being Added is > 2 GB santisq/PSCompression#19

Closed

jahav mentioned this issue Mar 13, 2024

Getting Stream was too large. ClosedXML/ClosedXML#2310

Closed

4 tasks

giuliohome mentioned this issue Mar 25, 2024

CfgAllPathValidator AreAllSuccessorsValid Stack Overflow on Windows and error MSB6006 in Linux Codespaces SonarSource/sonar-dotnet#8977

Closed

This was referenced Apr 18, 2024

Problem writing a large file to ZipArchive #101238

Closed

[API Proposal]: Add ZipArchiveEntry.Open overload #101243

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

System.IO.Compression: ZipArchiveEntry always stores uncompressed data in memory #1544

System.IO.Compression: ZipArchiveEntry always stores uncompressed data in memory #1544

qmfrederik commented Sep 13, 2016

ericstj commented Feb 27, 2019

ericstj commented Feb 27, 2019 •

edited

Loading

carlossanlop commented Dec 16, 2019

twsouthwick commented Jul 24, 2020 •

edited

Loading

danmoseley commented Jul 24, 2020

Trysor commented Aug 20, 2020

danmoseley commented Aug 20, 2020

ericstj commented Sep 3, 2020

nike61 commented Mar 26, 2021

ericstj commented Mar 26, 2021

nike61 commented Mar 28, 2021 •

edited

Loading

ericstj commented Mar 29, 2021

twsouthwick commented Mar 29, 2021

ericstj commented Mar 29, 2021 •

edited

Loading

salvois commented Nov 20, 2021

insinfo commented Apr 8, 2022

easuter commented Nov 18, 2022 •

edited

Loading

petrkoutnycz commented Dec 26, 2022

LarinLive commented Jan 11, 2023

twsouthwick commented Jan 23, 2023

clement911 commented Mar 14, 2023

LarinLive commented Nov 2, 2023

PaulusParssinen commented Apr 19, 2024

LarinLive commented Apr 22, 2024

kotofsky commented Jun 10, 2024

kotofsky commented Aug 28, 2024

System.IO.Compression: ZipArchiveEntry always stores uncompressed data in memory #1544

System.IO.Compression: ZipArchiveEntry always stores uncompressed data in memory #1544

Comments

qmfrederik commented Sep 13, 2016

ericstj commented Feb 27, 2019

ericstj commented Feb 27, 2019 • edited Loading

carlossanlop commented Dec 16, 2019

twsouthwick commented Jul 24, 2020 • edited Loading

danmoseley commented Jul 24, 2020

Trysor commented Aug 20, 2020

danmoseley commented Aug 20, 2020

ericstj commented Sep 3, 2020

nike61 commented Mar 26, 2021

ericstj commented Mar 26, 2021

nike61 commented Mar 28, 2021 • edited Loading

ericstj commented Mar 29, 2021

twsouthwick commented Mar 29, 2021

ericstj commented Mar 29, 2021 • edited Loading

salvois commented Nov 20, 2021

insinfo commented Apr 8, 2022

easuter commented Nov 18, 2022 • edited Loading

petrkoutnycz commented Dec 26, 2022

LarinLive commented Jan 11, 2023

twsouthwick commented Jan 23, 2023

clement911 commented Mar 14, 2023

LarinLive commented Nov 2, 2023

PaulusParssinen commented Apr 19, 2024

LarinLive commented Apr 22, 2024

kotofsky commented Jun 10, 2024

kotofsky commented Aug 28, 2024

ericstj commented Feb 27, 2019 •

edited

Loading

twsouthwick commented Jul 24, 2020 •

edited

Loading

nike61 commented Mar 28, 2021 •

edited

Loading

ericstj commented Mar 29, 2021 •

edited

Loading

easuter commented Nov 18, 2022 •

edited

Loading