Memory usage for large datasets #276

guilleva · 2014-01-10T16:45:23Z

Hi, I'm facing a memory issue when trying to generate a xlsx file with >600 columns and ~10k rows. The memory usage grows up to 3GB aprox and it's making the dyno be restarted by heroku.

I'm wondering if there is an known way to generate it without making a heavy use of memory.

The way I'm doing it can be represented by this script:

Axlsx::Package.new do |p|
  p.workbook.add_worksheet(:name => "Test") do |sheet|
    11_000.times do
        sheet.add_row ["test data"] * 600
    end
  end
  p.serialize('tmp/test.xlsx')
end

michaeldauria · 2014-01-15T19:02:59Z

We're experiencing the same issue, perhaps RubyZip could be replaced with https://bitbucket.org/winebarrel/zip-ruby/wiki/Home ?

jurriaan · 2014-01-15T20:11:41Z

RubyZip isn't the issue I think, axlsx is.
One of the problems is that axlsx creates an enormous string which is then passes to RubyZip. That's very inefficient..

jurriaan · 2014-01-15T23:15:03Z

Tried to fix it @ https://github.com/jurriaan/axlsx/tree/fix_enormous_string Please report if this helps (it should reduce usage to 1.5 GB or so)

jurriaan · 2014-01-17T12:13:17Z

@guilleva @michaeldauria Did this change work for you? It should be possible to reduce the memory usage even further, but it's a beginning :)

guilleva · 2014-01-17T17:20:17Z

Hi jurriaan, thanks for looking into this, I haven't been able to test it, I really sorry, I will do my best to do it today and I will let you know.

Thanks again

jurriaan · 2014-01-17T21:47:36Z

Just pushed some new commits, the example now uses ~700 MB on my system..

michaeldauria · 2014-01-22T17:12:30Z

I will try this out today and report back

jurriaan · 2014-01-22T17:34:46Z

It is possible it won't help in your situation though.. the fact is that atm axlsx allocates way too much objects and causes the heap to grow very big.. I'm trying to reduce the number of allocated objects..

michaeldauria · 2014-01-28T14:56:52Z

This did not end up helping too much, memory seemed more stable, but we still ran out anyway.

jeremywadsack · 2014-05-29T17:18:00Z

We have a similar issue. We're building sheets with hundreds of thousands of rows and as many as 100 tabs. axlsx uses 5.3GB to build a 111MB xlsx file.

I would be happy to test some changes on this, but your branch is no longer available. Any further updates on this?

jurriaan · 2014-06-19T08:43:12Z

@jeremywadsack I've merged most of the changes into the master branch, but it's still using a lot of memory. I hope to have some time to look at this issue again in the coming weeks

maxle · 2014-09-18T17:13:54Z

Hi. Any updates on this?

srpouyet · 2015-02-03T08:13:35Z

Our Delayed Job workers aren't processing jobs (out of memory), probably because of this issue. Any updates on the status? Cheers!

pallymore · 2015-04-27T19:53:09Z

+1 on this. We are having the same issue with huge datasets :/

jeremywadsack · 2015-04-27T20:55:09Z

We just ended up spinning up more hardware to deal with this for the time being. :/

hassanrehman · 2015-09-29T11:16:01Z

Old thread. We're also having problems. 2.1 MB file needs 650MB of RAM.

Still waiting for a solution

jeremywadsack · 2015-09-29T15:14:11Z

Yeah. We would still love a solution. We run a 24gb server instance so we
can build 15mb files.

Jeremy
On Sep 29, 2015 4:16 AM, "hassan abdul rehman" [email protected]
wrote:

Old thread. We're also having problems. 2.1 MB file needs 650MB of RAM.

Still waiting for a solution

—
Reply to this email directly or view it on GitHub
#276 (comment).

jeremywadsack · 2015-10-09T23:16:37Z

Just had another build crash because a 400MB Excel file took 22GB of memory to build. (Incidentally, Excel used 1.5GB to open this file.)

In the python world we use xlsxwriter to generate Excel files which has a few options to reduce memory including flushing rows to disk as they are written. There are trade-offs there but it's possible something like that could be done with axlsx as well?

AaronMegaphone · 2015-11-19T16:26:49Z

+1 for a solution/update for this.

sudoremo · 2015-11-26T09:11:59Z

+1 for a solution

altjx · 2016-01-19T20:48:21Z

First of all, thank you for this wonderful gem. I'm also having this problem when trying to generate 30k rows of text. As an alternative (and as opposed to increasing memory), does Axlsx have the capability to append data to an existing spreadsheet or join two of them without exhausting memory?

I'd at least be able to write multiple files and join them together at the end.

aalvarado · 2016-01-19T21:41:37Z

Maybe there could be an option of using a temp file instead of filling a StringIO object, once it is finished it can just place everything in the zip file. Thoughts?

jeremywadsack · 2016-01-19T21:58:44Z

The XLSX file isn't a single file. It's a collection of files that are zipped together. Which I think makes this more complicated.

The python library xlsxwriter has a feature like this:

The optimization works by flushing each row after a subsequent row is written. In this way the largest amount of data held in memory for a worksheet is the amount of data required to hold a single row of data.

This does introduce limits in how you can use it:

Since each new row flushes the previous row, data must be written in sequential row order when 'constant_memory' mode is on:

Perhaps something like this could be implemented in axlsx, though?

altjx · 2016-01-19T22:01:09Z

I haven't taken a look at it just yet, but I'm assuming it's just like a word doc... a bunch of XML files zipped together. However, with word docs you can indeed extract certain contents from the word/document.xml file, and I've done this to pull templates and insert data into other word docs (same process).

During the mean time, I'm going to try working on my own solution. Haven't seen much feedback from the guilleva here so not really expecting a fix anytime soon.

aalvarado · 2016-01-20T14:34:12Z

The XLSX file isn't a single file. It's a collection of files that are zipped together. Which I think makes this more complicated.

But the file that has the actual rows can it be still be written sequentially no?

looking inside an xlsx file I can see that the xl/_rel/sharedStrings.xml has the sheet data in it.

pol0nium · 2016-07-11T12:15:05Z

+1, this is a huge issue for me. I'd be happy to hear about possible workarounds.

Wenzel · 2016-07-11T14:38:58Z

+1, big issue for me too.

pol0nium · 2016-07-18T16:00:11Z

Would it help if I'd add a bounty via bountysource? @randym what do you think?

jeremywadsack · 2016-10-27T17:20:37Z

#352 is a similar request to append rows to an existing file.

Paxa · 2017-04-23T17:26:56Z

I've been struggling with same issues, when generating > 100k rows our servers started to use swap and become very slow.

Solved by making own excel library https://github.com/paxa/fast_excel, it works 3-4 times faster and consume 5-20 MB ram regardless of data size

cseelus · 2017-11-28T01:56:23Z

The problem with these memory issues we face using AXLSX via the to_spreadsheet gem is, that the memory won't be free after xslx creation. This means we have to manually restart our worker dyno for now and also limit the maximum number of rows per Excel sheet.

@Paxa Nice work. Depending on the answer of the maintainer of the to_spreadsheet gem (allows a Rails app to render Excel files using the existing slim/haml/erb/etc views) I'll see if fast_excel could be used as an alternative to AXLSX.

crystianwendel · 2018-03-22T22:37:35Z

So, @cseelus, i'm having issues with this lib's memory usage too. What did you do? @Paxa 's lib or to_spreadsheet?

Thank you!

cseelus · 2018-03-25T12:59:11Z

For now we limit the number of rows per xslx file to mitigate the memory problems using the to_spreadsheet gem.

barttenbrinke · 2018-04-04T21:02:31Z

According to my basic memoryprofile skills, the main issue is here: https://github.com/randym/axlsx/blob/master/lib/axlsx/workbook/worksheet/row.rb#L156 This is never released, expanding tot up to 3GB in my testcase.

retained memory by location
-----------------------------------
   3643008  /usr/local/var/rbenv/versions/2.3.1/lib/ruby/gems/2.3.0/bundler/gems/axlsx-776037c0fc79/lib/axlsx/workbook/worksheet/row.rb:156
    654864  /usr/local/var/rbenv/versions/2.3.1/lib/ruby/gems/2.3.0/bundler/gems/axlsx-776037c0fc79/lib/axlsx/util/simple_typed_list.rb:20
    497029  my_report.rb:146

simi · 2018-04-05T07:10:10Z

@barttenbrinke can you share your code to reproduce and measure this problem?

comictvn · 2018-04-13T10:51:17Z

anyone already fixed the issue ?

barttenbrinke · 2018-04-13T11:00:23Z

@simi https://github.com/SamSaffron/memory_profiler @comictvn This wil not be an easy fix, seeing that row.rb has not been touched in > year, and the fact that this ticket is from 2014, I would not get your hopes up.

jeremywadsack · 2018-04-13T17:59:05Z

Well, yes, @jurriaan said this in 2014: "the fact is that atm axlsx allocates way too much objects and causes the heap to grow very big."

I think there are two approaches here, neither is likely to be "easy":

Significantly reduce the size of the Cell objects.
Flush rows to disk as they are created.

The problem with the first approach is that we might generate a large excel file with 20+ tabs and each of them with five million rows (I know Excel doesn't support that many rows, but the file format does). I think you'd need to reduce the size of the Cell object to under 50 bytes to make a significant difference.

The problem with the second approach is that it means the developer would have to write rows in sequential order. That's a change to the interface, so would need to be "opt-in".

iamtheschmitzer · 2018-08-03T20:00:47Z

👍 for solution 2

mkhoa1412 · 2018-08-21T09:38:57Z

+1, this is a huge issue for me. I'd be happy to hear about possible workarounds.

janklimo · 2018-09-25T08:49:58Z

We've been very happy with axlsx but the memory usage really started to hit us. This may not apply to everyone, but if all you need is simple XLSX formatted output, I'd look into rubyXL. Here's a benchmark showing that its retained memory is constant, while memory retained by axlsx grows linearly with the number of rows/columns.

gdimitris · 2019-05-03T17:41:52Z

@randym any update on this ?

srpouyet mentioned this issue Nov 5, 2015

Maybe add axlsx to the list? ASoftCo/leaky-gems#6

Closed

cseelus mentioned this issue Nov 28, 2017

Option to use different Excel writer lib glebm/to_spreadsheet#31

Open

korun mentioned this issue May 11, 2019

Memory optimization (~30% less) #633

Open

Memory usage for large datasets #276

Memory usage for large datasets #276

Comments

guilleva commented Jan 10, 2014

michaeldauria commented Jan 15, 2014

jurriaan commented Jan 15, 2014

jurriaan commented Jan 15, 2014

jurriaan commented Jan 17, 2014

guilleva commented Jan 17, 2014

jurriaan commented Jan 17, 2014

michaeldauria commented Jan 22, 2014

jurriaan commented Jan 22, 2014

michaeldauria commented Jan 28, 2014

jeremywadsack commented May 29, 2014

jurriaan commented Jun 19, 2014

maxle commented Sep 18, 2014

srpouyet commented Feb 3, 2015

pallymore commented Apr 27, 2015

jeremywadsack commented Apr 27, 2015

hassanrehman commented Sep 29, 2015

jeremywadsack commented Sep 29, 2015

jeremywadsack commented Oct 9, 2015

AaronMegaphone commented Nov 19, 2015

sudoremo commented Nov 26, 2015

altjx commented Jan 19, 2016

aalvarado commented Jan 19, 2016

jeremywadsack commented Jan 19, 2016

altjx commented Jan 19, 2016

aalvarado commented Jan 20, 2016

pol0nium commented Jul 11, 2016

Wenzel commented Jul 11, 2016

pol0nium commented Jul 18, 2016

jeremywadsack commented Oct 27, 2016

Paxa commented Apr 23, 2017

cseelus commented Nov 28, 2017

crystianwendel commented Mar 22, 2018

cseelus commented Mar 25, 2018

barttenbrinke commented Apr 4, 2018 • edited Loading

simi commented Apr 5, 2018

comictvn commented Apr 13, 2018

barttenbrinke commented Apr 13, 2018

jeremywadsack commented Apr 13, 2018

iamtheschmitzer commented Aug 3, 2018

mkhoa1412 commented Aug 21, 2018

janklimo commented Sep 25, 2018

gdimitris commented May 3, 2019

barttenbrinke commented Apr 4, 2018 •

edited

Loading