Load runs out of memory #73

istvan-fodor · 2019-05-22T18:22:44Z

giraffez version 2.0.24.2
Teradata Load Utility Version 16.20.00.09 64-Bit
Ubuntu 16.04
4 cores, 16 GB Ram

If I run the giraffez load operation for large CSVs, the process runs out of memory. I see on the command line the usual message (Processed X Rows), and I see in top that memory usage is slowly creeping up. After a while the process maxes out on memory around 16 GB and the giraffez process is killed. Is this expected? Or is there a configuration I am missing?

istvan-fodor · 2019-05-22T19:36:55Z

I observed that its the same issue if I run through BulkLoad api with put()

hiker77 · 2019-05-29T19:57:46Z

My team ran into the same issue. I believe the leak is caused when iterating over elements in a tuple to convert. PySequence_GetItem is called which yields a new reference but only the last element in the list is getting deallocated. I ran a few tests using 1 million records, 2 integers each record, memory profiler snips are below:
First Run

================================================
    10     22.5 MiB     22.5 MiB   @profile
    11                             def run():
    12     27.4 MiB      4.9 MiB       with giraffez.Cmd() as cmd:
    13     27.4 MiB      0.0 MiB           cmd.execute("drop table creativedb.tom_test;")
    14     27.4 MiB      0.0 MiB           cmd.execute("create table creativedb.tom_test(col integer, col1 integer);")
    15     27.4 MiB      0.0 MiB       with open("export.csv", 'rt') as f, giraffez.BulkLoad("creativedb.tom_test", cleanup=True) as ld:
    16     27.4 MiB      0.0 MiB           reader = csv.reader(f)
    17     93.6 MiB      0.3 MiB           for i, record in enumerate(reader):
    18     93.6 MiB      2.7 MiB               ld.put(record)
    19     93.6 MiB      0.0 MiB               print(f"\rRows Loaded: {i}", end='', flush=True)

And the second with the change

Line #    Mem usage    Increment   Line Contents
================================================
    10     22.7 MiB     22.7 MiB   @profile
    11                             def run():
    12     27.3 MiB      4.6 MiB       with giraffez.Cmd() as cmd:
    13     27.3 MiB      0.0 MiB           cmd.execute("drop table creativedb.tom_test;")
    14     27.3 MiB      0.0 MiB           cmd.execute("create table creativedb.tom_test(col integer, col1 integer);")
    15     27.3 MiB      0.0 MiB       with open("export.csv", 'rt') as f, giraffez.BulkLoad("creativedb.tom_test", cleanup=True) as ld:
    16     27.3 MiB      0.0 MiB           reader = csv.reader(f)
    17     31.7 MiB      0.0 MiB           for i, record in enumerate(reader):
    18     31.7 MiB      2.9 MiB               ld.put(record)
    19     31.7 MiB      0.0 MiB               print(f"\rRows Loaded: {i}", end='', flush=True)

The change can be seen in the commit below. Trying to figure this one out is my first dive into C/C++ since school so any feedback or alternative or better solutions would be appreciated.
b1b9a50

istvan-fodor · 2019-06-07T01:03:26Z

Thanks @hiker77 !

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Load runs out of memory #73

Load runs out of memory #73

istvan-fodor commented May 22, 2019

istvan-fodor commented May 22, 2019

hiker77 commented May 29, 2019

istvan-fodor commented Jun 7, 2019

Load runs out of memory #73

Load runs out of memory #73

Comments

istvan-fodor commented May 22, 2019

istvan-fodor commented May 22, 2019

hiker77 commented May 29, 2019

istvan-fodor commented Jun 7, 2019