Skip to content
This repository has been archived by the owner on Nov 19, 2021. It is now read-only.

Load runs out of memory #73

Open
istvan-fodor opened this issue May 22, 2019 · 3 comments
Open

Load runs out of memory #73

istvan-fodor opened this issue May 22, 2019 · 3 comments

Comments

@istvan-fodor
Copy link

giraffez version 2.0.24.2
Teradata Load Utility Version 16.20.00.09 64-Bit
Ubuntu 16.04
4 cores, 16 GB Ram

If I run the giraffez load operation for large CSVs, the process runs out of memory. I see on the command line the usual message (Processed X Rows), and I see in top that memory usage is slowly creeping up. After a while the process maxes out on memory around 16 GB and the giraffez process is killed. Is this expected? Or is there a configuration I am missing?

@istvan-fodor
Copy link
Author

I observed that its the same issue if I run through BulkLoad api with put()

@hiker77
Copy link

hiker77 commented May 29, 2019

My team ran into the same issue. I believe the leak is caused when iterating over elements in a tuple to convert. PySequence_GetItem is called which yields a new reference but only the last element in the list is getting deallocated. I ran a few tests using 1 million records, 2 integers each record, memory profiler snips are below:
First Run

================================================
    10     22.5 MiB     22.5 MiB   @profile
    11                             def run():
    12     27.4 MiB      4.9 MiB       with giraffez.Cmd() as cmd:
    13     27.4 MiB      0.0 MiB           cmd.execute("drop table creativedb.tom_test;")
    14     27.4 MiB      0.0 MiB           cmd.execute("create table creativedb.tom_test(col integer, col1 integer);")
    15     27.4 MiB      0.0 MiB       with open("export.csv", 'rt') as f, giraffez.BulkLoad("creativedb.tom_test", cleanup=True) as ld:
    16     27.4 MiB      0.0 MiB           reader = csv.reader(f)
    17     93.6 MiB      0.3 MiB           for i, record in enumerate(reader):
    18     93.6 MiB      2.7 MiB               ld.put(record)
    19     93.6 MiB      0.0 MiB               print(f"\rRows Loaded: {i}", end='', flush=True)

And the second with the change

Line #    Mem usage    Increment   Line Contents
================================================
    10     22.7 MiB     22.7 MiB   @profile
    11                             def run():
    12     27.3 MiB      4.6 MiB       with giraffez.Cmd() as cmd:
    13     27.3 MiB      0.0 MiB           cmd.execute("drop table creativedb.tom_test;")
    14     27.3 MiB      0.0 MiB           cmd.execute("create table creativedb.tom_test(col integer, col1 integer);")
    15     27.3 MiB      0.0 MiB       with open("export.csv", 'rt') as f, giraffez.BulkLoad("creativedb.tom_test", cleanup=True) as ld:
    16     27.3 MiB      0.0 MiB           reader = csv.reader(f)
    17     31.7 MiB      0.0 MiB           for i, record in enumerate(reader):
    18     31.7 MiB      2.9 MiB               ld.put(record)
    19     31.7 MiB      0.0 MiB               print(f"\rRows Loaded: {i}", end='', flush=True)

The change can be seen in the commit below. Trying to figure this one out is my first dive into C/C++ since school so any feedback or alternative or better solutions would be appreciated.
b1b9a50

@istvan-fodor
Copy link
Author

Thanks @hiker77 !

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants