DataFrame read_table performance? #273

randyzwitch · 2013-06-03T18:14:05Z

Hey guys -

Just getting around to trying Julia (v 0.2-pre). Tried to read in a 50MB csv file on a 2013 quad-core MBP with 16GB of RAM using the following code:

require("DataFrames")
using DataFrames
df = read_table("filepath.csv")

Starting Julia and loading DataFrames uses < 100MB of RAM. Once I hit the read_table command, RAM usage shoots up to 3.4GB and 100% CPU (single core), and after 25 minutes, the csv file still hasn't loaded.

Is there something I'm missing in terms of the read_table function, a known bug, something else?

johnmyleswhite · 2013-06-03T18:58:46Z

Let me experiment with this. read_table still needs a ton of work, but this beyond what I'd expect from its weaknesses.

randyzwitch · 2013-06-03T19:09:00Z

Thanks John. The file is a mismash of GUIDs, sparsely populated columns with missing data (flags basically), User-Agent strings, 20-30 character strings, timestamps, ints...so all data types really. 80k rows by 100 cols.

HarlanH · 2013-06-03T19:50:16Z

Huh. Does it work if you truncate to the first 1000 rows and try loading
that? I'd be helpful to know if the tokenizer/parser is breaking, or if the
conversion to DataFrame is breaking...

On Mon, Jun 3, 2013 at 3:09 PM, randyzwitch [email protected]:

Thanks John. The file is a mismash of GUIDs, sparsely populated columns
with missing data (flags basically), User-Agent strings, 20-30 character
strings, timestamps, ints...so all data types really. 80k rows by 100 cols.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/273#issuecomment-18863622
.

johnmyleswhite · 2013-06-03T20:44:38Z

I don't know what the source of this is (I suspect GC), but I have a simple test case that demonstrates the problem:

using DataFrames

df = DataFrame()

for j in 1:100
    df[j] = randn(80_000)
end

write_table("tmp.csv", df)

df2 = read_table("tmp.csv")

johnmyleswhite · 2013-06-03T20:45:37Z

I've been putting off speedups for read_table and write_table, but it's time to get this resolved. By Wednesday we'll have the full US Julia team locked up in MIT, so we can hopefully solve this before Friday.

johnmyleswhite · 2013-06-03T21:04:30Z

For context on why this is so bizarre, reading 5,000 rows and 100 columns takes 5 seconds. Too long, but not 16x faster than the 80,000 row case.

randyzwitch · 2013-06-03T21:28:12Z

It's a whole other level of bizarre John...so the master file (the one I reported the error for) is a .csv created from an Excel spreadsheet. R reads this file just fine.

Because I'm garbage at Unix, I did write.csv from R to create new files. 1k, 5k...50k...all. The "all.csv" file written from R gets read into Julia using about 1GB of memory and a bit of time (slow, but completes in a few minutes).

Topically checking the files, the "Excel" version of the .csv is 49.6MB, the R version is 57MB. So R does something to clean up the file (fills with NA's?), and then it works in Julia.

johnmyleswhite · 2013-06-03T21:32:00Z

Strange. The CSV standard is absurdly permissive, so I'm not surprised that Excel does something wonky that we don't parse correctly.

But our performance is too slow to be acceptable any longer. I'll make a major push this week.

ViralBShah · 2013-06-14T11:49:46Z

Also see readcsv performance in JuliaLang/julia#3350

johnmyleswhite · 2013-06-18T17:04:57Z

@randyzwitch, any chance you could share your files for testing? I'm making some progress on this now and have managed to produce some code that does as well as R on some basic cases. (Unfortunately I'm turning off GC to do it.)

randyzwitch · 2013-06-18T17:13:30Z

@johnmyleswhite I can provide some anonymized data, no problem. Do you have an FTP? If not, I can provide S3 bucket, just let me know a way to contact you so I can send the details.

johnmyleswhite · 2013-06-18T17:14:22Z

I don't think I have any convenient FTP server you can upload to. Something on S3 with info sent to [email protected] would be great.

ViralBShah · 2013-06-19T03:05:52Z

I have a large file in my home dir on julia.mit.edu that used to make readtable barf.

johnmyleswhite · 2013-06-19T11:55:59Z

@ViralBShah: Your file fails on the following line

1472,167,Vijayanagar ,UAN5962592,Nagashree H,MANJUNATHA M,  "SANGAM" 189,23,F

That line is a real mess. It's got 2 spaces at the start of a field, quotations midfield and then another space inside a field. I'm tempted to say it's just an invalid CSV that needs cleaning before use.

Note that R 2.15.2 segfaults on your file:

> read.csv("blrsouth.csv")
Error: segfault from C stack overflow

ViralBShah · 2013-06-19T14:08:28Z

The old readcsv used to read it in julia. The new readcsv detects invalid UTF-8 and barfs, but it seems to do fine with this weird line. Yes, the file certainly needs cleaning.

ViralBShah · 2013-06-19T14:17:20Z

I'll prepare a cleaned up version of this file.

johnmyleswhite · 2013-06-21T02:37:19Z

Closed by 664f792

johnmyleswhite closed this as completed Jun 21, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataFrame read_table performance? #273

DataFrame read_table performance? #273

randyzwitch commented Jun 3, 2013

johnmyleswhite commented Jun 3, 2013

randyzwitch commented Jun 3, 2013

HarlanH commented Jun 3, 2013

johnmyleswhite commented Jun 3, 2013

johnmyleswhite commented Jun 3, 2013

johnmyleswhite commented Jun 3, 2013

randyzwitch commented Jun 3, 2013

johnmyleswhite commented Jun 3, 2013

ViralBShah commented Jun 14, 2013

johnmyleswhite commented Jun 18, 2013

randyzwitch commented Jun 18, 2013

johnmyleswhite commented Jun 18, 2013

ViralBShah commented Jun 19, 2013

johnmyleswhite commented Jun 19, 2013

ViralBShah commented Jun 19, 2013

ViralBShah commented Jun 19, 2013

johnmyleswhite commented Jun 21, 2013

DataFrame read_table performance? #273

DataFrame read_table performance? #273

Comments

randyzwitch commented Jun 3, 2013

johnmyleswhite commented Jun 3, 2013

randyzwitch commented Jun 3, 2013

HarlanH commented Jun 3, 2013

johnmyleswhite commented Jun 3, 2013

johnmyleswhite commented Jun 3, 2013

johnmyleswhite commented Jun 3, 2013

randyzwitch commented Jun 3, 2013

johnmyleswhite commented Jun 3, 2013

ViralBShah commented Jun 14, 2013

johnmyleswhite commented Jun 18, 2013

randyzwitch commented Jun 18, 2013

johnmyleswhite commented Jun 18, 2013

ViralBShah commented Jun 19, 2013

johnmyleswhite commented Jun 19, 2013

ViralBShah commented Jun 19, 2013

ViralBShah commented Jun 19, 2013

johnmyleswhite commented Jun 21, 2013