Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrame read_table performance? #273

Closed
randyzwitch opened this issue Jun 3, 2013 · 17 comments
Closed

DataFrame read_table performance? #273

randyzwitch opened this issue Jun 3, 2013 · 17 comments

Comments

@randyzwitch
Copy link
Contributor

Hey guys -

Just getting around to trying Julia (v 0.2-pre). Tried to read in a 50MB csv file on a 2013 quad-core MBP with 16GB of RAM using the following code:

require("DataFrames")
using DataFrames
df = read_table("filepath.csv")

Starting Julia and loading DataFrames uses < 100MB of RAM. Once I hit the read_table command, RAM usage shoots up to 3.4GB and 100% CPU (single core), and after 25 minutes, the csv file still hasn't loaded.

Is there something I'm missing in terms of the read_table function, a known bug, something else?

@johnmyleswhite
Copy link
Contributor

Let me experiment with this. read_table still needs a ton of work, but this beyond what I'd expect from its weaknesses.

@randyzwitch
Copy link
Contributor Author

Thanks John. The file is a mismash of GUIDs, sparsely populated columns with missing data (flags basically), User-Agent strings, 20-30 character strings, timestamps, ints...so all data types really. 80k rows by 100 cols.

@HarlanH
Copy link
Contributor

HarlanH commented Jun 3, 2013

Huh. Does it work if you truncate to the first 1000 rows and try loading
that? I'd be helpful to know if the tokenizer/parser is breaking, or if the
conversion to DataFrame is breaking...

On Mon, Jun 3, 2013 at 3:09 PM, randyzwitch [email protected]:

Thanks John. The file is a mismash of GUIDs, sparsely populated columns
with missing data (flags basically), User-Agent strings, 20-30 character
strings, timestamps, ints...so all data types really. 80k rows by 100 cols.


Reply to this email directly or view it on GitHubhttps://github.com//issues/273#issuecomment-18863622
.

@johnmyleswhite
Copy link
Contributor

I don't know what the source of this is (I suspect GC), but I have a simple test case that demonstrates the problem:

using DataFrames

df = DataFrame()

for j in 1:100
    df[j] = randn(80_000)
end

write_table("tmp.csv", df)

df2 = read_table("tmp.csv")

@johnmyleswhite
Copy link
Contributor

I've been putting off speedups for read_table and write_table, but it's time to get this resolved. By Wednesday we'll have the full US Julia team locked up in MIT, so we can hopefully solve this before Friday.

@johnmyleswhite
Copy link
Contributor

For context on why this is so bizarre, reading 5,000 rows and 100 columns takes 5 seconds. Too long, but not 16x faster than the 80,000 row case.

@randyzwitch
Copy link
Contributor Author

It's a whole other level of bizarre John...so the master file (the one I reported the error for) is a .csv created from an Excel spreadsheet. R reads this file just fine.

Because I'm garbage at Unix, I did write.csv from R to create new files. 1k, 5k...50k...all. The "all.csv" file written from R gets read into Julia using about 1GB of memory and a bit of time (slow, but completes in a few minutes).

Topically checking the files, the "Excel" version of the .csv is 49.6MB, the R version is 57MB. So R does something to clean up the file (fills with NA's?), and then it works in Julia.

@johnmyleswhite
Copy link
Contributor

Strange. The CSV standard is absurdly permissive, so I'm not surprised that Excel does something wonky that we don't parse correctly.

But our performance is too slow to be acceptable any longer. I'll make a major push this week.

@ViralBShah
Copy link
Contributor

Also see readcsv performance in JuliaLang/julia#3350

@johnmyleswhite
Copy link
Contributor

@randyzwitch, any chance you could share your files for testing? I'm making some progress on this now and have managed to produce some code that does as well as R on some basic cases. (Unfortunately I'm turning off GC to do it.)

@randyzwitch
Copy link
Contributor Author

@johnmyleswhite I can provide some anonymized data, no problem. Do you have an FTP? If not, I can provide S3 bucket, just let me know a way to contact you so I can send the details.

@johnmyleswhite
Copy link
Contributor

I don't think I have any convenient FTP server you can upload to. Something on S3 with info sent to [email protected] would be great.

@ViralBShah
Copy link
Contributor

I have a large file in my home dir on julia.mit.edu that used to make readtable barf.

@johnmyleswhite
Copy link
Contributor

@ViralBShah: Your file fails on the following line

1472,167,Vijayanagar ,UAN5962592,Nagashree H,MANJUNATHA M,  "SANGAM" 189,23,F

That line is a real mess. It's got 2 spaces at the start of a field, quotations midfield and then another space inside a field. I'm tempted to say it's just an invalid CSV that needs cleaning before use.

Note that R 2.15.2 segfaults on your file:

> read.csv("blrsouth.csv")
Error: segfault from C stack overflow

@ViralBShah
Copy link
Contributor

The old readcsv used to read it in julia. The new readcsv detects invalid UTF-8 and barfs, but it seems to do fine with this weird line. Yes, the file certainly needs cleaning.

@ViralBShah
Copy link
Contributor

I'll prepare a cleaned up version of this file.

@johnmyleswhite
Copy link
Contributor

Closed by 664f792

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants