readdlm and fixed column data #5391

KenziTrader · 2014-01-14T11:17:43Z

I have a file with fixed column data. The first few lines are:

5 7 35 1.400 .400 .657 2.33 14 23 6 1
6 7 42 1.167 .429 .881 3.60 18 37 5 1
6 18 108 3.000 .287 .741 4.43 31 80 7 1

If I use

page_blocks = readdlm("page-blocks.data")

it reads every row as a single string.

If I use

page_blocks = readdlm("page-blocks.data", Float64)

it reads the data as a long array of NaNs.

If I use

page_blocks = readdlm("page-blocks.data",(Int64, Int64,Int64,Float64,Float64,Float64,Float64,Int64,Int64,Int64, Int64))

I get the error:

ERROR: file entry " 5 7 35 1.400 .400 .657 2.33 14 23 6 1" cannot be converted to (Int64,Int64,Int64,Float64,Float64,Float64,Float64,Int64,Int64,Int64,Int64)
in error at error.jl:21
in dlm_fill at datafmt.jl:135
in readdlm_string at datafmt.jl:82
in readdlm_auto at datafmt.jl:50
in readdlm at datafmt.jl:42
in readdlm at datafmt.jl:35

How can I read fixed column data with readdlm?

acroy · 2014-01-14T11:35:38Z

You need to specify a delimiter as second argument in readdlm, otherwise char(0xfffffffe) is used for some (probably good) reason.

readdlm("../../test.dat",' ')
3x11 Array{Float64,2}:
 5.0   7.0   35.0  1.4    0.4    0.657  2.33  14.0  23.0  6.0  1.0
 6.0   7.0   42.0  1.167  0.429  0.881  3.6   18.0  37.0  5.0  1.0
 6.0  18.0  108.0  3.0    0.287  0.741  4.43  31.0  80.0  7.0  1.0

seems to work fine.

KenziTrader · 2014-01-14T12:20:42Z

If you specify a delimiter then the padding spaces inside the (fixed width) fields are interpreted as separate columns. Thus the rows then have a different number of columns.

The original file is:
https://archive.ics.uci.edu/ml/machine-learning-databases/page-blocks/page-blocks.data.Z

I get a BoundsError:

page_blocks = readdlm("page-blocks.data", ' ')

ERROR: BoundsError()
in getindex at ascii.jl:11
in dlm_fill at datafmt.jl:116
in dlm_fill at datafmt.jl:126
in readdlm_string at datafmt.jl:82
in readdlm_auto at datafmt.jl:50
in readdlm at datafmt.jl:41
in readdlm at datafmt.jl:39

acroy · 2014-01-14T13:48:01Z

I see. Seems readdlm is not able to handle this case. As a workaround you could use CSV format?

There is also readtable in the package DataFrames.jl. However, I just tried it with test2.dat containing

 5.0   7.0   35.0  1.4    0.4    0.657  2.33  14.0  23.0  6.0  1.0
 6.0   7.0   42.0  1.167  0.429  0.881  3.6   18.0  37.0  5.0  1.0
 6.0  18.0  108.0  3.0    0.287  0.741  4.43  31.0  80.0  7.0  1.0

which gave

julia> readtable("../../test2.dat", separator=' ', header = false)
ERROR: Saw 3 rows, 25 columns and 77 fields
 * Line 1 has 30 columns

 in error at error.jl:21
 in findcorruption at /Users/acr/.julia/DataFrames/src/io.jl:480
 in readtable! at /Users/acr/.julia/DataFrames/src/io.jl:526
 in readtable at /Users/acr/.julia/DataFrames/src/io.jl:595

Maybe someone else (cc: @johnmyleswhite) has an idea?

johnmyleswhite · 2014-01-14T14:59:35Z

We don't support fixed width fields yet. It's not that hard: I might even be able to finish a demo on the way to work today. But fixed width files have almost nothing in common with delimited files, so our existing infrastructure is only slightly usable.

JeffBezanson · 2014-01-14T15:57:36Z

Maybe we could add an option to readdlm to skip empty columns; that might handle cases like this.

KenziTrader · 2014-01-14T16:50:38Z

Well, maybe readdlm is not appropriate for reading fixed width data but I just want to read my data. For this file I converted it into a comma separated file and could read it. It would be nice to be able to read it without the conversion.

johnmyleswhite · 2014-01-15T02:14:39Z

I've started the work of doing this in JuliaData/DataFrames.jl#475. I started with binary files, which were more useful to me. I'll get to text files soon.

tanmaykm · 2014-01-15T05:45:01Z

PR #5400 addresses the issue of default delimiters not being applied and the BoundsError when there are empty columns. With this patch readdlm will parse both the example files mentioned here without errors. It however still can not read fixed width format. @JeffBezanson's suggestion looks like a good option for fixed width data.

address readdlm default delimiter and boundserror. ref: #5391

updated tests and docs. fixes JuliaLang#5391

…ingle delimiter. fixed bug in handling empty columns. updated tests and docs. fixes JuliaLang#5391

tanmaykm added a commit to tanmaykm/julia that referenced this issue Jan 15, 2014

address readdlm default delimiter and boundserror. ref: JuliaLang#5391

51f0f81

JeffBezanson added a commit that referenced this issue Jan 15, 2014

Merge pull request #5400 from tanmaykm/readcsv

714fa07

address readdlm default delimiter and boundserror. ref: #5391

tknopp pushed a commit to tknopp/julia that referenced this issue Jan 15, 2014

address readdlm default delimiter and boundserror. ref: JuliaLang#5391

3760540

tanmaykm added a commit to tanmaykm/julia that referenced this issue Jan 15, 2014

added readdlm option to ignore empty columns.

43d5f72

updated tests and docs. fixes JuliaLang#5391

tanmaykm added a commit to tanmaykm/julia that referenced this issue Jan 16, 2014

address readdlm default delimiter and boundserror. ref: JuliaLang#5391

395f0ee

tanmaykm added a commit to tanmaykm/julia that referenced this issue Jan 16, 2014

for whitespace delimited files, adjoining delimiters are treated as s…

ee61b45

…ingle delimiter. fixed bug in handling empty columns. updated tests and docs. fixes JuliaLang#5391

tanmaykm added a commit to tanmaykm/julia that referenced this issue Jan 16, 2014

for whitespace delimited files, adjoining delimiters are treated as s…

4edeef0

…ingle delimiter. fixed bug in handling empty columns. updated tests and docs. fixes JuliaLang#5391

tanmaykm added a commit to tanmaykm/julia that referenced this issue Jan 16, 2014

for whitespace delimited files, adjoining delimiters are treated as s…

bd1f380

…ingle delimiter. fixed bug in handling empty columns. updated tests and docs. fixes JuliaLang#5391

tanmaykm added a commit to tanmaykm/julia that referenced this issue Jan 16, 2014

for whitespace delimited files, adjoining delimiters are treated as s…

fd31b99

…ingle delimiter. fixed bug in handling empty columns. updated tests and docs. fixes JuliaLang#5391

tanmaykm closed this as completed in def6c1d Jan 18, 2014

visr mentioned this issue Jan 29, 2014

readdlm trailing whitespace #5602

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

readdlm and fixed column data #5391

readdlm and fixed column data #5391

KenziTrader commented Jan 14, 2014

acroy commented Jan 14, 2014

KenziTrader commented Jan 14, 2014

acroy commented Jan 14, 2014

johnmyleswhite commented Jan 14, 2014

JeffBezanson commented Jan 14, 2014

KenziTrader commented Jan 14, 2014

johnmyleswhite commented Jan 15, 2014

tanmaykm commented Jan 15, 2014

readdlm and fixed column data #5391

readdlm and fixed column data #5391

Comments

KenziTrader commented Jan 14, 2014

acroy commented Jan 14, 2014

KenziTrader commented Jan 14, 2014

acroy commented Jan 14, 2014

johnmyleswhite commented Jan 14, 2014

JeffBezanson commented Jan 14, 2014

KenziTrader commented Jan 14, 2014

johnmyleswhite commented Jan 15, 2014

tanmaykm commented Jan 15, 2014