Package for working with tabular data in Julia using DataFrame
's.
DataFrames.jl is now an installable package.
To install DataFrames.jl, use the following:
Pkg.add("DataFrames")
DataFrames.jl has one main module named DataFrames
. You can load it as:
using DataFrames
DataFrame
for efficient tabular storage of two-dimensional data- Minimized data copying
- Default columns can handle missing values (NA's) of any type
PooledDataFrame
for efficient storage of factor-like arrays for characters, integers, and other types- Flexible indexing
SubDataFrame
for efficient subset referencing without copies- Grouping operations inspired by plyr, pandas, and data.table
- Basic
merge
functionality stack
andunstack
for long/wide conversions- Pipelining support (
|
) for many operations - Several typical R-style functions, including
head
,tail
,describe
,unique
,duplicated
,with
,within
, and more - Formula and design matrix implementation
Here's a minimal demo showing some grouping operations:
julia> using DataFrames
julia> d = DataFrame(quote # expressions are one way to create a DataFrame
x = randn(10)
y = randn(10)
i = rand(1:3,10)
j = rand(1:3,10)
end);
julia> dump(d) # dump() is like R's str()
DataFrame 10 observations of 4 variables
x: DataArray{Float64,1}(10) [-0.22496343871037897,-0.4033933555989207,0.6027847717547058,0.06671669747901597]
y: DataArray{Float64,1}(10) [0.21904975091285417,-1.3275512477731726,2.266353546459277,-0.19840910239041679]
i: DataArray{Int64,1}(10) [2,1,3,1]
j: DataArray{Int64,1}(10) [3,2,1,2]
julia> head(d)
6x4 DataFrame:
x y i j
[1,] -0.224963 0.21905 2 3
[2,] -0.403393 -1.32755 1 2
[3,] 0.602785 2.26635 3 1
[4,] 0.0667167 -0.198409 1 2
[5,] 1.68303 -1.11183 1 3
[6,] 0.346034 1.68227 2 1
julia> d[1:3, ["x","y"]] # indexing is similar to R's
3x2 DataFrame
x y
[1,] -0.224963 0.21905
[2,] -0.403393 -1.32755
[3,] 0.602785 2.26635
julia> # Group on column i, and pipe (|) that result to an expression
julia> # that creates the column x_sum.
julia> groupby(d, "i") | :(x_sum = sum(x))
3x2 DataFrame
i x_sum
[1,] 1 2.06822
[2,] 2 -1.80867
[3,] 3 0.319517
julia> groupby(d, "i") | :sum # Another way to operate on a grouping
3x4 DataFrame
i x_sum y_sum j_sum
[1,] 1 2.06822 -2.73985 8
[2,] 2 -1.80867 1.83489 7
[3,] 3 0.319517 1.03072 2
See demo/workflow_demo.jl for a basic demo of the parts of a Julian data workflow.
See demo/design_demo.jl for a more in-depth demo of DataFrame and related types and library.
The Issues highlight a number of issues and ideas for enhancements. Here are some particular enhancements under way or under discussion:
-
Distributed DataFrames: issue 26
DataFrames fit well with Julia's syntax, but some features would
improve the user experience, including keyword function arguments
(Julia issue 485),
"~"
for easier expression syntax, and overloading "."
for easier
column access (df.colA). See
here
for a bit more information.
Please consider this a development preview. Many things work, but expect some rough edges. We hope that this can become a standard Julia package.