Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make as_data_frame a generic so can work on matrices? #876

Closed
jennybc opened this issue Jan 8, 2015 · 11 comments
Closed

Make as_data_frame a generic so can work on matrices? #876

jennybc opened this issue Jan 8, 2015 · 11 comments
Milestone

Comments

@jennybc
Copy link
Member

jennybc commented Jan 8, 2015

I love love love data_frame(). I wish I could get that behaviour when converting an existing matrix to a data.frame.

> foo <- matrix(LETTERS[1:6], nrow = 3)
> data_frame(foo)      # nope
Error: data_frames can not contain data.frames, matrices or arrays
> data.frame(foo)      # this works, BTW
  X1 X2
1  A  D
2  B  E
3  C  F
> as.data.frame(foo)   # but this is probably better
  V1 V2
1  A  D
2  B  E
3  C  F
> str(data.frame(foo)) # but the factor conversion is annoying
'data.frame':   3 obs. of  2 variables:
 $ X1: Factor w/ 3 levels "A","B","C": 1 2 3
 $ X2: Factor w/ 3 levels "D","E","F": 1 2 3


> ## what I WISH I could do -- warning: I made this up!
> (foo_dream <- as_data_frame(foo))
Source: local data frame [3 x 2]
            V1           V2
1            A            D
2            B            E
3            C            F
> str(foo_dream)
Classestbl_df’, ‘tbland 'data.frame':   3 obs. of  2 variables:
 $ V1: chr  "A" "B" "C"
 $ V2: chr  "D" "E" "F"
@romainfrancois
Copy link
Member

We already have as_data_frame but it works on lists, not matrices:

> as_data_frame( list( a = 1:10, b = letters[1:10] ) )
Source: local data frame [10 x 2]

    a b
1   1 a
2   2 b
3   3 c
4   4 d
5   5 e
6   6 f
7   7 g
8   8 h
9   9 i
10 10 j
> as_data_frame( list( a = 1:10, b = letters[1:10] ) ) %>% str
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   10 obs. of  2 variables:
 $ a: int  1 2 3 4 5 6 7 8 9 10
 $ b: chr  "a" "b" "c" "d" ...

@hadley
Copy link
Member

hadley commented Jan 9, 2015

Maybe we need to make as_data_frame() generic? But that would make it slower. We could just add a second path for matrices

@jennybc
Copy link
Member Author

jennybc commented Jan 9, 2015

I have amused myself with this workaround now that I know about existing as_data_frame():

> plyr::alply(foo, 2) %>% as_data_frame
Source: local data frame [3 x 2]

  1 2
1 A D
2 B E
3 C F

@jennybc jennybc changed the title as_data_frame would be nice Make as_data_frame a generic so can work on matrices? Jan 17, 2015
@klmr
Copy link

klmr commented Feb 3, 2015

To chime in, I do the following all over the place:

result = some_function_call() %>%
    as.data.frame() %>%
    as_data_frame()

Furthermore, I’m not sure I understand the need for the function tbl_df, which does just one additional check. So I suggest merging the current functionality of as_data_frame into tbl_df, and providing a more helpful as_data_frame.

@davharris
Copy link

This issue was referenced on @jennybc's Twitter account today, so I thought now might be a good time to chime in.

I also use @klmr's pattern very frequently (except my code usually looks more like as.data.frame(stringsAsFactors = FALSE) %>% tbl_df).

I'm not sure I understand @hadley's concern about speed. I thought that generics only took a few extra thousandths of a second. Is as_data_frame called often enough that this extra time is likely to matter?

If speed is still a concern, then I'd also be nearly as happy with a function like this one:

tbl_df_generic = function(x){
    x%>% as.data.frame(stringsAsFactors = FALSE) %>% tbl_df`)
}

It would add a small amount of cognitive overhead compared to just using a generic tbl_df, but it would still be way less overhead than needing to remember stringsAsFactors = FALSE.

@hadley
Copy link
Member

hadley commented May 12, 2015

@davharris I'm concerned about speed because sometimes I apply as_data_frame() to a list that contains >100,000 things. Some of that might go away if rbind_rows() could work with lists directly, but I need to think it through and do some benchmarking.

@davharris
Copy link

@hadley Okay, I hadn't realized it would be called that many times.

Here's some quick benchmarking that might be useful.

# a new generic for as_data_frame
as_data_frame_generic = function(x){
  UseMethod("as_data_frame_generic")
}

# a list method for the new generic
as_data_frame_generic.list = function(x){
  as_data_frame(x)
}

# A small list to use as an example.
# If the list is small, then dispatch will take proportionately longer
my_list = list(
  a = letters,
  A = LETTERS,
  one = 1:26,
  two = 26:1
)

library(microbenchmark)

microbenchmark(
  no_dispatch = as_data_frame(my_list),
  dispatch = as_data_frame_generic(my_list),
  times = 5E5
)

Here are the results on my laptop:

Unit: microseconds
        expr    min     lq     mean median     uq      max neval
 no_dispatch 40.307 44.849 51.16531 46.135 51.646 88714.88 5e+05
    dispatch 42.263 46.865 53.54021 48.212 54.082 84428.73 5e+05

It looks like this dispatch implementation increases the computation time on a half-million lists from 25.6 seconds to 26.8 seconds.

It probably makes sense to do more extensive benchmarking (for example, would things slow down if we added more methods?). So far, things look pretty good, though.

@hadley
Copy link
Member

hadley commented May 13, 2015

@davharris thanks - that seems reasonable and suggests that I shouldn't worry about it. I'll make the change when I'm next working on dplyr. An efficient implementation of as_data_frame.matrix() in C++ should be reasonable simple.

@davharris
Copy link

@hadley Perfect. Thanks.

On May 13, 2015, at 6:13 AM, Hadley Wickham [email protected] wrote:

@davharris thanks - that seems reasonable and suggests that I shouldn't worry about it. I'll make the change when I'm next working on dplyr. An efficient implementation of as_data_frame.matrix() in C++ should be reasonable simple.


Reply to this email directly or view it on GitHub.

@hadley hadley added this to the 0.5 milestone May 19, 2015
@jennybc
Copy link
Member Author

jennybc commented Sep 11, 2015

Déjà vu: trying to turn a character matrix into a tbl and rediscovered my own issue.

Bump.

library(dplyr)
ip <- installed.packages() %>% as.tbl()
#> Error in UseMethod("as.tbl"): no applicable method for 'as.tbl' applied to an object of class "c('matrix', 'character')"
ip <- installed.packages() %>% as_data_frame()
#> Error: is.list(x) is not TRUE

@hadley
Copy link
Member

hadley commented Sep 11, 2015

I actually just wrote an efficient implementation for this for tidyr - I'll hoist it up into dplyr.

@hadley hadley closed this as completed in 9a23e86 Oct 29, 2015
krlmlr pushed a commit to krlmlr/dplyr that referenced this issue Mar 2, 2016
Add method for list (existing), data frame (trivial) and matrix (from tidyr).

Fixes tidyverse#876
krlmlr pushed a commit to tidyverse/tibble that referenced this issue Mar 22, 2016
- Initial CRAN release

- Extracted from `dplyr` 0.4.3

- Exported functions:
    - `tbl_df()`
    - `as_data_frame()`
    - `data_frame()`, `data_frame_()`
    - `frame_data()`, `tibble()`
    - `glimpse()`
    - `trunc_mat()`, `knit_print.trunc_mat()`
    - `type_sum()`
    - New `lst()` and `lst_()` create lists in the same way that
      `data_frame()` and `data_frame_()` create data frames (tidyverse/dplyr#1290).
      `lst(NULL)` doesn't raise an error (#17, @jennybc), but always
      uses deparsed expression as name (even for `NULL`).
    - New `add_row()` makes it easy to add a new row to data frame
      (tidyverse/dplyr#1021).
    - New `rownames_to_column()` and `column_to_rownames()` (#11, @zhilongjia).
    - New `has_rownames()` and `remove_rownames()` (#44).
    - New `repair_names()` fixes missing and duplicate names (#10, #15,
      @r2evans).
    - New `is_vector_s3()`.

- Features
    - New `as_data_frame.table()` with argument `n` to control name of count
      column (#22, #23).
    - Use `tibble` prefix for options (#13, #36).
    - `glimpse()` now (invisibly) returns its argument (tidyverse/dplyr#1570). It
      is now a generic, the default method dispatches to `str()`
      (tidyverse/dplyr#1325).  The default width is obtained from the
      `tibble.width` option (#35, #56).
    - `as_data_frame()` is now an S3 generic with methods for lists (the old
      `as_data_frame()`), data frames (trivial), matrices (with efficient
      C++ implementation) (tidyverse/dplyr#876), and `NULL` (returns a 0-row
      0-column data frame) (#17, @jennybc).
    - Non-scalar input to `frame_data()` and `tibble()` (including lists)
      creates list-valued columns (#7). These functions return 0-row but n-col
      data frame if no data.

- Bug fixes
    - `frame_data()` properly constructs rectangular tables (tidyverse/dplyr#1377,
      @kevinushey).

- Minor modifications
    - Uses `setOldClass(c("tbl_df", "tbl", "data.frame"))` to help with S4
      (tidyverse/dplyr#969).
    - `tbl_df()` automatically generates column names (tidyverse/dplyr#1606).
    - `tbl_df`s gain `$` and `[[` methods that are ~5x faster than the defaults,
      never do partial matching (tidyverse/dplyr#1504), and throw an error if the
      variable does not exist.  `[[.tbl_df()` falls back to regular subsetting
      when used with anything other than a single string (#29).
      `base::getElement()` now works with tibbles (#9).
    - `all_equal()` allows to compare data frames ignoring row and column order,
      and optionally ignoring minor differences in type (e.g. int vs. double)
      (tidyverse/dplyr#821).  Used by `all.equal()` for tibbles.  (This package
      contains a pure R implementation of `all_equal()`, the `dplyr` code has
      identical behavior but is written in C++ and thus faster.)
    - The internals of `data_frame()` and `as_data_frame()` have been aligned,
      so `as_data_frame()` will now automatically recycle length-1 vectors.
      Both functions give more informative error messages if you are attempting
      to create an invalid data frame.  You can no longer create a data frame
      with duplicated names (tidyverse/dplyr#820).  Both functions now check that
      you don't have any `POSIXlt` columns, and tell you to use `POSIXct` if you
      do (tidyverse/dplyr#813).  `data_frame(NULL)` raises error "must be a 1d
      atomic vector or list".
    - `trunc_mat()` and `print.tbl_df()` are considerably faster if you have
      very wide data frames.  They will now also only list the first 100
      additional variables not already on screen - control this with the new
      `n_extra` parameter to `print()` (tidyverse/dplyr#1161).  The type of list
      columns is printed correctly (tidyverse/dplyr#1379).  The `width` argument is
      used also for 0-row or 0-column data frames (#18).
    - When used in list-columns, S4 objects only print the class name rather
      than the full class hierarchy (#33).
    - Add test that `[.tbl_df()` does not change class (#41, @jennybc).  Improve
      `[.tbl_df()` error message.

- Documentation
    - Update README, with edits (#52, @bhive01) and enhancements (#54,
      @jennybc).
    - `vignette("tibble")` describes the difference between tbl_dfs and
      regular data frames (tidyverse/dplyr#1468).

- Code quality
    - Test using new-style Travis-CI and AppVeyor. Full test coverage (#24,
      #53). Regression tests load known output from file (#49).
    - Renamed `obj_type()` to `obj_sum()`, improvements, better integration with
     `type_sum()`.
    - Internal cleanup.
@lock lock bot locked as resolved and limited conversation to collaborators Jun 9, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants