Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cleanEHR #102

Closed
13 tasks done
sinanshi opened this issue Feb 21, 2017 · 38 comments
Closed
13 tasks done

cleanEHR #102

sinanshi opened this issue Feb 21, 2017 · 38 comments

Comments

@sinanshi
Copy link

sinanshi commented Feb 21, 2017

Summary

  • What does this package do? (explain in 50 words or less):
    cleanEHR works with the CCHIC critical care patient record, which is the UK first open access, high resolution longitudinal critical care database. The package is created to address various data reliability and accessibility problems of CCHIC database. It provides a unique platform that enables data manipulation, transformation, reduction, anonymisation, cleaning and validation.

  • Paste the full DESCRIPTION file inside a code block below:

Package: cleanEHR
Type: Package
Title: The Critical Care Clinical Data Processing Tools
Version: 0.1
Date: 2017-01-30
Author: Sinan Shi, David Pérez-Suárez, Steve Harris, Niall MacCallum, David Brealey, Mervyn Singer, James Hetherington
Maintainer: Sinan Shi <[email protected]>
Description: 
    A toolset to deal with the Critical Care Health Informatics Collaborative
    dataset. It is created to address various data reliability and accessibility
    problems of electronic healthcare records (EHR). It provides a unique
    platform which enables data manipulation, transformation, reduction,
    anonymisation, cleaning and validation.
Depends:
    R (>= 3.1.0)
License: GPL-3
LinkingTo: Rcpp
Suggests:
    testthat
Imports:
    data.table,
    XML,
    yaml,
    Rcpp,
    methods,
    knitr,
    ggplot2,
    pander, 
    stats,
    utils
VignetteBuilder: knitr
URL: https://github.com/CC-HIC/cleanEHR, http:https://www.hic.nihr.ac.uk
RoxygenNote: 5.0.1
Collate:
    'RcppExports.R'
    'ccRecord.R'
    'ccTable.R'
    'create2dclean.R'
    'data.quality.report.R'
    'deltaTime.R'
    'filter.categorical.R'
    'filter.missingness.R'
    'filter.range.R'
    'imputation.R'
    'pipeline.R'
    'reallocateTime.R'
    'selectTable.R'
    'sql_demographic.R'
    'stdid.R'
    'summary.R'
    'unique.spell.R'
    'utilities.R'
    'xml2ccdata.R'
'zzz.R'
  • URL for the package (the development repository, not a stylized html page):
    https://github.com/CC-HIC/cleanEHR

  • Who is the target audience?
    Clinical/medical researchers

  • Are there other R packages that accomplish the same thing? If so, what is different about yours?
    No

Requirements

Confirm each of the following by checking the box. This package:

  • does not violate the Terms of Service of any service it interacts with.
  • has a CRAN and OSI accepted license.
  • contains a README with instructions for installing the development version.
  • includes documentation with examples for all functions.
  • contains a vignette with examples of its essential functions and uses.
  • has a test suite.
  • has continuous integration with Travis CI and/or another service.

Publication options

  • Do you intend for this package to go on CRAN?
  • Do you wish to automatically submit to the Journal of Open Source Software? If so:
    • The package contains a paper.md with a high-level description in the package root or in inst/.
    • The package is deposited in a long-term repository with the DOI:
    • (Do not submit your package separately to JOSS)

Detail

  • Does R CMD check (or devtools::check()) succeed? Paste and describe any errors or warnings:

  • Does the package conform to rOpenSci packaging guidelines? Please describe any exceptions:

  • If this is a resubmission following rejection, please explain the change in circumstances:

  • If possible, please provide recommendations of reviewers - those with experience with similar packages and/or likely users of your package - and their GitHub user names:

@noamross
Copy link
Contributor

noamross commented Feb 26, 2017

Editor checks:

  • Fit: The package meets criteria for fit and overlap
  • Automated tests: Package has a testing suite and is tested via Travis-CI or another CI service.
  • License: The package has a CRAN or OSI accepted license
  • Repository: The repository link resolves correctly
  • Archive (JOSS only, may be post-review): The repository DOI resolves correctly
  • Version (JOSS only, may be post-review): Does the release version given match the GitHub release (v1.0.0)?

Editor comments

Thank you for your submission!

I wanted to clarify one area before moving forward on review, which is understanding the limitations on access to the CCHIC data set. We aim to facilitate the access to open data, but we understand there are limitations in sharing personalized medical records. I am unsure of difference between the data contained in the package, the "toy" data set linked to from the README, and the extent of additional data available to researchers. Could you explain? Also, what are the requirements for receiving access to the broader data set? Is it a matter of signing the agreement to protect patient data or is the reason for research considered? (I note that the link from the README seems to go to a page for editing the access form, not the form itself.) Our reviewers will need access to the "toy" set, at least to review the package.

Some additional initial notes (changes not needed immediately, may be addressed during the review):

  • The name "cleanEHR" may be confusing because it could seem the package applies to electronic health records generally. It may make sense to change the name to something with the term CCHIC instead.

  • Here are outputs from automated testing using goodpractice::gp(). I'd also recommend using devtools::spell_check() to examine documentation.

── GP cleanEHR ──────────────────────────────────

It is good practice to

  ✖ write unit tests for all functions, and all package code in general. 60% of code lines are covered by test cases.

    R/ccRecord.R:117:NA
    R/ccRecord.R:118:NA
    R/ccRecord.R:145:NA
    R/ccRecord.R:153:NA
    R/ccRecord.R:246:NA
    ... and 540 more lines

  ✖ omit "Date" in DESCRIPTION. It is not required and it gets invalid quite often. A build date will be added to the
    package when you perform `R CMD build` on it.
  ✖ add a "BugReports" field to DESCRIPTION, and point it to a bug tracker. Many online code hosting services provide
    bug trackers for free, https://github.com, https://gitlab.com, etc.
  ✖ use '<-' for assignment instead of '='. '<-' is the standard, and R users and developers are used it and it is
    easier to read your code for them if you use '<-'.

    inst/pipeline/combine_data.r:5:6
    inst/pipeline/extract_data.r:5:6
    inst/pipeline/untimeit.r:5:6
    R/ccTable.R:202:19
    R/stdid.R:13:39
    ... and 2 more lines

  ✖ avoid long code lines, it is bad for readability. Also, many people prefer editor windows that are about 80
    characters wide. Try make your lines shorter than 80 characters

    inst/doc/tour.R:44:1
    inst/doc/tour.R:73:1
    inst/pipeline/convert.r:7:1
    R/ccRecord.R:205:1
    R/ccTable.R:144:1
    ... and 75 more lines

  ✖ avoid calling setwd(), it changes the global environment. If you need it, consider using on.exit() to restore the
    working directory.

    R/data.quality.report.R:40:9
    R/data.quality.report.R:55:26
    R/data.quality.report.R:58:26
    R/data.quality.report.R:60:13
    R/data.quality.report.R:64:39

  ✖ avoid sapply(), it is not type safe. It might return a vector, or a list, depending on the input data. Consider
    using vapply() instead.

    R/data.quality.report.R:277:53
    R/data.quality.report.R:281:53
    R/data.quality.report.R:306:22
    R/filter.categorical.R:4:17
    R/imputation.R:8:16
    ... and 12 more lines

  ✖ avoid 1:length(...), 1:nrow(...), 1:ncol(...), 1:NROW(...) and 1:NCOL(...) expressions. They are error prone and
    result 1:0 if the expression on the right hand side is zero. Use seq_len() or seq_along() instead.

    R/unique.spell.R:13:31

  ✖ not import packages as a whole, as this can cause name clashes between the imported packages. Instead, import
    only the specific functions you need.
  ✖ fix this R CMD check NOTE: File LICENSE is not mentioned in the DESCRIPTION file.
  ✖ fix this R CMD check NOTE: Namespace in Imports field not imported from: ‘stats’ All declared Imports should be
    used.
  ✖ fix this R CMD check NOTE: Found the following hidden files and directories: .travis.yml These were most likely
    included in error. See section ‘Package structure’ in the ‘Writing R Extensions’ manual.
  ✖ avoid 'T' and 'F', as they are just variables which are set to the logicals 'TRUE' and 'FALSE' by default, but
    are not reserved words and hence can be overwritten by the user.  Hence, one should always use 'TRUE' and 'FALSE'
    for the logicals.

    R/ccRecord.R:NA:NA
    R/ccTable.R:NA:NA
    R/data.quality.report.R:NA:NA
    R/data.quality.report.R:NA:NA
    R/data.quality.report.R:NA:NA
    ... and 32 more lines

──────────────────────────────────────────────
Warning messages:
1: In check_encoding(self, private, NULL) :
  Consider adding an Encoding field to DESCRIPTION,

Reviewers:
Due date:

@noamross
Copy link
Contributor

@sinanshi Just checking in, can you answer my questions above?

@sinanshi
Copy link
Author

@noamross Apologies for the delay. Thank you for your comment. Regarding the data accessibility,

  1. The example dataset in inst/doc/sample_ccd.RData contains 30 episodes. This is for testing/review purposes only.
  2. The toy dataset is accessible more widely. One only needs to sign the data sharing agreement to identify yourself, your institution and to confirm that you will be only be using the data for clinical research (in line with our research ethics permissions). In addition, you are undertaking to be respectful of the data: specifically not to pass it on, nor to attempt to re-identify. The toy dataset contains ~1000 episodes, with identifiable variables removed, and the quasi-identifiable variables modified. The purpose of the toy dataset is for users to familiarise themselves with the data structure and to develop hypotheses or analyses. Although the physiology fields are unaltered, the analysis results cannot be directly used for publication.
  3. One needs to apply for external collaborator status in order to get access to the full anonymised data. We will carefully review the application based on, first, whether the applicant is affiliated to an appropriate academic institution and second, whether the research topic is ethical. Once the external collaborator status is acquired, we work together to produce a customised data extract. There are often anonymisation trade-offs whereby higher resolution data in one area must be compensated with stricter anonymisation elsewhere (e.g. we can't provide high fidelity geographical and temporal data). The full anonymised dataset currently contains about 20k episodes, and will grow by about 10k per year for now.
  4. To access the identifiable data, you would need to organise formal contracts with one of the founding biomedical research centres (Cambridge, Guys & St Thomas', Imperial, Oxford, or UCL), and you will need to undergo training to work with the UCL Identifiable Data Handling Solution.

I hope I have answered your question. Thank you for the suggestion of using gp() and spell_check(). Will run through these issues and modify accordingly.

@sinanshi
Copy link
Author

@noamross Just would like to know if we have answered the questions you posed previously?

@noamross
Copy link
Contributor

My apologies, I am bit behind on my queue here. Will get back in the next day or two.

@dpshelio
Copy link

Hi @noamross - we would like to cite this library on a paper, do you have an estimation of when the review would be completed? Thanks!! 👍

@noamross
Copy link
Contributor

noamross commented Apr 11, 2017

Hi there, I've sorry this has dragged on. The editors have had a fair bit of discussion on this one. It's newer territory for us to be dealing with packages with these types of data limitations, but in the end I'm satisfied that its purpose is to expand data access, rather than just improve processing of a closed set. I'm fast-tracking to make up for the lost time and hope to have reviews to you faster than our usual three week turnaround.

I have re-submitted my request for the toy dataset, which I believe goes to @docsteveharris. Please let me know if I can share this set with the reviewers I select or if they will need to submit this separately.

In the meantime, there remain several lingering things from our automated goodpractice::gp() tests that I suggest you address - it should make the rest of the reviews go smoother.

── GP cleanEHR ─────────────────────────────────
It is good practice to

  ✖ write unit tests for all functions, and all package code in general. 67% of code lines are
    covered by test cases.

    R/ccRecord.R:117:NA
    R/ccRecord.R:118:NA
    R/ccRecord.R:145:NA
    R/ccRecord.R:153:NA
    R/ccRecord.R:246:NA
    ... and 441 more lines

  ✖ omit "Date" in DESCRIPTION. It is not required and it gets invalid quite often. A build date
    will be added to the package when you perform `R CMD build` on it.
  ✖ add a "BugReports" field to DESCRIPTION, and point it to a bug tracker. Many online code
    hosting services provide bug trackers for free, https://github.com, https://gitlab.com, etc.
  ✖ use '<-' for assignment instead of '='. '<-' is the standard, and R users and developers are
    used it and it is easier to read your code for them if you use '<-'.

    inst/pipeline/combine_data.r:5:6
    inst/pipeline/extract_data.r:5:6
    inst/pipeline/untimeit.r:5:6
    R/ccTable.R:202:19
    R/stdid.R:13:39
    ... and 2 more lines

  ✖ avoid long code lines, it is bad for readability. Also, many people prefer editor windows
    that are about 80 characters wide. Try make your lines shorter than 80 characters

    inst/doc/tour.R:44:1
    inst/doc/tour.R:73:1
    inst/pipeline/convert.r:7:1
    R/ccRecord.R:205:1
    R/ccTable.R:144:1
    ... and 76 more lines

  ✖ avoid calling setwd(), it changes the global environment. If you need it, consider using
    on.exit() to restore the working directory.

    R/data.quality.report.R:40:9
    R/data.quality.report.R:55:26
    R/data.quality.report.R:58:26
    R/data.quality.report.R:60:13
    R/data.quality.report.R:64:39

  ✖ avoid sapply(), it is not type safe. It might return a vector, or a list, depending on the
    input data. Consider using vapply() instead.

    R/data.quality.report.R:277:53
    R/data.quality.report.R:281:53
    R/data.quality.report.R:306:22
    R/filter.categorical.R:4:17
    R/imputation.R:8:16
    ... and 12 more lines

  ✖ avoid 1:length(...), 1:nrow(...), 1:ncol(...), 1:NROW(...) and 1:NCOL(...) expressions. They
    are error prone and result 1:0 if the expression on the right hand side is zero. Use
    seq_len() or seq_along() instead.

    R/unique.spell.R:13:31

  ✖ not import packages as a whole, as this can cause name clashes between the imported
    packages. Instead, import only the specific functions you need.
  ✖ fix this R CMD check NOTE: File LICENSE is not mentioned in the DESCRIPTION file.
  ✖ fix this R CMD check NOTE: Namespace in Imports field not imported from: ‘stats’ All
    declared Imports should be used.
  ✖ fix this R CMD check WARNING: Undocumented data sets: ‘icnarc_table’ All user-level objects
    in a package should have documentation entries. See chapter ‘Writing R documentation files’
    in the ‘Writing R Extensions’ manual.
  ✖ fix this R CMD check NOTE: Found the following hidden files and directories: .travis.yml
    These were most likely included in error. See section ‘Package structure’ in the ‘Writing R
    Extensions’ manual.  (You should use an .Rbuildignore file to ignore all these files)
  ✖ avoid 'T' and 'F', as they are just variables which are set to the logicals 'TRUE' and
    'FALSE' by default, but are not reserved words and hence can be overwritten by the user.
    Hence, one should always use 'TRUE' and 'FALSE' for the logicals.

    R/ccRecord.R:NA:NA
    R/ccTable.R:NA:NA
    R/data.quality.report.R:NA:NA
    R/data.quality.report.R:NA:NA
    R/data.quality.report.R:NA:NA
    ... and 32 more lines

───────────────────────────────────────
Warning messages:
1: In check_encoding(self, private, NULL) :
  Consider adding an Encoding field to DESCRIPTION,
Non-ASCII character(s) in Author

@noamross
Copy link
Contributor

noamross commented Apr 18, 2017

Reviewers: @haozhu233 @aammd
Due Date: 2017-05-09

@noamross
Copy link
Contributor

noamross commented May 2, 2017

Hi @haozhu233 and @aammd! A friendly reminder that your reviews are due in a week.

@haozhu233
Copy link

haozhu233 commented May 8, 2017

@sinanshi @noamross Sorry for the delay. Please see my review below. I'm happy to discuss more about this package after this post. :)


Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

  • As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

  • A statement of need clearly stating problems the software is designed to solve and its target audience in README
  • Installation instructions: for the development version of package and any non-standard dependencies in README
  • Vignette(s) demonstrating major functionality that runs successfully locally
  • [?] Function Documentation: for all exported functions in R help
  • Examples for all exported functions in R Help that run successfully locally
  • Community guidelines including contribution guidelines in the README or CONTRIBUTING, and URL, Maintainer and BugReports fields in DESCRIPTION
Paper (for packages co-submitting to JOSS)

The package contains a paper.md with:

  • A short summary describing the high-level functionality of the software
  • Authors: A list of authors with their affiliations
  • A statement of need clearly stating problems the software is designed to solve and its target audience.
  • References: with DOIs for all those that have one (e.g. papers, datasets, software).

Functionality

  • Installation: Installation succeeds as documented.
  • Functionality: Any functional claims of the software been confirmed.
  • Performance: Any performance claims of the software been confirmed.
  • Automated tests: Unit tests cover essential functions of the package
    and a reasonable range of inputs and conditions. All tests pass on the local machine.
  • Packaging guidelines: The package conforms to the rOpenSci packaging guidelines

Final approval (post-review)

  • The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing:
4 hr


Review Comments

This package allows users to extract and clean a very unique piece of clinical data for the CC-HIC datasets. Since the original dataset contains multiple complicated forms of data, it could be a very complicated process to start everything from scratch. I can imagine this package can be very helpful for many useRs in that system. I personally think this package agrees with rOpenSci's fit policy and is nice to be listed. However, I also want to note that maybe rOpenHealth fits this topic better?

I do have a few comments after trying out this package.

  • As @noamross mentioned, regarding the package name, I would recommend changing the name to cchicr or else instead of cleanEHR due to the fact that this is only one particular form of electronic health record.
  • rOpenSci packaging guide recommends using xml2 instead of XML for parsing XML files. It seems to me that most of the XML-related codes in this package only involves parsing. As a result, I would recommend using xml2 instead.
  • This package is frequently using a very OO-ish programming style to define objects and write functions. For example, after a ccTable being created, you can use cctable$export.csv("file.csv") to write contents to a csv file (https://github.com/CC-HIC/cleanEHR/blob/master/R/ccTable.R#L206-L213). To me, due to my lack of understanding of the full scenario, it seems sort of like forcibly writing python in R and is a little redundant. (I'm not saying it's a bad practice. My concern is mainly about the lack of documentations for these methods as I said below)
  • Also, it would be nice if we can see more documentations for these self-defined methods or at least a table of available methods with brief explanation. In this case, those methods are also "exported functions" as well even though they weren't exported officially. That's why I'm putting a ? mark on the checklist
  • I would like to see more documentation about pipeline. I feel like the current documentation in R/pipeline.R and inst/pipeline isn't quite enough. It would be nice to see a README in inst/pipeline.
  • In the package vignette tour.Rmd, we see tb <- create.cctable(ccd, yaml.load(conf), freq=1). However, freq is actually the 2nd argument in that function. It might mislead some users in the future.
  • I feel like the author can put more instructions in the package vignette.
  • Final point, I'm not sure if it's a good idea but I feel like building a shiny app to deal with these tasks will be very helpful, especially to help users to understand what kinds of functionalities are available in this package.

@aammd
Copy link

aammd commented May 10, 2017

This is an intricate and impressive package! It involves data.table, yaml, S4 classes and XML. ; it is the All Dressed chip of data accessing packages! Unfortunately, I am not much of an expert in data.table, nor in the creation and documentation of S4 classes. I have mostly reviewed for usability, documentation and some coding best practices

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

  • As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

  • A statement of need clearly stating problems the software is designed to solve and its target audience in README
  • Installation instructions: for the development version of package and any non-standard dependencies in README
  • Vignette(s) demonstrating major functionality that runs successfully locally
  • Function Documentation: for all exported functions in R help
  • Examples for all exported functions in R Help that run successfully locally
  • Community guidelines including contribution guidelines in the README or CONTRIBUTING, and URL, Maintainer and BugReports fields in DESCRIPTION
Paper (for packages co-submitting to JOSS)

The package contains a paper.md with:

  • A short summary describing the high-level functionality of the software
  • Authors: A list of authors with their affiliations
  • A statement of need clearly stating problems the software is designed to solve and its target audience.
  • References: with DOIs for all those that have one (e.g. papers, datasets, software).

Functionality

  • Installation: Installation succeeds as documented.
  • Functionality: Any functional claims of the software been confirmed.
  • Performance: Any performance claims of the software been confirmed.
  • Automated tests: Unit tests cover essential functions of the package
    and a reasonable range of inputs and conditions. All tests pass on the local machine.
  • Packaging guidelines: The package conforms to the rOpenSci packaging guidelines

Final approval (post-review)

  • The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing:


Review Comments

Review comments

  • Perhaps the biggest opportunity for improvement lies in the package vignette. Here are some thoughts that came to mind as I was reading through it.

    • A clearer description of exactly what kind of data this is, and what the data is useful for, would be very helpful. I get the sense that this package is aimed at users who are already familiar with the dataset. Nevertheless, in my experience a reminder is still useful
    • A clearer description of exactly how data is being transformed. This could be acheived simply, e.g. by using knitr::kable(head(dataset_name)) to show just the first few lines. This would help to add context to what is meant by "long" and "wide" format (which mean different things to different people)
    • The YAML code block in particular should be better documented. Which options are available? What does each part of this list do? A handy reference or "cheatsheet" will be a huge favour to all users.
  • I think the second-biggest opportunity is in the naming of the functions, methods and perhaps even the classes involved here.

    • The authors might consider using a prefix for every function in the package. Many successful packages use this strategy, which takes advantage of text editors which use tab completion. Here, perhaps functions which use a certain class of object could be prefixed with that class name, ccd_episode_plot instead of episode.graph etc.
    • Some names and terminology have the potential to be confusing to users. For example sql.demographic.table has nothing to do with sql per se, but rather arranges tables such that SQL import would be possible.
    • Some names are probably only confusing for the programmers. there is a mix of cases used in function names (sql.demographic.table(ccd), extractIndexTable, unique_spell)
    • I noticed that, apparently, UpperCamelCase is apparently the convention for S4 classes: link

*At the risk of seeming like a meat-based goodpractice, I wanted to echo two comments made above:

  • it is better to use TRUE and FALSE, in place of T and F.
  • it is also safer to use vapply in place of sapply or unlist(lapply).
  • I wonder if some of the functions could be split apart. For example, episode.graph creates both a plot and a rearrangement of the data. Perhaps if the plotting code were moved to a plot() method for the ccRecord class
  • I note that all of Rcpp is listed as an Import, but only the function evalCpp was imported directly. Is loading the entire package necessary?
  • The Vignette contains an example of cleaning data, based on instructions in a YAML code block. create.cctable(ccd, yaml.load(conf), freq=1). I find using the yaml.load function here to be a bit awkward, and difficult for users who are unfamiliar with yaml. It might be easiest to encourage users to write a short file, conf.yaml, and to direct create.cctable to this filename, rather than to an R object.
  • I feel like pander and ggplot should be listed in Suggests, rather than Imports. Then users could use the basic functionality of this package without necessarily installing these packages. This might also entail separating where these functions are used from the data-cleaning that sometimes precedes them.
  • Some documentation, e.g. in ccRecord, refers to S3 classes when in fact these are S4 classes.
  • There seem to be some vestigial R scripts and other files in inst/sql and inst/report, could these be deleted, or perhaps moved to a development branch?

warnings and automatic checks

  • I ran all the code in the Vignette, and found that sql.demographic.table(ccd) produced the following error for me (first sentence only): Invalid .internal.selfref detected and fixed by taking a (shallow) copy of the data.table so that := can add this new column by reference.

  • I ran devtools::spellcheck, and found the following errors (among other, trivial "errors" like variable names etc):

spelling errors:

  WORD              FOUND IN
anonymisation     description:5
charcter          data.quality.report.Rd:15
chuncks           update_database.Rd:15
conf              ccTable-class.Rd:39,45
datam             extractIndexTable.Rd:12
datat             extractIndexTable.Rd:11
demg              demg.distribution.Rd:5,16
demograhic        lenstay.Rd:10
dfilter           getfilter.Rd:5,15
doesn             selectTable.Rd:16
lenght            lenstay.Rd:5,20
organised         new.episode.Rd:21
structre          ccTable-class.Rd:10
Subseting         sub-sub-ccRecord-method.Rd:6,16
yaml              ccTable-class.Rd:54
YAML              ccTable-class.Rd:19, create.cctable.Rd:14, create2dclean.Rd:12,23
  • R CMD CHECK returned this warning:
 Undocumented data sets:
  ‘icnarc_table’
All user-level objects in a package should have documentation entries.

@noamross
Copy link
Contributor

noamross commented May 10, 2017

Thank you for your reviews, @aammd and @haozhu233! @sinanshi and @dpshelio, please let us know if you have questions about any of the points in the reviews or the automated checks. When you have made changes, please summarize them in a response to reviewers here (Example).

@sinanshi
Copy link
Author

Thank you so much for your comments! Will have a look shortly.

@sckott
Copy link
Contributor

sckott commented Jun 2, 2017

hey @aammd - est. hrs spent reviewing?

@noamross
Copy link
Contributor

@sinanshi Any updates on making changes in response to reviewers?

@sinanshi
Copy link
Author

Sorry for the delay. We are trying to modify the code based on the reviews. Will come back to you when the refactoring is done.

@noamross
Copy link
Contributor

@sinanshi Any updates on your changes? Do you have an expected time frame?

@sinanshi
Copy link
Author

@noamross sorry for spending such a long time to address the review. As we've received some very constructive reviews, we feel it is necessary to restructure our code slightly. That's why it takes longer than we expect. I think we can finish it this week.

@sinanshi
Copy link
Author

sinanshi commented Aug 27, 2017

Dear reviewers,

Thank you so much for your very helpful comments, and we are very sorry to be spending such a long to time to address these comments. We have made several re-structuring of the code, inspired by your comments. See https://github.com/CC-HIC/cleanEHR/tree/review

  • Pipeline removal: we would like to put our focus more on data cleaning and wrangling. CCHIC is just an example to demonstrate, how we clean the high resolution heterogeneous longitudinal data. We have an ambition of making cleanEHR an extendable EHR data cleaning and wrangling platform. We feel the CCHIC is no longer in our main theme, hence removed.
  • Vignette: we’ve made two separate vignettes, one to describe what the CCHIC data is; another one to explain the data structures and demonstrate step by step the data cleaning process, including how to write the YAML configuration.
  • Further documentation: Previously, we only used doc string to document the refclass functions, which is not sufficient, as they are the main components of this package. We made extra documentation for the member functions.
  • Naming: more consistent naming of the functions, and class.

To Reviewer 1 @haozhu223

I personally think this package agrees with rOpenSci's fit policy and is nice to be listed. However, I also want to note that maybe rOpenHealth fits this topic better?

We are not clear about the nature of rOpenHealth, since we didn’t find too much information online. Is it a subsidiary of rOpenSci? Can we be listed in both organisation? We would love to explore the possibilities here.

I do have a few comments after trying out this package. As @noamross mentioned, regarding the package name, I would recommend changing the name to cchicr or else instead of cleanEHR due to the fact that this is only one particular form of electronic health record.

The critical care data (CC-HIC) is one of the themes of National Institute for Health Research (NIHR) Health Informatics collaborative (HIC). It has other themes which covers a large area of health care topics. https://www.nihr.ac.uk/about-us/how-we-are-managed/our-structure/infrastructure/health-informatics-collaborative.htm
cleanEHR was designed as an extendable data cleaning tool. CCHIC is the first exemplar for it and we planed to extend cleanEHR functionality to other themes and link with other data sources.

rOpenSci packaging guide recommends using xml2 instead of XML for parsing XML files. It seems to me that most of the XML-related codes in this package only involves parsing. As a result, I would recommend using xml2 instead.

Thank you for the suggestion and you have the point. We do believe xml2 will be a better fit to the XML parser. However, the XML parser is the most complicated part of the code. The XML format we receive from hospitals do not follow strictly the standard. As a result, there are many corner cases that we fixed throughout the years of development. The XML parser is rather like a plugin which serves only with CCHIC dataset. Concerning the complexity of re-writing it with xml2, will it be OK if we leave the xml part this time, and redo them when the new xml standard is introduced in the coming years?

This package is frequently using a very OO-ish programming style to define objects and write functions. For example, after a ccTable being created, you can use cctable$export.csv("file.csv") to write contents to a csv file (https://github.com/CC-HIC/cleanEHR/blob/master/R/ccTable.R#L206-L213). To me, due to my lack of understanding of the full scenario, it seems sort of like forcibly writing python in R and is a little redundant. (I'm not saying it's a bad practice. My concern is mainly about the lack of documentations for these methods as I said below)

Thanks for pointing it out. ccTable is a Refclass which is served as the data cleaning and manipulation platform. Since the tables it held are usually very big, we intented to use refclass to avoid multiple copying. As we have adopted the OO way, we would like to keep all the ccTable related function, e.g. export.csv to be consistent. More documentation has been made, I hope it is clearer this time.

Also, it would be nice if we can see more documentations for these self-defined methods or at least a table of available methods with brief explanation.

Thanks, we have added more documentation for the ccTable member methods, which are desigated as ccTable_methods.

I would like to see more documentation about pipeline. I feel like the current documentation in R/pipeline.R and inst/pipeline isn't quite enough. It would be nice to see a README in inst/pipeline.

As we mentioned above, the pipeline is removed from the package.

In the package vignette tour.Rmd, we see tb <- create.cctable(ccd, yaml.load(conf), freq=1). However, freq is actually the 2nd argument in that function. It might mislead some users in the future.

Amended, thanks.

I feel like the author can put more instructions in the package vignette.

New vignette has been made, thanks.

Final point, I'm not sure if it's a good idea but I feel like building a shiny app to deal with these tasks will be very helpful, especially to help users to understand what kinds of functionalities are available in this package.

It is really a good point! We have the plan of making a website using shiny for data displaying and cleanEHR functionalities. It will be done as a separate project.

To Reviewer2 @aammd

The authors might consider using a prefix for every function in the package. Many successful packages use this strategy, which takes advantage of text editors which use tab completion. Here, perhaps functions which use a certain class of object could be prefixed with that class name, ccd_episode_plot instead of episode.graph etc.

Thank you for your suggestion. We recognise that the naming is a huge problem. As the package started with a lot of experimental elements, naming has not been paid with too much attention. We amended the package to adopt the naming style using prefix where applicable. In particular, the episode.graph() function has been moved to the plot() method for ccEpisode.

Some names and terminology have the potential to be confusing to users. For example sql.demographic.table has nothing to do with sql per se, but rather arranges tables such that SQL import would be possible.

sql.demographic.table has been renamed as ccd_demographic_table.

Some names are probably only confusing for the programmers. there is a mix of cases used in function names (sql.demographic.table(ccd), extractIndexTable, unique_spell)

Amended accordingly.

it is better to use TRUE and FALSE, in place of T and F.

Modified, thanks.

it is also safer to use vapply in place of sapply or unlist(lapply).

We didn’t manage to switch for all the cases, but we changed a couple of them.

I wonder if some of the functions could be split apart. For example, episode.graph creates both a plot and a rearrangement of the data. Perhaps if the plotting code were moved to a plot() method for the ccRecord class

This is a great idea and we moved it to plot() for ccEpisode class now.

I note that all of Rcpp is listed as an Import, but only the function evalCpp was imported directly. Is loading the entire package necessary?

Amended accordingly. Thanks.

The Vignette contains an example of cleaning data, based on instructions in a YAML code block. create.cctable(ccd, yaml.load(conf), freq=1). I find using the yaml.load function here to be a bit awkward, and difficult for users who are unfamiliar with yaml. It might be easiest to encourage users to write a short file, conf.yaml, and to direct create.cctable to this filename, rather than to an R object.

In fact, the function can accept both list and the path of a YAML configuration file as a input. You are right, it might be confusing to use yaml.load in the vignette. In the new version of the vignette, we demonstrated it without using yaml.load.

I feel like pander and ggplot should be listed in Suggests, rather than Imports. Then users could use the basic functionality of this package without necessarily installing these packages. This might also entail separating where these functions are used from the data-cleaning that sometimes precedes them.

We actually feel the data quality reporting and visualisation, where pander and ggplot are used, are crucial to the data cleaning process. For this reason, we feel it is perhaps necessary to make these functions loaded from the beginning. Please let me know if you feel otherwise. We can discuss about this.

Some documentation, e.g. in ccRecord, refers to S3 classes when in fact these are S4 classes.

That is a mistake. Amended!

There seem to be some vestigial R scripts and other files in inst/sql and inst/report, could these be deleted, or perhaps moved to a development branch?

Unnecessary files removed.

I ran all the code in the Vignette, and found that sql.demographic.table(ccd) produced the following error for me (first sentence only): Invalid .internal.selfref detected and fixed by taking a (shallow) copy of the data.table so that := can add this new column by reference.

This is probably a warning from data.table. In some R version it happens. But you can still run through.

I ran devtools::spellcheck, and found the following errors (among other, trivial "errors" like variable names etc):

Amended.

@noamross
Copy link
Contributor

Thank you for the response, @sinanshi. @aammd and @haozhu233, let us know if the changes address your concerns and if there are further areas of follow-up. Let me comment on a few things:

  • rOpenHealth is dormant as an organization these days. We previously referred health-related packages to them but they are not currently onboarding in any regular way.
  • The "general tool/specific application" combination pattern is one we see a lot, but is quite challenging from an ongoing maintenance and development perspective. A common solution is to split the general tool and the specific application into separate packages. This isn't required, but it is important to clearly separate the components of the code base and the documentation for both the EHR manipulation and the CCHIC use-case.
  • The XML package issue is exactly why this can be an issue: one does not want to make core package code reliant on an out-of-date package, but a use-specific plug-in would be expected to follow the development cycle of that data source. In this case, keeping XML for current CCHIC data, with a milestone of switching to xml2 for a future iteration of the data standard, is acceptable.

@sinanshi
Copy link
Author

@aammd and @haozhu233, do you have any comment on the response we made previously?

@noamross
Copy link
Contributor

Sorry for not following up on this earlier myself, @sinanshi. @aammd and @haozhu233, please do respond to the update.

@haozhu233
Copy link

@sinanshi @noamross Sorry for the delay. I briefly went through the changes and so far they look good to me. I will take a deep look this weekend.

@sinanshi
Copy link
Author

@haozhu233 thanks.

@sinanshi
Copy link
Author

sinanshi commented Nov 8, 2017

@aammd and @haozhu233 Have you had a chance to review the changes?

@haozhu233
Copy link

haozhu233 commented Nov 9, 2017

I think the changes @sinanshi made to the cleanEHR package addressed the issues I raised last time really well. I really like the improvements on documentation and I think the changes made it a lot easier for first-time users to understand.

I also have a few notes here:

  • Using package check, I received 2 note:
    • Since you are using the GPL-3 license, you probably don't need the License file as CRAN knows this type of license. If you really want to keep it, you can put it in the inst folder
    • You probably want to put ‘Makefile’ ‘appveyor.yml’ ‘paper.md’ in .Rbuildignore so it won't apprear in the note
  • When I build it from source, RStudio sometimes crashes but the package is fully functional after restarting. 🤔
  • Some functions such as xmlLoad are not exported but are documented using roxygen2. As a result, you will see the documentations in the package manual. I'm assuming it may confuse some users? My suggestion is to either export those functions or just use regular commenting method in R.
  • Is it possible to have a table of all available methods in one of the package vignettes?

@sinanshi
Copy link
Author

Hi @haozhu233. Thank you for your further comments.

  • .Rbuildignore has been added, please see the review branch.
  • I tried to build the from source (.tar.gz) in RStudio, and it works fine for me. Could you let me know how did you install it? which RStudio version are you using?
  • Non-exported functions, including xmlLoad are commented from the manual. All the available methods are shown in the manual. I hope it will no longer confuse the users.
  • Are the table of available methods in vignette you proposed exactly the same as the manual? Since the manual now only contains the available methods, it might be served as the same purpose.

@sinanshi
Copy link
Author

@aammd Do you have any feedback on the change we've made?

@aammd
Copy link

aammd commented Nov 27, 2017

@sinanshi I'm very sorry for my late reply! Your response to my comments sounds great to me, and looking over the package quickly, everything looks awesome -- especially the vignettes. I'll go over things more in detail by Friday but I don't anticipate anything major

@aammd
Copy link

aammd commented Dec 1, 2017

I think @sinanshi et al did a great job addressing my comments and polishing what was already a truly impressive package. I especially like some of the new naming conventions, it makes it much easier to navigate the package.

My comments are all very minor, mostly small tweaks to the vignettes and some comments on the output of goodpractice. Great work everyone!


  • The output of goodpractice doesn't raise any huge red flags. It suggests that you "not import packages as a whole, as this can cause name clashes between the imported packages. Instead, import only the specific functions you need." But I see that you have only done that when it is packages you use very extensively, such as YAML.

*I worked through both vignettes and I found them clear and everything seems to work.

  • I might suggest that you give the vignette titled "Tour" a more descriptive name

  • I've never seen this particular approach to loading in data before:

library(cleanEHR)
data.path <- paste0(find.package("cleanEHR"), "/doc/sample_ccd.RData")
print(data.path)

And I worry that it might be a little confusing for some users. Also, is it guaranteed to work on every machine? I'm not sure about that. Personally, I'm more familiar with the approach described in Wickham's "R packages" book, which argues that datasets used in a package should be placed in data/. I see that you do have at least one dataset in the data/ directory. This is the data.checklist, which is saved under an object of a different name:

data("ITEM_REFTABLE")

IT would be much easier if this .Rdata file was called data_checklist instead, but perhaps this was never meant to be user-facing? At any rate, I recommend this approach for the sample_ccd dataset.

  • I get a rather lengthy error message when I run ccd_demographic_table from the vignette:
dt <- ccd_demographic_table(ccd, dtype=TRUE)
Warning messages:
1: NAs introduced by coercion 
2: NAs introduced by coercion 
3: NAs introduced by coercion 
4: NAs introduced by coercion 
5: NAs introduced by coercion 
6: In `[.data.table`(demogt, , `:=`("index", seq(nrow(demogt)))) :
  Invalid .internal.selfref detected and fixed by taking a (shallow) copy of the data.table so that := can add this new column by reference. At an earlier point, this data.table has been copied by R (or been created manually using structure() or similar). Avoid key<-, names<- and attr<- which in R currently (and oddly) may copy the whole data.table. Use set* syntax instead to avoid copying: ?set, ?setnames and ?setattr. Also, in R<=v3.0.2, list(DT1,DT2) copied the entire DT1 and DT2 (R's list() used to copy named objects); please upgrade to R>v3.0.2 if that is biting. If this message doesn't help, please report to datatable-help so the root cause can be fixed.

I am no data.table expert so perhaps this is nothing to worry about?

  • In the vignette data_clean.html this line:

cleanEHR provides a refclass ccTable. There are several key components in the ccTable structure.

Might be too technical for some of your users. Better perhaps to describe what a refclass does, rather than use the technical term. I'm confused about it myself -- does it have anything to do with "Reference Classes" (RC) which yet another object oriented system in R?

  • This vignette also mentions "imputation: filling missing data." as one of the jobs performed by ccTable, but this is never expanded upon. This is one topic that users will definitely want to know more about, as everybody has strong feelings about how and when missing data should be "filled". I suggest you either expand on it or take it out, to avoid stressing users out too much.

  • In your section "create a ccTable" I suggest you add a quick link to the definition of what YAML is, and a brief explanation of why you chose YAML and how it works here (i.e. the CCHIC code first, then one level down the shortname etc. ). I also think that a "Glossary" of YAML configuration would be really helpful -- I mean, a sort of central document where users could go to see all the options they can use in a YAML configuration file (e.g. range: and its subcategories label: and apply:). Perhaps this could be another small vignette, or perhaps a wiki on the github repository.

  • Similarly, I strongly recommend against using echo=FALSE at line 100 in the vignette. If your users are anything like me, they will learn how to use cleanEHR by copying code out of the vignette and experimenting with it, and they won't have this conf object that holds all the YAML configuration unless they can see this missing code.

  • A link in the vignette to the data.table documentation within the vignette would also be really useful.

@noamross
Copy link
Contributor

noamross commented Dec 1, 2017

Thank you, @aammd. @sinanshi, if you address these remaining items, I will be sure to proceed with the final steps speedily.

@sinanshi
Copy link
Author

Thanks @aammd for your comments.

It suggests that you "not import packages as a whole, as this can cause name clashes between the imported packages. Instead, import only the specific functions you need."

We had a double check and we can confirm that we only imported the necessary packages.

I might suggest that you give the vignette titled "Tour" a more descriptive name.

It is an old file that we forgot to delete. It is now called "Data cleaning and wrangling with cleanEHR".

Data loading

Changed accordingly. Thank you for pointing that out.

I get a rather lengthy error message when I run ccd_demographic_table from the vignette:

We sort of expected the warning messages thought it is not pretty. It converts tree-style data
to a square table. Due the missing/duplicated data points, putting it into a square table leads to
filling NAs in the missing entries, which is perfectly normal.

I'm confused about it myself -- does it have anything to do with "Reference Classes" (RC) which yet
another object oriented system in R?

refclass is indeed the Reference Class.

In your section "create a ccTable" I suggest you add a quick link to the
definition of what YAML is, and a brief explanation of why you chose YAML and
how it works here (i.e. the CCHIC code first, then one level down the shortname
etc. ). I also think that a "Glossary" of YAML configuration would be really
helpful -- I mean, a sort of central document where users could go to see all
the options they can use in a YAML configuration file (e.g. range: and its
subcategories label: and apply:). Perhaps this could be another small vignette,
or perhaps a wiki on the github repository.

YAML explanation added. The vignette has actually covered almost all the option of the YAML configuration. We mostly expect the users to use the template directly with a minor modification.

Similarly, I strongly recommend against using echo=FALSE at line 100 in the vignette. If your users are anything like me, they will learn how to use cleanEHR by copying code out of the vignette and experimenting with it, and they won't have this conf object that holds all the YAML configuration unless they can see this missing code.

You are right, thanks. We have changed accordingly.

A link in the vignette to the data.table documentation within the vignette would also be really useful.
done

This vignette also mentions "imputation: filling missing data." as one of the jobs performed by ccTable, but this is never expanded upon. This is one topic that users will definitely want to know more about, as everybody has strong feelings about how and when missing data should be "filled". I suggest you either expand on it or take it out, to avoid stressing users out too much.

We have added some examples for the imputation function.

After all, thank you for all these comments to improve the quality of our package.

Sinan

@aammd
Copy link

aammd commented Dec 15, 2017

@sinanshi This looks great! I just read your comments and looked over your commit history on your master branch, and everything looks wonderful. Thank you for a very enjoyable dialogue; i hope it was helpful to you. Congratulations on producing such an ambitious and polished package!
@noamross I have no further comments here! thank you for the chance to review this.

@sinanshi
Copy link
Author

@aammd, It has been a great pleasure to work with you to improve our package. Thank you very much for you most valuable suggestions.

@noamross
Copy link
Contributor

Thanks @aammd and @haozhu233 for your thorough reviews and follow-ups. @sinanshi, we are almost there, just a couple of remaining notes from R CMD check:

checking data for ASCII and uncompressed saves ... WARNING
  
  Note: significantly better compression could be obtained
        by using R CMD build --resave-data
                      old_size new_size compress
  ITEM_REFTABLE.RData     14Kb     10Kb       xz
  sample_ccd.RData       528Kb    276Kb       xz

You can fix these by, when when generating these files with save(), adding the compress="xz" argument.

You'll also want to add LICENSE, Makefile, .travis.yml, appveyor.yml, and paper.md to .Rbuildignore to avoid NOTEs about non-standard/hidden files

With those two things fixed, we can approve!

To-dos after those fixes:

  • Transfer the repo to the rOpenSci organization under "Settings" in your repo. I have invited you to a team that should allow you to do so. You'll be made admin once you do.
  • Add the rOpenSci footer to the bottom of your README
[![ropensci_footer](https://ropensci.org/public_images/ropensci_footer.png)](https://ropensci.org)
  • Add the "Peer-reviewed" status badge to your README:
[![](https://badges.ropensci.org/145_status.svg)](https://github.com/ropensci/onboarding/issues/102)
  • Fix any links in badges for CI and coverage to point to the ropensci URL. (We'll turn on the services on our end as needed)

For submission to JOSS, you'll then want to generate a new Zenodo record with a DOI (you should be able to once you've joined the rOpenSci team and I have made you admin for the repo). Then, over on the JOSS thread, provide them with the updated DOI.

@sinanshi
Copy link
Author

@noamross Fixed the two issues.

@noamross
Copy link
Contributor

Approved! OK, you can go ahead and initiate the repo transfer and the rest of the checklist above.

@sinanshi
Copy link
Author

Badges added, CI links updated, new DOI obtained.
Thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants