Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workflow to access datasets hosted on NDA #3710

Closed
mih opened this issue Sep 25, 2019 · 5 comments
Closed

Workflow to access datasets hosted on NDA #3710

mih opened this issue Sep 25, 2019 · 5 comments

Comments

@mih
Copy link
Member

mih commented Sep 25, 2019

@yarikoptic What would be a sensible workflow to access dataset hosted on NDA as a DataLad dataset? In particular access to datasets for which dedicated data usage permission has been (or has to be) obtained, and that are comprised of more than just imaging data hosted on S3 (e.g. clinical assessments coming from some other dataset).

What about this?

  1. Create a dataset
  2. Populate the dataset by running ndatool (https://github.com/NDAR/nda-tools) through datalad run with the request number obtained through the standard NDA application process. This will download all files from S3, and make the necessary requests to also obtain all other datafiles.
  3. Use a helper (script) to sift through the NDA metadata to add S3 URLs to downloaded and annexed data files post-factum.

The outcome is a dataset that represents any NDA dataset in its raw form (defined as whatever ndatool is doing). This dataset can be subsequently normalized with tools like https://github.com/psychoinformatics-de/datalad-hirni by adding more required metadata, or using additional helpers to extract this information from the NDA-provided metadata.

ZIP files with DICOMs tracked in the dataset after the initial ndatool run, could be fed to datalad import-dcm. It would make sense to me to implement a metadata extractor for NDA metadata that ends up in a dataset in this way and format, such that things like datalad hirni can query for such metadata in order to better and less manual do their job.

Ping @loj @bpoldrack

@yarikoptic
Copy link
Member

1. + 2. is pretty much the "workaround" @loj described in datalad/datalad-crawler#56 .
3. is the tricky one, since AFAIK you cannot just get S3 urls to NDA's original bucket(s). @obenshaindw could correct me if I am wrong. We have discussed in the past ability to get some other persistent NDA identifiers per file so it would be possible then to have NDA to be the proxy to resolve them (after authentication/permission checking) into downloadable (temporary) URLs to S3 content.

Note that "metadata", even file names in NDA might be leaking sensitive information (dates, subject ids such as GUIDs etc), so such dataset might not be shareable openly, and only within the group which got initial permissions from NDA. I am not sure if NDA provides permissions to "wide" groups such as an entire research center.

@mih
Copy link
Member Author

mih commented Sep 25, 2019

Can you please clarify what would prevent me from "getting" the S3 URL? They seem ti be contained in a metadata table that is left behind be ndatool

@yarikoptic
Copy link
Member

yarikoptic commented Sep 25, 2019

IIRC those would be short-term lived (either url itself or a "bundle" bucket)... once again -- I might be wrong, haven't tried myself. wasn't yet granted any access to NDA (recently) to try myself

@mih
Copy link
Member Author

mih commented Sep 25, 2019

OK, thx. It wasn't clear from your original post that any S3 URL is temporary.

@mih
Copy link
Member Author

mih commented May 30, 2020

I think this can be closed.

@mih mih closed this as completed May 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants