Workflow to access datasets hosted on NDA #3710

mih · 2019-09-25T11:44:36Z

@yarikoptic What would be a sensible workflow to access dataset hosted on NDA as a DataLad dataset? In particular access to datasets for which dedicated data usage permission has been (or has to be) obtained, and that are comprised of more than just imaging data hosted on S3 (e.g. clinical assessments coming from some other dataset).

What about this?

Create a dataset
Populate the dataset by running ndatool (https://github.com/NDAR/nda-tools) through datalad run with the request number obtained through the standard NDA application process. This will download all files from S3, and make the necessary requests to also obtain all other datafiles.
Use a helper (script) to sift through the NDA metadata to add S3 URLs to downloaded and annexed data files post-factum.

The outcome is a dataset that represents any NDA dataset in its raw form (defined as whatever ndatool is doing). This dataset can be subsequently normalized with tools like https://github.com/psychoinformatics-de/datalad-hirni by adding more required metadata, or using additional helpers to extract this information from the NDA-provided metadata.

ZIP files with DICOMs tracked in the dataset after the initial ndatool run, could be fed to datalad import-dcm. It would make sense to me to implement a metadata extractor for NDA metadata that ends up in a dataset in this way and format, such that things like datalad hirni can query for such metadata in order to better and less manual do their job.

Ping @loj @bpoldrack

The text was updated successfully, but these errors were encountered:

yarikoptic · 2019-09-25T12:57:11Z

1. + 2. is pretty much the "workaround" @loj described in datalad/datalad-crawler#56 .
3. is the tricky one, since AFAIK you cannot just get S3 urls to NDA's original bucket(s). @obenshaindw could correct me if I am wrong. We have discussed in the past ability to get some other persistent NDA identifiers per file so it would be possible then to have NDA to be the proxy to resolve them (after authentication/permission checking) into downloadable (temporary) URLs to S3 content.

Note that "metadata", even file names in NDA might be leaking sensitive information (dates, subject ids such as GUIDs etc), so such dataset might not be shareable openly, and only within the group which got initial permissions from NDA. I am not sure if NDA provides permissions to "wide" groups such as an entire research center.

mih · 2019-09-25T14:56:21Z

Can you please clarify what would prevent me from "getting" the S3 URL? They seem ti be contained in a metadata table that is left behind be ndatool

yarikoptic · 2019-09-25T15:19:57Z

IIRC those would be short-term lived (either url itself or a "bundle" bucket)... once again -- I might be wrong, haven't tried myself. wasn't yet granted any access to NDA (recently) to try myself

mih · 2019-09-25T16:17:49Z

OK, thx. It wasn't clear from your original post that any S3 URL is temporary.

mih · 2020-05-30T07:22:14Z

I think this can be closed.

mih closed this as completed May 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workflow to access datasets hosted on NDA #3710

Workflow to access datasets hosted on NDA #3710

mih commented Sep 25, 2019

yarikoptic commented Sep 25, 2019

mih commented Sep 25, 2019

yarikoptic commented Sep 25, 2019 •

edited

Loading

mih commented Sep 25, 2019

mih commented May 30, 2020

Workflow to access datasets hosted on NDA #3710

Workflow to access datasets hosted on NDA #3710

Comments

mih commented Sep 25, 2019

yarikoptic commented Sep 25, 2019

mih commented Sep 25, 2019

yarikoptic commented Sep 25, 2019 • edited Loading

mih commented Sep 25, 2019

mih commented May 30, 2020

yarikoptic commented Sep 25, 2019 •

edited

Loading