Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to anonymize acquisition datetimes in scans.tsv #265

Open
tsalo opened this issue Aug 17, 2023 · 1 comment
Open

Add option to anonymize acquisition datetimes in scans.tsv #265

tsalo opened this issue Aug 17, 2023 · 1 comment
Assignees
Labels
enhancement New feature or request

Comments

@tsalo
Copy link
Member

tsalo commented Aug 17, 2023

Following a discussion in today's informatics scrum, I was thinking that it would be nice to be able to anonymize acquisition datetimes in the scans.tsv files (and potentially in sidecar JSON files). @mattcieslak thought this could be made part of the purge-metadata command.

Desiderata:

  1. Set first scan's acquisition to 1800/01/01.
  2. Give users the option to either anonymize the full datetime or just anonymize the date (i.e., retain the time of day).
  3. Preserve relative timing between scans in each session.
  4. Preserve relative timing between sessions.
@tsalo tsalo added the enhancement New feature or request label Aug 17, 2023
@mattcieslak mattcieslak self-assigned this Aug 28, 2023
@tsalo
Copy link
Member Author

tsalo commented Sep 7, 2023

Here's some code I've used to do this in another project:

"""Anonymize acquisition datetimes for a dataset.

Anonymize acquisition datetimes for a dataset. Works for both longitudinal
and cross-sectional studies. The time of day is preserved, but the first
scan is set to January 1st, 1800. In a longitudinal study, each session is
anonymized relative to the first session, so that time between sessions is
preserved.

Overwrites scan tsv files in dataset. Only run this *after* data collection
is complete for the study, especially if it's longitudinal.
"""
import os
from glob import glob

import pandas as pd
from dateutil import parser

if __name__ == "__main__":
    dset_dir = "/path/to/dset"

    bl_dt = parser.parse("1800-01-01")

    subject_dirs = sorted(glob(os.path.join(dset_dir, "sub-*")))
    for subject_dir in subject_dirs:
        sub_id = os.path.basename(subject_dir)
        print(f"Processing {sub_id}")

        scans_files = sorted(glob(os.path.join(subject_dir, "ses-*/*_scans.tsv")))

        for i_ses, scans_file in enumerate(scans_files):
            ses_dir = os.path.dirname(scans_file)
            ses_name = os.path.basename(ses_dir)
            print(f"\t{ses_name}")

            df = pd.read_table(scans_file)
            if i_ses == 0:
                # Anonymize in terms of first scan for subject.
                first_scan = df["acq_time"].min()
                first_dt = parser.parse(first_scan.split("T")[0])
                diff = first_dt - bl_dt

            acq_times = df["acq_time"].apply(parser.parse)
            acq_times = (acq_times - diff).astype(str)
            df["acq_time"] = acq_times
            df["acq_time"] = df["acq_time"].str.replace(" ", "T")

            # Delete the original file instead of just overwriting it, for Datalad.
            os.remove(scans_file)

            df.to_csv(
                scans_file,
                sep="\t",
                line_terminator="\n",
                na_rep="n/a",
                index=False,
            )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants