Skip to content
This repository has been archived by the owner on Mar 21, 2024. It is now read-only.

Allow redefinition of DATASET_CSV_FILE_NAME #403

Closed
kh296 opened this issue Feb 22, 2021 · 3 comments · Fixed by #412
Closed

Allow redefinition of DATASET_CSV_FILE_NAME #403

kh296 opened this issue Feb 22, 2021 · 3 comments · Fixed by #412
Labels
triaged An item on Azure Boards has been created and prioritized

Comments

@kh296
Copy link
Collaborator

kh296 commented Feb 22, 2021

The scans and structures to be used in a training run are specified in an index file, the name of which is given by the variable InnerEye.ML.common.DATASET_CSV_FILE_NAME. This is hardcoded to dataset.csv.

When trying to improve model performance, it can sometimes be useful to consider only a subset of patients and/or structures. This can be achieved by overwriting dataset.csv, or by adding a differently named index file and hacking InnerEye/ML/common.py. It to be nice instead able to set the value of DATASET_CSV_FILE_NAME in the model definition.

AB#3785

@ant0nsc
Copy link
Contributor

ant0nsc commented Mar 6, 2021

def test_custom_dataset_name(test_output_dirs: OutputFolderForTests) -> None:
    filename = "foo.csv"
    (test_output_dirs.root_dir / filename).write_text("""subject,channel,...
    """)
    config = SegmentationModelBase()
    config.local_dataset = test_output_dirs.root_dir
    config.custom_dataset_name = "foo.csv"
    dataframe = config.read_dataset_if_needed()
    assert dataframe is not None

@kh296
Copy link
Collaborator Author

kh296 commented Mar 9, 2021

I've enabled user definition of the dataset csv file name by allowing the parameter dataset_csv to be set as part of model configuration. The parameter is initialised to DATASET_CSV_FILE_NAME in InnerEye/ML/config.py, then is used to locate the dataset csv file in InnerEye/ML/config.py, in InnerEye/ML/run_ml.py and in InnerEye/ML/utils/ml_util.py. I've added a unit test to Tests/ML/test_config_helpers.py.

The above seems to be all that's needed to set custom dataset csv file names for model training. However, I've seen that DATASET_CSV_FILE_NAME is used for file location in a number of other places, for example InnerEye/ML/baselines_util.py and InnerEye/ML/normalize_and_visualize_dataset.py.

Should I submit a pull request for the changes made, in that they provide the functionality requested in the Issue submission, or should I first try to avoid DATASET_CSV_FILE_NAME being used anywhere, except as a fallback value?

@ant0nsc
Copy link
Contributor

ant0nsc commented Mar 9, 2021

@kh296 please do a pull request with the changes you described. As far as I can see now, the other uses of DATASET_CSV_FILE_NAME are fine. Once I see the PR, I'll be able to see more clearly anyway.

@ant0nsc ant0nsc added the triaged An item on Azure Boards has been created and prioritized label Mar 9, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
triaged An item on Azure Boards has been created and prioritized
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants