HEALNet

Code repository for paper: HEALNet - Hybrid Multi-Modal Fusion for Heterogeneous Biomedical Data.

Quickstart

Local install

First, locally install HEALNet using pip.

git clone <https/ssh_path>
cd healnet
pip install -e .

Usage

from healnet.models import HealNet
from healnet.etl import MMDataset
import torch
import einops

# synthetic data example
n = 1000 # number of samples
b = 4 # batch size
img_c = 3 # image channels
tab_c = 1 # tabular channels
tab_d = 5000 # tabular features
h = 512 # image height
w = 512 # image width

tab_tensor = torch.rand(size=(n, tab_c, tab_d)) # assume 5k tabular features
img_tensor = torch.rand(size=(n, img_c, h, w)) # c h w
dataset = MMDataset([tab_tensor, img_tensor])

[tab_sample, img_sample] = dataset[0]

# batch dim for illustration purposes
tab_sample = einops.repeat(tab_sample, 'c d -> b c d', b=1)
img_sample = einops.repeat(img_sample, 'c h w -> b c (h w)', b=1)

model = HealNet(
            modalities=2, 
            input_channels=[tab_c, img_c], 
            input_axes=[1, 1], # channel axes (0-indexed)
            num_classes = 4
        )

# example forward pass
model([tab_sample, img_sample])

Please view notebooks/sample.ipynb for a more detailed example.

Reproducing experiments

If you want to reproduce the results in the paper instead of using HEALNet as a standalone module, you need to install a few more dependencies.

Conda/Mamba environment

Install or update the conda/mamba environment using and then activate. For a faster installation, we recommend using mamba.

conda env update -f environment.yml
conda activate cognition

CLI for additional dependenceis

On Mac or Linux, you can install the below dependencies using the command line

invoke install --system <system>

for both linux and mac.

This will auto-install the requirements below (OpenSlide and GDC client). Please follow detailed instructions below if our pre-written installation fails.

Openslide

Note that for openslide-python to work, you need to install openslide separately on your system. See here for instructions.

GDC client

To download the WSI data, you need to install the gdc-client for your respective platform

Data

Download

From the root of the repository, run

Specify the path to the gdc-client executable in main.yml (this will likely be the repository root if you installed the dependencies using invoke install).
Run invoke download --dataset <dataset> --config_path <config>, e.g., invoke download --dataset brca

If you are unsure about which arguments are available, you can always run invoke download --help.

The script downloads the data using the given manifest files in data/tcga/gdc_manifests/full and save it in the data folder under tcga/wsi/<dataset> taking the following structure:

tcga/wsi/<dataset>/
	├── slide_1.svs
	├── slide_2.svs
	└── ...

If a data manifest file is not available for a given cancer site, you can select the files and download the manifest using the NIH Genomic Data Commons Data Portal. You can filter the .svs tissue and diagnostics slide files

Preprocessing

To ensure comparability with baselines, want to have the option to run the model in the WSI patches and extracted features using the CLAM package.

To extract he patches, run

invoke preprocess --dataset <dataset> --config <config> --level <level>

Which will extract to the following structure

tcga/wsi/<dataset>_preprocessed/
	├── masks
    		├── slide_1.png
    		├── slide_2.png
    		└── ...
	├── patches
    		├── slide_1.h5
    		├── slide_2.h5
    		└── ...
	├── stitches
    		├── slide_1.png
    		├── slide_2.png
    		└── ...
	└── process_list_autogen.csv

Note that the slide.h5 files contain the coordinates of the patches that are to be read in via OpenSlide (x, y coordinates).

On first run of the pipeline, the script will add an additional folder called patch_features which contains the ImageNet50 extracted features after patch normalisation as a 1024-dimensional tensor (using PyTorch serialisation).

	├── patch_features
    		├── slide_1.pt
    		├── slide_2.pt
    		└── ...

Datasets

This repo contains the manifests and scripts to easily download the following 8 cancer sites from The Cancer Genome Atlas. You can use the GDC Data Access Tool and use the same scripts if you require additional data.

TCGA

BLCA: Urothelial Bladder Carcinoma
BRCA: Breast Invasive Carcinoma
UCEC: Uterine Corpus Endometrial Carcinoma
KIRP: cevical Kidney Renal Papillary Cell Carcinoma
LUAD: Lung Adenocarcinoma
LUSC: Lung Squamous Cell Carcinoma
PAAD: Pancreatic adenocarcinoma
HNSC: Head and Neck Squamous Cell Carcinoma

Biobank

To be added

Running Experiments

Single run

Given the configuration in config.yml, you can launch a single run using. Note that all below commands assume that you are in the repository root.

python3 healnet/main.py

You can view the available command line arguments using

python3 healnet/main.py --help

Full run

python3 healnet/main.py --mode run_plan

Hyperparameter search

You can launch a hyperparameter search by passing the --hyperparameter_sweep argument.

python3 healnet/main.py --hyperparameter_sweep

Note that the sweep parameters are specified in the config/sweep.yaml file. If a parameter is not specified as part of the parameter sweep, the program will default to whatever is configured in config/main_gpu.yml

Name		Name	Last commit message	Last commit date
Latest commit History 294 Commits
assets		assets
config		config
data/tcga/gdc_manifests		data/tcga/gdc_manifests
healnet		healnet
notebooks		notebooks
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
gdc-client		gdc-client
pyproject.toml		pyproject.toml
run_plan.sh		run_plan.sh
setup.cfg		setup.cfg
sweep.yaml		sweep.yaml
tasks.py		tasks.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HEALNet

Quickstart

Local install

Usage

Reproducing experiments

Conda/Mamba environment

CLI for additional dependenceis

Openslide

GDC client

Data

Download

Preprocessing

Datasets

TCGA

Biobank

Running Experiments

Single run

Full run

Hyperparameter search

About

Releases

Packages

Languages

License

konst-int-i/healnet

Folders and files

Latest commit

History

Repository files navigation

HEALNet

Quickstart

Local install

Usage

Reproducing experiments

Conda/Mamba environment

CLI for additional dependenceis

Openslide

GDC client

Data

Download

Preprocessing

Datasets

TCGA

Biobank

Running Experiments

Single run

Full run

Hyperparameter search

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages