Skip to content

Using a GAN to synthetically generate medical images for DL purposes

License

Notifications You must be signed in to change notification settings

gcastro-98/synthetic-medical-images

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Generating synthetic medical images

Automated diagnosis of any kind are hampered by the small size, lack of diversity and expensiveness of available dataset of medical images. To tackle this problem, several approaches using generative models have been applied.

Thence, in this project we will be using a (conditional) Deep Convolutional Generative Adversarial Network (cDCGAN) to synthetically generate medical images; concretely, dermatological images of pigmented skin lesions.

Generated samples during the training process

Original samples

Original samples of the HAM10000 dataset

Generated samples

Generated samples from random noise and arbitrary labels

This were generated after a training with IMAGE_SIZE = 64 and the current hyperparameters, which yielded the following error losses during training:

Losses during training

It is trivial to see the discriminator was too simple to keep improving at some point (its predictive performance was eventually at a standstill, preventing the generator to improve). This is due to the fact no fine-tuning was yet carried out.

Dataset

We will use the HAM10000 dataset ("Human Against Machine with 10000 training images"), published in the 2018. In the words of the authors:

We collected dermatoscopic images from different populations acquired and stored by different modalities. Given this diversity we had to apply different acquisition and cleaning methods and developed semi-automatic workflows utilizing specifically trained neural networks. The final dataset consists of 11788 dermatoscopic images, of which 10010 will be released as a training set for academic machine learning purposes and will be publicly available through the ISIC archive. This benchmark dataset can be used for machine learning and for comparisons with human experts. Cases include a representative collection of all important diagnostic categories in the realm of pigmented lesions. More than 50% of lesions have been confirmed by pathology, the ground truth for the rest of the cases was either follow-up, expert consensus, or confirmation by in-vivo confocal microscopy.

This same dataset, has been uploaded at several places:

Technical information

Training of neural networks for automated diagnosis of pigmented skin lesions is hampered by the small size and lack of diversity of available dataset of dermatoscopic images. This problem was tackled by releasing the HAM10000 ("Human Against Machine with 10000 training images") dataset. The authors collected dermatoscopic images from different populations, acquired and stored by different modalities. The final dataset consists of 10015 dermatoscopic images which can serve as a training set for academic machine learning purposes. Cases include a representative collection of all important diagnostic categories in the realm of pigmented lesions:

  • Actinic keratoses and intraepithelial carcinoma / Bowen's disease (akiec, 0 labeled)
  • Basal cell carcinoma (bcc, 1 labeled)
  • Benign keratosis-like lesions: solar lentigines / seborrheic keratoses and lichen-planus like keratoses (bkl, 2 labeled)
  • Dermatofibroma (df, 3 labeled)
  • Melanoma (mel, 4 labeled)
  • Melanocytic nevi (nv, 5 labeled)
  • Vascular lesions: angiomas, angiokeratomas, pyogenic granulomas and hemorrhage (vasc, 6 labeled).

The ground-truth of the lesions are confirmed through:

  • Histopathology (histo): more than 50% of lesions
  • Follow-up examination (follow_up)
  • Expert consensus (consensus)
  • Confirmation by in-vivo confocal microscopy (confocal). These are labeled accordingly in the dx_type column of the HAM10000_metadata.csv.

Due to upload size limitations, images were stored in two files:

  • HAM10000_images_part_1: with 5000 .jpg files
  • HAM10000_images_part_2: with 5015 .jpg files

The dataset includes lesions with multiple images, which can be tracked by the lesion_id column within the HAM10000_metadata.csv file. This file contains the following columns:

  • image_id: id and name with which the image can be found in one of HAM10000_images_part_1 or HAM10000_images_part_2 folders
  • lesion_id: id of the lesion (note one lesion can contain more than one image)
  • dx_type: 4-category procedure of the diagnostic (through which dx was confirmed)
  • age: age of the patient as integer
  • sex: male or female in function of the biological sex of the patient
  • localization: body part in which the lesion is found
  • dx: the ground-truth, i.e. our label

Other medical images datasets

There are plenty of medical images datasets. Below we list some websites which may prove useful:

There are lots of general datasets' repositories out there, which may be worht looking at:

Reproducing the results

Dedicated conda environment

To run the code locally (without any Docker container), I installed pytorch (with GPU support) in a dedicated conda environment by following this guide.

Note the cuDNN files have to be copied at C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8. Once, CUDA 11.8 and the compatible cuDNN version are installed, we set up GPU-pytorch through conda according to the official website.

conda create -n hamgan python=3.9
conda activate hamgan
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia -y
conda install jupyter notebook pandas matplotlib seaborn -y

Set up a dedicated jupyter kernel

In order to run the repo .ipynb we need to set up our conda environment as jupyter kernel, by typing:

python -m ipykernel install --user --name=hamgan

Releases

No releases published

Packages

No packages published

Languages