Skip to content
/ AF-SfMLearner Public template

[MedIA2022 & ICRA2021] Self-Supervised Monocular Depth and Ego-Motion Estimation in Endoscopy: Appearance Flow to the Rescue


Notifications You must be signed in to change notification settings


Repository files navigation


This is the official PyTorch implementation for training and testing depth estimation models using the method described in

Self-Supervised Monocular Depth and Ego-Motion Estimation in Endoscopy: Appearance Flow to the Rescue

Shuwei Shao, Zhongcai Pei, Weihai Chen, Wentao Zhu, Xingming Wu, Dianmin Sun and Baochang Zhang

accepted by Medical Image Analysis (arXiv pdf)


Self-Supervised Learning for Monocular Depth Estimation on Minimally Invasive Surgery Scenes

Shuwei Shao, Zhongcai Pei, Weihai Chen, Baochang Zhang, Xingming Wu, Dianmin Sun and David Doermann

ICRA 2021 (pdf).


✏️ 📄 Citation

If you find our work useful in your research please consider citing our paper:

  title={Self-Supervised monocular depth and ego-Motion estimation in endoscopy: Appearance flow to the rescue},
  author={Shao, Shuwei and Pei, Zhongcai and Chen, Weihai and Zhu, Wentao and Wu, Xingming and Sun, Dianmin and Zhang, Baochang},
  journal={Medical image analysis},
  title={Self-Supervised Learning for Monocular Depth Estimation on Minimally Invasive Surgery Scenes},
  author={Shao, Shuwei and Pei, Zhongcai and Chen, Weihai and Zhang, Baochang and Wu, Xingming and Sun, Dianmin and Doermann, David},
  booktitle={2021 IEEE International Conference on Robotics and Automation (ICRA)},

⚙️ Setup

We ran our experiments with PyTorch 1.2.0, torchvision 0.4.0, CUDA 10.2, Python 3.7.3 and Ubuntu 18.04.

🖼️ Prediction for a single image or a folder of images

You can predict scaled disparity for a single image or a folder of images with:

CUDA_VISIBLE_DEVICES=0 python --model_path <your_model_path> --image_path <your_image_or_folder_path>

💾 Datasets

You can download the Endovis or SCARED dataset by signing the challenge rules and emailing them to [email protected], the EndoSLAM dataset, the SERV-CT dataset, and the Hamlyn dataset.

Endovis split

The train/test/validation split for Endovis dataset used in our works is defined in the splits/endovis folder.

Endovis data preprocessing

We use the ffmpeg to convert the RGB.mp4 into images.png:

find . -name "*.mp4" -print0 | xargs -0 -I {} sh -c 'output_dir=$(dirname "$1"); ffmpeg -i "$1" "$output_dir/%10d.png"' _ {}

We only use the left frames in our experiments and please refer to For dataset 8 and 9, we rephrase keyframes 0-4 as keyframes 1-5.

Data structure

The directory of dataset structure is shown as follows:


⏳ Endovis training

Stage-wise fashion:

Stage one:

CUDA_VISIBLE_DEVICES=0 python --data_path <your_data_path> --log_dir <path_to_save_model (optical flow)>

Stage two:

CUDA_VISIBLE_DEVICES=0 python --data_path <your_data_path> --log_dir <path_to_save_model (depth, pose, appearance flow, optical flow)> --load_weights_folder <path_to_the_trained_optical_flow_model_in_stage_one>

End-to-end fashion:

CUDA_VISIBLE_DEVICES=0 python --data_path <your_data_path> --log_dir <path_to_save_model (depth, pose, appearance flow, optical flow)>

📊 Endovis evaluation

To prepare the ground truth depth maps run:

CUDA_VISIBLE_DEVICES=0 python --data_path endovis_data --split endovis

...assuming that you have placed the endovis dataset in the default location of ./endovis_data/.

The following example command evaluates the epoch 19 weights of a model named mono_model:

CUDA_VISIBLE_DEVICES=0 python --data_path <your_data_path> --load_weights_folder ~/mono_model/mdp/models/weights_19 --eval_mono

Appearance Flow

Depth Estimation

Visual Odometry

3D Reconstruction

📦 Model zoo

Model Abs Rel Sq Rel RMSE RMSE log Link
Stage-wise (ID 5 in Table 8) 0.059 0.435 4.925 0.082 baidu (code:n6lh); google
End-to-end (ID 3 in Table 8) 0.059 0.470 5.062 0.083 baidu (code:z4mo); google
ICRA 0.063 0.489 5.185 0.086 baidu (code:wbm8); google

Important Note

If you use the latest PyTorch version,

Note1: please try to add 'align_corners=True' to 'F.interpolate' and 'F.grid_sample' when you train the network, to get a good camera trajectory.

Note2: please revise color_aug=transforms.ColorJitter.get_params(self.brightness,self.contrast,self.saturation,self.hue) to color_avg=transforms.ColorJitter(self.brightness,self.contrast,self.saturation,self.hue).


If you have any questions, please feel free to contact [email protected].


Our code is based on the implementation of Monodepth2. We thank Monodepth2's authors for their excellent work and repository.