This is a continuation of https://github.com/Zarand3r/deepStitch forked from https://github.com/fluongo/deepStitch
We can divide the suturign process into discrete actions such as needle positioning, needle driving, etc. Insert more complete description of the data and the labels. The focus of this project is to apply computer vision techniques (standard two-stream action recogntiion models) towards the goal of
- Action segmentation
Recognition of when a stitch is happening (e.g. "Stitch segmentation").
The ConvLSTM is a two-stream network, with one stream RGB and the other stream optical flow obtained by visualizing the inverse flow generated by FlowNet. But single stream models (only RGB or only optical flow) are also explored. We also implement multi-stage TCN and compare results with the ConvLSTM. The general model schematic involves varying the pre-trained backbone (e.g. alexNet, resnet, or VGG16) and the recurrent model learned at the classification layer (RNN, GRU, LSTM, convLSTM, convttLSTM). We also implemented saliency map visualizations.
The inputs to TCN are i3D features extracted from the training videos. The models are trained on sponge data, but we could easily replace the data paths to point to the pipe data, which are also compatible with the ConvLSTM and TCN preprocessing scripts. Once we implement style transfer and optical_flow pixel masking modules, which output transformed versions of the raw videos, we can also train on these paths. The style transfer module will transform videos to be more life-like. These modules will be implemented in the utils oflder and are intended to help the model learn more generalizable features, and to aid in transfer learning. The optical_flow pixel masking module will transform the videos to only contain the pixels that are moving, as determined by optical flow, so that the model doesn't focus on backgorund information unique to the sponge or the tubes.
- Action segmentation with kinetics
Action segmentation aided with semi-supervised injection of kinematics data to help the models focus on more fine grained details
- Skills assessment
Trying to classify the success of a particular action in the actions sequence. Also experimented with a transfer learning attempt to fine tune a ConvLSTM pre-trained action segmentation. We conclude that ConvLSTM is not fine grained enough to be effective for skills assessment because the success of particular actions depends on very fine grained details, like the position of the needle held relative to its length, that the ConvLSTM cannot pick up. frames around timepoint B (1 second padding)
The label in column O: "label_needle positionB" refers to the final needle position that was reached at time point B (column I). The needle is being manipulated during the AB interval to reach the final needle position at time point B, so I believe we use this interval for predicting needle position.
The label in column P: "label_needle_entry_angleC" refers to the entry angle assessment for each time point C (column J). If there are multiple entry angle skills assessments, there were multiple attempts to hit the first needle target (time point C). The interval from time point B to the first C is usually used to predict the first entry angle skill assessment. Subsequent correspond to CC intervals.
The label in column S: "label_needle_driving_1D" refers to the needle driving 1 score. The first score is always assigned for the interval ranging from the last time point C to the first time point D. If there are additional attempts, the corresponding needle driving skill score refers to a D-D interval (though this interval includes some reverse movement of the needle usually). The D to E interval corresponds to the last D time point to the end of needle driving.
The label in column T: "label_needle_driving_2FG" refers to the needle driving 2 score. This score is assigned for the interval from F to G.
- Skills assessment with kinematics
Skills assessment aided with semi-supervised injection of kinematics data to help the models focus on more fine grained details
Initially, we will be training on videos demonstrations obtrained from the Mimic suturing simulation, exploring exercises on sponges and tubes. We will eventually extend to the real domain.
PyTorch (1.3) PyTorch lightning ffmpeg NVIDIA DALI
Computation using flownet refers to the following github implementation of flownet: https://github.com/ClementPinard/FlowNetPytorch
If computing optical flow in a faster way you can use the NVIDIA DALI (https://docs.nvidia.com/deeplearning/dali/user-guide/docs/index.html) framework which uses the turingOF library from nvidia. This is substantially faster inference than flownet but is also coarser in its estimates resulting in slightly lower performance.
If you have the mp4 files in a labeled directory with labeled subdirs corresponding to gesture labels
-───movie_clips
│ └───00
│ │ fn*.mp4
| | ...
│ └───01
│ │ fn*.mp4
| | ...
# Preprocess the data into the corresponding movie folder
python preprocessing/generate_flows.sh movie_clips/
# Train the model
python lightning_train.py \
--arch resnet18 \
--rnn_model convLSTM \
--rnn_layers 2 \
--hidden_size 64 \
--fc_size 128 \
--lr 1e-4
*Instructions for cutting model to come later...
411 labeled video clips of varying length (2-30s) across the 5 gesture types shown above
24 surgical preocedure videos (>1hour) with labeled segments corresponding to times that the needle actually entered tissue
Models for recognizing and evaluating stitches in srgical videos using deep networks
Currently migrating over from old directory, this will be the newest one as it contains lightning files for expedited training during experiments.