Step recognition

This is the code for training and evaluation of the preception models built on the PTG project and developed by NYU. It can process videos and predict task (skill) steps such as the ones related to tactical field care.

Note

These are the used skills:

(June/2024 demo) Apply tourniquet (M2), Pressure Dressing (M3), X-Stat (M5), and Apply Chest seal (R18)
(December/2024 demo) Nasopharyngeal Airway (NPA) (A8), Wound Packing (M4), Ventilate with Bag-Valve-Mask (BVM) (R16), Needle Chest Decompression (R19)

Install

Note

All this process is working in the NYU Greene HPC

Consider using singuconda to easily use singularity in the HPC

Repo

git clone --recursive https://github.com/VIDA-NYU/Perception-training.git  

cd Perception-training/
pip install -r requirements.txt
pip install -e .

cd auditory_slowfast/
pip install -e .

Dataset

All video annotations should be in a CSV file with the EPICK-KITCHENS structure. You should also add the column video_fps to describe the FPS of each video annotated.

Note

The code is using only these columns: video_id, start_frame, stop_frame, narration, verb_class, video_fps

Preprocessing videos

The preprocessing steps are the extraction of video frames and sound. Basically, you can execute the following commands:

1.1 Extracting frames or sound

bash scripts/extract_frames.sh /path/to/the/skill desc/Data /path/to/save/the/frames/ SKILL frame true 

bash scripts/extract_frames.sh /path/to/the/skill desc/Data /path/to/save/the/sound/ SKILL sound true

1.2 /path/to/the/skill desc/ should be structured such as

 |- skill desc
   Data
     |- video_id
       video_id.skill_labels_by_frame.txt
       video_id.mp4
     |- video_id   
       video_id.skill_labels_by_frame.txt
       video_id.mp4
     ...

1.3 Using squash to compact the files in an image that can be used with singularity.

bash scripts/extract_frames.sh /path/to/the/skill desc/Data /path/to/save/the/frames/ SKILL frame false 

bash scripts/extract_frames.sh /path/to/the/skill desc/Data /path/to/save/the/sound/ SKILL sound false

Important

to execute this script, consider using singularity with the image ubuntu-22.04.3.sif or rockylinux-9.2.sif both available on the NYU HPC.

if you are not using singularity, remember to install ffmeg

1.5 If you want to run out the NYU HPC execute this script but do not forget to install ffmeg

bash scripts/out_hpc/extract_frames.sh /path/to/the/skill desc/Data /path/to/save/the/frames/ SKILL frame

bash scripts/out_hpc/extract_frames.sh /path/to/the/skill desc/Data /path/to/save/the/sound/ SKILL sound

Training and making predictions

Check the configuration files under config folder.

2.1 The field TRAIN.ENABLE should be True for training and False for prediction.

2.2 Change the path to the labels DATASET.TR_ANNOTATIONS_FILE (train), DATASET.VL_ANNOTATIONS_FILE (validation), DATASET.TS_ANNOTATIONS_FILE (test)

2.2 If you are evaluating the models, the config file should point to the model used for predictions MODEL.OMNIGRU_CHECKPOINT_URL.

2.3 You also have to configure where are your Yolo models MODEL.YOLO_CHECKPOINT_URL needed to extract image features.

2.4 The following script is always running cross-validation. Inside the script, you can change CROSS_VALIDATION="false" to run it with a single step. You also have to change the config TRAIN.USE_CROSS_VALIDATION.

bash scripts/omnimix.sh /path/to/the/frames/squash/files /path/to/the/sound/squash/files config/M2.yaml

Important

this code uses the squash files previously created.

it is also expecting the use of the singuconda

2.4 If you want to run out the NYU HPC or singularity, change the config file to point to your frame DATASET.LOCATION and sound DATASET.AUDIO_LOCATION paths. Finally, execute this python script

python tools/run_step_recog.py --cfg config/M2.yaml

Visualizing the results

The configuration file should also point to the model used for prediction.

python step_recog/full/visualize.py /path/to/the/video/mp4/file output.mp4 config/M3.yaml

Feature extraction

The configuration file should also point to the model used for prediction and to a place to save the features OUTPUT.LOCATION.

python tools/test.py --cfg config/M3.yaml

Code structure

Main code: toos/run_step_recog.py (function train_kfold)
Training/evaluation routines: step_recog/iterators.py (functions train, evaluate)
Model classes: step_recog/models.py
Dataloader: step_recog/datasets/milly.py (methods _construct_loader and __getitem__)

class Milly_multifeature_v4 loads video frames and returns formated features
class Milly_multifeature_v5 loads (preprocessed) features and returns formated features

Image augmentation: tools/augmentation.py (function get_augmentation)
Basic configuration: step_recog/config/defaults.py (more important), act_recog/config/defaults.py, auditory_slowfast/config/defaults.py
Visualizer: step_recog/full/visualize.py implements a specific code that combines dataloading, model prediction, and a state machine. It uses the user interface with the trained models.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Step recognition

Install

Repo

Dataset

Preprocessing videos

Training and making predictions

Visualizing the results

Feature extraction

Code structure

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 163 Commits
.github/workflows		.github/workflows
act_recog		act_recog
auditory_slowfast		auditory_slowfast
config		config
scripts		scripts
step_recog		step_recog
tools		tools
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

VIDA-NYU/Perception-training

Folders and files

Latest commit

History

Repository files navigation

Step recognition

Install

Repo

Dataset

Preprocessing videos

Training and making predictions

Visualizing the results

Feature extraction

Code structure

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages