MMBench: End-to-End Benchmarking Tool for Analyzing the Hardware-Software Implications of Multi-modal DNNs
Multi-modal DNNs have become increasingly popular across various application domains due to their significant accuracy improvement compared to SOTA uni-modal DNNs.
Multimodal DNN
Self-driving Medical Multimedia Robotic
To understand the implications of multi-modal DNNs on hardware-software co-designs, we have developed MMBench, an end-to-end benchmarking tool designed to evaluate the performance of multi-modal DNNs at both architecture and system levels.
MMBench provides profiling tools based on integrated profilers in both CPU and NVIDIA GPU, including PyTorch profiler, Nsight System, and Nsight Compute. These tools enable researchers to comprehensively understand the execution of multi-modal DNNs. See the figure below for how they work together to analyze DNN performance.
In all, MMBench possesses the following unique features closely related with the characteristics of multi-modal DNNs, which distinguishes itself from general-purpose benchmarks in these specific areas:
- Fine-grained Network Characterization
- End-to-End Application
- ExecutionUser-friendly Profiler Integration
MMBench includes 9 different applications from the five most important multi-modal research domains as shown below. It can cover a wide range of the multi-modal DNNs workloads today.
Application | Domain | Size | Modalities | Unimodal models | Fusion models | Task type |
---|---|---|---|---|---|---|
Avmnist | Multimedia | Small | Image Audio |
CNN | Concate/Tensor | Classification |
MMimdb | Multimedia | Medium | Image Text |
CNN+transformer | Concate/Tensor | Classification |
CMU-MOSEI | Affective computing | Large | Language Vision Audio |
CNN+transformer | Concate/Tensor/Transformer | Regression |
Sarcasm | Affective computing | Small | Language Vision Audio |
CNN+transformer | Concate/Tensor/Transformer | Classification |
Medical VQA | Medical | Large | Image Text |
CNN+transformer | Transformer | Generation |
Medical Segmentation | Medical | Large | MRI scans (T1, T1c, T2, FLAIR) |
CNN+transformer | Transformer | Segmentation |
MuJoCo Push | Robotics | Medium | Image, force, proprioception, control | CNN+RNN | Concate/Tensor/Transformer | Classification |
Vison & Touch | Robotics | Large | Image, force, proprioception, depth | CNN+RNN | Concate/Tensor | Classification |
TransFuser | Automatic driving | Large | Image LiDAR |
ResNet-34 ResNet-18 |
Transformer | Classification |
From software aspects, the applications we choose apply many kinds of subnets (mainly as encoders) , fusion ways and head methods, which consititue a whole multi-modal DNN.
Nsight System and Nsight Compute measurement scripts are provided in the scripts folder. You can follow instructions there to run experiments.
The code for measuring using the Pytorch Profiler is contained within each application's own folder. The result will be generated in the log folder.
Some codes and applications were adapted from the MultiBench.
Our team has been working on related technologies since 2018. Thank you to everyone for contributing to this project.
Correspondence to:
- Cheng Xu ([email protected])
- Xuehan Tang ([email protected])
- Jiacheng Liu ( [email protected])
- Xiaofeng Hou ([email protected])
- Chao Li ([email protected])
- Jieping Ye ([email protected])
- Lingyu Sun ([email protected])
- Tongqiao Xu ([email protected])
- Peng Tang ([email protected])
- Guangya Li ([email protected])
- Yinglei Teng ([email protected])
- Tianhao Huang ([email protected])
- Xiaozhi Zhu ([email protected])
- Mo Niu ([email protected])
- Tianyu Zang ([email protected])
- Minyi Guo ([email protected])
Characterizing and Understanding End-to-End
Multi-modal Neural Networks on GPUs
Xiaofeng Hou, Cheng Xu, Jiacheng Liu, Xuehan Tang, Lingyu Sun, Chao Li and Kwang-Ting Cheng
IEEE Co