[ICCV 2023] SparseFusion: Fusing Multi-Modal Sparse Representations for Multi-Sensor 3D Object Detection
We propose SparseFusion, a novel multi-sensor 3D detection method that exclusively uses sparse candidates and sparse representations. Specifically, SparseFusion utilizes the outputs of parallel detectors in the LiDAR and camera modalities as sparse candidates for fusion. We transform the camera candidates into the LiDAR coordinate space by disentangling the object representations. Then, we can fuse the multi-modality candidates in a unified 3D space by a lightweight self-attention module. To mitigate negative transfer between modalities, we propose novel semantic and geometric cross-modality transfer modules that are applied prior to the modality-specific detectors. SparseFusion achieves state-of-the-art performance on the nuScenes benchmark while also running at the fastest speed.
[paper link] [Chinese summary (自动驾驶之心)]
[2023-8-21] Much better training GPU memory efficiency (45GB -> 29GB) with no hurt to the performance and speed!
[2023-7-13] 🔥SparseFusion has been accepted to ICCV 2023!🔥
[2023-3-21] We release the first version code of SparseFusion.
Compared to existing fusion algorithms, SparseFusion achieves state-of-the-art performance as well as the fastest inference speed on nuScenes test set. †: Official repository of AutoAlignV2 uses flip as test-time augmentation. ‡: We use BEVFusion-base results in the official repository of BEVFusion to match the input resolutions of other methods.
We do not use any test-time augmentations or model ensembles to get these results. We have released the configure files and pretrained checkpoints to reproduce our results.
Image Backbone | Point Cloud Backbone | mAP | NDS | Link |
---|---|---|---|---|
ResNet50 | VoxelNet | 70.5 | 72.8 | config/ckpt |
Swin-T | VoxelNet | 71.0 | 73.1 | config/ckpt |
Image Backbone | Point Cloud Backbone | mAP | NDS |
---|---|---|---|
ResNet50 | VoxelNet | 72.0 | 73.8 |
-
We test our code on an environment with CUDA 11.5, python 3.7, PyTorch 1.7.1, TorchVision 0.8.2, NumPy 1.20.0, and numba 0.48.0.
-
We use
mmdet==2.10.0, mmcv==1.2.7
for our code. Please refer to their official instructions for installation. -
You can install
mmdet3d==0.11.0
directly from our repo bycd SparseFusion pip install -e .
-
We use
spconv==2.3.3
. Please follow the official instruction to install it based on your CUDA version.pip install spconv-cuxxx # e.g. pip install spconv-cu114
-
You also need to install the deformable attention module with the following command.
pip install ./mmdet3d/models/utils/ops
Download nuScenes full dataset from the official website. You should have a folder structure like this:
SparseFusion
├── mmdet3d
├── tools
├── configs
├── data
│ ├── nuscenes
│ │ ├── maps
│ │ ├── samples
│ │ ├── sweeps
│ │ ├── v1.0-test
| | ├── v1.0-trainval
Then, you can select either of the two ways to preprocess the data.
-
Run the following two commands sequentially.
python tools/create_data.py nuscenes --root-path ./data/nuscenes --out-dir ./data/nuscenes --extra-tag nuscenes python tools/combine_view_info.py
-
Alternatively, you may directly download our preprocessed data from Google Drive, and put these files in
data/nuscenes
.
Please download the initial weights for model training, and put them in checkpoints/
.
In our default setting, we train the model with 4 GPUs.
# training
bash tools/dist_train.sh configs/sparsefusion_nusc_voxel_LC_r50.py 4 --work-dir work_dirs/sparsefusion_nusc_voxel_LC_r50
# test
bash tools/dist_test.sh configs/sparsefusion_nusc_voxel_LC_r50.py ${CHECKPOINT_FILE} 4 --eval=bbox
Note: We use A6000 GPUs (48GB per-GPU memory) for model training. The training of SparseFusion model (ResNet50 backbone) requires ~29 GB per-GPU memory.
If you have any questions, feel free to open an issue or contact us at [email protected].
We sincerely thank the authors of mmdetection3d, TransFusion, BEVFusion, MSMDFusion, and DeepInteraction for providing their codes or pretrained weights.
If you find our work useful, please consider citing the following paper:
@article{xie2023sparsefusion,
title={SparseFusion: Fusing Multi-Modal Sparse Representations for Multi-Sensor 3D Object Detection},
author={Xie, Yichen and Xu, Chenfeng and Rakotosaona, Marie-Julie and Rim, Patrick and Tombari, Federico and Keutzer, Kurt and Tomizuka, Masayoshi and Zhan, Wei},
journal={arXiv preprint arXiv:2304.14340},
year={2023}
}