EvPlug: Learn a Plug-and-Play Module for Event and Image Fusion

Jianping Jiang

{}^{1,2}

, Xinyu Zhou

{}^{3}

, Peiqi Duan

{}^{1,2}

, Boxin Shi

{}^{1,2}

{}^{1}

National Key Laboratory for Multimedia Information Processing, School of Computer Science

{}^{2}

National Engineering Research Center of Visual Technology, School of Computer Science

{}^{3}

National Key Lab of General AI, School of Intelligence Science and Technology
Peking University
[email protected], {zhouxiny, duanqi0001, shiboxin}@pku.edu.cn

Abstract

Event cameras and RGB cameras exhibit complementary characteristics in imaging: the former possesses high dynamic range (HDR) and high temporal resolution, while the latter provides rich texture and color information. This makes the integration of event cameras into middle- and high-level RGB-based vision tasks highly promising. However, challenges arise in multi-modal fusion, data annotation, and model architecture design. In this paper, we propose EvPlug, which learns a plug-and-play event and image fusion module from the supervision of the existing RGB-based model. The learned fusion module integrates event streams with image features in the form of a plug-in, endowing the RGB-based model to be robust to HDR and fast motion scenes while enabling high temporal resolution inference. Our method only requires unlabeled event-image pairs (no pixel-wise alignment required) and does not alter the structure or weights of the RGB-based model. We demonstrate the superiority of EvPlug in several vision tasks such as object detection, semantic segmentation, and 3D hand pose estimation.

1 Introduction

Traditional frame-based RGB cameras, featuring rich color and texture information as well as lower noise, are the mainstream sensors in computer vision research. However, due to their imaging mechanisms, they inevitably face issues such as overexposure, motion blur, and limited temporal resolution han21evintsr ; wang20jointfilter ; hu20nga ; Timelens ; EDI . Bio-inspired neuromorphic event cameras, with their asynchronous differential imaging mechanism (events are triggered by observing pixel-wise brightness changes exceeding a threshold in logarithmic domain), possess high dynamic range (HDR), high temporal resolution, low data redundancy, and low power consumption licht08davis ; dvs640 ; davis346 ; survey . However, the imaging quality of event cameras degrades when the camera and scene are relatively static or when the scene illumination changes significantly rebecq21e2vid ; e2sri-cvpr20 ; Timelens . These two types of cameras exhibit complementary characteristics in imaging as shown in Fig. 2, which has been demonstrated in various vision tasks, such as high frame-rate video reconstruction Timelens ; tulyakov22timelens++ ; gef-tpami , HDR image restoration han2020hdr , object detection zhou2023rgb , depth estimation daniel21ramnet , etc.

However, event cameras have been studied for a much shorter period of time than RGB cameras, limiting the progress of their integration with RGB cameras. Researchers have built large-scale datasets deng09imagenet ; lin14coco and various meticulously designed network architectures he16resnet ; vaswani17transformer based on RGB images, while the corresponding tasks in event-based vision are at a much less mature stage. Therefore, when fusing the two modalities of data, challenges arise from two perspectives: 1) data perspective: introducing event cameras requires collecting new data sequences and annotation, and large-scale data annotation implies high costs; 2) model perspective: designing new fusion algorithms and retraining models for event streams require additional model design works and computational resource.

To alleviate the challenges, existing methods hu20nga ; wang21dual ; wang21evdistill ; messikommer22evtransfer ; gehrig20v2e ; rebecq21e2vid utilize the concept of domain adaptation messikommer22evtransfer (also referred as transfer learning, knowledge distillation in event-based vision wanglin22kd_survey ), leveraging the consistency of visual signals between event and RGB cameras in geometric and semantic dimensions to transfer RGB-based knowledge to event-based tasks. They have achieved promising results on middle- and high-level vision tasks such as classification messikommer22evtransfer , object detection wang21evdistill ; messikommer22evtransfer ; hu20nga , and semantic segmentation sun22ess , where the input data can be compressed into high-dimension features¹¹1Event and RGB image fusion for low-level vision tasks such as video interpolation Timelens ; tulyakov22timelens++ , debluring teng22nest ; sun22event-based , often utilize physical constraints and require pixel-wised alignment, so they do not apply domain adaptation.. However, these methods are insufficient in two aspects: 1) difficulty in handling the differences between these two types of visual information – event streams contain rich motion and temporal information but lack color and texture information, while RGB images are the opposite messikommer22evtransfer ; zhu21eventgan ; 2) inability to achieve modality fusion that complementarily utilizes both modalities.

To conquer the limitations of the aforementioned domain adaptation methods, we propose EvPlug, a framework that learns a pluggable event and image fusion module to strengthen the capability of the existing RGB-based model with events under various challenging scenes. In terms of event-and-image connection, unlike previous approaches that utilize the visual similarity between RGB images and event streams, we use the event generation model to constrain their relationship. This not only physically aligns with the differential imaging mechanism of the event camera but also theoretically endows the existing RGB-based model with the same temporal resolution as the event stream with temporal consistency. In terms of modality fusion, we employ event features to calibrate RGB features in the feature dimension to assure event-image quality consistency. On the one hand, this allows the event information to correct the feature space distortion caused when RGB image degrades, achieving modality complementarity. On the other hand, as a feature-level fusion method, it does not change the structure and weights of the RGB-based model, making the learned events and image feature fusion module plug-and-play during evaluation.

Refer to caption — Figure 1: Complementary characteristics of images and events. Event cameras record more meaningful signals than RGB cameras in scenes shown in blue boxes, while results in green boxes are the opposite.

The above information constraints and feature fusion can be unified into a single framework. We compare our method with other domain adaptation methods and supervised RGB-based and event-based methods in several downstream tasks, and the experimental results confirm the effectiveness of our approach as shown in Fig. 2. In summary, our contributions are as follows:

•

a unified framework that learns a plug-and-play module under the supervision of existing RGB-based models connected by the event generation model;
•

a flexible and generalized event and image feature fusion strategy to retain both merits of event and RGB cameras; and
•

an image and event distillation benchmark, which includes the results of six methods in three downstream vision tasks.

2 Related Works

2.1 RGB-Event Fusion Methods

Due to the high dynamic range, high time resolution, low redundancy, and low power consumption, recent researches have shown the potential of event cameras in object detection anton18detection ; zhou2023rgb ; prophesee , semantic segmentation alonso19evsegnet ; gehrig21dsec ; sun22ess , depth estimation daniel21ramnet , optical flow estimation zhu18evflownet ; zhu19evflow , denoising wang20jointfilter , motion segmentation mitrokhin2020learning ; motion-seg-iccv19 , object detection prophesee , frame interpolation han21evintsr , human/hand pose estimationzou21eventhpe ; victor21eventhands ; jiang23evhandpose , etc. Despite these advantages, event cameras exhibit higher noise levels due to less mature fabrication processes compared with RGB cameras, and they struggle to capture effective information when the camera and scene are relatively static or when the scene’s illumination undergoes significant changes. Consequently, some studies attempt to fuse RGB images with event streams to enhance the robustness of vision models. Researchers have investigated various pixel-level alignment and fusion methods for low-level vision tasks, such as video frame interpolation tulyakov22timelens++ , deblurring sun22event-based , and HDR imaging messikommer22multi-bracket , etc. In middle- and high-level vision tasks, hierarchical feature-level fusion methods are conducted on object detection tomy22fusing ; zhou2023rgb , depth estimation daniel21ramnet , etc. EventCap lan20eventcap utilizes events to track estimated human poses from images. However, these RGB and event fusion methods are supervised for a specific task given ground truth (GT) with limited generality.

2.2 Domain Adaption in Event-based Vision

As both event streams and RGB images are visual signals, existing domain adaptation methods primarily exploit the consistency in geometry and texture between the two modalities to transfer knowledge from the RGB domain to event-based vision tasks. In terms of the data utilized, current methods can be divided into paired and unpaired approaches as shown in Fig. 3. Paired methods hu20nga ; deng21learning require pixel-wise alignment to ensure feature-level constraints, but this imposes higher demands on data acquisition. Although the DAVIS camera licht08davis can simultaneously output event streams and active pixel sensor (APS) frames, the APS frames lack color information and are of low quality.

Table 1: Comparison about training and evaluation factors among EvPlug and other methods.

Methods	Training		Evaluation
Methods	Model Input	Supervision	Motion Blur	HDR	Illumination Change	Relatively Static	Temporal Resolution
RGB-based	Image	GT			✓	✓	Low
Event-based	Events	GT	✓	✓			High
Paired deng21learning ; hu20nga	Pixel-wised Paired	RGB Model	✓	✓			High
Unpaired wang21dual ; messikommer22evtransfer ; rebecq21e2vid ; gehrig20v2e ; wang21evdistill ; zhu21eventgan	Unpaired	RGB Model	✓	✓			High
EvPlug	Paired	RGB Model	✓	✓	✓	✓	High

The core idea of unpaired methods gehrig20v2e ; rebecq21e2vid ; zhu21eventgan ; wang21evdistill ; wang21dual ; messikommer22evtransfer is to convert between the two modalities to create paired data and then perform domain adaptation using the paired approach. However, since event streams and RGB images contain unmatching information (events lack absolute brightness values and color information, while images lack motion and illumination change information), this conversion process is ill-conditioned. Although EvTransfer messikommer22evtransfer has attempted to decouple the corresponding information, it does not fundamentally address the problem. As summarized in Tab. 1, these existing methods can only distill pure event-based models without multi-modal fusion.

3 Method

3.1 Event-Image Connection Constrained via Event Generation Model

As model architectures he16resnet ; doso21vit and data scales have been advanced, researchers have successfully trained high-performing vision models on RGB images radford21clip ; liu21swintransformer . In contrast, event-based vision is in the early stage of its development. As shown in Fig. 3 (a) and (b), visual similarity between event streams and RGB images encourages researchers to transfer knowledge from RGB images to events and such a process can be explained:

E_{[t_{i-1},t_{i}]}\leftrightarrow I_{t_{i}},

(1)

where $E_{[t_{i-1},t_{i}]}$ represents event streams from time $t_{i-1}$ to $t_{i}$ , $I_{t_{i}}$ is the RGB image at time $t_{i}$ , $\leftrightarrow$ means the information connection between images and events, such as feature map constraint wanglin22kd_survey .

However, such a straightforward mapping method does not work well due to the different sensing in motion, texture, and color between events and RGB images. This issue is caused by the imaging mechanism of event cameras, a.k.a. event generation model hidalgo22edso .

Event cameras licht08davis generate asynchronous event streams by measuring per-pixel brightness changes. An event $e_{i}=(x_{i},y_{i},t_{i},p_{i})$ occurs at pixel $(x_{i},y_{i})$ at time $t_{i}$ when the logarithmic brightness change reaches the threshold:

\log I(x_{i},y_{i},t_{i})-\log I(x_{i},y_{i},t_{p})=p_{i}\cdot C,

(2)

where $t_{p}$ is the timestamp of the last event at pixel $(x_{i},y_{i})$ , $p_{i}\in\{-1,1\}$ is the polarity, $C$ is the threshold. Previous research on video frame interpolation and deblurring tulyakov22timelens++ ; han21evintsr ; teng22nest have shown that, based on an image at time $t_{i}$ and the event stream following $t_{i}$ , a frame at any future time can be reconstructed. When images and events are pixel-wisely aligned, such relation is denoted as:

I(x_{i+1},y_{i+1},t_{i+1})=I(x_{i},y_{i},t_{i})\cdot\text{exp}(\sum_{e_{j}\in E% _{[t_{i},t_{i+1}]}}{p_{j}\cdot C}).

(3)

Hence, we propose the following information constraint connecting event streams and RGB images:

I_{t_{i}}+E_{[t_{i},t_{i+1}]}\leftrightarrow I_{t_{i+1}},

(4)

where " $+$ " refers to a multi-modal data fusion process, as shown in Fig. 3 (c).

In addition to expressing the physical constraints between event cameras and RGB cameras according to their image formation model, this constraint also implies that it is possible to predict the results at any future moment using a single RGB image and subsequent event streams.

3.2 Feature-level Multi-modal Fusion

When fusing multi-modal data, the feature-level constraint allows effective information fusion for complementary use of both modalities of data daniel21ramnet ; zhou2023rgb ; wanglin22kd_survey . This inspires us to extend the information constraint in Eq. (4) to feature level as illustrated in Fig. 5:

f_{\text{E-Former}}(f_{\text{ImEncoder}}(I_{t_{i}}),f_{\text{EvEncoder}}(E_{[t% _{i},t_{i+1}]}))\leftrightarrow f_{\text{ImEncoder}}(I_{t_{i+1}}),

(5)

where $f_{\text{E-Former}}$ is the fusion module, $f_{\text{ImEncoder}}$ and $f_{\text{EvEncoder}}$ are the image and the event encoders.

We design $f_{\text{E-Former}}$ based on the transformer decoder layer vaswani17transformer , which can perform feature association and does not require strict pixel-wise alignment between event streams and RGB images, alleviating the burden of data acquisition. Within the decoder layers, event features serve as the input for cross-attention keys and values, while the image features are updated through multiple decoder layers to obtain the fused features. In terms of information constraints, the feature-level constraint enables $f_{\text{E-Former}}$ to learn rich information from RGB-based models wanglin22kd_survey . Besides, this constraint does not change the structure and weights of the RGB-based model, and $f_{\text{EvEncoder}}$ and $f_{\text{E-Former}}$ can be plug-and-play for evaluation.

However, using the constraint in Eq. (5) to train $f_{\text{E-Former}}$ faces two challenges: 1) It cannot utilize event streams $E_{[t_{i-1},t_{i}]}$ for robust inference on image $I_{t_{i}}$ ; 2) it cannot assure temporal consistency when performing model inferences with high temporal resolution using events between adjacent RGB images. To tackle these challenges, we propose two constraints to enable fusion module $f_{\text{E-Former}}$ to learn event-image quality consistency and temporal consistency.

Event-Image Quality Consistency

Existing RGB-based vision models doso21vit ; he16resnet usually extract high-quality features from normal-quality RGB images. Due to the limited imaging characteristics of RGB cameras in HDR or fast motion scenes, the resulting images may suffer from degradation, thus causing distortion in the feature space. So one of the goals of data fusion module $f_{\text{E-Former}}$ is to rectify the feature space distortion caused by RGB image degradation with event streams.

For training, it is hard to obtain paired normal and degraded RGB sequences of the same view in the real world. To tackle this issue, we add photometric degradation on normal-quality RGB sequences to get synthetic paired degraded images. As shown in Fig. 5, this can be formulated as:

f_{\text{E-Former}}(f_{\text{ImEncoder}}(\delta(I_{t_{i}})),f_{\text{EvEncoder% }}(E_{[t_{i-1},t_{i}]}))\leftrightarrow f_{\text{ImEncoder}}(I_{t_{i}}),

(6)

where $\delta(\cdot)$ is the image degradation function. For applying synthetic degradation, we introduce two types of operations: strong/low light with brightness contrast augmentation in PyTorch paszke19pytorch and motion blur with pre-computed optical flow. We warp the original image with optical flow following farneback03flow in OpenCV to interpolate frames and average them for simulating motion blur.

Event-Image Temporal Consistency

Directly using the constraint in Eq. (5) for inference with high temporal resolution cannot assure the temporal consistency of inference results. After obtaining fusion features at time $t_{i}$ as:

F_{t_{i}}=f_{\text{E-Former}}(f_{\text{ImEncoder}}(\delta(I_{t_{i}})),f_{\text% {EvEncoder}}(E_{[t_{i-1},t_{i}]})),

(7)

we use $f_{\text{E-Former}}$ to iteratively update the fusion feature with events to assure temporal consistency:

F_{t_{j}}=f_{\text{E-Former}}(F_{t_{j-1}},f_{\text{EvEncoder}}(E_{[t_{j-1},t_{% j}]})),\quad j=i+1,i+2,...,i+K,

(8)

where $F_{t_{j}}$ is the fusion feature at time $t_{j}$ , and $K$ is the fusion step.

Thus we get sequential constraints to enable $f_{\text{E-Former}}$ to learn temporal consistency:

F_{t_{j}}\leftrightarrow f_{\text{ImEncoder}}(I_{t_{j}}),\quad j=i+1,i+2,...,i% +K.

(9)

By integrating constraints from Eq. (6) and Eq. (9), $f_{\text{E-Former}}$ can utilize events $E_{[t_{i-1},t_{i}]}$ to rectify the degradation of the image $I_{t_{i}}$ and assure the temporal consistency of evaluation results with subsequent events in an iterative way.

3.3 Training and Evaluation

Training

Our training procedure is illustrated in Fig. 6 (left). As mentioned in Eq. (5), feature $F_{t_{j}}$ should be consistent with image feature $f_{\text{ImEncoder}}(I_{t_{j}})$ . In our work, we use two losses to measure the constraints: task-level loss ${\mathcal{L}}_{\text{task}}$ and feature-level losses. Task-level loss is defined by the task-specific training loss under supervision from outputs of RGB-based model on original RGB image inputs. Following hu20nga , feature-level losses consist of feature reconstruction loss ${\mathcal{L}}_{\text{recon}}$ and feature style loss ${\mathcal{L}}_{\text{style}}$ :

	$\displaystyle{\mathcal{L}}_{\text{recon}}(t_{j})=\text{MSE}(F_{t_{j}},f_{\text% {ImEncoder}}(I_{t_{j}})),$		(10)
	$\displaystyle{\mathcal{L}}_{\text{style}}(t_{j})=\text{MSE}(\text{Gram}(F_{t_{% j}}),\text{Gram}(f_{\text{ImEncoder}}(I_{t_{j}}))),$		(11)

where MSE is mean squared error, Gram is the Gram matrix gatys16imagestyle .

Thus the final loss becomes:

{\mathcal{L}}_{\text{all}}=\sum_{j}{\lambda_{\text{task}}{\mathcal{L}}_{\text{% task}}(t_{j})+\lambda_{\text{recon}}{\mathcal{L}}_{\text{recon}}(t_{j})+% \lambda_{\text{style}}{\mathcal{L}}_{\text{style}}(t_{j})},\quad j=i,i+1,...,K,

(12)

where $\lambda_{\text{task}}$ , $\lambda_{\text{recon}}$ , and $\lambda_{\text{style}}$ are loss weights.

Evaluation

As shown in Fig. 6 (right), during the evaluation stage, robust features are first obtained by fusing the image at time $I_{t_{i}}$ and the event stream $E_{[t_{i-1},t_{i}]}$ , which are used for subsequent decoding to obtain robust inference results. Next, the incoming high temporal resolution event stream is divided into $K$ event slices with equal time interval (assuming the duration of neighboring RGB images is $\Delta$ ). These slices are then sequentially iterated and updated through the fusion module, and the output features are obtained through the feature decoder module. Through this iterative process, EvPlug has $K$ times temporal resolution than original RGB-based methods. Since $K$ can be adjusted according to the different applications, EvPlug allows for flexible high-temporal-resolution inference.

4 Experiments

Baselines

For comparing with domain adaptation methods, we choose NGA hu20nga as the pixel-aligned paired method, and for unpaired methods, we select E2VID rebecq21e2vid and EV-Transfer messikommer22evtransfer , which have released open-source codes. V2E gehrig20v2e , works as an event simulator, inevitably faces sim-to-real gap compared with paired method NGA hu20nga , and we don’t evaluate here. Although EvDistill wang21evdistill and DTL wang21dual are unpaired methods, but there is no public training code to employ them on our datasets. To validate the benefits of fusing the two modalities, we compare supervised RGB-based methods for each task.

Downstream Tasks

The middle- and high-level vision datasets that contain both RGB images and event streams are generally scarce. Considering the data availability, we evaluate the performance of EvPlug on three downstream vision tasks: object detection on DSEC-MOD zhou2023rgb , semantic segmentation on DSEC-Semantic sun22ess , and 3D hand pose estimation on EvRealHands jiang23evhandpose . We highly recommend readers to see the results of high temporal resolution in the supplementary video.

Implementation Details of EvPlug

EvPlug has a simple and generalized architecture. In all tasks, event streams are represented as voxels zhu19evflow ; victor21eventhands , and the event encoder $f_{\text{EvEncoder}}$ shares the same architecture with the image encoder $f_{\text{ImEncoder}}$ except for the channel dimension of the first CNN layer. The fusion module $f_{\text{E-Former}}$ is composed of $N$ sequential transformer decoder layers vaswani17transformer ( $N$ is 3 in our experiments). In training, sequential fusion step $K$ is set as 2, $\lambda_{\text{recon}}$ as 10, $\lambda_{\text{style}}$ as 10. We use Adam adam15 with learning rate 0.001 to optimize the $f_{\text{EvEncoder}}$ and $f_{\text{E-Former}}$ . Apart from these settings, data processing and technical details are consistent with the RGB-based methods. In our experiments, EvPlug is trained on a single TITAN RTX in two days.

4.1 Object Detection

Experimental Setup

We validate EvPlug on the DSEC-MOD dataset zhou2023rgb for moving object detection with 10495 frames for training and 2819 frames for evaluation. For the RGB-based backbone of knowledge distillation methods and RGB-based methods, we use DETR carion20detr , the widely used backbone for object detection, as the trade-off between accuracy and efficiency. We use the pretrained DETR-R50 model on COCO lin14coco and finetune it on DSEC-MOD zhou2023rgb for 10 epochs with batchsize 64 to get an RGB-based model. For the event-based method, we represent events as voxels as zhu19evflow and use the same backbone as DETR-R50, which requires training for 30 epochs. For the RGB-event fusion method, we use the released model of RENet zhou2023rgb for evaluation. In order to uniformly evaluate the various methods using the same standard, we use the detection metrics of COCO lin14coco (reported as %, $\uparrow$ means higher is better, while $\downarrow$ is the opposite).

Results

As quantitative results shown in Tab. 2, EvPlug outperforms the domain adaptation methods hu20nga ; rebecq21e2vid ; messikommer22evtransfer by more than 20 $\text{AP}_{\text{50}}$ , which can be attributed to its ability to fuse event streams with RGB images, rather than relying solely on events for inference. Compared with RENet zhou2023rgb which designs specific architecture for object detection and requires training from scratch, EvPlug can utilize pretrained models from large-scale image dataset lin14coco , thus resulting in 18.2 $\text{AP}_{\text{50}}$ improvement. EvPlug brings only 2.0 $\text{AP}_{\text{50}}$ improvement for the RGB-based model, because DSEC-MOD zhou2023rgb contains few data under challenging scenes (HDR, fast motion) where events have better imaging quality over RGB images. Qualitative results in Fig. 7 show that EvPlug can make the RGB-based model sensitive to small moving objects in the distance, which is due to the rich motion information contained in event streams.

Table 2: Quantitative results on object detection and semantic segmentation.

Methods	Input	Test Time	Object Detection			Semantic Segmentation
Methods	Input	Test Time	AP $\uparrow$	$\text{AP}_{\text{50}}$ $\uparrow$	$\text{AP}_{\text{75}}$ $\uparrow$	mIoU $\uparrow$	Acc $\uparrow$
Event-based	$E_{[t_{i-1},t_{i}]}$	$t_{i}$	18.5	31.4	14.1	51.2	87.7
NGA hu20nga	$E_{[t_{i-1},t_{i}]}$	$t_{i}$	19.0	32.2	15.4	50.8	87.4
E2VID rebecq21e2vid	$E_{[t_{i-1},t_{i}]}$	$t_{i}$	15.3	24.5	13.6	38.7	74.6
EvTransfer messikommer22evtransfer	$E_{[t_{i-1},t_{i}]}$	$t_{i}$	17.6	27.4	14.4	49.9	87.6
ESS sun22ess	$I_{t_{i}}$ , $E_{[t_{i-1},t_{i}]}$	$t_{i}$	–	–	–	53.3	89.4
RENet zhou2023rgb	$I_{t_{i}}$ , $E_{[t_{i-1},t_{i}]}$	$t_{i}$	18.5	36.9	17.2	–	–
RGB-based	$I_{t_{i}}$	$t_{i}$	29.6	53.1	28.8	63.8	93.2
EvPlug	$I_{t_{i}}$ , $E_{[t_{i-1},t_{i}]}$	$t_{i}$	30.1 (0.5 $\uparrow$ )	55.1 (2.0 $\uparrow$ )	30.1 (1.3 $\uparrow$ )	63.2	93.1
EvPlug (w/o $\delta$ )	$I_{t_{i}}$ , $E_{[t_{i-1},t_{i}]}$	$t_{i}$	29.6	54.3	28.8	63.7	93.2
EvPlug	$I_{t_{i}}$ , $E_{[t_{i-1},t_{i+2}]}$	$t_{i+2}$	19.0	40.6	16.4	53.5	90.7
EvPlug (w/o iter)	$I_{t_{i}}$ , $E_{[t_{i-1},t_{i+2}]}$	$t_{i+2}$	18.7	40.2	16.1	52.4	90.1

4.2 Semantic Segmentation

Experimental Setup

We validate EvPlug on DSEC-Semantic sun22ess (8082 frames for training, 2809 frames for evaluation) for semantic segmentation. For the RGB-based model, we use the extension version of DETR carion20detr on panoptic segmentation Kirillov19panoptic and finetune it on Cityscapes Cordts2016Cityscapes . For the event-based method, we use supervised Ev-SegNet alonso19evsegnet for comparison. Apart from the mentioned domain adaptation methods, ESS sun22ess proposed an unsupervised domain adaptation method. We use the same evaluation metric (mIoU (%) and Accuracy (Acc., %)) as ESS sun22ess for a fair comparison.

Results

Results in Tab. 2 and Fig. 8 show that EvPlug outperforms mentioned domain adaptation methods hu20nga ; rebecq21e2vid ; sun22ess ; messikommer22evtransfer due to the multi-modal fusion of both modalities. However, EvPlug achieves a bit lower accuracy compared with the RGB-based method. The reason is that color and texture play a significant role in semantic segmentation, and there are few degraded RGB images in the DSEC-Semantic sun22ess . Compared to the gain that events bring to RGB images, the disruption caused by event features to the RGB feature space has a greater impact. Results in Tab. 7 (methods: EvPlug, test time: $t_{i+2}$ ) show that when performing semantic segmentation with high temporal resolution, EvPlug outperforms event-based methods alonso19evsegnet by over 2.5 % mIoU. Due to the complementary nature of RGB images at time $t_{i}$ to event streams $E_{[t_{i},t_{i+2}]}$ , EvPlug achieves higher accuracy than methods based solely on event streams. This implies that EvPlug, at the cost of sacrificing a small portion of accuracy on RGB images, enables the RGB-based model with high temporal resolution inference.

4.3 Hand Pose Estimation

Table 3: Quantitative results on 3D hand pose estimation.

Methods	Input	Test Time	Normal		Strong Light		Flash
Methods	Input	Test Time	MPJPE $\downarrow$	AUC $\uparrow$	MPJPE $\downarrow$	AUC $\uparrow$	MPJPE $\downarrow$	AUC $\uparrow$
Event-based victor21eventhands	$E_{[t_{i-1},t_{i}]}$	$t_{i}$	29.16	77.7	32.48	71.7	52.92	60.9
NGA hu20nga	$E_{[t_{i-1},t_{i}]}$	$t_{i}$	20.48	79.6	28.37	71.8	41.06	61.4
E2VID rebecq21e2vid	$E_{[t_{i-1},t_{i}]}$	$t_{i}$	29.81	73.6	30.77	70.3	42.35	61.0
EvTransfer messikommer22evtransfer	$E_{[t_{i-1},t_{i}]}$	$t_{i}$	22.67	77.5	29.39	70.8	41.32	60.6
RGB-based cho22fastmetro	$I_{t_{i}}$	$t_{i}$	14.31	85.6	42.43	59.1	26.31	73.9
EvPlug	$I_{t_{i}}$ , $E_{[t_{i-1},t_{i}]}$	$t_{i}$	13.78 (0.53 $\downarrow$ )	86.2 (0.6 $\uparrow$ )	28.16 (14.27 $\downarrow$ )	71.9 (12.8 $\uparrow$ )	26.00 (0.31 $\downarrow$ )	74.3 (0.4 $\uparrow$ )
EvPlug (w/o $\delta$ )	$I_{t_{i}}$ , $E_{[t_{i-1},t_{i}]}$	$t_{i}$	13.78	86.2	46.07	56.0	26.85	73.4
EvPlug	$I_{t_{i}}$ , $E_{[t_{i-1},t_{i+2}]}$	$t_{i+2}$	18.28	81.8	30.10	70.8	35.34	67.3
EvPlug (w/o iter)	$I_{t_{i}}$ , $E_{[t_{i-1},t_{i+2}]}$	$t_{i+2}$	18.39	81.6	30.25	70.0	36.12	65.4

Experimental Setup

We evaluate the performance of EvPlug on hand pose estimation on dataset EvRealHands jiang23evhandpose , a multi-modal hand dataset consisting of 4452 seconds of event streams from DAVIS346 (346 $\times$ 260 pixels) and corresponding RGB sequences from FLIR cameras (2660 $\times$ 2300 pixels). We perform comparisons on the sequences under different illumination and hand movements: normal, strong light, flash, and fast motion. For RGB-based baseline and domain adaptation backbones, we select FastMETRO cho22fastmetro which performs high accuracy with low computational cost. For efficiency, we use the lightweight version (a ResNet34 he16resnet backbone and 4 transformer vaswani17transformer layers). For event-based methods, we use EventHands victor21eventhands . We follow the implementation details of EventHands victor21eventhands and FastMETRO cho22fastmetro to pre-process event streams and RGB images. In our setup, RGB images are cropped in 192 $\times$ 192 and event streams in 128 $\times$ 128. We use the common MPJPE (root-aligned mean per joint position error in Euclidean distance (mm)) and AUC (reported as %) metrics for quantitative evaluation.

Results

As quantitative results are shown in Tab. 3, EvPlug outperforms domain adaptation methods hu20nga ; rebecq21e2vid ; messikommer22evtransfer over 6 mm MPJPE lower in normal scenes and 15 mm MPJPE on flash scenes. This benefits from the fact that EvPlug is an event and image fusion method, rather than a pure event-based domain adaptation method. The plug-and-play module learned by our method will significantly improve the robustness of the RGB-based model in strong light scenes, resulting in a 14.27 mm lower MPJPE in such conditions. This demonstrates that our method can integrate the HDR robustness characteristics of event cameras into the RGB-based model. Qualitative results in Fig. 9 show that these domain adaptation methods hu20nga ; rebecq21e2vid ; messikommer22evtransfer cannot handle the scenes when hands are relatively static or the illumination changes. And RGB-based models cannot achieve robust hand pose estimation when images degrade, such as overexposure and motion blur. EvPlug can perform robust hand pose estimation in these challenging issues, which derives from the complementary use of two modalities.

4.4 Ablation Study

In the ablation study, we demonstrate the effectiveness of the two constraints for image-event quality consistency and temporal consistency.

Event-Image Quality Consistency

To validate the effectiveness of the quality constraint, we can compare the performance of EvPlug at time $t_{i}$ with and without incorporating $\delta$ in Eq. (6) (denoted as "w/o $\delta$ ") in the training procedure. Quantitative results in Tab. 2 and Tab. 3 show that this constraint can improve the performance of EvPlug on object detection (DSEC-MOD zhou2023rgb ) and 3D hand pose estimation (EvRealHands jiang23evhandpose ), especially under strong lights (18.09 mm lower on MPJPE ). However, it will result in a small decrease in metrics on semantic segmentation, which mainly results from the absence of degraded images in evaluation data. Qualitative results in Fig. 7 and Fig. 9 show that this constraint can improve the performance of EvPlug under HDR imaging and fast motion scenes.

Event-Image Temporal Consistency

A direct approach (denoted as "w/o iter") to validate the effectiveness of the temporal consistency constraint is to integrate the features $F_{t_{i}}$ at time $t_{i}$ with events during evaluation:

F_{t_{j}}=f_{\text{E-Former}}(F_{t_{i}},f_{\text{EvEncoder}}(E_{[t_{j-1},t_{j}% ]})),\quad j=i+1,i+2,...,i+K,

(13)

which is different from Eq. (8). Therefore we can compare the results at time $t_{i+2}$ with the image $I_{t_{i}}$ and the event stream $E_{[t_{i-1},t_{i}]}$ to validate the efficiency of event-image temporal consistency. Quantitative results in Tab. 2 and Tab. 3 show that this constraint will contribute to a 0.3 lift on AP on object detection, 1.1 % lift on mIoU on semantic segmentation, and 0.1 mm MPJPE improvement on hand pose estimation.

5 Conclusion

In this paper, we propose a unified framework called EvPlug that learns a plug-and-play event and image fusion module under supervision from the RGB-based model connected by the event generation model. Our method only requires unlabeled, non-strictly pixel-aligned image-event data pairs for training. Working as a plug-in for the RGB-based model, EvPlug enables the RGB-based model with robustness to HDR imaging and high-temporal-resolution inference. We demonstrate the effectiveness of EvPlug on three vision tasks: object detection, semantic segmentation, and hand pose estimation.

Limitations

In our fusion model $f_{\text{E-Former}}$ , when the scale of the feature map is large, the computational cost of the transformer-based fusion module can be quite high. Additionally, although $f_{\text{E-Former}}$ successfully bridges the temporal consistency between neighboring images and events, it cannot model long-term consistency across multiple images. Therefore, designing a fusion framework that can model long-term temporal consistency with low computational cost is a promising research direction.

References

(1) Iñigo Alonso and Ana C. Murillo. Ev-SegNet: Semantic segmentation for event-based cameras. In CVPR Workshops, 2019.
(2) Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In ECCV, 2020.
(3) Junhyeong Cho, Kim Youwang, and Tae-Hyun Oh. Cross-attention of disentangled modalities for 3D human mesh recovery with transformers. In ECCV, 2022.
(4) Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In CVPR, 2016.
(5) Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image database. In CVPR, 2009.
(6) Yongjian Deng, Hao Chen, Huiying Chen, and Youfu Li. Learning from images: A distillation learning framework for event cameras. IEEE TIP, 2021.
(7) Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
(8) Peiqi Duan, Zihao Wang, Boxin Shi, Oliver Cossairt, Tiejun Huang, and Aggelos Katsaggelos. Guided Event Filtering: Synergy between intensity images and neuromorphic events for high performance imaging. IEEE TPAMI, 2021.
(9) Gunnar Farnebäck. Two-frame motion estimation based on polynomial expansion. In Image Analysis, 2003.
(10) Guillermo Gallego, Tobi Delbrück, Garrick Orchard, Chiara Bartolozzi, Brian Taba, Andrea Censi, Stefan Leutenegger, Andrew J. Davison, Jörg Conradt, Kostas Daniilidis, and Davide Scaramuzza. Event-based vision: A survey. IEEE TPAMI, 44(1):154–180, 2022.
(11) Leon A. Gatys, Alexander S. Ecker, and Matthias Bethge. Image style transfer using convolutional neural networks. In CVPR, 2016.
(12) Daniel Gehrig, Mathias Gehrig, Javier Hidalgo-Carrió, and Davide Scaramuzza. Video to events: Recycling video datasets for event cameras. In CVPR, 2020.
(13) Daniel Gehrig, Michelle Rüegg, Mathias Gehrig, Javier Hidalgo-Carrió, and Davide Scaramuzza. Combining events and frames using recurrent asynchronous multimodal networks for monocular depth prediction. IRAL, 2021.
(14) Mathias Gehrig, Willem Aarents, Daniel Gehrig, and Davide Scaramuzza. DSEC: A stereo event camera dataset for driving scenarios. IRAL, 2021.
(15) Jin Han, Yixin Yang, Chu Zhou, Chao Xu, and Boxin Shi. EvIntSR-Net: Event guided multiple latent frames reconstruction and super-resolution. In ICCV, 2021.
(16) Jin Han, Chu Zhou, Peiqi Duan, Yehui Tang, Chang Xu, Chao Xu, Tiejun Huang, and Boxin Shi. Neuromorphic camera guided high dynamic range imaging. In CVPR, 2020.
(17) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
(18) Javier Hidalgo-Carrió, Guillermo Gallego, and Davide Scaramuzza. Event-aided direct sparse odometry. In CVPR, 2022.
(19) Yuhuang Hu, Tobi Delbrück, and Shih-Chii Liu. Learning to exploit multiple vision modalities by using grafted networks. In ECCV, 2020.
(20) Jianping Jiang, Jiahe Li, Baowen Zhang, Xiaoming Deng, and Boxin Shi. EvHandPose: Event-based 3D hand pose estimation with sparse supervision. In arXiv:2303.02862, 2023.
(21) Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
(22) Alexander Kirillov, Kaiming He, Ross B. Girshick, Carsten Rother, and Piotr Dollár. Panoptic segmentation. In CVPR, 2019.
(23) Patrick Lichtsteiner, Christoph Posch, and Tobi Delbrück. A 128×128 120 dB 15 ${\mu}$ s latency asynchronous temporal contrast vision sensor. JSSC, 2008.
(24) Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft COCO: common objects in context. In ECCV, 2014.
(25) Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021.
(26) Nico Messikommer, Daniel Gehrig, Mathias Gehrig, and Davide Scaramuzza. Bridging the gap between events and frames through unsupervised domain adaptation. IRAL, 2022.
(27) Nico Messikommer, Stamatios Georgoulis, Daniel Gehrig, Stepan Tulyakov, Julius Erbach, Alfredo Bochicchio, Yuanyou Li, and Davide Scaramuzza. Multi-bracket high dynamic range imaging with event cameras. In CVPR, 2022.
(28) Anton Mitrokhin, Cornelia Fermüller, Chethan Parameshwara, and Yiannis Aloimonos. Event-based moving object detection and tracking. In IROS, 2018.
(29) Anton Mitrokhin, Zhiyuan Hua, Cornelia Fermuller, and Yiannis Aloimonos. Learning visual motion segmentation using event surfaces. In CVPR, 2020.
(30) S. Mohammad Mostafavi I., Jonghyun Choi, and Kuk-Jin Yoon. Learning to super resolve intensity images from events. In CVPR, 2020.
(31) Liyuan Pan, Cedric Scheerlinck, Xin Yu, Richard Hartley, Miaomiao Liu, and Yuchao Dai. Bringing a blurry frame alive at high frame-rate with an event camera. In CVPR, 2019.
(32) Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Z. Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. PyTorch: An imperative style, high-performance deep learning library. In NeurIPS, 2019.
(33) Etienne Perot, Pierre de Tournemire, Davide Nitti, Jonathan Masci, and Amos Sironi. Learning to detect objects with a 1 megapixel event camera. In NeurIPS, 2020.
(34) Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In ICML, 2021.
(35) Henri Rebecq, René Ranftl, Vladlen Koltun, and Davide Scaramuzza. High speed and high dynamic range video with an event camera. IEEE TPAMI, 2021.
(36) Viktor Rudnev, Vladislav Golyanik, Jiayi Wang, Hans-Peter Seidel, Franziska Mueller, Mohamed Elgharib, and Christian Theobalt. EventHands: Real-time neural 3D hand pose estimation from an event stream. In ICCV, 2021.
(37) Bongki Son, Yunjae Suh, Sungho Kim, Heejae Jung, Jun-Seok Kim, Changwoo Shin, Keunju Park, Kyoobin Lee, Jinman Park, Jooyeon Woo, et al. A 640 $\times$ 480 dynamic vision sensor with a 9 $\mu$ m pixel and 300Meps address-event representation. In ISSCC, pages 66–67, 2017.
(38) Timo Stoffregen, Guillermo Gallego, Tom Drummond, Lindsay Kleeman, and Davide Scaramuzza. Event-based motion segmentation by motion compensation. In ICCV, 2019.
(39) Lei Sun, Christos Sakaridis, Jingyun Liang, Qi Jiang, Kailun Yang, Peng Sun, Yaozu Ye, Kaiwei Wang, and Luc Van Gool. Event-based fusion for motion deblurring with cross-modal attention. In ECCV, 2022.
(40) Zhaoning Sun, Nico Messikommer, Daniel Gehrig, and Davide Scaramuzza. ESS: learning event-based semantic segmentation from still images. In ECCV, 2022.
(41) Gemma Taverni, Diederik Paul Moeys, Chenghan Li, Celso Cavaco, Vasyl Motsnyi, David San Segundo Bello, and Tobi Delbruck. Front and back illuminated dynamic and active pixel vision sensors comparison. IEEE TCAS-II, 65(5):677–681, 2018.
(42) Minggui Teng, Chu Zhou, Hanyue Lou, and Boxin Shi. NEST: neural event stack for event-based image enhancement. In ECCV, 2022.
(43) Abhishek Tomy, Anshul Paigwar, Khushdeep Singh Mann, Alessandro Renzaglia, and Christian Laugier. Fusing event-based and RGB camera for robust object detection in adverse conditions. In ICRA, 2022.
(44) Stepan Tulyakov, Alfredo Bochicchio, Daniel Gehrig, Stamatios Georgoulis, Yuanyou Li, and Davide Scaramuzza. Time Lens++: Event-based frame interpolation with parametric non-linear flow and multi-scale fusion. In CVPR, 2022.
(45) Stepan Tulyakov, Daniel Gehrig, Stamatios Georgoulis, Julius Erbach, Mathias Gehrig, Yuanyou Li, and Davide Scaramuzza. Time Lens: Event-based video frame interpolation. In CVPR, 2021.
(46) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017.
(47) Lin Wang, Yujeong Chae, and Kuk-Jin Yoon. Dual transfer learning for event-based end-task prediction via pluggable event to image translation. In ICCV, 2021.
(48) Lin Wang, Yujeong Chae, Sung-Hoon Yoon, Tae-Kyun Kim, and Kuk-Jin Yoon. EvDistill: Asynchronous events to end-task learning via bidirectional reconstruction-guided cross-modal knowledge distillation. In CVPR, 2021.
(49) Lin Wang and Kuk-Jin Yoon. Knowledge distillation and student-teacher learning for visual intelligence: A review and new outlooks. IEEE TPAMI, 2022.
(50) Zihao W. Wang, Peiqi Duan, Oliver Cossairt, Aggelos K. Katsaggelos, Tiejun Huang, and Boxin Shi. Joint filtering of intensity images and neuromorphic events for high-resolution noise-robust imaging. In CVPR, 2020.
(51) Lan Xu, Weipeng Xu, Vladislav Golyanik, Marc Habermann, Lu Fang, and Christian Theobalt. EventCap: Monocular 3D capture of high-speed human motions using an event camera. In CVPR, 2020.
(52) Zhuyun Zhou, Zongwei Wu, Rémi Boutteau, Fan Yang, Cédric Demonceaux, and Dominique Ginhac. Rgb-event fusion for moving object detection in autonomous driving. ICRA, 2023.
(53) Alex Zihao Zhu, Ziyun Wang, Kaung Khant, and Kostas Daniilidis. EventGAN: Leveraging large scale image datasets for event cameras. In ICCP, 2021.
(54) Alex Zihao Zhu, Liangzhe Yuan, Kenneth Chaney, and Kostas Daniilidis. Ev-FlowNet: Self-supervised optical flow estimation for event-based cameras. In RSS, 2018.
(55) Alex Zihao Zhu, Liangzhe Yuan, Kenneth Chaney, and Kostas Daniilidis. Unsupervised event-based learning of optical flow, depth, and egomotion. In CVPR, 2019.
(56) Shihao Zou, Chuan Guo, Xinxin Zuo, Sen Wang, Pengyu Wang, Xiaoqin Hu, Shoushun Chen, Minglun Gong, and Li Cheng. EventHPE: Event-based 3D human pose and shape estimation. In ICCV, 2021.