HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: animate

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: arXiv.org perpetual non-exclusive license
arXiv:2312.16933v1 [cs.CV] 28 Dec 2023

EvPlug: Learn a Plug-and-Play Module for Event and Image Fusion

Jianping Jiang1,212{}^{1,2}start_FLOATSUPERSCRIPT 1 , 2 end_FLOATSUPERSCRIPT, Xinyu Zhou33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPT, Peiqi Duan1,212{}^{1,2}start_FLOATSUPERSCRIPT 1 , 2 end_FLOATSUPERSCRIPT, Boxin Shi1,212{}^{1,2}start_FLOATSUPERSCRIPT 1 , 2 end_FLOATSUPERSCRIPT
11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPTNational Key Laboratory for Multimedia Information Processing, School of Computer Science
22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPTNational Engineering Research Center of Visual Technology, School of Computer Science
33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPTNational Key Lab of General AI, School of Intelligence Science and Technology
Peking University
[email protected], {zhouxiny, duanqi0001, shiboxin}@pku.edu.cn
Abstract

Event cameras and RGB cameras exhibit complementary characteristics in imaging: the former possesses high dynamic range (HDR) and high temporal resolution, while the latter provides rich texture and color information. This makes the integration of event cameras into middle- and high-level RGB-based vision tasks highly promising. However, challenges arise in multi-modal fusion, data annotation, and model architecture design. In this paper, we propose EvPlug, which learns a plug-and-play event and image fusion module from the supervision of the existing RGB-based model. The learned fusion module integrates event streams with image features in the form of a plug-in, endowing the RGB-based model to be robust to HDR and fast motion scenes while enabling high temporal resolution inference. Our method only requires unlabeled event-image pairs (no pixel-wise alignment required) and does not alter the structure or weights of the RGB-based model. We demonstrate the superiority of EvPlug in several vision tasks such as object detection, semantic segmentation, and 3D hand pose estimation.

1 Introduction

Traditional frame-based RGB cameras, featuring rich color and texture information as well as lower noise, are the mainstream sensors in computer vision research. However, due to their imaging mechanisms, they inevitably face issues such as overexposure, motion blur, and limited temporal resolution han21evintsr ; wang20jointfilter ; hu20nga ; Timelens ; EDI . Bio-inspired neuromorphic event cameras, with their asynchronous differential imaging mechanism (events are triggered by observing pixel-wise brightness changes exceeding a threshold in logarithmic domain), possess high dynamic range (HDR), high temporal resolution, low data redundancy, and low power consumption licht08davis ; dvs640 ; davis346 ; survey . However, the imaging quality of event cameras degrades when the camera and scene are relatively static or when the scene illumination changes significantly rebecq21e2vid ; e2sri-cvpr20 ; Timelens . These two types of cameras exhibit complementary characteristics in imaging as shown in Fig. 2, which has been demonstrated in various vision tasks, such as high frame-rate video reconstruction Timelens ; tulyakov22timelens++ ; gef-tpami , HDR image restoration han2020hdr , object detection zhou2023rgb , depth estimation daniel21ramnet , etc.

However, event cameras have been studied for a much shorter period of time than RGB cameras, limiting the progress of their integration with RGB cameras. Researchers have built large-scale datasets deng09imagenet ; lin14coco and various meticulously designed network architectures he16resnet ; vaswani17transformer based on RGB images, while the corresponding tasks in event-based vision are at a much less mature stage. Therefore, when fusing the two modalities of data, challenges arise from two perspectives: 1) data perspective: introducing event cameras requires collecting new data sequences and annotation, and large-scale data annotation implies high costs; 2) model perspective: designing new fusion algorithms and retraining models for event streams require additional model design works and computational resource.

To alleviate the challenges, existing methods hu20nga ; wang21dual ; wang21evdistill ; messikommer22evtransfer ; gehrig20v2e ; rebecq21e2vid utilize the concept of domain adaptation messikommer22evtransfer (also referred as transfer learning, knowledge distillation in event-based vision wanglin22kd_survey ), leveraging the consistency of visual signals between event and RGB cameras in geometric and semantic dimensions to transfer RGB-based knowledge to event-based tasks. They have achieved promising results on middle- and high-level vision tasks such as classification messikommer22evtransfer , object detection wang21evdistill ; messikommer22evtransfer ; hu20nga , and semantic segmentation sun22ess , where the input data can be compressed into high-dimension features111Event and RGB image fusion for low-level vision tasks such as video interpolation Timelens ; tulyakov22timelens++ , debluring teng22nest ; sun22event-based , often utilize physical constraints and require pixel-wised alignment, so they do not apply domain adaptation.. However, these methods are insufficient in two aspects: 1) difficulty in handling the differences between these two types of visual information – event streams contain rich motion and temporal information but lack color and texture information, while RGB images are the opposite messikommer22evtransfer ; zhu21eventgan ; 2) inability to achieve modality fusion that complementarily utilizes both modalities.

To conquer the limitations of the aforementioned domain adaptation methods, we propose EvPlug, a framework that learns a pluggable event and image fusion module to strengthen the capability of the existing RGB-based model with events under various challenging scenes. In terms of event-and-image connection, unlike previous approaches that utilize the visual similarity between RGB images and event streams, we use the event generation model to constrain their relationship. This not only physically aligns with the differential imaging mechanism of the event camera but also theoretically endows the existing RGB-based model with the same temporal resolution as the event stream with temporal consistency. In terms of modality fusion, we employ event features to calibrate RGB features in the feature dimension to assure event-image quality consistency. On the one hand, this allows the event information to correct the feature space distortion caused when RGB image degrades, achieving modality complementarity. On the other hand, as a feature-level fusion method, it does not change the structure and weights of the RGB-based model, making the learned events and image feature fusion module plug-and-play during evaluation.

Figure 1: Complementary characteristics of images and events. Event cameras record more meaningful signals than RGB cameras in scenes shown in blue boxes, while results in green boxes are the opposite.
Refer to caption
Refer to caption
Figure 1: Complementary characteristics of images and events. Event cameras record more meaningful signals than RGB cameras in scenes shown in blue boxes, while results in green boxes are the opposite.
Figure 2: Radar plots of experimental results of EvPlug, NGA hu20nga , E2VID rebecq21e2vid , EvTransfer messikommer22evtransfer , RGB-based method, event-based method on object detection (AP50 on DSEC-MOD zhou2023rgb ), semantic segmentation (mIoU on DSEC-Semantic sun22ess ), hand pose estimation (AUC on EvRealHands jiang23evhandpose ).

The above information constraints and feature fusion can be unified into a single framework. We compare our method with other domain adaptation methods and supervised RGB-based and event-based methods in several downstream tasks, and the experimental results confirm the effectiveness of our approach as shown in Fig. 2. In summary, our contributions are as follows:

  • a unified framework that learns a plug-and-play module under the supervision of existing RGB-based models connected by the event generation model;

  • a flexible and generalized event and image feature fusion strategy to retain both merits of event and RGB cameras; and

  • an image and event distillation benchmark, which includes the results of six methods in three downstream vision tasks.

2 Related Works

2.1 RGB-Event Fusion Methods

Due to the high dynamic range, high time resolution, low redundancy, and low power consumption, recent researches have shown the potential of event cameras in object detection anton18detection ; zhou2023rgb ; prophesee , semantic segmentation alonso19evsegnet ; gehrig21dsec ; sun22ess , depth estimation daniel21ramnet , optical flow estimation zhu18evflownet ; zhu19evflow , denoising wang20jointfilter , motion segmentation mitrokhin2020learning ; motion-seg-iccv19 , object detection prophesee , frame interpolation han21evintsr , human/hand pose estimationzou21eventhpe ; victor21eventhands ; jiang23evhandpose , etc. Despite these advantages, event cameras exhibit higher noise levels due to less mature fabrication processes compared with RGB cameras, and they struggle to capture effective information when the camera and scene are relatively static or when the scene’s illumination undergoes significant changes. Consequently, some studies attempt to fuse RGB images with event streams to enhance the robustness of vision models. Researchers have investigated various pixel-level alignment and fusion methods for low-level vision tasks, such as video frame interpolation tulyakov22timelens++ , deblurring sun22event-based , and HDR imaging messikommer22multi-bracket , etc. In middle- and high-level vision tasks, hierarchical feature-level fusion methods are conducted on object detection tomy22fusing ; zhou2023rgb , depth estimation daniel21ramnet , etc. EventCap lan20eventcap utilizes events to track estimated human poses from images. However, these RGB and event fusion methods are supervised for a specific task given ground truth (GT) with limited generality.

Refer to caption
Figure 3: Illustration of differences among (a) paired methods hu20nga ; deng21learning , (b) unpaired methods gehrig20v2e ; rebecq21e2vid ; zhu21eventgan ; wang21evdistill ; wang21dual ; messikommer22evtransfer , and (c) EvPlug. EvPlug can learn event and image fusion connected by the event generation model. The legends on the right remain consistent throughout the paper.

2.2 Domain Adaption in Event-based Vision

As both event streams and RGB images are visual signals, existing domain adaptation methods primarily exploit the consistency in geometry and texture between the two modalities to transfer knowledge from the RGB domain to event-based vision tasks. In terms of the data utilized, current methods can be divided into paired and unpaired approaches as shown in Fig. 3. Paired methods hu20nga ; deng21learning require pixel-wise alignment to ensure feature-level constraints, but this imposes higher demands on data acquisition. Although the DAVIS camera licht08davis can simultaneously output event streams and active pixel sensor (APS) frames, the APS frames lack color information and are of low quality.

Table 1: Comparison about training and evaluation factors among EvPlug and other methods.
Methods Training Evaluation
Model Input Supervision Motion Blur HDR Illumination Change Relatively Static Temporal Resolution
RGB-based Image GT Low
Event-based Events GT High
Paired deng21learning ; hu20nga Pixel-wised Paired RGB Model High
Unpaired  wang21dual ; messikommer22evtransfer ; rebecq21e2vid ; gehrig20v2e ; wang21evdistill ; zhu21eventgan Unpaired RGB Model High
EvPlug Paired RGB Model High

The core idea of unpaired methods gehrig20v2e ; rebecq21e2vid ; zhu21eventgan ; wang21evdistill ; wang21dual ; messikommer22evtransfer is to convert between the two modalities to create paired data and then perform domain adaptation using the paired approach. However, since event streams and RGB images contain unmatching information (events lack absolute brightness values and color information, while images lack motion and illumination change information), this conversion process is ill-conditioned. Although EvTransfer messikommer22evtransfer has attempted to decouple the corresponding information, it does not fundamentally address the problem. As summarized in Tab. 1, these existing methods can only distill pure event-based models without multi-modal fusion.

3 Method

Figure 4: Feature-level constraint
Refer to caption
Refer to caption
Figure 4: Feature-level constraint
Figure 5: Event-image quality consistency

3.1 Event-Image Connection Constrained via Event Generation Model

As model architectures he16resnet ; doso21vit and data scales have been advanced, researchers have successfully trained high-performing vision models on RGB images radford21clip ; liu21swintransformer . In contrast, event-based vision is in the early stage of its development. As shown in Fig. 3 (a) and (b), visual similarity between event streams and RGB images encourages researchers to transfer knowledge from RGB images to events and such a process can be explained:

E[ti1,ti]Iti,subscript𝐸subscript𝑡𝑖1subscript𝑡𝑖subscript𝐼subscript𝑡𝑖E_{[t_{i-1},t_{i}]}\leftrightarrow I_{t_{i}},italic_E start_POSTSUBSCRIPT [ italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] end_POSTSUBSCRIPT ↔ italic_I start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , (1)

where E[ti1,ti]subscript𝐸subscript𝑡𝑖1subscript𝑡𝑖E_{[t_{i-1},t_{i}]}italic_E start_POSTSUBSCRIPT [ italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] end_POSTSUBSCRIPT represents event streams from time ti1subscript𝑡𝑖1t_{i-1}italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT to tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, Itisubscript𝐼subscript𝑡𝑖I_{t_{i}}italic_I start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT is the RGB image at time tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, \leftrightarrow means the information connection between images and events, such as feature map constraint wanglin22kd_survey .

However, such a straightforward mapping method does not work well due to the different sensing in motion, texture, and color between events and RGB images. This issue is caused by the imaging mechanism of event cameras, a.k.a. event generation model hidalgo22edso .

Event cameras licht08davis generate asynchronous event streams by measuring per-pixel brightness changes. An event ei=(xi,yi,ti,pi)subscript𝑒𝑖subscript𝑥𝑖subscript𝑦𝑖subscript𝑡𝑖subscript𝑝𝑖e_{i}=(x_{i},y_{i},t_{i},p_{i})italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) occurs at pixel (xi,yi)subscript𝑥𝑖subscript𝑦𝑖(x_{i},y_{i})( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) at time tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT when the logarithmic brightness change reaches the threshold:

logI(xi,yi,ti)logI(xi,yi,tp)=piC,𝐼subscript𝑥𝑖subscript𝑦𝑖subscript𝑡𝑖𝐼subscript𝑥𝑖subscript𝑦𝑖subscript𝑡𝑝subscript𝑝𝑖𝐶\log I(x_{i},y_{i},t_{i})-\log I(x_{i},y_{i},t_{p})=p_{i}\cdot C,roman_log italic_I ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - roman_log italic_I ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) = italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ italic_C , (2)

where tpsubscript𝑡𝑝t_{p}italic_t start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT is the timestamp of the last event at pixel (xi,yi)subscript𝑥𝑖subscript𝑦𝑖(x_{i},y_{i})( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), pi{1,1}subscript𝑝𝑖11p_{i}\in\{-1,1\}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ { - 1 , 1 } is the polarity, C𝐶Citalic_C is the threshold. Previous research on video frame interpolation and deblurring tulyakov22timelens++ ; han21evintsr ; teng22nest have shown that, based on an image at time tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and the event stream following tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, a frame at any future time can be reconstructed. When images and events are pixel-wisely aligned, such relation is denoted as:

I(xi+1,yi+1,ti+1)=I(xi,yi,ti)exp(ejE[ti,ti+1]pjC).𝐼subscript𝑥𝑖1subscript𝑦𝑖1subscript𝑡𝑖1𝐼subscript𝑥𝑖subscript𝑦𝑖subscript𝑡𝑖expsubscriptsubscript𝑒𝑗subscript𝐸subscript𝑡𝑖subscript𝑡𝑖1subscript𝑝𝑗𝐶I(x_{i+1},y_{i+1},t_{i+1})=I(x_{i},y_{i},t_{i})\cdot\text{exp}(\sum_{e_{j}\in E% _{[t_{i},t_{i+1}]}}{p_{j}\cdot C}).italic_I ( italic_x start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT ) = italic_I ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ⋅ exp ( ∑ start_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_E start_POSTSUBSCRIPT [ italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT ] end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⋅ italic_C ) . (3)

Hence, we propose the following information constraint connecting event streams and RGB images:

Iti+E[ti,ti+1]Iti+1,subscript𝐼subscript𝑡𝑖subscript𝐸subscript𝑡𝑖subscript𝑡𝑖1subscript𝐼subscript𝑡𝑖1I_{t_{i}}+E_{[t_{i},t_{i+1}]}\leftrightarrow I_{t_{i+1}},italic_I start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_E start_POSTSUBSCRIPT [ italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT ] end_POSTSUBSCRIPT ↔ italic_I start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , (4)

where "+++" refers to a multi-modal data fusion process, as shown in Fig. 3 (c).

In addition to expressing the physical constraints between event cameras and RGB cameras according to their image formation model, this constraint also implies that it is possible to predict the results at any future moment using a single RGB image and subsequent event streams.

3.2 Feature-level Multi-modal Fusion

When fusing multi-modal data, the feature-level constraint allows effective information fusion for complementary use of both modalities of data daniel21ramnet ; zhou2023rgb ; wanglin22kd_survey . This inspires us to extend the information constraint in Eq. (4) to feature level as illustrated in Fig. 5:

fE-Former(fImEncoder(Iti),fEvEncoder(E[ti,ti+1]))fImEncoder(Iti+1),subscript𝑓E-Formersubscript𝑓ImEncodersubscript𝐼subscript𝑡𝑖subscript𝑓EvEncodersubscript𝐸subscript𝑡𝑖subscript𝑡𝑖1subscript𝑓ImEncodersubscript𝐼subscript𝑡𝑖1f_{\text{E-Former}}(f_{\text{ImEncoder}}(I_{t_{i}}),f_{\text{EvEncoder}}(E_{[t% _{i},t_{i+1}]}))\leftrightarrow f_{\text{ImEncoder}}(I_{t_{i+1}}),italic_f start_POSTSUBSCRIPT E-Former end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT ImEncoder end_POSTSUBSCRIPT ( italic_I start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , italic_f start_POSTSUBSCRIPT EvEncoder end_POSTSUBSCRIPT ( italic_E start_POSTSUBSCRIPT [ italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT ] end_POSTSUBSCRIPT ) ) ↔ italic_f start_POSTSUBSCRIPT ImEncoder end_POSTSUBSCRIPT ( italic_I start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , (5)

where fE-Formersubscript𝑓E-Formerf_{\text{E-Former}}italic_f start_POSTSUBSCRIPT E-Former end_POSTSUBSCRIPT is the fusion module, fImEncodersubscript𝑓ImEncoderf_{\text{ImEncoder}}italic_f start_POSTSUBSCRIPT ImEncoder end_POSTSUBSCRIPT and fEvEncodersubscript𝑓EvEncoderf_{\text{EvEncoder}}italic_f start_POSTSUBSCRIPT EvEncoder end_POSTSUBSCRIPT are the image and the event encoders.

We design fE-Formersubscript𝑓E-Formerf_{\text{E-Former}}italic_f start_POSTSUBSCRIPT E-Former end_POSTSUBSCRIPT based on the transformer decoder layer vaswani17transformer , which can perform feature association and does not require strict pixel-wise alignment between event streams and RGB images, alleviating the burden of data acquisition. Within the decoder layers, event features serve as the input for cross-attention keys and values, while the image features are updated through multiple decoder layers to obtain the fused features. In terms of information constraints, the feature-level constraint enables fE-Formersubscript𝑓E-Formerf_{\text{E-Former}}italic_f start_POSTSUBSCRIPT E-Former end_POSTSUBSCRIPT to learn rich information from RGB-based models wanglin22kd_survey . Besides, this constraint does not change the structure and weights of the RGB-based model, and fEvEncodersubscript𝑓EvEncoderf_{\text{EvEncoder}}italic_f start_POSTSUBSCRIPT EvEncoder end_POSTSUBSCRIPT and fE-Formersubscript𝑓E-Formerf_{\text{E-Former}}italic_f start_POSTSUBSCRIPT E-Former end_POSTSUBSCRIPT can be plug-and-play for evaluation.

Refer to caption
Figure 6: Illustration of EvPlug pipeline. During training, event encoder fEvEncodersubscript𝑓EvEncoderf_{\text{EvEncoder}}italic_f start_POSTSUBSCRIPT EvEncoder end_POSTSUBSCRIPT (blue) and fusion module fE-Formersubscript𝑓E-Formerf_{\text{E-Former}}italic_f start_POSTSUBSCRIPT E-Former end_POSTSUBSCRIPT (yellow) are learned to ensure the event-image quality consistency and temporal consistency under the supervision of RGB sequential features from the fixed RGB-based model (green). During the evaluation, fEvEncodersubscript𝑓EvEncoderf_{\text{EvEncoder}}italic_f start_POSTSUBSCRIPT EvEncoder end_POSTSUBSCRIPT and fE-Formersubscript𝑓E-Formerf_{\text{E-Former}}italic_f start_POSTSUBSCRIPT E-Former end_POSTSUBSCRIPT can work as plug-and-play modules for complementary use of both modalities for robust inference with high temporal resolution.

However, using the constraint in Eq. (5) to train fE-Formersubscript𝑓E-Formerf_{\text{E-Former}}italic_f start_POSTSUBSCRIPT E-Former end_POSTSUBSCRIPT faces two challenges: 1) It cannot utilize event streams E[ti1,ti]subscript𝐸subscript𝑡𝑖1subscript𝑡𝑖E_{[t_{i-1},t_{i}]}italic_E start_POSTSUBSCRIPT [ italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] end_POSTSUBSCRIPT for robust inference on image Itisubscript𝐼subscript𝑡𝑖I_{t_{i}}italic_I start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT; 2) it cannot assure temporal consistency when performing model inferences with high temporal resolution using events between adjacent RGB images. To tackle these challenges, we propose two constraints to enable fusion module fE-Formersubscript𝑓E-Formerf_{\text{E-Former}}italic_f start_POSTSUBSCRIPT E-Former end_POSTSUBSCRIPT to learn event-image quality consistency and temporal consistency.

Event-Image Quality Consistency

Existing RGB-based vision models doso21vit ; he16resnet usually extract high-quality features from normal-quality RGB images. Due to the limited imaging characteristics of RGB cameras in HDR or fast motion scenes, the resulting images may suffer from degradation, thus causing distortion in the feature space. So one of the goals of data fusion module fE-Formersubscript𝑓E-Formerf_{\text{E-Former}}italic_f start_POSTSUBSCRIPT E-Former end_POSTSUBSCRIPT is to rectify the feature space distortion caused by RGB image degradation with event streams.

For training, it is hard to obtain paired normal and degraded RGB sequences of the same view in the real world. To tackle this issue, we add photometric degradation on normal-quality RGB sequences to get synthetic paired degraded images. As shown in Fig. 5, this can be formulated as:

fE-Former(fImEncoder(δ(Iti)),fEvEncoder(E[ti1,ti]))fImEncoder(Iti),subscript𝑓E-Formersubscript𝑓ImEncoder𝛿subscript𝐼subscript𝑡𝑖subscript𝑓EvEncodersubscript𝐸subscript𝑡𝑖1subscript𝑡𝑖subscript𝑓ImEncodersubscript𝐼subscript𝑡𝑖f_{\text{E-Former}}(f_{\text{ImEncoder}}(\delta(I_{t_{i}})),f_{\text{EvEncoder% }}(E_{[t_{i-1},t_{i}]}))\leftrightarrow f_{\text{ImEncoder}}(I_{t_{i}}),italic_f start_POSTSUBSCRIPT E-Former end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT ImEncoder end_POSTSUBSCRIPT ( italic_δ ( italic_I start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) , italic_f start_POSTSUBSCRIPT EvEncoder end_POSTSUBSCRIPT ( italic_E start_POSTSUBSCRIPT [ italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] end_POSTSUBSCRIPT ) ) ↔ italic_f start_POSTSUBSCRIPT ImEncoder end_POSTSUBSCRIPT ( italic_I start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , (6)

where δ()𝛿\delta(\cdot)italic_δ ( ⋅ ) is the image degradation function. For applying synthetic degradation, we introduce two types of operations: strong/low light with brightness contrast augmentation in PyTorch paszke19pytorch and motion blur with pre-computed optical flow. We warp the original image with optical flow following farneback03flow in OpenCV to interpolate frames and average them for simulating motion blur.

Event-Image Temporal Consistency

Directly using the constraint in Eq. (5) for inference with high temporal resolution cannot assure the temporal consistency of inference results. After obtaining fusion features at time tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as:

Fti=fE-Former(fImEncoder(δ(Iti)),fEvEncoder(E[ti1,ti])),subscript𝐹subscript𝑡𝑖subscript𝑓E-Formersubscript𝑓ImEncoder𝛿subscript𝐼subscript𝑡𝑖subscript𝑓EvEncodersubscript𝐸subscript𝑡𝑖1subscript𝑡𝑖F_{t_{i}}=f_{\text{E-Former}}(f_{\text{ImEncoder}}(\delta(I_{t_{i}})),f_{\text% {EvEncoder}}(E_{[t_{i-1},t_{i}]})),italic_F start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT E-Former end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT ImEncoder end_POSTSUBSCRIPT ( italic_δ ( italic_I start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) , italic_f start_POSTSUBSCRIPT EvEncoder end_POSTSUBSCRIPT ( italic_E start_POSTSUBSCRIPT [ italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] end_POSTSUBSCRIPT ) ) , (7)

we use fE-Formersubscript𝑓E-Formerf_{\text{E-Former}}italic_f start_POSTSUBSCRIPT E-Former end_POSTSUBSCRIPT to iteratively update the fusion feature with events to assure temporal consistency:

Ftj=fE-Former(Ftj1,fEvEncoder(E[tj1,tj])),j=i+1,i+2,,i+K,formulae-sequencesubscript𝐹subscript𝑡𝑗subscript𝑓E-Formersubscript𝐹subscript𝑡𝑗1subscript𝑓EvEncodersubscript𝐸subscript𝑡𝑗1subscript𝑡𝑗𝑗𝑖1𝑖2𝑖𝐾F_{t_{j}}=f_{\text{E-Former}}(F_{t_{j-1}},f_{\text{EvEncoder}}(E_{[t_{j-1},t_{% j}]})),\quad j=i+1,i+2,...,i+K,italic_F start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT E-Former end_POSTSUBSCRIPT ( italic_F start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_j - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT EvEncoder end_POSTSUBSCRIPT ( italic_E start_POSTSUBSCRIPT [ italic_t start_POSTSUBSCRIPT italic_j - 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ] end_POSTSUBSCRIPT ) ) , italic_j = italic_i + 1 , italic_i + 2 , … , italic_i + italic_K , (8)

where Ftjsubscript𝐹subscript𝑡𝑗F_{t_{j}}italic_F start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT is the fusion feature at time tjsubscript𝑡𝑗t_{j}italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, and K𝐾Kitalic_K is the fusion step.

Thus we get sequential constraints to enable fE-Formersubscript𝑓E-Formerf_{\text{E-Former}}italic_f start_POSTSUBSCRIPT E-Former end_POSTSUBSCRIPT to learn temporal consistency:

FtjfImEncoder(Itj),j=i+1,i+2,,i+K.F_{t_{j}}\leftrightarrow f_{\text{ImEncoder}}(I_{t_{j}}),\quad j=i+1,i+2,...,i% +K.italic_F start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ↔ italic_f start_POSTSUBSCRIPT ImEncoder end_POSTSUBSCRIPT ( italic_I start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , italic_j = italic_i + 1 , italic_i + 2 , … , italic_i + italic_K . (9)

By integrating constraints from  Eq. (6) and Eq. (9), fE-Formersubscript𝑓E-Formerf_{\text{E-Former}}italic_f start_POSTSUBSCRIPT E-Former end_POSTSUBSCRIPT can utilize events E[ti1,ti]subscript𝐸subscript𝑡𝑖1subscript𝑡𝑖E_{[t_{i-1},t_{i}]}italic_E start_POSTSUBSCRIPT [ italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] end_POSTSUBSCRIPT to rectify the degradation of the image Itisubscript𝐼subscript𝑡𝑖I_{t_{i}}italic_I start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT and assure the temporal consistency of evaluation results with subsequent events in an iterative way.

3.3 Training and Evaluation

Training

Our training procedure is illustrated in Fig. 6 (left). As mentioned in Eq. (5), feature Ftjsubscript𝐹subscript𝑡𝑗F_{t_{j}}italic_F start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT should be consistent with image feature fImEncoder(Itj)subscript𝑓ImEncodersubscript𝐼subscript𝑡𝑗f_{\text{ImEncoder}}(I_{t_{j}})italic_f start_POSTSUBSCRIPT ImEncoder end_POSTSUBSCRIPT ( italic_I start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ). In our work, we use two losses to measure the constraints: task-level loss tasksubscripttask{\mathcal{L}}_{\text{task}}caligraphic_L start_POSTSUBSCRIPT task end_POSTSUBSCRIPT and feature-level losses. Task-level loss is defined by the task-specific training loss under supervision from outputs of RGB-based model on original RGB image inputs. Following hu20nga , feature-level losses consist of feature reconstruction loss reconsubscriptrecon{\mathcal{L}}_{\text{recon}}caligraphic_L start_POSTSUBSCRIPT recon end_POSTSUBSCRIPT and feature style loss stylesubscriptstyle{\mathcal{L}}_{\text{style}}caligraphic_L start_POSTSUBSCRIPT style end_POSTSUBSCRIPT:

recon(tj)=MSE(Ftj,fImEncoder(Itj)),subscriptreconsubscript𝑡𝑗MSEsubscript𝐹subscript𝑡𝑗subscript𝑓ImEncodersubscript𝐼subscript𝑡𝑗\displaystyle{\mathcal{L}}_{\text{recon}}(t_{j})=\text{MSE}(F_{t_{j}},f_{\text% {ImEncoder}}(I_{t_{j}})),caligraphic_L start_POSTSUBSCRIPT recon end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = MSE ( italic_F start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT ImEncoder end_POSTSUBSCRIPT ( italic_I start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) , (10)
style(tj)=MSE(Gram(Ftj),Gram(fImEncoder(Itj))),subscriptstylesubscript𝑡𝑗MSEGramsubscript𝐹subscript𝑡𝑗Gramsubscript𝑓ImEncodersubscript𝐼subscript𝑡𝑗\displaystyle{\mathcal{L}}_{\text{style}}(t_{j})=\text{MSE}(\text{Gram}(F_{t_{% j}}),\text{Gram}(f_{\text{ImEncoder}}(I_{t_{j}}))),caligraphic_L start_POSTSUBSCRIPT style end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = MSE ( Gram ( italic_F start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , Gram ( italic_f start_POSTSUBSCRIPT ImEncoder end_POSTSUBSCRIPT ( italic_I start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) ) , (11)

where MSE is mean squared error, Gram is the Gram matrix gatys16imagestyle .

Thus the final loss becomes:

all=jλtasktask(tj)+λreconrecon(tj)+λstylestyle(tj),j=i,i+1,,K,formulae-sequencesubscriptallsubscript𝑗subscript𝜆tasksubscripttasksubscript𝑡𝑗subscript𝜆reconsubscriptreconsubscript𝑡𝑗subscript𝜆stylesubscriptstylesubscript𝑡𝑗𝑗𝑖𝑖1𝐾{\mathcal{L}}_{\text{all}}=\sum_{j}{\lambda_{\text{task}}{\mathcal{L}}_{\text{% task}}(t_{j})+\lambda_{\text{recon}}{\mathcal{L}}_{\text{recon}}(t_{j})+% \lambda_{\text{style}}{\mathcal{L}}_{\text{style}}(t_{j})},\quad j=i,i+1,...,K,caligraphic_L start_POSTSUBSCRIPT all end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT task end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT task end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) + italic_λ start_POSTSUBSCRIPT recon end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT recon end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) + italic_λ start_POSTSUBSCRIPT style end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT style end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) , italic_j = italic_i , italic_i + 1 , … , italic_K , (12)

where λtasksubscript𝜆task\lambda_{\text{task}}italic_λ start_POSTSUBSCRIPT task end_POSTSUBSCRIPT, λreconsubscript𝜆recon\lambda_{\text{recon}}italic_λ start_POSTSUBSCRIPT recon end_POSTSUBSCRIPT, and λstylesubscript𝜆style\lambda_{\text{style}}italic_λ start_POSTSUBSCRIPT style end_POSTSUBSCRIPT are loss weights.

Evaluation

As shown in Fig. 6 (right), during the evaluation stage, robust features are first obtained by fusing the image at time Itisubscript𝐼subscript𝑡𝑖I_{t_{i}}italic_I start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT and the event stream E[ti1,ti]subscript𝐸subscript𝑡𝑖1subscript𝑡𝑖E_{[t_{i-1},t_{i}]}italic_E start_POSTSUBSCRIPT [ italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] end_POSTSUBSCRIPT, which are used for subsequent decoding to obtain robust inference results. Next, the incoming high temporal resolution event stream is divided into K𝐾Kitalic_K event slices with equal time interval (assuming the duration of neighboring RGB images is ΔΔ\Deltaroman_Δ). These slices are then sequentially iterated and updated through the fusion module, and the output features are obtained through the feature decoder module. Through this iterative process, EvPlug has K𝐾Kitalic_K times temporal resolution than original RGB-based methods. Since K𝐾Kitalic_K can be adjusted according to the different applications, EvPlug allows for flexible high-temporal-resolution inference.

4 Experiments

Baselines

For comparing with domain adaptation methods, we choose NGA hu20nga as the pixel-aligned paired method, and for unpaired methods, we select E2VID rebecq21e2vid and EV-Transfer messikommer22evtransfer , which have released open-source codes. V2E gehrig20v2e , works as an event simulator, inevitably faces sim-to-real gap compared with paired method NGA hu20nga , and we don’t evaluate here. Although EvDistill wang21evdistill and DTL wang21dual are unpaired methods, but there is no public training code to employ them on our datasets. To validate the benefits of fusing the two modalities, we compare supervised RGB-based methods for each task.

Downstream Tasks

The middle- and high-level vision datasets that contain both RGB images and event streams are generally scarce. Considering the data availability, we evaluate the performance of EvPlug on three downstream vision tasks: object detection on DSEC-MOD zhou2023rgb , semantic segmentation on DSEC-Semantic sun22ess , and 3D hand pose estimation on EvRealHands jiang23evhandpose . We highly recommend readers to see the results of high temporal resolution in the supplementary video.

Implementation Details of EvPlug

EvPlug has a simple and generalized architecture. In all tasks, event streams are represented as voxels zhu19evflow ; victor21eventhands , and the event encoder fEvEncodersubscript𝑓EvEncoderf_{\text{EvEncoder}}italic_f start_POSTSUBSCRIPT EvEncoder end_POSTSUBSCRIPT shares the same architecture with the image encoder fImEncodersubscript𝑓ImEncoderf_{\text{ImEncoder}}italic_f start_POSTSUBSCRIPT ImEncoder end_POSTSUBSCRIPT except for the channel dimension of the first CNN layer. The fusion module fE-Formersubscript𝑓E-Formerf_{\text{E-Former}}italic_f start_POSTSUBSCRIPT E-Former end_POSTSUBSCRIPT is composed of N𝑁Nitalic_N sequential transformer decoder layers vaswani17transformer (N𝑁Nitalic_N is 3 in our experiments). In training, sequential fusion step K𝐾Kitalic_K is set as 2, λreconsubscript𝜆recon\lambda_{\text{recon}}italic_λ start_POSTSUBSCRIPT recon end_POSTSUBSCRIPT as 10, λstylesubscript𝜆style\lambda_{\text{style}}italic_λ start_POSTSUBSCRIPT style end_POSTSUBSCRIPT as 10. We use Adam adam15 with learning rate 0.001 to optimize the fEvEncodersubscript𝑓EvEncoderf_{\text{EvEncoder}}italic_f start_POSTSUBSCRIPT EvEncoder end_POSTSUBSCRIPT and fE-Formersubscript𝑓E-Formerf_{\text{E-Former}}italic_f start_POSTSUBSCRIPT E-Former end_POSTSUBSCRIPT. Apart from these settings, data processing and technical details are consistent with the RGB-based methods. In our experiments, EvPlug is trained on a single TITAN RTX in two days.

4.1 Object Detection

Experimental Setup

We validate EvPlug on the DSEC-MOD dataset zhou2023rgb for moving object detection with 10495 frames for training and 2819 frames for evaluation. For the RGB-based backbone of knowledge distillation methods and RGB-based methods, we use DETR carion20detr , the widely used backbone for object detection, as the trade-off between accuracy and efficiency. We use the pretrained DETR-R50 model on COCO lin14coco and finetune it on DSEC-MOD zhou2023rgb for 10 epochs with batchsize 64 to get an RGB-based model. For the event-based method, we represent events as voxels as zhu19evflow and use the same backbone as DETR-R50, which requires training for 30 epochs. For the RGB-event fusion method, we use the released model of RENet zhou2023rgb for evaluation. In order to uniformly evaluate the various methods using the same standard, we use the detection metrics of COCO lin14coco (reported as %, \uparrow means higher is better, while \downarrow is the opposite).

Results

As quantitative results shown in Tab. 2, EvPlug outperforms the domain adaptation methods hu20nga ; rebecq21e2vid ; messikommer22evtransfer by more than 20 AP50subscriptAP50\text{AP}_{\text{50}}AP start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT, which can be attributed to its ability to fuse event streams with RGB images, rather than relying solely on events for inference. Compared with RENet zhou2023rgb which designs specific architecture for object detection and requires training from scratch, EvPlug can utilize pretrained models from large-scale image dataset lin14coco , thus resulting in 18.2 AP50subscriptAP50\text{AP}_{\text{50}}AP start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT improvement. EvPlug brings only 2.0 AP50subscriptAP50\text{AP}_{\text{50}}AP start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT improvement for the RGB-based model, because DSEC-MOD zhou2023rgb contains few data under challenging scenes (HDR, fast motion) where events have better imaging quality over RGB images. Qualitative results in Fig. 7 show that EvPlug can make the RGB-based model sensitive to small moving objects in the distance, which is due to the rich motion information contained in event streams.

Table 2: Quantitative results on object detection and semantic segmentation.
Methods Input Test Time Object Detection Semantic Segmentation
AP \uparrow AP50subscriptAP50\text{AP}_{\text{50}}AP start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT \uparrow AP75subscriptAP75\text{AP}_{\text{75}}AP start_POSTSUBSCRIPT 75 end_POSTSUBSCRIPT \uparrow mIoU \uparrow Acc \uparrow
Event-based E[ti1,ti]subscript𝐸subscript𝑡𝑖1subscript𝑡𝑖E_{[t_{i-1},t_{i}]}italic_E start_POSTSUBSCRIPT [ italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] end_POSTSUBSCRIPT tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT 18.5 31.4 14.1 51.2 87.7
NGA hu20nga E[ti1,ti]subscript𝐸subscript𝑡𝑖1subscript𝑡𝑖E_{[t_{i-1},t_{i}]}italic_E start_POSTSUBSCRIPT [ italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] end_POSTSUBSCRIPT tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT 19.0 32.2 15.4 50.8 87.4
E2VID rebecq21e2vid E[ti1,ti]subscript𝐸subscript𝑡𝑖1subscript𝑡𝑖E_{[t_{i-1},t_{i}]}italic_E start_POSTSUBSCRIPT [ italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] end_POSTSUBSCRIPT tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT 15.3 24.5 13.6 38.7 74.6
EvTransfer messikommer22evtransfer E[ti1,ti]subscript𝐸subscript𝑡𝑖1subscript𝑡𝑖E_{[t_{i-1},t_{i}]}italic_E start_POSTSUBSCRIPT [ italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] end_POSTSUBSCRIPT tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT 17.6 27.4 14.4 49.9 87.6
ESS sun22ess Itisubscript𝐼subscript𝑡𝑖I_{t_{i}}italic_I start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT, E[ti1,ti]subscript𝐸subscript𝑡𝑖1subscript𝑡𝑖E_{[t_{i-1},t_{i}]}italic_E start_POSTSUBSCRIPT [ italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] end_POSTSUBSCRIPT tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT 53.3 89.4
RENet zhou2023rgb Itisubscript𝐼subscript𝑡𝑖I_{t_{i}}italic_I start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT, E[ti1,ti]subscript𝐸subscript𝑡𝑖1subscript𝑡𝑖E_{[t_{i-1},t_{i}]}italic_E start_POSTSUBSCRIPT [ italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] end_POSTSUBSCRIPT tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT 18.5 36.9 17.2
RGB-based Itisubscript𝐼subscript𝑡𝑖I_{t_{i}}italic_I start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT 29.6 53.1 28.8 63.8 93.2
EvPlug Itisubscript𝐼subscript𝑡𝑖I_{t_{i}}italic_I start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT,E[ti1,ti]subscript𝐸subscript𝑡𝑖1subscript𝑡𝑖E_{[t_{i-1},t_{i}]}italic_E start_POSTSUBSCRIPT [ italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] end_POSTSUBSCRIPT tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT 30.1 (0.5normal-↑\uparrow) 55.1 (2.0normal-↑\uparrow) 30.1 (1.3normal-↑\uparrow) 63.2 93.1
EvPlug (w/o δ𝛿\deltaitalic_δ) Itisubscript𝐼subscript𝑡𝑖I_{t_{i}}italic_I start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT, E[ti1,ti]subscript𝐸subscript𝑡𝑖1subscript𝑡𝑖E_{[t_{i-1},t_{i}]}italic_E start_POSTSUBSCRIPT [ italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] end_POSTSUBSCRIPT tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT 29.6 54.3 28.8 63.7 93.2
EvPlug Itisubscript𝐼subscript𝑡𝑖I_{t_{i}}italic_I start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT, E[ti1,ti+2]subscript𝐸subscript𝑡𝑖1subscript𝑡𝑖2E_{[t_{i-1},t_{i+2}]}italic_E start_POSTSUBSCRIPT [ italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_i + 2 end_POSTSUBSCRIPT ] end_POSTSUBSCRIPT ti+2subscript𝑡𝑖2t_{i+2}italic_t start_POSTSUBSCRIPT italic_i + 2 end_POSTSUBSCRIPT 19.0 40.6 16.4 53.5 90.7
EvPlug (w/o iter) Itisubscript𝐼subscript𝑡𝑖I_{t_{i}}italic_I start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT, E[ti1,ti+2]subscript𝐸subscript𝑡𝑖1subscript𝑡𝑖2E_{[t_{i-1},t_{i+2}]}italic_E start_POSTSUBSCRIPT [ italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_i + 2 end_POSTSUBSCRIPT ] end_POSTSUBSCRIPT ti+2subscript𝑡𝑖2t_{i+2}italic_t start_POSTSUBSCRIPT italic_i + 2 end_POSTSUBSCRIPT 18.7 40.2 16.1 52.4 90.1
Refer to caption

RGB(GT)       Event-based carion20detr       NGA hu20nga         E2VID rebecq21e2vid     EvTransfer messikommer22evtransfer    RGB-based carion20detr    EvPlug (w/o δ𝛿\deltaitalic_δ)       EvPlug

Figure 7: Qualitative results on DSEC-MOD zhou2023rgb . Green boxes are bounding boxes and gray boxes are spaces zoomed in.

4.2 Semantic Segmentation

Experimental Setup

We validate EvPlug on DSEC-Semantic sun22ess (8082 frames for training, 2809 frames for evaluation) for semantic segmentation. For the RGB-based model, we use the extension version of DETR carion20detr on panoptic segmentation Kirillov19panoptic and finetune it on Cityscapes Cordts2016Cityscapes . For the event-based method, we use supervised Ev-SegNet alonso19evsegnet for comparison. Apart from the mentioned domain adaptation methods, ESS sun22ess proposed an unsupervised domain adaptation method. We use the same evaluation metric (mIoU (%) and Accuracy (Acc., %)) as ESS sun22ess for a fair comparison.

Refer to caption

RGB            Events               GT        Ev-SegNet alonso19evsegnet      NGA hu20nga      E2VID rebecq21e2vid    EvTransfer messikommer22evtransfer    DETR carion20detr         EvPlug

Figure 8: Qualitative results on DSEC-Semantic sun22ess .

Results

Results in Tab. 2 and Fig. 8 show that EvPlug outperforms mentioned domain adaptation methods hu20nga ; rebecq21e2vid ; sun22ess ; messikommer22evtransfer due to the multi-modal fusion of both modalities. However, EvPlug achieves a bit lower accuracy compared with the RGB-based method. The reason is that color and texture play a significant role in semantic segmentation, and there are few degraded RGB images in the DSEC-Semantic sun22ess . Compared to the gain that events bring to RGB images, the disruption caused by event features to the RGB feature space has a greater impact. Results in Tab. 7 (methods: EvPlug, test time: ti+2subscript𝑡𝑖2t_{i+2}italic_t start_POSTSUBSCRIPT italic_i + 2 end_POSTSUBSCRIPT) show that when performing semantic segmentation with high temporal resolution, EvPlug outperforms event-based methods alonso19evsegnet by over 2.5 % mIoU. Due to the complementary nature of RGB images at time tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to event streams E[ti,ti+2]subscript𝐸subscript𝑡𝑖subscript𝑡𝑖2E_{[t_{i},t_{i+2}]}italic_E start_POSTSUBSCRIPT [ italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_i + 2 end_POSTSUBSCRIPT ] end_POSTSUBSCRIPT, EvPlug achieves higher accuracy than methods based solely on event streams. This implies that EvPlug, at the cost of sacrificing a small portion of accuracy on RGB images, enables the RGB-based model with high temporal resolution inference.

4.3 Hand Pose Estimation

Table 3: Quantitative results on 3D hand pose estimation.
Methods Input Test Time Normal Strong Light Flash
MPJPE \downarrow AUC \uparrow MPJPE \downarrow AUC \uparrow MPJPE \downarrow AUC \uparrow
Event-based victor21eventhands E[ti1,ti]subscript𝐸subscript𝑡𝑖1subscript𝑡𝑖E_{[t_{i-1},t_{i}]}italic_E start_POSTSUBSCRIPT [ italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] end_POSTSUBSCRIPT tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT 29.16 77.7 32.48 71.7 52.92 60.9
NGA hu20nga E[ti1,ti]subscript𝐸subscript𝑡𝑖1subscript𝑡𝑖E_{[t_{i-1},t_{i}]}italic_E start_POSTSUBSCRIPT [ italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] end_POSTSUBSCRIPT tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT 20.48 79.6 28.37 71.8 41.06 61.4
E2VID rebecq21e2vid E[ti1,ti]subscript𝐸subscript𝑡𝑖1subscript𝑡𝑖E_{[t_{i-1},t_{i}]}italic_E start_POSTSUBSCRIPT [ italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] end_POSTSUBSCRIPT tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT 29.81 73.6 30.77 70.3 42.35 61.0
EvTransfer messikommer22evtransfer E[ti1,ti]subscript𝐸subscript𝑡𝑖1subscript𝑡𝑖E_{[t_{i-1},t_{i}]}italic_E start_POSTSUBSCRIPT [ italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] end_POSTSUBSCRIPT tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT 22.67 77.5 29.39 70.8 41.32 60.6
RGB-based cho22fastmetro Itisubscript𝐼subscript𝑡𝑖I_{t_{i}}italic_I start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT 14.31 85.6 42.43 59.1 26.31 73.9
EvPlug Itisubscript𝐼subscript𝑡𝑖I_{t_{i}}italic_I start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT,E[ti1,ti]subscript𝐸subscript𝑡𝑖1subscript𝑡𝑖E_{[t_{i-1},t_{i}]}italic_E start_POSTSUBSCRIPT [ italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] end_POSTSUBSCRIPT tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT 13.78 (0.53normal-↓\downarrow) 86.2 (0.6normal-↑\uparrow) 28.16 (14.27normal-↓\downarrow) 71.9 (12.8normal-↑\uparrow) 26.00 (0.31normal-↓\downarrow) 74.3 (0.4normal-↑\uparrow)
EvPlug (w/o δ𝛿\deltaitalic_δ) Itisubscript𝐼subscript𝑡𝑖I_{t_{i}}italic_I start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT,E[ti1,ti]subscript𝐸subscript𝑡𝑖1subscript𝑡𝑖E_{[t_{i-1},t_{i}]}italic_E start_POSTSUBSCRIPT [ italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] end_POSTSUBSCRIPT tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT 13.78 86.2 46.07 56.0 26.85 73.4
EvPlug Itisubscript𝐼subscript𝑡𝑖I_{t_{i}}italic_I start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT, E[ti1,ti+2]subscript𝐸subscript𝑡𝑖1subscript𝑡𝑖2E_{[t_{i-1},t_{i+2}]}italic_E start_POSTSUBSCRIPT [ italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_i + 2 end_POSTSUBSCRIPT ] end_POSTSUBSCRIPT ti+2subscript𝑡𝑖2t_{i+2}italic_t start_POSTSUBSCRIPT italic_i + 2 end_POSTSUBSCRIPT 18.28 81.8 30.10 70.8 35.34 67.3
EvPlug (w/o iter) Itisubscript𝐼subscript𝑡𝑖I_{t_{i}}italic_I start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT, E[ti1,ti+2]subscript𝐸subscript𝑡𝑖1subscript𝑡𝑖2E_{[t_{i-1},t_{i+2}]}italic_E start_POSTSUBSCRIPT [ italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_i + 2 end_POSTSUBSCRIPT ] end_POSTSUBSCRIPT ti+2subscript𝑡𝑖2t_{i+2}italic_t start_POSTSUBSCRIPT italic_i + 2 end_POSTSUBSCRIPT 18.39 81.6 30.25 70.0 36.12 65.4

Experimental Setup

We evaluate the performance of EvPlug on hand pose estimation on dataset EvRealHands jiang23evhandpose , a multi-modal hand dataset consisting of 4452 seconds of event streams from DAVIS346 (346×\times×260 pixels) and corresponding RGB sequences from FLIR cameras (2660×\times×2300 pixels). We perform comparisons on the sequences under different illumination and hand movements: normal, strong light, flash, and fast motion. For RGB-based baseline and domain adaptation backbones, we select FastMETRO cho22fastmetro which performs high accuracy with low computational cost. For efficiency, we use the lightweight version (a ResNet34 he16resnet backbone and 4 transformer vaswani17transformer layers). For event-based methods, we use EventHands victor21eventhands . We follow the implementation details of EventHands victor21eventhands and FastMETRO cho22fastmetro to pre-process event streams and RGB images. In our setup, RGB images are cropped in 192×\times×192 and event streams in 128×\times×128. We use the common MPJPE (root-aligned mean per joint position error in Euclidean distance (mm)) and AUC (reported as %) metrics for quantitative evaluation.

Results

As quantitative results are shown in Tab. 3, EvPlug outperforms domain adaptation methods hu20nga ; rebecq21e2vid ; messikommer22evtransfer over 6 mm MPJPE lower in normal scenes and 15 mm MPJPE on flash scenes. This benefits from the fact that EvPlug is an event and image fusion method, rather than a pure event-based domain adaptation method. The plug-and-play module learned by our method will significantly improve the robustness of the RGB-based model in strong light scenes, resulting in a 14.27 mm lower MPJPE in such conditions. This demonstrates that our method can integrate the HDR robustness characteristics of event cameras into the RGB-based model. Qualitative results in Fig. 9 show that these domain adaptation methods hu20nga ; rebecq21e2vid ; messikommer22evtransfer cannot handle the scenes when hands are relatively static or the illumination changes. And RGB-based models cannot achieve robust hand pose estimation when images degrade, such as overexposure and motion blur. EvPlug can perform robust hand pose estimation in these challenging issues, which derives from the complementary use of two modalities.

Refer to caption

RGB               Events       EventHands victor21eventhands     NGA hu20nga       E2VID rebecq21e2vid       EvTransfer messikommer22evtransfer FastMETRO cho22fastmetro EvPlug (w/o δ𝛿\deltaitalic_δ)     EvPlug

Figure 9: Qualitative results on EvRealHands jiang23evhandpose .

4.4 Ablation Study

In the ablation study, we demonstrate the effectiveness of the two constraints for image-event quality consistency and temporal consistency.

Event-Image Quality Consistency

To validate the effectiveness of the quality constraint, we can compare the performance of EvPlug at time tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with and without incorporating δ𝛿\deltaitalic_δ in Eq. (6) (denoted as "w/o δ𝛿\deltaitalic_δ") in the training procedure. Quantitative results in Tab. 2 and Tab. 3 show that this constraint can improve the performance of EvPlug on object detection (DSEC-MOD zhou2023rgb ) and 3D hand pose estimation (EvRealHands jiang23evhandpose ), especially under strong lights (18.09 mm lower on MPJPE ). However, it will result in a small decrease in metrics on semantic segmentation, which mainly results from the absence of degraded images in evaluation data. Qualitative results in Fig. 7 and Fig. 9 show that this constraint can improve the performance of EvPlug under HDR imaging and fast motion scenes.

Event-Image Temporal Consistency

A direct approach (denoted as "w/o iter") to validate the effectiveness of the temporal consistency constraint is to integrate the features Ftisubscript𝐹subscript𝑡𝑖F_{t_{i}}italic_F start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT at time tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with events during evaluation:

Ftj=fE-Former(Fti,fEvEncoder(E[tj1,tj])),j=i+1,i+2,,i+K,formulae-sequencesubscript𝐹subscript𝑡𝑗subscript𝑓E-Formersubscript𝐹subscript𝑡𝑖subscript𝑓EvEncodersubscript𝐸subscript𝑡𝑗1subscript𝑡𝑗𝑗𝑖1𝑖2𝑖𝐾F_{t_{j}}=f_{\text{E-Former}}(F_{t_{i}},f_{\text{EvEncoder}}(E_{[t_{j-1},t_{j}% ]})),\quad j=i+1,i+2,...,i+K,italic_F start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT E-Former end_POSTSUBSCRIPT ( italic_F start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT EvEncoder end_POSTSUBSCRIPT ( italic_E start_POSTSUBSCRIPT [ italic_t start_POSTSUBSCRIPT italic_j - 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ] end_POSTSUBSCRIPT ) ) , italic_j = italic_i + 1 , italic_i + 2 , … , italic_i + italic_K , (13)

which is different from Eq. (8). Therefore we can compare the results at time ti+2subscript𝑡𝑖2t_{i+2}italic_t start_POSTSUBSCRIPT italic_i + 2 end_POSTSUBSCRIPT with the image Itisubscript𝐼subscript𝑡𝑖I_{t_{i}}italic_I start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT and the event stream E[ti1,ti]subscript𝐸subscript𝑡𝑖1subscript𝑡𝑖E_{[t_{i-1},t_{i}]}italic_E start_POSTSUBSCRIPT [ italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] end_POSTSUBSCRIPT to validate the efficiency of event-image temporal consistency. Quantitative results in Tab. 2 and Tab. 3 show that this constraint will contribute to a 0.3 lift on AP on object detection, 1.1 % lift on mIoU on semantic segmentation, and 0.1 mm MPJPE improvement on hand pose estimation.

5 Conclusion

In this paper, we propose a unified framework called EvPlug that learns a plug-and-play event and image fusion module under supervision from the RGB-based model connected by the event generation model. Our method only requires unlabeled, non-strictly pixel-aligned image-event data pairs for training. Working as a plug-in for the RGB-based model, EvPlug enables the RGB-based model with robustness to HDR imaging and high-temporal-resolution inference. We demonstrate the effectiveness of EvPlug on three vision tasks: object detection, semantic segmentation, and hand pose estimation.

Limitations

In our fusion model fE-Formersubscript𝑓E-Formerf_{\text{E-Former}}italic_f start_POSTSUBSCRIPT E-Former end_POSTSUBSCRIPT, when the scale of the feature map is large, the computational cost of the transformer-based fusion module can be quite high. Additionally, although fE-Formersubscript𝑓E-Formerf_{\text{E-Former}}italic_f start_POSTSUBSCRIPT E-Former end_POSTSUBSCRIPT successfully bridges the temporal consistency between neighboring images and events, it cannot model long-term consistency across multiple images. Therefore, designing a fusion framework that can model long-term temporal consistency with low computational cost is a promising research direction.

References

  • (1) Iñigo Alonso and Ana C. Murillo. Ev-SegNet: Semantic segmentation for event-based cameras. In CVPR Workshops, 2019.
  • (2) Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In ECCV, 2020.
  • (3) Junhyeong Cho, Kim Youwang, and Tae-Hyun Oh. Cross-attention of disentangled modalities for 3D human mesh recovery with transformers. In ECCV, 2022.
  • (4) Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In CVPR, 2016.
  • (5) Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image database. In CVPR, 2009.
  • (6) Yongjian Deng, Hao Chen, Huiying Chen, and Youfu Li. Learning from images: A distillation learning framework for event cameras. IEEE TIP, 2021.
  • (7) Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
  • (8) Peiqi Duan, Zihao Wang, Boxin Shi, Oliver Cossairt, Tiejun Huang, and Aggelos Katsaggelos. Guided Event Filtering: Synergy between intensity images and neuromorphic events for high performance imaging. IEEE TPAMI, 2021.
  • (9) Gunnar Farnebäck. Two-frame motion estimation based on polynomial expansion. In Image Analysis, 2003.
  • (10) Guillermo Gallego, Tobi Delbrück, Garrick Orchard, Chiara Bartolozzi, Brian Taba, Andrea Censi, Stefan Leutenegger, Andrew J. Davison, Jörg Conradt, Kostas Daniilidis, and Davide Scaramuzza. Event-based vision: A survey. IEEE TPAMI, 44(1):154–180, 2022.
  • (11) Leon A. Gatys, Alexander S. Ecker, and Matthias Bethge. Image style transfer using convolutional neural networks. In CVPR, 2016.
  • (12) Daniel Gehrig, Mathias Gehrig, Javier Hidalgo-Carrió, and Davide Scaramuzza. Video to events: Recycling video datasets for event cameras. In CVPR, 2020.
  • (13) Daniel Gehrig, Michelle Rüegg, Mathias Gehrig, Javier Hidalgo-Carrió, and Davide Scaramuzza. Combining events and frames using recurrent asynchronous multimodal networks for monocular depth prediction. IRAL, 2021.
  • (14) Mathias Gehrig, Willem Aarents, Daniel Gehrig, and Davide Scaramuzza. DSEC: A stereo event camera dataset for driving scenarios. IRAL, 2021.
  • (15) Jin Han, Yixin Yang, Chu Zhou, Chao Xu, and Boxin Shi. EvIntSR-Net: Event guided multiple latent frames reconstruction and super-resolution. In ICCV, 2021.
  • (16) Jin Han, Chu Zhou, Peiqi Duan, Yehui Tang, Chang Xu, Chao Xu, Tiejun Huang, and Boxin Shi. Neuromorphic camera guided high dynamic range imaging. In CVPR, 2020.
  • (17) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
  • (18) Javier Hidalgo-Carrió, Guillermo Gallego, and Davide Scaramuzza. Event-aided direct sparse odometry. In CVPR, 2022.
  • (19) Yuhuang Hu, Tobi Delbrück, and Shih-Chii Liu. Learning to exploit multiple vision modalities by using grafted networks. In ECCV, 2020.
  • (20) Jianping Jiang, Jiahe Li, Baowen Zhang, Xiaoming Deng, and Boxin Shi. EvHandPose: Event-based 3D hand pose estimation with sparse supervision. In arXiv:2303.02862, 2023.
  • (21) Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
  • (22) Alexander Kirillov, Kaiming He, Ross B. Girshick, Carsten Rother, and Piotr Dollár. Panoptic segmentation. In CVPR, 2019.
  • (23) Patrick Lichtsteiner, Christoph Posch, and Tobi Delbrück. A 128×128 120 dB 15 μ𝜇{\mu}italic_μs latency asynchronous temporal contrast vision sensor. JSSC, 2008.
  • (24) Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft COCO: common objects in context. In ECCV, 2014.
  • (25) Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021.
  • (26) Nico Messikommer, Daniel Gehrig, Mathias Gehrig, and Davide Scaramuzza. Bridging the gap between events and frames through unsupervised domain adaptation. IRAL, 2022.
  • (27) Nico Messikommer, Stamatios Georgoulis, Daniel Gehrig, Stepan Tulyakov, Julius Erbach, Alfredo Bochicchio, Yuanyou Li, and Davide Scaramuzza. Multi-bracket high dynamic range imaging with event cameras. In CVPR, 2022.
  • (28) Anton Mitrokhin, Cornelia Fermüller, Chethan Parameshwara, and Yiannis Aloimonos. Event-based moving object detection and tracking. In IROS, 2018.
  • (29) Anton Mitrokhin, Zhiyuan Hua, Cornelia Fermuller, and Yiannis Aloimonos. Learning visual motion segmentation using event surfaces. In CVPR, 2020.
  • (30) S. Mohammad Mostafavi I., Jonghyun Choi, and Kuk-Jin Yoon. Learning to super resolve intensity images from events. In CVPR, 2020.
  • (31) Liyuan Pan, Cedric Scheerlinck, Xin Yu, Richard Hartley, Miaomiao Liu, and Yuchao Dai. Bringing a blurry frame alive at high frame-rate with an event camera. In CVPR, 2019.
  • (32) Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Z. Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. PyTorch: An imperative style, high-performance deep learning library. In NeurIPS, 2019.
  • (33) Etienne Perot, Pierre de Tournemire, Davide Nitti, Jonathan Masci, and Amos Sironi. Learning to detect objects with a 1 megapixel event camera. In NeurIPS, 2020.
  • (34) Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In ICML, 2021.
  • (35) Henri Rebecq, René Ranftl, Vladlen Koltun, and Davide Scaramuzza. High speed and high dynamic range video with an event camera. IEEE TPAMI, 2021.
  • (36) Viktor Rudnev, Vladislav Golyanik, Jiayi Wang, Hans-Peter Seidel, Franziska Mueller, Mohamed Elgharib, and Christian Theobalt. EventHands: Real-time neural 3D hand pose estimation from an event stream. In ICCV, 2021.
  • (37) Bongki Son, Yunjae Suh, Sungho Kim, Heejae Jung, Jun-Seok Kim, Changwoo Shin, Keunju Park, Kyoobin Lee, Jinman Park, Jooyeon Woo, et al. A 640×\times×480 dynamic vision sensor with a 9μ𝜇\muitalic_μm pixel and 300Meps address-event representation. In ISSCC, pages 66–67, 2017.
  • (38) Timo Stoffregen, Guillermo Gallego, Tom Drummond, Lindsay Kleeman, and Davide Scaramuzza. Event-based motion segmentation by motion compensation. In ICCV, 2019.
  • (39) Lei Sun, Christos Sakaridis, Jingyun Liang, Qi Jiang, Kailun Yang, Peng Sun, Yaozu Ye, Kaiwei Wang, and Luc Van Gool. Event-based fusion for motion deblurring with cross-modal attention. In ECCV, 2022.
  • (40) Zhaoning Sun, Nico Messikommer, Daniel Gehrig, and Davide Scaramuzza. ESS: learning event-based semantic segmentation from still images. In ECCV, 2022.
  • (41) Gemma Taverni, Diederik Paul Moeys, Chenghan Li, Celso Cavaco, Vasyl Motsnyi, David San Segundo Bello, and Tobi Delbruck. Front and back illuminated dynamic and active pixel vision sensors comparison. IEEE TCAS-II, 65(5):677–681, 2018.
  • (42) Minggui Teng, Chu Zhou, Hanyue Lou, and Boxin Shi. NEST: neural event stack for event-based image enhancement. In ECCV, 2022.
  • (43) Abhishek Tomy, Anshul Paigwar, Khushdeep Singh Mann, Alessandro Renzaglia, and Christian Laugier. Fusing event-based and RGB camera for robust object detection in adverse conditions. In ICRA, 2022.
  • (44) Stepan Tulyakov, Alfredo Bochicchio, Daniel Gehrig, Stamatios Georgoulis, Yuanyou Li, and Davide Scaramuzza. Time Lens++: Event-based frame interpolation with parametric non-linear flow and multi-scale fusion. In CVPR, 2022.
  • (45) Stepan Tulyakov, Daniel Gehrig, Stamatios Georgoulis, Julius Erbach, Mathias Gehrig, Yuanyou Li, and Davide Scaramuzza. Time Lens: Event-based video frame interpolation. In CVPR, 2021.
  • (46) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017.
  • (47) Lin Wang, Yujeong Chae, and Kuk-Jin Yoon. Dual transfer learning for event-based end-task prediction via pluggable event to image translation. In ICCV, 2021.
  • (48) Lin Wang, Yujeong Chae, Sung-Hoon Yoon, Tae-Kyun Kim, and Kuk-Jin Yoon. EvDistill: Asynchronous events to end-task learning via bidirectional reconstruction-guided cross-modal knowledge distillation. In CVPR, 2021.
  • (49) Lin Wang and Kuk-Jin Yoon. Knowledge distillation and student-teacher learning for visual intelligence: A review and new outlooks. IEEE TPAMI, 2022.
  • (50) Zihao W. Wang, Peiqi Duan, Oliver Cossairt, Aggelos K. Katsaggelos, Tiejun Huang, and Boxin Shi. Joint filtering of intensity images and neuromorphic events for high-resolution noise-robust imaging. In CVPR, 2020.
  • (51) Lan Xu, Weipeng Xu, Vladislav Golyanik, Marc Habermann, Lu Fang, and Christian Theobalt. EventCap: Monocular 3D capture of high-speed human motions using an event camera. In CVPR, 2020.
  • (52) Zhuyun Zhou, Zongwei Wu, Rémi Boutteau, Fan Yang, Cédric Demonceaux, and Dominique Ginhac. Rgb-event fusion for moving object detection in autonomous driving. ICRA, 2023.
  • (53) Alex Zihao Zhu, Ziyun Wang, Kaung Khant, and Kostas Daniilidis. EventGAN: Leveraging large scale image datasets for event cameras. In ICCP, 2021.
  • (54) Alex Zihao Zhu, Liangzhe Yuan, Kenneth Chaney, and Kostas Daniilidis. Ev-FlowNet: Self-supervised optical flow estimation for event-based cameras. In RSS, 2018.
  • (55) Alex Zihao Zhu, Liangzhe Yuan, Kenneth Chaney, and Kostas Daniilidis. Unsupervised event-based learning of optical flow, depth, and egomotion. In CVPR, 2019.
  • (56) Shihao Zou, Chuan Guo, Xinxin Zuo, Sen Wang, Pengyu Wang, Xiaoqin Hu, Shoushun Chen, Minglun Gong, and Li Cheng. EventHPE: Event-based 3D human pose and shape estimation. In ICCV, 2021.