Spanner logo Spann3R

3D Reconstruction with Spatial Memory

University College London

arXiv 2024

Spann3R achieves incremental reconstruction from uncalibrated image collection with a simple forward pass on our transformer-based architecture (No test time optimisation, no camera pose estimation)


Abstract

We present Spann3R, a novel approach for dense 3D reconstruction from ordered or unordered image collections.

Built on the DUSt3R paradigm, Spann3R uses a transformer-based architecture to directly regress pointmaps from images without any prior knowledge of the scene or camera parameters. Unlike DUSt3R, which predicts per image-pair pointmaps each expressed in its local coordinate frame, Spann3R can predict per-image pointmaps expressed in a global coordinate system, thus eliminating the need for optimization-based global alignment.

The key idea of Spann3R is to manage an external spatial memory that learns to keep track of all previous relevant 3D information. Spann3R then queries this spatial memory to predict the 3D structure of the next frame in a global coordinate system. Taking advantage of DUSt3R's pre-trained weights, and further fine-tuning on a subset of datasets, Spann3R shows competitive performance and generalization ability on various unseen datasets and is able to process ordered image collections in real-time.


Video

Pipeline

Given a sequence of images, Spann3R aims to map each image to its corresponding pointmap, expressed in the coordinate system of the initial frame. Following the DUSt3R, we use a ViT encoder with two intertwined decoders. An additional lightweight memory encoder is used to encode the previous 3D information into the spatial memory. For each time step, Spann3R takes a new frame and a previous query feature as input. The query feature is used to read from the spatial memory and generate a fused feature. This fused feature and visual feature of the new frame will be input into reference and target decoders respectively. The pointmap and feature from the reference decoder would be encoded into the spatial memory. The feature from the target decoder would be kept as the query feature for the next time step.


Reconstruction showcase

Visualization of Attention Maps

HOVER over query image (left) to see the attention maps corresponding to each patch.
CLICK query image (left) to overlay the attention maps with its corresponding image.

BibTeX


      @article{wang2024spann3r,
        title={3D Reconstruction with Spatial Memory},
        author={Wang, Hengyi and Agapito, Lourdes},
        journal={arXiv preprint arXiv:2408.16061},
        year={2024}
      }
    

Acknowledgement

Research presented here has been supported by the UCL Centre for Doctoral Training in Foundational AI under UKRI grant number EP/S021566/1. This project made use of time on Tier 2 HPC facility JADE2, funded by EPSRC (EP/T022205/1). Hengyi Wang was supported from a sponsored research award by Cisco Research.