WHAM: Reconstructing World-grounded Humans with Accurate 3D Motion

Soyong Shin1   Juyong Kim1   Eni Halilaj1   Michael J. Black2
1Carnegie Mellon University       2Max Planck Institute for Intelligent System
Abstract

The estimation of 3D human motion from video has progressed rapidly but current methods still have several key limitations. First, most methods estimate the human in camera coordinates. Second, prior work on estimating humans in global coordinates often assumes a flat ground plane and produces foot sliding. Third, the most accurate methods rely on computationally expensive optimization pipelines, limiting their use to offline applications. Finally, existing video-based methods are surprisingly less accurate than single-frame methods. We address these limitations with WHAM (World-grounded Humans with Accurate Motion), which accurately and efficiently reconstructs 3D human motion in a global coordinate system from video. WHAM learns to lift 2D keypoint sequences to 3D using motion capture data and fuses this with video features, integrating motion context and visual information. WHAM exploits camera angular velocity estimated from a SLAM method together with human motion to estimate the body’s global trajectory. We combine this with a contact-aware trajectory refinement method that lets WHAM capture human motion in diverse conditions, such as climbing stairs. WHAM outperforms all existing 3D human motion recovery methods across multiple in-the-wild benchmarks. Code is available for research purposes at https://wham.is.tue.mpg.de/.

[Uncaptioned image]

Figure 1: WHAM: World-grounded Humans with Accurate Motion. State-of-the-art methods like TRACE [45] and SLAHMR [55] fail to capture global 3D human trajectories accurately when given in-the-wild videos captured using a moving camera, producing implausible world-grounded motion (e.g., foot sliding). To address this, WHAM uses two novel strategies: (1) feature integration from 2D keypoints and pixels to reconstruct precise and pixel-aligned 3D human motion and (2) contact-aware trajectory recovery to place the human in a global coordinate system without foot sliding. Gray dots show the ground-truth global trajectory. See Supplemental Video.

1 Introduction

Our goal is to accurately estimate the 3D pose and shape of a person from monocular video. This is a longstanding problem and, while the field has made rapid progress, several key challenges remain. First, human motion should be computed in a consistent global coordinate system. Second, the method should be computationally efficient. Third, the results should be accurate, temporally smooth, detailed, natural looking, and have realistic foot-ground contact. Fourth, the capture should work with an arbitrary moving camera. These constraints need to be satisfied to make markerless human motion capture widely available for applications in gaming, AR/VR, autonomous driving, sports analysis, and human-robot interaction. We address these challenges with WHAM (World-grounded Humans with Accurate Motion), which enables fast and accurate recovery of 3D human motion from a moving camera; see Fig. 1.

It seems natural that, in estimating 3D humans from video, we should be able to exploit the temporal nature of video. Counter-intuitively, existing video-based methods for 3D human pose and shape (HPS) estimation [15, 17, 30, 6, 52, 43] are less accurate than the best single-frame methods [14, 21, 18, 60, 23, 60, 25, 26, 7]. This may be an issue of training data. There are large datasets of single images with ground-truth 3D human poses containing a diversity of body shapes, poses, backgrounds, lighting, etc. In contrast, video datasets with ground truth are much more limited.

To address this, WHAM leverages both the large-scale AMASS motion capture (mocap) dataset [32] and video datasets. Our key idea is to learn about 3D human motion from AMASS and then learn to fuse this information with temporal image cues from video, getting the best of both. Similar to previous work [62, 61], we use AMASS to generate synthetic 2D keypoints and ground-truth motion sequence pairs, from which we pretrain a motion encoder, which captures the motion context, and decoder that lifts sequences of 2D keypoints to sequences of 3D poses. Given the robustness of recent 2D keypoint detection models [54, 59], our pretrained model does a good job of predicting human pose from video.

Keypoints alone, however, are too sparse for accurate 3D mesh estimation. To improve accuracy, we jointly train a feature integrator network that merges information from video and 2D-keypoint sequences. We use a pretrained image encoder from previous work [21, 7, 25, 1] and train the feature integrator using video datasets [49, 11, 15, 33]. This integration process supplements the motion context extracted from the sparse 2D keypoints with dense visual context, significantly improving the recovered pose and shape accuracy.

While the above approach produces accurate motion, we want this motion in global coordinates, unlike most previous methods that compute the body in camera coordinates. Estimating the global human trajectory is challenging when the camera is moving because the motions of the body and the camera are entangled. Recent work addresses this with optimization fitting based on a learned human motion prior and camera information from SLAM methods [55, 41, 20] or dense 3D scene information from COLMAP [28]. However, these methods are computationally expensive and far from real time. Recent regression-based methods are faster but either constrain the problem with static or known camera conditions [61, 42] or have temporal jitter and limited accuracy [45]. We tackle this challenge with two additional modules. First, we predict the global orientation and root velocity of the human from the sequence of 2D keypoints by training a global trajectory decoder. Specifically, we concatenate the camera’s angular velocity to the context and train the global trajectory decoder to recursively predict the current orientation and root velocity, effectively factoring camera motion from human motion. WHAM takes the camera’s angular velocity either from the output of a SLAM method or from a camera’s gyroscope when available.

The above solution relies on knowledge of human motion learned from AMASS. Therefore, it can fail to capture elevation changes when the surface is not flat, e.g. when ascending the stairs because AMASS has a limited amount of such data. To address this, we introduce foot contact as an additional explicit form of motion context. We train WHAM to predict the likelihood of foot-ground contact using estimated contact labels from both AMASS and 3D video datasets. We then train a trajectory refinement network that outputs an update to the root orientation and velocity based on the information about the foot contact/velocity. This refinement enables WHAM to accurately estimate human motion in a global coordinate system even when the terrain is not flat.

WHAM has very low computational overhead because it is an on-line algorithm that recursively predicts the pose, shape, and global motion parameters. The network, excluding preprocessing (bounding box detection, keypoint detection, and person identification), runs at 200 fps, significantly faster than prior methods. Also, despite not using global optimization like [55], we obtain accurate 3D camera trajectories and global body motions with minimal drift. Through extensive comparisons on multiple in-the-wild datasets as well as detailed ablation studies, we find that WHAM achieves state-of-the-art (SOTA) accuracy on 3D human pose estimation as well as global trajectory estimation (see Fig. 1).

In summary, in this paper we: (1) introduce the first approach to effectively fuse 3D human motion context and video context for 3D HPS regression; (2) propose a novel global trajectory estimation framework that leverages motion context and foot contact to effectively address foot sliding and enable the 3D tracking of people on non-planar surfaces; (3) efficiently perform HPS regression in global coordinates; (4) achieve state-of-the-art (SOTA) performance on multiple in-the-wild benchmark datasets (3DPW [49], RICH [10], EMDB [16]). WHAM is the first video-based method to outperform all image-based and video-based methods on per-frame accuracy while maintaining temporal smoothness.

2 Related Work

Image-based 3D HPS Estimation. There are two broad classes of methods for recovering 3D HPS from images: model-free [34, 22, 27] and model-based [14, 21, 7, 13, 12, 35, 25, 36]. Here we focus on model-based methods, which estimate the low-dimensional parameters of a statistical body model [29, 38, 37, 53]. While early work explores optimization-based methods [2], here we focus on direct regression methods based on deep learning.

Many existing methods follow the architecture of HMR [14], which uses a pretrained backbone to predict image features followed by a multilayer perceptron (MLP) that regresses SMPL [29] pose parameters from image features. Training such networks typically leverages paired images with SMPL parameters; these parameters are often pseudo-groundtruth (p-GT), estimated from 2D keypoints [21, 12, 35, 36, 25]. Other architectures for HPS regression have also been proposed [18, 23, 60, 24, 7, 19]. None of these methods use video or estimate the body in global coordinates. While quite accurate, when these image-based models are applied independently to frames of a video sequence, the shape and pose can be temporally inconsistent. In contrast, WHAM effectively aggregates temporal information to provide frame-accurate and temporally-coherent 3D HPS estimation.

Video-Based 3D HPS. Video-based methods typically encode temporal information by combining static features extracted by a backbone from each frame. HMMR [15] uses a convolutional encoder, while VIBE [17] and MEVA [30] employ recurrent neural networks. TCMR [6] divides sequences into past, future, and whole frames, aggregating information to strongly constrain the output with motion consistency. MPS-Net [52] uses attention to capture non-local motion context and a hierarchical architecture to aggregate temporal features. Both MAED [50] and GLoT [43] use transformer architectures [48] to encode videos. MAED encodes videos in both temporal (across frames) and spatial (within each frame) dimensions and leverages the kinematic tree to iteratively regress each joint angle. GLoT encodes long-term temporal correlations and refines local details by focusing on nearby frames. Despite integrating information across frames, all existing video-based methods have lower accuracy than the best single-frame methods.

Given limited video training data with ground truth SMPL poses, several single-frame methods infer a mesh from 2D/3D keypoints [5, 8, 39, 31, 34] and use the keypoints as a proxy for training. Another set of approaches exploits 3D mocap data, which is plentiful [32], to train a network to lift 2D joints to 3D poses, which are used as a proxy for ground truth 3D. MotionBERT [62] synthesizes 2D keypoints through orthographic projection to learn a unified motion representation. ProxyCap [61] projects synthetic 3D keypoints into virtual cameras using a heuristic camera pose distribution. Despite benefiting from the scale of mocap datasets, these approaches do not fully utilize the visual information available in the video at test time. Here, we propose a combined network architecture and training strategy that leverages both proxy representations of human pose (lifting) and visual context extracted from video.

Global 3D Motion Estimation with Dense Sensors. Several methods augment video data with other sensors to estimate 3D HPS in world coordinates. The 3DPW dataset [49] employs pre-calibrated body-worn inertial sensors and a handheld camera to jointly optimize the camera and human motion in challenging environments. Similarly, the EMDB dataset [16] uses electromagnetic sensors with an RGB-D camera, enabling accurate human motion capture in the world. While body-worn sensors aid global human motion reconstruction, they are intrusive, require cooperation, and do not help with archival video. BodySLAM++ [9] uses an optimization method with a visual-inertial sensor, comprising stereo cameras and an IMU. In contrast, we use a standard monocular camera, balancing accessibility and accuracy without using specialized equipment. While WHAM can take the camera gyro as input, this is not required.

Monocular Global 3D Human Trajectory Estimation. Estimating the global human trajectory from a monocular dynamic camera is challenging. Previous work relies on learned prior distributions of human motion to separate human motion from camera motion. GLAMR [58] computes the global trajectory based on a predicted and infilled 3D motion sequence and optimizes it across multiple individuals in the scene. However, since GLAMR does not consider camera motion cues, the output trajectory may be noisy when the camera is rotating. SLAHMR [55] and PACE [20] use off-the-shelf SLAM algorithms [46, 47] and jointly optimize the camera and human motion to minimize the negative log-likelihood of a learned motion prior [40]. While they achieve good results, their optimization approach is computationally expensive. TRACE [45] is a pure regression method that utilizes optical flow as a motion cue and estimates multiple people at once, but lacks temporal consistency. GloPro [42] regresses the uncertainty of the global human motion in real-time, but requires known camera poses. In contrast, WHAM leverages both explicit and implicit prior knowledge of human motion and efficiently reconstructs accurate and temporally coherent 3D human motion in world coordinates.

Refer to caption

Figure 2: An Overview of WHAM. WHAM takes the sequence of 2D keypoints estimated by a pretrained detector and encodes it into a motion feature. WHAM then updates the motion feature using another sequence of image features extracted from the image encoder through the feature integrator. From the updated motion feature, the Local Motion Decoder estimates 3D motion in the camera coordinate system and foot-ground contact probability. The Trajectory Decoder takes the motion feature and camera angular velocity to initially estimate the global root orientation and egocentric velocity, which are then updated through the Trajectory Refiner using the foot-ground contact. The final output of WHAM is pixel-aligned 3D human motion with the 3D trajectory in the global coordinates.

3 Methods

3.1 Overview

An overview of our World-grounded Human with Accurate Motion (WHAM) framework is illustrated in Fig. 2. The input to WHAM is a raw video data {I(t)}t=0Tsuperscriptsubscriptsuperscript𝐼𝑡𝑡0𝑇\{I^{(t)}\}_{t=0}^{T}{ italic_I start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, captured by a camera with possibly unknown motion. Our goal is to predict the corresponding sequence of the SMPL model parameters {Θ(t)}t=0TsuperscriptsubscriptsuperscriptΘ𝑡𝑡0𝑇\{\Theta^{(t)}\}_{t=0}^{T}{ roman_Θ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, as well as the root orientation {Γ(t)}t=0TsuperscriptsubscriptsuperscriptΓ𝑡𝑡0𝑇\{\Gamma^{(t)}\}_{t=0}^{T}{ roman_Γ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT and translation {τ(t)}t=0Tsuperscriptsubscriptsuperscript𝜏𝑡𝑡0𝑇\{\tau^{(t)}\}_{t=0}^{T}{ italic_τ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, expressed in the world coordinate system. We use ViTPose [54] to detect 2D keypoints {x2D(t)}t=0Tsuperscriptsubscriptsubscriptsuperscript𝑥𝑡2𝐷𝑡0𝑇\{x^{(t)}_{2D}\}_{t=0}^{T}{ italic_x start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 italic_D end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT from which we obtain motion features {ϕm(t)}t=0Tsuperscriptsubscriptsuperscriptsubscriptitalic-ϕ𝑚𝑡𝑡0𝑇\{\phi_{m}^{(t)}\}_{t=0}^{T}{ italic_ϕ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT using the motion encoder. Additionally, we use a pretrained image encoder [21, 25, 7] to extract static image features {ϕi(t)}t=0Tsuperscriptsubscriptsuperscriptsubscriptitalic-ϕ𝑖𝑡𝑡0𝑇\{\phi_{i}^{(t)}\}_{t=0}^{T}{ italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT and integrate them with {ϕm(t)}t=0Tsuperscriptsubscriptsuperscriptsubscriptitalic-ϕ𝑚𝑡𝑡0𝑇\{\phi_{m}^{(t)}\}_{t=0}^{T}{ italic_ϕ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT to obtain fine-grained motion features {ϕ^m(t)}t=0Tsuperscriptsubscriptsuperscriptsubscript^italic-ϕ𝑚𝑡𝑡0𝑇\{\hat{\phi}_{m}^{(t)}\}_{t=0}^{T}{ over^ start_ARG italic_ϕ end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT from which we regress 3D human motion in the world coordinate system.

3.2 Network Architecture

Uni-directional Motion Encoder and Decoder. In contrast to existing methods [6, 30, 52, 43, 62], which use windows with a fixed time duration, we use uni-directional recurrent neural networks (RNN) for the motion encoder and decoder, making WHAM suitable for online inference. The objective of the motion encoder, EMsubscript𝐸𝑀E_{M}italic_E start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT, is to extract the motion context, ϕm(t)superscriptsubscriptitalic-ϕ𝑚𝑡\phi_{m}^{(t)}italic_ϕ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT, from the current and previous sequence of 2D keypoints and the initial hidden state, hE(0)superscriptsubscript𝐸0h_{E}^{(0)}italic_h start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT:

ϕm(t)=EM(x2D(0),x2D(1),,x2D(t)|hE(0)).superscriptsubscriptitalic-ϕ𝑚𝑡subscript𝐸𝑀superscriptsubscript𝑥2𝐷0superscriptsubscript𝑥2𝐷1conditionalsuperscriptsubscript𝑥2𝐷𝑡superscriptsubscript𝐸0\phi_{m}^{(t)}=E_{M}\big{(}x_{2D}^{(0)},x_{2D}^{(1)},...,x_{2D}^{(t)}|h_{E}^{(% 0)}\big{)}.italic_ϕ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = italic_E start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 2 italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT 2 italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , … , italic_x start_POSTSUBSCRIPT 2 italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT | italic_h start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ) .

We normalize keypoints to a bounding box around the person and concatenate the box’s center and scale to the keypoints, similar to CLIFF [25]. The role of the motion decoder, DMsubscript𝐷𝑀D_{M}italic_D start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT, is to recover SMPL parameters, (θ,β)𝜃𝛽(\theta,\beta)( italic_θ , italic_β ), weak-perspective camera translation, c𝑐citalic_c, and foot-ground contact probability, p𝑝pitalic_p, from the motion feature history:

(θ(t),β(t),c(t),p(t))=DM(ϕ^m(0),,ϕ^m(t)|hD(0)).superscript𝜃𝑡superscript𝛽𝑡superscript𝑐𝑡superscript𝑝𝑡subscript𝐷𝑀superscriptsubscript^italic-ϕ𝑚0conditionalsuperscriptsubscript^italic-ϕ𝑚𝑡subscriptsuperscript0𝐷\big{(}\theta^{(t)},\beta^{(t)},c^{(t)},p^{(t)}\big{)}=D_{M}\big{(}\hat{\phi}_% {m}^{(0)},...,\hat{\phi}_{m}^{(t)}|h^{(0)}_{D}\big{)}.( italic_θ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_β start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_c start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_p start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) = italic_D start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( over^ start_ARG italic_ϕ end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT , … , over^ start_ARG italic_ϕ end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT | italic_h start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ) .

Here, ϕ^m(t)superscriptsubscript^italic-ϕ𝑚𝑡\hat{\phi}_{m}^{(t)}over^ start_ARG italic_ϕ end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT is the motion feature integrated with the image feature ϕi(t)superscriptsubscriptitalic-ϕ𝑖𝑡\phi_{i}^{(t)}italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT (described below). During pretraining on synthetic data, the image feature is not available and we set ϕ^m(t)=ϕm(t)superscriptsubscript^italic-ϕ𝑚𝑡superscriptsubscriptitalic-ϕ𝑚𝑡\hat{\phi}_{m}^{(t)}=\phi_{m}^{(t)}over^ start_ARG italic_ϕ end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = italic_ϕ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT. As the encoder and decoder are tasked with reconstructing a dense 3D representation ΘΘ\Thetaroman_Θ from a sparse 2D input signal x2Dsubscript𝑥2𝐷x_{2D}italic_x start_POSTSUBSCRIPT 2 italic_D end_POSTSUBSCRIPT, we design an intermediate task to predict the 3D keypoints x3Dsubscript𝑥3𝐷x_{3D}italic_x start_POSTSUBSCRIPT 3 italic_D end_POSTSUBSCRIPT and use them as the intermediate motion representation. This cascaded approach guides ϕmsubscriptitalic-ϕ𝑚\phi_{m}italic_ϕ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT to represent the implicit context of motion and the 3D spatial structure of the body. Similar to PIP [57], we use Neural Initialization that uses MLP to initialize the hidden state of the motion encoder and decoder, (hE(0),hD(0))superscriptsubscript𝐸0superscriptsubscript𝐷0\big{(}h_{E}^{(0)},h_{D}^{(0)}\big{)}( italic_h start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT , italic_h start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ); see section 4.3 and Sup. Mat. for details.

Motion and Visual Feature Integrator. We use the AMASS dataset to synthetically generate 2D sequences by projecting 3D SMPL joints into images with varied camera motions. This provides effectively limitless training data that is far more diverse than existing video datasets that contain ground truth 3D pose and shape. Although we leverage the temporal human motion context, lifting 2D keypoints to 3D meshes is an ambiguous task. A key idea is to augment this 2D keypoint information with image cues that can help disambiguate the 3D pose. Specifically, we use an image encoder [21, 25, 1, 7], pretrained on the human mesh recovery task, to extract image features ϕisubscriptitalic-ϕ𝑖\phi_{i}italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, which contain dense visual contextual information related to the 3D human pose and shape. We then train a feature integrator network, FIsubscript𝐹𝐼F_{I}italic_F start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT, to combine ϕmsubscriptitalic-ϕ𝑚\phi_{m}italic_ϕ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT with ϕisubscriptitalic-ϕ𝑖\phi_{i}italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, integrating motion and visual context. The feature integrator uses a simple yet effective residual connection:

ϕ^m(t)=ϕm(t)+FI(concat(ϕm(t),ϕi(t))).superscriptsubscript^italic-ϕ𝑚𝑡superscriptsubscriptitalic-ϕ𝑚𝑡subscript𝐹𝐼concatsuperscriptsubscriptitalic-ϕ𝑚𝑡superscriptsubscriptitalic-ϕ𝑖𝑡\hat{\phi}_{m}^{(t)}=\phi_{m}^{(t)}+F_{I}\Big{(}\text{concat}\big{(}\phi_{m}^{% (t)},\phi_{i}^{(t)}\big{)}\Big{)}.over^ start_ARG italic_ϕ end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = italic_ϕ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT + italic_F start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ( concat ( italic_ϕ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) ) .

This supplements motion features pretrained on the 2D-to-3D lifting task using AMASS with visual context, resulting in enriched motion features that use image evidence to help disambiguate the task.

Global Trajectory Decoder. We design an additional decoder, DTsubscript𝐷𝑇D_{T}italic_D start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, to predict the rough global root orientation Γ0(t)superscriptsubscriptΓ0𝑡\Gamma_{0}^{(t)}roman_Γ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT and root velocity v0(t)superscriptsubscript𝑣0𝑡v_{0}^{(t)}italic_v start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT from the motion feature ϕm(t)superscriptsubscriptitalic-ϕ𝑚𝑡\phi_{m}^{(t)}italic_ϕ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT. Since ϕmsubscriptitalic-ϕ𝑚\phi_{m}italic_ϕ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is derived from the input signals in the camera coordinates, it is highly challenging to decouple the human and camera motion from it. To address this ambiguity, we append the angular velocity of the camera, ω(t)superscript𝜔𝑡\omega^{(t)}italic_ω start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT, to the motion feature, ϕm(t)superscriptsubscriptitalic-ϕ𝑚𝑡\phi_{m}^{(t)}italic_ϕ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT, to create a camera-agnostic motion context. This design choice makes WHAM compatible with both off-the-shelf SLAM algorithms [47, 46] and gyroscope measurements that are widely available from modern digital cameras. We recursively predict global orientation, Γ0(t)superscriptsubscriptΓ0𝑡\Gamma_{0}^{(t)}roman_Γ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT, using the uni-directional RNN.

(Γ0(t),v0(t))=DT(ϕm(0),ω(0),,ϕm(t),ω(t)).superscriptsubscriptΓ0𝑡superscriptsubscript𝑣0𝑡subscript𝐷𝑇superscriptsubscriptitalic-ϕ𝑚0superscript𝜔0superscriptsubscriptitalic-ϕ𝑚𝑡superscript𝜔𝑡\big{(}\Gamma_{0}^{(t)},v_{0}^{(t)}\big{)}=D_{T}\big{(}\phi_{m}^{(0)},\omega^{% (0)},...,\phi_{m}^{(t)},\omega^{(t)}\big{)}.( roman_Γ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_v start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) = italic_D start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_ϕ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT , italic_ω start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT , … , italic_ϕ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_ω start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) .

Contact Aware Trajectory Refinement. Good 3D motion in world coordinates in most scenarios implies accurate foot-ground contact without sliding. We want WHAM to generalize beyond flat ground planes, which are typically assumed in prior work. Specifically, our new trajectory refiner aims to resolve foot sliding and enables WHAM to generalize well to diverse motions, including climbing stairs. The refinement involves two stages. First, we adjust the ego-centric root velocity to v~(t)superscript~𝑣𝑡\tilde{v}^{(t)}over~ start_ARG italic_v end_ARG start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT to minimize foot sliding, based on the foot-ground contact probability p(t)superscript𝑝𝑡p^{(t)}italic_p start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT, estimated from the motion decoder DMsubscript𝐷𝑀D_{M}italic_D start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT:

v~(t)=v0(t)(Γ0(t))1v¯f(t),superscript~𝑣𝑡superscriptsubscript𝑣0𝑡superscriptsuperscriptsubscriptΓ0𝑡1superscriptsubscript¯𝑣𝑓𝑡\tilde{v}^{(t)}=v_{0}^{(t)}-\big{(}\Gamma_{0}^{(t)}\big{)}^{-1}\bar{v}_{f}^{(t% )},over~ start_ARG italic_v end_ARG start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = italic_v start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT - ( roman_Γ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT over¯ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ,

where v¯f(t)superscriptsubscript¯𝑣𝑓𝑡\bar{v}_{f}^{(t)}over¯ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT is the averaged velocity of the toes and heels in world coordinates when their contact probability, p(t)superscript𝑝𝑡p^{(t)}italic_p start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT, is higher than a threshold. However, this velocity adjustment often introduces noisy translation when the contact and pose estimation is inaccurate. Therefore, we propose a simple learning mechanism in which a trajectory refining network, RTsubscript𝑅𝑇R_{T}italic_R start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, updates the root orientation and velocity to address this issue. Finally, the global translation is computed through a roll-out operation:

(Γ(t),v(t))superscriptΓ𝑡superscript𝑣𝑡\displaystyle\big{(}\Gamma^{(t)},v^{(t)}\big{)}( roman_Γ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_v start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) =\displaystyle== RT(ϕm(0),Γ0(0),v~(0),,ϕm(t),Γ0(t),v~(t)),subscript𝑅𝑇superscriptsubscriptitalic-ϕ𝑚0superscriptsubscriptΓ00superscript~𝑣0superscriptsubscriptitalic-ϕ𝑚𝑡superscriptsubscriptΓ0𝑡superscript~𝑣𝑡\displaystyle R_{T}\big{(}\phi_{m}^{(0)},\Gamma_{0}^{(0)},\tilde{v}^{(0)},...,% \phi_{m}^{(t)},\Gamma_{0}^{(t)},\tilde{v}^{(t)}\big{)},italic_R start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_ϕ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT , roman_Γ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT , over~ start_ARG italic_v end_ARG start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT , … , italic_ϕ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , roman_Γ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , over~ start_ARG italic_v end_ARG start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) ,
τ(t)superscript𝜏𝑡\displaystyle\tau^{(t)}italic_τ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT =\displaystyle== i=0t1Γ(i)v(i).superscriptsubscript𝑖0𝑡1superscriptΓ𝑖superscript𝑣𝑖\displaystyle\sum_{i=0}^{t-1}{\Gamma^{(i)}v^{(i)}}.∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT roman_Γ start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT italic_v start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT .

In summary, this full process reconstructs accurate 3D human pose and shape in both the camera and world coordinates from a single monocular video sequence (Fig. 2).

Refer to caption

Figure 3: WHAM’s Two-Stage Training Scheme. During pre-taining, we generate synthetic 2D keypoint sequences from AMASS [32] and train a motion encoder and decoder on the generated data (top). We then leverage video datasets with ground truth SMPL parameters, for which there is much less data. We use the fixed-weight pre-trained image encoder and keypoints detector (Refer to caption) to extract image features and 2D keypoints. In this stage, we train the feature integration network while fine-tuning the motion encoder and motion/trajectory decoders, marked Refer to caption (bottom).

3.3 Training

Pretraining on AMASS. We train in two stages: (1) pretraining with synthetic data, and (2) fine-tuning with real data (Fig. 3). The objective of the pretraining stage is to teach the motion encoder to extract motion context from the input 2D keypoint sequence. The motion and trajectory decoders then map this motion context to the corresponding 3D motion and global trajectory spaces (i.e. they lift the encoding to 3D). We use the AMASS dataset [32] to generate an extensive set of synthetic pairs consisting of sequences of 2D keypoints together with the ground truth SMPL parameters.

To synthesize 2D keypoints from AMASS, we create virtual cameras onto which we project 3D keypoints derived from the ground truth mesh. Unlike MotionBERT [62] and ProxyCap [61], which use static cameras for keypoint projection, we employ dynamic cameras that incorporate both rotational and translational motion. This choice has two main motivations. First, it accounts for the inherent differences between human motions captured in static and dynamic camera setups. Second, it enables the learning of a camera-agnostic motion representation, from which the trajectory decoder can reconstruct the global trajectory. We also augment the 2D data with noise and masking. For details of the synthetic generation process see Sup. Mat.

Fine-tuning on Video Datasets. Starting with the pretrained network, we fine-tune WHAM on four video datasets: 3DPW [49], Human3.6M [11], MPI-INF-3DHP [33], and InstaVariety [15]. For the human mesh recovery task, we supervise WHAM on ground-truth SMPL parameters from AMASS and 3DPW, 3D keypoints from Human3.6M and MPI-INF-3DHP, and 2D keypoints from InstaVariety. For the global trajectory estimation task, we use AMASS, Human3.6M, and MPI-INF-3DHP. Additionally, during training we experiment with adding BEDLAM [1], a large synthetic dataset with realistic video and ground truth SMPL parameters.

The fine-tuning has two objectives: 1) exposing the network to real 2D keypoints, instead of training it solely on synthetic data, and 2) training the feature integrator network to aggregate motion and image features. To achieve these goals, we jointly train the entire network on the video datasets while setting a smaller learning rate on the pretrained modules (see Fig. 3). Consistent with prior work [17, 30, 6, 52, 43], we employ a pretrained and fixed-weight image encoder [21] to extract image features. However, to leverage recent network architectures and training strategies, we also experiment with different types of encoders [25, 1, 7] in the following section.

Implementation Details. For the pretraining stage, we train WHAM on AMASS for 80 epochs with the learning rate of 5×1045superscript1045\times 10^{-4}5 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT. Then we finetune WHAM on 3DPW, MPI-INF-3DHP, Human3.6M, and InstaVariety for 30 epochs. We use a learning rate of 1×1041superscript1041\times 10^{-4}1 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT for the feature integrator and 1×1051superscript1051\times 10^{-5}1 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT for the pretrained components during finetuning. During training, we use the Adam optimizer and a batch size of 64 and split sequences into 81-frame chunks.

4 Experiments

Datasets. We evaluate WHAM on three in-the-wild benchmarks: 3DPW [49], RICH [10], and EMDB [16]. Following previous work [14, 21, 17, 6, 30, 25, 1], we perform the evaluation in camera coordinates. The estimated global trajectory is evaluated on a subset of EMDB (EMDB 2) for which they provide ground truth global motion with dynamic cameras (used for evaluation). We also test on new sequences captured using an iPhone with the gyroscope. See Sup. Mat. for more details on the datasets and iPhone results.

Evaluation metrics. To evaluate the accuracy of 3D human pose and shape estimation, we compute Mean Per Joint Position Error (MPJPE), Procrustes-aligned MPJPE (PA-MPJPE), and Per Vertex Error (PVE) measured in millimeters (mm)𝑚𝑚(mm)( italic_m italic_m ). We compute Acceleration error (Accel, in m/s2𝑚superscript𝑠2m/s^{2}italic_m / italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT)111Previous work follows [15] in reporting Accel in mm/frame2𝑚𝑚superscriptframe2mm/\mathrm{frame}^{2}italic_m italic_m / roman_frame start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. To remove the dependency on frame rate, we convert all previous results to m/s2𝑚superscript𝑠2m/s^{2}italic_m / italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. to measure the inter-frame smoothness of the reconstructed motion. We also evaluate the motion reconstruction and trajectory estimation accuracy in the world-frame. Following previous work [55, 20], we split sequences into smaller segments of 100 frames and align each output segment with the ground-truth data using the first two frames (W-MPJPE100) or the entire segment (WA-MPJPE100) in mm𝑚𝑚mmitalic_m italic_m. These previous metrics give an unrealistic picture of 3D performance as they do not measure drift over long sequences. Therefore, we also evaluate the error over the entire trajectory after the rigid alignment and measure Root Translation Error (RTE in %) normalized by the actual displacement of the person. We also assess the jitter of the motion in the world coordinate system in 10m/s310𝑚superscript𝑠310m/s^{3}10 italic_m / italic_s start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT and foot sliding during the contact (FS in mm𝑚𝑚mmitalic_m italic_m).

  3DPW (14) RICH (24) EMDB (24)
Models    PA-MPJPE MPJPE PVE Accel    PA-MPJPE MPJPE PVE Accel    PA-MPJPE MPJPE PVE Accel
per-frame SPIN [21]  59.2 96.9 112.8 31.4    69.7 122.9 144.2 35.2    87.1 140.7 166.1 41.3
PARE [18]  46.5 74.5 88.6 –    60.7 109.2 123.5 –    72.2 113.9 133.2
CLIFF [25]  43.0 69.0 81.2 22.5    56.6 102.6 115.0 22.4    68.3 103.5 123.7 24.5
HybrIK [23]  41.8 71.6 82.3 –    56.4 96.8 110.4 –    65.6 103.0 122.2
HMR2.0 [7]  44.4 69.8 82.2 18.1    48.1 96.0 110.9 18.8    60.7 98.3 120.8 19.9
ReFit [51]  40.5 65.3 75.1 18.5    47.9 80.7 92.9 17.1    58.6 88.0 104.5 20.7
temporal TCMR [6]  52.7 86.5 101.4 6.0    65.6 119.1 137.7 5.0    79.8 127.7 150.2 5.3
VIBE [17]  51.9 82.9 98.4 18.5    68.4 120.5 140.2 21.8    81.6 126.1 149.9 26.5
MPS-Net [52]  52.1 84.3 99.0 6.5    67.1 118.2 136.7 5.8    81.4 123.3 143.9 6.2
GLoT [43]  50.6 80.7 96.4 6.0  65.6 114.3 132.7 5.2    79.1 119.9 140.8 5.4
GLAMR [58]  51.1 8.0    79.9 107.7    73.8 113.8 134.9 33.0
TRACE [45]  50.9 79.1 95.4 28.6    –    71.5 110.0 129.6 25.5
SLAHMR [55]  55.9 –    52.5 9.4    69.7 93.7 111.3 7.1
PACE [20]  –    49.3 8.8   
WHAM (Res)  40.2 62.7 75.1 6.3    51.8 89.5 103.2 5.0  57.8 84.0 99.7 5.2
WHAM (HR)  39.0 62.6 74.8 6.4    49.1 84.6 96.4 5.2    57.1 85.7 103.2 5.6
WHAM (ViT)  35.9 57.8 68.7 6.6    44.3 80.0 91.2 5.3    50.4 79.7 94.4 5.3
Table 1: Quantitative comparison of state-of-the-art models on the 3DPW [49], RICH [10], and EMDB [16] datasets. Ordering of per-frame and temporal methods is done separately by descending MPJPE on EMDB (except for PACE). For testing on EMDB, we follow the protocol of EMDB 1 [16]. Parenthesis denotes the number of body joints used to compute MPJPE and PA-MPJPE, and denotes models trained with the 3DPW training set. Bold numbers denote the most accurate method in each column. Accel is in m/s2𝑚superscript𝑠2m/s^{2}italic_m / italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, all other errors are in mm𝑚𝑚mmitalic_m italic_m.
  3DPW (14)
Models Dataset    PA-MPJPE MPJPE PVE Accel
CLIFF [25, 1] R    43.6 68.8 82.1 19.2
R+B    43.0 66.9 78.5 31.0
ReFit [51] R    40.5 65.3 75.1 18.5
R+B    38.2 57.6 67.6 21.4
WHAM (ViT) R    35.9 57.8 68.7 6.6
R+B    35.7 56.9 67.4 6.7
Table 2: Dataset ablation experiments on 3DPW [49]. R denotes the use of real datasets and B denotes BEDLAM.

4.1 3D Human Motion Recovery

Per-frame accuracy. Table 1 presents a comprehensive comparison of WHAM and the existing state-of-the-art per-frame and video-based methods across three benchmark datasets [49, 10, 16]. Because none of the methods are exposed to data from RICH or EMDB during training, results on these datasets are indicative of each method’s ability to generalize. WHAM (Res), WHAM (HR), and WHAM (ViT) correspond to different architectures for the pretrained image encoders, derived from SPIN (ResNet-50) [21], CLIFF (HRNet-W48) [25, 1], and HMR2.0 (ViT-H/16) [7], respectively. Not surprisingly, WHAM (HR) is more accurate than WHAM (Res), while the transformer-based version, WHAM (ViT), is the most accurate. The backbone matters, with WHAM (ViT) outperforming all previous methods on all per-frame metrics (MPJPE, PA-MPJPE, and PVE) on all benchmarks. Even with the simplest ResNet backbone, WHAM (Res) outperforms every method except for ReFit on RICH.

Training on BEDLAM consistently improves accuracy in prior work (see [3, 1, 51]), and we find the same here as shown in Table 2. However, compared to ReFit, WHAM exhibits relatively smaller performance improvement by adding BEDLAM. One possible reason for this is that, unlike ReFit, we did not finetune the image encoder during training to limit training time.

Refer to caption

Figure 4: Qualitative comparison with previous state-of-the-art methods for 3D human pose and shape estimation. See text.

Inter-frame smoothness. We also evaluate the inter-frame smoothness using the acceleration error. Compared with state-of-the-art per-frame methods [7, 25, 1, 23], WHAM has significantly lower acceleration error. This indicates that WHAM reconstructs smooth and more plausible 3D human motion across frames while not sacrificing high per-frame accuracy. On the other hand, when compared to recent temporal methods [43, 52, 6], WHAM exhibits comparable or slightly higher acceleration error. However, we observe that these video-based methods tend to over-smooth the human motion, resulting in lower accuracy on per-frame metrics.

To provide intuition for these numbers, we qualitatively compare WHAM with TCMR [6] and GLoT [43] in Fig. 4. While producing smooth results, TCMR and GLoT fail to capture the bending of the left knee when the subject is ascending the stairs, while WHAM more accurately reconstructs the 3D human pose.

  EMDB 2
Models    WA-MPJPE100 W–MPJPE100 RTE Jitter FS
DPVO (+ HMR2.0) [47, 7]  647.8 2231.4 15.8 537.3 107.6
GLAMR [58]  280.8 726.6 11.4 46.3 20.7
TRACE [45]  529.0 1702.3 17.7 2987.6 370.7
SLAHMR [55]  326.9 776.1 10.2 31.3 14.5
WHAM (w/DPVO [47])  135.6 354.8 6.0 22.5 4.4
WHAM (w/DROID [46])  133.3 343.9 4.6 21.5 4.4
WHAM (w/ GT gyro)  131.1 335.3 4.1 21.0 4.4
Table 3: Global motion estimation accuracy on EMDB [16].

4.2 3D Global Trajectory Recovery

To evaluate global trajectory recovery, we compare WHAM with the state-of-the-art methods and a baseline that combines a SLAM method (DPVO [47]) and a per-frame method (HMR2.0 [7]); see Table 3. WHAM is agnostic to the source of the camera angular velocity and we compare results using DPVO, DROID-SLAM [46] and the ground truth angular velocity (gyro).

As shown in Table 3, WHAM outperforms the existing methods on all metrics. Specifically, combining WHAM with DPVO is more accurate than the global trajectory estimation of DPVO combined with HMR2.0, illustrating that our method actively refines the global trajectory instead of performing a simple integration. DROID-SLAM gives slightly better results than DPVO. Furthermore, WHAM significantly outperforms the regression-based method, TRACE, on jitter and foot sliding metrics. We further demonstrate this in Figs. 5 and 1, where WHAM captures more consistent and plausible human motion in the global coordinate system than TRACE and SLAHMR for videos captured by dynamic cameras. As depicted in Fig. 6, WHAM outperforms GLAMR, TRACE, and SLAHMR in capturing the pattern of human motion in the global coordinate system.

Refer to caption

Figure 5: Qualitative comparison with TRACE [45] and SLAHMR [55] on global human motion estimation with dynamic cameras.

Refer to caption

Figure 6: Comparison of global trajectory estimation on EMDB [16]. Overall, WHAM shows better alignment to ground truth data compared to GLAMR [58], TRACE [45], and SLAHMR [55].

4.3 Ablation Study

To provide further insight into our approach, we conduct ablation studies to analyze the contribution of each component to the performance. As shown in Table 4, our entire system (WHAM) outperforms the different variants of WHAM that ablate a single component. To be specific, we first observe that adding feature integration improves both motion and global trajectory estimation accuracy when compared with an ablated version without feature integration (w/o FIsubscript𝐹𝐼F_{I}italic_F start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT). Similarly, the removal of the pretraining on the 2D-to-3D lifting task using AMASS [32] (w/o lifting) shows significant performance degradation. WHAM also outperforms the ablation of the Neural Initialization (w/o NI), particularly in the MPJPE metric. In addition, we experiment with WHAM to decode trajectory solely based on the motion context without using the estimated camera angular velocity (w/o ω𝜔\omegaitalic_ω). Although this version shows similar performance in predicting 3D human pose, it suffers from the entanglement of camera and human motion, resulting in significantly high overall trajectory error (RTE). Last, we observe that WHAM without the trajectory refinement (w/o traj. ref.) gives larger global trajectory and foot sliding errors in return for less jitter, indicating that our approach contributes to the global trajectory accuracy and helps reduce foot sliding. Core details are presented here; see Sup. Mat. for more details and information on run-time cost.

  EMDB 2
Models    PA–MPJPE MPJPE WA–MPJPE100 W–MPJPE100 RTE Jitter FS
w/o FIsubscript𝐹𝐼F_{I}italic_F start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT  44.2 69.0 147.6 377.9 6.3 23.1 5.5
w/o lifting    60.3 83.0 238.0 693.0 11.5 24.5 5.0
w/o NI  40.4 66.3 142.7 368.1 6.8 22.3 4.6
w/o ω𝜔\omegaitalic_ω  39.1 62.0 156.5 422.0 10.1 22.1 5.0
w/o traj. ref.    38.2 59.3 154.7 407.5 6.3 18.8 6.5
WHAM (Ours)    38.2 59.3 135.6 354.8 6.0 22.5 4.4
Table 4: Ablation experiments. See text.

5 Conclusion

WHAM is a new method to recover accurate 3D human motion in global coordinates from a moving camera more efficiently and accurately than the state-of-the-art approaches. Our approach leverages the AMASS dataset to train a network to recursively lift 2D sequences of keypoints to sequences of 3D SMPL parameters. But keypoints alone lack valuable information about the body and its movement. Consequently, we integrate image context information over time and learn to combine it with the motion context to better estimate human body shape and pose. Additionally, our method takes an estimate of the camera angular velocity, which can either be computed from a SLAM method or from the camera’s gyro when available. Finally, we combine all this information with an estimate of foot contact to recover the 3D human motion in global coordinates from a monocular video sequence. WHAM significantly outperforms the existing state-of-the-art methods (both image-based and video-based) on challenging in-the-wild benchmarks in both 3D HPS and the world-coordinate trajectory estimation accuracy. Because of its speed and accuracy, WHAM provides a foundation for in-the-wild motion capture applications.

Limitations and future directions: WHAM learns about human motion from AMASS, limiting generalization to motions that are out of distribution. While we employ random masking as part of our data synthesis process, our generating approach mainly assumes the scenario where the full body is within the field of view. See Sup. Mat. for more details.

WHAM opens up many directions for future work. For example, while we use SLAM to estimate the camera’s angular velocity, SLAM could also provide camera intrinsics and extrinsics as well as information about the 3D scene that could be used to enforce consistency between the scene and the human. While WHAM is an online method, designed for real-time applications, it could also initialize an optimization-based post-processing akin to bundle adjustment, which would optimize the camera, scene, and human motion together. Furthermore, a real-time and phone-based implementation of WHAM should be feasible.

Acknowledgements. Part of this work was done when the first author was an intern at the Max Planck Institute for Intelligence System. This work was partially supported through an NSF CAREER Award (CBET 2145473) and a Chan-Zuckerberg Essential Open Source Software for Science Award.

CoI Disclosure. MJB has received research gift funds from Adobe, Intel, Nvidia, Meta/Facebook, and Amazon. MJB has financial interests in Amazon and Meshcapade GmbH. While MJB is a co-founder and Chief Scientist at Meshcapade, his research in this project was performed solely at, and funded solely by, the Max Planck Society.

References

  • Black et al. [2023] Michael J. Black, Priyanka Patel, Joachim Tesch, and Jinlong Yang. BEDLAM: A synthetic dataset of bodies exhibiting detailed lifelike animated motion. In Proceedings IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), pages 8726–8737, 2023.
  • Bogo et al. [2016] Federica Bogo, Angjoo Kanazawa, Christoph Lassner, Peter Gehler, Javier Romero, and Michael J. Black. Keep it SMPL: Automatic estimation of 3D human pose and shape from a single image. In Computer Vision – ECCV 2016, pages 561–578. Springer International Publishing, 2016.
  • Cai et al. [2023] Zhongang Cai, Wanqi Yin, Ailing Zeng, Chen Wei, Qingping Sun, Yanjun Wang, Hui En Pang, Haiyi Mei, Mingyuan Zhang, Lei Zhang, Chen Change Loy, Lei Yang, and Ziwei Liu. SMPLer-X: Scaling up expressive human pose and shape estimation. In Proc. Neural Information Processing Systems (NeurIPS), 2023.
  • Cao et al. [2019] Z. Cao, G. Hidalgo Martinez, T. Simon, S. Wei, and Y. A. Sheikh. Openpose: Realtime multi-person 2d pose estimation using part affinity fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019.
  • Choi et al. [2020] Hongsuk Choi, Gyeongsik Moon, and Kyoung Mu Lee. Pose2mesh: Graph convolutional network for 3d human pose and mesh recovery from a 2d human pose. In European Conference on Computer Vision (ECCV), 2020.
  • Choi et al. [2021] Hongsuk Choi, Gyeongsik Moon, Ju Yong Chang, and Kyoung Mu Lee. Beyond static features for temporally consistent 3d human pose and shape from a video. In Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  • Goel et al. [2023] Shubham Goel, Georgios Pavlakos, Jathushan Rajasegaran, Angjoo Kanazawa, and Jitendra Malik. Humans in 4D: Reconstructing and tracking humans with transformers. arXiv preprint arXiv:2305.20091, 2023.
  • Guler and Kokkinos [2019] Riza Alp Guler and Iasonas Kokkinos. Holopose: Holistic 3d human reconstruction in-the-wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  • Henning et al. [2023] Dorian F. Henning, Christopher Choi, Simon Schaefer, and Stefan Leutenegger. Bodyslam++: Fast and tightly-coupled visual-inertial camera and human motion tracking, 2023.
  • Huang et al. [2022] Chun-Hao P. Huang, Hongwei Yi, Markus Höschle, Matvey Safroshkin, Tsvetelina Alexiadis, Senya Polikovsky, Daniel Scharstein, and Michael J. Black. Capturing and inferring dense full-body human-scene contact. In Proceedings IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), pages 13274–13285, 2022.
  • Ionescu et al. [2014] Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian Sminchisescu. Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(7):1325–1339, 2014.
  • Joo et al. [2020a] Hanbyul Joo, Natalia Neverova, and Andrea Vedaldi. Exemplar fine-tuning for 3d human pose fitting towards in-the-wild 3d human pose estimation. In 3DV, 2020a.
  • Joo et al. [2020b] Hanbyul Joo, Natalia Neverova, and Andrea Vedaldi. Exemplar fine-tuning for 3d human pose fitting towards in-the-wild 3d human pose estimation. In 3DV, 2020b.
  • Kanazawa et al. [2018a] Angjoo Kanazawa, Michael J. Black, David W. Jacobs, and Jitendra Malik. End-to-end recovery of human shape and pose. In Computer Vision and Pattern Regognition (CVPR), 2018a.
  • Kanazawa et al. [2018b] Angjoo Kanazawa, Jason Y. Zhang, Panna Felsen, and Jitendra Malik. Learning 3d human dynamics from video. arXiv preprint arXiv:1812.01601, 2018b.
  • Kaufmann et al. [2023] Manuel Kaufmann, Jie Song, Chen Guo, Kaiyue Shen, Tianjian Jiang, Chengcheng Tang, Juan José Zárate, and Otmar Hilliges. EMDB: The Electromagnetic Database of Global 3D Human Pose and Shape in the Wild. In International Conference on Computer Vision (ICCV), 2023.
  • Kocabas et al. [2020] Muhammed Kocabas, Nikos Athanasiou, and Michael J. Black. VIBE: Video inference for human body pose and shape estimation. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 5252–5262. IEEE, 2020.
  • Kocabas et al. [2021a] Muhammed Kocabas, Chun-Hao P. Huang, Otmar Hilliges, and Michael J. Black. PARE: Part attention regressor for 3D human body estimation. In Proc. International Conference on Computer Vision (ICCV), pages 11127–11137, 2021a.
  • Kocabas et al. [2021b] Muhammed Kocabas, Chun-Hao P. Huang, Joachim Tesch, Lea Müller, Otmar Hilliges, and Michael J. Black. SPEC: Seeing people in the wild with an estimated camera. In Proc. International Conference on Computer Vision (ICCV), pages 11015–11025, Piscataway, NJ, 2021b. IEEE.
  • Kocabas et al. [2024] Muhammed Kocabas, Ye Yuan, Pavlo Molchanov, Yunrong Guo, Michael J. Black, Otmar Hilliges, Jan Kautz, and Umar Iqbal. Pace: Human and motion estimation from in-the-wild videos. In 3DV, 2024.
  • Kolotouros et al. [2019a] Nikos Kolotouros, Georgios Pavlakos, Michael J Black, and Kostas Daniilidis. Learning to reconstruct 3d human pose and shape via model-fitting in the loop. In ICCV, 2019a.
  • Kolotouros et al. [2019b] Nikos Kolotouros, Georgios Pavlakos, and Kostas Daniilidis. Convolutional mesh regression for single-image human shape reconstruction. In CVPR, 2019b.
  • Li et al. [2021] Jiefeng Li, Chao Xu, Zhicun Chen, Siyuan Bian, Lixin Yang, and Cewu Lu. Hybrik: A hybrid analytical-neural inverse kinematics solution for 3d human pose and shape estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3383–3393, 2021.
  • Li et al. [2023] Jiefeng Li, Siyuan Bian, Qi Liu, Jiasheng Tang, Fan Wang, and Cewu Lu. NIKI: Neural inverse kinematics with invertible neural networks for 3d human pose and shape estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  • Li et al. [2022] Zhihao Li, Jianzhuang Liu, Zhensong Zhang, Songcen Xu, and Youliang Yan. Cliff: Carrying location information in full frames into human pose and shape estimation. In ECCV, 2022.
  • Lin et al. [2023a] Jing Lin, Ailing Zeng, Haoqian Wang, Lei Zhang, and Yu Li. One-stage 3d whole-body mesh recovery with component aware transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21159–21168, 2023a.
  • Lin et al. [2023b] Kevin Lin, Chung-Ching Lin, Lin Liang, Zicheng Liu, and Lijuan Wang. Mpt: Mesh pre-training with transformers for human pose and mesh reconstruction, 2023b.
  • Liu et al. [2021] Miao Liu, Dexin Yang, Yan Zhang, Zhaopeng Cui, James M Rehg, and Siyu Tang. 4d human body capture from egocentric video via 3d scene grounding. In 3DV, 2021.
  • Loper et al. [2015] Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J. Black. SMPL: A skinned multi-person linear model. ACM Trans. Graphics (Proc. SIGGRAPH Asia), 34(6):248:1–248:16, 2015.
  • Luo et al. [2020] Zhengyi Luo, S. Alireza Golestaneh, and Kris M. Kitani. 3d human motion estimation via motion compression and refinement. In Proceedings of the Asian Conference on Computer Vision (ACCV), 2020.
  • Ma et al. [2023] Xiaoxuan Ma, Jiajun Su, Chunyu Wang, Wentao Zhu, and Yizhou Wang. 3d human mesh estimation from virtual markers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 534–543, 2023.
  • Mahmood et al. [2019] Naureen Mahmood, Nima Ghorbani, Nikolaus F. Troje, Gerard Pons-Moll, and Michael J. Black. AMASS: Archive of motion capture as surface shapes. In International Conference on Computer Vision, pages 5442–5451, 2019.
  • Mehta et al. [2017] Dushyant Mehta, Helge Rhodin, Dan Casas, Pascal Fua, Oleksandr Sotnychenko, Weipeng Xu, and Christian Theobalt. Monocular 3d human pose estimation in the wild using improved cnn supervision. In 3D Vision (3DV), 2017 Fifth International Conference on. IEEE, 2017.
  • Moon and Lee [2020] Gyeongsik Moon and Kyoung Mu Lee. I2l-meshnet: Image-to-lixel prediction network for accurate 3d human pose and mesh estimation from a single rgb image. In European Conference on Computer Vision (ECCV), 2020.
  • Moon et al. [2022] Gyeongsik Moon, Hongsuk Choi, and Kyoung Mu Lee. Neuralannot: Neural annotator for 3d human mesh training sets. In Computer Vision and Pattern Recognition Workshop (CVPRW), 2022.
  • Moon et al. [2023] Gyeongsik Moon, Hongsuk Choi, Sanghyuk Chun, Jiyoung Lee, and Sangdoo Yun. Three recipes for better 3d pseudo-gts of 3d human mesh estimation in the wild. In Computer Vision and Pattern Recognition Workshop (CVPRW), 2023.
  • Osman et al. [2020] Ahmed A. A. Osman, Timo Bolkart, and Michael J. Black. STAR: Sparse trained articulated human body regressor. In Computer Vision - ECCV 2020, pages 598–613. Springer, 2020.
  • Pavlakos et al. [2019a] Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed A. A. Osman, Dimitrios Tzionas, and Michael J. Black. Expressive body capture: 3D hands, face, and body from a single image. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 10975–10985, 2019a.
  • Pavlakos et al. [2019b] Georgios Pavlakos, Luyang Zhu, Xiaowei Zhou, and Kostas Daniilidis. Learning to estimate 3d human pose and shape from a single color image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 459–468, 2019b.
  • Rempe et al. [2021] Davis Rempe, Tolga Birdal, Aaron Hertzmann, Jimei Yang, Srinath Sridhar, and Leonidas J. Guibas. Humor: 3d human motion model for robust pose estimation. In International Conference on Computer Vision (ICCV), 2021.
  • Saini et al. [2023] Nitin Saini, Chun-Hao P. Huang, Michael J. Black, and Aamir Ahmad. Smartmocap: Joint estimation of human and camera motion using uncalibrated rgb cameras. IEEE Robotics and Automation Letters, pages 1–8, 2023.
  • Schaefer et al. [2023] Simon Schaefer, Dorian F. Henning, and Stefan Leutenegger. Glopro: Globally-consistent uncertainty-aware 3d human pose estimation & tracking in the wild, 2023.
  • Shen et al. [2023] Xiaolong Shen, Zongxin Yang, Xiaohan Wang, Jianxin Ma, Chang Zhou, and Yi Yang. Global-to-local modeling for video-based 3d human pose and shape estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8887–8896, 2023.
  • Shin et al. [2023] Soyong Shin, Zhixiong Li, and Eni Halilaj. Markerless motion tracking with noisy video and imu data. IEEE Transactions on Biomedical Engineering, 70(11):3082–3092, 2023.
  • Sun et al. [2023] Yu Sun, Qian Bao, Wu Liu, Tao Mei, and Michael J. Black. TRACE: 5D Temporal Regression of Avatars with Dynamic Cameras in 3D Environments. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 2023.
  • Teed and Deng [2021] Zachary Teed and Jia Deng. DROID-SLAM: Deep Visual SLAM for Monocular, Stereo, and RGB-D Cameras. Advances in neural information processing systems, 2021.
  • Teed et al. [2022] Zachary Teed, Lahav Lipson, and Jia Deng. Deep patch visual odometry. arXiv preprint arXiv:2208.04726, 2022.
  • Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2017.
  • von Marcard et al. [2018] Timo von Marcard, Roberto Henschel, Michael Black, Bodo Rosenhahn, and Gerard Pons-Moll. Recovering accurate 3d human pose in the wild using imus and a moving camera. In European Conference on Computer Vision (ECCV), 2018.
  • Wan et al. [2021] Ziniu Wan, Zhengjia Li, Maoqing Tian, Jianbo Liu, Shuai Yi, and Hongsheng Li. Encoder-decoder with multi-level attention for 3d human shape and pose estimation. In The IEEE International Conference on Computer Vision (ICCV), 2021.
  • Wang and Daniilidis [2023] Yufu Wang and Kostas Daniilidis. Refit: Recurrent fitting network for 3d human recovery. In International Conference on Computer Vision, 2023.
  • Wei et al. [2022] Wen-Li Wei, Jen-Chun Lin, Tyng-Luh Liu, and Hong-Yuan Mark Liao. Capturing humans in motion: Temporal-attentive 3d human pose and shape estimation from monocular video. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  • Xu et al. [2020] Hongyi Xu, Eduard Gabriel Bazavan, Andrei Zanfir, William T. Freeman, Rahul Sukthankar, and Cristian Sminchisescu. GHUM & GHUML: generative 3D human shape and articulated pose models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6183–6192, 2020.
  • Xu et al. [2022] Yufei Xu, Jing Zhang, Qiming Zhang, and Dacheng Tao. ViTPose: Simple vision transformer baselines for human pose estimation. In Advances in Neural Information Processing Systems, 2022.
  • Ye et al. [2023] Vickie Ye, Georgios Pavlakos, Jitendra Malik, and Angjoo Kanazawa. Decoupling human and camera motion from videos in the wild. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  • Yi et al. [2021] Xinyu Yi, Yuxiao Zhou, and Feng Xu. Transpose: Real-time 3d human translation and pose estimation with six inertial sensors. ACM Transactions on Graphics, 40(4), 2021.
  • Yi et al. [2022] Xinyu Yi, Yuxiao Zhou, Marc Habermann, Soshi Shimada, Vladislav Golyanik, Christian Theobalt, and Feng Xu. Physical inertial poser (pip): Physics-aware real-time human motion tracking from sparse inertial sensors. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  • Yuan et al. [2022] Ye Yuan, Umar Iqbal, Pavlo Molchanov, Kris Kitani, and Jan Kautz. Glamr: Global occlusion-aware human mesh recovery with dynamic cameras. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  • Zhang et al. [2020] Feng Zhang, Xiatian Zhu, Hanbin Dai, Mao Ye, and Ce Zhu. Distribution-aware coordinate representation for human pose estimation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  • Zhang et al. [2023a] Hongwen Zhang, Yating Tian, Yuxiang Zhang, Mengcheng Li, Liang An, Zhenan Sun, and Yebin Liu. Pymaf-x: Towards well-aligned full-body model regression from monocular images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023a.
  • Zhang et al. [2023b] Yuxiang Zhang, Hongwen Zhang, Liangxiao Hu, Hongwei Yi, Shengping Zhang, and Yebin Liu. Real-time monocular full-body capture in world space via sequential proxy-to-motion learning. arXiv preprint arXiv:2307.01200, 2023b.
  • Zhu et al. [2023] Wentao Zhu, Xiaoxuan Ma, Zhaoyang Liu, Libin Liu, Wayne Wu, and Yizhou Wang. Motionbert: A unified perspective on learning human motion representations. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023.

As promised in the main paper, this supplemental document provides details of our synthetic data generation, the datasets we use, our network training, run-time cost, and limitations. Additionally, please refer to our project page at https://wham.is.tue.mpg.de/ for results that illustrate our method and recent SOTA methods applied to video sequences.

Appendix A Synthetic data generation

To address the scarcity of the video data with paired 3D ground truth, we pretrain WHAM on an extensive number of synthetic 2D keypoint sequences for which we have the ground truth 3D poses. In this section, we describe the process of data synthesis from AMASS [32].

2D keypoint sequence synthesis.

During training, we sample sequences of SMPL poses of length L=81𝐿81L=81italic_L = 81 from AMASS. Then, similar to MEVA [30], we uniformly upsample or downsample the frames to speed up or down the motion by up to 50% of the original speed. Furthermore, we apply a random root rotation ΔΓ𝒰(0,360)similar-toΔΓ𝒰superscript0superscript360\Delta\Gamma\sim\mathcal{U}(0^{\circ},360^{\circ})roman_Δ roman_Γ ∼ caligraphic_U ( 0 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT , 360 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT ) to the axis that is vertical to the ground plane and Gaussian noise to the shape parameter Δβ𝒩(0,0.1)similar-toΔ𝛽𝒩00.1\Delta\beta\sim\mathcal{N}(0,0.1)roman_Δ italic_β ∼ caligraphic_N ( 0 , 0.1 ). Given the augmented SMPL sequence, we extract 3D keypoints that correspond to the MS-COCO keypoints and add 3D noise modeled following previous work [44]. Finally, we apply a random mask with the average probability of p=0.15𝑝0.15p=0.15italic_p = 0.15 to the 3D keypoints and project them onto the virtual camera as described below.

Contact label generation.

The goal of generating a contact label is to train the motion decoder to detect the foot-ground contact accurately. Previous work [61] uses both the velocity and height of the feet to generate contact labels. However, in order to generalize our approach to arbitrary ground conditions such as slopes or stairs, we only use foot velocity to compute the ground truth contact labels, similar to TransPose [56]. We use the heel and toe vertices of each foot to define foot-ground contact and compute the probability as below:

p^(t)=11+eα(v(t)vt)/vt.superscript^𝑝𝑡11superscript𝑒𝛼superscript𝑣𝑡subscript𝑣𝑡subscript𝑣𝑡\hat{p}^{(t)}=\frac{1}{1+e^{\alpha(v^{(t)}-v_{t})/v_{t}}}.over^ start_ARG italic_p end_ARG start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG 1 + italic_e start_POSTSUPERSCRIPT italic_α ( italic_v start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT - italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) / italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG .

We set the threshold velocity vt=1cm/framesubscript𝑣𝑡1cmframev_{t}=1\text{cm}/\text{frame}italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 1 cm / frame and the coefficient α=5𝛼5\alpha=5italic_α = 5.

Camera motion synthesis.

We begin with generating the initial pose of the virtual camera, followed by the modeling of the camera motion. We model the initial roll and pitch angles of the camera using Gaussian distributions:

γr(0)superscriptsubscript𝛾𝑟0\displaystyle\gamma_{r}^{(0)}italic_γ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT similar-to\displaystyle\sim 𝒩(0,(5)2),𝒩superscript0superscriptsuperscript52\displaystyle\mathcal{N}(0^{\circ},(5^{\circ})^{2}),caligraphic_N ( 0 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT , ( 5 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ,
γp(0)superscriptsubscript𝛾𝑝0\displaystyle\gamma_{p}^{(0)}italic_γ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT similar-to\displaystyle\sim 𝒩(5,(22.5)2).𝒩superscript5superscriptsuperscript22.52\displaystyle\mathcal{N}(5^{\circ},(22.5^{\circ})^{2}).caligraphic_N ( 5 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT , ( 22.5 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) .

Here, we do not model the initial yaw angle since it is already handled by the random SMPL root rotation ΔΓΔΓ\Delta\Gammaroman_Δ roman_Γ.

Subsequently, we sample the initial camera translation, using a mix of uniform and normal distributions, to capture the ground-truth 3D pose in the camera coordinates:

Tz(0)superscriptsubscript𝑇𝑧0\displaystyle T_{z}^{(0)}italic_T start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT similar-to\displaystyle\sim 𝒰(2m,(12m)2)(R(0)t(0))z,𝒰2𝑚superscript12m2subscriptsuperscriptR0superscriptt0z\displaystyle\mathcal{U}(2m,(12\rm{m})^{2})-\big{(}R^{(0)}t^{(0)}\big{)}_{z},caligraphic_U ( 2 italic_m , ( 12 roman_m ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) - ( roman_R start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT roman_t start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT roman_z end_POSTSUBSCRIPT ,
Tx(0)superscriptsubscript𝑇𝑥0\displaystyle T_{x}^{(0)}italic_T start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT similar-to\displaystyle\sim 𝒩(0,0.252)d(R(0)t(0))x,𝒩0superscript0.252𝑑subscriptsuperscript𝑅0superscript𝑡0𝑥\displaystyle\mathcal{N}(0,0.25^{2})~{}d-\big{(}R^{(0)}t^{(0)}\big{)}_{x},caligraphic_N ( 0 , 0.25 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_d - ( italic_R start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT italic_t start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ,
Ty(0)superscriptsubscript𝑇𝑦0\displaystyle T_{y}^{(0)}italic_T start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT similar-to\displaystyle\sim 𝒩(0,0.252)d(R(0)t(0))y.𝒩0superscript0.252𝑑subscriptsuperscript𝑅0superscript𝑡0𝑦\displaystyle\mathcal{N}(0,0.25^{2})~{}d-\big{(}R^{(0)}t^{(0)}\big{)}_{y}.caligraphic_N ( 0 , 0.25 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_d - ( italic_R start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT italic_t start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT .

Here, d=wTz/2f𝑑𝑤subscript𝑇𝑧2𝑓d=w\cdot T_{z}/2fitalic_d = italic_w ⋅ italic_T start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT / 2 italic_f is the maximum displacement of the camera to capture the 3D keypoints within the field of view, R(0)superscript𝑅0R^{(0)}italic_R start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT is the initial camera pose, t(0)superscript𝑡0t^{(0)}italic_t start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT is the initial human translation, w𝑤witalic_w is the image size, and f𝑓fitalic_f is the focal length. Next, we sample the magnitude of change in the camera’s extrinsics with the Gaussian distributions:

ΔγyΔsubscript𝛾𝑦\displaystyle\Delta\gamma_{y}roman_Δ italic_γ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT similar-to\displaystyle\sim 𝒩(0,(45)2),𝒩superscript0superscriptsuperscript452\displaystyle\mathcal{N}(0^{\circ},(45^{\circ})^{2}),caligraphic_N ( 0 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT , ( 45 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ,
Δγr,ΔγpΔsubscript𝛾𝑟Δsubscript𝛾𝑝\displaystyle\Delta\gamma_{r},\Delta\gamma_{p}roman_Δ italic_γ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , roman_Δ italic_γ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT similar-to\displaystyle\sim 𝒩(0,(22.5)2),𝒩superscript0superscriptsuperscript22.52\displaystyle\mathcal{N}(0^{\circ},(22.5^{\circ})^{2}),caligraphic_N ( 0 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT , ( 22.5 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ,
ΔTx,ΔTy,ΔTzΔsubscript𝑇𝑥Δsubscript𝑇𝑦Δsubscript𝑇𝑧\displaystyle\Delta T_{x},\Delta T_{y},\Delta T_{z}roman_Δ italic_T start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , roman_Δ italic_T start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , roman_Δ italic_T start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT similar-to\displaystyle\sim 𝒩(0m,(1m)2).𝒩0𝑚superscript1𝑚2\displaystyle\mathcal{N}(0m,(1m)^{2}).caligraphic_N ( 0 italic_m , ( 1 italic_m ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) .

Finally, we interpolate the extrinsics and construct the camera’s dynamic path. Here, we sample the time stamp with 20% of noise, instead of uniform sampling, to model the non-linear camera motion.

We use 6.7M frames in total and uniformly sample them during the synthetic data generation.

Appendix B Datasets

In this section, we illustrate the datasets we use for training and testing our method.

Human3.6M [11] is an indoor dataset containing individuals performing 15 distinct actions captured by both a motion-capture (mocap) system and 4 calibrated video cameras. The ground truth 3D keypoint locations are provided by the mocap. Following previous work [17, 6, 43, 52], we use 5 subjects (S1, S5, S6, S7, and S8) to train our network after downsampling the mocap data to 25 fps.

MPI-INF-3DHP [33] is a multi-view and markerless dataset containing individuals performing various ranges of motion with corresponding ground-truth 3D keypoint locations. To train the network, we use the training set of the dataset, containing 8 subjects and 16 videos per subject.

InstaVariety [15] is a large-scale in-the-wild video dataset with large variations in subjects, motion, and environment. The dataset contains pseudo-ground-truth 2D keypoints detected by OpenPose [4]. We train our method on the training split of the dataset.

3DPW [49] is an in-the-wild video dataset containing ground truth 3D pose captured by a hand-held camera and 13 body-worn inertial sensors. We use the train, validation, and test splits of 3DPW for training, validating, and testing our method.

RICH [10] is a large-scale multi-view dataset captured in both indoor and outdoor environments. RICH provides the ground truth SMPL-X [38] parameters. Following previous work [1], we use the test split of the dataset to evaluate our method on 3D pose estimation accuracy.

EMDB [16] is a recently captured dataset that uses a dynamic camera and body-worn electromagnetic (EM) sensors. EMDB provides ground-truth SMPL parameters as well as the global trajectory of the individuals in a global coordinate system. We use two distinct test splits, EMDB 1 and EMDB 2, to evaluate the performance on 3D pose and shape estimation (EMDB 1) and global trajectory estimation (EMDB 2).

BEDLAM [1] is a recently proposed large-scale synthetic dataset. BEDLAM introduces realistic modeling of diverse clothing, hair, motion, skin tones, and scene environments to synthesize videos. BEDLAM contains 1 million video frames for individuals with ground truth SMPL/SMPL-X parameters. We optionally use the train split of BEDLAM to train the network.

Appendix C Losses

The loss functions of WHAM can be categorized into two groups; motion reconstruction and trajectory reconstruction:

Ltotalsubscript𝐿total\displaystyle L_{\rm{total}}italic_L start_POSTSUBSCRIPT roman_total end_POSTSUBSCRIPT =\displaystyle== Lsmpl+Lverts+L3D+L2D+Lcascmotion reconstruction+limit-fromsuperscriptsubscript𝐿smplsubscript𝐿vertssubscript𝐿3Dsubscript𝐿2Dsubscript𝐿cascmotion reconstruction\displaystyle\overbrace{L_{\rm{smpl}}+L_{\rm{verts}}+L_{\rm{3D}}+L_{\rm{2D}}+L% _{\rm{casc}}}^{\text{motion reconstruction}}+over⏞ start_ARG italic_L start_POSTSUBSCRIPT roman_smpl end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT roman_verts end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT 3 roman_D end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT 2 roman_D end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT roman_casc end_POSTSUBSCRIPT end_ARG start_POSTSUPERSCRIPT motion reconstruction end_POSTSUPERSCRIPT +
Lroot+Lcontact+Lω+Lcam+Lfstrajectory reconstruction.superscriptsubscript𝐿rootsubscript𝐿contactsubscript𝐿𝜔subscript𝐿camsubscript𝐿fstrajectory reconstruction\displaystyle\overbrace{L_{\rm{root}}+L_{\rm{contact}}+L_{\omega}+L_{\rm{cam}}% +L_{\rm{fs}}}^{\text{trajectory reconstruction}}.over⏞ start_ARG italic_L start_POSTSUBSCRIPT roman_root end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT roman_contact end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT roman_cam end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT roman_fs end_POSTSUBSCRIPT end_ARG start_POSTSUPERSCRIPT trajectory reconstruction end_POSTSUPERSCRIPT .

To be specific, we supervise WHAM directly on SMPL parameters and vertices when we have the labels, θsuperscript𝜃\theta^{*}italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, βsuperscript𝛽\beta^{*}italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and Vsuperscript𝑉V^{*}italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT:

Lsmplsubscript𝐿smpl\displaystyle L_{\rm{smpl}}italic_L start_POSTSUBSCRIPT roman_smpl end_POSTSUBSCRIPT =\displaystyle== t=0T(λθθ(t)θ(t)2+λββ(t)β(t)2),superscriptsubscript𝑡0𝑇subscript𝜆𝜃subscriptnormsuperscript𝜃𝑡superscript𝜃absent𝑡2subscript𝜆𝛽subscriptnormsuperscript𝛽𝑡superscript𝛽absent𝑡2\displaystyle\sum_{t=0}^{T}\left(\lambda_{\theta}||\theta^{(t)}-\theta^{*(t)}|% |_{2}+\lambda_{\beta}||\beta^{(t)}-\beta^{*(t)}||_{2}\right),∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT | | italic_θ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT - italic_θ start_POSTSUPERSCRIPT ∗ ( italic_t ) end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT | | italic_β start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT - italic_β start_POSTSUPERSCRIPT ∗ ( italic_t ) end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ,
Lvertssubscript𝐿verts\displaystyle L_{\rm{verts}}italic_L start_POSTSUBSCRIPT roman_verts end_POSTSUBSCRIPT =\displaystyle== t=0TλvertsV(t)V(t)1.superscriptsubscript𝑡0𝑇subscript𝜆vertssubscriptnormsuperscript𝑉𝑡superscript𝑉absent𝑡1\displaystyle\sum_{t=0}^{T}\lambda_{\rm{verts}}||V^{(t)}-V^{*(t)}||_{1}.∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT roman_verts end_POSTSUBSCRIPT | | italic_V start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT - italic_V start_POSTSUPERSCRIPT ∗ ( italic_t ) end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT .

When the 3D keypoints annotations (x3D)superscriptsubscript𝑥3𝐷(x_{3D}^{*})( italic_x start_POSTSUBSCRIPT 3 italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) are available, we follow the standard MSE loss on 3D keypoints directly regressed from the Motion Encoder (x3D)subscript𝑥3𝐷(x_{3D})( italic_x start_POSTSUBSCRIPT 3 italic_D end_POSTSUBSCRIPT ) and extracted from the predicted SMPL mesh (x^3D)subscript^𝑥3D(\hat{x}_{\rm{3D}})( over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 3 roman_D end_POSTSUBSCRIPT ):

L3D=t=0Tλ3D(x3D(t)x3D(t)2+x^3D(t)x3D(t)2).subscript𝐿3Dsuperscriptsubscript𝑡0𝑇subscript𝜆3Dsubscriptnormsuperscriptsubscript𝑥3D𝑡superscriptsubscript𝑥3Dabsent𝑡2subscriptnormsuperscriptsubscript^𝑥3D𝑡superscriptsubscript𝑥3Dabsent𝑡2\displaystyle L_{\rm{3D}}=\sum_{t=0}^{T}\lambda_{\rm{3D}}\left(||x_{\rm{3D}}^{% (t)}-x_{\rm{3D}}^{*(t)}||_{2}+||\hat{x}_{\rm{3D}}^{(t)}-x_{\rm{3D}}^{*(t)}||_{% 2}\right).italic_L start_POSTSUBSCRIPT 3 roman_D end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT 3 roman_D end_POSTSUBSCRIPT ( | | italic_x start_POSTSUBSCRIPT 3 roman_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT 3 roman_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ ( italic_t ) end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + | | over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 3 roman_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT 3 roman_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ ( italic_t ) end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) .

Since WHAM reconstructs full-body mesh in a cascaded manner, we use cascade loss Lcascsubscript𝐿cascL_{\textrm{casc}}italic_L start_POSTSUBSCRIPT casc end_POSTSUBSCRIPT to supervise the difference between x3Dsubscript𝑥3Dx_{\rm{3D}}italic_x start_POSTSUBSCRIPT 3 roman_D end_POSTSUBSCRIPT and x^3Dsubscript^𝑥3D\hat{x}_{\rm{3D}}over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 3 roman_D end_POSTSUBSCRIPT:

Lcasc=t=0Tλcascx3D(t)x^3D2.subscript𝐿cascsuperscriptsubscript𝑡0𝑇subscript𝜆cascsubscriptnormsuperscriptsubscript𝑥3D𝑡subscript^𝑥3D2\displaystyle L_{\rm{casc}}=\sum_{t=0}^{T}\lambda_{\rm{casc}}||x_{\rm{3D}}^{(t% )}-\hat{x}_{\rm{3D}}||_{2}.italic_L start_POSTSUBSCRIPT roman_casc end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT roman_casc end_POSTSUBSCRIPT | | italic_x start_POSTSUBSCRIPT 3 roman_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT - over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 3 roman_D end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT .

For 2D loss, we follow CLIFF [25] to compute full-perspective 2D reprojection loss when the label is available, while weak-perspective 2D loss is applied for InstaVariety dataset as its full-frame image is not available:

L2Dsubscript𝐿2D\displaystyle L_{\rm{2D}}italic_L start_POSTSUBSCRIPT 2 roman_D end_POSTSUBSCRIPT =\displaystyle== t=0Tλ2Dπ(x^3D(t))x2D(t)2,superscriptsubscript𝑡0𝑇subscript𝜆2Dsubscriptnorm𝜋superscriptsubscript^𝑥3D𝑡superscriptsubscript𝑥2Dabsent𝑡2\displaystyle\sum_{t=0}^{T}\lambda_{\rm{2D}}||\pi\left(\hat{x}_{\rm{3D}}^{(t)}% \right)-x_{\rm{2D}}^{*(t)}||_{2},∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT 2 roman_D end_POSTSUBSCRIPT | | italic_π ( over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 3 roman_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) - italic_x start_POSTSUBSCRIPT 2 roman_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ ( italic_t ) end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ,

where π()𝜋\pi(\cdot)italic_π ( ⋅ ) denotes the operation of perspective projection.

We supervise root trajectory and foot contact probability directly using MSE loss when the ground truth data is available:

Lrootsubscript𝐿root\displaystyle L_{\rm{root}}italic_L start_POSTSUBSCRIPT roman_root end_POSTSUBSCRIPT =\displaystyle== t=0TλΓ(Γ0(t)Γ(t)2+Γ(t)Γ(t)2)+limit-fromsuperscriptsubscript𝑡0𝑇subscript𝜆ΓsubscriptnormsuperscriptsubscriptΓ0𝑡superscriptΓabsent𝑡2subscriptnormsuperscriptΓ𝑡superscriptΓabsent𝑡2\displaystyle\sum_{t=0}^{T}\lambda_{\Gamma}\left(||\Gamma_{0}^{(t)}-\Gamma^{*(% t)}||_{2}+||\Gamma^{(t)}-\Gamma^{*(t)}||_{2}\right)+∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT roman_Γ end_POSTSUBSCRIPT ( | | roman_Γ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT - roman_Γ start_POSTSUPERSCRIPT ∗ ( italic_t ) end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + | | roman_Γ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT - roman_Γ start_POSTSUPERSCRIPT ∗ ( italic_t ) end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) +
t=0Tλv(v0(t)v(t)2+v(t)v(t)2),superscriptsubscript𝑡0𝑇subscript𝜆𝑣subscriptnormsuperscriptsubscript𝑣0𝑡superscript𝑣absent𝑡2subscriptnormsuperscript𝑣𝑡superscript𝑣absent𝑡2\displaystyle\sum_{t=0}^{T}\lambda_{v}\left(||v_{0}^{(t)}-v^{*(t)}||_{2}+||v^{% (t)}-v^{*(t)}||_{2}\right),∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( | | italic_v start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT - italic_v start_POSTSUPERSCRIPT ∗ ( italic_t ) end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + | | italic_v start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT - italic_v start_POSTSUPERSCRIPT ∗ ( italic_t ) end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ,
Lcontactsubscript𝐿contact\displaystyle L_{\rm{contact}}italic_L start_POSTSUBSCRIPT roman_contact end_POSTSUBSCRIPT =\displaystyle== t=0Tλpp(t)p(t)2.superscriptsubscript𝑡0𝑇subscript𝜆𝑝subscriptnormsuperscript𝑝𝑡superscript𝑝absent𝑡2\displaystyle\sum_{t=0}^{T}\lambda_{p}||p^{(t)}-p^{*(t)}||_{2}.∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT | | italic_p start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT - italic_p start_POSTSUPERSCRIPT ∗ ( italic_t ) end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT .

The Camera’s pose can be expressed by the predicted root orientation in world and camera coordinates:

R(t)superscript𝑅𝑡\displaystyle R^{(t)}italic_R start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT =\displaystyle== Γc(t)(Γ(t)),superscriptsubscriptΓ𝑐𝑡superscriptsuperscriptΓ𝑡top\displaystyle\Gamma_{c}^{(t)}\left(\Gamma^{(t)}\right)^{\top},roman_Γ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( roman_Γ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ,

where ΓcsubscriptΓ𝑐\Gamma_{c}roman_Γ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT is the predicted root orientation in the camera coordinate system. As a result, we can construct novel losses based on camera pose and angular velocity:

Lωsubscript𝐿𝜔\displaystyle L_{\omega}italic_L start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT =\displaystyle== t=0Tλωω(t)ω(t)2,superscriptsubscript𝑡0𝑇subscript𝜆𝜔subscriptnormsuperscript𝜔𝑡superscript𝜔absent𝑡2\displaystyle\sum_{t=0}^{T}\lambda_{\omega}||\omega^{(t)}-\omega^{*(t)}||_{2},∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT | | italic_ω start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT - italic_ω start_POSTSUPERSCRIPT ∗ ( italic_t ) end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ,
Lcamsubscript𝐿cam\displaystyle L_{\rm{cam}}italic_L start_POSTSUBSCRIPT roman_cam end_POSTSUBSCRIPT =\displaystyle== t=0TλcamR(t)R(t)2,superscriptsubscript𝑡0𝑇subscript𝜆camsubscriptnormsuperscript𝑅𝑡superscript𝑅absent𝑡2\displaystyle\sum_{t=0}^{T}\lambda_{\rm{cam}}||R^{(t)}-R^{*(t)}||_{2},∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT roman_cam end_POSTSUBSCRIPT | | italic_R start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT - italic_R start_POSTSUPERSCRIPT ∗ ( italic_t ) end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ,

where ω𝜔\omegaitalic_ω is the reconstructed camera angular velocities from R𝑅Ritalic_R.

Finally, we add foot sliding loss using the foot velocity during the contact:

Lfssubscript𝐿fs\displaystyle L_{\rm{fs}}italic_L start_POSTSUBSCRIPT roman_fs end_POSTSUBSCRIPT =\displaystyle== t=0Tλfsp(t)vf(t)2.superscriptsubscript𝑡0𝑇subscript𝜆fssubscriptnormdirect-productsuperscript𝑝absent𝑡superscriptsubscript𝑣𝑓𝑡2\displaystyle\sum_{t=0}^{T}\lambda_{\rm{fs}}||p^{*(t)}\odot v_{f}^{(t)}||_{2}.∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT roman_fs end_POSTSUBSCRIPT | | italic_p start_POSTSUPERSCRIPT ∗ ( italic_t ) end_POSTSUPERSCRIPT ⊙ italic_v start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT .

Here, vf(t)superscriptsubscript𝑣𝑓𝑡v_{f}^{(t)}italic_v start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT is the world-coordinate foot velocity, and direct-product\odot denotes the masking operation of foot contact based on the ground truth contact probability p(t)superscript𝑝absent𝑡p^{*(t)}italic_p start_POSTSUPERSCRIPT ∗ ( italic_t ) end_POSTSUPERSCRIPT.

Appendix D Neural-network initialization

Uni-directional RNNs introduce the challenge of differing learning objectives between the initial frames and subsequent ones due to the initialization state. Specifically, in traditional RNNs, the initialization state is typically padded with zeros, resulting in the first frame primarily relying on the input signal. In contrast, the subsequent frames are trained to capitalize on both the input signal and information transferred from the past. To resolve the disparity in learning objectives, we use a neural initialization network, as proposed by [57], to predict h0,Esubscript0𝐸h_{0,E}italic_h start_POSTSUBSCRIPT 0 , italic_E end_POSTSUBSCRIPT and h0,Dsubscript0𝐷h_{0,D}italic_h start_POSTSUBSCRIPT 0 , italic_D end_POSTSUBSCRIPT from the 0-th frame pose, instead of using zero-padding. During the training, we use pseudo-ground-truth 3D pose [35] for the video datasets that do not have the SMPL parameter annotation [11, 33, 15]. Note that we do not supervise our network on the pseudo labels. At test time, we use the pose and shape predicted by a single-frame regressor as the initial state.

Refer to caption


Figure 7: Qualitative comparison between WHAM and after removal of contact-aware trajectory refinement (w/o traj. ref.).

Appendix E Qualitative evaluation of contact estimation

In the main paper, we show the quantitative effect of the contact-aware trajectory refinement unit. To supplement this with more intuitive visualization, we provide qualitative results in Fig. 7. As shown in the figure, WHAM without the trajectory refinement unit induces significant foot sliding during the contact. Conversely, WHAM reconstructs a more feasible trajectory based on its contact estimation.

Appendix F Run-time cost

While the core WHAM network presented here runs at 200fps, it relies on the input of several other methods. Here we compute the run time of each module required by our framework on the EMDB dataset [16]. The inference speed of all methods was computed on a single A100 GPU. We exclude running SLAM in this analysis as it can be obviated if we use gyroscope data (though real-time SLAM methods exist). As shown in Table 5, our full method, with pre-processing steps, runs at around 9 fps with online inference (i.e., a batch size of 1 and no lag), and around 50 fps when run in batch mode (with resulting lag). We compare the core runtime of WHAM with SLAHMR [55], excluding bounding box detection, person identification, and keypoints detection, for which there are real-time solutions. In this condition, WHAM takes 5 seconds (202 fps) for 1000 frames. Specifically, WHAM takes 4.3 seconds (237 fps) for image feature extraction and 0.7 seconds (1431 fps) to regress the motion and global trajectory. This is significantly faster than SLAHMR which takes 260 minutes (<<< 0.1 fps) per 1000 frames.

  Runtime: fps (ms𝑚𝑠msitalic_m italic_s)
Methods    batch size = 1 batch size = 64
Bounding box detection    70 (14.3 ms𝑚𝑠msitalic_m italic_s) 265 (3.8 ms𝑚𝑠msitalic_m italic_s)
Bounding box tracking    7189 (0.1 ms𝑚𝑠msitalic_m italic_s) 7189 (0.1 ms𝑚𝑠msitalic_m italic_s)
2D keypoints detection    12.1 (82.6 ms𝑚𝑠msitalic_m italic_s) 88 (11.4 ms𝑚𝑠msitalic_m italic_s)
Image feature extraction    66 (15.2 ms𝑚𝑠msitalic_m italic_s) 237 (4.3 ms𝑚𝑠msitalic_m italic_s)
Rest of the framework    926 (1.1 ms𝑚𝑠msitalic_m italic_s) 1431 (0.7 ms𝑚𝑠msitalic_m italic_s)
Total    8.8 (113.3 ms𝑚𝑠msitalic_m italic_s) 49.3 (20.3 ms𝑚𝑠msitalic_m italic_s)
Table 5: Per-frame computation time (running time) of each module in WHAM. We present this both as frames per second (fps) and milliseconds (ms𝑚𝑠msitalic_m italic_s).

Refer to caption

Figure 8: Failure cases of WHAM in global motion estimation.

Appendix G Discussions of limitation

In Fig. 8, we demonstrate exemplar cases when WHAM fails to capture the global motion of the person. Since the datasets we used for training WHAM on global trajectory estimation do not contain people riding skateboards or bicycles, WHAM does not capture the global trajectory in these scenarios. Furthermore, our contact estimation only applies to the feet, thus, WHAM fails to capture ground contact of the human body other than the feet, resulting in physically infeasible body support (floating hands) or sliding of the contact points. More information on human-object interactions can be used to resolve this issue. While we employ random masking as part of our data synthesis process, our generating approach mainly assumes the scenario where the full body is within the field of view. This can be addressed with additional augmentation during training (cf. [18, 26]).