Department Talks
- Guy Tevet
- MPI-IS Tuebingen, N3.022
Character motion synthesis stands as a central challenge in computer animation and graphics. The successful adaptation of diffusion models to the field boosted synthesis quality and provided intuitive controls such as text and music. One of the earliest and most popular methods to do so is Motion Diffusion Model (MDM) [ICLR 2023]. In this talk, I will review how MDM incorporates domain know-how into the diffusion model and enables intuitive editing capabilities. Then, I will present two recent works, each suggesting a refreshing take on motion diffusion and extending its abilities to new animation tasks. Multi-view Ancestral Sampling (MAS) [CVPR 2024] is an inference time algorithm that samples 3D animations from 2D keypoint diffusion models. We demonstrated it by generating 3D animations for characters and scenarios that are challenging to record in elaborate motion capture systems, yet vastly ubiquitous on in-the-wild videos. These include for example horse racing and professional rhythmic gymnastics motions. Monkey See, Monkey Do (MoMo) [SIGGRAPH Asia 2024] explores the attention space of the motion diffusion model. A careful analysis shows the roles of the attention’s keys and queries through the generation process. With these findings in hand, we design a training-free method that generates motion following the distinct motifs of one motion while led by an outline dictated by another motion. To conclude the talk, I will give my modest take on the challenges in the fields and our lab’s current work attempting to tackle some of them.
Organizers: Omid Taheri
- Egor Zakharov
- Max-Planck-Ring 4, N3, Aquarium
Digital humans, or realistic avatars, are a centerpiece of future telepresence and special effects systems, and human head modeling is one of their main components. The abovementioned applications, however, are highly demanding in terms of avatar creation speed, as well as realism, and controllability. This talk will focus on the approaches that create controllable and detailed 3D head avatars using the data from consumer-grade devices, such as smartphones, in an uncalibrated and unconstrained capture setting. We will discuss leveraging in-the-wild internet videos and synthetic data sources to achieve a high diversity of facial expressions and appearance personalization, including detailed hair modeling. We also showcase how the resulting human-centric assets can be integrated into virtual environments for real-time telepresence and entertainment applications, illustrating the future of digital communication and gaming.
Organizers: Vanessa Sklyarova
Collaborative Control for Geometry-Conditioned PBR Image Generation
- 26 September 2024 • 14:00—15:00
- Simon Donne
- Virtual, Live stream at Max-Planck-Ring 4, N3, Aquarium
Current diffusion models only generate RGB images. If we want to make progress towards graphics-ready 3D content generation, we need a PBR foundation model, but there is not enough PBR data available to train such a model from scratch. We introduce Collaborative Control, which tightly links a new PBR diffusion model to a pre-trained RGB model. We show that this dual architecture does not risk catastrophic forgetting, outputting high-quality PBR images and generalizing well beyond the PBR training dataset. Furthermore, the frozen base model remains compatible with techniques such as IP-Adapter.
Organizers: Soubhik Sanyal
Geometry Image Diffusion: Fast and Data-Efficient Text-to-3D with Image-Based Surface Representation
- 26 September 2024 • 14:00—15:00
- Slava Elizarov
- Virtual, Live stream at Max-Planck-Ring 4, N3, Aquarium
In this talk, I will present Geometry Image Diffusion (GIMDiffusion), a novel method designed to generate 3D objects from text prompts efficiently. GIMDiffusion uses geometry images, a 2D representation of 3D shapes, which allows the use of existing image-based architectures instead of complex 3D-aware models. This approach reduces computational costs and simplifies the model design. By incorporating Collaborative Control, the method exploits rich priors of pretrained Text-to-Image models like Stable Diffusion, enabling strong generalization even with limited 3D training data. GIMDiffusion produces 3D objects with semantically meaningful, separable parts and internal structures, which enhances the ease of manipulation and editing.
Organizers: Soubhik Sanyal
- Panagiotis Filntisis and George Retsinas
- Hybrid
Recent advances in 3D face reconstruction from in-the-wild images and videos have excelled at capturing the overall facial shape associated with a person's identity. However, they often struggle to accurately represent the perceptual realism of facial expressions, especially subtle, extreme, or rarely observed ones. In this talk, we will present two contributions focused on improving 3D facial expression reconstruction. The first part introduces SPECTRE—"Visual Speech-Aware Perceptual 3D Facial Expression Reconstruction from Videos"—which offers a method for precise 3D reconstruction of mouth movements linked to speech articulation. This is achieved using a novel "lipread" loss function that enhances perceptual realism. The second part covers SMIRK—"3D Facial Expressions through Analysis-by-Neural-Synthesis"—where we explore how neural rendering techniques can overcome the limitations of differentiable rendering. This approach provides better gradients for 3D reconstruction and allows us to augment training data with diverse expressions for improved generalization. Together, these methods set new standards in accurately reconstructing facial expressions.
Organizers: Victoria Fernandez Abrevaya
- Wanyue Zhang
- Max-Planck-Ring 4, N3, Aquarium
Data-driven virtual 3D character animation has recently witnessed remarkable progress. The realism of virtual characters is a core contributing factor to the quality of computer animations and user experience in immersive applications like games, movies, and VR/AR. However, existing automatic approaches for 3D virtual character motion synthesis supporting scene interactions do not generalize well to new objects outside training distributions, even when trained on extensive motion capture datasets with diverse objects and annotated interactions. In this talk, I will present ROAM, an alternative framework that generalizes to unseen objects of the same category without relying on a large dataset of human-object animations. In addition, I will share some preliminary findings from an ongoing project on hand motion interaction with articulated objects.
Organizers: Nikos Athanasiou
- Thor Besier
- Max Planck Ring 4, N3
Thor Besier leads the musculoskeletal modelling group at the Auckland Bioengineering Institute and will provide an overview of the institute and some of the current research projects of his team, including the Musculoskeletal Atlas Project, Harmonising clinical gait analysis data, Digital Twins for shoulder arthroplasty, and Reproducibility of Knee Models (NIH funded KneeHUB project).
Organizers: Marilyn Keller
- István Sárándi
- Max Planck Ring 4, N3
With the explosive growth of available training data, 3D human pose and shape estimation is ahead of a transition to a data-centric paradigm. To leverage data scale, we need flexible models trainable from heterogeneous data sources. To this end, our latest work, Neural Localizer Fields, seamlessly unifies different human pose and shape-related tasks and datasets though the ability - both at training and test time - to query any arbitrary point of the human volume, and obtain its estimated location in 3D, based on a single RGB image. We achieve this by learning a continuous neural field of body point localizer functions, each of which is a differently parameterized 3D heatmap-based convolutional point localizer. This way, we can naturally exploit differently annotated data sources including parametric mesh, 2D/3D skeleton and dense pose, without having to explicitly convert between them, and thereby train large-scale 3D human mesh and skeleton estimation models that outperform the state-of-the-art on several public benchmarks including 3DPW, EMDB and SSP-3D by a considerable margin.
Organizers: Marilyn Keller
- Jiawei Liu
- Virtual (Zoom)
People live in a 4D dynamic moving world. While videos serve as the most convenient medium to capture this dynamic world, they lack the capability to present the 4D nature of our world. Therefore, 4D video reconstruction, free-viewpoint rendering, and high-quality editing and generation offer innovative opportunities for content creation, virtual reality, telepresence, and robotics. Although promising, they also pose significant challenges in terms of efficiency, 4D motion and dynamics, temporal and subject consistency, and text-3D/video alignment. In light of these challenges, this talk will discuss our recent progress on how to represent and learn the 4D dynamic moving world, from its underlying dynamics to the reconstruction, editing, and generation of 4D dynamic scenes. This talk will motivate discussions about future directions on multi-modal 4D dynamic human-object-scene reconstruction, generation, and perception.
Organizers: Omid Taheri
- Angelica Lim
- Virtual (Zoom)
Science fiction has long promised us interfaces and robots that interact with us as smoothly as humans do - Rosie the Robot from The Jetsons, C-3PO from Star Wars, and Samantha from Her. Today, interactive robots and voice user interfaces are moving us closer to effortless, human-like interactions in the real world. In this talk, I will discuss the opportunities and challenges in finely analyzing, detecting and generating non-verbal communication in context, including gestures, gaze, auditory signals, and facial expressions. Specifically, I will discuss how we might allow robots and virtual agents to understand human social signals (including emotions, mental states, and attitudes) across cultures as well as recognize and generate expressions with controllability, transparency, and diversity in mind.
Organizers: Yao Feng Michael Black