[Deformable 3D Gaussians] Bringing 3D Gaussian Splatting to Life for Real-Time Dynamic Scenes

Paper at a Glance

Paper Title: Deformable 3D Gaussians for High-Fidelity Monocular Dynamic Scene Reconstruction
Authors: Ziyi Yang, Xinyu Gao, Wen Zhou, Shaohui Jiao, Yuqing Zhang, Xiaogang Jin
Affiliation: Zhejiang University, ByteDance Inc.
Published in: Conference on Computer Vision and Pattern Recognition (CVPR) 2024
Link to Paper: https://openaccess.thecvf.com//content/CVPR2024/html/Yang_Deformable_3D_Gaussians_for_High-Fidelity_Monocular_Dynamic_Scene_Reconstruction_CVPR_2024_paper.html
Project Page: https://github.com/ingra14m/Deformable-3D-Gaussians

The Gist of It: TL;DR

In one sentence: This paper adapts the ultra-fast 3D Gaussian Splatting technique for dynamic, moving scenes by learning a “canonical” set of static Gaussians and a neural network that deforms them over time, achieving real-time, high-fidelity rendering from a single moving camera.

Why It Matters: The Big Picture

For years, Neural Radiance Fields (NeRFs) have been the gold standard for creating stunningly photorealistic 3D scenes from images. However, they have a major drawback: they are slow. Rendering a single frame can take seconds or even minutes, making them impractical for real-time applications like VR/AR or content creation.

In 2023, 3D Gaussian Splatting (3D-GS) arrived and changed the game for static scenes. By representing a scene as a collection of millions of tiny, colored 3D ellipsoids (Gaussians), 3D-GS achieved NeRF-level quality but with real-time rendering speeds (>100 FPS). The problem? The real world isn’t static. People move, objects change, and scenes evolve. While dynamic NeRF variants existed, they were still slow and often struggled to capture fine details.

This paper bridges that critical gap. It asks: Can we get the speed and quality of 3D Gaussian Splatting but for dynamic scenes captured from a single monocular video? The answer is yes, and their method, Deformable 3D Gaussians, sets a new standard for performance and efficiency in dynamic scene reconstruction.

The Core Idea: How It Works

1. The Problem They’re Solving

The original 3D-GS method is designed for a single, unchanging scene. You can’t just run it on each frame of a video; that would be computationally expensive and would result in a jerky, inconsistent output with no way to smoothly interpolate between moments in time. The core challenge is to represent a scene that changes over time within the efficient 3D-GS framework.

2. The Key Innovation

The authors’ key insight is to decouple the scene’s static structure from its dynamic motion. Instead of learning a completely new set of Gaussians for every single frame, they propose learning just one set of Gaussians in a static “canonical space”—think of this as a base pose or a template of the scene.

Then, a separate, lightweight neural network acts as a deformation field. This network’s only job is to learn the motion. For any given point in time, it predicts how each canonical Gaussian should move, rotate, and scale to match its appearance in the video at that instant.

3. The Method, Step-by-Step

The entire process, elegantly illustrated in Figure 2 of the paper, can be broken down into a few key steps:

Initialization: The process starts by running a standard Structure-from-Motion (SfM) algorithm on the input video to get camera poses and a sparse point cloud. This point cloud provides the initial positions for the 3D Gaussians in the canonical space.
Deformation: A simple Multi-Layer Perceptron (MLP) serves as the deformation network. It takes two inputs: the position of a canonical Gaussian and a specific time t. It then outputs three small offsets: δx (for position), δr (for rotation), and δs (for scale).
Transformation: These predicted offsets are added to the parameters of the static, canonical Gaussians. This transforms them into their correct state (position, orientation, and size) for that specific moment in time t.
Rendering and Optimization: The newly deformed Gaussians are fed into the standard, highly efficient 3D Gaussian rasterizer to produce an image. The difference between this rendered image and the actual video frame is calculated as a loss. This loss is then backpropagated to jointly update both the properties of the canonical Gaussians (improving the scene’s base appearance) and the weights of the deformation MLP (improving its ability to model motion).
Annealing Smooth Training (AST): Real-world camera poses can be slightly inaccurate, causing jittery artifacts when interpolating between frames. To fix this, the authors introduce a clever trick. During the early stages of training, they add a tiny amount of random noise to the time input t. This forces the deformation network to learn a smoother, more generalized motion, preventing it from overfitting to small pose errors. This noise is gradually reduced (“annealed”) as training progresses, allowing the model to capture finer details in the later stages.

Key Experimental Results

The paper demonstrates a significant leap in performance over previous state-of-the-art methods for dynamic scene rendering.

Unmatched Quality on Synthetic Data: On the standard D-NeRF synthetic dataset, Deformable 3D Gaussians consistently outperforms methods like D-NeRF, TiNeuVox, and Tensor4D across all metrics (PSNR, SSIM, LPIPS), as shown in Table 1. The qualitative comparisons in Figure 3 are striking, revealing crisp details on complex structures like skeletons and hands, while other methods produce blurry or incomplete results.
Robustness in the Real World: The method also excels on real-world datasets from NeRF-DS and HyperNeRF, which feature imperfect camera poses. The proposed Annealing Smooth Training proves its worth here, leading to temporally smooth and detailed reconstructions (Table 2, Figure 5).

Real-Time Rendering: The primary advantage of building on 3D-GS is speed. The authors report real-time rendering speeds (over 30 FPS on an NVIDIA RTX 3090), a feat that remains out of reach for most NeRF-based dynamic scene models.
Accurate Geometry: The depth maps visualized in Figure 6 show that the model learns accurate scene geometry rather than just “painting” colors onto a surface. This confirms that the deformation field is learning true 3D transformations.

A Critical Look: Strengths & Limitations

Strengths / Contributions

A Powerful Combination: This is the first work to successfully merge the efficiency of explicit 3D Gaussians with the flexibility of an implicit deformation field for monocular dynamic scenes. It sets a new and powerful precedent.
State-of-the-Art Performance: The method achieves a new state of the art in both rendering quality and speed, significantly outclassing prior methods in side-by-side comparisons.
Practical and Robust: The Annealing Smooth Training (AST) mechanism is a simple yet highly effective solution for handling the noisy camera poses common in real-world video captures, making the method more practical.

Limitations / Open Questions

Complex Motion: The paper acknowledges that its evaluations focused on scenes with “moderate motion dynamics.” The simple MLP-based deformation field might not be sufficient to model extremely fast, complex, or topologically changing motions, such as nuanced facial expressions or turbulent fluids.
Dependence on Good Poses: While AST improves robustness, the method’s quality is still fundamentally tied to the initial camera pose estimation. Severe errors from the SfM step could still cause the reconstruction to fail.
Scalability Concerns: The model’s complexity and training time scale with the number of Gaussians. Reconstructing extremely large-scale or intricately detailed scenes might become a computational bottleneck.

Contribution Level: Significant Improvement. This paper represents a major step forward for dynamic scene reconstruction. While it doesn’t introduce a fundamentally new paradigm, it masterfully combines the best of recent advances (3D-GS speed, NeRF’s canonical spaces) to solve a critical problem. It provides a practical, high-performance solution that significantly advances the state of the art.

Conclusion: Potential Impact

Deformable 3D Gaussians provides a compelling answer to one of the biggest challenges in neural rendering: how to create high-fidelity, dynamic 3D scenes that can be rendered in real-time. By extending the revolutionary 3D-GS framework, this work unlocks new possibilities for applications in virtual and augmented reality, 3D content creation, and digital twins. Researchers and engineers now have a powerful tool that balances quality, speed, and flexibility, paving the way for more interactive and immersive 3D experiences built from simple video recordings.