Real-Time Video Rendering with 4D Gaussian Splatting

Paper at a Glance

Paper Title: 4D Gaussian Splatting for Real-Time Dynamic Scene Rendering
Authors: Guanjun Wu, Taoran Yi, Jiemin Fang, Lingxi Xie, Xiaopeng Zhang, Wei Wei, Wenyu Liu, Qi Tian, and Xinggang Wang
Affiliation: Huazhong University of Science and Technology, Huawei Inc.
Published in: Conference on Computer Vision and Pattern Recognition (CVPR) 2024
Link to Paper: https://openaccess.thecvf.com//content/CVPR2024/html/Wu_4D_Gaussian_Splatting_for_Real-Time_Dynamic_Scene_Rendering_CVPR_2024_paper.html
Project Page: https://guanjunwu.github.io/4dgs/

The Gist of It: TL;DR

In one sentence: This paper introduces 4D Gaussian Splatting (4D-GS), a method that extends the blazing-fast 3D Gaussian Splatting to dynamic scenes by learning a compact neural deformation field, enabling real-time, high-fidelity rendering of moving objects from novel viewpoints.

Why It Matters: The Big Picture

For years, Neural Radiance Fields (NeRFs) have been the dominant force in novel view synthesis, allowing us to create stunning 3D scenes from a handful of images. However, their Achilles’ heel has always been speed; both training and rendering were notoriously slow. This made them impractical for real-time applications like VR, AR, or interactive editing.

The game changed with the introduction of 3D Gaussian Splatting (3D-GS), which replaced the slow volume rendering of NeRF with a highly efficient, point-based rasterization technique. Suddenly, rendering photorealistic static scenes in real-time became a reality. But what about dynamic scenes—videos of people moving, objects interacting, or scenes changing over time?

The naive approach of training a separate 3D-GS model for every single video frame is a non-starter. It would lead to astronomical storage costs and completely ignore the temporal consistency between frames. This is the exact problem that “4D Gaussian Splatting for Real-Time Dynamic Scene Rendering” sets out to solve: how can we achieve the speed and quality of 3D-GS for dynamic scenes in a compact and efficient way?

The Core Idea: How It Works

The authors’ key insight is to represent a dynamic scene not as a collection of independent 3D snapshots, but as a single canonical 3D scene that deforms over time. Instead of storing billions of points for a video, they store one set of points and a small neural network that knows how to move them.

1. The Problem They’re Solving

Directly extending 3D-GS to dynamic scenes leads to a memory explosion. If a scene is represented by N Gaussians and has T timestamps, a per-frame model would require storage proportional to . The challenge is to model the 4D scene (3D space + time) in a way that scales efficiently with the length of the video.

2. The Key Innovation

The central idea is 4D Gaussian Splatting (4D-GS). This approach maintains only one canonical set of 3D Gaussians. To render the scene at a specific time t, a learned Gaussian deformation field—a small and efficient neural network—predicts the necessary transformations (translation, rotation, and scaling) to apply to each canonical Gaussian. This elegantly separates the static geometry from the dynamic motion.

3. The Method, Step-by-Step

As illustrated in Figure 3 of the paper, the process works as follows:

Canonical Representation: The model starts with a single set of 3D Gaussians, G, that represents the scene in a base state (e.g., initialized from the first video frame). These Gaussians have properties like position, color, opacity, rotation, and scale.
Spatial-Temporal Encoding: To predict how a Gaussian should move at a specific time, the model needs to understand its context in both space and time. Instead of using a bulky 4D voxel grid, 4D-GS employs a clever decomposition strategy inspired by methods like HexPlane. The 4D coordinate (x, y, z, t) is projected onto six 2D planes: (xy, xz, yz) for spatial information and (xt, yt, zt) for temporal information. By sampling features from these efficient 2D grids and combining them, the model gets a rich feature vector describing each Gaussian’s state at a given moment.
Deformation Prediction: This feature vector is passed to a lightweight multi-head MLP decoder. This network has separate “heads,” each specialized for a specific task:
- Position Head: Predicts the displacement (Δx, Δy, Δz).
- Rotation Head: Predicts the change in rotation (Δr).
- Scaling Head: Predicts the change in scale (Δs).
Applying the Deformation: The predicted deformations are simply added to the attributes of the original canonical Gaussians. This creates a new set of deformed Gaussians, G', which accurately represents the scene at time t.
Rendering: Finally, this set of deformed Gaussians G' is rendered using the standard, highly-efficient 3D-GS rasterizer to produce the final image. Because the deformation network is very small, this entire process can be executed in real-time.

Key Experimental Results

The paper’s results demonstrate a powerful combination of speed, quality, and efficiency.

Unprecedented Speed and Quality: As shown in the summary plot (Figure 1) and detailed tables (Table 1), 4D-GS achieves state-of-the-art rendering quality while being orders of magnitude faster than previous dynamic NeRF methods. It hits 82 FPS on synthetic datasets at 800x800 resolution and 30-34 FPS on high-resolution real-world videos, all on a single RTX 3090 GPU.
Extreme Compactness: The model is incredibly storage-efficient. For a synthetic dataset, 4D-GS requires only 18 MB of storage, compared to hundreds of MB for competing methods like KPlanes or V4D. This is a direct benefit of storing only one canonical set of Gaussians and a tiny deformation network.
Effective Motion Modeling: Visualizations in the paper (Figure 6) show that 4D-GS produces sharp, coherent renderings of complex motions, outperforming other fast methods like TiNeuVox which can appear blurry.
Crucial Components: The ablation studies (Table 4) confirm the importance of their design. Removing the spatial-temporal encoder or the specialized heads for rotation and scaling causes a significant drop in rendering quality, proving that each component plays a vital role in accurately modeling complex deformations.

A Critical Look: Strengths & Limitations

Strengths / Contributions

Real-Time Dynamic Scene Rendering: This is the headline achievement. 4D-GS is one of the first methods to deliver high-quality, real-time rendering for dynamic scenes, bridging a critical gap left by 3D-GS.
Highly Compact and Scalable Representation: The canonical-plus-deformation approach is extremely efficient in terms of storage, making it practical for representing long video sequences without memory blowing up.
State-of-the-Art Performance: The method sets a new standard on the speed-quality-storage Pareto front, outperforming previous SOTA methods across multiple benchmarks.
Coherent Motion Learning: By using a spatial-temporal encoder, the deformation field can leverage information from neighboring Gaussians, resulting in smooth and physically plausible motion.

Limitations / Open Questions

Large Motions and Topology Changes: The deformation-based model might struggle with very large, fast motions or scenes with significant changes in topology (e.g., a shirt being taken off, a liquid splashing). The canonical representation may not be expressive enough for such scenarios.
Scene Scale: The authors acknowledge that the method may not scale well to massive, urban-scale environments due to the computational cost of querying the deformation field for millions or billions of Gaussians.
Static vs. Dynamic Separation: The model learns a single deformation field for the entire scene. It doesn’t explicitly distinguish between the static background and the dynamic foreground, which could potentially lead to subtle artifacts or “wobbling” in the static parts of the scene.

Contribution Level: Significant Improvement. This work is a powerful and logical evolution of the Gaussian Splatting framework. It directly addresses the major limitation of 3D-GS (its static nature) with an elegant and highly effective solution. While not a complete paradigm shift, it significantly pushes the boundaries of what’s possible in real-time 3D scene reconstruction and rendering, setting a new benchmark for performance.

Conclusion: Potential Impact

4D Gaussian Splatting represents a major step forward for creating dynamic 3D content. Its real-time performance and high fidelity open the door to a new wave of applications in virtual and augmented reality, digital twins, telepresence, and next-generation visual effects. Researchers can now iterate on dynamic scene models much faster, and practitioners have a powerful tool for creating interactive 4D experiences. The future of this work will likely involve tackling larger scenes, more complex topological changes, and enabling real-time capture and streaming of dynamic environments.