Paper   |   Code (Coming Soon)
DGS-LRM Overview: Our proposed Deformable Gaussian Splats Large Reconstruction Model (DGS-LRM) takes posed monocular videos as input and predicts deformable 3D Gaussians in a single feedforward pass. On the right, we render novel views from the predicted deformable Gaussians and sample 2D trajectories from the aggregated 3D scene flow of deformable 3D Gaussians.
We introduce the Deformable Gaussian Splats Large Reconstruction Model (DGS-LRM), the first feed-forward method predicting deformable 3D Gaussian splats from a monocular posed video of any dynamic scene. Feed-forward scene reconstruction has gained significant attention for its ability to rapidly create digital replicas of real-world environments. However, most existing models are limited to static scenes and fail to reconstruct the motion of moving objects. Developing a feed-forward model for dynamic scene reconstruction poses significant challenges, including the scarcity of training data and the need for appropriate 3D representations and training paradigms. To address these challenges, we introduce several key technical contributions: an enhanced large-scale synthetic dataset with ground-truth multi-view videos and dense 3D scene flow supervision; a per-pixel deformable 3D Gaussian representation that is easy to learn, supports high-quality dynamic view synthesis, and enables long-range 3D tracking; and a large transformer network that achieves real-time, generalizable dynamic scene reconstruction. Extensive qualitative and quantitative experiments demonstrate that DGS-LRM achieves dynamic scene reconstruction quality comparable to optimization-based methods, while significantly outperforming the state-of-the-art predictive dynamic reconstruction method on real-world examples. Its predicted physically grounded 3D deformation is accurate and can readily adapt for long-range 3D tracking tasks, achieving performance on par with state-of-the-art monocular video 3D tracking methods.
@article{lin2025dgslrm,
title={DGS-LRM: Real-Time Deformable 3D Gaussian Reconstruction From Monocular Videos},
author={Lin, Chieh Hubert and Lv, Zhaoyang and Wu, Songyin and Xu, Zhen and Nguyen-Phuoc, Thu and Tseng, Hung-Yu and Straub, Julian and Khan, Numair and Xiao, Lei and Yang, Ming-Hsuan and Ren, Yuheng and Newcombe, Richard and Dong, Zhao and Li, Zhengqin},
journal={arXiv preprint arXiv:2506.09997},
year={2025}
}
We first concatenate multi-view videos with a time-aware Plucker ray and tokenize them using the spatial-temporal tokenizer. Then, the transformers take the sampled time tokens as input and predict per-pixel deformable Gaussians with 3D scene flow. During training, we rendered multi-view synthetic videos using Kubric. We draw a dual-view ground-truth in each sample at the same timestamp, rendering views, depth, and scene flows.
We introduce per-pixel deformable 3D Gaussian splats that model temporal deformation through 3D translation vectors. Each pixel contains a Gaussian splat with depth, RGB colors, rotation, scale, opacity, and deformation vectors across time. This representation is easy to learn, supports high-quality dynamic view synthesis, and enables long-range 3D tracking by providing physically grounded 3D scene flow that can be aggregated into coherent trajectories.
Instead of processing each frame individually, we use spatial-temporal tokenization that processes video cubes to significantly reduce memory consumption and enable training at scale. This approach achieves 4 times memory reduction compared to naive frame-by-frame processing while maintaining temporal coherence through joint spatial-temporal feature learning in the transformer architecture.
We address the fundamental geometry and motion ambiguity in monocular videos by using synchronized multi-view supervision during training. This provides clearer constraints for learning accurate 3D deformation by leveraging ground-truth multi-view videos rendered from different camera positions, enabling the model to distinguish between camera motion and object deformation.
We create an enhanced synthetic dataset using Kubric with ground-truth multi-view videos and dense 3D scene flow supervision, containing 40,000 scenes with diverse dynamic objects, realistic physics, and complex motion patterns. The dataset includes various object types, materials, lighting conditions, and camera trajectories to ensure robust generalization to real-world scenarios.
DGS-LRM generalizes well to real-world videos, correctly reconstructing thin geometries like bike wheels and challenging scenes with water deformation. The flow visualizations show effective tracking of complex deformations in hand motion and wheel turning while maintaining consistent flow for rigid body movements.
Our DGS-LRM outperforms D3DGS and avoids warping artifacts present in PGDVS. Both baseline methods fail to recover the correct geometry and repetitive motion patterns, while our method handles these challenging scenarios effectively. We mask out zero covisible regions with black pixels.
DGS-LRM outperforms LRM-based L4GM, and is comparable to optimization-based novel-view synthesis methods with a substantially faster reconstruction time.
DynMask applies a dynamic mask to evaluate the foreground only.
Method | Time (s) | DynMask | mPSNR (↑) | mLPIPS (↓) |
---|---|---|---|---|
D3DGS | 1-3 hours | ✗ | 11.92 | 0.66 |
PGDVS | 3 hours | ✗ | 15.88 | 0.34 |
Ours | 0.495 sec | ✗ | 14.89 | 0.42 |
L4GM | 4.8 sec | ✓ | 5.84 | 0.67 |
Ours | 0.495 sec | ✓ | 11.97 | 0.51 |
DGS-LRM demonstrates competitive performance for long-range 3D tracking tasks, achieving results on par with state-of-the-art monocular video tracking methods while providing physically grounded 3D deformation. DGS-LRM shows better performance and consistency in texture-less areas. SpatialTracker predicts tracks inconsistent with the object's moving direction, such as several tracking points drifting and colliding in the humanoid's knee.
Method | Frames | PSNR | ATE-3D (↓) | δ0.1 (↑) | δ0.2 (↑) |
---|---|---|---|---|---|
Chained RAFT3D | 120 | N/A | 0.70 | 0.12 | 0.25 |
Lifted CoTracker | 120 | N/A | 0.77 | 0.51 | 0.64 |
SpatialTracker | 120 | N/A | 0.22 | 0.59 | 0.76 |
Ours (Flow Chaining) | 120 | 27.77 | 0.21 | 0.57 | 0.68 |
Ours (Native) | 24 | 27.77 | 0.11 | 0.72 | 0.84 |
Ours (Flow Chaining + Fully Visible) | 120 | 27.77 | 0.15 | 0.64 | 0.75 |
We conduct comprehensive ablation studies to validate the effectiveness of each proposed component. Each proposed component contributes significantly to the final performance. Temporal tokenization enables scalable training, scene flow loss improves deformation quality, reference frames help resolve scale ambiguity, and dual-view supervision provides better geometric constraints.
Method | DyCheck | Kubric-MV (Test) | ||
---|---|---|---|---|
mPSNR (↑) | mLPIPS (↓) | mPSNR (↑) | mLPIPS (↓) | |
w/o Temporal Tokenization | OOM | OOM | OOM | OOM |
w/o Dual-View Sampling | 14.72 | 0.412 | 25.77 | 0.171 |
w/o Scene Flow Loss | 14.29 | 0.423 | 25.06 | 0.189 |
w/o Reference Frames | 13.91 | 0.438 | 24.69 | 0.186 |
Full Method | 14.67 | 0.412 | 26.05 | 0.161 |
DGS-LRM has a few limitations that can be explored in future works:
As the model is trained with temporally continuous video, the model cannot handle discrete image frames that are temporally too distant.
Our predicted scene flow cannot handle extremely large motion in the scene, which may stem from the motion distribution of the physically simulated synthetic dataset. Such a domain gap can also affect the synthesized novel view.
The input video baseline and distribution significantly influence the quality of novel view rendering quality, since our model heavily relies on the triangulation from large camera movements to analyze and extract the 3D geometry.