DGS-LRM: Real-Time Deformable 3D Gaussian Reconstruction From Monocular Videos

Chieh Hubert Lin1,2,     Zhaoyang Lv1,     Songyin Wu1,3,     Zhen Xu1,     Thu Nguyen-Phuoc1,
Hung-Yu Tseng1,     Julian Straub1,     Numair Khan1,     Lei Xiao1,     Ming-Hsuan Yang1,2,
Yuheng Ren1,     Richard Newcombe1,     Zhao Dong1,     Zhengqin Li1
1Meta,     2UC Merced,     3UC Santa Barbara

DGS-LRM Overview: Our proposed Deformable Gaussian Splats Large Reconstruction Model (DGS-LRM) takes posed monocular videos as input and predicts deformable 3D Gaussians in a single feedforward pass. On the right, we render novel views from the predicted deformable Gaussians and sample 2D trajectories from the aggregated 3D scene flow of deformable 3D Gaussians.


Abstract

We introduce the Deformable Gaussian Splats Large Reconstruction Model (DGS-LRM), the first feed-forward method predicting deformable 3D Gaussian splats from a monocular posed video of any dynamic scene. Feed-forward scene reconstruction has gained significant attention for its ability to rapidly create digital replicas of real-world environments. However, most existing models are limited to static scenes and fail to reconstruct the motion of moving objects. Developing a feed-forward model for dynamic scene reconstruction poses significant challenges, including the scarcity of training data and the need for appropriate 3D representations and training paradigms. To address these challenges, we introduce several key technical contributions: an enhanced large-scale synthetic dataset with ground-truth multi-view videos and dense 3D scene flow supervision; a per-pixel deformable 3D Gaussian representation that is easy to learn, supports high-quality dynamic view synthesis, and enables long-range 3D tracking; and a large transformer network that achieves real-time, generalizable dynamic scene reconstruction. Extensive qualitative and quantitative experiments demonstrate that DGS-LRM achieves dynamic scene reconstruction quality comparable to optimization-based methods, while significantly outperforming the state-of-the-art predictive dynamic reconstruction method on real-world examples. Its predicted physically grounded 3D deformation is accurate and can readily adapt for long-range 3D tracking tasks, achieving performance on par with state-of-the-art monocular video 3D tracking methods.

[Paper]

arXiv

[Codes]

Coming Soon

[Citation]

@article{lin2025dgslrm,
   title={DGS-LRM: Real-Time Deformable 3D Gaussian Reconstruction From Monocular Videos},
   author={Lin, Chieh Hubert and Lv, Zhaoyang and Wu, Songyin and Xu, Zhen and Nguyen-Phuoc, Thu and Tseng, Hung-Yu and Straub, Julian and Khan, Numair and Xiao, Lei and Yang, Ming-Hsuan and Ren, Yuheng and Newcombe, Richard and Dong, Zhao and Li, Zhengqin},
   journal={arXiv preprint arXiv:2506.09997},
   year={2025}
}

Method Overview

We first concatenate multi-view videos with a time-aware Plucker ray and tokenize them using the spatial-temporal tokenizer. Then, the transformers take the sampled time tokens as input and predict per-pixel deformable Gaussians with 3D scene flow. During training, we rendered multi-view synthetic videos using Kubric. We draw a dual-view ground-truth in each sample at the same timestamp, rendering views, depth, and scene flows.

Deformable 3D Gaussian Representation

We introduce per-pixel deformable 3D Gaussian splats that model temporal deformation through 3D translation vectors. Each pixel contains a Gaussian splat with depth, RGB colors, rotation, scale, opacity, and deformation vectors across time. This representation is easy to learn, supports high-quality dynamic view synthesis, and enables long-range 3D tracking by providing physically grounded 3D scene flow that can be aggregated into coherent trajectories.

Temporal Tokenization

Instead of processing each frame individually, we use spatial-temporal tokenization that processes video cubes to significantly reduce memory consumption and enable training at scale. This approach achieves 4 times memory reduction compared to naive frame-by-frame processing while maintaining temporal coherence through joint spatial-temporal feature learning in the transformer architecture.

Dual-View Supervision

We address the fundamental geometry and motion ambiguity in monocular videos by using synchronized multi-view supervision during training. This provides clearer constraints for learning accurate 3D deformation by leveraging ground-truth multi-view videos rendered from different camera positions, enabling the model to distinguish between camera motion and object deformation.

Training on Pure Synthetic Data

We create an enhanced synthetic dataset using Kubric with ground-truth multi-view videos and dense 3D scene flow supervision, containing 40,000 scenes with diverse dynamic objects, realistic physics, and complex motion patterns. The dataset includes various object types, materials, lighting conditions, and camera trajectories to ensure robust generalization to real-world scenarios.


Qualitative: DAVIS In-the-Wild Videos

DGS-LRM generalizes well to real-world videos, correctly reconstructing thin geometries like bike wheels and challenging scenes with water deformation. The flow visualizations show effective tracking of complex deformations in hand motion and wheel turning while maintaining consistent flow for rigid body movements.

Input Video
Reconstruction Input Views
Reconstruction Novel Views
3D Track Input Views
3D Track Novel Views

Qualitative: DyCheck In-the-Wild Videos

Our DGS-LRM outperforms D3DGS and avoids warping artifacts present in PGDVS. Both baseline methods fail to recover the correct geometry and repetitive motion patterns, while our method handles these challenging scenarios effectively. We mask out zero covisible regions with black pixels.

Monocular Dynamic View Synthesis on DyCheck

DGS-LRM outperforms LRM-based L4GM, and is comparable to optimization-based novel-view synthesis methods with a substantially faster reconstruction time.
DynMask applies a dynamic mask to evaluate the foreground only.

Method Time (s) DynMask mPSNR (↑) mLPIPS (↓)
D3DGS 1-3 hours 11.92 0.66
PGDVS 3 hours 15.88 0.34
Ours 0.495 sec 14.89 0.42
L4GM 4.8 sec 5.84 0.67
Ours 0.495 sec 11.97 0.51


3D Tracking on PointOdyssey

DGS-LRM demonstrates competitive performance for long-range 3D tracking tasks, achieving results on par with state-of-the-art monocular video tracking methods while providing physically grounded 3D deformation. DGS-LRM shows better performance and consistency in texture-less areas. SpatialTracker predicts tracks inconsistent with the object's moving direction, such as several tracking points drifting and colliding in the humanoid's knee.

Quantitative Comparisons
Method Frames PSNR ATE-3D (↓) δ0.1 (↑) δ0.2 (↑)
Chained RAFT3D 120 N/A 0.70 0.12 0.25
Lifted CoTracker 120 N/A 0.77 0.51 0.64
SpatialTracker 120 N/A 0.22 0.59 0.76
Ours (Flow Chaining) 120 27.77 0.21 0.57 0.68
Ours (Native) 24 27.77 0.11 0.72 0.84
Ours (Flow Chaining + Fully Visible) 120 27.77 0.15 0.64 0.75

Ablation Study

We conduct comprehensive ablation studies to validate the effectiveness of each proposed component. Each proposed component contributes significantly to the final performance. Temporal tokenization enables scalable training, scene flow loss improves deformation quality, reference frames help resolve scale ambiguity, and dual-view supervision provides better geometric constraints.

Method DyCheck Kubric-MV (Test)
mPSNR (↑) mLPIPS (↓) mPSNR (↑) mLPIPS (↓)
w/o Temporal Tokenization OOM OOM OOM OOM
w/o Dual-View Sampling 14.72 0.412 25.77 0.171
w/o Scene Flow Loss 14.29 0.423 25.06 0.189
w/o Reference Frames 13.91 0.438 24.69 0.186
Full Method 14.67 0.412 26.05 0.161

Limitations

DGS-LRM has a few limitations that can be explored in future works:

Temporal Continuity Requirements

As the model is trained with temporally continuous video, the model cannot handle discrete image frames that are temporally too distant.

Scene Flow Limitations

Our predicted scene flow cannot handle extremely large motion in the scene, which may stem from the motion distribution of the physically simulated synthetic dataset. Such a domain gap can also affect the synthesized novel view.

Camera Pose Distribution

The input video baseline and distribution significantly influence the quality of novel view rendering quality, since our model heavily relies on the triangulation from large camera movements to analyze and extract the 3D geometry.