DGS-LRM: Real-Time Deformable 3D Gaussian Reconstruction From Monocular Videos

Chieh Hubert Lin^1,2, Zhaoyang Lv¹, Songyin Wu^1,3, Zhen Xu¹, Thu Nguyen-Phuoc¹,

Hung-Yu Tseng¹, Julian Straub¹, Numair Khan¹, Lei Xiao¹, Ming-Hsuan Yang^1,2,

Yuheng Ren¹, Richard Newcombe¹, Zhao Dong¹, Zhengqin Li¹

¹Meta, ²UC Merced, ³UC Santa Barbara

Chieh Hubert Lin^1,2, Zhaoyang Lv¹, Songyin Wu^1,3,

Zhen Xu¹, Thu Nguyen-Phuoc¹, Hung-Yu Tseng¹,

Julian Straub¹, Numair Khan¹, Lei Xiao¹,

Ming-Hsuan Yang^1,2, Yuheng Ren¹, Richard Newcombe¹,

Zhao Dong¹, Zhengqin Li¹

¹Meta, ²UC Merced, ³UC Santa Barbara

Paper (NeurIPS 2025) | Code (Coming Soon)

DGS-LRM Overview: Our proposed Deformable Gaussian Splats Large Reconstruction Model (DGS-LRM) takes posed monocular videos as input and predicts deformable 3D Gaussians in a single feedforward pass. On the right, we render novel views from the predicted deformable Gaussians and sample 2D trajectories from the aggregated 3D scene flow of deformable 3D Gaussians.

Abstract

We introduce the Deformable Gaussian Splats Large Reconstruction Model (DGS-LRM), the first feed-forward method predicting deformable 3D Gaussian splats from a monocular posed video of any dynamic scene. Feed-forward scene reconstruction has gained significant attention for its ability to rapidly create digital replicas of real-world environments. However, most existing models are limited to static scenes and fail to reconstruct the motion of moving objects. Developing a feed-forward model for dynamic scene reconstruction poses significant challenges, including the scarcity of training data and the need for appropriate 3D representations and training paradigms. To address these challenges, we introduce several key technical contributions: an enhanced large-scale synthetic dataset with ground-truth multi-view videos and dense 3D scene flow supervision; a per-pixel deformable 3D Gaussian representation that is easy to learn, supports high-quality dynamic view synthesis, and enables long-range 3D tracking; and a large transformer network that achieves real-time, generalizable dynamic scene reconstruction. Extensive qualitative and quantitative experiments demonstrate that DGS-LRM achieves dynamic scene reconstruction quality comparable to optimization-based methods, while significantly outperforming the state-of-the-art predictive dynamic reconstruction method on real-world examples. Its predicted physically grounded 3D deformation is accurate and can readily adapt for long-range 3D tracking tasks, achieving performance on par with state-of-the-art monocular video 3D tracking methods.

[Paper]

arXiv

[Codes]

Coming Soon

[Citation]


					@inproceedings{

					   lin2025dgslrm, 

					   title={{DGS}-{LRM}: Real-Time Deformable 3D Gaussian Reconstruction From Monocular Videos}, 

					   author={Lin, Chieh Hubert and Lv, Zhaoyang and Wu, Songyin and Xu, Zhen and Nguyen-Phuoc, Thu and Tseng, Hung-Yu and Straub, Julian and Khan, Numair and Xiao, Lei and Yang, Ming-Hsuan and Ren, Yuheng and Newcombe, Richard and Dong, Zhao and Li, Zhengqin}, 

					   booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems}, 

					   year={2025}, 

					   url={https://openreview.net/forum?id=X2u8esISdb} 

					}

Method Overview

We first concatenate multi-view videos with a time-aware Plucker ray and tokenize them using the spatial-temporal tokenizer. Then, the transformers take the sampled time tokens as input and predict per-pixel deformable Gaussians with 3D scene flow. During training, we rendered multi-view synthetic videos using Kubric. We draw a dual-view ground-truth in each sample at the same timestamp, rendering views, depth, and scene flows.

➤ Deformable 3D Gaussian Representation

We introduce per-pixel deformable 3D Gaussian splats that model temporal deformation through 3D translation vectors. Each pixel contains a Gaussian splat with depth, RGB colors, rotation, scale, opacity, and deformation vectors across time. This representation is easy to learn, supports high-quality dynamic view synthesis, and enables long-range 3D tracking by providing physically grounded 3D scene flow that can be aggregated into coherent trajectories.

➤ Temporal Tokenization

Instead of processing each frame individually, we use spatial-temporal tokenization that processes video cubes to significantly reduce memory consumption and enable training at scale. This approach achieves 4 times memory reduction compared to naive frame-by-frame processing while maintaining temporal coherence through joint spatial-temporal feature learning in the transformer architecture.

➤ Dual-View Supervision

We address the fundamental geometry and motion ambiguity in monocular videos by using synchronized multi-view supervision during training. This provides clearer constraints for learning accurate 3D deformation by leveraging ground-truth multi-view videos rendered from different camera positions, enabling the model to distinguish between camera motion and object deformation.

➤ Training on Pure Synthetic Data

We create an enhanced synthetic dataset using Kubric with ground-truth multi-view videos and dense 3D scene flow supervision, containing 40,000 scenes with diverse dynamic objects, realistic physics, and complex motion patterns. The dataset includes various object types, materials, lighting conditions, and camera trajectories to ensure robust generalization to real-world scenarios.

Qualitative: DAVIS In-the-Wild Videos

DGS-LRM generalizes well to real-world videos, correctly reconstructing thin geometries like bike wheels and challenging scenes with water deformation. The flow visualizations show effective tracking of complex deformations in hand motion and wheel turning while maintaining consistent flow for rigid body movements.

Input Video

Reconstruction Input Views

Reconstruction Novel Views

3D Track Input Views

3D Track Novel Views

Qualitative: DyCheck In-the-Wild Videos

Our DGS-LRM outperforms D3DGS and avoids warping artifacts present in PGDVS. Both baseline methods fail to recover the correct geometry and repetitive motion patterns, while our method handles these challenging scenarios effectively. We mask out zero covisible regions with black pixels.

➤Monocular Dynamic View Synthesis on DyCheck

DGS-LRM outperforms LRM-based L4GM, and is comparable to optimization-based novel-view synthesis methods with a substantially faster reconstruction time.
DynMask applies a dynamic mask to evaluate the foreground only.

Method	Time (s)	DynMask	mPSNR (↑)	mLPIPS (↓)
D3DGS	1-3 hours	✗	11.92	0.66
PGDVS	3 hours	✗	15.88	0.34
Ours	0.495 sec	✗	14.89	0.42
L4GM	4.8 sec	✓	5.84	0.67
Ours	0.495 sec	✓	11.97	0.51

3D Tracking on PointOdyssey

DGS-LRM demonstrates competitive performance for long-range 3D tracking tasks, achieving results on par with state-of-the-art monocular video tracking methods while providing physically grounded 3D deformation. DGS-LRM shows better performance and consistency in texture-less areas. SpatialTracker predicts tracks inconsistent with the object's moving direction, such as several tracking points drifting and colliding in the humanoid's knee.

➤Quantitative Comparisons

Method	Frames	PSNR	ATE-3D (↓)	δ_0.1 (↑)	δ_0.2 (↑)
Chained RAFT3D	120	N/A	0.70	0.12	0.25
Lifted CoTracker	120	N/A	0.77	0.51	0.64
SpatialTracker	120	N/A	0.22	0.59	0.76
Ours (Flow Chaining)	120	27.77	0.21	0.57	0.68
Ours (Native)	24	27.77	0.11	0.72	0.84
Ours (Flow Chaining + Fully Visible)	120	27.77	0.15	0.64	0.75

Ablation Study

We conduct comprehensive ablation studies to validate the effectiveness of each proposed component. Each proposed component contributes significantly to the final performance. Temporal tokenization enables scalable training, scene flow loss improves deformation quality, reference frames help resolve scale ambiguity, and dual-view supervision provides better geometric constraints.

Method	DyCheck		Kubric-MV (Test)
Method	mPSNR (↑)	mLPIPS (↓)	mPSNR (↑)	mLPIPS (↓)
w/o Temporal Tokenization	OOM	OOM	OOM	OOM
w/o Dual-View Sampling	14.72	0.412	25.77	0.171
w/o Scene Flow Loss	14.29	0.423	25.06	0.189
w/o Reference Frames	13.91	0.438	24.69	0.186
Full Method	14.67	0.412	26.05	0.161

Limitations

DGS-LRM has a few limitations that can be explored in future works:

➤ Temporal Continuity Requirements

As the model is trained with temporally continuous video, the model cannot handle discrete image frames that are temporally too distant.

➤ Scene Flow Limitations

Our predicted scene flow cannot handle extremely large motion in the scene, which may stem from the motion distribution of the physically simulated synthetic dataset. Such a domain gap can also affect the synthesized novel view.

➤ Camera Pose Distribution

The input video baseline and distribution significantly influence the quality of novel view rendering quality, since our model heavily relies on the triangulation from large camera movements to analyze and extract the 3D geometry.