Taming Latent Diffusion Model for
Neural Radiance Field Inpainting

ECCV 2024
1Meta, 2University of California, Merced, 3University of Maryland, College Park

The top row shows the original scene, and the bottom row is our inpainted results.


Neural Radiance Field (NeRF) is a representation for 3D reconstruction from multi-view images. Despite some recent work showing preliminary success in editing a reconstructed NeRF with diffusion prior, they remain struggling to synthesize reasonable geometry in completely uncovered regions. One major reason is the high diversity of synthetic contents from the diffusion model, which hinders the radiance field from converging to a crisp and deterministic geometry. Moreover, applying latent diffusion models on real data often yields a textural shift incoherent to the image condition due to auto-encoding errors. These two problems are further reinforced with the use of pixel-distance losses. To address these issues, we propose tempering the diffusion model's stochasticity with per-scene customization and mitigating the textural shift with masked adversarial training. During the analyses, we also found the commonly used pixel and perceptual losses are harmful in the NeRF inpainting task. Through rigorous experiments, our framework yields state-of-the-art NeRF inpainting results on various real-world scenes.


MALD-NeRF uses a latent diffusion model to obtain the inpainted training images from the NeRF-rendered images using partial DDIM. The inpainted images are used to update the NeRF training dataset following the iterative dataset update protocol. (reconstruction) We use pixel-level regression loss between the NeRF-rendered and ground-truth pixels to reconstruct the regions observed in the input images. (inpainting) We design a masked patch-based adversarial training, which include an adversarial loss and discriminator feature matching loss, to supervise the the inpainting regions.

Per-Scene customization.

Our per-scene customization effectively forges the latent diffusion model to synthesize consistent and in-context contents across views.

Evaluation on SPIn-NeRF and LLFF Datasets.

We present the results on the SPIn-NeRF and LLFF datasets. Note that the LLFF dataset does not have ground-truth views with object being physically removed, therefore, we only measures C-FID and C-KID on these scenes. The best performance is underscored.

Comparing with Related Methods


Example Image

Example Mask


Ablation Study

# LPIPS Adversarial Others FID (↓) KID (↓)
192.86 0.0447
185.79 0.0419
w/o Per-Scene Customization 224.29 0.0596
w/o Feature Matching 232.28 0.0716
w/o Adv Masking 196.47 0.0472
Ours (full) 183.25 0.0397

Ours (full)

Ablation Variant A


  author    = {Lin, Chieh Hubert and Kim, Changil and Huang, Jia-Bin and Li, Qinbo and Ma, Chih-Yao and Kopf, Johannes and Yang, Ming-Hsuan and Tseng, Hung-Yu},
  title     = {Taming Latent Diffusion Model for Neural Radiance Field Inpainting},
  booktitle = {European Conference on Computer Vision (ECCV)},
  year      = {2024}