[Abstract]

Toward infinite-scale 3D city synthesis, we propose a novel framework, InfiniCity, which constructs and renders an unconstrainedly large and 3D-grounded environment from random noises. InfiniCity decomposes the seemingly impractical task into three feasible modules, taking advantage of both 2D and 3D data. First, an infinite-pixel image synthesis module generates arbitrary-scale 2D maps from the bird's-eye view. Next, an octree-based voxel completion module lifts the generated 2D map to 3D octrees. Finally, a voxel-based neural rendering module texturizes the voxels and renders 2D images. InfiniCity can thus synthesize arbitrary-scale and traversable 3D city environments, and allow flexible and interactive editing from users. We quantitatively and qualitatively demonstrate the efficacy of the proposed framework.

[Paper]

ICCV 2023

[Citation]

@inproceedings{lin2023infinicity,
   title={Infini{C}ity: Infinite-Scale City Synthesis},
   author={Lin, Chieh Hubert and Lee, Hsin-Ying and Menapace, Willi and Chai, Menglei and Siarohin, Aliaksandr and Yang, Ming-Hsuan and Tulyakov, Sergey},
   booktitle={Proceedings of the IEEE/CVF international conference on computer vision},
   year={2023},
}

Framework Overview

We propose InfiniCity, a three-stage synthesis framework toward infinite-scale city scene synthesis.

Starting from the bottom to the top, we synthesize multi-modality infinite-pixel satellite images, perform octree-based voxel completion to create a watertight voxel world, then finally texturize with voxel neural rendering. In the middle figure, we mark the camera locations (in red and orange) used to render the views in the top figures.



Methodology


InfiniCity consists of three major modules.   The Infinite-pixel satellite image synthesis stage is trained on image tuples (category, depth, and normal maps) derived from a bird's-eye view scan of the 3D environment, and is able to synthesize arbitrary- scale satellite maps during inference. The 3D octree-based voxel completion stage is trained on pairs of surface-scanned and completed octrees. During inference, it takes the surface voxels lifted from the satellite images as inputs and produces the watertight voxel world. Finally, the voxel-based neural rendering stage performs ray-sampling to retrieve features from the voxel world, then renders the final image with a neural renderer. The neural renderer is trained with both real images and pseudo-ground-truths synthesized by a pretrained SPADE generator. With these modules, InfiniCity can synthesize an arbitrary-scale and traversable 3D city environment from noises.


Infinite-Pixel and Multi-Modality Satellite Map Synthesis

Modality: 

Synthesized satellite maps.   We train InfinityGAN [1] with contrastive discriminator in multiple data modalities (category, depth, and normal). The demonstrated images are 1024×1024 pixels cropped from the infinite-pixel images.


Quality and Diversity of Synthesized Voxel

Octree-based voxel completion.   High-quality and high-diversity voxels completed from synthetic satellite images. We show synthesized satellite images, lifted surface voxels, then 3D-completed voxels. The samples are 643 voxels.


Neural Rendering Quality and 3D Consistency


Trajectory-wise image rendering results.   Our final rendering results present better quality, structural consistency, and diversity, over the competing method GSN [2]. Each group of images is rendered within the same voxel world using a shared global style code, while GSN shares the same global latent vector in each group.


Rendered Video

Nothing is real.  


Acknowledgement

We sincerely thank the great power from OuO.

References

[1]

InfinityGAN

Chieh Hubert Lin, Hsin-Ying Lee, Yen-Chi Cheng, Sergey Tulyakov, and Ming-Hsuan Yang. "InfinityGAN: Towards Infinite-Pixel Image Synthesis." In ICLR, 2022.

[2]

GSN

Terrance DeVries, Miguel Angel Bautista, Nitish Srivastava, Graham W. Taylor, and Joshua M. Susskind "Unconstrained Scene Generation with Locally Conditioned Radiance Fields." In ICCV, 2021.