Training-Free Instance-Aware 3D Scene Reconstruction and Diffusion-Based View Synthesis from Sparse Images

Australian Institute for Machine Learning, The University of Adelaide
Corresponding author

SIGGRAPH Asia 2025

We present a novel, unified pipeline that transforms sparse image inputs into a clean, instance-aware point cloud without requiring any pose pre-processing or scene-specific learning, and can directly synthesize photorealistic novel views through a tailored diffusion-based rendering approach. Beyond reconstruction and rendering, our method also supports essential downstream tasks such as scene-level editing.

Abstract

We introduce a novel, training-free system for reconstructing, understanding, and rendering 3D indoor scenes from a sparse set of unposed RGB images. Unlike traditional radiance field approaches that require dense views and per-scene optimization, our pipeline achieves high-fidelity results without any training or pose preprocessing. The system integrates three key innovations: (1) A robust point cloud reconstruction module that filters unreliable geometry using a warping-based anomaly removal strategy; (2) A warping-guided 2D-to-3D instance lifting mechanism that propagates 2D segmentation masks into a consistent, instance-aware 3D representation; and (3) A novel rendering approach that projects the point cloud into new views and refines the renderings with a 3D-aware diffusion model. Our method leverages the generative power of diffusion to compensate for missing geometry and enhances realism, especially under sparse input conditions. We further demonstrate that object-level scene editing—such as instance removal—can be naturally supported in our pipeline by modifying only the point cloud, enabling the synthesis of consistent, edited views without retraining. Our results establish a new direction for efficient, editable 3D content generation without relying on scene-specific optimization.

Our Framework

Our Framework

Framework overview. We apply a feed-forward model (e.g., MV-DUSt3R) to predict the initial point cloud from the input unposed sparse images, followed by a novel warping-based anomaly points removal strategy to eliminate unreliable points, forming a clean and accurate scene point cloud. For 2D-to-3D instance segmentation, we first utilize a foundational segmentation model (e.g., SAM) to generate initial segmentation masks. We then introduce a warping-based instance unification strategy to align instances across frames. By associating the instance masks with the corresponding pointmaps, we obtain an instance-segmented 3D point cloud. The instance-segmented 3D point cloud facilitates downstream tasks such as object-level editing. Finally, the point cloud can be directly rendered into high-quality 2D images through our designed diffusion-based rendering pipeline.

Results

Main Results

Main results. We present the main experimental results, including instance segmentation, point cloud reconstruction, novel view synthesis, and scene edits through object removal. Each row corresponds to a test scene. The left column shows the instance segmentation results. The center section labeled "Original" displays the reconstructed point cloud (top) and novel view synthesis results (bottom). The two columns on the right illustrate two different object removal results, with the removed objects highlighted with black dashed boxes in the point clouds.

BibTeX

@article{xia2025tid3r,
  author    = {Xia, Jiatong and Liu, Lingqiao},
  title     = {Training-Free Instance-Aware 3D Scene Reconstruction and Diffusion-Based View Synthesis from Sparse Images},
  journal   = {ACM SIGGRAPH Asia},
  year      = {2025},
}