We introduce a novel, training-free system for reconstructing, understanding, and rendering 3D indoor scenes
from a sparse set of unposed RGB images. Unlike traditional radiance field approaches that require dense views
and per-scene optimization, our pipeline achieves high-fidelity results without any training or pose preprocessing.
The system integrates three key innovations: (1) A robust point cloud reconstruction module that filters
unreliable geometry using a warping-based anomaly removal strategy; (2) A warping-guided 2D-to-3D instance
lifting mechanism that propagates 2D segmentation masks into a consistent, instance-aware 3D representation;
and (3) A novel rendering approach that projects the point cloud into new views and refines the renderings
with a 3D-aware diffusion model.
Our method leverages the generative power of diffusion to compensate for missing geometry and enhances realism,
especially under sparse input conditions. We further demonstrate that object-level scene editing—such as
instance removal—can be naturally supported in our pipeline by modifying only the point cloud, enabling the
synthesis of consistent, edited views without retraining. Our results establish a new direction for efficient,
editable 3D content generation without relying on scene-specific optimization.