Our method demonstrates consistent high-quality results across diverse object categories and scene types, highlighting the effectiveness of explicitly embedding point-cloud priors for achieving accurate and structurally controllable 3D generation.
Recent progress in 3D generation has been driven largely by models conditioned on images or text, while readily available 3D priors are still underused. In many real-world scenarios, visible-region point clouds are easy to obtain—from active sensors such as LiDAR or from feed-forward predictors like VGGT—offering explicit geometric constraints that current methods fail to exploit.
In this work, we introduce Points-to-3D, a diffusion-based framework that leverages point cloud priors for geometry-controllable 3D asset and scene generation. Built on the latent 3D diffusion model TRELLIS, Points-to-3D first replaces pure-noise sparse structure latent initialization with a point cloud priors tailored input formulation. A structure inpainting network, trained within the TRELLIS framework on task-specific data designed to learn global structural inpainting, is then used for inference with a staged sampling strategy (structural inpainting followed by boundary refinement), completing the global geometry while preserving the visible regions of the input priors.
In practice, Points-to-3D can take either accurate point-cloud priors or VGGT-estimated point clouds from single images as input. Experiments on both single-object and multi-object scenarios consistently demonstrate superior performance over state-of-the-art baselines in terms of rendering quality and geometric fidelity, highlighting the effectiveness of explicitly embedding point-cloud priors for achieving more accurate and structurally controllable 3D generation.
Points-to-3D framework. Given point cloud priors—either pre-existing or predicted by VGGT from input image—we first voxelize and VAE-encode it to obtain an SS latent, where the empty regions are filled with random noise and concatenated with an extracted mask to form the input paradigm for our model. During training, we optimize our inpainting flow transformer via conditional flow matching loss. During inference, we employ a two-stage sampling procedure: (1) structural inpainting with s steps to inpaint the global structure, and (2) boundary refinement with remaining (t-s) steps to refine the inpainting boundaries.
Input point cloud priors examples. We support two types of point cloud priors: (1) Sampled Point Cloud - partial captured ground truth point clouds, and (2) VGGT Estimated - point clouds estimated from input images via feed-forward point-map prediction. Both types impose reliable geometric constraints that steer our model toward controllable and faithful 3D generation.
Our method demonstrates consistent high-quality results across diverse object categories and scene types, highlighting the effectiveness of explicitly embedding point-cloud priors for achieving accurate and structurally controllable 3D generation.
@inproceedings{xia2026points2-3d,
author = {Xia, Jiatong and Duan, Zicheng and van den Hengel, Anton and Liu, Lingqiao},
title = {Points-to-3D: Structure-Aware 3D Generation with Point Cloud Priors},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2026},
}