Towards Scalable Spatial Intelligence via 2D-to-3D Data Lifting

¹Durham University ²DAMO Academy, Alibaba Group
³Tsinghua University ⁴UCAS-Terminus AI Lab

^†Project Lead, ✉ Co-corresponding Authors

ICCV 2025 (Highlight)

Abstract

Spatial intelligence is emerging as a transformative frontier in AI, yet it remains constrained by the scarcity of large-scale 3D datasets. Unlike the abundant 2D imagery, acquiring 3D data typically requires specialized sensors and laborious annotation.

In this work, we present a scalable pipeline that converts single-view images into comprehensive, scale- and appearance-realistic 3D representations — including point clouds, camera poses, depth maps, and pseudo-RGBD — via integrated depth estimation, camera calibration, and scale calibration. Our method bridges the gap between the vast repository of imagery and the increasing demand for spatial scene understanding. By automatically generating authentic, scale-aware 3D data from images, we significantly reduce data collection costs and open new avenues for advancing spatial intelligence.

We release two generated spatial datasets, i.e., COCO-3D and Objects365-v2-3D, and demonstrate through extensive experiments that our generated data can benefit various 3D tasks, ranging from fundamental perception to MLLM-based reasoning. These results validate our pipeline as an effective solution for developing AI systems capable of perceiving, understanding, and interacting with physical environments.

Motivation

Spatial intelligence is crucial for AI to perceive, understand, and interact with physical world. However, unlike 2D imagery which is abundant online, 3D datasets are expensive to collect, requiring specialized sensors like LiDAR or pseudo-RGBD, and laborious annotation. This creates a critical data bottleneck for advancing spatial intelligence.

To address the limitations of existing spatial datasets, we present a novel data-generation pipeline that lifts large-scale 2D image datasets into high-quality, richly annotated 3D representations covering diverse scenes and tasks. This approach is superior to purely synthetic simulation data while avoiding the high costs of sensor-based data collection.

Data Curation Pipeline

Our pipeline converts single-view 2D images into comprehensive 3D representations through three key steps:

Step 1: We generate a scale-calibrated depth map by integrating scale-invariant and scale-aware depth estimation techniques. This ensures accurate geometric reconstruction with proper scale information. (Author's Note: We noticed recent works (VGGT, UniDepth-V2, etc.) have even better results for depth estimation. We believe our pipeline can be improved by using these methods.)

Step 2: Predicted camera parameters are used to project images into 3D space and remove invalid points, creating accurate point cloud representations of the scene.

Step 3: Original 2D image annotations are lifted to 3D, resulting in a fully annotated 3D representation ready for various downstream spatial intelligence tasks.

Our pipeline mitigates the scarcity of spatial data by generating scale- and metric-authentic 3D data (point clouds, depth maps, camera poses, etc.) with rich annotations. Our generated data can support a wide range of tasks, including spatial perception and MLLM-based captioning, spatial reasoning, and grounding.

Experiments

Through extensive experiments, we demonstrate the effectiveness of our generated 3D data across multiple tasks and scenarios:

(1) Enhanced 3D Perception Performance: Pre-training with our generated data significantly boosts performance on various 3D perception tasks. Models trained with our data show substantial improvements in accuracy and robustness compared to those trained without our synthetic 3D representations.

(2) Strong Zero-Shot Capabilities: Remarkably, using only our generated data without any real 3D datasets, our method achieves strong zero-shot performance on 3D semantic and instance segmentation tasks. This demonstrates that our synthetic 3D data captures essential spatial patterns that generalize well to real-world scenarios.

Zero-shot 3D instance segmentation results using only our generated data, without any real 3D training data.

(3) Improved 3D Vision-Language Tasks: Our data significantly enhances performance on 3D vision-language task, i.e., 3D referring segmentation. The rich annotations in our generated 3D scenes provide valuable training signals for models that need to understand both spatial relationships and natural language descriptions.

(4) Enhanced 3D Multimodal Large Language Models: In the LLM era, our generated data substantially improves the capabilities of 3D MLLMs. These models trained with our data show better spatial reasoning, scene understanding, and ability to ground language descriptions in 3D environments.

These comprehensive results validate that our 2D-to-3D data lifting pipeline provides a practical and effective solution for scaling up spatial intelligence.

BibTeX

@inproceedings{miao2025towards, title={Towards Scalable Spatial Intelligence via 2D-to-3D Data Lifting}, author={Miao, Xingyu and Duan, Haoran and Qian, Quanhao and Wang, Jiuniu and Long, Yang and Shao, Ling and Zhao, Deli and Xu, Ran and Zhang, Gongjie}, booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision}, year={2025} }