IGen: Scalable Data Generation for Robot Learning from Open-World Images

IGen: Scalable Data Generation for
Robot Learning from Open-World Images

Turning unstructured 2D pixels into structured 3D scenes — and synthesizing realistic visuomotor data without a single human teleoperation.

Chenghao Gu^1*, Haolan Kang^2*, Junchao Lin^3*, Jinghe Wang¹, Duo Wu¹, Shuzhao Xie¹, Fanding Huang¹, Junchen Ge¹, Ziyang Gong⁴, Letian Li¹, Hongying Zheng⁵, Changwei Lv⁵, Zhi Wang¹

¹Tsinghua University · ²The University of Hong Kong · ³Beijing University of Chemical Technology
⁴Shanghai Jiao Tong University · ⁵Shenzhen University of Information Technology

^*Equal contribution

Key Takeaways

Open-world images become robot data. IGen turns ordinary images into structured 3D scenes and executable robot trajectories, without human teleoperation.
Generation is grounded in geometry and reasoning. IGen combines foundation vision models, vision-language reasoning, and SE(3) action synthesis to create temporally coherent visuomotor demonstrations.
Synthetic data transfers to real robots. Policies trained solely on IGen-generated data achieve strong real-world performance, showing that open-world images can serve as a practical source of robot learning data.

Method

Pipeline of IGen. Given an open-world image and a task description, IGen first reconstructs the environment and objects as point clouds via foundation vision models. After spatial keypoint extraction, a vision-language model maps the task description to high-level plans and low-level control commands. During the robot's execution in simulation, a virtual depth camera captures motion point-cloud sequences. The resulting end-effector pose trajectory drives the synthesis of dynamic point-cloud sequences, which are then rendered frame-by-frame into visual observations of the manipulation. The final output consists of the generated robot actions and the visual observations.

Generated Data & Real-World Deployment

Starting from a single captured real-world scene image, IGen automatically generates 1,000 task demonstrations with spatial randomization. The resulting data are used to train a visuomotor policy, which is later deployed and evaluated on a real robot. The result suggests that IGen can serve as an effective and scalable alternative to human teleoperation for training robot policies.

Real-World Rollouts

BibTeX

igen.bib

@misc{gu2025igenscalabledatageneration,
    title  = {IGen: Scalable Data Generation for Robot Learning from Open-World Images},
    author = {Chenghao Gu and Haolan Kang and Junchao Lin and Jinghe Wang and
              Duo Wu and Shuzhao Xie and Fanding Huang and Junchen Ge and
              Ziyang Gong and Letian Li and Hongying Zheng and Changwei Lv and Zhi Wang},
    year   = {2025},
    eprint = {2512.01773},
    archivePrefix = {arXiv},
    primaryClass  = {cs.RO},
    url    = {https://arxiv.org/abs/2512.01773}
}