IGen: Scalable Data Generation for Robot Learning from Open-World Images
IGen is a scalable data generation system for robot learning from open-world images. Instead of relying on labor-intensive teleoperation, IGen takes diverse images from the real world and automatically synthesizes robot trajectories for various tasks. With these generated data, we can train robot policies that transfer to real robots and perform effective manipulation in the real world — without any human teleoperation data.
Abstract
The rise of generalist robotic policies has created an exponential demand for large-scale training data. However, on-robot data collection is labor-intensive and often limited to specific environments. In contrast, open-world images capture a vast diversity of real-world scenes that naturally align with robotic manipulation tasks, offering a promising avenue for low-cost, large-scale robot data acquisition. Despite this potential, the lack of associated robot actions hinders the practical use of open-world images for robot learning, leaving this rich visual resource largely unexploited. To bridge this gap, we propose IGen, a framework that scalably generates realistic visual observations and executable actions from open-world images. IGen first converts unstructured 2D pixels into structured 3D scene representations suitable for scene understanding and manipulation. It then leverages the reasoning capabilities of vision-language models to transform scene-specific task instructions into high-level plans and generate low-level actions as SE(3) end-effector pose sequences. From these poses, it synthesizes dynamic scene evolution and renders temporally coherent visual observations. Experiments validate the high quality of visuomotor data generated by IGen, and show that policies trained solely on IGen-synthesized data achieve performance comparable to those trained on real-world data. This highlights the potential of IGen to support scalable data generation from open-world images for generalist robotic policy training. Code for IGen will be made publicly available.
Method
Given an open-world image and a task description, IGen first reconstructs the environment and objects as point clouds via Foundation Vision Models. After spatial keypoint extraction, Vision-Language-Model maps the task description to high-level plans and low-level control commands. During the robot’s execution in simulation, a virtual depth camera captures the motion point cloud sequences. The resulting end-effector pose trajectory is used to synthesize dynamic point-cloud sequences, which are then rendered frame-by-frame into visual observations of the manipulation. The final output consists of the generated robot actions and the visual observations.
Starting from a captured real-world scene image, IGen automatically generates 1,000 task demonstrations with spatial randomization. The resulting data are used to train a visuomotor policy, which is later deployed and evaluated in the real world. The result suggest that IGen can serve as an effective and scalable alternative to human teleoperation for training robot policies.