Phone2Act: A Low-Cost, Hardware-Agnostic Teleoperation System for Scalable VLA Data Collection
Om Mandhane, Bipin Yadav, Sangeetha Prasanna Ram, Gopalakrishnan Narayanan
THE PROBLEM
This paper focuses on Modern Robot LearningVision-Language-Action model (VLA)A model that takes images and language as input and outputs robot actions.. Phone2Act lets you collect Modern Robot LearningVision-Language-Action model (VLA)A model that takes images and language as input and outputs robot actions. Robot LearningTrainingThe process of fitting a model using data or experience. data using just a smartphone as a 6-DoF Control & PlanningControllerThe algorithm or system that turns desired behavior into motor commands. on any Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. without hardware lock-in. This democratizes high-quality Manipulation & TasksManipulationUsing a robot arm or hand to move or interact with objects. Robot LearningDatasetA collection of training or evaluation data. collection for researchers who can't afford expensive Imitation & Reinforcement LearningTeleoperation (teleop)A human remotely controlling the robot, often to collect demonstrations. rigs, directly enabling more teams to fine-tune Modern Robot LearningVision-Language-Action model (VLA)A model that takes images and language as input and outputs robot actions. models on their own Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. platforms. Read the paper by tracking the Core ConceptsTaskThe job the robot is supposed to complete, such as pick-and-place, navigation, or drawer opening. definition, the Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. or data assumptions, and the evidence that supports the claimed improvement.
HOW IT WORKS
Task framing
Core method
Data and supervision
Evaluation evidence
KEY RESULTS
Phone2Act lets you collect Modern Robot LearningVision-Language-Action model (VLA)A model that takes images and language as input and outputs robot actions. Robot LearningTrainingThe process of fitting a model using data or experience. data using just a smartphone as a 6-DoF Control & PlanningControllerThe algorithm or system that turns desired behavior into motor commands. on any Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. without hardware lock-in. This democratizes high-quality Manipulation & TasksManipulationUsing a robot arm or hand to move or interact with objects. Robot LearningDatasetA collection of training or evaluation data. collection for researchers who can't afford expensive Imitation & Reinforcement LearningTeleoperation (teleop)A human remotely controlling the robot, often to collect demonstrations. rigs, directly enabling more teams to fine-tune Modern Robot LearningVision-Language-Action model (VLA)A model that takes images and language as input and outputs robot actions. models on their own Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. platforms.
WHY DEVELOPERS SHOULD CARE
Phone2Act lets you collect Modern Robot LearningVision-Language-Action model (VLA)A model that takes images and language as input and outputs robot actions. Robot LearningTrainingThe process of fitting a model using data or experience. data using just a smartphone as a 6-DoF Control & PlanningControllerThe algorithm or system that turns desired behavior into motor commands. on any Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. without hardware lock-in. This democratizes high-quality Manipulation & TasksManipulationUsing a robot arm or hand to move or interact with objects. Robot LearningDatasetA collection of training or evaluation data. collection for researchers who can't afford expensive Imitation & Reinforcement LearningTeleoperation (teleop)A human remotely controlling the robot, often to collect demonstrations. rigs, directly enabling more teams to fine-tune Modern Robot LearningVision-Language-Action model (VLA)A model that takes images and language as input and outputs robot actions. models on their own Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. platforms.
LIMITATIONS
The main limitation to check is whether the claimed behavior holds outside the paper's reported setup. That means testing across different Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. embodiments, scenes, objects, and data distributions.
WHAT COMES NEXT
The practical next step is independent reproduction with clear baselines, ablations, and stress tests. For a developer, the useful follow-up is to map the paper's Modern Robot LearningVision-Language-Action model (VLA)A model that takes images and language as input and outputs robot actions. assumptions onto a concrete Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. stack, then test the smallest version of the method that could run end to end.